What’s the Magic Word? A Control Theory of LLM Prompting

Aman Bhargava
California Institute of Technology
Pasadena, CA, USA
abhargav[at]caltech[dot]edu
&Cameron Witkowski
University of Toronto
Toronto, ON, Canada
cameron.witkowski[at]mail.utoronto.ca
&Shi-Zhuo Looi
California Institute of Technology
Pasadena, CA, USA
looi[at]caltech[dot]edu
&Matt Thomson
California Institute of Technology
Pasadena, CA, USA
mthomson[at]caltech[dot]edu
Use footnote for providing further information about author (webpage, alternative address)—not for acknowledging funding agencies.

Abstract

Prompt engineering is crucial for deploying LLMs but is poorly understood mathematically. We formalize LLM systems as a class of discrete stochastic dynamical systems to explore prompt engineering through the lens of control theory. We offer a mathematical analysis of the limitations on the controllability of self-attention as a function of the singular values of the parameter matrices. We present complementary empirical results on the controllability of a panel of LLMs, including Falcon-7b, Llama-7b, and Falcon-40b. Given initial state $\mathbf{x}_{0}$ from Wikitext and prompts of length $k\leq 10$ tokens, we find that the “correct” next token is reachable at least 97% of the time, and that the top 75 most likely next tokens are reachable at least 85% of the time. Intriguingly, short prompt sequences can dramatically alter the likelihood of specific outputs, even making the least likely tokens become the most likely ones. This control-theoretic analysis of LLMs demonstrates the significant and poorly understood role of input sequences in steering output probabilities, offering a foundational perspective for enhancing language model system capabilities.

1 Introduction

LLMs pre-trained on unsupervised next token prediction objectives exhibit unprecedented dynamic reprogrammability achieved through “prompting”, often referred to as zero-shot learning [1, 2, 3, 4, 5, 6]. These capabilities appear to emerge as the model’s size, training data, and training time are scaled. The dynamic reprogrammability of LLMs is akin to the adaptable computational capacities observed in biological systems. This feature finds applications across domains such as machine translation [7], code generation [8], and chatbots [9]. A rigorous understanding of the prompt’s influence over LLM generation would be of great utility for understanding LLMs and building more robust and capable systems leveraging LLMs.

Strategies for controlling pre-trained LLM generation today fall into three broad categories [10]:

1.

Input Optimization (Prompting): Adjusting the input tokens (e.g., rewording the prompt) to improve subsequent text generation.
2.

Model Optimization: Adjusting the weights of the network (e.g., fine-tuning, RLHF) to improve model behavior during inference.
3.

Post-processing: Adjusting or re-ranking generated text (e.g., surrogate ranking algorithm).

Of all these approaches, input optimization (i.e., prompting) is the least invasive and lowest-cost method – and the least understood. Prompt optimization is also deeply connected to the zero-shot capabilities of LLMs – the mysterious emergent capabilities of LLMs such as problem-solving, knowledge retrieval, reasoning, and apparent general intelligence [11]. With such a view, we seek to characterize the controllability of LLMs via prompting (Figure 1).

Refer to caption — Figure 1: Illustration of the control-theoretic approach to LLM prompt engineering. Left: the LLM system diagram mapping an initial state $\mathbf{x}_{0}$ to a system output $\mathbf{y}$ under the influence of a control input $\mathbf{u}$ (all token sequences). Right: sketch of the reachable output sets $R_{y}^{k}(\mathbf{x}_{0})$ for varying control input lengths $k$ .

1.1 Contribution

We formalize LLM systems in the mathematical framework of control theory in Section 3. Our analysis focuses on the reachable set of outputs $\mathcal{R}_{y}(\mathbf{x}_{0})$ for an LLM system. The reachable set is a fundamental concept in control theory that underlies notions of controllability, stability, and observability (cf. Appendix A). The reachable output set $R_{y}(\mathbf{x}_{0})$ is the set of output sequences $\mathbf{y}$ for which there exists a control input sequence $\mathbf{u}^{*}$ that steers the LLM from initial state $\mathbf{x}_{0}$ to output $\mathbf{y}$ (cf. Definitions 3.3, A.5).

Our mathematical results in Section 4 prove an upper bound on the contents of the reachable output set for a self-attention head as a function of the singular values of its parameter matrices. Since self-attention is the only component in a transformer block where significant information is exchanged between token representations, this bound provides a foothold for analysis of LLM controllability from the perspective of mechanistic interpretability (e.g., [12, 13, 14]). Our result represents an analytically computable necessary condition for an output to be in the reachable set (Equation 7).

Our empirical results apply state-of-the-art prompt optimization techniques (Section 5.1) to demonstrate a lower bound on the contents of the reachable output set for a panel of LLMs, including Llama-7b [15], Falcon-7b, and Falcon-40b [16]. Specifically, we sample initial states $\mathbf{x}_{0}$ from the Wikitext dataset [17] and probe the reachable output tokens $y$ under length-constrained control input sequences $\mathbf{u}:|\mathbf{u}|\leq k$ . The length constraint $k$ is highly relevant for optimal control of LLMs, as prompts with fewer tokens require fewer computation and memory resources. We find that the reachable output set contains the “correct” next Wikitext token following $\mathbf{x}_{0}$ over 97% of the time with prompts of $k\leq 10$ tokens. We expand our analysis of the contents of $R_{y}(\mathbf{x}_{0})$ by sampling target output tokens $y$ based on the LLMs initial estimate of output likelihood $P_{LM}(y|\mathbf{x}_{0})$ . We find that the top 75 most likely output tokens $y$ are reachable at least 85% of the time with prompts of $k\leq 10$ tokens. Intriguingly, some tokens drawn from the set of least likely outputs are controllable to the most likely output with $k\leq 4$ control input tokens. Our results suggest that prior likelihood-based metrics, such as cross-entropy loss, cannot guarantee exclusion from the reachable set, emphasizing the gap in our current understanding of LLM systems and control. Implications of our results and open questions in LLM control theory are further discussed in Section 6.

2 Related Work

Much of the work on prompt optimization is concerned with finding prompts that induce higher LLM performance on “fill-in-the-blank” or “cloze” tasks [18]. One can frame a range of tasks including knowledge retrieval [19], reasoning [20], and sentiment analysis [21] as fill-in-the-blank tasks:

•

Knowledge Retrieval: “The Titanic sank in the year [MASK].” (Answer: “1912”)
•

Reasoning: “A is taller than B. B is taller than C. Is A taller than C? Answer: [MASK]” (Answer: “Yes”)
•

Sentiment Analysis: “I am sad today. The sentiment of the previous sentence was [MASK]” (Answer: “Negative”)

Notably, there is some freedom in the bolded “prompt text” that surrounds the question to convert it into a “fill-in-the-blank” task. As it turns out, the prompt tokens have a large effect on LLM performance [1, 10, 22].

Modern prompt optimization algorithms generally consist of two iterated steps: a sampling step where new prompts are generated and a testing step where the utility of the new prompts is evaluated, and the best are selected for the next iteration. Algorithms primarily differ in the sampling procedure, where various heuristics may be used to pick high-value swaps [23, 24, 25]. Overall, AutoPrompt and its derivative algorithms have been the most numerically successful prompt optimization methods, with the greedy coordinate gradient (GCG) algorithm having state-of-the-art performance [26].

The AutoPrompt Family:

AutoPrompt [27] pioneered the current wave of prompt optimization. Shin et al propose a prompt optimization technique and demonstrate its effectiveness for engineering prompts to improve LLM performance on knowledge and sentiment analysis tasks. At its core, the AutoPrompt algorithm leverages gradient information at the token embedding layer to inform iterative token exchanges within the prompt. This method was extended in [26] as the greedy coordinate gradient (GCG) algorithm. Taking inspiration from adversarial examples [28], Zou et al applied this AutoPrompt variant to generate “jailbreak” prompts that cause aligned LLMs to generate objectionable content.

Other Prompt Optimization Methods:

Other investigations on LLMs as prompt optimizers [24] and further analysis of manual prompt optimization [25] are informative but do not exceed the AutoPrompt family’s performance. Some other methods include GBDA [29], an approach based on the Gumbel-Softmax reparametrization, the PEZ algorithm [23], which directly optimizes embeddings via gradient information, and FluentPrompt [30], which differs from AutoPrompt by incorporating Langevin dynamics. Another family of techniques relating closely to our work is RL-Based prompt optimization methods [31, 32, 33, 34]. Such methods seek to optimize a prompt generation policy to maximize some reward signal, using a host of off the shelf reinforcement learning algorithms. Despite the variety of alternatives, GCG retains state-of-the-art performance.

Control Theory for LLMs:

To our knowledge, the only other work to date on the controllability or reachability of LLM text generation is [35]. Soatto et al analyze the controllability of LLMs in terms of “meaningful sentences”, defined as the sigma-algebra generated by snippets of text written on the Internet. Their empirical analysis revolves around demonstrating that LLMs are capable of attributing meaning. The theoretical analysis of LLM controllability is limited to “meaningful sentences”, eliminating the possibility of out-of-distribution inputs and outputs. These restrictions render their results challenging to leverage toward a practical understanding of LLM controllability. As stated in Section 5.5 of [35], “If fed gibberish, the well-trained bot operates out of distribution, which does not allow predicting the reachable set”. We situate our work as a practically oriented exploration of LLM controllability. Motivated by challenges in developing LLM systems, we do not eliminate “meaningless sentences” from the state space or input space. We aim to establish a rigorous, general framework for understanding LLM systems and controllability that is amenable to the development of theory and practical engineering insights on systems design.

3 Control Theory for LLMs

Control theory originates from the study of automatic control systems in engineering. It seeks to understand how a “plant” system can be influenced toward a desired state using a “control signal” – often in the presence of disturbances and uncertainty.

Control theory is central to a variety of engineering problems, from electrical engineering to autopilot to telecommunications to manufacturing. Surprisingly, control theory has also been highly applicable to a diverse range of scientific disciplines. Analyzing systems through the lens of controllability has proven fruitful for generating insight into biological systems such as cell signaling pathways and neural networks [36], the economics of central banking [37], and controlling the spread of infectious diseases [38]. One of the central benefits of studying systems via controllability is that a range of questions and problems naturally emerge from the framing: when is control possible? What is the cost of control? How computationally intensive is control? These questions are both practically useful and often lead to fundamental insights about the nature of the system in question.

To develop a control theory of LLMs, we begin with fundamental definitions of systems and control in Appendix A. We extend these fundamentals to define LLM systems (Definition 3.1) and outline specific canonical control concepts and problems such as controllability and reachability (Definition 3.3, 3.4) that arise naturally for LLM systems.

Language Model Notation:

We denote a causal language model using $P_{LM}$ . $P_{LM}$ maps from an ordered list of tokens from a vocabulary set $\mathcal{V}$ (e.g., $\mathbf{x}\in\mathcal{V}^{n}$ ) to the probability distribution over the next token $P_{LM}(x_{n+1}|\mathbf{x})\in[0,1]^{|\mathcal{V}|}$ . We use $\mathcal{V}^{*}$ to denote the set of all possible sequences of any length composed of tokens from $\mathcal{V}$ . The addition operator indicates the concatenation of token sequences. Bolded lowercase variables (e.g., $\mathbf{x}=[x^{1},\dots,x^{n}]$ ) denote token sequences while unbolded lowercase variables refer to individual tokens (e.g., $x\in\mathcal{V}$ ). The length of a token sequence is denoted $|\mathbf{x}|$ .

While LLMs are at times leveraged in a manner that masks the iterative aspects of generation, the reality is that token generation and externally imposed “control input” sequences are generated and processed sequentially, leading to non-trivial system dynamics. Several key differences remain between LLM-based systems and systems typically modeled through ordinary differential equations (ODEs), which have long been a cornerstone in the study of continuous-time dynamical systems:

1.

Discrete state and time: LLM systems operate on sequences of discrete tokens over a discrete time set, in contrast to the continuous state spaces and time sets studied in classical control theory.
2.

Shift-and-Grow State Dynamics: Whereas the system state in an ODE-based system has a fixed size over time, the system state $\mathbf{x}(t)$ for LLM systems grows as tokens are added to the state sequence.
3.

Mutual exclusion on control input token vs. generated token: The LLM system state $\mathbf{x}(t)$ is written to one token at a time. The newest token is either drawn from the control input $u(t)$ or is generated by the LLM by sampling $x^{\prime}\sim P_{LM}(x^{\prime}|\mathbf{x}(t))$ . This differs from traditional discrete stochastic systems, where the control sequence and internal dynamics generally affect the state synchronously.

We begin by rigorously defining LLM systems with user input, drawing from the abstract mathematical definition of a system (Definition A.1).

Definition 3.1 (LLM System with Control Input).

An autoregressive LLM system with control input $\Sigma=(\mathcal{V},P_{LM})$ consists of:

•

$\mathcal{T}=\mathbb{N}$ – The time set is the natural numbers.
•

$\mathcal{X}=\mathcal{V}^{*}$ – The state space consists of all possible token sequences of any length drawn from $\mathcal{V}$ . We denote the state at time $t$ as $\mathbf{x}(t)=[x^{0}(t),\dots,x^{t}(t)]$ .
•

$\mathcal{U}=\mathcal{V}\cup\varnothing$ – The input takes values from the vocabulary set $\mathcal{V}$ or null.

•

$\phi:\mathcal{X}\times\mathcal{U}\times\mathcal{T}^{2}\to\mathcal{X}$ – The transition map is

\displaystyle\phi(\mathbf{x}(t),u(t),t,t+1)=\begin{cases}\mathbf{x}(t)+u(t)&% \text{ if }u(t)\neq\varnothing\\ \mathbf{x}(t)+x^{\prime}&\text{ else }\end{cases}

(1)

where $x^{\prime}\sim P_{LM}(x^{\prime}|\mathbf{x}(t))$ . Note that the general multi-step transition map $\phi(\mathbf{x}(t),u,t,t+N)$ can be achieved by iterating equation 1 for control sequences $\mathbf{u}$ defined over the interval $[t,t+N]$ .

•

$h(\mathbf{x}(t);r)=[x^{t-r}(t),\dots,x^{t}(t)]$ – The readout map returns the most recent $r$ tokens from state $\mathbf{x}(t)$ .

We note that this LLM system definition is generalizable to a variety of LLM augmentations, including chain-of-thought [39], retrieval-augmented generation [40], and chatbot interaction. For example, chain-of-thought is equivalent to sampling the readout map $h(x(t),r)$ at time $T>|\mathbf{u}|+|\mathbf{x}_{0}|+r$ for prompt $\mathbf{u}$ and initial state $\mathbf{x}_{0}$ . A similar formulation may be applied to LLM systems endowed with programmatic tools (e.g., [41]).

In Definition 3.1, we assume that the control input gets to “decide” whether to yield token generation to the LLM ( $u(t)=\varnothing$ ) or override the LLM and add some token $u(t)\neq\varnothing$ to the state $\mathbf{x}(t)$ . This assumption generally holds when building LLM systems, though it may not hold when using existing systems (e.g., via non-streaming API). When discussing finite-length control inputs – e.g., the family of $k$ -long input sequences $\mathbf{u}\in\mathcal{V}^{k}$ – the value of $u(\ell):\ell>k$ is implicitly $\varnothing$ unless otherwise stated.

While next token generation $x^{\prime}\sim P_{LM}(x^{\prime}|\mathbf{x}(t))$ in equation 1 is probabilistic, we may render the system deterministic by sampling with zero temperature (i.e., greedy decoding). The greedy decoding assumption provides a foothold to analyze the reachable sets and controllability of LLM systems without invoking notions of stochastic control as in [42, 35]. Moreover, it remains connected to temperature-based stochastic decoding strategies as a limiting case of temperature-based sampling as zero-temperature sampling.

We now extend Definition A.4 to define output controllability for LLM systems:

Definition 3.2 (LLM Output Reachability).

Output token sequence $\mathbf{y}\in\mathcal{V}^{r}$ is reachable from initial state $\mathbf{x}_{0}\in\mathcal{V}^{*}$ for LLM system $\Sigma(\mathcal{V},P_{LM})$ iff there exists some time $T$ and input $\mathbf{u}^{*}\in\mathcal{U}^{k}$ for some $k+|\mathbf{x}_{0}|\leq T$ that steers the LLM from initial state $\mathbf{x}_{0}$ to output $\mathbf{y}=h(\mathbf{x}(T),r)$ at time $T$ .

We disregard the trivial solution wherein the control input $\mathbf{u}^{*}(t)$ overrides the LLM to force the state sequence to take on the desired output value $\mathbf{y}$ . We focus on the case of immediate generation, where $T=k+|\mathbf{x}_{0}|+r$ .

The reachable output set definition for LLM systems follows from Definition A.5:

Definition 3.3 (LLM Reachable Output Set).

The reachable output set from initial state $\mathbf{x}_{0}\in\mathcal{V}^{*}$ for LLM system $\Sigma=(\mathcal{V},P_{LM})$ is denoted $R_{y}(\mathbf{x}_{0})$ and consists of all reachable outputs $\mathbf{y}\in\mathcal{V}^{*}$ from initial state $\mathbf{x}_{0}$ .

Output controllability for LLMs follows from Definition A.7:

Definition 3.4 (LLM Output Controllability).

An LLM system $\Sigma=(\mathcal{V},P_{LM})$ is output controllable iff, for every initial state $\mathbf{x}_{0}\in\mathcal{V}^{*}$ , the reachable output set $\mathcal{R}_{y}(\mathbf{x}_{0})=\mathcal{V}^{*}$ .

The turn-based nature of writing to the LLM state sequence $\mathbf{x}(t)$ invites the question of whether the prompt $\mathbf{u}$ should preempt the imposed state $\mathbf{x}_{0}$ or come after the state ¹¹1Both situations are reasonable in developing LLM systems: $\mathbf{u}$ preceding $\mathbf{x}_{0}$ may arise when prompting an LLM to complete a partial string $\mathbf{x}_{0}$ . $\mathbf{u}$ proceeding $\mathbf{x}_{0}$ may arise when prompting an LLM in the presence of an imposed system prompt $\mathbf{x}_{0}$ . Therefore, how an initial state $\mathbf{x}_{0}$ is interleaved with control input $\mathbf{u}$ is largely a design decision.. We focus our efforts on cases where $\mathbf{u}$ comes before imposed state sequence $\mathbf{x}_{0}$ due to its importance for developing system prompts and controlling text completion-based generation where the desired output is $\mathbf{x}_{0}+\mathbf{y}^{*}$ for some desired continuation $\mathbf{y}^{*}$ of partial string $\mathbf{x}_{0}$ . Due to the costly nature of long prompts, we are especially interested in the existence of prompts $\mathbf{u}^{*}$ with minimal length $|\mathbf{u}^{*}|$ .

Definitions 3.3 and 3.4 form the basis for our control theory of LLMs. While amenable to theoretical analysis as in Section 4 and [35], empirical analysis of the reachable set and controllability is challenging due to the intractable size of $\mathcal{V}^{*}$ . We propose the following statistical measure of controllability for practically assessing the controllability of an LLM system w.r.t. a dataset $\mathcal{D}$ under prompt length constraint $|\mathbf{u}|\leq k$ :

Definition 3.5 ( $k$ - $\epsilon$ Controllability).

Consider a dataset of state-output pairs $\mathcal{D}=\{(\mathbf{x}_{0}^{i},\mathbf{y}^{i})\}_{i\in[N]}$ . An LLM $\Sigma=(\mathcal{V},P_{LM})$ is $k$ - $\epsilon$ controllable w.r.t. $\mathcal{D}$ if

\Pr\{\mathbf{y}\notin\mathcal{R}^{k}_{y}(\mathbf{x}_{0})\}\leq\epsilon

(2)

For $(\mathbf{x}_{0},\mathbf{y})\sim\mathcal{D}$ , where $\mathcal{R}^{k}_{y}(\mathbf{x}_{0}^{i})$ is the reachable set of outputs as in Definition 3.3 under the constraint that prompts $\mathbf{u}$ must have length $|\mathbf{u}|\leq k$ .

Our empirical work in Section 5.2 explores $k$ - $\epsilon$ controllability w.r.t. initial states $\mathbf{x}_{0}$ sampled from the Wikitext dataset. While empirical analysis of LLM controllability is challenging due to the lack of apparent structure in LLM dynamics and the combinatorially large state space, we may still experimentally establish the existence of optimal prompts $\mathbf{u}^{*}$ that elicit a given output, and thus establish a lower bound on the content of the reachable set. Meanwhile, our theoretical work in Section 4 establishes upper bounds on the content of the reachable set for self-attention. We hope these complementary approaches aid in unifying our understanding of LLM systems.

4 The Self-Attention Control Theorem

Self-attention is a central component in modern transformer-based language models [1, 15, 43, 44]. Introduced in [45] and popularized by [46], self-attention is the primary component in transformers where token representations exchange information. Self-attention mechanisms have significantly advanced the field of natural language processing, enabling models to capture long-range dependencies and achieve impressive performance on various tasks. Despite the widespread adoption and success of self-attention, the extent to which the outputs of self-attention layers can be precisely controlled via the input sequence remains an open question.

In this section, we present the Self-Attention Control Theorem, which proves bounds for understanding the reachability of self-attention outputs given limited control over the input token representations.

4.1 Preliminaries

Definition 4.1 (Self-Attention).

Self-attention $\Xi$ is parameterized by weight matrices $\boldsymbol{\theta}=(\mathbf{W}_{q},\mathbf{W}_{\rm key},\mathbf{W}_{v})$ . $\Xi$ is a mapping from $\mathbb{R}^{N\times d_{in}}$ to $\mathbb{R}^{N\times d_{out}}$ , where $N$ is the number of input token representations, each of dimensionality $d_{in}$ , and $d_{out}$ is the dimensionality of the output token representations.

\Xi(\mathbf{X};\boldsymbol{\theta})=\mathbf{D}^{-1}\exp\left(\frac{\mathbf{QK^% {\top}}}{\sqrt{d_{\rm key}}}\right)\mathbf{V}

(3)

where $\exp()$ denotes element-wise exponentiation of the matrix entries, $\mathbf{W}_{q},\mathbf{W}_{\rm key}\in\mathbb{R}^{d_{in}\times d_{\rm key}}$ , $\mathbf{W}_{v}\in\mathbb{R}^{d_{in}\times d_{out}}$ , $\mathbf{Q}=\mathbf{X}\mathbf{W}_{q}$ , $\mathbf{K}=\mathbf{X}\mathbf{W}_{\rm key}$ , $\mathbf{V}=\mathbf{X}\mathbf{W}_{v}$ , and $\mathbf{D}$ is a diagonal positive definite matrix defined as

\mathbf{D}:=\text{diag}\left(\exp\left(\frac{\mathbf{QK^{\top}}}{\sqrt{d_{\rm key% }}}\right)\mathbf{1}_{N\times 1}\right)

(4)

where $\mathbf{1}_{N\times 1}$ is an $N\times 1$ matrix of ones.

The parameters and operation of $\Xi$ are independent of the number of token representations $N$ . Self-attention is typically applied to discrete token sequences by embedding each token in the sequence as a vector in $\mathbb{R}^{d_{in}}$ to construct the matrix of $N$ token representations $\mathbf{X}\in\mathbb{R}^{N\times d_{in}}$ .

We focus on the reachability of output token representations $\Xi(\mathbf{X};\boldsymbol{\theta})$ , where we partition the input $\mathbf{X}\in\mathbb{R}^{(k+m)\times d_{in}}$ into a $k\times d_{in}$ block of control input representations $\mathbf{U}$ and an $m\times d_{in}$ block of imposed state representations $\mathbf{X_{0}}$ (cf. Definition 3.1) where $k+m=N$ . Thus the complete input matrix $\mathbf{X}$ is a concatenation of the control input $\mathbf{U}$ and the imposed state $\mathbf{X}_{0}$ .

	$\displaystyle\Xi(\mathbf{X};\boldsymbol{\theta})$	$\displaystyle=\Xi\begin{pmatrix}\begin{bmatrix}\mathbf{U}\\ \mathbf{X}_{0}\end{bmatrix};\boldsymbol{\theta}\end{pmatrix}=\Xi([\mathbf{U};% \mathbf{X}_{0}];\boldsymbol{\theta})$		(5)
		$\displaystyle=\begin{bmatrix}\mathbf{U}^{\prime}\\ \mathbf{Y}\end{bmatrix}=[\mathbf{U}^{\prime};\mathbf{Y}]$		(6)

We also partition the output $\mathbf{X^{\prime}}=\Xi(\mathbf{X};\boldsymbol{\theta})\in\mathbb{R}^{(k+m)% \times d_{in}}$ into a corresponding $k\times d_{out}$ matrix $\mathbf{U^{\prime}}$ and an $m\times d_{out}$ matrix $\mathbf{Y}$ .

We aim to characterize the reachable set of output representations $\mathbf{Y}\in\mathcal{R}_{y}^{k}(\mathbf{X_{0}})$ under $m$ imposed input representations $\mathbf{X_{0}}$ and $k$ controllable input representations $\mathbf{U}$ . Although the reachable set is now a set of continuous-valued output representation matrices in $\mathbb{R}^{m\times d_{in}}$ , we can readily adapt Definitions 3.3-3.2 to define the reachable set for these conditions:

Reachability for Self Attention:

Following from the original output reachability definition (Definition 3.2), let ${\mathbf{Y}}^{*}\in\mathbb{R}^{m\times d_{out}}$ be the desired output. We consider $\mathbf{Y}^{*}$ reachable from initial state $\mathbf{X}_{0}$ if there exists some $\mathbf{U}$ that steers the output of $\Xi\big{(}[\mathbf{U};\mathbf{X}_{0}];\boldsymbol{\theta}\big{)}$ to output $[\mathbf{U}^{\prime};\mathbf{Y}]$ such that $\mathbf{Y}=\mathbf{Y}^{*}$ .

4.2 The theorem and its motivation

Our approach is to split the output $\mathbf{Y}$ into two parts, $\mathbf{Y}=\mathbf{Y}_{u}+\mathbf{Y}_{x}$ , corresponding to the control input and imposed state, respectively. $\mathbf{Y}_{x}$ can be bounded as a function of $\mathbf{X},k$ , and $\boldsymbol{\theta}$ . $\mathbf{Y}_{u}$ is the remaining component arising from $\mathbf{U}$ . Each of the two parts $\mathbf{Y}_{u}$ and $\mathbf{Y}_{x}$ is split into two further components, one orthogonal to $\mathbf{Y}^{*}$ and one parallel to it. For instance, we denote the orthogonal part of $\mathbf{Y}_{x}$ by $\mathbf{Y}_{x,\perp}$ . Thus we have

	$\displaystyle\mathbf{Y}$	$\displaystyle=\mathbf{Y}_{u}+\mathbf{Y}_{x}$
		$\displaystyle=(\mathbf{Y}_{u,\|\|}+\mathbf{Y}_{u,\perp})+(\mathbf{Y}_{x,\|\|}+% \mathbf{Y}_{x,\perp})$

After rearranging, we have $\mathbf{Y}=(\mathbf{Y}_{u,||}+\mathbf{Y}_{x,||})+(\mathbf{Y}_{u,\perp}+\mathbf% {Y}_{x,\perp})\in\operatorname{span}(\mathbf{Y}^{*})\oplus\operatorname{span}(% \mathbf{Y}^{*})^{\perp}$ . If the desired output is reachable, then $\mathbf{Y}_{u,\perp}+\mathbf{Y}_{x,\perp}=\mathbf{0}$ and also $\|\mathbf{Y}_{u,\perp}\|=\|\mathbf{Y}_{x,\perp}\|$ (see Appendix B.3).

Theorem 4.2 (Self-Attention Control Theorem, proved in Appendix B).

Let $\mathbf{Y}_{x}^{\operatorname{max}}=\Xi(\mathbf{X}_{0};\boldsymbol{\theta})$ be the output of the self-attention layer given only the imposed state $\mathbf{X}_{0}$ . As before, we denote the $i$ -th row of the orthogonal component of $\mathbf{Y}_{x}^{\operatorname{max}}$ to the desired $\mathbf{Y}^{*}$ as $\mathbf{Y}_{x,\perp}^{\operatorname{max},i}$ .

Then $\mathbf{Y}^{*}$ is unreachable for any control input $\mathbf{U}$ if, for any $i\in\{1,\dots,m\}$ ,

\|\mathbf{Y}_{x,\perp}^{\operatorname{max},i}\|>k\gamma_{i}(\mathbf{X}_{0},% \boldsymbol{\theta})

(7)

where

\gamma_{i}(\mathbf{X}_{0},\boldsymbol{\theta}):=\frac{e^{\alpha}}{g_{i}}\sigma% _{v}M_{u},\quad\alpha=\sigma_{q}\sigma_{\rm key}M_{u}M_{x}/\sqrt{d_{\mathrm{% key}}}

(8)

g_{i}(\mathbf{X}_{0},\boldsymbol{\theta}):=\sum_{j=1}^{m}\exp\left((\mathbf{X}% _{0})^{i}\mathbf{W}_{q}\mathbf{W}_{\mathrm{key}}^{\top}(\mathbf{X}_{0})^{j\top% }/\sqrt{d_{\mathrm{key}}}\right),

(9)

$\sigma_{v},\sigma_{q}$ and $\sigma_{\rm key}$ being the maximum singular values of the value, query and key projection matrices, respectively, and with $M_{u}:=\max_{j}\|\mathbf{U}^{j}\|$ , $M_{x}:=\max_{j}\|(\mathbf{X}_{0})^{j}\|$ being the maximum norms of the control and imposed token embeddings, respectively.

Remark 4.3.

The upper bound $k\gamma_{i}(\mathbf{X}_{0},\boldsymbol{\theta})$ scales linearly with $k$ , implying that the set of unreachable $\mathbf{Y}^{*}$ becomes smaller as $k$ grows larger. Moreover, $\gamma$ is solely a function of the imposed state $\mathbf{X}_{0}$ .

Proof Summary:

An important idea of the proof is the decomposition of the output representations $\mathbf{Y}$ into two components: $\mathbf{Y}_{x}$ and $\mathbf{Y}_{u}$ . The $\mathbf{Y}_{x}$ component arises from the value projections of the imposed state $\mathbf{X}_{0}$ , while $\mathbf{Y}_{u}$ arises from the value projections of the control input $\mathbf{U}$ . Although the softmax operation in the self-attention mechanism introduces cross-terms between $\mathbf{X}$ and $\mathbf{U}$ in both $\mathbf{Y}_{x}$ and $\mathbf{Y}_{u}$ , we can disentangle their influences by considering the auxiliary representations $\mathbf{Y}_{x}^{\operatorname{max}}$ and $\mathbf{Y}_{u}^{\operatorname{max}}$ . Specifically, $\mathbf{Y}_{x}^{\operatorname{max}}=\Xi(\mathbf{X}_{0};\boldsymbol{\theta})$ represents the output of the self-attention mechanism $\Xi$ when only the imposed state $\mathbf{X}_{0}$ is provided as input, without any control input $\mathbf{U}$ . We derive the bound in Theorem 4.2 by first deriving the bound $\beta_{i}\geq\|\mathbf{Y}_{u}^{i}\|$ on row $i$ of $\mathbf{Y}_{u}$ . In Appendix B.2, we observe that, if $\|\mathbf{Y}_{x,\perp}^{i}\|\geq\beta_{i}$ , it is impossible for $\|\mathbf{Y}_{u,\perp}^{i}\|$ to nullify the orthogonal component of $\mathbf{Y}_{x}$ , rendering $\mathbf{Y}^{*}$ unreachable. A simplification of this inequality yields our bound $\|\mathbf{Y}_{x,\perp}^{\operatorname{max},i}\|>k\gamma_{i}(\mathbf{X}_{0},% \boldsymbol{\theta})$ .

Discussion of Theorem 4.2:

The reachable set exclusion condition in Equation (7) arises when the output representation $\mathbf{Y}_{x}^{\operatorname{max}}$ , which depends only on the imposed state $\mathbf{X}_{0}$ , is too far away from the desired output $\mathbf{Y}^{*}$ for the control input $\mathbf{U}$ to steer the output towards $\mathbf{Y}^{*}$ . The ability of the control input $\mathbf{U}$ to nullify the impact of $\mathbf{Y}_{x}^{\operatorname{max}}=\Xi(\mathbf{X}_{0};\boldsymbol{\theta})$ scales with the number of control tokens $k$ (see hyperbolic relationship in Equation 13). A longer control input can "dominate" the influence of $\mathbf{X}_{0}$ by increasing the relative contribution of $\mathbf{Y}_{u}$ to the overall output $\mathbf{Y}$ .

Furthermore, the proof reveals that the output of self-attention can be decomposed into components that depend primarily on different parts of the input (i.e., $\mathbf{X}_{0}$ and $\mathbf{U}$ ). While there are cross-terms in the attention matrix $(\mathbf{X}_{0})^{i}\mathbf{W}_{q}\mathbf{W}_{\mathrm{key}}^{\top}(\mathbf{X}_% {0})^{j\top}$ , these only introduce positive scaling factors (e.g., functions of $g_{i}$ ) to components (e.g. $\mathbf{Y}_{x}$ , $\mathbf{Y}_{u}$ ) that are not dependent on the control input, allowing us to derive an analytic bound on the reachable output set for self-attention via $\mathbf{Y}_{x}^{\operatorname{max}}$ (see Equations 21-25,23).

The implications of Theorem 4.2 are further discussed in Section 6. See Appendix B for proofs, including Section B.3 for a more general statement of reachability conditions in terms of the perpendicular and orthogonal components of $\mathbf{Y}^{*}$ and $\mathbf{Y}$ .

5 Experiments

To gain a practical, empirical understanding of the reachable set $\mathcal{R}_{y}^{k}(\mathbf{x}_{0})$ , we probe the existence of optimal prompts $\mathbf{u}^{*}$ across datasets $\mathcal{D}$ of initial state–desired output pairs $(\mathbf{x}_{0},y^{*})$ . We scope our experiments to study immediate control (i.e., we check the LLM output after $|y^{*}|$ tokens are generated) where the control input $\mathbf{u}$ is prepended to the imposed state $\mathbf{x}_{0}$ . Moreover, we focus on the case of controlling the LLM system to produce a single output token $y^{*}\in\mathcal{V}$ under some constraint $|u|\leq k$ . This “single-step” control renders the problem of gauging reachability computationally tractable and is a fundamental step toward understanding the iterated dynamics of LLM systems in terms of reachability and controllability. We leave the exploration of reachability and controllability under an extended time horizon (e.g., chain-of-thought, chatbot dynamics, tool-wielding LLMs) and under the requirement of multi-token outputs $\mathbf{y}$ to future work.

5.1 Methods

We apply prompt optimization algorithms to establish the existence of optimal prompts $\mathbf{u}^{*}$ of length $k$ that steer the LLM system from initial state $\mathbf{x}_{0}$ to output $y$ for some dataset $\mathcal{D}$ of initial state-output pairs. In general, prompt optimization algorithms accept a token sequence and a loss function on said token sequence, along with a specification of which tokens are manipulable. The output of a prompt optimizer is a manipulated token sequence (i.e., optimized prompt) designed to minimize the loss. We apply two computational methods to generating optimal prompts: greedy back-generation (algorithm 2) and greedy coordinate gradient (GCG, invented in [26], algorithm 3). We found that greedy back-generation performed best for short prompts $k\leq 3$ tokens, while GCG was the best-performing algorithm for prompts of 4 or more tokens. To our knowledge, our greedy back-generation algorithm is novel. For brevity, we place the full description of the algorithms and our parameter values for the two algorithms in Appendix C, as the specifics of the algorithms are not the main contribution of this work.

We focus on understanding the content and structure of the reachable set of LLM system outputs $\mathcal{R}_{y}^{k}(\mathbf{x}_{0})$ , particularly under a constraint on the number of input tokens $k$ . To determine which output tokens are reachable under varying input sequence lengths, we apply an incremental prompt lengthening procedure when searching for optimal prompts on some dataset $\mathcal{D}$ .

Algorithm 1 Back-off Prompt

0: State-output token sequence

(\mathbf{x}0,y)

; LLM system

\Sigma=(P{LM},\mathcal{V})

1: for

k=1

3

\mathbf{u}_{k}=\text{Greedy Back Generate}(\mathbf{x}_{0},y;\Sigma)

3: if

\mathbf{u}_{k}

steers

\Sigma

from

\mathbf{x}_{0}\to y

then

4: return

\mathbf{u}_{k}

5: end if

6: end for

7: for

k\in{4,6,8,10}

\mathbf{u}_{k}=\text{Greedy Coordinate Gradient}(\mathbf{x}_{0},y;\Sigma)

9: if

\mathbf{u}_{k}

steers

\Sigma

from

\mathbf{x}_{0}\to y

then

10: return

\mathbf{u}_{k}

11: end if

12: end for

13: return Failed to establish reachability.

5.2 Results

Our results revolve around the reachable set $\mathcal{R}_{y}^{k}(\mathbf{x}_{0})$ for state sequences sampled from the Wikitext dataset. Results were computed for a panel of models, including Falcon-7b, Falcon-40b, and Llama-7b. Falcon-7b results are showcased in this section while additional plots and results for Falcon-40b and Llama-7b can be found in Section D. We applied the same Back-off Prompt strategy (Algorithm 1) to determine $k$ - $\epsilon$ controllability for all experiments, varying the specifics of the dataset $\mathcal{D}$ for each experiment.

“Ground truth” reachability:

We established the reachability of the “ground truth” next token $y$ proceeding state token sequence $\mathbf{x}_{0}$ in Wikitext. In our tests on a dataset of 5000 state-output sequences with states of length $8-32$ tokens, we found that the true next token $y$ is reachable over 97% of the time across all models with a prompt of length $k\leq 10$ (Figure 3). Plots and supplementary figures for Falcon-40b and Llama-7b controllability w.r.t. ground truth Wikitext outputs can be found in Section D.1.

Top-75 reachability:

To explore the reachable set $\mathcal{R}_{y}^{k}(\mathbf{x}_{0})$ beyond the ground truth of Wikitext outputs, we generated a synthetic dataset of outputs by sampling 25 Wikitext sequences $\mathbf{x}_{0}$ and selecting the top 75 most likely next-tokens according to the model itself $P_{LM}(y|\mathbf{x}_{0})$ as the target tokens (Figure 3). We found that the top 75 output tokens were reachable over 85% of the time for all models with control sequence length $k=10$ . Supplementary figures including results for Llama-7b and Falcon-40b on $k$ - $\epsilon$ controllability with respect to the top 75 most likely output tokens can be found in Section D.2.

Uniformly sampled target outputs:

To maximally push the bounds of the reachable set within our single output token scope, we created another synthetic dataset where the target output token $y^{*}$ was sampled uniformly from the highest likelihood next token to the lowest likelihood token. Although the overall $k$ - $\epsilon$ score was relatively poor (only 46.43% reachable with $k=10$ for Falcon-7b), we were intrigued by the near-uniform relationship between prior token rank (based on $P_{LM}(y|\mathbf{x}_{0})$ ) versus the required number of prompt tokens. Figure 3 plots the relationship between prior target token rank based on $P(y^{*}|\mathbf{x}_{0})$ and the required prompt length $k$ to elicit the prompt. While over half were unreachable, the remaining reachable tokens appear uniformly distributed in terms of required prompt length, regardless of rank. Supplementary figures analyzing the $k$ - $\epsilon$ controllability of Falcon-7b with respect to uniformly sampled target outputs $y$ can be found in Section D.3.

6 Discussion

We proposed a control theoretic framework for understanding language model prompting, orienting our investigation around the reachable set of outputs $\mathcal{R}_{y}^{k}(\mathbf{x}_{0})$ . We proved a bound on the reachable set of outputs for self-attention in terms of the singular values of its weight matrices, and we established fundamental results on the reachability of “correct” next tokens (according to Wikitext). We expanded the scope of this investigation by probing the reachability of tokens assigned high likelihood by the LLM itself (top 75 most likely next tokens), and tokens assigned minimal likelihood by the LLM itself (randomly sampled target tokens).

The Self-Attention Control Theorem (Theorem 4.2) provides a sufficient condition for the unreachability of a desired output $\mathbf{Y}^{*}$ in terms of the projection of a single row of $\mathbf{Y}^{max}_{x}=\Xi(\mathbf{X}_{0};\boldsymbol{\theta})$ onto the orthogonal complement of $\mathbf{Y}^{*}$ . If the orthogonal component of $\mathbf{Y}^{\rm max}_{x}$ exceeds $k\gamma$ , then no prompt of length $\leq k$ can steer the self attention head to output $\mathbf{Y}^{*}$ under the input constraints. The threshold $k\gamma_{i}(\mathbf{X}_{0},\boldsymbol{\theta})$ depends on the imposed input $\mathbf{X}$ , the number of control tokens $k$ , and the maximum singular values of the query, key, and value weight matrices, $\boldsymbol{\theta}=(\mathbf{W}_{k},\mathbf{W}_{q},\mathbf{W}_{v})$ . Intuitively, this result suggests that if the output $\mathbf{Y}=\mathbf{Y}_{x}+\mathbf{Y}_{u}$ has component $\mathbf{Y}_{x}$ too large and misaligned with $\mathbf{Y}^{*}$ , then no control input with $k$ or fewer tokens can yield a component $\mathbf{Y}_{u}$ that corrects the misalignment – even if control inputs $\mathbf{U}$ yield maximal influence on the output under the $k$ -token limit (Figure 2‘).

Bounding the reachable set for self-attention is deeply related to the mechanism by which consistent representations are formed for multi-token generation. Steering a language model to generate a desired token sequence requires that the control input induce a token representation in the right-most token such that the next token prediction logits $P(\mathbf{y}|\mathbf{u}+\mathbf{x}_{0})$ achieves a desired value. Moreover, generated tokens are fed back into the model, and their representations must be steered as well to control iterated generation. Self-attention is the primary mechanism by which the token representations exchange information, making the reachable set of output representations across multiple tokens in $\mathbf{X}_{0}$ for self-attention a fundamental part of LLM control theory. The Self-Attention Control Theorem provides a step towards understanding the limitations and possibilities of controlling the self-attention layer, and by extension, the language model as a whole.

Our empirical results suggest that there is far more to the reachability of a given output than just prior likelihood or the prior rank the LLM assigns to a given token. Although prompt optimization-based $k$ - $\epsilon$ controllability experiments are only able to provide a lower bound on the content of the reachable set, the ability to frequently control even the least likely token to being the most likely token with just a few input tokens is intriguing (Figure 3, bottom right). This result indicates the importance of further investigating the reachability and controllability of LLMs, particularly for developing capable and reliable LLM systems.

Our investigations provide an entry into the understanding of LLM controllability via prompts. However, a comprehensive understanding necessitates extending our exploration into diverse regimes. Exploring the controllability with longer prompts and longer questions (base token sequences) will be pivotal. Equally important is the study of diverse models to verify the generality of our findings. The direct comparison of controllability scores of different model families is challenging since each model family uses a different tokenizer. The Llama family tokenizer, for instance, has a vocabulary of 30,000 tokens whereas the Falcon family has a vocabulary of 65,536 tokens. Further work is required to robustly compare controllability across models.

An intriguing observation from our study is the log-linear relationship between prompt length $k$ and controllability fraction $\epsilon$ (see Figure 4 in Appendix D). While this is compelling within our studied domain, it raises the essential question: is this relationship robust outside our current explorative scope? Unearthing universal scaling laws in LLM controllability would not only inform practical control applications but also open the door for theoretical insight into the nature of LLM behavior.

The progress we have made, both in understanding the bounds on self-attention controllability and the empirical measures of $k$ - $\epsilon$ LLM controllability, underscores the potential of this control theoretic framing for studying LLMs. Below is a non-exhaustive list of open problems in LLM control, all stemming from the framing in section A:

•

Control Properties of Chain-of-Thought: Chain-of-Thought is a powerful technique where LLMs are allowed to generate intermediate tokens (i.e., “thoughts”) between a question and an answer [39]. The control properties (e.g., stability, reachability) of systems leveraging these techniques are of great interest for understanding and composing systems of LLMs in the real world.
•

Distributional Control: To what extent can we control the output distribution of a language model $P_{LM}(\mathbf{y}|\mathbf{x}_{0}+\mathbf{u})$ to a desired distribution $P^{*}(\mathbf{y})$ ?
•

Computational Cost of Control: What are the performance characteristics of LLM control regularized by computational cost?
•

Learnability of Control: To what extent can LLMs learn to control each other? Work such as [24] showed that LLMs are capable of human-level prompt engineering, but it is unclear how well an LLM can learn to control another when explicitly optimized on the objective of LLM control.
•

Controllable Subspaces: In the control of linear dynamical systems, it is known that uncontrollable systems are often coordinate transformable into a representation where a subset of the coordinates are controllable and a subset are uncontrollable [47]. We have shown that controllable and uncontrollable components naturally emerge for self-attention heads in section 4 – can this be generalized to transformer blocks with nonlinearities and residual streams?
•

Composable LLM Systems: One of the greatest boons of control theory is the ability to compose control modules and subsystems into an interpretable, predictable, and effective whole [48]. The composition of LLM systems (potentially with non-LLM control modules) is an exciting avenue for scaling super intelligent systems.

Practically, our findings lay the groundwork for more effective and efficient prompt engineering. The ability to control even the least likely tokens illuminates untapped capabilities within LLMs, hinting at a potentially broader spectrum of application than previously recognized. Such insights could lead to the development of more nuanced and sophisticated LLM systems, capable of handling tasks with greater precision and adaptability.

Impact statement

This paper introduces foundational work aimed at enhancing our understanding and control of generative language models (LLMs) as they become integral to crucial societal functions. The increasing integration of generative AI into critical infrastructures — such as healthcare data analysis, insurance and financial data processing, and emergency response systems — underscores the urgency for a sophisticated control theory. Drawing on the principles of control theory, which have historically ensured the dependability of machines in life-or-death scenarios (e.g., in cruise control and aircraft navigation systems), our goal is to extend these guarantees to LLM-based applications. By doing so, we aim to make these advanced AI systems as trustworthy and robust as their electro-mechanical counterparts, thereby securing their role in supporting and safeguarding society.

Code availability

All code used to produce the experimental results is provided with the submission.

References

[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33, pp. 1877–1901, Curran Associates, Inc., 2020.
[2] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” 2022.
[3] T. Hagendorff, “Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods,” 2023.
[4] D. Noever and F. McKee, “Numeracy from literacy: Data science as an emergent skill from large language models,” 2023.
[5] OpenAI, “Gpt-4 technical report,” 2023.
[6] OpenAI, Nov 2022.
[7] L. Wang, C. Lyu, T. Ji, Z. Zhang, D. Yu, S. Shi, and Z. Tu, “Document-level machine translation with large language models,” arXiv preprint arXiv:2304.02210, 2023.
[8] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models for code,” 2023.
[9] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022.
[10] H. Zhang, H. Song, S. Li, M. Zhou, and D. Song, “A survey of controllable text generation using transformer-based pre-trained language models,” CoRR, vol. abs/2201.05337, 2022.
[11] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang, “Sparks of artificial general intelligence: Early experiments with gpt-4,” 2023.
[12] T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah, “Towards monosemanticity: Decomposing language models with dictionary learning,” Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
[13] H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” 2021.
[14] A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso, “Towards automated circuit discovery for mechanistic interpretability,” 2023.
[15] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023.
[16] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Heslow, J. Launay, Q. Malartic, B. Noune, B. Pannier, and G. Penedo, “Falcon-40B: an open large language model with state-of-the-art performance,” HuggingFace, 2023.
[17] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” 2016.
[18] W. L. Taylor, ““cloze procedure”: A new tool for measuring readability,” Journalism Quarterly, vol. 30, no. 4, pp. 415–433, 1953.
[19] F. Petroni, T. Rocktäschel, P. S. H. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel, “Language models as knowledge bases?,” CoRR, vol. abs/1909.01066, 2019.
[20] J. Weston, A. Bordes, S. Chopra, and T. Mikolov, “Towards ai-complete question answering: A set of prerequisite toy tasks,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings (Y. Bengio and Y. LeCun, eds.), 2016.
[21] Z. Wang, Q. Xie, Z. Ding, Y. Feng, and R. Xia, “Is chatgpt a good sentiment analyzer? a preliminary study,” arXiv preprint arXiv:2304.04339, 2023.
[22] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?,” 2020.
[23] Y. Wen, N. Jain, J. Kirchenbauer, M. Goldblum, J. Geiping, and T. Goldstein, “Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery,” 2023.
[24] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba, “Large language models are human-level prompt engineers,” 2023.
[25] L. Reynolds and K. McDonell, “Prompt programming for large language models: Beyond the few-shot paradigm,” 2021.
[26] A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” 2023.
[27] T. Shin, Y. Razeghi, R. L. L. I. au2, E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,” 2020.
[28] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” 2015.
[29] C. Guo, A. Sablayrolles, H. Jégou, and D. Kiela, “Gradient-based adversarial attacks against text transformers,” 2021.
[30] W. Shi, X. Han, H. Gonen, A. Holtzman, Y. Tsvetkov, and L. Zettlemoyer, “Toward human readable prompt tuning: Kubrick’s the shining is a good movie, and a good prompt too?,” 2022.
[31] M. Deng, J. Wang, C.-P. Hsieh, Y. Wang, H. Guo, T. Shu, M. Song, E. P. Xing, and Z. Hu, “Rlprompt: Optimizing discrete text prompts with reinforcement learning,” 2022.
[32] Y. Hao, Z. Chi, L. Dong, and F. Wei, “Optimizing prompts for text-to-image generation,” 2023.
[33] T. Zhang, X. Wang, D. Zhou, D. Schuurmans, and J. E. Gonzalez, “Tempera: Test-time prompting via reinforcement learning,” 2022.
[34] D.-K. Kim, S. Sohn, L. Logeswaran, D. Shim, and H. Lee, “Multiprompter: Cooperative prompt optimization with multi-agent reinforcement learning,” 2023.
[35] S. Soatto, P. Tabuada, P. Chaudhari, and T. Y. Liu, “Taming ai bots: Controllability of neural states in large language models,” 2023.
[36] T.-M. Yi, Y. Huang, M. I. Simon, and J. Doyle, “Robust perfect adaptation in bacterial chemotaxis through integral feedback control,” Proceedings of the National Academy of Sciences, vol. 97, no. 9, pp. 4649–4653, 2000.
[37] S. Aniţa, V. Arnăutu, V. Capasso, and V. Capasso, An introduction to optimal control problems in life sciences and economics: From mathematical models to numerical simulation with MATLAB®, vol. 2. Springer, 2011.
[38] S. Roy, Y. Wan, and A. Saberi, “A network control theory approach to virus spread mitigation,” in 2009 IEEE Conference on Technologies for Homeland Security, pp. 599–606, 2009.
[39] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2023.
[40] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
[41] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive apis,” 2023.
[42] K. Sivaramakrishnan, V. Sivaramakrishnan, and M. M. K. Oishi, “Stochastic reachability of discrete-time stochastic systems via probability measures,” 2023.
[43] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[44] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth, “Recent advances in natural language processing via large pre-trained language models: A survey,” ACM Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023.
[45] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 2016.
[46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[47] E. D. Sontag, Mathematical control theory: deterministic finite dimensional systems, vol. 6. Springer Science & Business Media, 2013.
[48] F.-L. Lian, J. Moyne, and D. Tilbury, “Network design consideration for distributed control systems,” IEEE transactions on control systems technology, vol. 10, no. 2, pp. 297–307, 2002.
[49] R. E. Kalman, P. L. Falb, and M. A. Arbib, Topics in mathematical system theory, vol. 33. McGraw-Hill New York, 1969.
[50] K. Ogata, Modern control engineering fifth edition. Prentice Hall, 2010.

Appendix A Abstract Systems and Control Theory Background

This section aims to provide an overview of fundamental control-theoretic concepts from an abstract, set-theoretic perspective. We primarily draw from canonical textbooks [47, 49], and [50].

Diverse definitions of “system” or “machine” exist in the literature, all representing the same core concept but varying in mathematical details. We offer the following high-level definition based on [47] Chapter 2:

Definition A.1 (System).

A “system” or “machine” $\Sigma=(\mathcal{T,X,U},\phi)$ consists of:

•

$\mathcal{T}:$ The time set along which system state evolves.
•

$\mathcal{X}:$ The state space.
•

$\mathcal{U}:$ The input space.
•

$\phi:\mathcal{X\times U\times T}^{2}\to\mathcal{X}:$ The transition map.

A system may also be equipped with an output space and readout map $(\mathcal{Y},h)$ :

•

$\mathcal{Y}:$ The output space.
•

$h:\mathcal{X\times U\times T}\to\mathcal{Y}:$ The readout map.

In other words, at time $t\in\mathcal{T}$ , the system’s state takes on values $x\in\mathcal{X}$ , and the control input takes values $u\in\mathcal{U}$ . The system evolves over time with the transition map $\phi(x,u,t,t^{\prime})$ that returns the new state value $x^{\prime}\in\mathcal{X}$ at time $t^{\prime}>t$ . A system can also have a readout map $h(x,u,t)$ that produces the output value $y\in\mathcal{Y}$ given the current time, state, and input value. An input $u\in\mathcal{U}$ defined over interval $[t,t^{\prime}]$ may be said to steer the system $\Sigma=(\mathcal{T,X,U},\phi)$ from state $x_{0}$ to state $x^{\prime}$ if $x^{\prime}=\phi(x_{0},u,t,t^{\prime})$ . A wide variety of systems are expressible within this framework. E.g., we obtain discrete-time dynamical systems for $\mathcal{T}=\mathbb{Z}^{+}$ . Continuous-time dynamical systems emerge for $\mathcal{T}=\mathbb{R}^{+}$ . We apply Definition A.1 to formulate LLM systems in Definition 3.1.

Note that we assume that the system $\Sigma$ is time-invariant; its dynamics $\phi$ do not change as a function of time. This assumption is widely applicable and is often made in the literature [49, 50, 47] to simplify definitions and discussions of systems.

Reachability is a core control theory concept and is central to defining controllability. At their core, definitions of reachability revolve around the existence of control inputs $u\in\mathcal{U}$ that steer the system from a starting state $x_{0}\in\mathcal{X}$ to some desired state(s). Following from Chapters 1-2 of [49] and Chapter 2 of [47], we define state reachability as:

Definition A.2 (State Reachability).

State $x\in\mathcal{X}$ is reachable from initial state $x_{0}\in\mathcal{X}$ for system $\Sigma=(\mathcal{T,X,U},\phi)$ iff there exists some time $T$ and control input $u^{*}\in\mathcal{U}$ such that $u^{*}$ steers the system from state $x_{0}$ to state $x$ at time $T$ .

We may use this definition of state reachability to define the reachable state set for some initial state $x_{0}\in\mathcal{X}$ :

Definition A.3 (Reachable State Set).

The reachable state set from initial state $x_{0}\in\mathcal{X}$ for system $\Sigma=(\mathcal{T,X,U},\phi)$ is denoted $\mathcal{R}(x_{0})\subseteq\mathcal{X}$ and consists of all reachable states $x\in\mathcal{X}$ from initial state $x_{0}$ (cf. Definition A.2).

For systems with readout maps $h$ , notions of output reachability arise naturally. Note that state reachability is neither necessary nor sufficient to guarantee output reachability.

Definition A.4 (Output Reachability).

Output $y\in\mathcal{Y}$ is reachable from initial state $x_{0}\in\mathcal{X}$ for system $\Sigma=(\mathcal{T,X,U},\phi,\mathcal{Y},h)$ iff there exists some time $T$ and control input $u^{*}\in\mathcal{U}$ such that $u^{*}$ steers the system from state $x_{0}$ to output $y$ in time $T$ .

Definition A.5 (Reachable Output Set).

The reachable output set from initial state $x_{0}\in\mathcal{X}$ for system $\Sigma=(\mathcal{T,X,U},\phi,\mathcal{Y},h)$ is denoted $\mathcal{R}_{y}(x_{0})$ and consists of all reachable outputs $y\in\mathcal{Y}$ from initial state $x_{0}$ (cf. Definition A.4).

A system is controllable when the reachable set extends to the entire state space. Practically speaking, this implies that one can steer the system from any initial state to any desired state. We develop the reachable set for LLM systems in Definition 3.3 and LLM reachability in Definition 3.2.

Definition A.6 (State Controllability).

System $\Sigma=(\mathcal{T,X,U},\phi)$ is state controllable iff, for every initial state $x_{0}\in\mathcal{X}$ , the reachable set $\mathcal{R}(x_{0})=\mathcal{X}$ .

Definition A.7 (Output Controllability).

System $\Sigma=(\mathcal{T,X,U},\phi,\mathcal{Y},h)$ is output controllable iff, for every initial state $x_{0}\in\mathcal{X}$ , the reachable output set $\mathcal{R}_{y}(x_{0})=\mathcal{Y}$ .

A range of fruitful questions stem from these definitions: if there is a cost associated with control inputs $u\in\mathcal{U}$ (e.g., power constraints, length constraints), what is the minimum cost of control? What is the minimum time required to get from the initial state to the desired final state or output? If the system is not completely controllable, under what conditions is it controllable? Under which readout maps is a system output controllable? We develop controllability for LLMs abstractly in Definition 3.4 and in an empirically/statistically testable fashion in Definition 3.5.

Appendix B Theory on Self-Attention Controllability

Note: Key terms for the proof are introduced in Section 4 surrounding Theorem 4.2. Specifically, the definition of self-attention mechanism $\Xi$ , the control problem setup, and the reachable set $\mathcal{R}_{y}^{k}(\mathbf{X}_{0})$ are required background for this proof.

Notation: For each token representation matrix $\mathbf{Q,K,V}\in\mathbb{R}^{(k+m)\times\cdot}$ , we denote the first $k$ rows corresponding to $\mathbf{U}$ using $u$ as a subscript, like $\mathbf{Q}_{u}$ . The remaining $m$ rows corresponding to $\mathbf{X}_{0}$ are denoted with subscript $x$ like $\mathbf{Q}_{x}$ .

B.1 Proof of Theorem 4.2

Let $\mathbf{A}$ be the exponentiated query-key outer product matrix with the following block structure:

\mathbf{A}=\exp\begin{pmatrix}\frac{\textbf{Q K}^{\top}}{\sqrt{d_{\rm key}}}% \end{pmatrix}=\exp\begin{pmatrix}\begin{bmatrix}\mathbf{Q}_{u}\mathbf{K}_{u}^{% \top}&\mathbf{Q}_{u}\mathbf{K}_{x}^{\top}\\ \mathbf{Q}_{x}\mathbf{K}_{u}^{\top}&\mathbf{Q}_{x}\mathbf{K}_{x}^{\top}\end{% bmatrix}\frac{1}{\sqrt{d_{\rm key}}}\end{pmatrix}=\begin{bmatrix}\mathbf{A}_{% uu}&\mathbf{A}_{ux}\\ \mathbf{A}_{xu}&\mathbf{A}_{xx}\end{bmatrix}

(10)

where $\mathbf{Q}_{u}=\mathbf{U}\mathbf{W}_{q}$ , $\mathbf{K}_{x}=\mathbf{X}_{0}\mathbf{W}_{\rm key}$ , and similarly for $\mathbf{K}_{u},\mathbf{Q}_{x}$ . We apply a similar quadrant decomposition to $\mathbf{D}$ , defined initially in Equation 4.

\mathbf{D}=\text{diag}\begin{pmatrix}\exp\begin{pmatrix}\frac{\mathbf{QK}^{% \top}}{\sqrt{d_{\rm key}}}\end{pmatrix}\mathbf{1}_{N\times 1}\end{pmatrix}=% \begin{bmatrix}\mathbf{D}_{u}&\mathbf{0}\\ \mathbf{0}&\mathbf{D}_{x}\\ \end{bmatrix}

(11)

where the quadrant demarcations in $\mathbf{D}$ follow from Equation 10.

We may now express the self-attention mechanism output representations $\mathbf{Y}$ as

\mathbf{Y}=\underbrace{\mathbf{D}_{x}^{-1}\mathbf{A}_{xu}\mathbf{V}_{u}}_{% \mathbf{Y}_{u}}+\underbrace{\mathbf{D}_{x}^{-1}\mathbf{A}_{xx}\mathbf{V}_{x}}_% {\mathbf{Y}_{x}}

(12)

Lemma B.1.

For any control input $\mathbf{U}$ whose rows satisfy $\|\mathbf{U}^{j}\|\leq M_{u}$ for all $j\in\{1,\dots,k\}$ , the norm of the $i$ -th row of $\mathbf{Y}_{u}$ is bounded as follows

\|\mathbf{Y}_{u}^{i}\|\leq\beta_{i}(\mathbf{X}_{0},k)

where

\beta_{i}(\mathbf{X}_{0},k):=\frac{ke^{\alpha}}{g_{i}(\mathbf{X}_{0},% \boldsymbol{\theta})+ke^{\alpha}}\sigma_{v}M_{u},

(13)

and

\alpha=\sigma_{q}\sigma_{\rm key}M_{u}M_{x}/\sqrt{d_{\rm key}},\qquad g_{i}(% \mathbf{X}_{0},\boldsymbol{\theta}):=\mathbf{D}_{xx}^{i}=\sum_{j=1}^{m}\exp% \left((\mathbf{X}_{0})^{i}\mathbf{W}_{q}\mathbf{W}_{\rm key}^{\top}(\mathbf{X}% _{0})^{j\top}/\sqrt{d_{\rm key}}\right).

Proof.

Our objective is to establish an upper bound on $\|\mathbf{Y}_{u}^{i}\|$ , the Euclidean norm of the $i$ -th row of the matrix $\mathbf{Y}_{u}$ , which represents the contribution of the control input to the output of the self-attention layer. $g_{i}=\mathbf{D}_{xx}^{i}$ represents the component of the row-wise softmax denominator $\mathbf{D}_{x}$ from $\mathbf{A}_{xx}$ (solely a function of $\mathbf{X}_{0}$ ). Similarly, $\mathbf{D}_{xu}$ represents the component of $\mathbf{D}_{x}$ from $\mathbf{A}_{xu}$ , and $\mathbf{D}_{x}=\mathbf{D}_{xx}+\mathbf{D}_{xu}$ . We observe that $\mathbf{D}_{xu}^{i}$ is the sum of the entries in the $i$ -th row of $\mathbf{A}_{xu}$ :

\mathbf{D}_{xu}^{i}=\sum_{j=1}^{k}(\mathbf{A}_{xu})_{ij}=\sum_{j=1}^{k}\exp(% \langle\mathbf{Q}_{x}^{i},\mathbf{K}_{u}^{j}\rangle/\sqrt{d_{\rm key}}),

(14)

where $\mathbf{Q}_{x}^{i}$ and $\mathbf{K}_{u}^{j}$ denote the $i$ -th row of $\mathbf{Q}_{x}$ and the $j$ -th row of $\mathbf{K}_{u}$ , respectively. Every entry of the diagonal matrix $\mathbf{D}_{xu}$ is strictly positive.

Recall that $\mathbf{D}_{x}=\mathbf{D}_{xx}+\mathbf{D}_{xu}$ and $\mathbf{V}_{u}=\mathbf{U}\mathbf{W}_{v}$ . We begin by expressing the $i$ -th row of $\mathbf{Y}_{u}$ as:

\mathbf{Y}_{u}^{i}=(\mathbf{D}_{xx}^{i}+\mathbf{D}_{xu}^{i})^{-1}(\mathbf{A}_{% xu}^{i}\mathbf{V}_{u}^{i}),

(15)

where $\mathbf{D}_{xx}^{i}$ and $\mathbf{D}_{xu}^{i}$ denote the $i$ -th diagonal entries of the matrices $\mathbf{D}_{xx}$ and $\mathbf{D}_{xu}$ , respectively, $\mathbf{A}_{xu}^{i}$ represents the $i$ -th row of the matrix $\mathbf{A}_{xu}$ , and $\mathbf{V}_{u}^{i}$ corresponds to the $i$ -th row of the matrix $\mathbf{V}_{u}$ .

Let $\alpha_{ij}:=(\mathbf{A}_{xu})_{ij}=\langle\mathbf{Q}_{x}^{i},\mathbf{K}_{u}^{% j}\rangle/\sqrt{d_{\rm key}}\leq\alpha$ for all $i,j$ where $\alpha$ is defined to be an upper bound on the scaled key-query dot products between vectors in $\mathbf{U}$ and $\mathbf{X}$ given by

\alpha=\sigma_{q}\sigma_{\rm key}M_{u}M_{x}/\sqrt{d_{\rm key}}.

(16)

Recall that $\sigma_{q},\sigma_{\rm key}$ are the maximal singular values of $\mathbf{W}_{q},\mathbf{W}_{\rm key}$ respectively.

By applying the Cauchy-Schwarz inequality and using the definitions of $\alpha$ , $M_{u}$ , and $M_{x}$ , we can perform the bound $(\mathbf{A}_{xu})_{ij}\leq e^{\alpha}$ for all $i,j$ , and thus:

\mathbf{D}_{xu}^{i}=\mathbf{A}^{i}_{xu}\mathbf{1}\leq\sum_{j=1}^{k}e^{\alpha}=% ke^{\alpha}.

(17)

where $\mathbf{1}$ is a constant vector consisting of all entries equal to 1.

Next, we note that $\|\mathbf{V}_{u}^{i}\|\leq C$ , where $C:=\sigma_{v}M_{u}$ , and $\sigma_{v}$ denotes the maximum singular value of the value projection matrix $\mathbf{W}_{v}$ . That fact follows directly the definition of $\mathbf{V}_{u}$ . This allows us, while we are bounding $\|\mathbf{Y}_{u}^{i}\|$ , to replace $\mathbf{V}_{u}^{i}$ with a constant vector $\mathbf{C}$ whose entries are all equal to $C$ , yielding an upper bound on $\|\mathbf{A}_{xu}^{i}\mathbf{V}_{u}^{i}\|$ :

\|\mathbf{A}_{xu}^{i}\mathbf{V}_{u}^{i}\|\leq\|\mathbf{A}_{xu}^{i}\mathbf{C}\|% =C\sum_{j=1}^{k}(\mathbf{A}_{xu})_{ij}=C\mathbf{D}_{xu}^{i}.

(18)

We now rewrite the norm $\|\mathbf{Y}_{u}^{i}\|$ . Toward that end, let $g_{i}(\cdot,\boldsymbol{\theta}):\mathbb{R}^{m\times d}\to[0,\infty)$ denote the function of $\mathbf{X}_{0}$ defined by $g_{i}(\mathbf{X}_{0},\boldsymbol{\theta}):=\mathbf{D}_{xx}^{i}$ .

$\displaystyle\\|\mathbf{Y}_{u}^{i}\\|$	$\displaystyle=\frac{\\|\mathbf{A}_{xu}^{i}\mathbf{V}_{u}^{i}\\|}{\mathbf{D}_{xx}% ^{i}+\mathbf{D}_{xu}^{i}}$	(19)
	$\displaystyle\leq\frac{\\|\mathbf{A}_{xu}^{i}\mathbf{C}\\|}{\mathbf{D}_{xx}^{i}+% \mathbf{D}_{xu}^{i}}=\frac{\langle\mathbf{A}_{xu}^{i},\mathbf{C}\rangle}{% \mathbf{D}_{xx}^{i}+\mathbf{A}_{xu}^{i}\mathbf{1}}$	(20)
	$\displaystyle\leq\frac{Cke^{\alpha}}{g_{i}+ke^{\alpha}}$	(21)

The final line follows from (17) and the observation that the function $f(x):=x/(x+g_{i})$ , where $g_{i}>0$ is monotone increasing.

Let

\beta_{i}(\mathbf{X}_{0},k):=\frac{ke^{\alpha}}{g_{i}(\mathbf{X}_{0},% \boldsymbol{\theta})+ke^{\alpha}}\sigma_{v}M_{u}.

(22)

We have established that $\|\mathbf{Y}_{u}^{i}\|\leq\beta_{i}(\mathbf{X}_{0},k)$ for any control input $\mathbf{U}$ whose rows satisfy $\|\mathbf{U}^{j}\|\leq M_{u}$ for all $j\in\{1,\dots,k\}$ . The same bound holds for $\|\mathbf{Y}_{u,\perp}^{i}\|$ , the norm of the projection of $\mathbf{Y}_{u}^{i}$ onto the orthogonal complement of $\mathbf{Y}^{*}$ . ∎

B.2 Simplified reachability hypothesis

We can restate the hypothesis of our self-attention theorem, Theorem 4.2

\|\mathbf{Y}^{\operatorname{max}}_{x,\perp}\|>\frac{ke^{\alpha}}{g_{i}}\sigma_% {v}M_{u}

(23)

as equivalent to

\|\mathbf{Y}^{\operatorname{min},i}_{x,\perp}\|>\beta_{i}=\frac{ke^{\alpha}}{% ke^{\alpha}+g_{i}}\sigma_{v}M_{u}.

(24)

Since

\mathbf{Y}^{\operatorname{min},i}_{x}=\frac{g_{i}}{g_{i}+ke^{\alpha}}\mathbf{Y% }^{\operatorname{max},i}_{x},

(25)

and $g_{i}>0$ and $ke^{\alpha}$ are positive scalars, we can cancel the factor of $(ke^{\alpha}+g_{i})^{-1}$ on both sides, and then divide both sides by $g_{i}$ , to obtain the equivalent hypothesis (23) from hypothesis (24).

B.3 More general theorem

Theorem B.2 (Self-Attention Control Theorem 2).

Consider a self-attention layer with input $\mathbf{X}\in\mathbb{R}^{m\times d}$ and control input $\mathbf{U}\in\mathbb{R}^{k\times d}$ , where $m$ is the number of imposed tokens, $k$ is the number of control tokens, and $d$ is the token embedding dimension. Let $\mathbf{Y}^{*}\in\mathbb{R}^{m\times d}$ be the desired output, and let $\mathbf{Y}\in\mathbb{R}^{m\times d}$ be the actual output of the self-attention layer. As before, define $\mathbf{Y}_{\perp}=\mathbf{Y}_{x,\perp}+\mathbf{Y}_{u,\perp}\in\mathbb{R}^{m% \times d}$ as the projection of the output onto the orthogonal complement of $\mathbf{Y}^{*}$ .

If either:

(A) $\|\mathbf{Y}\|=\|\mathbf{Y}^{*}\|$ and there exists a component $\mathbf{Y}_{\perp}^{ij}\neq 0$ of the matrix $\mathbf{Y}_{\perp}$ , or

(B) $\|\mathbf{Y}\|\neq\|\mathbf{Y}^{*}\|$ ,

then

\mathbf{Y}\neq\mathbf{Y}^{*}

(26)

for any control input $\mathbf{U}\in\mathbb{R}^{k\times d}$ such that $\max_{j}\|\mathbf{U}^{j}\|\leq M_{u}$ .

This theorem is also illustrated in Figure 2 and is a more general theorem than Theorem 4.2: the hypothesis of Theorem 4.2 implies that some row satisfies $\|\mathbf{Y}^{\operatorname{min},i}_{x,\perp}\|>\|\mathbf{Y}^{\operatorname{% max},i}_{u,\perp}\|$ , so it must be the case that there exists some nonzero entry $\mathbf{Y}^{ij}_{\perp}$ of the matrix $\mathbf{Y}_{\perp}$ in the case that $\|\mathbf{Y}\|=\|\mathbf{Y}^{*}\|$ .

The Self-Attention Control Theorem (Theorem 4.2) provides valuable insights despite being less general than the more general version (Theorem B.2). An advantage of Theorem 4.2 is its more specific hypothesis, Equation (7), which provides a concrete criterion for determining whether the desired output can be achieved by the self-attention layer²²2and depends on the properties of the input tokens, the control tokens, and the learned parameters of the self-attention layer, such as the maximum singular values of the query and key projection matrices.

Proof of Theorem B.2.

We will prove the theorem by contradiction. Assume that $\mathbf{Y}=\mathbf{Y}^{*}$ for some control input $\mathbf{U}$ satisfying $\max_{j}\|\mathbf{U}^{j}\|\leq M_{u}$ .

Case (A): If $\|\mathbf{Y}\|=\|\mathbf{Y}^{*}\|$ and there exists a component $\mathbf{Y}_{\perp}^{ij}\neq 0$ of the matrix $\mathbf{Y}_{\perp}$ , then $\mathbf{Y}_{\perp}\neq\mathbf{0}$ . This implies that $\mathbf{Y}$ and $\mathbf{Y}^{*}$ are not parallel, and therefore $\mathbf{Y}\neq\mathbf{Y}^{*}$ , contradicting the assumption.

Case (B): If $\|\mathbf{Y}\|\neq\|\mathbf{Y}^{*}\|$ , then $\mathbf{Y}\neq\mathbf{Y}^{*}$ directly, again contradicting the assumption.

In both cases, we have a contradiction, so the assumption that $\mathbf{Y}=\mathbf{Y}^{*}$ must be false, and we can conclude that $\mathbf{Y}\neq\mathbf{Y}^{*}$ for any control input $\mathbf{U}$ satisfying $\max_{j}\|\mathbf{U}^{j}\|\leq M_{u}$ .

To show that Theorem 4.2 is a special case of the more general theorem, consider the hypothesis of Theorem 4.2: from (7) we conclude that some row satisfies $\|\mathbf{Y}^{min,i}_{x,\perp}\|>\|\mathbf{Y}^{max,i}_{u,\perp}\|$ . This implies that $\mathbf{Y}_{x,\perp}^{ij}\neq-\mathbf{Y}_{u,\perp}^{ij}$ for some entry $(i,j)$ , and therefore $\mathbf{Y}_{\perp}^{ij}=\mathbf{Y}_{x,\perp}^{ij}+\mathbf{Y}_{u,\perp}^{ij}\neq 0$ . This satisfies the condition of case (A) in the more general theorem, assuming $\|\mathbf{Y}\|=\|\mathbf{Y}^{*}\|$ . Thus, the hypothesis of Theorem 4.2 is a special case of the hypothesis in the more general theorem. ∎

By incorporating this bound into the hypothesis, Theorem 4.2 offers a more practical and actionable result, allowing researchers and practitioners to assess the controllability of a self-attention layer based on measurable quantities, without the need to exhaustively search the space of possible control inputs. Moreover, the presence of the bound opens up opportunities for further analysis and optimization, potentially guiding the design of control strategies that satisfy the bound and ensuring that the desired output can be reached. Additionally, the bound can be used to derive insights into the relationship between the properties of the input tokens, the control tokens, and the achievable control over the self-attention layer’s output. While Theorem B.2 provides a more general result, Theorem 4.2 complements it by incorporating a specific bound involving $\gamma_{i}$ into its hypothesis. This specific bound in Theorem 4.2’s makes it more practical for control of self-attention layers in applications.

B.4 Discussion

As seen in equation (13), $\beta_{i}(\mathbf{X}_{0},k)$ exhibits a hyperbolic dependence on $ke^{\alpha}$ . This suggests that increasing the number of control tokens can “dominate” the output of the self-attention, overwhelming the influence of the imposed state sequence $\mathbf{X}_{0}$ . The theorem’s reachability condition depends on the number of control tokens $k$ through the threshold $\gamma_{i}(\mathbf{X}_{0},\boldsymbol{\theta})$ . As discussed in Remark 4.3, the threshold scales linearly with $k$ , suggesting that increasing the number of control tokens can potentially enhance controllability. However, this effect is modulated by the other terms in the threshold, such as $\alpha$ and $g_{i}(\mathbf{X}_{0},\boldsymbol{\theta})$ , which depend on the properties of the imposed tokens and the model parameters. Specifically, $\beta_{i}$ saturates to 1 as $k\to\infty$ or as $\alpha$ becomes very large.

The term $g_{i}(\mathbf{X}_{0},\boldsymbol{\theta})$ captures the influence of the imposed tokens on the attention weights and appears in the denominator of the threshold $\gamma_{i}(\mathbf{X}_{0},\boldsymbol{\theta})$ . Larger values of $g_{i}(\mathbf{X}_{0},\boldsymbol{\theta})$ lead to a lower threshold, which may make reachability easier, thus increasing the potentially reachable set size.

The hypothesis of Theorem 4.2 implies that some row of the projection of the minimum possible output $\mathbf{Y}_{x}^{{\rm min}}$ onto the orthogonal complement of $\mathbf{Y}^{*}$ exceeds the corresponding row of the maximum possible projection of $\mathbf{Y}_{u}$ . This ensures the existence of a non-zero component in $\mathbf{Y}_{\perp}$ and precludes reachability. Thus, Theorem 4.2 provides a more specific, practically applicable criterion for assessing controllability than Theorem B.2.

Theorem 4.2’s reachability condition depends on the maximum singular values of the query, key, and value projection matrices ( $\mathbf{W}_{q},\mathbf{W}_{\rm key},\mathbf{W}_{v}$ ). The $\alpha$ term in Theorem 4.2, which involves the maximum singular values of the query and key projection matrices, provides an upper bound on the scaled dot products that is only tight in the special case of maximal alignment between the query and key matrices. In general, the actual size of the threshold $\gamma_{i}$ will be smaller depending on $g_{i}$ and the alignment of queries from $\mathbf{X}_{0}$ with the keys from $\mathbf{U}$ and $\mathbf{X}_{0}$ . The distribution of the singular values will also heavily impact the tightness of the bound: if all singular values are the same (i.e., $\mathbf{W}_{q}$ , $\mathbf{W}_{k}$ are each orthogonal matrices), the bound will be tight. If there are a few very large singular values and many small ones, the bound is loose. Therefore, the reachability condition in the theorem can be overly optimistic when used as a test for reachability, though it remains a sufficient condition for unreachability.

Theorem 4.2 and Theorem B.2 focus exclusively on the self-attention mechanism and do not directly address the impact of activation functions and other non-linearities present in the full transformer architecture on the controllability of the final model outputs. In a typical transformer block, the output of the self-attention layer passes through a non-linear activation function, such as ReLU or GELU, before being combined with the residual connection and proceeding to the next layer. These non-linearities can affect the propagation of signals through the network and, consequently, the controllability of the end-to-end model.

Analyzing controllability in the presence of multiple layers with interleaved non-linearities is an open problem. Investigating this challenge through the lens of, for instance, non-linear control theory has the potential to guide the design of transformer models with enhanced steerability and interpretability, which may advance the frontier of controllable and explainable AI systems. This direction has the potential to advance our understanding of the complex dynamics of large language models and develop principled approaches to controlling their behavior. However, significant research is still needed to realize this goal.

Appendix C Prompt Optimization Algorithms

Greedy Back-Generation:

While testing all prompts in $\mathcal{V}^{k}$ is intractable for $k>1$ , it takes only $|\mathcal{V}|$ forward passes of the network to compute the loss on $y$ induced by all possible single token prompts $u\in\mathcal{V}$ . Our Greedy Back Generation algorithm leverages this fact to generate prompts $u\in\mathcal{V}^{k}$ one token at a time, working backward sampling the $i$ th greedy-optimal single token extension $u^{\prime}=\arg\max_{u^{\prime}}P_{LM}(y|u^{\prime}+u+x)$ of the current prompt $u\in\mathcal{V}^{i-1}$ .

Algorithm 2 Greedy Token-Wise Prompt Generation

0: A causal LLM

P_{LM}

with vocabulary

\mathcal{V}

, a set of base tokens

x\in\mathcal{V}^{n}

, a desired final token

y\in\mathcal{V}

, and a desired number of prompt tokens

k

0: Magic words

u^{*}

of length

k

1: Initialize

u^{*}

to be empty.

2: for

i=1

k

3: for all

u^{\prime}\in\mathcal{V}

4: compute

P_{LM}(y|u^{\prime}+u^{*}+x)

5: end for

6: Select the

u^{\prime}

that maximizes the probability of

y

given

u^{\prime}+u^{*}+x

. Prepend

u^{\prime}

u^{*}

7: end for

8: return

u^{*}

This method is optimal for $k=1$ prompt token $u^{*}\in\mathcal{V}$ and generally outperforms GCG for short prompts of length $k\leq 3$ . Computing 1 additional prompt token takes roughly 1-4 minute when using an NVIDIA A100-80GB GPU with a 7 billion parameter model and 5-20 minutes on 2 NVIDIA A100-80GB GPUs with a 40 billion parameter model.

Greedy Coordinate Gradient (GCG):

The Greedy Coordinate Gradient algorithm, presented by [26] building off the work of [27], is the state-of-the-art method for optimizing prompts. Starting with a random prompt of length $k$ , the algorithm generates a batch of alternative prompts. Each member of the batch swaps a random token in the current prompt with a promising alternate token. The value metric for a swap is given by a first order approximation of the change in loss $\mathcal{L}=\text{CELoss}(y,P_{LM}(y|u+x))$ with the embedding of each token in $u$ .

Algorithm 3 Greedy Coordinate Gradient

0: A causal LLM

P_{LM}

that accepts token strings from a vocabulary

\mathcal{X}

, an embedding dictionary

\mathbf{e}

, embeddings

\mathbf{e}^{*}_{i}

corresponding to each token

i

u^{*}

, a set of base tokens

x_{1:n}

, a desired number of prompt tokens

k

, iterations

T

k_{sub}

, and batch size

B

0: Magic words

u^{*}

of length

k

1: Initialize

u^{*}

to be random tokens from vocabulary.

2: for

iteration=1

T

3: for

i=1

k

\mathcal{X}_{i}=

Top-

k_{sub}

(

\mathbf{e}^{\top}\nabla_{\mathbf{e}^{*}_{i}}P_{LM}(x_{n}|u^{*}+x_{1:n-1})

)

5: end for

6: for

b=1

B

i=

randint(

[1,\dots,k]

)

j=

randint(

[1,\dots,k_{sub}]

)

\tilde{u}^{*}_{b}[i]=\mathcal{X}_{i}[j]

10: end for

11:

u^{*}=\tilde{u}^{*}_{b^{\ast}}

, where

b^{\ast}=

argmax

{}_{b}(P_{LM}(x_{n}|u^{*}+x_{1:n-1}))

12: end for

13: return

u^{*}

This method outperforms all other methods we tested for prompts of length $k>3$ . We use a batch size $B=768$ , sampled from the top $k_{sub}=128$ token replacements at each index, and iterate for $T=34$ iterations. For each instance, this optimization took roughly 2 minutes for the 7 billion parameter models on a single A100-80GB GPU and 4-8 minutes for the 40 billion parameter model on 4 A100-80GB GPU.

Appendix D Supplementary Figures: Optimal Control Prompts

D.1 “Ground Truth” Controllability Results

This subsection includes supplementary figures for the controllability of Llama-7b, Falcon-7b, and Falcon-40b “ground truth” target outputs from Wikitext. For each initial state sequence $\mathbf{x}_{0}$ , the target output $y$ is the token immediately following $\mathbf{x}_{0}$ in Wikitext. We measured the $k$ - $\epsilon$ controllability of each of the 7 billion parameter models with a dataset of 5000 state-output pairs while we used a dataset of 500 state-output pairs for Falcon-40b.

Figure 4 shows each model’s log-spaced $k$ - $\epsilon$ curves on the Wikitext dataset, revealing a log-linear relationship between maximum prompt length $k$ and the fraction of uncontrollable initial state-target output pairs $(\mathbf{x}_{0},y)$ . We visualize the relationship between prompt length and the prior cross-entropy loss of each LLM on predicting the target output $y$ given the state sequence $\mathbf{x}_{0}$ (i.e., $-\log P_{LM}(y|\mathbf{x}_{0})$ in Figure 5 where we find it difficult to predict the required prompt length from the base loss.

Finally, Figure 6 shows a histogram of the tokens in the optimized prompts generated in the ground truth $k$ - $\epsilon$ controllability experiments on Wikitext.

D.2 Top-75 Wikitext Controllability Results

This subsection includes supplementary figures for the controllability of Llama-7b, Falcon-7b, and Falcon-40b on the Wikitext dataset where the target output token $y$ for a given initial state token sequence $\mathbf{x}_{0}$ is sampled uniformly from the top 75 highest-probability tokens as determined by the language model itself $P_{LM}(y|\mathbf{x}_{0})$ . Specifically, the dataset $\mathcal{D}$ consists of 25 unique initial state token sequences $\mathbf{x}_{0}$ sampled from Wikitext, each replicated 75 times for the top 75 most probable subsequent tokens $y\sim P(y|\mathbf{x}_{0})$ . This procedure yielded a dataset of 1875 initial state-target output pairs $(\mathbf{x}_{0},y)$ for the 7 billion parameter models. Due to the computational requirements for the 40 billion parameter model, the number of unique initial state token sequences was decreased to 10, resulting in a dataset of 750 initial state-target output pairs. The $k$ - $\epsilon$ plots for each model are shown in Figure 7. On average, across the 3 models, the top 75 outputs were reachable 86.865% of the time with $k\leq 10$ prompt tokens. Similar log-linear trends were observed in the $k$ - $\epsilon$ plot. Figure 8 shows the relationship between base loss and required prompt length, revealing a more dramatic “exclusion zone” in the top left, similar to main “ground truth” results in Figure 5. Finally, Figure 9 plots a histogram of the 40 most common tokens observed in the optimized control input prompts from the top-75 experiments.

D.3 Uniformly Sampled Output Token Results

This section contains supplementary figures for $k$ - $\epsilon$ controllability experiments on a synthetic dataset $\mathcal{D}=\{(\mathbf{x}_{0},y)\}$ where $\mathbf{x}_{0}$ are sampled from the Wikitext dataset and $y$ is sampled uniformly from the vocabulary. The uniform target output dataset $\mathcal{D}$ consists of 616 state-output pairs. Due to computational constraints, $k$ - $\epsilon$ controllability was only measured for Falcon-7b. Overall, only 46.42% of the target outputs were reachable with $k=10$ prompt tokens. Figure 10 visualizes the $k$ - $\epsilon$ results, the relationship between base loss and prompt length, and the most frequently observed tokens in the optimized control prompts. While the “exclusion zone” behavior (cf Figures 8, 5) is observed in the base loss vs. prompt length subplot, base loss remains a poor predictor of required prompt length. Moreover, Figure 3 reveals an even more uniform relationship between the initial rank of the target output token and the required prompt length.

Glossary of Symbols

Self-Attention:

•

$\Xi$ : The self-attention mechanism, a mapping from $\mathbb{R}^{N\times d_{in}}$ to $\mathbb{R}^{N\times d_{out}}$
•

$\mathbf{X}$ : The input matrix to self-attention, $\mathbf{X}\in\mathbb{R}^{N\times d_{in}}$
•

$N$ : The number of input token representations
•

$d_{in}$ : The dimensionality of each input token representation
•

$d_{out}$ : The dimensionality of each output token representation
•

$\mathbf{W}_{q},\mathbf{W}_{\rm key},\mathbf{W}_{v}$ : The query, key, and value projection weight matrices
•

$d_{\rm key}$ : The dimensionality of the key vectors
•

$\mathbf{Q},\mathbf{K},\mathbf{V}$ : The query, key, and value matrices
•

$\mathbf{D}$ : The diagonal matrix used for normalization
•

$\mathbf{1}_{N\times 1}$ : An $N\times 1$ matrix of ones

Input Partitioning:

•

$\mathbf{U}$ : The $k\times d_{in}$ submatrix of $\mathbf{X}$ corresponding to the control input
•

$\mathbf{X}_{0}$ : The $m\times d_{in}$ submatrix of $\mathbf{X}$ corresponding to the imposed state
•

$k$ : The number of control input tokens
•

$m$ : The number of imposed state tokens

Output Partitioning:

•

$\mathbf{U}^{\prime}$ : The $k\times d_{out}$ submatrix of the output corresponding to the control input
•

$\mathbf{Y}$ : The $m\times d_{out}$ submatrix of the output corresponding to the imposed state
•

$\mathbf{Y}^{*}$ : The desired output, $\mathbf{Y}^{*}\in\mathbb{R}^{m\times d_{out}}$
•

$\mathbf{Y}_{u},\mathbf{Y}_{x}$ : The components of $\mathbf{Y}$ arising from $\mathbf{U}$ and $\mathbf{X}_{0}$ respectively
•

$\mathbf{Y}_{u,||},\mathbf{Y}_{x,||}$ : The components of $\mathbf{Y}_{u}$ and $\mathbf{Y}_{x}$ parallel to $\mathbf{Y}^{*}$
•

$\mathbf{Y}_{u,\perp},\mathbf{Y}_{x,\perp}$ : The components of $\mathbf{Y}_{u}$ and $\mathbf{Y}_{x}$ orthogonal to $\mathbf{Y}^{*}$
•

$\mathbf{Y}_{x,\perp}^{min}$ : The minimum value of $\mathbf{Y}_{x,\perp}$ over all control inputs that are uniformly bounded in norm by a fixed constant $M_{u}$ in the hypothesis of the theorem

Reachability Conditions:

•

$\beta_{i}(\mathbf{X}_{0},k)$ : The upper bound on the norm of row $i$ of $\mathbf{Y}_{u,\perp}$
•

$\gamma_{i}(\mathbf{X}_{0},\boldsymbol{\theta})$ : A number that depends on $\mathbf{X}_{0}$ and $\boldsymbol{\theta}=(\mathbf{W}_{q},\mathbf{W}_{\rm key},\mathbf{W}_{v})$
•

$\alpha$ : An upper bound on the scaled key-query dot products
•

$\sigma_{q},\sigma_{\rm key},\sigma_{v}$ : The maximum singular values of $\mathbf{W}_{q}$ , $\mathbf{W}_{key}$ , $\mathbf{W}_{v}$
•

$M_{u},M_{x}$ : The maximum norms of the control and imposed token embeddings

What’s the Magic Word? A Control Theory of LLM Prompting

Abstract

1 Introduction

1.1 Contribution

2 Related Work

The AutoPrompt Family:

Other Prompt Optimization Methods:

Control Theory for LLMs:

3 Control Theory for LLMs

Language Model Notation:

Definition 3.1 (LLM System with Control Input).

Definition 3.2 (LLM Output Reachability).

Definition 3.3 (LLM Reachable Output Set).

Definition 3.4 (LLM Output Controllability).

Definition 3.5 (k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ Controllability).

4 The Self-Attention Control Theorem

4.1 Preliminaries

Definition 4.1 (Self-Attention).

Reachability for Self Attention:

4.2 The theorem and its motivation

Theorem 4.2 (Self-Attention Control Theorem, proved in Appendix B).

Remark 4.3.

Proof Summary:

Discussion of Theorem 4.2:

5 Experiments

5.1 Methods

5.2 Results

“Ground truth” reachability:

Top-75 reachability:

Uniformly sampled target outputs:

6 Discussion

Impact statement

Code availability

References

Appendix A Abstract Systems and Control Theory Background

Definition A.1 (System).

Definition A.2 (State Reachability).

Definition A.3 (Reachable State Set).

Definition A.4 (Output Reachability).

Definition A.5 (Reachable Output Set).

Definition A.6 (State Controllability).

Definition A.7 (Output Controllability).

Appendix B Theory on Self-Attention Controllability

B.1 Proof of Theorem 4.2

Lemma B.1.

Proof.

B.2 Simplified reachability hypothesis

B.3 More general theorem

Theorem B.2 (Self-Attention Control Theorem 2).

Proof of Theorem B.2.

B.4 Discussion

Appendix C Prompt Optimization Algorithms

Greedy Back-Generation:

Greedy Coordinate Gradient (GCG):

Appendix D Supplementary Figures: Optimal Control Prompts

D.1 “Ground Truth” Controllability Results

D.2 Top-75 Wikitext Controllability Results

D.3 Uniformly Sampled Output Token Results

Glossary of Symbols

Definition 3.5 ( $k$ - $\epsilon$ Controllability).