What’s the Magic Word? A Control Theory of LLM Prompting

Aman Bhargava
California Institute of Technology
Pasadena, CA, USA
abhargav[at]caltech[dot]edu
&Cameron Witkowski
University of Toronto
Toronto, ON, Canada
cameron.witkowski[at]mail.utoronto.ca
&Shi-Zhuo Looi
California Institute of Technology
Pasadena, CA, USA
looi[at]caltech[dot]edu
&Matt Thomson
California Institute of Technology
Pasadena, CA, USA
mthomson[at]caltech[dot]edu
Use footnote for providing further information about author (webpage, alternative address)—not for acknowledging funding agencies.
Abstract

Prompt engineering is crucial for deploying LLMs but is poorly understood mathematically. We formalize LLM systems as a class of discrete stochastic dynamical systems to explore prompt engineering through the lens of control theory. We offer a mathematical analysis of the limitations on the controllability of self-attention as a function of the singular values of the parameter matrices. We present complementary empirical results on the controllability of a panel of LLMs, including Falcon-7b, Llama-7b, and Falcon-40b. Given initial state 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from Wikitext and prompts of length k10𝑘10k\leq 10italic_k ≤ 10 tokens, we find that the “correct” next token is reachable at least 97% of the time, and that the top 75 most likely next tokens are reachable at least 85% of the time. Intriguingly, short prompt sequences can dramatically alter the likelihood of specific outputs, even making the least likely tokens become the most likely ones. This control-theoretic analysis of LLMs demonstrates the significant and poorly understood role of input sequences in steering output probabilities, offering a foundational perspective for enhancing language model system capabilities.

1 Introduction

LLMs pre-trained on unsupervised next token prediction objectives exhibit unprecedented dynamic reprogrammability achieved through “prompting”, often referred to as zero-shot learning [1, 2, 3, 4, 5, 6]. These capabilities appear to emerge as the model’s size, training data, and training time are scaled. The dynamic reprogrammability of LLMs is akin to the adaptable computational capacities observed in biological systems. This feature finds applications across domains such as machine translation [7], code generation [8], and chatbots [9]. A rigorous understanding of the prompt’s influence over LLM generation would be of great utility for understanding LLMs and building more robust and capable systems leveraging LLMs.

Strategies for controlling pre-trained LLM generation today fall into three broad categories [10]:

  1. 1.

    Input Optimization (Prompting): Adjusting the input tokens (e.g., rewording the prompt) to improve subsequent text generation.

  2. 2.

    Model Optimization: Adjusting the weights of the network (e.g., fine-tuning, RLHF) to improve model behavior during inference.

  3. 3.

    Post-processing: Adjusting or re-ranking generated text (e.g., surrogate ranking algorithm).

Of all these approaches, input optimization (i.e., prompting) is the least invasive and lowest-cost method – and the least understood. Prompt optimization is also deeply connected to the zero-shot capabilities of LLMs – the mysterious emergent capabilities of LLMs such as problem-solving, knowledge retrieval, reasoning, and apparent general intelligence [11]. With such a view, we seek to characterize the controllability of LLMs via prompting (Figure 1).

Refer to caption
Figure 1: Illustration of the control-theoretic approach to LLM prompt engineering. Left: the LLM system diagram mapping an initial state 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to a system output 𝐲𝐲\mathbf{y}bold_y under the influence of a control input 𝐮𝐮\mathbf{u}bold_u (all token sequences). Right: sketch of the reachable output sets Ryk(𝐱0)superscriptsubscript𝑅𝑦𝑘subscript𝐱0R_{y}^{k}(\mathbf{x}_{0})italic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for varying control input lengths k𝑘kitalic_k.

1.1 Contribution

We formalize LLM systems in the mathematical framework of control theory in Section 3. Our analysis focuses on the reachable set of outputs y(𝐱0)subscript𝑦subscript𝐱0\mathcal{R}_{y}(\mathbf{x}_{0})caligraphic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for an LLM system. The reachable set is a fundamental concept in control theory that underlies notions of controllability, stability, and observability (cf. Appendix A). The reachable output set Ry(𝐱0)subscript𝑅𝑦subscript𝐱0R_{y}(\mathbf{x}_{0})italic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the set of output sequences 𝐲𝐲\mathbf{y}bold_y for which there exists a control input sequence 𝐮superscript𝐮\mathbf{u}^{*}bold_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that steers the LLM from initial state 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to output 𝐲𝐲\mathbf{y}bold_y (cf. Definitions 3.3A.5).

Our mathematical results in Section 4 prove an upper bound on the contents of the reachable output set for a self-attention head as a function of the singular values of its parameter matrices. Since self-attention is the only component in a transformer block where significant information is exchanged between token representations, this bound provides a foothold for analysis of LLM controllability from the perspective of mechanistic interpretability (e.g., [12, 13, 14]). Our result represents an analytically computable necessary condition for an output to be in the reachable set (Equation 7).

Our empirical results apply state-of-the-art prompt optimization techniques (Section 5.1) to demonstrate a lower bound on the contents of the reachable output set for a panel of LLMs, including Llama-7b [15], Falcon-7b, and Falcon-40b [16]. Specifically, we sample initial states 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the Wikitext dataset [17] and probe the reachable output tokens y𝑦yitalic_y under length-constrained control input sequences 𝐮:|𝐮|k:𝐮𝐮𝑘\mathbf{u}:|\mathbf{u}|\leq kbold_u : | bold_u | ≤ italic_k. The length constraint k𝑘kitalic_k is highly relevant for optimal control of LLMs, as prompts with fewer tokens require fewer computation and memory resources. We find that the reachable output set contains the “correct” next Wikitext token following 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over 97% of the time with prompts of k10𝑘10k\leq 10italic_k ≤ 10 tokens. We expand our analysis of the contents of Ry(𝐱0)subscript𝑅𝑦subscript𝐱0R_{y}(\mathbf{x}_{0})italic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) by sampling target output tokens y𝑦yitalic_y based on the LLMs initial estimate of output likelihood PLM(y|𝐱0)subscript𝑃𝐿𝑀conditional𝑦subscript𝐱0P_{LM}(y|\mathbf{x}_{0})italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). We find that the top 75 most likely output tokens y𝑦yitalic_y are reachable at least 85% of the time with prompts of k10𝑘10k\leq 10italic_k ≤ 10 tokens. Intriguingly, some tokens drawn from the set of least likely outputs are controllable to the most likely output with k4𝑘4k\leq 4italic_k ≤ 4 control input tokens. Our results suggest that prior likelihood-based metrics, such as cross-entropy loss, cannot guarantee exclusion from the reachable set, emphasizing the gap in our current understanding of LLM systems and control. Implications of our results and open questions in LLM control theory are further discussed in Section 6.

2 Related Work

Much of the work on prompt optimization is concerned with finding prompts that induce higher LLM performance on “fill-in-the-blank” or “cloze” tasks [18]. One can frame a range of tasks including knowledge retrieval [19], reasoning [20], and sentiment analysis [21] as fill-in-the-blank tasks:

  • Knowledge Retrieval: “The Titanic sank in the year [MASK].” (Answer: “1912”)

  • Reasoning: “A is taller than B. B is taller than C. Is A taller than C? Answer: [MASK] (Answer: “Yes”)

  • Sentiment Analysis: “I am sad today. The sentiment of the previous sentence was [MASK] (Answer: “Negative”)

Notably, there is some freedom in the bolded “prompt text” that surrounds the question to convert it into a “fill-in-the-blank” task. As it turns out, the prompt tokens have a large effect on LLM performance [1, 10, 22].

Modern prompt optimization algorithms generally consist of two iterated steps: a sampling step where new prompts are generated and a testing step where the utility of the new prompts is evaluated, and the best are selected for the next iteration. Algorithms primarily differ in the sampling procedure, where various heuristics may be used to pick high-value swaps [23, 24, 25]. Overall, AutoPrompt and its derivative algorithms have been the most numerically successful prompt optimization methods, with the greedy coordinate gradient (GCG) algorithm having state-of-the-art performance [26].

The AutoPrompt Family:

AutoPrompt [27] pioneered the current wave of prompt optimization. Shin et al propose a prompt optimization technique and demonstrate its effectiveness for engineering prompts to improve LLM performance on knowledge and sentiment analysis tasks. At its core, the AutoPrompt algorithm leverages gradient information at the token embedding layer to inform iterative token exchanges within the prompt. This method was extended in [26] as the greedy coordinate gradient (GCG) algorithm. Taking inspiration from adversarial examples [28], Zou et al applied this AutoPrompt variant to generate “jailbreak” prompts that cause aligned LLMs to generate objectionable content.

Other Prompt Optimization Methods:

Other investigations on LLMs as prompt optimizers [24] and further analysis of manual prompt optimization [25] are informative but do not exceed the AutoPrompt family’s performance. Some other methods include GBDA [29], an approach based on the Gumbel-Softmax reparametrization, the PEZ algorithm [23], which directly optimizes embeddings via gradient information, and FluentPrompt [30], which differs from AutoPrompt by incorporating Langevin dynamics. Another family of techniques relating closely to our work is RL-Based prompt optimization methods [31, 32, 33, 34]. Such methods seek to optimize a prompt generation policy to maximize some reward signal, using a host of off the shelf reinforcement learning algorithms. Despite the variety of alternatives, GCG retains state-of-the-art performance.

Control Theory for LLMs:

To our knowledge, the only other work to date on the controllability or reachability of LLM text generation is [35]. Soatto et al analyze the controllability of LLMs in terms of “meaningful sentences”, defined as the sigma-algebra generated by snippets of text written on the Internet. Their empirical analysis revolves around demonstrating that LLMs are capable of attributing meaning. The theoretical analysis of LLM controllability is limited to “meaningful sentences”, eliminating the possibility of out-of-distribution inputs and outputs. These restrictions render their results challenging to leverage toward a practical understanding of LLM controllability. As stated in Section 5.5 of [35], “If fed gibberish, the well-trained bot operates out of distribution, which does not allow predicting the reachable set”. We situate our work as a practically oriented exploration of LLM controllability. Motivated by challenges in developing LLM systems, we do not eliminate “meaningless sentences” from the state space or input space. We aim to establish a rigorous, general framework for understanding LLM systems and controllability that is amenable to the development of theory and practical engineering insights on systems design.

3 Control Theory for LLMs

Control theory originates from the study of automatic control systems in engineering. It seeks to understand how a “plant” system can be influenced toward a desired state using a “control signal” – often in the presence of disturbances and uncertainty.

Control theory is central to a variety of engineering problems, from electrical engineering to autopilot to telecommunications to manufacturing. Surprisingly, control theory has also been highly applicable to a diverse range of scientific disciplines. Analyzing systems through the lens of controllability has proven fruitful for generating insight into biological systems such as cell signaling pathways and neural networks [36], the economics of central banking [37], and controlling the spread of infectious diseases [38]. One of the central benefits of studying systems via controllability is that a range of questions and problems naturally emerge from the framing: when is control possible? What is the cost of control? How computationally intensive is control? These questions are both practically useful and often lead to fundamental insights about the nature of the system in question.

To develop a control theory of LLMs, we begin with fundamental definitions of systems and control in Appendix A. We extend these fundamentals to define LLM systems (Definition 3.1) and outline specific canonical control concepts and problems such as controllability and reachability (Definition 3.33.4) that arise naturally for LLM systems.

Language Model Notation:

We denote a causal language model using PLMsubscript𝑃𝐿𝑀P_{LM}italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT. PLMsubscript𝑃𝐿𝑀P_{LM}italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT maps from an ordered list of tokens from a vocabulary set 𝒱𝒱\mathcal{V}caligraphic_V (e.g., 𝐱𝒱n𝐱superscript𝒱𝑛\mathbf{x}\in\mathcal{V}^{n}bold_x ∈ caligraphic_V start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT) to the probability distribution over the next token PLM(xn+1|𝐱)[0,1]|𝒱|subscript𝑃𝐿𝑀conditionalsubscript𝑥𝑛1𝐱superscript01𝒱P_{LM}(x_{n+1}|\mathbf{x})\in[0,1]^{|\mathcal{V}|}italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | bold_x ) ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT | caligraphic_V | end_POSTSUPERSCRIPT. We use 𝒱superscript𝒱\mathcal{V}^{*}caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to denote the set of all possible sequences of any length composed of tokens from 𝒱𝒱\mathcal{V}caligraphic_V. The addition operator indicates the concatenation of token sequences. Bolded lowercase variables (e.g., 𝐱=[x1,,xn]𝐱superscript𝑥1superscript𝑥𝑛\mathbf{x}=[x^{1},\dots,x^{n}]bold_x = [ italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ]) denote token sequences while unbolded lowercase variables refer to individual tokens (e.g., x𝒱𝑥𝒱x\in\mathcal{V}italic_x ∈ caligraphic_V). The length of a token sequence is denoted |𝐱|𝐱|\mathbf{x}|| bold_x |.

While LLMs are at times leveraged in a manner that masks the iterative aspects of generation, the reality is that token generation and externally imposed “control input” sequences are generated and processed sequentially, leading to non-trivial system dynamics. Several key differences remain between LLM-based systems and systems typically modeled through ordinary differential equations (ODEs), which have long been a cornerstone in the study of continuous-time dynamical systems:

  1. 1.

    Discrete state and time: LLM systems operate on sequences of discrete tokens over a discrete time set, in contrast to the continuous state spaces and time sets studied in classical control theory.

  2. 2.

    Shift-and-Grow State Dynamics: Whereas the system state in an ODE-based system has a fixed size over time, the system state 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) for LLM systems grows as tokens are added to the state sequence.

  3. 3.

    Mutual exclusion on control input token vs. generated token: The LLM system state 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) is written to one token at a time. The newest token is either drawn from the control input u(t)𝑢𝑡u(t)italic_u ( italic_t ) or is generated by the LLM by sampling xPLM(x|𝐱(t))similar-tosuperscript𝑥subscript𝑃𝐿𝑀conditionalsuperscript𝑥𝐱𝑡x^{\prime}\sim P_{LM}(x^{\prime}|\mathbf{x}(t))italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_x ( italic_t ) ). This differs from traditional discrete stochastic systems, where the control sequence and internal dynamics generally affect the state synchronously.

We begin by rigorously defining LLM systems with user input, drawing from the abstract mathematical definition of a system (Definition A.1).

Definition 3.1 (LLM System with Control Input).

An autoregressive LLM system with control input Σ=(𝒱,PLM)Σ𝒱subscript𝑃𝐿𝑀\Sigma=(\mathcal{V},P_{LM})roman_Σ = ( caligraphic_V , italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ) consists of:

  • 𝒯=𝒯\mathcal{T}=\mathbb{N}caligraphic_T = blackboard_N – The time set is the natural numbers.

  • 𝒳=𝒱𝒳superscript𝒱\mathcal{X}=\mathcal{V}^{*}caligraphic_X = caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT – The state space consists of all possible token sequences of any length drawn from 𝒱𝒱\mathcal{V}caligraphic_V. We denote the state at time t𝑡titalic_t as 𝐱(t)=[x0(t),,xt(t)]𝐱𝑡superscript𝑥0𝑡superscript𝑥𝑡𝑡\mathbf{x}(t)=[x^{0}(t),\dots,x^{t}(t)]bold_x ( italic_t ) = [ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) , … , italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_t ) ].

  • 𝒰=𝒱𝒰𝒱\mathcal{U}=\mathcal{V}\cup\varnothingcaligraphic_U = caligraphic_V ∪ ∅ – The input takes values from the vocabulary set 𝒱𝒱\mathcal{V}caligraphic_V or null.

  • ϕ:𝒳×𝒰×𝒯2𝒳:italic-ϕ𝒳𝒰superscript𝒯2𝒳\phi:\mathcal{X}\times\mathcal{U}\times\mathcal{T}^{2}\to\mathcal{X}italic_ϕ : caligraphic_X × caligraphic_U × caligraphic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → caligraphic_X – The transition map is

    ϕ(𝐱(t),u(t),t,t+1)={𝐱(t)+u(t) if u(t)𝐱(t)+x elseitalic-ϕ𝐱𝑡𝑢𝑡𝑡𝑡1cases𝐱𝑡𝑢𝑡 if 𝑢𝑡𝐱𝑡superscript𝑥 else\displaystyle\phi(\mathbf{x}(t),u(t),t,t+1)=\begin{cases}\mathbf{x}(t)+u(t)&% \text{ if }u(t)\neq\varnothing\\ \mathbf{x}(t)+x^{\prime}&\text{ else }\end{cases}italic_ϕ ( bold_x ( italic_t ) , italic_u ( italic_t ) , italic_t , italic_t + 1 ) = { start_ROW start_CELL bold_x ( italic_t ) + italic_u ( italic_t ) end_CELL start_CELL if italic_u ( italic_t ) ≠ ∅ end_CELL end_ROW start_ROW start_CELL bold_x ( italic_t ) + italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL start_CELL else end_CELL end_ROW (1)

    where xPLM(x|𝐱(t))similar-tosuperscript𝑥subscript𝑃𝐿𝑀conditionalsuperscript𝑥𝐱𝑡x^{\prime}\sim P_{LM}(x^{\prime}|\mathbf{x}(t))italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_x ( italic_t ) ). Note that the general multi-step transition map ϕ(𝐱(t),u,t,t+N)italic-ϕ𝐱𝑡𝑢𝑡𝑡𝑁\phi(\mathbf{x}(t),u,t,t+N)italic_ϕ ( bold_x ( italic_t ) , italic_u , italic_t , italic_t + italic_N ) can be achieved by iterating equation 1 for control sequences 𝐮𝐮\mathbf{u}bold_u defined over the interval [t,t+N]𝑡𝑡𝑁[t,t+N][ italic_t , italic_t + italic_N ].

  • h(𝐱(t);r)=[xtr(t),,xt(t)]𝐱𝑡𝑟superscript𝑥𝑡𝑟𝑡superscript𝑥𝑡𝑡h(\mathbf{x}(t);r)=[x^{t-r}(t),\dots,x^{t}(t)]italic_h ( bold_x ( italic_t ) ; italic_r ) = [ italic_x start_POSTSUPERSCRIPT italic_t - italic_r end_POSTSUPERSCRIPT ( italic_t ) , … , italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_t ) ] – The readout map returns the most recent r𝑟ritalic_r tokens from state 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ).

We note that this LLM system definition is generalizable to a variety of LLM augmentations, including chain-of-thought [39], retrieval-augmented generation [40], and chatbot interaction. For example, chain-of-thought is equivalent to sampling the readout map h(x(t),r)𝑥𝑡𝑟h(x(t),r)italic_h ( italic_x ( italic_t ) , italic_r ) at time T>|𝐮|+|𝐱0|+r𝑇𝐮subscript𝐱0𝑟T>|\mathbf{u}|+|\mathbf{x}_{0}|+ritalic_T > | bold_u | + | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + italic_r for prompt 𝐮𝐮\mathbf{u}bold_u and initial state 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. A similar formulation may be applied to LLM systems endowed with programmatic tools (e.g., [41]).

In Definition 3.1, we assume that the control input gets to “decide” whether to yield token generation to the LLM (u(t)=𝑢𝑡u(t)=\varnothingitalic_u ( italic_t ) = ∅) or override the LLM and add some token u(t)𝑢𝑡u(t)\neq\varnothingitalic_u ( italic_t ) ≠ ∅ to the state 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ). This assumption generally holds when building LLM systems, though it may not hold when using existing systems (e.g., via non-streaming API). When discussing finite-length control inputs – e.g., the family of k𝑘kitalic_k-long input sequences 𝐮𝒱k𝐮superscript𝒱𝑘\mathbf{u}\in\mathcal{V}^{k}bold_u ∈ caligraphic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT – the value of u():>k:𝑢𝑘u(\ell):\ell>kitalic_u ( roman_ℓ ) : roman_ℓ > italic_k is implicitly \varnothing unless otherwise stated.

While next token generation xPLM(x|𝐱(t))similar-tosuperscript𝑥subscript𝑃𝐿𝑀conditionalsuperscript𝑥𝐱𝑡x^{\prime}\sim P_{LM}(x^{\prime}|\mathbf{x}(t))italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_x ( italic_t ) ) in equation 1 is probabilistic, we may render the system deterministic by sampling with zero temperature (i.e., greedy decoding). The greedy decoding assumption provides a foothold to analyze the reachable sets and controllability of LLM systems without invoking notions of stochastic control as in [42, 35]. Moreover, it remains connected to temperature-based stochastic decoding strategies as a limiting case of temperature-based sampling as zero-temperature sampling.

We now extend Definition A.4 to define output controllability for LLM systems:

Definition 3.2 (LLM Output Reachability).

Output token sequence 𝐲𝒱r𝐲superscript𝒱𝑟\mathbf{y}\in\mathcal{V}^{r}bold_y ∈ caligraphic_V start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT is reachable from initial state 𝐱0𝒱subscript𝐱0superscript𝒱\mathbf{x}_{0}\in\mathcal{V}^{*}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for LLM system Σ(𝒱,PLM)Σ𝒱subscript𝑃𝐿𝑀\Sigma(\mathcal{V},P_{LM})roman_Σ ( caligraphic_V , italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ) iff there exists some time T𝑇Titalic_T and input 𝐮𝒰ksuperscript𝐮superscript𝒰𝑘\mathbf{u}^{*}\in\mathcal{U}^{k}bold_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for some k+|𝐱0|T𝑘subscript𝐱0𝑇k+|\mathbf{x}_{0}|\leq Titalic_k + | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | ≤ italic_T that steers the LLM from initial state 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to output 𝐲=h(𝐱(T),r)𝐲𝐱𝑇𝑟\mathbf{y}=h(\mathbf{x}(T),r)bold_y = italic_h ( bold_x ( italic_T ) , italic_r ) at time T𝑇Titalic_T.

We disregard the trivial solution wherein the control input 𝐮(t)superscript𝐮𝑡\mathbf{u}^{*}(t)bold_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t ) overrides the LLM to force the state sequence to take on the desired output value 𝐲𝐲\mathbf{y}bold_y. We focus on the case of immediate generation, where T=k+|𝐱0|+r𝑇𝑘subscript𝐱0𝑟T=k+|\mathbf{x}_{0}|+ritalic_T = italic_k + | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + italic_r.

The reachable output set definition for LLM systems follows from Definition A.5:

Definition 3.3 (LLM Reachable Output Set).

The reachable output set from initial state 𝐱0𝒱subscript𝐱0superscript𝒱\mathbf{x}_{0}\in\mathcal{V}^{*}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for LLM system Σ=(𝒱,PLM)Σ𝒱subscript𝑃𝐿𝑀\Sigma=(\mathcal{V},P_{LM})roman_Σ = ( caligraphic_V , italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ) is denoted Ry(𝐱0)subscript𝑅𝑦subscript𝐱0R_{y}(\mathbf{x}_{0})italic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and consists of all reachable outputs 𝐲𝒱𝐲superscript𝒱\mathbf{y}\in\mathcal{V}^{*}bold_y ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from initial state 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Output controllability for LLMs follows from Definition A.7:

Definition 3.4 (LLM Output Controllability).

An LLM system Σ=(𝒱,PLM)Σ𝒱subscript𝑃𝐿𝑀\Sigma=(\mathcal{V},P_{LM})roman_Σ = ( caligraphic_V , italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ) is output controllable iff, for every initial state 𝐱0𝒱subscript𝐱0superscript𝒱\mathbf{x}_{0}\in\mathcal{V}^{*}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the reachable output set y(𝐱0)=𝒱subscript𝑦subscript𝐱0superscript𝒱\mathcal{R}_{y}(\mathbf{x}_{0})=\mathcal{V}^{*}caligraphic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

The turn-based nature of writing to the LLM state sequence 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) invites the question of whether the prompt 𝐮𝐮\mathbf{u}bold_u should preempt the imposed state 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or come after the state 111Both situations are reasonable in developing LLM systems: 𝐮𝐮\mathbf{u}bold_u preceding 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT may arise when prompting an LLM to complete a partial string 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. 𝐮𝐮\mathbf{u}bold_u proceeding 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT may arise when prompting an LLM in the presence of an imposed system prompt 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Therefore, how an initial state 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is interleaved with control input 𝐮𝐮\mathbf{u}bold_u is largely a design decision.. We focus our efforts on cases where 𝐮𝐮\mathbf{u}bold_u comes before imposed state sequence 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT due to its importance for developing system prompts and controlling text completion-based generation where the desired output is 𝐱0+𝐲subscript𝐱0superscript𝐲\mathbf{x}_{0}+\mathbf{y}^{*}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for some desired continuation 𝐲superscript𝐲\mathbf{y}^{*}bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of partial string 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Due to the costly nature of long prompts, we are especially interested in the existence of prompts 𝐮superscript𝐮\mathbf{u}^{*}bold_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with minimal length |𝐮|superscript𝐮|\mathbf{u}^{*}|| bold_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT |.

Definitions 3.3 and  3.4 form the basis for our control theory of LLMs. While amenable to theoretical analysis as in Section 4 and [35], empirical analysis of the reachable set and controllability is challenging due to the intractable size of 𝒱superscript𝒱\mathcal{V}^{*}caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We propose the following statistical measure of controllability for practically assessing the controllability of an LLM system w.r.t. a dataset 𝒟𝒟\mathcal{D}caligraphic_D under prompt length constraint |𝐮|k𝐮𝑘|\mathbf{u}|\leq k| bold_u | ≤ italic_k:

Definition 3.5 (k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ Controllability).

Consider a dataset of state-output pairs 𝒟={(𝐱0i,𝐲i)}i[N]𝒟subscriptsuperscriptsubscript𝐱0𝑖superscript𝐲𝑖𝑖delimited-[]𝑁\mathcal{D}=\{(\mathbf{x}_{0}^{i},\mathbf{y}^{i})\}_{i\in[N]}caligraphic_D = { ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT. An LLM Σ=(𝒱,PLM)Σ𝒱subscript𝑃𝐿𝑀\Sigma=(\mathcal{V},P_{LM})roman_Σ = ( caligraphic_V , italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ) is k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ controllable w.r.t. 𝒟𝒟\mathcal{D}caligraphic_D if

Pr{𝐲yk(𝐱0)}ϵPr𝐲subscriptsuperscript𝑘𝑦subscript𝐱0italic-ϵ\Pr\{\mathbf{y}\notin\mathcal{R}^{k}_{y}(\mathbf{x}_{0})\}\leq\epsilonroman_Pr { bold_y ∉ caligraphic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } ≤ italic_ϵ (2)

For (𝐱0,𝐲)𝒟similar-tosubscript𝐱0𝐲𝒟(\mathbf{x}_{0},\mathbf{y})\sim\mathcal{D}( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_y ) ∼ caligraphic_D, where yk(𝐱0i)subscriptsuperscript𝑘𝑦superscriptsubscript𝐱0𝑖\mathcal{R}^{k}_{y}(\mathbf{x}_{0}^{i})caligraphic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is the reachable set of outputs as in Definition 3.3 under the constraint that prompts 𝐮𝐮\mathbf{u}bold_u must have length |𝐮|k𝐮𝑘|\mathbf{u}|\leq k| bold_u | ≤ italic_k.

Our empirical work in Section 5.2 explores k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ controllability w.r.t. initial states 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT sampled from the Wikitext dataset. While empirical analysis of LLM controllability is challenging due to the lack of apparent structure in LLM dynamics and the combinatorially large state space, we may still experimentally establish the existence of optimal prompts 𝐮superscript𝐮\mathbf{u}^{*}bold_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that elicit a given output, and thus establish a lower bound on the content of the reachable set. Meanwhile, our theoretical work in Section 4 establishes upper bounds on the content of the reachable set for self-attention. We hope these complementary approaches aid in unifying our understanding of LLM systems.

4 The Self-Attention Control Theorem

Self-attention is a central component in modern transformer-based language models [1, 15, 43, 44]. Introduced in [45] and popularized by [46], self-attention is the primary component in transformers where token representations exchange information. Self-attention mechanisms have significantly advanced the field of natural language processing, enabling models to capture long-range dependencies and achieve impressive performance on various tasks. Despite the widespread adoption and success of self-attention, the extent to which the outputs of self-attention layers can be precisely controlled via the input sequence remains an open question.

In this section, we present the Self-Attention Control Theorem, which proves bounds for understanding the reachability of self-attention outputs given limited control over the input token representations.

4.1 Preliminaries

Definition 4.1 (Self-Attention).

Self-attention ΞΞ\Xiroman_Ξ is parameterized by weight matrices 𝜽=(𝐖q,𝐖key,𝐖v)𝜽subscript𝐖𝑞subscript𝐖keysubscript𝐖𝑣\boldsymbol{\theta}=(\mathbf{W}_{q},\mathbf{W}_{\rm key},\mathbf{W}_{v})bold_italic_θ = ( bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ). ΞΞ\Xiroman_Ξ is a mapping from N×dinsuperscript𝑁subscript𝑑𝑖𝑛\mathbb{R}^{N\times d_{in}}blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to N×doutsuperscript𝑁subscript𝑑𝑜𝑢𝑡\mathbb{R}^{N\times d_{out}}blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the number of input token representations, each of dimensionality dinsubscript𝑑𝑖𝑛d_{in}italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, and doutsubscript𝑑𝑜𝑢𝑡d_{out}italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT is the dimensionality of the output token representations.

Ξ(𝐗;𝜽)=𝐃1exp(𝐐𝐊dkey)𝐕Ξ𝐗𝜽superscript𝐃1superscript𝐐𝐊topsubscript𝑑key𝐕\Xi(\mathbf{X};\boldsymbol{\theta})=\mathbf{D}^{-1}\exp\left(\frac{\mathbf{QK^% {\top}}}{\sqrt{d_{\rm key}}}\right)\mathbf{V}roman_Ξ ( bold_X ; bold_italic_θ ) = bold_D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_exp ( divide start_ARG bold_QK start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT end_ARG end_ARG ) bold_V (3)

where exp()\exp()roman_exp ( ) denotes element-wise exponentiation of the matrix entries, 𝐖q,𝐖keydin×dkeysubscript𝐖𝑞subscript𝐖keysuperscriptsubscript𝑑𝑖𝑛subscript𝑑key\mathbf{W}_{q},\mathbf{W}_{\rm key}\in\mathbb{R}^{d_{in}\times d_{\rm key}}bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝐖vdin×doutsubscript𝐖𝑣superscriptsubscript𝑑𝑖𝑛subscript𝑑𝑜𝑢𝑡\mathbf{W}_{v}\in\mathbb{R}^{d_{in}\times d_{out}}bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝐐=𝐗𝐖q𝐐subscript𝐗𝐖𝑞\mathbf{Q}=\mathbf{X}\mathbf{W}_{q}bold_Q = bold_XW start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, 𝐊=𝐗𝐖key𝐊subscript𝐗𝐖key\mathbf{K}=\mathbf{X}\mathbf{W}_{\rm key}bold_K = bold_XW start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT, 𝐕=𝐗𝐖v𝐕subscript𝐗𝐖𝑣\mathbf{V}=\mathbf{X}\mathbf{W}_{v}bold_V = bold_XW start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, and 𝐃𝐃\mathbf{D}bold_D is a diagonal positive definite matrix defined as

𝐃:=diag(exp(𝐐𝐊dkey)𝟏N×1)assign𝐃diagsuperscript𝐐𝐊topsubscript𝑑keysubscript1𝑁1\mathbf{D}:=\text{diag}\left(\exp\left(\frac{\mathbf{QK^{\top}}}{\sqrt{d_{\rm key% }}}\right)\mathbf{1}_{N\times 1}\right)bold_D := diag ( roman_exp ( divide start_ARG bold_QK start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT end_ARG end_ARG ) bold_1 start_POSTSUBSCRIPT italic_N × 1 end_POSTSUBSCRIPT ) (4)

where 𝟏N×1subscript1𝑁1\mathbf{1}_{N\times 1}bold_1 start_POSTSUBSCRIPT italic_N × 1 end_POSTSUBSCRIPT is an N×1𝑁1N\times 1italic_N × 1 matrix of ones.

The parameters and operation of ΞΞ\Xiroman_Ξ are independent of the number of token representations N𝑁Nitalic_N. Self-attention is typically applied to discrete token sequences by embedding each token in the sequence as a vector in dinsuperscriptsubscript𝑑𝑖𝑛\mathbb{R}^{d_{in}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to construct the matrix of N𝑁Nitalic_N token representations 𝐗N×din𝐗superscript𝑁subscript𝑑𝑖𝑛\mathbf{X}\in\mathbb{R}^{N\times d_{in}}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

We focus on the reachability of output token representations Ξ(𝐗;𝜽)Ξ𝐗𝜽\Xi(\mathbf{X};\boldsymbol{\theta})roman_Ξ ( bold_X ; bold_italic_θ ), where we partition the input 𝐗(k+m)×din𝐗superscript𝑘𝑚subscript𝑑𝑖𝑛\mathbf{X}\in\mathbb{R}^{(k+m)\times d_{in}}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_k + italic_m ) × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT into a k×din𝑘subscript𝑑𝑖𝑛k\times d_{in}italic_k × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT block of control input representations 𝐔𝐔\mathbf{U}bold_U and an m×din𝑚subscript𝑑𝑖𝑛m\times d_{in}italic_m × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT block of imposed state representations 𝐗𝟎subscript𝐗0\mathbf{X_{0}}bold_X start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT (cf. Definition 3.1) where k+m=N𝑘𝑚𝑁k+m=Nitalic_k + italic_m = italic_N. Thus the complete input matrix 𝐗𝐗\mathbf{X}bold_X is a concatenation of the control input 𝐔𝐔\mathbf{U}bold_U and the imposed state 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Ξ(𝐗;𝜽)Ξ𝐗𝜽\displaystyle\Xi(\mathbf{X};\boldsymbol{\theta})roman_Ξ ( bold_X ; bold_italic_θ ) =Ξ([𝐔𝐗0];𝜽)=Ξ([𝐔;𝐗0];𝜽)absentΞmatrixmatrix𝐔subscript𝐗0𝜽Ξ𝐔subscript𝐗0𝜽\displaystyle=\Xi\begin{pmatrix}\begin{bmatrix}\mathbf{U}\\ \mathbf{X}_{0}\end{bmatrix};\boldsymbol{\theta}\end{pmatrix}=\Xi([\mathbf{U};% \mathbf{X}_{0}];\boldsymbol{\theta})= roman_Ξ ( start_ARG start_ROW start_CELL [ start_ARG start_ROW start_CELL bold_U end_CELL end_ROW start_ROW start_CELL bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ; bold_italic_θ end_CELL end_ROW end_ARG ) = roman_Ξ ( [ bold_U ; bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] ; bold_italic_θ ) (5)
=[𝐔𝐘]=[𝐔;𝐘]absentmatrixsuperscript𝐔𝐘superscript𝐔𝐘\displaystyle=\begin{bmatrix}\mathbf{U}^{\prime}\\ \mathbf{Y}\end{bmatrix}=[\mathbf{U}^{\prime};\mathbf{Y}]= [ start_ARG start_ROW start_CELL bold_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_Y end_CELL end_ROW end_ARG ] = [ bold_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_Y ] (6)

We also partition the output 𝐗=Ξ(𝐗;𝜽)(k+m)×dinsuperscript𝐗Ξ𝐗𝜽superscript𝑘𝑚subscript𝑑𝑖𝑛\mathbf{X^{\prime}}=\Xi(\mathbf{X};\boldsymbol{\theta})\in\mathbb{R}^{(k+m)% \times d_{in}}bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_Ξ ( bold_X ; bold_italic_θ ) ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_k + italic_m ) × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT into a corresponding k×dout𝑘subscript𝑑𝑜𝑢𝑡k\times d_{out}italic_k × italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT matrix 𝐔superscript𝐔\mathbf{U^{\prime}}bold_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and an m×dout𝑚subscript𝑑𝑜𝑢𝑡m\times d_{out}italic_m × italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT matrix 𝐘𝐘\mathbf{Y}bold_Y.

We aim to characterize the reachable set of output representations 𝐘yk(𝐗𝟎)𝐘superscriptsubscript𝑦𝑘subscript𝐗0\mathbf{Y}\in\mathcal{R}_{y}^{k}(\mathbf{X_{0}})bold_Y ∈ caligraphic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) under m𝑚mitalic_m imposed input representations 𝐗𝟎subscript𝐗0\mathbf{X_{0}}bold_X start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT and k𝑘kitalic_k controllable input representations 𝐔𝐔\mathbf{U}bold_U. Although the reachable set is now a set of continuous-valued output representation matrices in m×dinsuperscript𝑚subscript𝑑𝑖𝑛\mathbb{R}^{m\times d_{in}}blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we can readily adapt Definitions 3.3-3.2 to define the reachable set for these conditions:

Reachability for Self Attention:

Following from the original output reachability definition (Definition 3.2), let 𝐘m×doutsuperscript𝐘superscript𝑚subscript𝑑𝑜𝑢𝑡{\mathbf{Y}}^{*}\in\mathbb{R}^{m\times d_{out}}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the desired output. We consider 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT reachable from initial state 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT if there exists some 𝐔𝐔\mathbf{U}bold_U that steers the output of Ξ([𝐔;𝐗0];𝜽)Ξ𝐔subscript𝐗0𝜽\Xi\big{(}[\mathbf{U};\mathbf{X}_{0}];\boldsymbol{\theta}\big{)}roman_Ξ ( [ bold_U ; bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] ; bold_italic_θ ) to output [𝐔;𝐘]superscript𝐔𝐘[\mathbf{U}^{\prime};\mathbf{Y}][ bold_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_Y ] such that 𝐘=𝐘𝐘superscript𝐘\mathbf{Y}=\mathbf{Y}^{*}bold_Y = bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

4.2 The theorem and its motivation

Our approach is to split the output 𝐘𝐘\mathbf{Y}bold_Y into two parts, 𝐘=𝐘u+𝐘x𝐘subscript𝐘𝑢subscript𝐘𝑥\mathbf{Y}=\mathbf{Y}_{u}+\mathbf{Y}_{x}bold_Y = bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, corresponding to the control input and imposed state, respectively. 𝐘xsubscript𝐘𝑥\mathbf{Y}_{x}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT can be bounded as a function of 𝐗,k𝐗𝑘\mathbf{X},kbold_X , italic_k, and 𝜽𝜽\boldsymbol{\theta}bold_italic_θ. 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is the remaining component arising from 𝐔𝐔\mathbf{U}bold_U. Each of the two parts 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and 𝐘xsubscript𝐘𝑥\mathbf{Y}_{x}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is split into two further components, one orthogonal to 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and one parallel to it. For instance, we denote the orthogonal part of 𝐘xsubscript𝐘𝑥\mathbf{Y}_{x}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT by 𝐘x,subscript𝐘𝑥perpendicular-to\mathbf{Y}_{x,\perp}bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT. Thus we have

𝐘𝐘\displaystyle\mathbf{Y}bold_Y =𝐘u+𝐘xabsentsubscript𝐘𝑢subscript𝐘𝑥\displaystyle=\mathbf{Y}_{u}+\mathbf{Y}_{x}= bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT
=(𝐘u,||+𝐘u,)+(𝐘x,||+𝐘x,)\displaystyle=(\mathbf{Y}_{u,||}+\mathbf{Y}_{u,\perp})+(\mathbf{Y}_{x,||}+% \mathbf{Y}_{x,\perp})= ( bold_Y start_POSTSUBSCRIPT italic_u , | | end_POSTSUBSCRIPT + bold_Y start_POSTSUBSCRIPT italic_u , ⟂ end_POSTSUBSCRIPT ) + ( bold_Y start_POSTSUBSCRIPT italic_x , | | end_POSTSUBSCRIPT + bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT )

After rearranging, we have 𝐘=(𝐘u,||+𝐘x,||)+(𝐘u,+𝐘x,)span(𝐘)span(𝐘)\mathbf{Y}=(\mathbf{Y}_{u,||}+\mathbf{Y}_{x,||})+(\mathbf{Y}_{u,\perp}+\mathbf% {Y}_{x,\perp})\in\operatorname{span}(\mathbf{Y}^{*})\oplus\operatorname{span}(% \mathbf{Y}^{*})^{\perp}bold_Y = ( bold_Y start_POSTSUBSCRIPT italic_u , | | end_POSTSUBSCRIPT + bold_Y start_POSTSUBSCRIPT italic_x , | | end_POSTSUBSCRIPT ) + ( bold_Y start_POSTSUBSCRIPT italic_u , ⟂ end_POSTSUBSCRIPT + bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT ) ∈ roman_span ( bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⊕ roman_span ( bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT. If the desired output is reachable, then 𝐘u,+𝐘x,=𝟎subscript𝐘𝑢perpendicular-tosubscript𝐘𝑥perpendicular-to0\mathbf{Y}_{u,\perp}+\mathbf{Y}_{x,\perp}=\mathbf{0}bold_Y start_POSTSUBSCRIPT italic_u , ⟂ end_POSTSUBSCRIPT + bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT = bold_0 and also 𝐘u,=𝐘x,normsubscript𝐘𝑢perpendicular-tonormsubscript𝐘𝑥perpendicular-to\|\mathbf{Y}_{u,\perp}\|=\|\mathbf{Y}_{x,\perp}\|∥ bold_Y start_POSTSUBSCRIPT italic_u , ⟂ end_POSTSUBSCRIPT ∥ = ∥ bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT ∥ (see Appendix B.3).

Refer to caption
Figure 2: Visualization of Theorem 4.2 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and components of 𝐘u,𝐘xsubscript𝐘𝑢subscript𝐘𝑥\mathbf{Y}_{u},\mathbf{Y}_{x}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. If 𝐘x,max,inormsuperscriptsubscript𝐘𝑥perpendicular-to𝑖\|\mathbf{Y}_{x,\perp}^{\max,i}\|∥ bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max , italic_i end_POSTSUPERSCRIPT ∥ exceeds kγ𝑘𝛾k\gammaitalic_k italic_γ, then no prompt of length kabsent𝑘\leq k≤ italic_k can steer the self-attention to output 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT given imposed 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and constraints on 𝐔iMunormsuperscript𝐔𝑖subscript𝑀𝑢\|\mathbf{U}^{i}\|\leq M_{u}∥ bold_U start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ ≤ italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT.
Theorem 4.2 (Self-Attention Control Theorem, proved in Appendix B).

Consider a self-attention layer with input 𝐗m×d𝐗superscript𝑚𝑑\mathbf{X}\in\mathbb{R}^{m\times d}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT and control input 𝐔k×d𝐔superscript𝑘𝑑\mathbf{U}\in\mathbb{R}^{k\times d}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d end_POSTSUPERSCRIPT, where m𝑚mitalic_m is the number of imposed tokens, k𝑘kitalic_k is the number of control tokens, and d𝑑ditalic_d is the token embedding dimension. Let 𝐘m×dsuperscript𝐘superscript𝑚𝑑\mathbf{Y}^{*}\in\mathbb{R}^{m\times d}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT be the desired output, and let 𝐘m×d𝐘superscript𝑚𝑑\mathbf{Y}\in\mathbb{R}^{m\times d}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT be the actual output of the self-attention layer.

Let 𝐘xmax=Ξ(𝐗0;𝛉)superscriptsubscript𝐘𝑥maxΞsubscript𝐗0𝛉\mathbf{Y}_{x}^{\operatorname{max}}=\Xi(\mathbf{X}_{0};\boldsymbol{\theta})bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT = roman_Ξ ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; bold_italic_θ ) be the output of the self-attention layer given only the imposed state 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. As before, we denote the i𝑖iitalic_i-th row of the orthogonal component of 𝐘xmaxsuperscriptsubscript𝐘𝑥max\mathbf{Y}_{x}^{\operatorname{max}}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT to the desired 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as 𝐘x,max,isuperscriptsubscript𝐘𝑥perpendicular-tomax𝑖\mathbf{Y}_{x,\perp}^{\operatorname{max},i}bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max , italic_i end_POSTSUPERSCRIPT.

Then 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is unreachable for any control input 𝐔𝐔\mathbf{U}bold_U if, for any i{1,,m}𝑖1𝑚i\in\{1,\dots,m\}italic_i ∈ { 1 , … , italic_m },

𝐘x,max,i>kγi(𝐗0,𝜽)normsuperscriptsubscript𝐘𝑥perpendicular-tomax𝑖𝑘subscript𝛾𝑖subscript𝐗0𝜽\|\mathbf{Y}_{x,\perp}^{\operatorname{max},i}\|>k\gamma_{i}(\mathbf{X}_{0},% \boldsymbol{\theta})∥ bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max , italic_i end_POSTSUPERSCRIPT ∥ > italic_k italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ) (7)

where

γi(𝐗0,𝜽):=eαgiσvMu,α=σqσkeyMuMx/dkeyformulae-sequenceassignsubscript𝛾𝑖subscript𝐗0𝜽superscript𝑒𝛼subscript𝑔𝑖subscript𝜎𝑣subscript𝑀𝑢𝛼subscript𝜎𝑞subscript𝜎keysubscript𝑀𝑢subscript𝑀𝑥subscript𝑑key\gamma_{i}(\mathbf{X}_{0},\boldsymbol{\theta}):=\frac{e^{\alpha}}{g_{i}}\sigma% _{v}M_{u},\quad\alpha=\sigma_{q}\sigma_{\rm key}M_{u}M_{x}/\sqrt{d_{\mathrm{% key}}}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ) := divide start_ARG italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_α = italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT / square-root start_ARG italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT end_ARG (8)
gi(𝐗0,𝜽):=j=1mexp((𝐗0)i𝐖q𝐖key(𝐗0)j/dkey),assignsubscript𝑔𝑖subscript𝐗0𝜽superscriptsubscript𝑗1𝑚superscriptsubscript𝐗0𝑖subscript𝐖𝑞superscriptsubscript𝐖keytopsuperscriptsubscript𝐗0limit-from𝑗topsubscript𝑑keyg_{i}(\mathbf{X}_{0},\boldsymbol{\theta}):=\sum_{j=1}^{m}\exp\left((\mathbf{X}% _{0})^{i}\mathbf{W}_{q}\mathbf{W}_{\mathrm{key}}^{\top}(\mathbf{X}_{0})^{j\top% }/\sqrt{d_{\mathrm{key}}}\right),italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ) := ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_exp ( ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j ⊤ end_POSTSUPERSCRIPT / square-root start_ARG italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT end_ARG ) , (9)

σv,σqsubscript𝜎𝑣subscript𝜎𝑞\sigma_{v},\sigma_{q}italic_σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and σkeysubscript𝜎key\sigma_{\rm key}italic_σ start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT being the maximum singular values of the value, query and key projection matrices, respectively, and with Mu:=maxj𝐔jassignsubscript𝑀𝑢subscript𝑗normsuperscript𝐔𝑗M_{u}:=\max_{j}\|\mathbf{U}^{j}\|italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT := roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ bold_U start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥, Mx:=maxj(𝐗0)jassignsubscript𝑀𝑥subscript𝑗normsuperscriptsubscript𝐗0𝑗M_{x}:=\max_{j}\|(\mathbf{X}_{0})^{j}\|italic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT := roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ being the maximum norms of the control and imposed token embeddings, respectively.

Remark 4.3.

The upper bound kγi(𝐗0,𝜽)𝑘subscript𝛾𝑖subscript𝐗0𝜽k\gamma_{i}(\mathbf{X}_{0},\boldsymbol{\theta})italic_k italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ) scales linearly with k𝑘kitalic_k, implying that the set of unreachable 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT becomes smaller as k𝑘kitalic_k grows larger. Moreover, γ𝛾\gammaitalic_γ is solely a function of the imposed state 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Proof Summary:

An important idea of the proof is the decomposition of the output representations 𝐘𝐘\mathbf{Y}bold_Y into two components: 𝐘xsubscript𝐘𝑥\mathbf{Y}_{x}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. The 𝐘xsubscript𝐘𝑥\mathbf{Y}_{x}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT component arises from the value projections of the imposed state 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, while 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT arises from the value projections of the control input 𝐔𝐔\mathbf{U}bold_U. Although the softmax operation in the self-attention mechanism introduces cross-terms between 𝐗𝐗\mathbf{X}bold_X and 𝐔𝐔\mathbf{U}bold_U in both 𝐘xsubscript𝐘𝑥\mathbf{Y}_{x}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, we can disentangle their influences by considering the auxiliary representations 𝐘xmaxsuperscriptsubscript𝐘𝑥max\mathbf{Y}_{x}^{\operatorname{max}}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT and 𝐘umaxsuperscriptsubscript𝐘𝑢max\mathbf{Y}_{u}^{\operatorname{max}}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT. Specifically, 𝐘xmax=Ξ(𝐗0;𝜽)superscriptsubscript𝐘𝑥maxΞsubscript𝐗0𝜽\mathbf{Y}_{x}^{\operatorname{max}}=\Xi(\mathbf{X}_{0};\boldsymbol{\theta})bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT = roman_Ξ ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; bold_italic_θ ) represents the output of the self-attention mechanism ΞΞ\Xiroman_Ξ when only the imposed state 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is provided as input, without any control input 𝐔𝐔\mathbf{U}bold_U. We derive the bound in Theorem 4.2 by first deriving the bound βi𝐘uisubscript𝛽𝑖normsuperscriptsubscript𝐘𝑢𝑖\beta_{i}\geq\|\mathbf{Y}_{u}^{i}\|italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ ∥ bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ on row i𝑖iitalic_i of 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. In Appendix B.2, we observe that, if 𝐘x,iβinormsuperscriptsubscript𝐘𝑥perpendicular-to𝑖subscript𝛽𝑖\|\mathbf{Y}_{x,\perp}^{i}\|\geq\beta_{i}∥ bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ ≥ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, it is impossible for 𝐘u,inormsuperscriptsubscript𝐘𝑢perpendicular-to𝑖\|\mathbf{Y}_{u,\perp}^{i}\|∥ bold_Y start_POSTSUBSCRIPT italic_u , ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ to nullify the orthogonal component of 𝐘xsubscript𝐘𝑥\mathbf{Y}_{x}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, rendering 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT unreachable. A simplification of this inequality yields our bound 𝐘x,max,i>kγi(𝐗0,𝜽)normsuperscriptsubscript𝐘𝑥perpendicular-tomax𝑖𝑘subscript𝛾𝑖subscript𝐗0𝜽\|\mathbf{Y}_{x,\perp}^{\operatorname{max},i}\|>k\gamma_{i}(\mathbf{X}_{0},% \boldsymbol{\theta})∥ bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max , italic_i end_POSTSUPERSCRIPT ∥ > italic_k italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ).

Discussion of Theorem 4.2:

The reachable set exclusion condition in Equation (7) arises when the output representation 𝐘xmaxsuperscriptsubscript𝐘𝑥max\mathbf{Y}_{x}^{\operatorname{max}}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT, which depends only on the imposed state 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, is too far away from the desired output 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for the control input 𝐔𝐔\mathbf{U}bold_U to steer the output towards 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The ability of the control input 𝐔𝐔\mathbf{U}bold_U to nullify the impact of 𝐘xmax=Ξ(𝐗0;𝜽)superscriptsubscript𝐘𝑥maxΞsubscript𝐗0𝜽\mathbf{Y}_{x}^{\operatorname{max}}=\Xi(\mathbf{X}_{0};\boldsymbol{\theta})bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT = roman_Ξ ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; bold_italic_θ ) scales with the number of control tokens k𝑘kitalic_k (see hyperbolic relationship in Equation 13). A longer control input can "dominate" the influence of 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by increasing the relative contribution of 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT to the overall output 𝐘𝐘\mathbf{Y}bold_Y.

Furthermore, the proof reveals that the output of self-attention can be decomposed into components that depend primarily on different parts of the input (i.e., 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐔𝐔\mathbf{U}bold_U). While there are cross-terms in the attention matrix (𝐗0)i𝐖q𝐖key(𝐗0)jsuperscriptsubscript𝐗0𝑖subscript𝐖𝑞superscriptsubscript𝐖keytopsuperscriptsubscript𝐗0limit-from𝑗top(\mathbf{X}_{0})^{i}\mathbf{W}_{q}\mathbf{W}_{\mathrm{key}}^{\top}(\mathbf{X}_% {0})^{j\top}( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j ⊤ end_POSTSUPERSCRIPT, these only introduce positive scaling factors (e.g., functions of gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) to components (e.g. 𝐘xsubscript𝐘𝑥\mathbf{Y}_{x}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT) that are not dependent on the control input, allowing us to derive an analytic bound on the reachable output set for self-attention via 𝐘xmaxsuperscriptsubscript𝐘𝑥max\mathbf{Y}_{x}^{\operatorname{max}}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT (see Equations 21-25,23).

The implications of Theorem 4.2 are further discussed in Section 6. See Appendix B for proofs, including Section B.3 for a more general statement of reachability conditions in terms of the perpendicular and orthogonal components of 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝐘𝐘\mathbf{Y}bold_Y.

5 Experiments

To gain a practical, empirical understanding of the reachable set yk(𝐱0)superscriptsubscript𝑦𝑘subscript𝐱0\mathcal{R}_{y}^{k}(\mathbf{x}_{0})caligraphic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), we probe the existence of optimal prompts 𝐮superscript𝐮\mathbf{u}^{*}bold_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT across datasets 𝒟𝒟\mathcal{D}caligraphic_D of initial state–desired output pairs (𝐱0,y)subscript𝐱0superscript𝑦(\mathbf{x}_{0},y^{*})( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). We scope our experiments to study immediate control (i.e., we check the LLM output after |y|superscript𝑦|y^{*}|| italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | tokens are generated) where the control input 𝐮𝐮\mathbf{u}bold_u is prepended to the imposed state 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Moreover, we focus on the case of controlling the LLM system to produce a single output token y𝒱superscript𝑦𝒱y^{*}\in\mathcal{V}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_V under some constraint |u|k𝑢𝑘|u|\leq k| italic_u | ≤ italic_k. This “single-step” control renders the problem of gauging reachability computationally tractable and is a fundamental step toward understanding the iterated dynamics of LLM systems in terms of reachability and controllability. We leave the exploration of reachability and controllability under an extended time horizon (e.g., chain-of-thought, chatbot dynamics, tool-wielding LLMs) and under the requirement of multi-token outputs 𝐲𝐲\mathbf{y}bold_y to future work.

5.1 Methods

We apply prompt optimization algorithms to establish the existence of optimal prompts 𝐮superscript𝐮\mathbf{u}^{*}bold_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of length k𝑘kitalic_k that steer the LLM system from initial state 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to output y𝑦yitalic_y for some dataset 𝒟𝒟\mathcal{D}caligraphic_D of initial state-output pairs. In general, prompt optimization algorithms accept a token sequence and a loss function on said token sequence, along with a specification of which tokens are manipulable. The output of a prompt optimizer is a manipulated token sequence (i.e., optimized prompt) designed to minimize the loss. We apply two computational methods to generating optimal prompts: greedy back-generation (algorithm 2) and greedy coordinate gradient (GCG, invented in [26], algorithm 3). We found that greedy back-generation performed best for short prompts k3𝑘3k\leq 3italic_k ≤ 3 tokens, while GCG was the best-performing algorithm for prompts of 4 or more tokens. To our knowledge, our greedy back-generation algorithm is novel. For brevity, we place the full description of the algorithms and our parameter values for the two algorithms in Appendix C, as the specifics of the algorithms are not the main contribution of this work.

We focus on understanding the content and structure of the reachable set of LLM system outputs yk(𝐱0)superscriptsubscript𝑦𝑘subscript𝐱0\mathcal{R}_{y}^{k}(\mathbf{x}_{0})caligraphic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), particularly under a constraint on the number of input tokens k𝑘kitalic_k. To determine which output tokens are reachable under varying input sequence lengths, we apply an incremental prompt lengthening procedure when searching for optimal prompts on some dataset 𝒟𝒟\mathcal{D}caligraphic_D.

Algorithm 1 Back-off Prompt
0:  State-output token sequence (𝐱0,y)𝐱0𝑦(\mathbf{x}0,y)( bold_x 0 , italic_y ); LLM system Σ=(PLM,𝒱)Σ𝑃𝐿𝑀𝒱\Sigma=(P{LM},\mathcal{V})roman_Σ = ( italic_P italic_L italic_M , caligraphic_V ).
1:  for k=1𝑘1k=1italic_k = 1 to 3333 do
2:     𝐮k=Greedy Back Generate(𝐱0,y;Σ)subscript𝐮𝑘Greedy Back Generatesubscript𝐱0𝑦Σ\mathbf{u}_{k}=\text{Greedy Back Generate}(\mathbf{x}_{0},y;\Sigma)bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = Greedy Back Generate ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ; roman_Σ )
3:     if 𝐮ksubscript𝐮𝑘\mathbf{u}_{k}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT steers ΣΣ\Sigmaroman_Σ from 𝐱0ysubscript𝐱0𝑦\mathbf{x}_{0}\to ybold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → italic_y then
4:        return  𝐮ksubscript𝐮𝑘\mathbf{u}_{k}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
5:     end if
6:  end for
7:  for k4,6,8,10𝑘46810k\in{4,6,8,10}italic_k ∈ 4 , 6 , 8 , 10 do
8:     𝐮k=Greedy Coordinate Gradient(𝐱0,y;Σ)subscript𝐮𝑘Greedy Coordinate Gradientsubscript𝐱0𝑦Σ\mathbf{u}_{k}=\text{Greedy Coordinate Gradient}(\mathbf{x}_{0},y;\Sigma)bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = Greedy Coordinate Gradient ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ; roman_Σ )
9:     if 𝐮ksubscript𝐮𝑘\mathbf{u}_{k}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT steers ΣΣ\Sigmaroman_Σ from 𝐱0ysubscript𝐱0𝑦\mathbf{x}_{0}\to ybold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → italic_y then
10:        return  𝐮ksubscript𝐮𝑘\mathbf{u}_{k}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
11:     end if
12:  end for
13:  return  Failed to establish reachability.

5.2 Results

Our results revolve around the reachable set yk(𝐱0)superscriptsubscript𝑦𝑘subscript𝐱0\mathcal{R}_{y}^{k}(\mathbf{x}_{0})caligraphic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for state sequences sampled from the Wikitext dataset. Results were computed for a panel of models, including Falcon-7b, Falcon-40b, and Llama-7b. Falcon-7b results are showcased in this section while additional plots and results for Falcon-40b and Llama-7b can be found in Section D. We applied the same Back-off Prompt strategy (Algorithm 1) to determine k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ controllability for all experiments, varying the specifics of the dataset 𝒟𝒟\mathcal{D}caligraphic_D for each experiment.

“Ground truth” reachability:

We established the reachability of the “ground truth” next token y𝑦yitalic_y proceeding state token sequence 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Wikitext. In our tests on a dataset of 5000 state-output sequences with states of length 8328328-328 - 32 tokens, we found that the true next token y𝑦yitalic_y is reachable over 97% of the time across all models with a prompt of length k10𝑘10k\leq 10italic_k ≤ 10 (Figure 3). Plots and supplementary figures for Falcon-40b and Llama-7b controllability w.r.t. ground truth Wikitext outputs can be found in Section D.1.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Top Left: k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ values on initial state 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and target output token ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from Wikitext. 97.16% of the instances were solved with a prompt of length k10𝑘10k\leq 10italic_k ≤ 10.
Top Right: k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ values reaching the top 75 most likely outputs ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for each 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from Wikitext. The top 75 targets were reachable at least 89.39% of the time with a prompt of length k10𝑘10k\leq 10italic_k ≤ 10.
Bottom Left: Prior likelihood rank of target token ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT versus required prompt length to elicit ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Target tokens were sampled uniformly from the least to most likely token given 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT sampled from Wikitext.

Top-75 reachability:

To explore the reachable set yk(𝐱0)superscriptsubscript𝑦𝑘subscript𝐱0\mathcal{R}_{y}^{k}(\mathbf{x}_{0})caligraphic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) beyond the ground truth of Wikitext outputs, we generated a synthetic dataset of outputs by sampling 25 Wikitext sequences 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and selecting the top 75 most likely next-tokens according to the model itself PLM(y|𝐱0)subscript𝑃𝐿𝑀conditional𝑦subscript𝐱0P_{LM}(y|\mathbf{x}_{0})italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) as the target tokens (Figure 3). We found that the top 75 output tokens were reachable over 85% of the time for all models with control sequence length k=10𝑘10k=10italic_k = 10. Supplementary figures including results for Llama-7b and Falcon-40b on k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ controllability with respect to the top 75 most likely output tokens can be found in Section D.2.

Uniformly sampled target outputs:

To maximally push the bounds of the reachable set within our single output token scope, we created another synthetic dataset where the target output token ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT was sampled uniformly from the highest likelihood next token to the lowest likelihood token. Although the overall k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ score was relatively poor (only 46.43% reachable with k=10𝑘10k=10italic_k = 10 for Falcon-7b), we were intrigued by the near-uniform relationship between prior token rank (based on PLM(y|𝐱0)subscript𝑃𝐿𝑀conditional𝑦subscript𝐱0P_{LM}(y|\mathbf{x}_{0})italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )) versus the required number of prompt tokens. Figure 3 plots the relationship between prior target token rank based on P(y|𝐱0)𝑃conditionalsuperscript𝑦subscript𝐱0P(y^{*}|\mathbf{x}_{0})italic_P ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and the required prompt length k𝑘kitalic_k to elicit the prompt. While over half were unreachable, the remaining reachable tokens appear uniformly distributed in terms of required prompt length, regardless of rank. Supplementary figures analyzing the k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ controllability of Falcon-7b with respect to uniformly sampled target outputs y𝑦yitalic_y can be found in Section D.3.

6 Discussion

We proposed a control theoretic framework for understanding language model prompting, orienting our investigation around the reachable set of outputs yk(𝐱0)superscriptsubscript𝑦𝑘subscript𝐱0\mathcal{R}_{y}^{k}(\mathbf{x}_{0})caligraphic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). We proved a bound on the reachable set of outputs for self-attention in terms of the singular values of its weight matrices, and we established fundamental results on the reachability of “correct” next tokens (according to Wikitext). We expanded the scope of this investigation by probing the reachability of tokens assigned high likelihood by the LLM itself (top 75 most likely next tokens), and tokens assigned minimal likelihood by the LLM itself (randomly sampled target tokens).

The Self-Attention Control Theorem (Theorem 4.2) provides a sufficient condition for the unreachability of a desired output 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in terms of the projection of a single row of 𝐘xmax=Ξ(𝐗0;𝜽)subscriptsuperscript𝐘𝑚𝑎𝑥𝑥Ξsubscript𝐗0𝜽\mathbf{Y}^{max}_{x}=\Xi(\mathbf{X}_{0};\boldsymbol{\theta})bold_Y start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = roman_Ξ ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; bold_italic_θ ) onto the orthogonal complement of 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. If the orthogonal component of 𝐘xmaxsubscriptsuperscript𝐘max𝑥\mathbf{Y}^{\rm max}_{x}bold_Y start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT exceeds kγ𝑘𝛾k\gammaitalic_k italic_γ, then no prompt of length kabsent𝑘\leq k≤ italic_k can steer the self attention head to output 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT under the input constraints. The threshold kγi(𝐗0,𝜽)𝑘subscript𝛾𝑖subscript𝐗0𝜽k\gamma_{i}(\mathbf{X}_{0},\boldsymbol{\theta})italic_k italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ) depends on the imposed input 𝐗𝐗\mathbf{X}bold_X, the number of control tokens k𝑘kitalic_k, and the maximum singular values of the query, key, and value weight matrices, 𝜽=(𝐖k,𝐖q,𝐖v)𝜽subscript𝐖𝑘subscript𝐖𝑞subscript𝐖𝑣\boldsymbol{\theta}=(\mathbf{W}_{k},\mathbf{W}_{q},\mathbf{W}_{v})bold_italic_θ = ( bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ). Intuitively, this result suggests that if the output 𝐘=𝐘x+𝐘u𝐘subscript𝐘𝑥subscript𝐘𝑢\mathbf{Y}=\mathbf{Y}_{x}+\mathbf{Y}_{u}bold_Y = bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT has component 𝐘xsubscript𝐘𝑥\mathbf{Y}_{x}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT too large and misaligned with 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then no control input with k𝑘kitalic_k or fewer tokens can yield a component 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT that corrects the misalignment – even if control inputs 𝐔𝐔\mathbf{U}bold_U yield maximal influence on the output under the k𝑘kitalic_k-token limit (Figure 2‘).

Bounding the reachable set for self-attention is deeply related to the mechanism by which consistent representations are formed for multi-token generation. Steering a language model to generate a desired token sequence requires that the control input induce a token representation in the right-most token such that the next token prediction logits P(𝐲|𝐮+𝐱0)𝑃conditional𝐲𝐮subscript𝐱0P(\mathbf{y}|\mathbf{u}+\mathbf{x}_{0})italic_P ( bold_y | bold_u + bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) achieves a desired value. Moreover, generated tokens are fed back into the model, and their representations must be steered as well to control iterated generation. Self-attention is the primary mechanism by which the token representations exchange information, making the reachable set of output representations across multiple tokens in 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for self-attention a fundamental part of LLM control theory. The Self-Attention Control Theorem provides a step towards understanding the limitations and possibilities of controlling the self-attention layer, and by extension, the language model as a whole.

Our empirical results suggest that there is far more to the reachability of a given output than just prior likelihood or the prior rank the LLM assigns to a given token. Although prompt optimization-based k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ controllability experiments are only able to provide a lower bound on the content of the reachable set, the ability to frequently control even the least likely token to being the most likely token with just a few input tokens is intriguing (Figure 3, bottom right). This result indicates the importance of further investigating the reachability and controllability of LLMs, particularly for developing capable and reliable LLM systems.

Our investigations provide an entry into the understanding of LLM controllability via prompts. However, a comprehensive understanding necessitates extending our exploration into diverse regimes. Exploring the controllability with longer prompts and longer questions (base token sequences) will be pivotal. Equally important is the study of diverse models to verify the generality of our findings. The direct comparison of controllability scores of different model families is challenging since each model family uses a different tokenizer. The Llama family tokenizer, for instance, has a vocabulary of 30,000 tokens whereas the Falcon family has a vocabulary of 65,536 tokens. Further work is required to robustly compare controllability across models.

An intriguing observation from our study is the log-linear relationship between prompt length k𝑘kitalic_k and controllability fraction ϵitalic-ϵ\epsilonitalic_ϵ (see Figure 4 in Appendix D). While this is compelling within our studied domain, it raises the essential question: is this relationship robust outside our current explorative scope? Unearthing universal scaling laws in LLM controllability would not only inform practical control applications but also open the door for theoretical insight into the nature of LLM behavior.

The progress we have made, both in understanding the bounds on self-attention controllability and the empirical measures of k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ LLM controllability, underscores the potential of this control theoretic framing for studying LLMs. Below is a non-exhaustive list of open problems in LLM control, all stemming from the framing in section A:

  • Control Properties of Chain-of-Thought: Chain-of-Thought is a powerful technique where LLMs are allowed to generate intermediate tokens (i.e., “thoughts”) between a question and an answer [39]. The control properties (e.g., stability, reachability) of systems leveraging these techniques are of great interest for understanding and composing systems of LLMs in the real world.

  • Distributional Control: To what extent can we control the output distribution of a language model PLM(𝐲|𝐱0+𝐮)subscript𝑃𝐿𝑀conditional𝐲subscript𝐱0𝐮P_{LM}(\mathbf{y}|\mathbf{x}_{0}+\mathbf{u})italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( bold_y | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_u ) to a desired distribution P(𝐲)superscript𝑃𝐲P^{*}(\mathbf{y})italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_y )?

  • Computational Cost of Control: What are the performance characteristics of LLM control regularized by computational cost?

  • Learnability of Control: To what extent can LLMs learn to control each other? Work such as [24] showed that LLMs are capable of human-level prompt engineering, but it is unclear how well an LLM can learn to control another when explicitly optimized on the objective of LLM control.

  • Controllable Subspaces: In the control of linear dynamical systems, it is known that uncontrollable systems are often coordinate transformable into a representation where a subset of the coordinates are controllable and a subset are uncontrollable [47]. We have shown that controllable and uncontrollable components naturally emerge for self-attention heads in section 4 – can this be generalized to transformer blocks with nonlinearities and residual streams?

  • Composable LLM Systems: One of the greatest boons of control theory is the ability to compose control modules and subsystems into an interpretable, predictable, and effective whole [48]. The composition of LLM systems (potentially with non-LLM control modules) is an exciting avenue for scaling super intelligent systems.

Practically, our findings lay the groundwork for more effective and efficient prompt engineering. The ability to control even the least likely tokens illuminates untapped capabilities within LLMs, hinting at a potentially broader spectrum of application than previously recognized. Such insights could lead to the development of more nuanced and sophisticated LLM systems, capable of handling tasks with greater precision and adaptability.

Impact statement

This paper introduces foundational work aimed at enhancing our understanding and control of generative language models (LLMs) as they become integral to crucial societal functions. The increasing integration of generative AI into critical infrastructures — such as healthcare data analysis, insurance and financial data processing, and emergency response systems — underscores the urgency for a sophisticated control theory. Drawing on the principles of control theory, which have historically ensured the dependability of machines in life-or-death scenarios (e.g., in cruise control and aircraft navigation systems), our goal is to extend these guarantees to LLM-based applications. By doing so, we aim to make these advanced AI systems as trustworthy and robust as their electro-mechanical counterparts, thereby securing their role in supporting and safeguarding society.

Code availability

All code used to produce the experimental results is provided with the submission.

References

  • [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33, pp. 1877–1901, Curran Associates, Inc., 2020.
  • [2] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” 2022.
  • [3] T. Hagendorff, “Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods,” 2023.
  • [4] D. Noever and F. McKee, “Numeracy from literacy: Data science as an emergent skill from large language models,” 2023.
  • [5] OpenAI, “Gpt-4 technical report,” 2023.
  • [6] OpenAI, Nov 2022.
  • [7] L. Wang, C. Lyu, T. Ji, Z. Zhang, D. Yu, S. Shi, and Z. Tu, “Document-level machine translation with large language models,” arXiv preprint arXiv:2304.02210, 2023.
  • [8] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models for code,” 2023.
  • [9] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022.
  • [10] H. Zhang, H. Song, S. Li, M. Zhou, and D. Song, “A survey of controllable text generation using transformer-based pre-trained language models,” CoRR, vol. abs/2201.05337, 2022.
  • [11] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang, “Sparks of artificial general intelligence: Early experiments with gpt-4,” 2023.
  • [12] T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah, “Towards monosemanticity: Decomposing language models with dictionary learning,” Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  • [13] H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” 2021.
  • [14] A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso, “Towards automated circuit discovery for mechanistic interpretability,” 2023.
  • [15] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023.
  • [16] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Heslow, J. Launay, Q. Malartic, B. Noune, B. Pannier, and G. Penedo, “Falcon-40B: an open large language model with state-of-the-art performance,” HuggingFace, 2023.
  • [17] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” 2016.
  • [18] W. L. Taylor, ““cloze procedure”: A new tool for measuring readability,” Journalism Quarterly, vol. 30, no. 4, pp. 415–433, 1953.
  • [19] F. Petroni, T. Rocktäschel, P. S. H. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel, “Language models as knowledge bases?,” CoRR, vol. abs/1909.01066, 2019.
  • [20] J. Weston, A. Bordes, S. Chopra, and T. Mikolov, “Towards ai-complete question answering: A set of prerequisite toy tasks,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings (Y. Bengio and Y. LeCun, eds.), 2016.
  • [21] Z. Wang, Q. Xie, Z. Ding, Y. Feng, and R. Xia, “Is chatgpt a good sentiment analyzer? a preliminary study,” arXiv preprint arXiv:2304.04339, 2023.
  • [22] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?,” 2020.
  • [23] Y. Wen, N. Jain, J. Kirchenbauer, M. Goldblum, J. Geiping, and T. Goldstein, “Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery,” 2023.
  • [24] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba, “Large language models are human-level prompt engineers,” 2023.
  • [25] L. Reynolds and K. McDonell, “Prompt programming for large language models: Beyond the few-shot paradigm,” 2021.
  • [26] A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” 2023.
  • [27] T. Shin, Y. Razeghi, R. L. L. I. au2, E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,” 2020.
  • [28] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” 2015.
  • [29] C. Guo, A. Sablayrolles, H. Jégou, and D. Kiela, “Gradient-based adversarial attacks against text transformers,” 2021.
  • [30] W. Shi, X. Han, H. Gonen, A. Holtzman, Y. Tsvetkov, and L. Zettlemoyer, “Toward human readable prompt tuning: Kubrick’s the shining is a good movie, and a good prompt too?,” 2022.
  • [31] M. Deng, J. Wang, C.-P. Hsieh, Y. Wang, H. Guo, T. Shu, M. Song, E. P. Xing, and Z. Hu, “Rlprompt: Optimizing discrete text prompts with reinforcement learning,” 2022.
  • [32] Y. Hao, Z. Chi, L. Dong, and F. Wei, “Optimizing prompts for text-to-image generation,” 2023.
  • [33] T. Zhang, X. Wang, D. Zhou, D. Schuurmans, and J. E. Gonzalez, “Tempera: Test-time prompting via reinforcement learning,” 2022.
  • [34] D.-K. Kim, S. Sohn, L. Logeswaran, D. Shim, and H. Lee, “Multiprompter: Cooperative prompt optimization with multi-agent reinforcement learning,” 2023.
  • [35] S. Soatto, P. Tabuada, P. Chaudhari, and T. Y. Liu, “Taming ai bots: Controllability of neural states in large language models,” 2023.
  • [36] T.-M. Yi, Y. Huang, M. I. Simon, and J. Doyle, “Robust perfect adaptation in bacterial chemotaxis through integral feedback control,” Proceedings of the National Academy of Sciences, vol. 97, no. 9, pp. 4649–4653, 2000.
  • [37] S. Aniţa, V. Arnăutu, V. Capasso, and V. Capasso, An introduction to optimal control problems in life sciences and economics: From mathematical models to numerical simulation with MATLAB®, vol. 2. Springer, 2011.
  • [38] S. Roy, Y. Wan, and A. Saberi, “A network control theory approach to virus spread mitigation,” in 2009 IEEE Conference on Technologies for Homeland Security, pp. 599–606, 2009.
  • [39] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2023.
  • [40] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
  • [41] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive apis,” 2023.
  • [42] K. Sivaramakrishnan, V. Sivaramakrishnan, and M. M. K. Oishi, “Stochastic reachability of discrete-time stochastic systems via probability measures,” 2023.
  • [43] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  • [44] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth, “Recent advances in natural language processing via large pre-trained language models: A survey,” ACM Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023.
  • [45] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 2016.
  • [46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [47] E. D. Sontag, Mathematical control theory: deterministic finite dimensional systems, vol. 6. Springer Science & Business Media, 2013.
  • [48] F.-L. Lian, J. Moyne, and D. Tilbury, “Network design consideration for distributed control systems,” IEEE transactions on control systems technology, vol. 10, no. 2, pp. 297–307, 2002.
  • [49] R. E. Kalman, P. L. Falb, and M. A. Arbib, Topics in mathematical system theory, vol. 33. McGraw-Hill New York, 1969.
  • [50] K. Ogata, Modern control engineering fifth edition. Prentice Hall, 2010.

Appendix A Abstract Systems and Control Theory Background

This section aims to provide an overview of fundamental control-theoretic concepts from an abstract, set-theoretic perspective. We primarily draw from canonical textbooks [47, 49], and [50].

Diverse definitions of “system” or “machine” exist in the literature, all representing the same core concept but varying in mathematical details. We offer the following high-level definition based on [47] Chapter 2:

Definition A.1 (System).

A “system” or “machine” Σ=(𝒯,𝒳,𝒰,ϕ)Σ𝒯𝒳𝒰italic-ϕ\Sigma=(\mathcal{T,X,U},\phi)roman_Σ = ( caligraphic_T , caligraphic_X , caligraphic_U , italic_ϕ ) consists of:

  • 𝒯::𝒯absent\mathcal{T}:caligraphic_T : The time set along which system state evolves.

  • 𝒳::𝒳absent\mathcal{X}:caligraphic_X : The state space.

  • 𝒰::𝒰absent\mathcal{U}:caligraphic_U : The input space.

  • ϕ:𝒳×𝒰×𝒯2𝒳::italic-ϕ𝒳𝒰superscript𝒯2𝒳:absent\phi:\mathcal{X\times U\times T}^{2}\to\mathcal{X}:italic_ϕ : caligraphic_X × caligraphic_U × caligraphic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → caligraphic_X : The transition map.

A system may also be equipped with an output space and readout map (𝒴,h)𝒴(\mathcal{Y},h)( caligraphic_Y , italic_h ):

  • 𝒴::𝒴absent\mathcal{Y}:caligraphic_Y : The output space.

  • h:𝒳×𝒰×𝒯𝒴::𝒳𝒰𝒯𝒴:absenth:\mathcal{X\times U\times T}\to\mathcal{Y}:italic_h : caligraphic_X × caligraphic_U × caligraphic_T → caligraphic_Y : The readout map.

In other words, at time t𝒯𝑡𝒯t\in\mathcal{T}italic_t ∈ caligraphic_T, the system’s state takes on values x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, and the control input takes values u𝒰𝑢𝒰u\in\mathcal{U}italic_u ∈ caligraphic_U. The system evolves over time with the transition map ϕ(x,u,t,t)italic-ϕ𝑥𝑢𝑡superscript𝑡\phi(x,u,t,t^{\prime})italic_ϕ ( italic_x , italic_u , italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) that returns the new state value x𝒳superscript𝑥𝒳x^{\prime}\in\mathcal{X}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X at time t>tsuperscript𝑡𝑡t^{\prime}>titalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_t. A system can also have a readout map h(x,u,t)𝑥𝑢𝑡h(x,u,t)italic_h ( italic_x , italic_u , italic_t ) that produces the output value y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y given the current time, state, and input value. An input u𝒰𝑢𝒰u\in\mathcal{U}italic_u ∈ caligraphic_U defined over interval [t,t]𝑡superscript𝑡[t,t^{\prime}][ italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] may be said to steer the system Σ=(𝒯,𝒳,𝒰,ϕ)Σ𝒯𝒳𝒰italic-ϕ\Sigma=(\mathcal{T,X,U},\phi)roman_Σ = ( caligraphic_T , caligraphic_X , caligraphic_U , italic_ϕ ) from state x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to state xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if x=ϕ(x0,u,t,t)superscript𝑥italic-ϕsubscript𝑥0𝑢𝑡superscript𝑡x^{\prime}=\phi(x_{0},u,t,t^{\prime})italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_ϕ ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_u , italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). A wide variety of systems are expressible within this framework. E.g., we obtain discrete-time dynamical systems for 𝒯=+𝒯superscript\mathcal{T}=\mathbb{Z}^{+}caligraphic_T = blackboard_Z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Continuous-time dynamical systems emerge for 𝒯=+𝒯superscript\mathcal{T}=\mathbb{R}^{+}caligraphic_T = blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. We apply Definition A.1 to formulate LLM systems in Definition 3.1.

Note that we assume that the system ΣΣ\Sigmaroman_Σ is time-invariant; its dynamics ϕitalic-ϕ\phiitalic_ϕ do not change as a function of time. This assumption is widely applicable and is often made in the literature [49, 50, 47] to simplify definitions and discussions of systems.

Reachability is a core control theory concept and is central to defining controllability. At their core, definitions of reachability revolve around the existence of control inputs u𝒰𝑢𝒰u\in\mathcal{U}italic_u ∈ caligraphic_U that steer the system from a starting state x0𝒳subscript𝑥0𝒳x_{0}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X to some desired state(s). Following from Chapters 1-2 of [49] and Chapter 2 of [47], we define state reachability as:

Definition A.2 (State Reachability).

State x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X is reachable from initial state x0𝒳subscript𝑥0𝒳x_{0}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X for system Σ=(𝒯,𝒳,𝒰,ϕ)Σ𝒯𝒳𝒰italic-ϕ\Sigma=(\mathcal{T,X,U},\phi)roman_Σ = ( caligraphic_T , caligraphic_X , caligraphic_U , italic_ϕ ) iff there exists some time T𝑇Titalic_T and control input u𝒰superscript𝑢𝒰u^{*}\in\mathcal{U}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_U such that usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT steers the system from state x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to state x𝑥xitalic_x at time T𝑇Titalic_T.

We may use this definition of state reachability to define the reachable state set for some initial state x0𝒳subscript𝑥0𝒳x_{0}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X:

Definition A.3 (Reachable State Set).

The reachable state set from initial state x0𝒳subscript𝑥0𝒳x_{0}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X for system Σ=(𝒯,𝒳,𝒰,ϕ)Σ𝒯𝒳𝒰italic-ϕ\Sigma=(\mathcal{T,X,U},\phi)roman_Σ = ( caligraphic_T , caligraphic_X , caligraphic_U , italic_ϕ ) is denoted (x0)𝒳subscript𝑥0𝒳\mathcal{R}(x_{0})\subseteq\mathcal{X}caligraphic_R ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⊆ caligraphic_X and consists of all reachable states x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X from initial state x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (cf. Definition A.2).

For systems with readout maps hhitalic_h, notions of output reachability arise naturally. Note that state reachability is neither necessary nor sufficient to guarantee output reachability.

Definition A.4 (Output Reachability).

Output y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y is reachable from initial state x0𝒳subscript𝑥0𝒳x_{0}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X for system Σ=(𝒯,𝒳,𝒰,ϕ,𝒴,h)Σ𝒯𝒳𝒰italic-ϕ𝒴\Sigma=(\mathcal{T,X,U},\phi,\mathcal{Y},h)roman_Σ = ( caligraphic_T , caligraphic_X , caligraphic_U , italic_ϕ , caligraphic_Y , italic_h ) iff there exists some time T𝑇Titalic_T and control input u𝒰superscript𝑢𝒰u^{*}\in\mathcal{U}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_U such that usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT steers the system from state x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to output y𝑦yitalic_y in time T𝑇Titalic_T.

Definition A.5 (Reachable Output Set).

The reachable output set from initial state x0𝒳subscript𝑥0𝒳x_{0}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X for system Σ=(𝒯,𝒳,𝒰,ϕ,𝒴,h)Σ𝒯𝒳𝒰italic-ϕ𝒴\Sigma=(\mathcal{T,X,U},\phi,\mathcal{Y},h)roman_Σ = ( caligraphic_T , caligraphic_X , caligraphic_U , italic_ϕ , caligraphic_Y , italic_h ) is denoted y(x0)subscript𝑦subscript𝑥0\mathcal{R}_{y}(x_{0})caligraphic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and consists of all reachable outputs y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y from initial state x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (cf. Definition A.4).

A system is controllable when the reachable set extends to the entire state space. Practically speaking, this implies that one can steer the system from any initial state to any desired state. We develop the reachable set for LLM systems in Definition 3.3 and LLM reachability in Definition 3.2.

Definition A.6 (State Controllability).

System Σ=(𝒯,𝒳,𝒰,ϕ)Σ𝒯𝒳𝒰italic-ϕ\Sigma=(\mathcal{T,X,U},\phi)roman_Σ = ( caligraphic_T , caligraphic_X , caligraphic_U , italic_ϕ ) is state controllable iff, for every initial state x0𝒳subscript𝑥0𝒳x_{0}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X, the reachable set (x0)=𝒳subscript𝑥0𝒳\mathcal{R}(x_{0})=\mathcal{X}caligraphic_R ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_X.

Definition A.7 (Output Controllability).

System Σ=(𝒯,𝒳,𝒰,ϕ,𝒴,h)Σ𝒯𝒳𝒰italic-ϕ𝒴\Sigma=(\mathcal{T,X,U},\phi,\mathcal{Y},h)roman_Σ = ( caligraphic_T , caligraphic_X , caligraphic_U , italic_ϕ , caligraphic_Y , italic_h ) is output controllable iff, for every initial state x0𝒳subscript𝑥0𝒳x_{0}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X, the reachable output set y(x0)=𝒴subscript𝑦subscript𝑥0𝒴\mathcal{R}_{y}(x_{0})=\mathcal{Y}caligraphic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_Y.

A range of fruitful questions stem from these definitions: if there is a cost associated with control inputs u𝒰𝑢𝒰u\in\mathcal{U}italic_u ∈ caligraphic_U (e.g., power constraints, length constraints), what is the minimum cost of control? What is the minimum time required to get from the initial state to the desired final state or output? If the system is not completely controllable, under what conditions is it controllable? Under which readout maps is a system output controllable? We develop controllability for LLMs abstractly in Definition 3.4 and in an empirically/statistically testable fashion in Definition 3.5.

Appendix B Theory on Self-Attention Controllability

Note: Key terms for the proof are introduced in Section 4 surrounding Theorem 4.2. Specifically, the definition of self-attention mechanism ΞΞ\Xiroman_Ξ, the control problem setup, and the reachable set yk(𝐗0)superscriptsubscript𝑦𝑘subscript𝐗0\mathcal{R}_{y}^{k}(\mathbf{X}_{0})caligraphic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) are required background for this proof.

Notation: For each token representation matrix 𝐐,𝐊,𝐕(k+m)×𝐐𝐊𝐕superscript𝑘𝑚absent\mathbf{Q,K,V}\in\mathbb{R}^{(k+m)\times\cdot}bold_Q , bold_K , bold_V ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_k + italic_m ) × ⋅ end_POSTSUPERSCRIPT, we denote the first k𝑘kitalic_k rows corresponding to 𝐔𝐔\mathbf{U}bold_U using u𝑢uitalic_u as a subscript, like 𝐐usubscript𝐐𝑢\mathbf{Q}_{u}bold_Q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. The remaining m𝑚mitalic_m rows corresponding to 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are denoted with subscript x𝑥xitalic_x like 𝐐xsubscript𝐐𝑥\mathbf{Q}_{x}bold_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.

B.1 Proof of Theorem 4.2

Let 𝐀𝐀\mathbf{A}bold_A be the exponentiated query-key outer product matrix with the following block structure:

𝐀=exp(Q Kdkey)=exp([𝐐u𝐊u𝐐u𝐊x𝐐x𝐊u𝐐x𝐊x]1dkey)=[𝐀uu𝐀ux𝐀xu𝐀xx]𝐀matrixsuperscriptQ Ktopsubscript𝑑keymatrixmatrixsubscript𝐐𝑢superscriptsubscript𝐊𝑢topsubscript𝐐𝑢superscriptsubscript𝐊𝑥topsubscript𝐐𝑥superscriptsubscript𝐊𝑢topsubscript𝐐𝑥superscriptsubscript𝐊𝑥top1subscript𝑑keymatrixsubscript𝐀𝑢𝑢subscript𝐀𝑢𝑥subscript𝐀𝑥𝑢subscript𝐀𝑥𝑥\mathbf{A}=\exp\begin{pmatrix}\frac{\textbf{Q K}^{\top}}{\sqrt{d_{\rm key}}}% \end{pmatrix}=\exp\begin{pmatrix}\begin{bmatrix}\mathbf{Q}_{u}\mathbf{K}_{u}^{% \top}&\mathbf{Q}_{u}\mathbf{K}_{x}^{\top}\\ \mathbf{Q}_{x}\mathbf{K}_{u}^{\top}&\mathbf{Q}_{x}\mathbf{K}_{x}^{\top}\end{% bmatrix}\frac{1}{\sqrt{d_{\rm key}}}\end{pmatrix}=\begin{bmatrix}\mathbf{A}_{% uu}&\mathbf{A}_{ux}\\ \mathbf{A}_{xu}&\mathbf{A}_{xx}\end{bmatrix}bold_A = roman_exp ( start_ARG start_ROW start_CELL divide start_ARG Q K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT end_ARG end_ARG end_CELL end_ROW end_ARG ) = roman_exp ( start_ARG start_ROW start_CELL [ start_ARG start_ROW start_CELL bold_Q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT bold_K start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL bold_Q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT bold_K start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_K start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL bold_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_K start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT end_ARG end_ARG end_CELL end_ROW end_ARG ) = [ start_ARG start_ROW start_CELL bold_A start_POSTSUBSCRIPT italic_u italic_u end_POSTSUBSCRIPT end_CELL start_CELL bold_A start_POSTSUBSCRIPT italic_u italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT end_CELL start_CELL bold_A start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] (10)

where 𝐐u=𝐔𝐖qsubscript𝐐𝑢subscript𝐔𝐖𝑞\mathbf{Q}_{u}=\mathbf{U}\mathbf{W}_{q}bold_Q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = bold_UW start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, 𝐊x=𝐗0𝐖keysubscript𝐊𝑥subscript𝐗0subscript𝐖key\mathbf{K}_{x}=\mathbf{X}_{0}\mathbf{W}_{\rm key}bold_K start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT, and similarly for 𝐊u,𝐐xsubscript𝐊𝑢subscript𝐐𝑥\mathbf{K}_{u},\mathbf{Q}_{x}bold_K start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. We apply a similar quadrant decomposition to 𝐃𝐃\mathbf{D}bold_D, defined initially in Equation 4.

𝐃=diag(exp(𝐐𝐊dkey)𝟏N×1)=[𝐃u𝟎𝟎𝐃x]𝐃diagmatrixmatrixsuperscript𝐐𝐊topsubscript𝑑keysubscript1𝑁1matrixsubscript𝐃𝑢00subscript𝐃𝑥\mathbf{D}=\text{diag}\begin{pmatrix}\exp\begin{pmatrix}\frac{\mathbf{QK}^{% \top}}{\sqrt{d_{\rm key}}}\end{pmatrix}\mathbf{1}_{N\times 1}\end{pmatrix}=% \begin{bmatrix}\mathbf{D}_{u}&\mathbf{0}\\ \mathbf{0}&\mathbf{D}_{x}\\ \end{bmatrix}bold_D = diag ( start_ARG start_ROW start_CELL roman_exp ( start_ARG start_ROW start_CELL divide start_ARG bold_QK start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT end_ARG end_ARG end_CELL end_ROW end_ARG ) bold_1 start_POSTSUBSCRIPT italic_N × 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) = [ start_ARG start_ROW start_CELL bold_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] (11)

where the quadrant demarcations in 𝐃𝐃\mathbf{D}bold_D follow from Equation 10.

We may now express the self-attention mechanism output representations 𝐘𝐘\mathbf{Y}bold_Y as

𝐘=𝐃x1𝐀xu𝐕u𝐘u+𝐃x1𝐀xx𝐕x𝐘x𝐘subscriptsuperscriptsubscript𝐃𝑥1subscript𝐀𝑥𝑢subscript𝐕𝑢subscript𝐘𝑢subscriptsuperscriptsubscript𝐃𝑥1subscript𝐀𝑥𝑥subscript𝐕𝑥subscript𝐘𝑥\mathbf{Y}=\underbrace{\mathbf{D}_{x}^{-1}\mathbf{A}_{xu}\mathbf{V}_{u}}_{% \mathbf{Y}_{u}}+\underbrace{\mathbf{D}_{x}^{-1}\mathbf{A}_{xx}\mathbf{V}_{x}}_% {\mathbf{Y}_{x}}bold_Y = under⏟ start_ARG bold_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG bold_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT (12)
Lemma B.1.

For any control input 𝐔𝐔\mathbf{U}bold_U whose rows satisfy 𝐔jMunormsuperscript𝐔𝑗subscript𝑀𝑢\|\mathbf{U}^{j}\|\leq M_{u}∥ bold_U start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ ≤ italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for all j{1,,k}𝑗1𝑘j\in\{1,\dots,k\}italic_j ∈ { 1 , … , italic_k }, the norm of the i𝑖iitalic_i-th row of 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is bounded as follows

𝐘uiβi(𝐗0,k)normsuperscriptsubscript𝐘𝑢𝑖subscript𝛽𝑖subscript𝐗0𝑘\|\mathbf{Y}_{u}^{i}\|\leq\beta_{i}(\mathbf{X}_{0},k)∥ bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ ≤ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_k )

where

βi(𝐗0,k):=keαgi(𝐗0,𝜽)+keασvMu,assignsubscript𝛽𝑖subscript𝐗0𝑘𝑘superscript𝑒𝛼subscript𝑔𝑖subscript𝐗0𝜽𝑘superscript𝑒𝛼subscript𝜎𝑣subscript𝑀𝑢\beta_{i}(\mathbf{X}_{0},k):=\frac{ke^{\alpha}}{g_{i}(\mathbf{X}_{0},% \boldsymbol{\theta})+ke^{\alpha}}\sigma_{v}M_{u},italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_k ) := divide start_ARG italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ) + italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , (13)

and

α=σqσkeyMuMx/dkey,gi(𝐗0,𝜽):=𝐃xxi=j=1mexp((𝐗0)i𝐖q𝐖key(𝐗0)j/dkey).formulae-sequence𝛼subscript𝜎𝑞subscript𝜎keysubscript𝑀𝑢subscript𝑀𝑥subscript𝑑keyassignsubscript𝑔𝑖subscript𝐗0𝜽superscriptsubscript𝐃𝑥𝑥𝑖superscriptsubscript𝑗1𝑚superscriptsubscript𝐗0𝑖subscript𝐖𝑞superscriptsubscript𝐖keytopsuperscriptsubscript𝐗0limit-from𝑗topsubscript𝑑key\alpha=\sigma_{q}\sigma_{\rm key}M_{u}M_{x}/\sqrt{d_{\rm key}},\qquad g_{i}(% \mathbf{X}_{0},\boldsymbol{\theta}):=\mathbf{D}_{xx}^{i}=\sum_{j=1}^{m}\exp% \left((\mathbf{X}_{0})^{i}\mathbf{W}_{q}\mathbf{W}_{\rm key}^{\top}(\mathbf{X}% _{0})^{j\top}/\sqrt{d_{\rm key}}\right).italic_α = italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT / square-root start_ARG italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT end_ARG , italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ) := bold_D start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_exp ( ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j ⊤ end_POSTSUPERSCRIPT / square-root start_ARG italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT end_ARG ) .
Proof.

Our objective is to establish an upper bound on 𝐘uinormsuperscriptsubscript𝐘𝑢𝑖\|\mathbf{Y}_{u}^{i}\|∥ bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥, the Euclidean norm of the i𝑖iitalic_i-th row of the matrix 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, which represents the contribution of the control input to the output of the self-attention layer. gi=𝐃xxisubscript𝑔𝑖superscriptsubscript𝐃𝑥𝑥𝑖g_{i}=\mathbf{D}_{xx}^{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_D start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the component of the row-wise softmax denominator 𝐃xsubscript𝐃𝑥\mathbf{D}_{x}bold_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT from 𝐀xxsubscript𝐀𝑥𝑥\mathbf{A}_{xx}bold_A start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT (solely a function of 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT). Similarly, 𝐃xusubscript𝐃𝑥𝑢\mathbf{D}_{xu}bold_D start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT represents the component of 𝐃xsubscript𝐃𝑥\mathbf{D}_{x}bold_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT from 𝐀xusubscript𝐀𝑥𝑢\mathbf{A}_{xu}bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT, and 𝐃x=𝐃xx+𝐃xusubscript𝐃𝑥subscript𝐃𝑥𝑥subscript𝐃𝑥𝑢\mathbf{D}_{x}=\mathbf{D}_{xx}+\mathbf{D}_{xu}bold_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = bold_D start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT + bold_D start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT. We observe that 𝐃xuisuperscriptsubscript𝐃𝑥𝑢𝑖\mathbf{D}_{xu}^{i}bold_D start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the sum of the entries in the i𝑖iitalic_i-th row of 𝐀xusubscript𝐀𝑥𝑢\mathbf{A}_{xu}bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT:

𝐃xui=j=1k(𝐀xu)ij=j=1kexp(𝐐xi,𝐊uj/dkey),superscriptsubscript𝐃𝑥𝑢𝑖superscriptsubscript𝑗1𝑘subscriptsubscript𝐀𝑥𝑢𝑖𝑗superscriptsubscript𝑗1𝑘superscriptsubscript𝐐𝑥𝑖superscriptsubscript𝐊𝑢𝑗subscript𝑑key\mathbf{D}_{xu}^{i}=\sum_{j=1}^{k}(\mathbf{A}_{xu})_{ij}=\sum_{j=1}^{k}\exp(% \langle\mathbf{Q}_{x}^{i},\mathbf{K}_{u}^{j}\rangle/\sqrt{d_{\rm key}}),bold_D start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_exp ( ⟨ bold_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_K start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ⟩ / square-root start_ARG italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT end_ARG ) , (14)

where 𝐐xisuperscriptsubscript𝐐𝑥𝑖\mathbf{Q}_{x}^{i}bold_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝐊ujsuperscriptsubscript𝐊𝑢𝑗\mathbf{K}_{u}^{j}bold_K start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT denote the i𝑖iitalic_i-th row of 𝐐xsubscript𝐐𝑥\mathbf{Q}_{x}bold_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and the j𝑗jitalic_j-th row of 𝐊usubscript𝐊𝑢\mathbf{K}_{u}bold_K start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, respectively. Every entry of the diagonal matrix 𝐃xusubscript𝐃𝑥𝑢\mathbf{D}_{xu}bold_D start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT is strictly positive.

Recall that 𝐃x=𝐃xx+𝐃xusubscript𝐃𝑥subscript𝐃𝑥𝑥subscript𝐃𝑥𝑢\mathbf{D}_{x}=\mathbf{D}_{xx}+\mathbf{D}_{xu}bold_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = bold_D start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT + bold_D start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT and 𝐕u=𝐔𝐖vsubscript𝐕𝑢subscript𝐔𝐖𝑣\mathbf{V}_{u}=\mathbf{U}\mathbf{W}_{v}bold_V start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = bold_UW start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. We begin by expressing the i𝑖iitalic_i-th row of 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT as:

𝐘ui=(𝐃xxi+𝐃xui)1(𝐀xui𝐕ui),superscriptsubscript𝐘𝑢𝑖superscriptsuperscriptsubscript𝐃𝑥𝑥𝑖superscriptsubscript𝐃𝑥𝑢𝑖1superscriptsubscript𝐀𝑥𝑢𝑖superscriptsubscript𝐕𝑢𝑖\mathbf{Y}_{u}^{i}=(\mathbf{D}_{xx}^{i}+\mathbf{D}_{xu}^{i})^{-1}(\mathbf{A}_{% xu}^{i}\mathbf{V}_{u}^{i}),bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( bold_D start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + bold_D start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , (15)

where 𝐃xxisuperscriptsubscript𝐃𝑥𝑥𝑖\mathbf{D}_{xx}^{i}bold_D start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝐃xuisuperscriptsubscript𝐃𝑥𝑢𝑖\mathbf{D}_{xu}^{i}bold_D start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT denote the i𝑖iitalic_i-th diagonal entries of the matrices 𝐃xxsubscript𝐃𝑥𝑥\mathbf{D}_{xx}bold_D start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT and 𝐃xusubscript𝐃𝑥𝑢\mathbf{D}_{xu}bold_D start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT, respectively, 𝐀xuisuperscriptsubscript𝐀𝑥𝑢𝑖\mathbf{A}_{xu}^{i}bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the i𝑖iitalic_i-th row of the matrix 𝐀xusubscript𝐀𝑥𝑢\mathbf{A}_{xu}bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT, and 𝐕uisuperscriptsubscript𝐕𝑢𝑖\mathbf{V}_{u}^{i}bold_V start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT corresponds to the i𝑖iitalic_i-th row of the matrix 𝐕usubscript𝐕𝑢\mathbf{V}_{u}bold_V start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT.

Let αij:=(𝐀xu)ij=𝐐xi,𝐊uj/dkeyαassignsubscript𝛼𝑖𝑗subscriptsubscript𝐀𝑥𝑢𝑖𝑗superscriptsubscript𝐐𝑥𝑖superscriptsubscript𝐊𝑢𝑗subscript𝑑key𝛼\alpha_{ij}:=(\mathbf{A}_{xu})_{ij}=\langle\mathbf{Q}_{x}^{i},\mathbf{K}_{u}^{% j}\rangle/\sqrt{d_{\rm key}}\leq\alphaitalic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT := ( bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ⟨ bold_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_K start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ⟩ / square-root start_ARG italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT end_ARG ≤ italic_α for all i,j𝑖𝑗i,jitalic_i , italic_j where α𝛼\alphaitalic_α is defined to be an upper bound on the scaled key-query dot products between vectors in 𝐔𝐔\mathbf{U}bold_U and 𝐗𝐗\mathbf{X}bold_X given by

α=σqσkeyMuMx/dkey.𝛼subscript𝜎𝑞subscript𝜎keysubscript𝑀𝑢subscript𝑀𝑥subscript𝑑key\alpha=\sigma_{q}\sigma_{\rm key}M_{u}M_{x}/\sqrt{d_{\rm key}}.italic_α = italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT / square-root start_ARG italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT end_ARG . (16)

Recall that σq,σkeysubscript𝜎𝑞subscript𝜎key\sigma_{q},\sigma_{\rm key}italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT are the maximal singular values of 𝐖q,𝐖keysubscript𝐖𝑞subscript𝐖key\mathbf{W}_{q},\mathbf{W}_{\rm key}bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT respectively.

By applying the Cauchy-Schwarz inequality and using the definitions of α𝛼\alphaitalic_α, Musubscript𝑀𝑢M_{u}italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, and Mxsubscript𝑀𝑥M_{x}italic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, we can perform the bound (𝐀xu)ijeαsubscriptsubscript𝐀𝑥𝑢𝑖𝑗superscript𝑒𝛼(\mathbf{A}_{xu})_{ij}\leq e^{\alpha}( bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT for all i,j𝑖𝑗i,jitalic_i , italic_j, and thus:

𝐃xui=𝐀xui𝟏j=1keα=keα.superscriptsubscript𝐃𝑥𝑢𝑖subscriptsuperscript𝐀𝑖𝑥𝑢1superscriptsubscript𝑗1𝑘superscript𝑒𝛼𝑘superscript𝑒𝛼\mathbf{D}_{xu}^{i}=\mathbf{A}^{i}_{xu}\mathbf{1}\leq\sum_{j=1}^{k}e^{\alpha}=% ke^{\alpha}.bold_D start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = bold_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT bold_1 ≤ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT = italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT . (17)

where 𝟏1\mathbf{1}bold_1 is a constant vector consisting of all entries equal to 1.

Next, we note that 𝐕uiCnormsuperscriptsubscript𝐕𝑢𝑖𝐶\|\mathbf{V}_{u}^{i}\|\leq C∥ bold_V start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ ≤ italic_C, where C:=σvMuassign𝐶subscript𝜎𝑣subscript𝑀𝑢C:=\sigma_{v}M_{u}italic_C := italic_σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, and σvsubscript𝜎𝑣\sigma_{v}italic_σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT denotes the maximum singular value of the value projection matrix 𝐖vsubscript𝐖𝑣\mathbf{W}_{v}bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. That fact follows directly the definition of 𝐕usubscript𝐕𝑢\mathbf{V}_{u}bold_V start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. This allows us, while we are bounding 𝐘uinormsuperscriptsubscript𝐘𝑢𝑖\|\mathbf{Y}_{u}^{i}\|∥ bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥, to replace 𝐕uisuperscriptsubscript𝐕𝑢𝑖\mathbf{V}_{u}^{i}bold_V start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT with a constant vector 𝐂𝐂\mathbf{C}bold_C whose entries are all equal to C𝐶Citalic_C, yielding an upper bound on 𝐀xui𝐕uinormsuperscriptsubscript𝐀𝑥𝑢𝑖superscriptsubscript𝐕𝑢𝑖\|\mathbf{A}_{xu}^{i}\mathbf{V}_{u}^{i}\|∥ bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥:

𝐀xui𝐕ui𝐀xui𝐂=Cj=1k(𝐀xu)ij=C𝐃xui.normsuperscriptsubscript𝐀𝑥𝑢𝑖superscriptsubscript𝐕𝑢𝑖normsuperscriptsubscript𝐀𝑥𝑢𝑖𝐂𝐶superscriptsubscript𝑗1𝑘subscriptsubscript𝐀𝑥𝑢𝑖𝑗𝐶superscriptsubscript𝐃𝑥𝑢𝑖\|\mathbf{A}_{xu}^{i}\mathbf{V}_{u}^{i}\|\leq\|\mathbf{A}_{xu}^{i}\mathbf{C}\|% =C\sum_{j=1}^{k}(\mathbf{A}_{xu})_{ij}=C\mathbf{D}_{xu}^{i}.∥ bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ ≤ ∥ bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_C ∥ = italic_C ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_C bold_D start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT . (18)

We now rewrite the norm 𝐘uinormsuperscriptsubscript𝐘𝑢𝑖\|\mathbf{Y}_{u}^{i}\|∥ bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥. Toward that end, let gi(,𝜽):m×d[0,):subscript𝑔𝑖𝜽superscript𝑚𝑑0g_{i}(\cdot,\boldsymbol{\theta}):\mathbb{R}^{m\times d}\to[0,\infty)italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ , bold_italic_θ ) : blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT → [ 0 , ∞ ) denote the function of 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT defined by gi(𝐗0,𝜽):=𝐃xxiassignsubscript𝑔𝑖subscript𝐗0𝜽superscriptsubscript𝐃𝑥𝑥𝑖g_{i}(\mathbf{X}_{0},\boldsymbol{\theta}):=\mathbf{D}_{xx}^{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ) := bold_D start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.

𝐘uinormsuperscriptsubscript𝐘𝑢𝑖\displaystyle\|\mathbf{Y}_{u}^{i}\|∥ bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ =𝐀xui𝐕ui𝐃xxi+𝐃xuiabsentnormsuperscriptsubscript𝐀𝑥𝑢𝑖superscriptsubscript𝐕𝑢𝑖superscriptsubscript𝐃𝑥𝑥𝑖superscriptsubscript𝐃𝑥𝑢𝑖\displaystyle=\frac{\|\mathbf{A}_{xu}^{i}\mathbf{V}_{u}^{i}\|}{\mathbf{D}_{xx}% ^{i}+\mathbf{D}_{xu}^{i}}= divide start_ARG ∥ bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ end_ARG start_ARG bold_D start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + bold_D start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG (19)
𝐀xui𝐂𝐃xxi+𝐃xui=𝐀xui,𝐂𝐃xxi+𝐀xui𝟏absentnormsuperscriptsubscript𝐀𝑥𝑢𝑖𝐂superscriptsubscript𝐃𝑥𝑥𝑖superscriptsubscript𝐃𝑥𝑢𝑖superscriptsubscript𝐀𝑥𝑢𝑖𝐂superscriptsubscript𝐃𝑥𝑥𝑖superscriptsubscript𝐀𝑥𝑢𝑖1\displaystyle\leq\frac{\|\mathbf{A}_{xu}^{i}\mathbf{C}\|}{\mathbf{D}_{xx}^{i}+% \mathbf{D}_{xu}^{i}}=\frac{\langle\mathbf{A}_{xu}^{i},\mathbf{C}\rangle}{% \mathbf{D}_{xx}^{i}+\mathbf{A}_{xu}^{i}\mathbf{1}}≤ divide start_ARG ∥ bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_C ∥ end_ARG start_ARG bold_D start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + bold_D start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG = divide start_ARG ⟨ bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_C ⟩ end_ARG start_ARG bold_D start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + bold_A start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_1 end_ARG (20)
Ckeαgi+keαabsent𝐶𝑘superscript𝑒𝛼subscript𝑔𝑖𝑘superscript𝑒𝛼\displaystyle\leq\frac{Cke^{\alpha}}{g_{i}+ke^{\alpha}}≤ divide start_ARG italic_C italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG (21)

The final line follows from (17) and the observation that the function f(x):=x/(x+gi)assign𝑓𝑥𝑥𝑥subscript𝑔𝑖f(x):=x/(x+g_{i})italic_f ( italic_x ) := italic_x / ( italic_x + italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where gi>0subscript𝑔𝑖0g_{i}>0italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 is monotone increasing.

Let

βi(𝐗0,k):=keαgi(𝐗0,𝜽)+keασvMu.assignsubscript𝛽𝑖subscript𝐗0𝑘𝑘superscript𝑒𝛼subscript𝑔𝑖subscript𝐗0𝜽𝑘superscript𝑒𝛼subscript𝜎𝑣subscript𝑀𝑢\beta_{i}(\mathbf{X}_{0},k):=\frac{ke^{\alpha}}{g_{i}(\mathbf{X}_{0},% \boldsymbol{\theta})+ke^{\alpha}}\sigma_{v}M_{u}.italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_k ) := divide start_ARG italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ) + italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT . (22)

We have established that 𝐘uiβi(𝐗0,k)normsuperscriptsubscript𝐘𝑢𝑖subscript𝛽𝑖subscript𝐗0𝑘\|\mathbf{Y}_{u}^{i}\|\leq\beta_{i}(\mathbf{X}_{0},k)∥ bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ ≤ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_k ) for any control input 𝐔𝐔\mathbf{U}bold_U whose rows satisfy 𝐔jMunormsuperscript𝐔𝑗subscript𝑀𝑢\|\mathbf{U}^{j}\|\leq M_{u}∥ bold_U start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ ≤ italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for all j{1,,k}𝑗1𝑘j\in\{1,\dots,k\}italic_j ∈ { 1 , … , italic_k }. The same bound holds for 𝐘u,inormsuperscriptsubscript𝐘𝑢perpendicular-to𝑖\|\mathbf{Y}_{u,\perp}^{i}\|∥ bold_Y start_POSTSUBSCRIPT italic_u , ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥, the norm of the projection of 𝐘uisuperscriptsubscript𝐘𝑢𝑖\mathbf{Y}_{u}^{i}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT onto the orthogonal complement of 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. ∎

B.2 Simplified reachability hypothesis

We can restate the hypothesis of our self-attention theorem, Theorem 4.2

𝐘x,max>keαgiσvMunormsubscriptsuperscript𝐘max𝑥perpendicular-to𝑘superscript𝑒𝛼subscript𝑔𝑖subscript𝜎𝑣subscript𝑀𝑢\|\mathbf{Y}^{\operatorname{max}}_{x,\perp}\|>\frac{ke^{\alpha}}{g_{i}}\sigma_% {v}M_{u}∥ bold_Y start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT ∥ > divide start_ARG italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT (23)

as equivalent to

𝐘x,min,i>βi=keαkeα+giσvMu.normsubscriptsuperscript𝐘min𝑖𝑥perpendicular-tosubscript𝛽𝑖𝑘superscript𝑒𝛼𝑘superscript𝑒𝛼subscript𝑔𝑖subscript𝜎𝑣subscript𝑀𝑢\|\mathbf{Y}^{\operatorname{min},i}_{x,\perp}\|>\beta_{i}=\frac{ke^{\alpha}}{% ke^{\alpha}+g_{i}}\sigma_{v}M_{u}.∥ bold_Y start_POSTSUPERSCRIPT roman_min , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT ∥ > italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT + italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT . (24)

Since

𝐘xmin,i=gigi+keα𝐘xmax,i,subscriptsuperscript𝐘min𝑖𝑥subscript𝑔𝑖subscript𝑔𝑖𝑘superscript𝑒𝛼subscriptsuperscript𝐘max𝑖𝑥\mathbf{Y}^{\operatorname{min},i}_{x}=\frac{g_{i}}{g_{i}+ke^{\alpha}}\mathbf{Y% }^{\operatorname{max},i}_{x},bold_Y start_POSTSUPERSCRIPT roman_min , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = divide start_ARG italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG bold_Y start_POSTSUPERSCRIPT roman_max , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , (25)

and gi>0subscript𝑔𝑖0g_{i}>0italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 and keα𝑘superscript𝑒𝛼ke^{\alpha}italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT are positive scalars, we can cancel the factor of (keα+gi)1superscript𝑘superscript𝑒𝛼subscript𝑔𝑖1(ke^{\alpha}+g_{i})^{-1}( italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT + italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT on both sides, and then divide both sides by gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, to obtain the equivalent hypothesis (23) from hypothesis (24).

B.3 More general theorem

Theorem B.2 (Self-Attention Control Theorem 2).

Consider a self-attention layer with input 𝐗m×d𝐗superscript𝑚𝑑\mathbf{X}\in\mathbb{R}^{m\times d}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT and control input 𝐔k×d𝐔superscript𝑘𝑑\mathbf{U}\in\mathbb{R}^{k\times d}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d end_POSTSUPERSCRIPT, where m𝑚mitalic_m is the number of imposed tokens, k𝑘kitalic_k is the number of control tokens, and d𝑑ditalic_d is the token embedding dimension. Let 𝐘m×dsuperscript𝐘superscript𝑚𝑑\mathbf{Y}^{*}\in\mathbb{R}^{m\times d}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT be the desired output, and let 𝐘m×d𝐘superscript𝑚𝑑\mathbf{Y}\in\mathbb{R}^{m\times d}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT be the actual output of the self-attention layer. As before, define 𝐘=𝐘x,+𝐘u,m×dsubscript𝐘perpendicular-tosubscript𝐘𝑥perpendicular-tosubscript𝐘𝑢perpendicular-tosuperscript𝑚𝑑\mathbf{Y}_{\perp}=\mathbf{Y}_{x,\perp}+\mathbf{Y}_{u,\perp}\in\mathbb{R}^{m% \times d}bold_Y start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT = bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT + bold_Y start_POSTSUBSCRIPT italic_u , ⟂ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT as the projection of the output onto the orthogonal complement of 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

If either:

(A) 𝐘=𝐘norm𝐘normsuperscript𝐘\|\mathbf{Y}\|=\|\mathbf{Y}^{*}\|∥ bold_Y ∥ = ∥ bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ and there exists a component 𝐘ij0superscriptsubscript𝐘perpendicular-to𝑖𝑗0\mathbf{Y}_{\perp}^{ij}\neq 0bold_Y start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT ≠ 0 of the matrix 𝐘subscript𝐘perpendicular-to\mathbf{Y}_{\perp}bold_Y start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT, or

(B) 𝐘𝐘norm𝐘normsuperscript𝐘\|\mathbf{Y}\|\neq\|\mathbf{Y}^{*}\|∥ bold_Y ∥ ≠ ∥ bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥,

then

𝐘𝐘𝐘superscript𝐘\mathbf{Y}\neq\mathbf{Y}^{*}bold_Y ≠ bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (26)

for any control input 𝐔k×d𝐔superscript𝑘𝑑\mathbf{U}\in\mathbb{R}^{k\times d}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d end_POSTSUPERSCRIPT such that maxj𝐔jMusubscript𝑗normsuperscript𝐔𝑗subscript𝑀𝑢\max_{j}\|\mathbf{U}^{j}\|\leq M_{u}roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ bold_U start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ ≤ italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT.

This theorem is also illustrated in Figure 2 and is a more general theorem than Theorem 4.2: the hypothesis of Theorem 4.2 implies that some row satisfies 𝐘x,min,i>𝐘u,max,inormsubscriptsuperscript𝐘min𝑖𝑥perpendicular-tonormsubscriptsuperscript𝐘max𝑖𝑢perpendicular-to\|\mathbf{Y}^{\operatorname{min},i}_{x,\perp}\|>\|\mathbf{Y}^{\operatorname{% max},i}_{u,\perp}\|∥ bold_Y start_POSTSUPERSCRIPT roman_min , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT ∥ > ∥ bold_Y start_POSTSUPERSCRIPT roman_max , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u , ⟂ end_POSTSUBSCRIPT ∥, so it must be the case that there exists some nonzero entry 𝐘ijsubscriptsuperscript𝐘𝑖𝑗perpendicular-to\mathbf{Y}^{ij}_{\perp}bold_Y start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT of the matrix 𝐘subscript𝐘perpendicular-to\mathbf{Y}_{\perp}bold_Y start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT in the case that 𝐘=𝐘norm𝐘normsuperscript𝐘\|\mathbf{Y}\|=\|\mathbf{Y}^{*}\|∥ bold_Y ∥ = ∥ bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥.

The Self-Attention Control Theorem (Theorem 4.2) provides valuable insights despite being less general than the more general version (Theorem B.2). An advantage of Theorem 4.2 is its more specific hypothesis, Equation (7), which provides a concrete criterion for determining whether the desired output can be achieved by the self-attention layer222and depends on the properties of the input tokens, the control tokens, and the learned parameters of the self-attention layer, such as the maximum singular values of the query and key projection matrices.

Proof of Theorem B.2.

We will prove the theorem by contradiction. Assume that 𝐘=𝐘𝐘superscript𝐘\mathbf{Y}=\mathbf{Y}^{*}bold_Y = bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for some control input 𝐔𝐔\mathbf{U}bold_U satisfying maxj𝐔jMusubscript𝑗normsuperscript𝐔𝑗subscript𝑀𝑢\max_{j}\|\mathbf{U}^{j}\|\leq M_{u}roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ bold_U start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ ≤ italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT.

Case (A): If 𝐘=𝐘norm𝐘normsuperscript𝐘\|\mathbf{Y}\|=\|\mathbf{Y}^{*}\|∥ bold_Y ∥ = ∥ bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ and there exists a component 𝐘ij0superscriptsubscript𝐘perpendicular-to𝑖𝑗0\mathbf{Y}_{\perp}^{ij}\neq 0bold_Y start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT ≠ 0 of the matrix 𝐘subscript𝐘perpendicular-to\mathbf{Y}_{\perp}bold_Y start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT, then 𝐘𝟎subscript𝐘perpendicular-to0\mathbf{Y}_{\perp}\neq\mathbf{0}bold_Y start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ≠ bold_0. This implies that 𝐘𝐘\mathbf{Y}bold_Y and 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are not parallel, and therefore 𝐘𝐘𝐘superscript𝐘\mathbf{Y}\neq\mathbf{Y}^{*}bold_Y ≠ bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, contradicting the assumption.

Case (B): If 𝐘𝐘norm𝐘normsuperscript𝐘\|\mathbf{Y}\|\neq\|\mathbf{Y}^{*}\|∥ bold_Y ∥ ≠ ∥ bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥, then 𝐘𝐘𝐘superscript𝐘\mathbf{Y}\neq\mathbf{Y}^{*}bold_Y ≠ bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT directly, again contradicting the assumption.

In both cases, we have a contradiction, so the assumption that 𝐘=𝐘𝐘superscript𝐘\mathbf{Y}=\mathbf{Y}^{*}bold_Y = bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT must be false, and we can conclude that 𝐘𝐘𝐘superscript𝐘\mathbf{Y}\neq\mathbf{Y}^{*}bold_Y ≠ bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for any control input 𝐔𝐔\mathbf{U}bold_U satisfying maxj𝐔jMusubscript𝑗normsuperscript𝐔𝑗subscript𝑀𝑢\max_{j}\|\mathbf{U}^{j}\|\leq M_{u}roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ bold_U start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ ≤ italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT.

To show that Theorem 4.2 is a special case of the more general theorem, consider the hypothesis of Theorem 4.2: from (7) we conclude that some row satisfies 𝐘x,min,i>𝐘u,max,inormsubscriptsuperscript𝐘𝑚𝑖𝑛𝑖𝑥perpendicular-tonormsubscriptsuperscript𝐘𝑚𝑎𝑥𝑖𝑢perpendicular-to\|\mathbf{Y}^{min,i}_{x,\perp}\|>\|\mathbf{Y}^{max,i}_{u,\perp}\|∥ bold_Y start_POSTSUPERSCRIPT italic_m italic_i italic_n , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT ∥ > ∥ bold_Y start_POSTSUPERSCRIPT italic_m italic_a italic_x , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u , ⟂ end_POSTSUBSCRIPT ∥. This implies that 𝐘x,ij𝐘u,ijsuperscriptsubscript𝐘𝑥perpendicular-to𝑖𝑗superscriptsubscript𝐘𝑢perpendicular-to𝑖𝑗\mathbf{Y}_{x,\perp}^{ij}\neq-\mathbf{Y}_{u,\perp}^{ij}bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT ≠ - bold_Y start_POSTSUBSCRIPT italic_u , ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT for some entry (i,j)𝑖𝑗(i,j)( italic_i , italic_j ), and therefore 𝐘ij=𝐘x,ij+𝐘u,ij0superscriptsubscript𝐘perpendicular-to𝑖𝑗superscriptsubscript𝐘𝑥perpendicular-to𝑖𝑗superscriptsubscript𝐘𝑢perpendicular-to𝑖𝑗0\mathbf{Y}_{\perp}^{ij}=\mathbf{Y}_{x,\perp}^{ij}+\mathbf{Y}_{u,\perp}^{ij}\neq 0bold_Y start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT = bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT + bold_Y start_POSTSUBSCRIPT italic_u , ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT ≠ 0. This satisfies the condition of case (A) in the more general theorem, assuming 𝐘=𝐘norm𝐘normsuperscript𝐘\|\mathbf{Y}\|=\|\mathbf{Y}^{*}\|∥ bold_Y ∥ = ∥ bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥. Thus, the hypothesis of Theorem 4.2 is a special case of the hypothesis in the more general theorem. ∎

By incorporating this bound into the hypothesis, Theorem 4.2 offers a more practical and actionable result, allowing researchers and practitioners to assess the controllability of a self-attention layer based on measurable quantities, without the need to exhaustively search the space of possible control inputs. Moreover, the presence of the bound opens up opportunities for further analysis and optimization, potentially guiding the design of control strategies that satisfy the bound and ensuring that the desired output can be reached. Additionally, the bound can be used to derive insights into the relationship between the properties of the input tokens, the control tokens, and the achievable control over the self-attention layer’s output. While Theorem B.2 provides a more general result, Theorem 4.2 complements it by incorporating a specific bound involving γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into its hypothesis. This specific bound in Theorem 4.2’s makes it more practical for control of self-attention layers in applications.

B.4 Discussion

As seen in equation (13), βi(𝐗0,k)subscript𝛽𝑖subscript𝐗0𝑘\beta_{i}(\mathbf{X}_{0},k)italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_k ) exhibits a hyperbolic dependence on keα𝑘superscript𝑒𝛼ke^{\alpha}italic_k italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT. This suggests that increasing the number of control tokens can “dominate” the output of the self-attention, overwhelming the influence of the imposed state sequence 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The theorem’s reachability condition depends on the number of control tokens k𝑘kitalic_k through the threshold γi(𝐗0,𝜽)subscript𝛾𝑖subscript𝐗0𝜽\gamma_{i}(\mathbf{X}_{0},\boldsymbol{\theta})italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ). As discussed in Remark 4.3, the threshold scales linearly with k𝑘kitalic_k, suggesting that increasing the number of control tokens can potentially enhance controllability. However, this effect is modulated by the other terms in the threshold, such as α𝛼\alphaitalic_α and gi(𝐗0,𝜽)subscript𝑔𝑖subscript𝐗0𝜽g_{i}(\mathbf{X}_{0},\boldsymbol{\theta})italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ), which depend on the properties of the imposed tokens and the model parameters. Specifically, βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT saturates to 1 as k𝑘k\to\inftyitalic_k → ∞ or as α𝛼\alphaitalic_α becomes very large.

The term gi(𝐗0,𝜽)subscript𝑔𝑖subscript𝐗0𝜽g_{i}(\mathbf{X}_{0},\boldsymbol{\theta})italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ) captures the influence of the imposed tokens on the attention weights and appears in the denominator of the threshold γi(𝐗0,𝜽)subscript𝛾𝑖subscript𝐗0𝜽\gamma_{i}(\mathbf{X}_{0},\boldsymbol{\theta})italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ). Larger values of gi(𝐗0,𝜽)subscript𝑔𝑖subscript𝐗0𝜽g_{i}(\mathbf{X}_{0},\boldsymbol{\theta})italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ) lead to a lower threshold, which may make reachability easier, thus increasing the potentially reachable set size.

The hypothesis of Theorem 4.2 implies that some row of the projection of the minimum possible output 𝐘xminsuperscriptsubscript𝐘𝑥min\mathbf{Y}_{x}^{{\rm min}}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min end_POSTSUPERSCRIPT onto the orthogonal complement of 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT exceeds the corresponding row of the maximum possible projection of 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. This ensures the existence of a non-zero component in 𝐘subscript𝐘perpendicular-to\mathbf{Y}_{\perp}bold_Y start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT and precludes reachability. Thus, Theorem 4.2 provides a more specific, practically applicable criterion for assessing controllability than Theorem B.2.

Theorem 4.2’s reachability condition depends on the maximum singular values of the query, key, and value projection matrices (𝐖q,𝐖key,𝐖vsubscript𝐖𝑞subscript𝐖keysubscript𝐖𝑣\mathbf{W}_{q},\mathbf{W}_{\rm key},\mathbf{W}_{v}bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT). The α𝛼\alphaitalic_α term in Theorem 4.2, which involves the maximum singular values of the query and key projection matrices, provides an upper bound on the scaled dot products that is only tight in the special case of maximal alignment between the query and key matrices. In general, the actual size of the threshold γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will be smaller depending on gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the alignment of queries from 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with the keys from 𝐔𝐔\mathbf{U}bold_U and 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The distribution of the singular values will also heavily impact the tightness of the bound: if all singular values are the same (i.e., 𝐖qsubscript𝐖𝑞\mathbf{W}_{q}bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, 𝐖ksubscript𝐖𝑘\mathbf{W}_{k}bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are each orthogonal matrices), the bound will be tight. If there are a few very large singular values and many small ones, the bound is loose. Therefore, the reachability condition in the theorem can be overly optimistic when used as a test for reachability, though it remains a sufficient condition for unreachability.

Theorem 4.2 and Theorem B.2 focus exclusively on the self-attention mechanism and do not directly address the impact of activation functions and other non-linearities present in the full transformer architecture on the controllability of the final model outputs. In a typical transformer block, the output of the self-attention layer passes through a non-linear activation function, such as ReLU or GELU, before being combined with the residual connection and proceeding to the next layer. These non-linearities can affect the propagation of signals through the network and, consequently, the controllability of the end-to-end model.

Analyzing controllability in the presence of multiple layers with interleaved non-linearities is an open problem. Investigating this challenge through the lens of, for instance, non-linear control theory has the potential to guide the design of transformer models with enhanced steerability and interpretability, which may advance the frontier of controllable and explainable AI systems. This direction has the potential to advance our understanding of the complex dynamics of large language models and develop principled approaches to controlling their behavior. However, significant research is still needed to realize this goal.

Appendix C Prompt Optimization Algorithms

Greedy Back-Generation:

While testing all prompts in 𝒱ksuperscript𝒱𝑘\mathcal{V}^{k}caligraphic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is intractable for k>1𝑘1k>1italic_k > 1, it takes only |𝒱|𝒱|\mathcal{V}|| caligraphic_V | forward passes of the network to compute the loss on y𝑦yitalic_y induced by all possible single token prompts u𝒱𝑢𝒱u\in\mathcal{V}italic_u ∈ caligraphic_V. Our Greedy Back Generation algorithm leverages this fact to generate prompts u𝒱k𝑢superscript𝒱𝑘u\in\mathcal{V}^{k}italic_u ∈ caligraphic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT one token at a time, working backward sampling the i𝑖iitalic_ith greedy-optimal single token extension u=argmaxuPLM(y|u+u+x)superscript𝑢subscriptsuperscript𝑢subscript𝑃𝐿𝑀conditional𝑦superscript𝑢𝑢𝑥u^{\prime}=\arg\max_{u^{\prime}}P_{LM}(y|u^{\prime}+u+x)italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_y | italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_u + italic_x ) of the current prompt u𝒱i1𝑢superscript𝒱𝑖1u\in\mathcal{V}^{i-1}italic_u ∈ caligraphic_V start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT.

Algorithm 2 Greedy Token-Wise Prompt Generation
0:  A causal LLM PLMsubscript𝑃𝐿𝑀P_{LM}italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT with vocabulary 𝒱𝒱\mathcal{V}caligraphic_V, a set of base tokens x𝒱n𝑥superscript𝒱𝑛x\in\mathcal{V}^{n}italic_x ∈ caligraphic_V start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, a desired final token y𝒱𝑦𝒱y\in\mathcal{V}italic_y ∈ caligraphic_V, and a desired number of prompt tokens k𝑘kitalic_k.
0:  Magic words usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of length k𝑘kitalic_k.
1:  Initialize usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to be empty.
2:  for i=1𝑖1i=1italic_i = 1 to k𝑘kitalic_k do
3:     for all u𝒱superscript𝑢𝒱u^{\prime}\in\mathcal{V}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_V do
4:        compute PLM(y|u+u+x)subscript𝑃𝐿𝑀conditional𝑦superscript𝑢superscript𝑢𝑥P_{LM}(y|u^{\prime}+u^{*}+x)italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_y | italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_x )
5:     end for
6:     Select the usuperscript𝑢u^{\prime}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that maximizes the probability of y𝑦yitalic_y given u+u+xsuperscript𝑢superscript𝑢𝑥u^{\prime}+u^{*}+xitalic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_x. Prepend usuperscript𝑢u^{\prime}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
7:  end for
8:  return usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

This method is optimal for k=1𝑘1k=1italic_k = 1 prompt token u𝒱superscript𝑢𝒱u^{*}\in\mathcal{V}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_V and generally outperforms GCG for short prompts of length k3𝑘3k\leq 3italic_k ≤ 3. Computing 1 additional prompt token takes roughly 1-4 minute when using an NVIDIA A100-80GB GPU with a 7 billion parameter model and 5-20 minutes on 2 NVIDIA A100-80GB GPUs with a 40 billion parameter model.

Greedy Coordinate Gradient (GCG):

The Greedy Coordinate Gradient algorithm, presented by [26] building off the work of [27], is the state-of-the-art method for optimizing prompts. Starting with a random prompt of length k𝑘kitalic_k, the algorithm generates a batch of alternative prompts. Each member of the batch swaps a random token in the current prompt with a promising alternate token. The value metric for a swap is given by a first order approximation of the change in loss =CELoss(y,PLM(y|u+x))CELoss𝑦subscript𝑃𝐿𝑀conditional𝑦𝑢𝑥\mathcal{L}=\text{CELoss}(y,P_{LM}(y|u+x))caligraphic_L = CELoss ( italic_y , italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_y | italic_u + italic_x ) ) with the embedding of each token in u𝑢uitalic_u.

Algorithm 3 Greedy Coordinate Gradient
0:  A causal LLM PLMsubscript𝑃𝐿𝑀P_{LM}italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT that accepts token strings from a vocabulary 𝒳𝒳\mathcal{X}caligraphic_X, an embedding dictionary 𝐞𝐞\mathbf{e}bold_e, embeddings 𝐞isubscriptsuperscript𝐞𝑖\mathbf{e}^{*}_{i}bold_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponding to each token i𝑖iitalic_i of usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, a set of base tokens x1:nsubscript𝑥:1𝑛x_{1:n}italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT, a desired number of prompt tokens k𝑘kitalic_k, iterations T𝑇Titalic_T, ksubsubscript𝑘𝑠𝑢𝑏k_{sub}italic_k start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT, and batch size B𝐵Bitalic_B.
0:  Magic words usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of length k𝑘kitalic_k.
1:  Initialize usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to be random tokens from vocabulary.
2:  for iteration=1𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛1iteration=1italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n = 1 to T𝑇Titalic_T do
3:     for i=1𝑖1i=1italic_i = 1 to k𝑘kitalic_k do
4:        𝒳i=subscript𝒳𝑖absent\mathcal{X}_{i}=caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = Top-ksubsubscript𝑘𝑠𝑢𝑏k_{sub}italic_k start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT (𝐞𝐞iPLM(xn|u+x1:n1)superscript𝐞topsubscriptsubscriptsuperscript𝐞𝑖subscript𝑃𝐿𝑀conditionalsubscript𝑥𝑛superscript𝑢subscript𝑥:1𝑛1\mathbf{e}^{\top}\nabla_{\mathbf{e}^{*}_{i}}P_{LM}(x_{n}|u^{*}+x_{1:n-1})bold_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT 1 : italic_n - 1 end_POSTSUBSCRIPT ))
5:     end for
6:     for b=1𝑏1b=1italic_b = 1 to B𝐵Bitalic_B do
7:        i=𝑖absenti=italic_i = randint([1,,k]1𝑘[1,\dots,k][ 1 , … , italic_k ])
8:        j=𝑗absentj=italic_j = randint([1,,ksub]1subscript𝑘𝑠𝑢𝑏[1,\dots,k_{sub}][ 1 , … , italic_k start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT ])
9:        u~b[i]=𝒳i[j]subscriptsuperscript~𝑢𝑏delimited-[]𝑖subscript𝒳𝑖delimited-[]𝑗\tilde{u}^{*}_{b}[i]=\mathcal{X}_{i}[j]over~ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT [ italic_i ] = caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_j ]
10:     end for
11:     u=u~bsuperscript𝑢subscriptsuperscript~𝑢superscript𝑏u^{*}=\tilde{u}^{*}_{b^{\ast}}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = over~ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, where b=superscript𝑏absentb^{\ast}=italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = argmax(PLM(xn|u+x1:n1))b{}_{b}(P_{LM}(x_{n}|u^{*}+x_{1:n-1}))start_FLOATSUBSCRIPT italic_b end_FLOATSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT 1 : italic_n - 1 end_POSTSUBSCRIPT ) )
12:  end for
13:  return usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

This method outperforms all other methods we tested for prompts of length k>3𝑘3k>3italic_k > 3. We use a batch size B=768𝐵768B=768italic_B = 768, sampled from the top ksub=128subscript𝑘𝑠𝑢𝑏128k_{sub}=128italic_k start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT = 128 token replacements at each index, and iterate for T=34𝑇34T=34italic_T = 34 iterations. For each instance, this optimization took roughly 2 minutes for the 7 billion parameter models on a single A100-80GB GPU and 4-8 minutes for the 40 billion parameter model on 4 A100-80GB GPU.

Appendix D Supplementary Figures: Optimal Control Prompts

D.1 “Ground Truth” Controllability Results

This subsection includes supplementary figures for the controllability of Llama-7b, Falcon-7b, and Falcon-40b “ground truth” target outputs from Wikitext. For each initial state sequence 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the target output y𝑦yitalic_y is the token immediately following 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Wikitext. We measured the k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ controllability of each of the 7 billion parameter models with a dataset of 5000 state-output pairs while we used a dataset of 500 state-output pairs for Falcon-40b.

Figure 4 shows each model’s log-spaced k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ curves on the Wikitext dataset, revealing a log-linear relationship between maximum prompt length k𝑘kitalic_k and the fraction of uncontrollable initial state-target output pairs (𝐱0,y)subscript𝐱0𝑦(\mathbf{x}_{0},y)( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ). We visualize the relationship between prompt length and the prior cross-entropy loss of each LLM on predicting the target output y𝑦yitalic_y given the state sequence 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (i.e., logPLM(y|𝐱0)subscript𝑃𝐿𝑀conditional𝑦subscript𝐱0-\log P_{LM}(y|\mathbf{x}_{0})- roman_log italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) in Figure 5 where we find it difficult to predict the required prompt length from the base loss.

Finally, Figure 6 shows a histogram of the tokens in the optimized prompts generated in the ground truth k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ controllability experiments on Wikitext.

Refer to caption
Refer to caption
Figure 4: Log spaced main results of k𝑘kitalic_k-log(ϵ)italic-ϵ\log(\epsilon)roman_log ( italic_ϵ ) controllability. Interestingly, the relationship between k𝑘kitalic_k and log(ϵ)italic-ϵ\log(\epsilon)roman_log ( italic_ϵ ) appears roughly linear for each question length in the regime studied.
Top left: k𝑘kitalic_k-log(ϵ)italic-ϵ\log(\epsilon)roman_log ( italic_ϵ ) values for Falcon-7b. With k=10𝑘10k=10italic_k = 10 control tokens, 97.16% of the target outputs were reachable.
Top right: k𝑘kitalic_k-log(ϵ)italic-ϵ\log(\epsilon)roman_log ( italic_ϵ ) values for Llama-7b. With k=10𝑘10k=10italic_k = 10 control tokens, 98.64% of the target outputs were reachable.
Bottom right: k𝑘kitalic_k-log(ϵ)italic-ϵ\log(\epsilon)roman_log ( italic_ϵ ) values for Falcon-40b. With k=10𝑘10k=10italic_k = 10 control tokens, 97.00% of the target outputs were reachable.
Refer to caption
Refer to caption
Refer to caption
Figure 5: Required prompt length k𝑘kitalic_k versus base loss on the target output =logPLM(y|𝐱0)subscript𝑃𝐿𝑀conditional𝑦subscript𝐱0\mathcal{L}=-\log P_{LM}(y|\mathbf{x}_{0})caligraphic_L = - roman_log italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) on “ground truth” wikitext target outputs y𝑦yitalic_y directly proceeding 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Top left: Falcon-7b. Top right: Llama-7b. Bottom right: Falcon-40b. While there does appear to be an “exclusion zone” in the top left-hand corner where a high prompt length is never associated with a base loss below a given threshold, base loss appears to be a poor predictor of required prompt length.
Refer to caption
Refer to caption
(a) Falcon-7b
Refer to caption
(b) Llama-7b
Refer to caption
(c) Falcon-40b
Figure 6: Prompt token frequencies for Falcon-7b (top), Llama-7b (middle), and Falcon-40b (bottom) from Wikitext ground truth target token k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ controllability experiments.

D.2 Top-75 Wikitext Controllability Results

This subsection includes supplementary figures for the controllability of Llama-7b, Falcon-7b, and Falcon-40b on the Wikitext dataset where the target output token y𝑦yitalic_y for a given initial state token sequence 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is sampled uniformly from the top 75 highest-probability tokens as determined by the language model itself PLM(y|𝐱0)subscript𝑃𝐿𝑀conditional𝑦subscript𝐱0P_{LM}(y|\mathbf{x}_{0})italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Specifically, the dataset 𝒟𝒟\mathcal{D}caligraphic_D consists of 25 unique initial state token sequences 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT sampled from Wikitext, each replicated 75 times for the top 75 most probable subsequent tokens yP(y|𝐱0)similar-to𝑦𝑃conditional𝑦subscript𝐱0y\sim P(y|\mathbf{x}_{0})italic_y ∼ italic_P ( italic_y | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). This procedure yielded a dataset of 1875 initial state-target output pairs (𝐱0,y)subscript𝐱0𝑦(\mathbf{x}_{0},y)( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) for the 7 billion parameter models. Due to the computational requirements for the 40 billion parameter model, the number of unique initial state token sequences was decreased to 10, resulting in a dataset of 750 initial state-target output pairs. The k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ plots for each model are shown in Figure 7. On average, across the 3 models, the top 75 outputs were reachable 86.865% of the time with k10𝑘10k\leq 10italic_k ≤ 10 prompt tokens. Similar log-linear trends were observed in the k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ plot. Figure 8 shows the relationship between base loss and required prompt length, revealing a more dramatic “exclusion zone” in the top left, similar to main “ground truth” results in Figure 5. Finally, Figure 9 plots a histogram of the 40 most common tokens observed in the optimized control input prompts from the top-75 experiments.

Refer to caption
Refer to caption
Figure 7: k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ controllability plots on the top 75 most likely output tokens.
Top left: k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ values for Falcon-7b. With k=10𝑘10k=10italic_k = 10 control tokens, 89.387% of the top 75 output tokens were reachable.
Top right: k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ values for Llama-7b. With k=10𝑘10k=10italic_k = 10 control tokens, 85.493% of the top 75 output tokens were reachable.
Bottom right: k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ values for Falcon-40b. With k=10𝑘10k=10italic_k = 10 control tokens, 85.714% of the top 75 output tokens were reachable.
Refer to caption
Refer to caption
Refer to caption
Figure 8: Required prompt length k𝑘kitalic_k versus base loss on the target output =logPLM(y|𝐱0)subscript𝑃𝐿𝑀conditional𝑦subscript𝐱0\mathcal{L}=-\log P_{LM}(y|\mathbf{x}_{0})caligraphic_L = - roman_log italic_P start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) on synthetic top-75 dataset. Top left: Falcon-7b. Top right: Llama-7b. Bottom right: Falcon-40b. While there does appear to be an “exclusion zone” in the top left-hand corner where a high prompt length is never associated with a base loss below a given threshold, base loss appears to be a poor predictor of required prompt length.
Refer to caption
Refer to caption
(a) Falcon-7b
Refer to caption
(b) Llama-7b
Refer to caption
(c) Falcon-40b
Figure 9: Prompt token frequencies for Falcon-7b (top), Llama-7b (middle), and Falcon-40b (bottom) from Wikitext top-75 synthetic dataset k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ controllability experiments.

D.3 Uniformly Sampled Output Token Results

This section contains supplementary figures for k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ controllability experiments on a synthetic dataset 𝒟={(𝐱0,y)}𝒟subscript𝐱0𝑦\mathcal{D}=\{(\mathbf{x}_{0},y)\}caligraphic_D = { ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) } where 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are sampled from the Wikitext dataset and y𝑦yitalic_y is sampled uniformly from the vocabulary. The uniform target output dataset 𝒟𝒟\mathcal{D}caligraphic_D consists of 616 state-output pairs. Due to computational constraints, k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ controllability was only measured for Falcon-7b. Overall, only 46.42% of the target outputs were reachable with k=10𝑘10k=10italic_k = 10 prompt tokens. Figure 10 visualizes the k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ results, the relationship between base loss and prompt length, and the most frequently observed tokens in the optimized control prompts. While the “exclusion zone” behavior (cf Figures 8, 5) is observed in the base loss vs. prompt length subplot, base loss remains a poor predictor of required prompt length. Moreover, Figure 3 reveals an even more uniform relationship between the initial rank of the target output token and the required prompt length.

Refer to caption
Refer to caption
Refer to caption
Figure 10: Supplementary figures on uniformly sampled target output controllability tests on Falcon-7b. Top Left: k𝑘kitalic_k-ϵitalic-ϵ\epsilonitalic_ϵ plot (46.42% controllable at k=10𝑘10k=10italic_k = 10). Top Right: Base loss versus required prompt length. Bottom: Histogram of top 40 most frequent tokens in optimized control prompts.

Glossary of Symbols

Self-Attention:

  • ΞΞ\Xiroman_Ξ : The self-attention mechanism, a mapping from N×dinsuperscript𝑁subscript𝑑𝑖𝑛\mathbb{R}^{N\times d_{in}}blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to N×doutsuperscript𝑁subscript𝑑𝑜𝑢𝑡\mathbb{R}^{N\times d_{out}}blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

  • 𝐗𝐗\mathbf{X}bold_X : The input matrix to self-attention, 𝐗N×din𝐗superscript𝑁subscript𝑑𝑖𝑛\mathbf{X}\in\mathbb{R}^{N\times d_{in}}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

  • N𝑁Nitalic_N : The number of input token representations

  • dinsubscript𝑑𝑖𝑛d_{in}italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT : The dimensionality of each input token representation

  • doutsubscript𝑑𝑜𝑢𝑡d_{out}italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT : The dimensionality of each output token representation

  • 𝐖q,𝐖key,𝐖vsubscript𝐖𝑞subscript𝐖keysubscript𝐖𝑣\mathbf{W}_{q},\mathbf{W}_{\rm key},\mathbf{W}_{v}bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT : The query, key, and value projection weight matrices

  • dkeysubscript𝑑keyd_{\rm key}italic_d start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT : The dimensionality of the key vectors

  • 𝐐,𝐊,𝐕𝐐𝐊𝐕\mathbf{Q},\mathbf{K},\mathbf{V}bold_Q , bold_K , bold_V : The query, key, and value matrices

  • 𝐃𝐃\mathbf{D}bold_D : The diagonal matrix used for normalization

  • 𝟏N×1subscript1𝑁1\mathbf{1}_{N\times 1}bold_1 start_POSTSUBSCRIPT italic_N × 1 end_POSTSUBSCRIPT : An N×1𝑁1N\times 1italic_N × 1 matrix of ones

Input Partitioning:

  • 𝐔𝐔\mathbf{U}bold_U : The k×din𝑘subscript𝑑𝑖𝑛k\times d_{in}italic_k × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT submatrix of 𝐗𝐗\mathbf{X}bold_X corresponding to the control input

  • 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : The m×din𝑚subscript𝑑𝑖𝑛m\times d_{in}italic_m × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT submatrix of 𝐗𝐗\mathbf{X}bold_X corresponding to the imposed state

  • k𝑘kitalic_k : The number of control input tokens

  • m𝑚mitalic_m : The number of imposed state tokens

Output Partitioning:

  • 𝐔superscript𝐔\mathbf{U}^{\prime}bold_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : The k×dout𝑘subscript𝑑𝑜𝑢𝑡k\times d_{out}italic_k × italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT submatrix of the output corresponding to the control input

  • 𝐘𝐘\mathbf{Y}bold_Y : The m×dout𝑚subscript𝑑𝑜𝑢𝑡m\times d_{out}italic_m × italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT submatrix of the output corresponding to the imposed state

  • 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : The desired output, 𝐘m×doutsuperscript𝐘superscript𝑚subscript𝑑𝑜𝑢𝑡\mathbf{Y}^{*}\in\mathbb{R}^{m\times d_{out}}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

  • 𝐘u,𝐘xsubscript𝐘𝑢subscript𝐘𝑥\mathbf{Y}_{u},\mathbf{Y}_{x}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT : The components of 𝐘𝐘\mathbf{Y}bold_Y arising from 𝐔𝐔\mathbf{U}bold_U and 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT respectively

  • 𝐘u,||,𝐘x,||\mathbf{Y}_{u,||},\mathbf{Y}_{x,||}bold_Y start_POSTSUBSCRIPT italic_u , | | end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT italic_x , | | end_POSTSUBSCRIPT : The components of 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and 𝐘xsubscript𝐘𝑥\mathbf{Y}_{x}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT parallel to 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

  • 𝐘u,,𝐘x,subscript𝐘𝑢perpendicular-tosubscript𝐘𝑥perpendicular-to\mathbf{Y}_{u,\perp},\mathbf{Y}_{x,\perp}bold_Y start_POSTSUBSCRIPT italic_u , ⟂ end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT : The components of 𝐘usubscript𝐘𝑢\mathbf{Y}_{u}bold_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and 𝐘xsubscript𝐘𝑥\mathbf{Y}_{x}bold_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT orthogonal to 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

  • 𝐘x,minsuperscriptsubscript𝐘𝑥perpendicular-to𝑚𝑖𝑛\mathbf{Y}_{x,\perp}^{min}bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT : The minimum value of 𝐘x,subscript𝐘𝑥perpendicular-to\mathbf{Y}_{x,\perp}bold_Y start_POSTSUBSCRIPT italic_x , ⟂ end_POSTSUBSCRIPT over all control inputs that are uniformly bounded in norm by a fixed constant Musubscript𝑀𝑢M_{u}italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT in the hypothesis of the theorem

Reachability Conditions:

  • βi(𝐗0,k)subscript𝛽𝑖subscript𝐗0𝑘\beta_{i}(\mathbf{X}_{0},k)italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_k ) : The upper bound on the norm of row i𝑖iitalic_i of 𝐘u,subscript𝐘𝑢perpendicular-to\mathbf{Y}_{u,\perp}bold_Y start_POSTSUBSCRIPT italic_u , ⟂ end_POSTSUBSCRIPT

  • γi(𝐗0,𝜽)subscript𝛾𝑖subscript𝐗0𝜽\gamma_{i}(\mathbf{X}_{0},\boldsymbol{\theta})italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ) : A number that depends on 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝜽=(𝐖q,𝐖key,𝐖v)𝜽subscript𝐖𝑞subscript𝐖keysubscript𝐖𝑣\boldsymbol{\theta}=(\mathbf{W}_{q},\mathbf{W}_{\rm key},\mathbf{W}_{v})bold_italic_θ = ( bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT )

  • α𝛼\alphaitalic_α : An upper bound on the scaled key-query dot products

  • σq,σkey,σvsubscript𝜎𝑞subscript𝜎keysubscript𝜎𝑣\sigma_{q},\sigma_{\rm key},\sigma_{v}italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT roman_key end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT : The maximum singular values of 𝐖qsubscript𝐖𝑞\mathbf{W}_{q}bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, 𝐖keysubscript𝐖𝑘𝑒𝑦\mathbf{W}_{key}bold_W start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT, 𝐖vsubscript𝐖𝑣\mathbf{W}_{v}bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT

  • Mu,Mxsubscript𝑀𝑢subscript𝑀𝑥M_{u},M_{x}italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT : The maximum norms of the control and imposed token embeddings