Repeat After Me:
Transformers are Better than State Space Models at Copying
Transformers are Better than State Space Models at Copying

Samy Jelassi    David Brandfonbrener    Sham M. Kakade    Eran Malach
Abstract

Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as “generalized state space models” (GSSMs). In this paper we show that while GSSMs are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context. We start with a theoretical analysis of the simple task of string copying and prove that a two layer transformer can copy strings of exponential length while GSSMs are fundamentally limited by their fixed-size latent state. Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context. Finally, we evaluate pretrained large language models and find that transformer models dramatically outperform state space models at copying and retrieving information from context. Taken together, these results suggest a fundamental gap between transformers and GSSMs on tasks of practical interest.

Machine Learning, ICML

1 Introduction

Transformers (Vaswani et al., 2017) are the workhorse of modern sequence modeling, achieving remarkable performance on a variety of tasks, but they have unavoidable inefficiencies. Specifically, they require Ω(L)Ω𝐿\Omega(L)roman_Ω ( italic_L ) memory111In some naive implementations of transformers, it is common to allocate a L×L𝐿𝐿L\times Litalic_L × italic_L matrix to compute the attention. However, memory efficient implementations, such as FlashAttention (Dao et al., 2022), compute the attention with O(L)𝑂𝐿O(L)italic_O ( italic_L ) memory. and compute to predict the next token of a sequence of length L𝐿Litalic_L.

This has spurred a boom in attempts to create architectures that can achieve similar performance as transformers, but with O(1)𝑂1O(1)italic_O ( 1 ) memory to predict each token. This class of models includes state space models like S4 (Gu et al., 2021) or Mamba (Gu & Dao, 2023), as well as traditional RNN models (Hochreiter & Schmidhuber, 1997) and models that can be trained in parallel like linear attention (Katharopoulos et al., 2020; Choromanski et al., 2020) and parallel RNNs (Bradbury et al., 2016; Peng et al., 2023; Sun et al., 2023). In this paper, we will refer to this entire class of models that use a fixed-size memory as “generalized state space models” or GSSMs (see a formal definition in Section 2).

Recent work has demonstrated impressive performance of GSSMs, but it is not yet clear what these models sacrifice for their improved efficiency, if anything. In this paper, we find that one particular capability that is sacrificed is the ability to retrieve and repeat parts of the input context. As a result, transformers are better than GSSMs at a variety of tasks that require accessing arbitrary parts of the context.

To understand this gap in capabilities, we begin by presenting a theoretical analysis of the copying task222Note that we study copying of the input and not copying of training data (McCoy et al., 2023; Carlini et al., 2022). First, we show via construction that a simple transformer model can copy strings of length that is exponential in the number of heads of the transformer. This construction relies on the ability of the transformer to implement a mechanism of “storage” and retrieval of sequences of n tokens (n-grams), where the n-grams are used to track where to copy from. In contrast, we show that, trivially, GSSMs cannot accurately copy strings with more bits than the size of the latent state.

Refer to caption

                             Transformer:
                                    GSSM:

(a) Copying: training efficiency.
Refer to caption

 RoPE   NoPE   Alibi   HAlibi
 LSTM    Mamba

(b) Copying: length generalization
Refer to caption

Pythia:  410M    1.4B    2.8B
Mamba:   360M    1.4B   2.8B

(c) Lookup with pretrained models
Figure 1: (a) Copying: training efficiency. Here we train models to copy strings of length 300absent300\leq 300≤ 300 and evaluate string-level accuracy on strings of length 300. Transformers train much faster than GSSMs. An LSTM cannot even learn the task within this number of samples. (b) Copying: length generalization. Here we train models to copy on strings of length 50absent50\leq 50≤ 50 until all models are perfect in-distribution and evaluate string-level accuracy. Purple dotted line indicates maximum training string length and green dotted line indicates context window during training. Evaluating on longer inputs, the transformer models dramatically outperform the GSSMs. Using our Hard-Alibi positional encoding, we can even generalize well beyond the training context size. (c) Lookup with pretrained models. Here the task requires looking up and retrieving a number from a “phone book” of varying length that is entirely in context. We evaluate pretrained models 1-shot without any finetuning. Pythia (a transformer model) substantially outperforms Mamba (a GSSM) across model sizes.

Our theory studies representation expressivity, but not whether these representations will be learned. Moreover, in practice a large GSSM may have enough capacity to represent the entire input in the latent state, at least in theory. To resolve these concerns, we conduct a variety of synthetic experiments with models of similar-to\sim160M parameters. We find that transformers are both much more efficient at learning to copy (Figure 1(a)) and also generalize better to longer inputs (Figure 1(b)). Additionally, we verify experimentally that the copy “algorithm” learned by transformers indeed relies on n-grams to perform a lookup of where to copy from (Figure 3), similarly to our theoretical construction.

Finally, we present a variety of experiments on pre-trained models to test their ability to remember and access the input context. In particular, we show that Pythia transformers (Biderman et al., 2023) outperform Mamba GSSMs (Gu & Dao, 2023) of similar size at a variety of memory-intensive tasks including copying and retrieving information from the context (Figure 1(c)). This is especially notable since the Mamba models achieve lower perplexity than the Pythia models at language modeling on the Pile (Gao et al., 2020). These experiments illustrate the practical relevance of the memory issues that we raise, and hint at one way that architectual choices can impact the downstream performance of LLMs above and beyond training perplexity.

2 Theory: Representational Capacity

In this section we use the copy task for a theoretical comparison between state space models and transformers. We prove two main results. First, we construct a small transformer that solves the copy task for sequences lengths that are exponential in the transformer size. Second, we show that any state space model fails to solve the copy task, unless its latent state grows linearly with the sequence length.

2.1 Setting

Let 𝔻𝔻{\mathbb{D}}blackboard_D be a dictionary, which contains D𝐷Ditalic_D “alphabet” tokens. A sequence-to-sequence model is a function H:𝔻𝔻:𝐻superscript𝔻superscript𝔻H:{\mathbb{D}}^{*}\to{\mathbb{D}}^{*}italic_H : blackboard_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT → blackboard_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which maps an input sequence of tokens to an output sequence. We think of the input x1,,xisubscript𝑥1subscript𝑥𝑖x_{1},\dots,x_{i}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the “prompt” to the model, and of the output sequence H(x1,,xi)𝐻subscript𝑥1subscript𝑥𝑖H(x_{1},\dots,x_{i})italic_H ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as the generated “answer”.

A sequence-to-token mapping is a function h:𝔻𝔻:superscript𝔻𝔻h:{\mathbb{D}}^{*}\to{\mathbb{D}}italic_h : blackboard_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT → blackboard_D. Any sequence-to-token model hhitalic_h naturally defines a sequence-to-sequence model H𝐻Hitalic_H by auto-regressive inference. Namely, for every input sequence x1,,xi𝔻subscript𝑥1subscript𝑥𝑖𝔻x_{1},\dots,x_{i}\in{\mathbb{D}}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_D we define recursively xi+j=h(x1,,xi+j1)subscript𝑥𝑖𝑗subscript𝑥1subscript𝑥𝑖𝑗1x_{i+j}=h(x_{1},\dots,x_{i+j-1})italic_x start_POSTSUBSCRIPT italic_i + italic_j end_POSTSUBSCRIPT = italic_h ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i + italic_j - 1 end_POSTSUBSCRIPT ) and let H(x1,,xi)=(xi+1,xi+2,)𝐻subscript𝑥1subscript𝑥𝑖subscript𝑥𝑖1subscript𝑥𝑖2H(x_{1},\dots,x_{i})=(x_{i+1},x_{i+2},\dots)italic_H ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT , … ).

Generalized state space models.

A state space 𝒮𝒮{\mathcal{S}}caligraphic_S is some finite set. We denote by mem(𝒮)mem𝒮\mathrm{mem}({\mathcal{S}})roman_mem ( caligraphic_S ) the number of bits required to encode the states of 𝒮𝒮{\mathcal{S}}caligraphic_S, namely mem(𝒮)=log(|𝒮|)mem𝒮𝒮\mathrm{mem}({\mathcal{S}})=\log(\left\lvert{\mathcal{S}}\right\rvert)roman_mem ( caligraphic_S ) = roman_log ( | caligraphic_S | ). A generalized state space model (GSSM) is a sequence model defined by an update rule u:𝒮×𝔻𝒮:𝑢𝒮𝔻𝒮u:{\mathcal{S}}\times{\mathbb{D}}\to{\mathcal{S}}italic_u : caligraphic_S × blackboard_D → caligraphic_S and some output function r:𝒮𝔻:𝑟𝒮𝔻r:{\mathcal{S}}\to{\mathbb{D}}italic_r : caligraphic_S → blackboard_D. Let s0𝒮subscript𝑠0𝒮s_{0}\in{\mathcal{S}}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S be some initial state. Given some sequence x1,,xLsubscript𝑥1subscript𝑥𝐿x_{1},\dots,x_{L}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, the state of the model at iteration i𝑖iitalic_i is denoted by Si(x1,,xi)subscript𝑆𝑖subscript𝑥1subscript𝑥𝑖S_{i}(x_{1},\dots,x_{i})italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and the output token is denoted by Ri(x1,,xi)subscript𝑅𝑖subscript𝑥1subscript𝑥𝑖R_{i}(x_{1},\dots,x_{i})italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The state and output are defined recursively: 1) S0()=s0subscript𝑆0subscript𝑠0S_{0}(\emptyset)=s_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ∅ ) = italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, 2) Si(x1,,xi)=u(Si1(x1,,xi1),xi)subscript𝑆𝑖subscript𝑥1subscript𝑥𝑖𝑢subscript𝑆𝑖1subscript𝑥1subscript𝑥𝑖1subscript𝑥𝑖S_{i}(x_{1},\dots,x_{i})=u(S_{i-1}(x_{1},\dots,x_{i-1}),x_{i})italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_u ( italic_S start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), 3) Ri(x1,,xi)=r(Si(x1,,xi))subscript𝑅𝑖subscript𝑥1subscript𝑥𝑖𝑟subscript𝑆𝑖subscript𝑥1subscript𝑥𝑖R_{i}(x_{1},\dots,x_{i})=r(S_{i}(x_{1},\dots,x_{i}))italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_r ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ).

Refer to caption
Figure 2: An illustration of the copy task.
Remark 2.1.

It is important to note that for any sequence model, there are two types of memory considerations: 1) input-independent memory (parameters) and 2) input-dependent memory (activations). The GSSM definition constraints the input-dependent memory (activations), which corresponds to mem(𝒮)mem𝒮\mathrm{mem}({\mathcal{S}})roman_mem ( caligraphic_S ), and does not restrict in any way the amount of input-independent memory (parameters) or the run-time of state updates. Since our main goal is to show a lower bound on the state space memory, leaving all other considerations unconstrained only strengthens our results.

Transformers.

Given some input of length L𝐿Litalic_L and dimension d𝑑ditalic_d, denoted 𝒙1,,𝒙Ldsubscript𝒙1subscript𝒙𝐿superscript𝑑{\bm{x}}_{1},\dots,{\bm{x}}_{L}\in{\mathbb{R}}^{d}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, an attention head is parameterized by Wk,Wq,Wvd×dsubscript𝑊𝑘subscript𝑊𝑞subscript𝑊𝑣superscript𝑑𝑑W_{k},W_{q},W_{v}\in{\mathbb{R}}^{d\times d}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT. We denote 𝒌i=Wk𝒙i,𝒒i=Wq𝒙i,𝒗i=Wv𝒙iformulae-sequencesubscript𝒌𝑖subscript𝑊𝑘subscript𝒙𝑖formulae-sequencesubscript𝒒𝑖subscript𝑊𝑞subscript𝒙𝑖subscript𝒗𝑖subscript𝑊𝑣subscript𝒙𝑖{\bm{k}}_{i}=W_{k}{\bm{x}}_{i},{\bm{q}}_{i}=W_{q}{\bm{x}}_{i},{\bm{v}}_{i}=W_{% v}{\bm{x}}_{i}bold_italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and denote Ki=[𝒌1,,𝒌i]d×isubscript𝐾𝑖subscript𝒌1subscript𝒌𝑖superscript𝑑𝑖K_{i}=[{\bm{k}}_{1},\dots,{\bm{k}}_{i}]\in{\mathbb{R}}^{d\times i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ bold_italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_i end_POSTSUPERSCRIPT and Vi=[𝒗1,,𝒗i]d×isubscript𝑉𝑖subscript𝒗1subscript𝒗𝑖superscript𝑑𝑖V_{i}=[{\bm{v}}_{1},\dots,{\bm{v}}_{i}]\in{\mathbb{R}}^{d\times i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_i end_POSTSUPERSCRIPT. We denote the output of the head at token i𝑖iitalic_i by 𝒐idsubscript𝒐𝑖superscript𝑑{\bm{o}}_{i}\in{\mathbb{R}}^{d}bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, where 𝒐i=Visoftmax(Ki𝒒i)subscript𝒐𝑖subscript𝑉𝑖softmaxsubscript𝐾𝑖subscript𝒒𝑖{\bm{o}}_{i}=V_{i}\cdot\mathrm{softmax}(K_{i}\cdot{\bm{q}}_{i})bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ roman_softmax ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

We consider a transformer with l𝑙litalic_l attention heads, each one of dimension d𝑑ditalic_d so that the full dimension of the Transformer is dl𝑑𝑙dlitalic_d italic_l. An embedding is some mapping Ψ:𝔻d:Ψ𝔻superscript𝑑\Psi:{\mathbb{D}}\to{\mathbb{R}}^{d}roman_Ψ : blackboard_D → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. An MLP is a function f:dldl:𝑓superscript𝑑𝑙superscript𝑑𝑙f:{\mathbb{R}}^{dl}\to{\mathbb{R}}^{dl}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_d italic_l end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d italic_l end_POSTSUPERSCRIPT s.t. f(𝒙)=U1σ(U2𝒙)𝑓𝒙subscript𝑈1𝜎subscript𝑈2𝒙f({\bm{x}})=U_{1}\sigma(U_{2}{\bm{x}})italic_f ( bold_italic_x ) = italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ ( italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_x ), for some activation function σ𝜎\sigmaitalic_σ. Both the embedding and the MLP layer are assumed to be applied on the token level. An attention-block is a set of l𝑙litalic_l heads applied in parallel, and a transformer-block is an attention-block followed by an MLP which operates on the concatenated output of the l𝑙litalic_l heads. The output of the model is sampled based on the output of the final layer. For simplicity, we study the argmaxargmax\operatorname*{arg\,max}roman_arg roman_max “sampling” (i.e., predicting the most probable token).

The copy task.

To define the copy task, we add two special tokens to 𝔻𝔻{\mathbb{D}}blackboard_D: (1) beginning-of-sequence token, denoted BOSdelimited-⟨⟩BOS\left\langle{\tiny\textsc{BOS}}\right\rangle⟨ BOS ⟩, and (2) copy token, denoted COPYdelimited-⟨⟩COPY\left\langle{\tiny\textsc{COPY}}\right\rangle⟨ COPY ⟩. So now |𝔻|=D+2𝔻𝐷2|{\mathbb{D}}|=D+2| blackboard_D | = italic_D + 2. A length-L𝐿Litalic_L copy distribution 𝒟Lsubscript𝒟𝐿{\mathcal{D}}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT over 𝔻L+2superscript𝔻𝐿2{\mathbb{D}}^{L+2}blackboard_D start_POSTSUPERSCRIPT italic_L + 2 end_POSTSUPERSCRIPT generates strings of the form: “BOS,x1,,xL,COPYdelimited-⟨⟩BOSsubscript𝑥1subscript𝑥𝐿delimited-⟨⟩COPY\left\langle{\tiny\textsc{BOS}}\right\rangle,x_{1},\dots,x_{L},\left\langle{% \tiny\textsc{COPY}}\right\rangle⟨ BOS ⟩ , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , ⟨ COPY ⟩”, where 𝒙(𝔻{BOS,COPY})L𝒙superscript𝔻delimited-⟨⟩BOSdelimited-⟨⟩COPY𝐿{\bm{x}}\in({\mathbb{D}}\setminus\{\left\langle{\tiny\textsc{BOS}}\right% \rangle,\left\langle{\tiny\textsc{COPY}}\right\rangle\})^{L}bold_italic_x ∈ ( blackboard_D ∖ { ⟨ BOS ⟩ , ⟨ COPY ⟩ } ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT.

For some sequence-to-sequence model H:𝔻𝔻:𝐻superscript𝔻superscript𝔻H:{\mathbb{D}}^{*}\to{\mathbb{D}}^{*}italic_H : blackboard_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT → blackboard_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we denote the error of H𝐻Hitalic_H on a copy distribution 𝒟Lsubscript𝒟𝐿{\mathcal{D}}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT by

err𝒟L(H)=Pr𝒟L[H1:L(BOS,𝒙,COPY)𝒙]subscripterrsubscript𝒟𝐿𝐻subscriptPrsubscript𝒟𝐿subscript𝐻:1𝐿delimited-⟨⟩BOS𝒙delimited-⟨⟩COPY𝒙\mathrm{err}_{{\mathcal{D}}_{L}}(H)=\Pr_{{\mathcal{D}}_{L}}\left[H_{1:L}(\left% \langle{\tiny\textsc{BOS}}\right\rangle,{\bm{x}},\left\langle{\tiny\textsc{% COPY}}\right\rangle)\neq{\bm{x}}\right]roman_err start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_H ) = roman_Pr start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_H start_POSTSUBSCRIPT 1 : italic_L end_POSTSUBSCRIPT ( ⟨ BOS ⟩ , bold_italic_x , ⟨ COPY ⟩ ) ≠ bold_italic_x ]

where H1:L()subscript𝐻:1𝐿H_{1:L}(\cdot)italic_H start_POSTSUBSCRIPT 1 : italic_L end_POSTSUBSCRIPT ( ⋅ ) denotes the first L𝐿Litalic_L tokens generated by H𝐻Hitalic_H. That is, we expect the model to output an exact copy of 𝒙𝒙{\bm{x}}bold_italic_x.

2.2 Transformers can copy inputs of exponential length

In this section, we show that transformers can implement the copy operation for input sequences with length exponential in the number of heads. Namely, we construct a transformer with two blocks that gets small error on the copy task.

Construction: hash-based copying.

The key idea in the construction is to first “hash” sequences of n𝑛nitalic_n tokens (n𝑛nitalic_n-grams), then at each iteration of the auto-regression attend to the previous occurrence of the most recent n𝑛nitalic_n-gram, and output the succeeding token. That is, we show that a transformer can implement the copying algorithm illustrated in Figure 3 (and see also Algorithm 1 in the Appendix).

Positional embedding: Hard-ALiBi.

To perform the hashing described in the algorithm, we need to be able to leverage local positional information to define a hash, and also to apply this hash function globally on the entire input. To do this, we use a hard version of ALiBi (Press et al., 2021), which we call Hard-ALiBi. Just as in ALiBi, we add a bias bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the i𝑖iitalic_i-th attention head as follows: 𝒐i=Visoftmax(Ki𝒒i+bi)subscript𝒐𝑖subscript𝑉𝑖softmaxsubscript𝐾𝑖subscript𝒒𝑖subscript𝑏𝑖{\bm{o}}_{i}=V_{i}\cdot\mathrm{softmax}(K_{i}\cdot{\bm{q}}_{i}+b_{i})bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ roman_softmax ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Specifically, we set bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT s.t. bi,j=subscript𝑏𝑖𝑗b_{i,j}=-\inftyitalic_b start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = - ∞ for jim𝑗𝑖𝑚j\leq i-mitalic_j ≤ italic_i - italic_m and bi,j=0subscript𝑏𝑖𝑗0b_{i,j}=0italic_b start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 0 for j>im𝑗𝑖𝑚j>i-mitalic_j > italic_i - italic_m. We allow different heads with different choices of m𝑚mitalic_m and also allow for m=𝑚m=\inftyitalic_m = ∞ which corresponds to softmax attention with no positional embedding. This is illustrated in Figure 8(c) (Appendix). While the Hard-ALiBi is introduced for our theoretical construction, we observe it also offers significant benefits empirically, as discussed in Section 3.

Refer to caption
Figure 3: An illustration of the n𝑛nitalic_n-gram based copy algorithm. In order to predict the next token, we match the current n𝑛nitalic_n-gram to the corresponding n𝑛nitalic_n-gram in the input, then output the next token.

Guarantees.

The copy algorithm given in Algorithm 1 (and similarly, our transformer construction) can perfectly copy the input sequence, as long as there are no repeated n𝑛nitalic_n-gram patterns in the input. Therefore, the error of the algorithm depends on the probability of repeated n𝑛nitalic_n-grams:

Definition 2.2.

Let 𝒟Lsubscript𝒟𝐿{\mathcal{D}}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT be some copy distribution. For some n𝑛n\in{\mathbb{N}}italic_n ∈ blackboard_N, let pngram(𝒟L)subscript𝑝ngramsubscript𝒟𝐿p_{\mathrm{n-gram}}({\mathcal{D}}_{L})italic_p start_POSTSUBSCRIPT roman_n - roman_gram end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) be the probability that x1,,xLsubscript𝑥1subscript𝑥𝐿x_{1},\dots,x_{L}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT contains two repeated sequences of n𝑛nitalic_n tokens. Namely:

pngram(𝒟L)=Pr𝒟L[ijs.t.xi,xi+n=xj,,xj+n]\displaystyle p_{\mathrm{n-gram}}({\mathcal{D}}_{L})=\Pr_{{\mathcal{D}}_{L}}% \left[\exists_{i\neq j}~{}\mathrm{s.t.}~{}x_{i},\dots x_{i+n}=x_{j},\dots,x_{j% +n}\right]italic_p start_POSTSUBSCRIPT roman_n - roman_gram end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) = roman_Pr start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∃ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT roman_s . roman_t . italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … italic_x start_POSTSUBSCRIPT italic_i + italic_n end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_j + italic_n end_POSTSUBSCRIPT ]

Below we state the main theoretical result on copying with transformers, showing that transformers can copy their input, with error bounded by the probability of repeated n𝑛nitalic_n-grams:

Theorem 2.3.

For all n𝑛nitalic_n, there exists a depth-2 transformer 𝒯𝒯{\mathcal{T}}caligraphic_T of dimension O(nlog(D))𝑂𝑛𝐷O(n\log(D))italic_O ( italic_n roman_log ( italic_D ) ) s.t. for all 2nLDn2𝑛𝐿superscript𝐷𝑛2n\leq L\leq D^{n}2 italic_n ≤ italic_L ≤ italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and for any copy distribution 𝒟Lsubscript𝒟𝐿{\mathcal{D}}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, err𝒟L(𝒯)<pngram(𝒟L)subscripterrsubscript𝒟𝐿𝒯subscript𝑝ngramsubscript𝒟𝐿\mathrm{err}_{{\mathcal{D}}_{L}}({\mathcal{T}})<p_{\mathrm{n-gram}}({\mathcal{% D}}_{L})roman_err start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_T ) < italic_p start_POSTSUBSCRIPT roman_n - roman_gram end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ).

Intuitively, the probability of repeated n𝑛nitalic_n-grams decays quickly when increasing the value of n𝑛nitalic_n. Indeed, we show that for the uniform distribution over sequences, this probability decays exponentially with n𝑛nitalic_n:

Lemma 2.4.

Let 𝒟Lsubscript𝒟𝐿{\mathcal{D}}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT be the copy distribution generated by sampling 𝐱𝐱{\bm{x}}bold_italic_x from the uniform distribution over the “alphabet” (non-special) tokens. Then, pngram(𝒟L)<L2Dnsubscript𝑝ngramsubscript𝒟𝐿superscript𝐿2superscript𝐷𝑛p_{\mathrm{n-gram}}({\mathcal{D}}_{L})<L^{2}D^{-n}italic_p start_POSTSUBSCRIPT roman_n - roman_gram end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) < italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT - italic_n end_POSTSUPERSCRIPT.

Combining the above results, we get that transformers can copy sequences of tokens drawn from the uniform distribution, using a number of parameters that depends only logarithmically on the input sequence length.

Corollary 2.5.

Fix some ϵ(0,1/2)italic-ϵ012\epsilon\in(0,1/2)italic_ϵ ∈ ( 0 , 1 / 2 ) and some LΩ(log(1/ϵ))𝐿Ω1italic-ϵL\geq\Omega(\log(1/\epsilon))italic_L ≥ roman_Ω ( roman_log ( 1 / italic_ϵ ) ). There exists a depth-2 transformer 𝒯𝒯{\mathcal{T}}caligraphic_T of dimension O(log(L/ϵ)log(D))𝑂𝐿italic-ϵ𝐷O(\log(L/\epsilon)\log(D))italic_O ( roman_log ( italic_L / italic_ϵ ) roman_log ( italic_D ) ) s.t. for the uniform copy distribution 𝒟Lsubscript𝒟𝐿{\mathcal{D}}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, err𝒟L(𝒯)<ϵsubscripterrsubscript𝒟𝐿𝒯italic-ϵ\mathrm{err}_{{\mathcal{D}}_{L}}({\mathcal{T}})<\epsilonroman_err start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_T ) < italic_ϵ.

Remark 2.6.

For simplicity we do not limit the precision of the parameters or activations, but note that our results hold for finite-precision transormers, using O(log(log(L)))𝑂𝐿O(\log(\log(L)))italic_O ( roman_log ( roman_log ( italic_L ) ) ) bits.

2.3 State Space Models cannot copy inputs beyond memory size

We saw that transformers are able to copy uniform sequences of tokens, with parameter count logarithmic in the sequence length. We now show that GSSMs cannot copy uniform input sequences, unless the capacity of their state space grows linearly with the size of the sequence length. This is intuitive: to be able to copy the entire input sequence, the model needs to store it in its state space, which requires the memory to grow linearly with the sequence length.

Theorem 2.7.

Fix some GSSM H𝐻Hitalic_H over state space 𝒮𝒮{\mathcal{S}}caligraphic_S. Then, for all L𝐿Litalic_L, for the uniform copy distribution 𝒟Lsubscript𝒟𝐿{\mathcal{D}}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, the model H𝐻Hitalic_H has error err𝒟L(H)>1|𝒮|DLsubscripterrsubscript𝒟𝐿𝐻1𝒮superscript𝐷𝐿\mathrm{err}_{{\mathcal{D}}_{L}}(H)>1-\frac{\left\lvert{\mathcal{S}}\right% \rvert}{D^{L}}roman_err start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_H ) > 1 - divide start_ARG | caligraphic_S | end_ARG start_ARG italic_D start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_ARG.

Given Theorem 2.7, the following Corollary is immediate:

Corollary 2.8.

Fix some L𝐿L\in{\mathbb{N}}italic_L ∈ blackboard_N. Then, every GSSM H𝐻Hitalic_H with state space 𝒮𝒮{\mathcal{S}}caligraphic_S s.t. mem(𝒮)<Llog(D)1mem𝒮𝐿𝐷1\mathrm{mem}({\mathcal{S}})<L\log(D)-1roman_mem ( caligraphic_S ) < italic_L roman_log ( italic_D ) - 1 has error err𝒟L(H)>1/2subscripterrsubscript𝒟𝐿𝐻12\mathrm{err}_{{\mathcal{D}}_{L}}(H)>1/2roman_err start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_H ) > 1 / 2 for the uniform copy distribution 𝒟Lsubscript𝒟𝐿{\mathcal{D}}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT.

Remark 2.9.

As mentioned previously, the input-dependent memory of transformers grows linearly with the sequence length, which is less memory-efficient compared to GSSMs. However, it is interesting to note that from the above result, at least for the copy task, transformers are almost optimal in terms of their input-dependent memory. More specifically, an implication of Theorem 2.3 is that there exists a transformer which can copy inputs of length L𝐿Litalic_L using O~(L)~𝑂𝐿\tilde{O}(L)over~ start_ARG italic_O end_ARG ( italic_L ) input-dependent memory333We use O~~𝑂\tilde{O}over~ start_ARG italic_O end_ARG to hide logarithmic factors., and due to Corollary 2.8 this is indeed optimal (up to logarithmic factors).

3 Learning to Copy

In the previous section, we proved that transformers can represent the copy operation for exponentially long sequences, while GSSMs fail to copy long sequences due to their limited memory. While these results show that in theory, transformers can outperform GSSMs, our theoretical results do not establish that such a gap will be observed in practice for two reasons. First, it is not clear that transformers can indeed learn to copy from examples. Second, GSSMs in practice may use a large latent state memory, so that our bounds only hold for very long sequences of tokens. For example, a latent state of 1000 32-bit floating point numbers has enough bits to store at least 2000 tokens from a 50K token vocabulary. However, even though a GSSM could fit the context into memory, it may not learn to do so.

Our goal in this section is to verify that our theoretical analysis bears out experimentally when training models from scratch on synthetic data, before moving on to study pretrained models in the next section. Specifically, we train transformers and GSSMs (LSTM (Hochreiter & Schmidhuber, 1997) and Mamba (Gu & Dao, 2023)) on variants of the copy task shown in Figure 2.

3.1 Experimental setup

We now provide a brief overview of our experimental setup. Further details may be found in Appendix A. Code and data available at: https://github.com/sjelassi/transformers_ssm_copy

Architecture.

In all our experiments, we set the model hyperparameters so that the Mamba and transformers have a similar number of parameters (160absent160\approx 160≈ 160 million parameters). Since we find that large LSTMs are hard to train (as confirmed in Pascanu et al. (2013)), we use the largest LSTM we managed to train which has 40absent40\approx 40≈ 40 million parameters.

Dataset.

During training, we generate in an online manner a batch of 64 examples at each epoch. At test time, we evaluate our models on 10101010 batches of 128128128128 examples. We report the mean and standard-deviation over these 10 batches. If not specified otherwise, our token space 𝒱𝒱\mathcal{V}caligraphic_V is of size 30 and made of the alphabet letters i.e. 𝒱={a,,z,BOS,EOS,COPY}𝒱𝑎𝑧delimited-⟨⟩BOSdelimited-⟨⟩EOSdelimited-⟨⟩COPY\mathcal{V}=\{a,\dots,z,\left\langle{\tiny\textsc{BOS}}\right\rangle,\left% \langle{\tiny\textsc{EOS}}\right\rangle,\left\langle{\tiny\textsc{COPY}}\right\rangle\}caligraphic_V = { italic_a , … , italic_z , ⟨ BOS ⟩ , ⟨ EOS ⟩ , ⟨ COPY ⟩ } where BOSdelimited-⟨⟩BOS\left\langle{\tiny\textsc{BOS}}\right\rangle⟨ BOS ⟩ is the beginning of sentence token, EOSdelimited-⟨⟩EOS\left\langle{\tiny\textsc{EOS}}\right\rangle⟨ EOS ⟩ the end of sentence token and COPYdelimited-⟨⟩COPY\left\langle{\tiny\textsc{COPY}}\right\rangle⟨ COPY ⟩ the separator token. All the strings are sampled uniformly i.e. we first sample the length of the sequence and then independently sample each position of the string from 𝒱𝒱\mathcal{V}caligraphic_V. Finally, we “pack the context” with i.i.d. sequences during training similarly to (Zhou et al., 2023): we fill the context with multiple independent samples of the task.

Positional information.

Positional information also plays an important role in the length generalization capacity of Transformers (Jelassi et al., 2023; Kazemnejad et al., 2023; Shen et al., 2023). Previously popular methods of input-layer positional embeddings (e.g. sinusoidal (Vaswani et al., 2017) or learned (Radford et al., 2019)) have been replaced by relative positional encodings at each attention layer (e.g. RoPE (Su et al., 2023), Alibi (Press et al., 2021), or NoPE (Kazemnejad et al., 2023)). Below, we experiment these positional encodings along with the Hard-Alibi encoding introduced in Section 2.

Refer to caption

n-gram length:  2    3    4    5    6     7    8

Figure 4: String-level copying accuracy on data with duplicated n-grams. Copying fails when the duplicated n-gram is too long as the model can no longer perform n-gram lookups.

3.2 Data efficiency on the copy task

We begin by training our models on the simple task of copying a sequence of input tokens described in Figure 2. The model gets an input of Labsent𝐿\leq L≤ italic_L tokens followed by a Separator (COPYdelimited-⟨⟩COPY\left\langle{\tiny\textsc{COPY}}\right\rangle⟨ COPY ⟩) token, and needs to output the same sequence again from the beginning. In this section, we focus on in-distribution learning: we train on strings of random length L=300absent𝐿300\leq L=300≤ italic_L = 300 and record the string-level accuracy on evaluation strings sampled from the training distribution. Results for this experiment are shown in 1(a). Clearly, there is a large gap between the transformers and GSSMs. We observe that the transformers need 100x less samples than the best GSSMs to learn the copy task.

Note that the sharp changes in accuracy displayed in 1(a) are due to the log-scaled x-axis and choice of string-level accuracy as a metric. In 9(a), we report the character-level accuracy, which yields smoother curves demonstrating the learning process of GSSMs. Regarding LSTMs, we find that they do not manage to learn on length-300 strings even at the character level. In 9(b), we show that LSTMs are able to learn to copy on shorter strings and that string length is the bottleneck.

3.3 Length generalization on the copy task

The prior experiment demonstrates superior efficiency of learning in-distribution. Now, we test the ability of the learned functions to generalize out-of-distribution. Specifically, we consider generalization from short sequences to longer sequences. Testing this sort of generalization can help us to better understand which function the model has learned, i.e. whether the model has truly learned the “correct” copy operation or whether it just learned to copy sequences of the particular size it was trained on.

Here, we train all models on sequences of 50absent50\leq 50≤ 50 tokens, and test them on sequences of up to 1000100010001000 tokens, reporting string-level accuracy. As seen in 1(b), all models are able to (eventually) solve the task in-distribution on lengths of 50absent50\leq 50≤ 50, but transformer-based models display much better generalization to longer inputs compared to GSSMs. Namely, we observe that the performance of the GSSMs (LSTM and MAMBA) drops to zero almost immediately when increasing the input length, while the performance of transformers decays much more gradually with length.

Positional information.

When looking at the relative performance of different transformer models in 1(b), it becomes clear that the positional encoding is important to length generalization. Specifically, the ALiBi and NoPE transformers dramatically outperform the RoPE model on longer inputs. This is likely because the sinusoidal embeddings of RoPE create a more dramatic change than the decay of ALiBi or NoPE when we go to longer inputs.

Improved generalization with Hard-ALiBi.

To test our understanding of how transformers learn to copy, we now consider swapping in the Hard-ALiBi positional encoding that we used in our theoretical construction of hash-based copying (introduces in Subsection 2.2 and illustrated in Figure 8 in the Appendix). 1(b) shows that a transformer trained with Hard-ALiBi embedding on sequences of length 50absent50\leq 50≤ 50 achieves almost perfect length generalization up to sequences of length 1000. Note that this is well beyond the context length ever encountered in training.

3.4 Transformers learn to use n-gram hashing

Next, we attempt to determine whether the transformer trained on the copy task indeed applies the mechanism of storage and retrieval of n-grams. To do this, we evaluate the performance of a transformer with Hard-ALiBi positional encoding trained on the copy task when tested on a distribution of examples that intentionally contains duplicate n-grams. That is, we draw uniform sequences of tokens, and then randomly replace some n-gram with another n-gram that already appears in the sequence, such that each example always contains two copies of the same n-gram (typically followed by a different token). We use the Hard-Alibi model here since it performs the best for the copy task as showed in 1(a). Figure 4 shows the performance of the transformer for different choices of n𝑛nitalic_n. We observe that the transformer maintains roughly the same accuracy for n4𝑛4n\leq 4italic_n ≤ 4, but that its accuracy starts dropping when the inputs contains duplicate sequences of 5 or more tokens. This suggests that the transformer relies on something like 5-gram retrieval to do the copy task. Figure 11 further strengthens this point. We report the performance of perfect n-gram models in the copy task and observe that the performance of Transformers enhanced with Hard-ALiBi matches with the one of a 5-gram model.

3.5 GSSMs cannot arbitrarily retrieve from context

Refer to caption
Refer to caption

Transformer:  NoPE   Alibi   HAlibi
GSSM:  LSTM    Mamba

Figure 5: Top: An illustration of the suffix key veriant of the n-gram lookup task. Bottom: When trained on strings of length 30absent30\leq 30≤ 30, transformers outperform GSSMs on longer inputs, illustrating superior performance on this memory-intensive task.
Refer to caption
Refer to caption

Transformer:  NoPE   Alibi   HAlibi
GSSM:  LSTM    Mamba

Figure 6: Top: An illustration of the prefix key veriant of the n-gram lookup task. Bottom: When trained on strings of length 30absent30\leq 30≤ 30, GSSMs perform as well as the Hard-Alibi transformer and better than the other transformers. This slight variant of the task requires much less memory and is thus more suitable to the strengths of GSSMs at storing a small state over time.

We now introduce another task to probe the mechanisms that the models use to copy from the context: the n-gram lookup task. In this task the model needs to use a given n-gram as a key to look up the k-token key that follows the query. We consider two variants of the task: suffix keys and prefix keys. In both variants, we assess length generalization to understand the function that the models have learned.

First, we consider the suffix key version of n-gram lookup. In this task, the model is given a sequence L𝐿Litalic_L of input tokens, a separator, and then an n-gram from the input sequence. The model then needs to output a sequence of k𝑘kitalic_k tokens following the chosen n-gram (see Figure 5 for an illustration). This task is closely related to induction heads (Olsson et al., 2022). This task requires the model to be able to “store” the entire context in order to effectively find the correct key to access it’s query. We train all models on sequences of at most 30 tokens and show results in Figure 5. Transformers perform well on this task, with a relatively small drop in performance when increasing the sequence length up to 100. This suggests that transformers can learn to perform n-gram storage and retrieval. GSSMs, however, perform poorly beyond their training distribution. Intuitively, this task still requires the models to store the entire input sequence, something that GSSMs struggle to do.

Next, we try the prefix key version of n-gram lookup. Here we provide the n-gram key at the beginning and then the full input sequence (illustrated in Figure 6). In this version of the task the model does not need to store the entire input since it can look for the key on the fly as the sequence is processed. This is good for the GSSMs, since they can write the key into the state and then ignore inputs that do not match. Indeed, GSSMs achieve perfect length-generalization on this variant. Interestingly, the GSSMs even outperform the NoPE and ALiBi transformers (although not the Hard-Alibi model). We hypothesize that this may be an issue where these positional embeddings make it more difficult to effectively perform the hashing lookup over a long distance in relative positions. Taken together, these results illustrate how GSSMs seem to be memory limited, but can be effective when the tasks only require a summary of the inputs rather than storing the entire context.

4 Pre-trained Models

In this section, we compare the performance of pre-trained transformers and pre-trained GSSMs on memory-intensive tasks such as copying long strings, retrieval and few-shot question answering. We show that transformers outperform GSSMs of similar scale on such memory-intensive tasks, even when the GSSM has lower perplexity as a language model. These results confirm that the limitation of GSSMs raised in previous sections apply to large scale models trained on real pretraining data.

4.1 Setup

In the experiments below, we compare Pythia transformer models (Biderman et al., 2023) of sizes ranging from 410M to 2.8B against Mamba models (Gu & Dao, 2023) of similar sizes. All these models have been pre-trained on the Pile (Gao et al., 2020) and use the same tokenizer. The Mamba models generally have slightly lower perplexity on the training set for a given size. The main difference between the Pythia and the Mamba models is their architectural design.

We compare these models by measuring their performance while varying the input instance length and consider two types of tasks: copy-based and information retrieval tasks. The copy-based tasks consist of presenting a random text to the model and asking it to copy the text. In the information retrieval tasks, we provide a text to the model and ask it a related question. These retrieval tasks can be seen as “selective copy”, since the model needs to copy a small chunk of the input text in order to respond to the question. To measure performance, we use the string-level accuracy in all the experiments except in 7(c) where we consider question answering and thus report the F1 score. We evaluate the models over 10 batches of size 64 for all the tasks except for question answering where we evaluate over 50 questions because the number of questions with a given context length is limited. Further details are in Appendix A.

Refer to caption

Pythia:  410M    1.4B    2.8B
Mamba:   360M    1.4B   2.8B

(a) Copy: natural language strings
Refer to caption

Pythia:  410M    1.4B    2.8B
Mamba:   360M    1.4B   2.8B

(b) Copy: shuffled strings
Refer to caption

Pythia:  2.8B
Mamba:  2.8B

(c) Question answering (SQUAD)
Figure 7: (a) Copy: natural language strings. We compare pretrained models on their ability to copy natural language strings sampled from C4 of varying lengths and report string-level accuracy. The transformer models substantially outperform the GSSMs. (b) Copy: shuffled strings. To test whether it mattered that the strings were in natural language, we randomly shuffle the word order of the strings from the previous experiment. We find that this degrades performance, especially for the Mamba models. (c) Question answering (SQUAD). We compare Pythia and Mamba on a standard question answering dataset where we bin the dataset based on the length of the context paragraph. We find that Mamba performance decays more quickly with the length of the context.

4.2 Copying the input context

We first observe that pre-trained transformers outperform pre-trained GSSMs at copying long natural language strings. In 7(a), we randomly sample strings from the C4 dataset (Raffel et al., 2020) with varying number of tokens. Our prompt consists of two copies of the sampled string plus the first word of the string and we expect the model to complete the third copy. Even the smallest transformer model dramatically outperforms the largest GSSM. This happens even though the large GSSMs have enough bits in the state variable to potentially store the context. This confirms the idea that this is an architectual bias of transformers that makes it easier for them to copy from the context.

Unlike strings of tokens sampled uniformly at random, natural text can often be compressed, possibly allowing language models to copy longer strings even with limited memory. To test whether this matters, in 7(b) we conduct the same experiment as above but randomly shuffle the order of the words in the strings. We find that when we shuffle the words, both GSSMs and transformers perform worse on the task, but the effect is more stark for GSSMs. Even the largest GSSM now gets zero accuracy on strings of length 300. This suggests that when the input is more difficult to compress, the GSSM suffers due to its fixed size state.

4.3 Retrieval from the input context

While copying provides a clear task to separate the model classes, it is not a particularly realistic task. That said, it presents an extreme case of a type of behavior that is highly relevant for many tasks of interest. In particular, many tasks require retrieving specific information from the context that is relevant to the desired output. This subsection presents examples of how our results transfer to more practical tasks.

Phone-book lookup.

We first consider a “phone-book” experiment where we provide a synthetic phone-book to the model and ask it to return the phone number when given a name. We generate the phone-book by randomly sampling L𝐿Litalic_L names and their associated phone number. One line of this phone-book looks like “John Powell: 609-323-7777”. Our prompt to the model consists of the phone-book, two few-shot examples and a question asking for the phone number of a randomly sampled name from the phone-book. 1(c) reports the accuracy obtained by the pretrained transformers and GSSMs while varying the size of the phone-book L.𝐿L.italic_L . We observe that even the smallest transformer (410M parameters) outperforms the largest GSSMs (2.8B parameters) when the phone-book size is long enough (L70𝐿70L\geq 70italic_L ≥ 70). This shows that in retrieval tasks which require access to the whole context, GSSMs struggle to store the relevant information in their fixed-size state.

Question-Answering.

In this experiment, we compare the 2.8B parameter Mamba and transformer models444In our experiments, smaller models were unable to achieve reasonable and consistent performance on this dataset., on the SQuAD question-answering dataset (Rajpurkar et al., 2018). This dataset provides text paragraphs together with a few questions regarding the text. We probe the models to answer the question by providing a single demonstration of a question/answer pair (corresponding to the same text) before giving the target question. We bin the paragraphs according to their lengths, and report the F1 score as a function of the paragraph length for both models in 7(c). We observe that while for short paragraphs, both the Pythia transformer and Mamba achieve comparable performance, the performance of Mamba degrades more quickly with the paragraph length, while the transformer-based model maintains a similar accuracy even for longer texts. This result shows that the fixed-memory of GSSMs also limits their performance on standard natural tasks.

5 Related Work

There exists a broad body of prior work on the representational capacity of GSSMs like RNNs (Merrill, 2019; Merrill et al., 2020) as well as transformers (Weiss et al., 2021; Merrill et al., 2022; Wei et al., 2022; Sanford et al., 2023; Edelman et al., 2022). Previous works that study transformers do so through comparison to other complexity classes, such as threshold circuits (Merrill et al., 2022), RASP language (Weiss et al., 2021) or first-order logic (Chiang et al., 2023) (see Strobl et al. (2023) for a thorough review). These works do not provide insights into how transformers implement algorithms for solving specific problems. In contrast, our theoretical result constructs a transformer for the copy task, which illustrates the mechanism and provides tight bounds on the model size. Together with the result showing that GSSMs cannot copy long sequences, our theory characterizes the power of different sequence models on the copy task. Other theoretical separation results between transformers and RNNs (Sanford et al., 2023; Merrill, 2019) use more complex tasks of less practical relevance.

Other papers have previously demonstrated the capacity of transformers to leverage the entire input context for tasks like retrieval, question answering, and in-context learning (Devlin et al., 2018; Raffel et al., 2020; Petroni et al., 2020; Brown et al., 2020; Liu et al., 2023b; Kamradt, 2023). Another line of work has studied the “induction head” mechanism in transformers that performs a retrieval operation much like the one we observe for copying (Olsson et al., 2022). But, to our knowledge, there is not a comparison in related work between transformers and GSSMs of similar quality on these tasks.

Several of our experiments study length generalization as a way to assess whether the model found the “right way” to solve the task. Prior work on length generalization in transformers has focused on the data distribution (Anil et al., 2022), positional embeddings (Kazemnejad et al., 2023), and arithmetic tasks (Delétang et al., 2022; Ruoss et al., 2023; Jelassi et al., 2023; Zhou et al., 2023). We extend many of these ideas to the copying task.

Finally, we note that while we focus on tasks where transformers outperform GSSMs, there are also tasks where GSSMs outperform transformers. For example, Liu et al. (2023a) shows that transformers fail to generalize out of distribution for “flip-flop language modeling”, while LSTMs do so easily. These tasks require tracking a small O(1)𝑂1O(1)italic_O ( 1 ) state variable over time. Another benefit of GSSMs is the ability to input long contexts like DNA sequences that may be impractical for transformers (Nguyen et al., 2023).

Concurrently to our work, Akyürek et al. (2024); Grazzi et al. (2024); Park et al. (2024) studied the difference between Transformers and Mamba at in-context learning, which can be seen as a form of copying. In particular, Akyürek et al. (2024) finds that Transformers have an advantage over other architectures at this task because they have “n-gram heads”. Similarly to these works, we hint the limitations of SSMs in memory-intensive tasks such as copying because of their limited state size. We also show that Transformers can perform copying using the Hard-ALiBi positional encoding, which improves the model’s ability to learn n𝑛nitalic_n-gram matching.

6 Discussion

We have demonstrated through theory and experiments that transformers are better than GSSMs at copying from their input context. However, we emphasize that state space models have many advantages over transformers. The memory and computational complexity of GSSMs does not increase with the input length, which is ideal for training and inference on long inputs. Additionally, state space models such as RNNs are better at tracking state variables across long sequences (Liu et al., 2023a), which may be useful for generating long consistent text. Importantly, language processing in the human brain appears to be much more similar to how state space models process language (Tikochinski et al., 2024).

We therefore believe that future work should focus on building hybrid architectures that endow state space models with an attention-like mechanism, allowing them to retrieve relevant pieces of text from their input. Indeed, humans have an incredibly limited capacity for memorizing sequences (Miller, 1956), but can translate entire novels if we allow them to look back at the text (Shelton, 1612).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgements

We thank Boaz Barak for helpful discussions. Kempner Institute computing resources enabled this work. Samy Jelassi acknowledges funding supported by the Center of Mathematical Sciences and Applications. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence. Sham Kakade acknowledges funding from the Office of Naval Research under award N00014-22-1-2377.

References

  • Akyürek et al. (2024) Akyürek, E., Wang, B., Kim, Y., and Andreas, J. In-context language learning: Arhitectures and algorithms. arXiv preprint arXiv:2401.12973, 2024.
  • Anil et al. (2022) Anil, C., Wu, Y., Andreassen, A., Lewkowycz, A., Misra, V., Ramasesh, V., Slone, A., Gur-Ari, G., Dyer, E., and Neyshabur, B. Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546–38556, 2022.
  • Biderman et al. (2023) Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  • Bradbury et al. (2016) Bradbury, J., Merity, S., Xiong, C., and Socher, R. Quasi-recurrent neural networks. arXiv preprint arXiv:1611.01576, 2016.
  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Carlini et al. (2022) Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
  • Chiang et al. (2023) Chiang, D., Cholak, P., and Pillay, A. Tighter bounds on the expressivity of transformer encoders. arXiv preprint arXiv:2301.10743, 2023.
  • Choromanski et al. (2020) Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  • Dao et al. (2022) Dao, T., Fu, D., Ermon, S., Rudra, A., and Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  • Delétang et al. (2022) Delétang, G., Ruoss, A., Grau-Moya, J., Genewein, T., Wenliang, L. K., Catt, E., Hutter, M., Legg, S., and Ortega, P. A. Neural networks and the chomsky hierarchy. arXiv preprint arXiv:2207.02098, 2022.
  • Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Edelman et al. (2022) Edelman, B. L., Goel, S., Kakade, S., and Zhang, C. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pp.  5793–5831. PMLR, 2022.
  • Gao et al. (2020) Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  • Grazzi et al. (2024) Grazzi, R., Siems, J., Schrodi, S., Brox, T., and Hutter, F. Is mamba capable of in-context learning? arXiv preprint arXiv:2402.03170, 2024.
  • Gu & Dao (2023) Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  • Gu et al. (2021) Gu, A., Goel, K., and Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Jelassi et al. (2023) Jelassi, S., d’Ascoli, S., Domingo-Enrich, C., Wu, Y., Li, Y., and Charton, F. Length generalization in arithmetic transformers. arXiv preprint arXiv:2306.15400, 2023.
  • Kamradt (2023) Kamradt, G. Llmtest_needleinahaystack. https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023.
  • Katharopoulos et al. (2020) Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp.  5156–5165. PMLR, 2020.
  • Kazemnejad et al. (2023) Kazemnejad, A., Padhi, I., Ramamurthy, K. N., Das, P., and Reddy, S. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, 2023.
  • Liu et al. (2023a) Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Exposing attention glitches with flip-flop language modeling. arXiv preprint arXiv:2306.00946, 2023a.
  • Liu et al. (2023b) Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023b.
  • Loshchilov & Hutter (2017) Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • McCoy et al. (2023) McCoy, R. T., Smolensky, P., Linzen, T., Gao, J., and Celikyilmaz, A. How much do language models copy from their training data? evaluating linguistic novelty in text generation using raven. Transactions of the Association for Computational Linguistics, 11:652–670, 2023.
  • Merrill (2019) Merrill, W. Sequential neural networks as automata. arXiv preprint arXiv:1906.01615, 2019.
  • Merrill et al. (2020) Merrill, W., Weiss, G., Goldberg, Y., Schwartz, R., Smith, N. A., and Yahav, E. A formal hierarchy of rnn architectures. arXiv preprint arXiv:2004.08500, 2020.
  • Merrill et al. (2022) Merrill, W., Sabharwal, A., and Smith, N. A. Saturated Transformers are Constant-Depth Threshold Circuits. Transactions of the Association for Computational Linguistics, 10:843–856, 08 2022. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00493. URL https://doi.org/10.1162/tacl_a_00493.
  • Miller (1956) Miller, G. A. The magic number seven plus or minus two: Some limits on our capacity for processing information. Psychological review, 63:91–97, 1956.
  • Nguyen et al. (2023) Nguyen, E., Poli, M., Faizi, M., Thomas, A., Birch-Sykes, C., Wornow, M., Patel, A., Rabideau, C., Massaroli, S., Bengio, Y., et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. arXiv preprint arXiv:2306.15794, 2023.
  • Olsson et al. (2022) Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  • Park et al. (2024) Park, J., Park, J., Xiong, Z., Lee, N., Cho, J., Oymak, S., Lee, K., and Papailiopoulos, D. Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248, 2024.
  • Pascanu et al. (2013) Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In International conference on machine learning, pp.  1310–1318. Pmlr, 2013.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Peng et al. (2023) Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Cao, H., Cheng, X., Chung, M., Grella, M., GV, K. K., et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  • Petroni et al. (2020) Petroni, F., Lewis, P., Piktus, A., Rocktäschel, T., Wu, Y., Miller, A. H., and Riedel, S. How context affects language models’ factual predictions. arXiv preprint arXiv:2005.04611, 2020.
  • Press et al. (2021) Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  • Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners, 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
  • Raffel et al. (2020) Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  • Rajpurkar et al. (2018) Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
  • Ruoss et al. (2023) Ruoss, A., Delétang, G., Genewein, T., Grau-Moya, J., Csordás, R., Bennani, M., Legg, S., and Veness, J. Randomized positional encodings boost length generalization of transformers. arXiv preprint arXiv:2305.16843, 2023.
  • Sanford et al. (2023) Sanford, C., Hsu, D., and Telgarsky, M. Representational strengths and limitations of transformers. arXiv preprint arXiv:2306.02896, 2023.
  • Shelton (1612) Shelton, T. The Ingenious Gentleman Don Quixote of La Mancha. 1612. Written by Miguel de Cervantes, translated by Thomas Shelton.
  • Shen et al. (2023) Shen, R., Bubeck, S., Eldan, R., Lee, Y. T., Li, Y., and Zhang, Y. Positional description matters for transformers arithmetic. arXiv preprint arXiv:2311.14737, 2023.
  • Strobl et al. (2023) Strobl, L., Merrill, W., Weiss, G., Chiang, D., and Angluin, D. Transformers as recognizers of formal languages: A survey on expressivity. arXiv preprint arXiv:2311.00208, 2023.
  • Su et al. (2023) Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, pp.  127063, 2023.
  • Sun et al. (2023) Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
  • Tikochinski et al. (2024) Tikochinski, R., Goldstein, A., Meiri, Y., Hasson, U., and Reichart, R. An incremental large language model for long text processing in the brain. 2024.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wei et al. (2022) Wei, C., Chen, Y., and Ma, T. Statistically meaningful approximation: a case study on approximating turing machines with transformers. Advances in Neural Information Processing Systems, 35:12071–12083, 2022.
  • Weiss et al. (2021) Weiss, G., Goldberg, Y., and Yahav, E. Thinking like transformers. In International Conference on Machine Learning, pp.  11080–11090. PMLR, 2021.
  • Wolf et al. (2019) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  • Zhou et al. (2023) Zhou, H., Bradley, A., Littwin, E., Razin, N., Saremi, O., Susskind, J., Bengio, S., and Nakkiran, P. What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028, 2023.

Appendix A Experimental setup

In this section, we provide additional details about our experimental setup. We first give a description of the positional encodings used in our transformers experiments (Subsection A.1) and then give details about the training and evaluation procedures (Subsection A.2).

A.1 Positional encodings in the transformers

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
Refer to caption
(c)
Figure 8: Positional encoding schemes for transformers: illustration of the different positional encodings of the transformers that are trained in our experiments. (a) corresponds to the NoPE encoding (Kazemnejad et al., 2023) where no positional encoding is applied to any of the attention heads (b) depicts the ALiBi encoding (Press et al., 2021) where m𝑚mitalic_m is a head-specific scalar and (c) the Hard-ALiBi encoding introduced in Section 2. For the sake of illustration, we consider the case where we mask three heads which means that we force Heads 1, 2 and 3 to attend to their current token, their current and preceding tokens and their current, preceding and prior to the preceding tokens. The remaining heads are set as NoPE heads.

We consider multiple positional encoding schemes in our experiments in Section 3:

  • -

    the NoPE scheme (Kazemnejad et al., 2023) where no positional information is added to any of the attention scores (8(a)). This architecture choice helps to get better length generalization in multiple tasks including the copy task.

  • -

    the ALiBi scheme (Press et al., 2021) which biases the attention scores with a penalty that is proportional to their distance (8(b)). m𝑚mitalic_m is a head-specific slope fixed before training.

  • -

    the Hard-ALiBi scheme introduced in Section 2 which has M𝑀Mitalic_M masked attention heads where we explicitly force the model to attend to their directly previous tokens and HM𝐻𝑀H-Mitalic_H - italic_M heads set to be NoPE attention heads. In 8(c), we display the case where we have M=4𝑀4M=4italic_M = 4 masked heads: in the first head, the tokens just attend to themselves; in the second head, the tokens attend to themselves and to previous ones; in the third head, the tokens attend to themselves, the previous ones and the second preceding tokens. The remaining HM𝐻𝑀H-Mitalic_H - italic_M heads are set to NoPE.

A.2 Pretraining and evaluation details

Software dependencies.

We implement all of our training in Pytorch (Paszke et al., 2019). We use the HuggingFace library (Wolf et al., 2019) and the Mamba GitHub repository (Gu & Dao, 2023).

Architectures.

In our experiments in Section 3, the backbone of our transformers is the GPT-NeoX architecture. We set the number of layers to 12, the hidden size to 1024 and the number of heads H=16𝐻16H=16italic_H = 16. We consider the different positional encodings that are described in Subsection A.1. For Alibi, we set the head-specific scalar as in the original paper i.e. mh=2h/2subscript𝑚superscript22m_{h}=2^{-h/2}italic_m start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT - italic_h / 2 end_POSTSUPERSCRIPT for h{1,,H}.1𝐻h\in\{1,\dots,H\}.italic_h ∈ { 1 , … , italic_H } . For the Hard-Alibi model, we sweep over the number of masked heads M{2,,10}𝑀210M\in\{2,\dots,10\}italic_M ∈ { 2 , … , 10 } and found that the best model corresponds to M=6𝑀6M=6italic_M = 6. Regarding the Mamba models, we set the number of layers to 24 and the hidden size 1024. We also sweep over the state space dimension S{16,32,64,128,256}𝑆163264128256S\in\{16,32,64,128,256\}italic_S ∈ { 16 , 32 , 64 , 128 , 256 } and found the best model is S=32𝑆32S=32italic_S = 32. This choice of hyperparameters ensures that both transformers and Mamba models have a comparable number of parameters. Lastly, our LSTM is made of 4 layers and width 1024.

Training hyperparameters.

In Section 3, at each epoch, we sample online a batch size of size 64. We fill the context with examples so we choose a context length (C=420𝐶420C=420italic_C = 420 for all the experiments except 1(a) where we set C=620𝐶620C=620italic_C = 620) and pack as many examples as possible to fit this context. So in our case, one sample contains many instances. We run the experiments for 15 epochs for both transformers and Mamba while for LSTMs we need 300 epochs. All methods are trained with the AdamW optimizer (Loshchilov & Hutter, 2017) with learning rate 5e-5, a linear rate decay schedule, 300 steps of warmup and default weight decay of 1e-1. For LSTMs and Mamba, we did a sweep over learning rates {5e5,1e4,5e4}.5e51e45e4\{5\mathrm{e}-5,1\mathrm{e}-4,5\mathrm{e}-4\}.{ 5 roman_e - 5 , 1 roman_e - 4 , 5 roman_e - 4 } . Finally, to train all the models, we use the next-token prediction loss but we apply a mask on the input instance so that we only penalize the model whenever it makes a mistake on the labels (and not on the inputs and labels jointly).

Compute resources.

Pretraining was all done on an internal cluster using RTX8000 GPUs. We estimate that the final training run needed to produce the results in the paper took approximately 600 GPU hours.

Evaluation algorithm.

We evaluate the models over 10 batches of size 64 for all the tasks except for the question answering one where we evaluate over 50 questions because the number of questions with a given context length is limited.

Decoding algorithm.

At inference, all our models use greedy decoding for generation and we set the temperature to 0.

Appendix B Additional Experiments

In Subsection B.1, we focus on the in-distribution learning of the copy task and show that the number of samples needed by GSSMs is much higher than the one for transformers. In Subsection B.2, we study the performance of pre-trained models on the copy task in the case where the strings are sampled uniformly. This experiment shows that when the text to copy is totally random, the gap between pre-trained transformers and GSSMs is even larger.

B.1 Data efficiency on the copy task

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)

Transformer:  RoPE   NoPE   Alibi   HAlibi
GSSM:  LSTM    Mamba

Figure 9: (a) Copying long strings: character-level accuracy. Here we train models to copy strings of length 300absent300\leq 300≤ 300 and evaluate character-level accuracy on strings of length 300. Transformers train much faster than GSSMs. Mamba has a more progressive learning curve than in 1(a). An LSTM cannot even learn the task within this number of samples at the character level. (b) Copying short strings: string-level accuracy. Here we train models to copy strings of length 30absent30\leq 30≤ 30 and evaluate character-level accuracy on strings of length 30. Transformers train much faster than GSSMs. Compared to 1(a), we see that Mamba needs way less samples in order to learn to copy length-30 strings. An LSTM can learn to copy but requires 100x more training examples. (c) Copying short strings: character-level accuracy. Here we train models to copy strings of length 30absent30\leq 30≤ 30 and evaluate character-level accuracy on strings of length 30 and report the character-level accuracy.

In this section, we provide additional plots to complement the data efficiency experiment from 1(a). We want to highlight the following points:

  • -

    in 1(a), we see a sharp transition for the Mamba learning curve. However, 9(a) shows that the learning process is more smooth at the character level. Besides, LSTMs are not able to learn the copy on length-300 strings even at the character level.

  • -

    We consider the experiment of learning to copy much shorter strings namely strings with length 30absent30\leq 30≤ 30. 9(b) shows that the gap in terms of training examples between transformers and Mamba is much smaller i.e. Mamba only needs 10x more data. Besides, we see that the LSTM is able to learn the copy task but it needs 100x more data than transformers.

B.2 Pre-trained models on the uniform copy task

In this section, we provide an additional experiment that shows the superiority of pre-trained Pythia over pre-trained Mamba models in the copy task.

Refer to caption

Pythia:  410M    1.4B    2.8B
Mamba:   360M    1.4B   2.8B

Figure 10: Copy: uniform strings. To test whether it mattered that the strings were in natural language, we generate uniformly sampled strings (the generation process is described in Section 3). We find that this degrades the Mamba models while Pythia models are able to keep a high performance.

We consider the same setup as in Section 3: we sample uniform strings of alphabet characters with a fixed length and ask the model to copy it by using the same prompt format as the one described in Subsection 4.2.

This setting is a more extreme version of 7(b) since the strings are more random: in 7(b), the order of the nouns were random but the nouns were English nouns while in 7(b), the strings are totally random. In Figure 10, we see a clear separation between the transformers and Mamba models with the smallest Pythia outperforming the largest Mamba. However, compared to 7(b), the Pythia performance is much higher since the 1.4B model able to get almost 100% accuracy.

B.3 Performance n𝑛nitalic_n-gram models at copying

Refer to caption

n-gram length:  2    3    4    5
Transformers:  Hard-ALiBi

Figure 11: String-level copying accuracy obtained by perfect n𝑛nitalic_n-gram models and Transformers with Hard-ALiBi. Transformers performance matches the one of 5-gram model.

In Figure 11, we display the performance of perfect n𝑛nitalic_n-gram models in the copy task. To obtain these curves, we uniformly sample 128 strings over 3 seeds and report the probability there is a n𝑛nitalic_n-gram. This probability corresponds to the performance of a perfect n𝑛nitalic_n-gram model. We observe that Transformers enhanced with the Hard-ALiBi positional encoding have a performance close to a perfect 5555-gram model.

Appendix C Proofs - Upper Bound

This section gives a detailed proof of Theorem 2.3 and Lemma 2.4.

C.1 Technical Lemmas

We begin by introducing some technical lemmas that we use in the proof of Theorem 2.3.

Algorithm 1 Hash-based copying
  Input: sequence x1,,xLsubscript𝑥1subscript𝑥𝐿x_{1},\dots,x_{L}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT
  Let s:𝔻d:𝑠superscript𝔻superscript𝑑s:{\mathbb{D}}^{*}\to{\mathbb{R}}^{d}italic_s : blackboard_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be some hashing function.
  for i=n+2,,L𝑖𝑛2𝐿i=n+2,\dots,Litalic_i = italic_n + 2 , … , italic_L do
     kis(xin,xin+1,,xi1)subscript𝑘𝑖𝑠subscript𝑥𝑖𝑛subscript𝑥𝑖𝑛1subscript𝑥𝑖1k_{i}\leftarrow s(x_{i-n},x_{i-n+1},\dots,x_{i-1})italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_s ( italic_x start_POSTSUBSCRIPT italic_i - italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i - italic_n + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT )
     vixisubscript𝑣𝑖subscript𝑥𝑖v_{i}\leftarrow x_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
  end for
  for j=1,,L𝑗1𝐿j=1,\dots,Litalic_j = 1 , … , italic_L do
     if jn𝑗𝑛j\leq nitalic_j ≤ italic_n then
        yjxjsubscript𝑦𝑗subscript𝑥𝑗y_{j}\leftarrow x_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
     else
        qjs(yjn,,yj1)subscript𝑞𝑗𝑠subscript𝑦𝑗𝑛subscript𝑦𝑗1q_{j}\leftarrow s(y_{j-n},\dots,y_{j-1})italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← italic_s ( italic_y start_POSTSUBSCRIPT italic_j - italic_n end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT )
        Let i[L]𝑖delimited-[]𝐿i\in[L]italic_i ∈ [ italic_L ] s.t. ki=qjsubscript𝑘𝑖subscript𝑞𝑗k_{i}=q_{j}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and set yjxisubscript𝑦𝑗subscript𝑥𝑖y_{j}\leftarrow x_{i}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
     end if
  end for
  Output: sequence y1,,yLsubscript𝑦1subscript𝑦𝐿y_{1},\dots,y_{L}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT
Lemma C.1.

Let ht(𝐱1,,𝐱i)=1min(t,i)j=max(1,it+1)i𝐱jsubscript𝑡subscript𝐱1subscript𝐱𝑖1𝑡𝑖superscriptsubscript𝑗1𝑖𝑡1𝑖subscript𝐱𝑗h_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i})=\frac{1}{\min(t,i)}\sum_{j=\max(1,i-t+1% )}^{i}{\bm{x}}_{j}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG roman_min ( italic_t , italic_i ) end_ARG ∑ start_POSTSUBSCRIPT italic_j = roman_max ( 1 , italic_i - italic_t + 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Then, htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be computed using a hard-ALiBi attention head.

Proof.

Let Wk,Wq=0subscript𝑊𝑘subscript𝑊𝑞0W_{k},W_{q}=0italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = 0 (zero matrix) and let Wv=Idsubscript𝑊𝑣subscript𝐼𝑑W_{v}=I_{d}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT (indentity matrix). We choose bi{0,}isubscript𝑏𝑖superscript0𝑖b_{i}\in\{0,-\infty\}^{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , - ∞ } start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT s.t.

bi,j={jit0j>itsubscript𝑏𝑖𝑗cases𝑗𝑖𝑡0𝑗𝑖𝑡b_{i,j}=\begin{cases}-\infty&j\leq i-t\\ 0&j>i-t\end{cases}italic_b start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL - ∞ end_CELL start_CELL italic_j ≤ italic_i - italic_t end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_j > italic_i - italic_t end_CELL end_ROW

Lemma C.2.

Assume that d=log(D)+2𝑑𝐷2d=\left\lceil\log(D)\right\rceil+2italic_d = ⌈ roman_log ( italic_D ) ⌉ + 2. Then, there exists an embedding ΨΨ\Psiroman_Ψ s.t.

  • For every x𝔻𝑥𝔻x\in{\mathbb{D}}italic_x ∈ blackboard_D it holds that Ψ(x)2=1subscriptdelimited-∥∥Ψ𝑥21\left\lVert\Psi(x)\right\rVert_{2}=1∥ roman_Ψ ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 and Ψ(x)1subscriptdelimited-∥∥Ψ𝑥1\left\lVert\Psi(x)\right\rVert_{\infty}\leq 1∥ roman_Ψ ( italic_x ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ 1.

  • For xxsuperscript𝑥𝑥x^{\prime}\neq xitalic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_x it holds that x,x<11d𝑥superscript𝑥11𝑑\left\langle x,x^{\prime}\right\rangle<1-\frac{1}{d}⟨ italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ < 1 - divide start_ARG 1 end_ARG start_ARG italic_d end_ARG.

  • For every xBOS𝑥delimited-⟨⟩BOSx\neq\left\langle{\tiny\textsc{BOS}}\right\rangleitalic_x ≠ ⟨ BOS ⟩, Ψ(x),Ψ(BOS)=0Ψ𝑥Ψdelimited-⟨⟩BOS0\left\langle\Psi(x),\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)\right% \rangle=0⟨ roman_Ψ ( italic_x ) , roman_Ψ ( ⟨ BOS ⟩ ) ⟩ = 0, and for every xCOPY𝑥delimited-⟨⟩COPYx\neq\left\langle{\tiny\textsc{COPY}}\right\rangleitalic_x ≠ ⟨ COPY ⟩, Ψ(x),Ψ(COPY)=0Ψ𝑥Ψdelimited-⟨⟩COPY0\left\langle\Psi(x),\Psi(\left\langle{\tiny\textsc{COPY}}\right\rangle)\right% \rangle=0⟨ roman_Ψ ( italic_x ) , roman_Ψ ( ⟨ COPY ⟩ ) ⟩ = 0.

Proof.

Denote d=log(D)superscript𝑑𝐷d^{\prime}=\left\lceil\log(D)\right\rceilitalic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ⌈ roman_log ( italic_D ) ⌉, and observe that we can encode all D𝐷Ditalic_D “non-special” tokens as vectors in {±1d}dsuperscriptplus-or-minus1𝑑superscript𝑑\left\{\pm\frac{1}{\sqrt{d}}\right\}^{d^{\prime}}{ ± divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG } start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, and denote this encoding by ΨsuperscriptΨ\Psi^{\prime}roman_Ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Now, define:

Ψ(x)={[1,0,,0]x=BOS[0,1,0,,0]x=COPY[0,0,Ψ(x)]o.w.Ψ𝑥cases100𝑥delimited-⟨⟩BOS0100𝑥delimited-⟨⟩COPY00superscriptΨ𝑥formulae-sequence𝑜𝑤\Psi(x)=\begin{cases}[1,0,\dots,0]&x=\left\langle{\tiny\textsc{BOS}}\right% \rangle\\ [0,1,0,\dots,0]&x=\left\langle{\tiny\textsc{COPY}}\right\rangle\\ [0,0,\Psi^{\prime}(x)]&o.w.\end{cases}roman_Ψ ( italic_x ) = { start_ROW start_CELL [ 1 , 0 , … , 0 ] end_CELL start_CELL italic_x = ⟨ BOS ⟩ end_CELL end_ROW start_ROW start_CELL [ 0 , 1 , 0 , … , 0 ] end_CELL start_CELL italic_x = ⟨ COPY ⟩ end_CELL end_ROW start_ROW start_CELL [ 0 , 0 , roman_Ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ] end_CELL start_CELL italic_o . italic_w . end_CELL end_ROW

Lemma C.3.

Let 𝐳K𝐳superscript𝐾{\bm{z}}\in{\mathbb{R}}^{K}bold_italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT be some vector such that, for some constants a>b>0𝑎𝑏0a>b>0italic_a > italic_b > 0, there exists i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ] s.t. zi=asubscript𝑧𝑖𝑎z_{i}=aitalic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a and for all ji𝑗𝑖j\neq iitalic_j ≠ italic_i we have |zj|bsubscript𝑧𝑗𝑏\left\lvert z_{j}\right\rvert\leq b| italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ italic_b. Denote 𝐬=softmax(𝐳)𝐬softmax𝐳{\bm{s}}=\mathrm{softmax}({\bm{z}})bold_italic_s = roman_softmax ( bold_italic_z ). Then si11+Kexp(ba)subscript𝑠𝑖11𝐾𝑏𝑎s_{i}\geq\frac{1}{1+K\exp(b-a)}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 1 + italic_K roman_exp ( italic_b - italic_a ) end_ARG and sjexp(ba)subscript𝑠𝑗𝑏𝑎s_{j}\leq\exp(b-a)italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ roman_exp ( italic_b - italic_a ) for all ji𝑗𝑖j\neq iitalic_j ≠ italic_i.

Proof.

First, notice that:

exp(a)=exp(zi)j=1Kexp(zj)exp(zi)+(K1)exp(b)exp(a)+Kexp(b)=exp(a)(1+Kexp(ba))𝑎subscript𝑧𝑖superscriptsubscript𝑗1𝐾subscript𝑧𝑗subscript𝑧𝑖𝐾1𝑏𝑎𝐾𝑏𝑎1𝐾𝑏𝑎\displaystyle\exp(a)=\exp(z_{i})\leq\sum_{j=1}^{K}\exp(z_{j})\leq\exp(z_{i})+(% K-1)\exp(b)\leq\exp(a)+K\exp(b)=\exp(a)(1+K\exp(b-a))roman_exp ( italic_a ) = roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( italic_K - 1 ) roman_exp ( italic_b ) ≤ roman_exp ( italic_a ) + italic_K roman_exp ( italic_b ) = roman_exp ( italic_a ) ( 1 + italic_K roman_exp ( italic_b - italic_a ) )

Observe the following:

si=exp(zi)j=1Kexp(zj)exp(a)exp(a)(1+Kexp(ba))=11+Kexp(ba)subscript𝑠𝑖subscript𝑧𝑖superscriptsubscript𝑗1𝐾subscript𝑧𝑗𝑎𝑎1𝐾𝑏𝑎11𝐾𝑏𝑎\displaystyle s_{i}=\frac{\exp(z_{i})}{\sum_{j=1}^{K}\exp(z_{j})}\geq\frac{% \exp(a)}{\exp(a)(1+K\exp(b-a))}=\frac{1}{1+K\exp(b-a)}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG ≥ divide start_ARG roman_exp ( italic_a ) end_ARG start_ARG roman_exp ( italic_a ) ( 1 + italic_K roman_exp ( italic_b - italic_a ) ) end_ARG = divide start_ARG 1 end_ARG start_ARG 1 + italic_K roman_exp ( italic_b - italic_a ) end_ARG

Finally, for every ji𝑗𝑖j\neq iitalic_j ≠ italic_i:

sj=exp(zj)j=1Kexp(zj)exp(b)exp(a)=exp(ba)subscript𝑠𝑗subscript𝑧𝑗superscriptsubscript𝑗1𝐾subscript𝑧𝑗𝑏𝑎𝑏𝑎\displaystyle s_{j}=\frac{\exp(z_{j})}{\sum_{j=1}^{K}\exp(z_{j})}\leq\frac{% \exp(b)}{\exp(a)}=\exp(b-a)italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG ≤ divide start_ARG roman_exp ( italic_b ) end_ARG start_ARG roman_exp ( italic_a ) end_ARG = roman_exp ( italic_b - italic_a )

C.2 Proof of Theorem 2.3

We begin by constructing the first block of the transformer, which computes the “lookup-table” for the copy algorithm. This lookup-table consists of pairs of (key,values) for each position i𝑖iitalic_i, where the key encodes the n𝑛nitalic_n-gram preceding the i𝑖iitalic_i-th token, and the value is the i𝑖iitalic_i-th token. Namely, if the sequence is x1,,xisubscript𝑥1subscript𝑥𝑖x_{1},\dots,x_{i}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, then keyi=(xin1,,xi)subscriptkey𝑖subscript𝑥𝑖𝑛1subscript𝑥𝑖\mathrm{key}_{i}=(x_{i-n-1},\dots,x_{i})roman_key start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i - italic_n - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and valuei=xisubscriptvalue𝑖subscript𝑥𝑖\mathrm{value}_{i}=x_{i}roman_value start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Additionally, the transformer block also computes a query, which is just the “current” n𝑛nitalic_n-gram, i.e. queryi=(xin,,xi)subscriptquery𝑖subscript𝑥𝑖𝑛subscript𝑥𝑖\mathrm{query}_{i}=(x_{i-n},\dots,x_{i})roman_query start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i - italic_n end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The copy algorithm matches the current queryquery\mathrm{query}roman_query with previous keykey\mathrm{key}roman_key-s, retrieving the matching valuevalue\mathrm{value}roman_value.

The following theorem shows that by using a combination of n𝑛nitalic_n hard-ALiBi attention heads (with different choice of m𝑚mitalic_m for each head), together with an MLP layer, can compute the correct (keyi,valuei,queryi)subscriptkey𝑖subscriptvalue𝑖subscriptquery𝑖(\mathrm{key}_{i},\mathrm{value}_{i},\mathrm{query}_{i})( roman_key start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_value start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_query start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for each position. We use a slightly modified keyi,queryisubscriptkey𝑖subscriptquery𝑖\mathrm{key}_{i},\mathrm{query}_{i}roman_key start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_query start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to handle cases where the in𝑖𝑛i\leq nitalic_i ≤ italic_n (or, i𝑖iitalic_i is one of the first n𝑛nitalic_n tokens after the COPYdelimited-⟨⟩COPY\left\langle{\tiny\textsc{COPY}}\right\rangle⟨ COPY ⟩ token).

Lemma C.4.

Let ΨΨ\Psiroman_Ψ be the one-hot embedding. Then, there exists a hard-ALiBi transformer block with 3 outputs, denoted Tkey,Tquery,Tvaluesuperscript𝑇keysuperscript𝑇querysuperscript𝑇valueT^{\mathrm{key}},T^{\mathrm{query}},T^{\mathrm{value}}italic_T start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT roman_value end_POSTSUPERSCRIPT, which correspond to 3 blocks of the output dimension, s.t. Tkey:d×(d+1)n×:superscript𝑇keysuperscript𝑑absentsuperscript𝑑1𝑛absentT^{\mathrm{key}}:{\mathbb{R}}^{d\times*}\to{\mathbb{R}}^{(d+1)n\times*}italic_T start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d × ∗ end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT ( italic_d + 1 ) italic_n × ∗ end_POSTSUPERSCRIPT, Tquery:d×(d+1)n×:superscript𝑇querysuperscript𝑑absentsuperscript𝑑1𝑛absentT^{\mathrm{query}}:{\mathbb{R}}^{d\times*}\to{\mathbb{R}}^{(d+1)n\times*}italic_T start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d × ∗ end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT ( italic_d + 1 ) italic_n × ∗ end_POSTSUPERSCRIPT and Tvalue:d×d×:superscript𝑇valuesuperscript𝑑absentsuperscript𝑑absentT^{\mathrm{value}}:{\mathbb{R}}^{d\times*}\to{\mathbb{R}}^{d\times*}italic_T start_POSTSUPERSCRIPT roman_value end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d × ∗ end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d × ∗ end_POSTSUPERSCRIPT satisfying, for all 𝐱𝐱{\bm{x}}bold_italic_x sampled from a length-n𝑛nitalic_n copy distribution,

  1. 1.

    Value output: for all i𝑖iitalic_i,

    Tivalue(Ψ(x1),,Ψ(xi))=Ψ(xi)superscriptsubscript𝑇𝑖valueΨsubscript𝑥1Ψsubscript𝑥𝑖Ψsubscript𝑥𝑖T_{i}^{\mathrm{value}}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\Psi(x_{i})italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_value end_POSTSUPERSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
  2. 2.

    Key output:

    • For t=1,,n𝑡1𝑛t=1,\dots,nitalic_t = 1 , … , italic_n, if i>n𝑖𝑛i>nitalic_i > italic_n

      T(t1)d+1:td,ikey(Ψ(x1),,Ψ(xi))=Ψ(xit)subscriptsuperscript𝑇key:𝑡1𝑑1𝑡𝑑𝑖Ψsubscript𝑥1Ψsubscript𝑥𝑖Ψsubscript𝑥𝑖𝑡T^{\mathrm{key}}_{(t-1)d+1:td,i}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\Psi(x_{i-t})italic_T start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t - 1 ) italic_d + 1 : italic_t italic_d , italic_i end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT )

      and if in𝑖𝑛i\leq nitalic_i ≤ italic_n

      T(t1)d+1:td,ikey(Ψ(x1),,Ψ(xi))=0subscriptsuperscript𝑇key:𝑡1𝑑1𝑡𝑑𝑖Ψsubscript𝑥1Ψsubscript𝑥𝑖0T^{\mathrm{key}}_{(t-1)d+1:td,i}(\Psi(x_{1}),\dots,\Psi(x_{i}))=0italic_T start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t - 1 ) italic_d + 1 : italic_t italic_d , italic_i end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = 0
    • Additionally, for t=1,,n𝑡1𝑛t=1,\dots,nitalic_t = 1 , … , italic_n, for all i𝑖iitalic_i

      Tnd+t,ikey(Ψ(x1),,Ψ(xi))=𝟏{i=t+1}subscriptsuperscript𝑇key𝑛𝑑𝑡𝑖Ψsubscript𝑥1Ψsubscript𝑥𝑖1𝑖𝑡1T^{\mathrm{key}}_{nd+t,i}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\bm{1}\{i=t+1\}italic_T start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n italic_d + italic_t , italic_i end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = bold_1 { italic_i = italic_t + 1 }
  3. 3.

    Query output:

    • For t=1,,n𝑡1𝑛t=1,\dots,nitalic_t = 1 , … , italic_n, if in𝑖𝑛i\geq nitalic_i ≥ italic_n

      T(t1)d+1:td,iquery(Ψ(x1),,Ψ(xi))=Ψ(xit+1)subscriptsuperscript𝑇query:𝑡1𝑑1𝑡𝑑𝑖Ψsubscript𝑥1Ψsubscript𝑥𝑖Ψsubscript𝑥𝑖𝑡1T^{\mathrm{query}}_{(t-1)d+1:td,i}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\Psi(x_{i-t+% 1})italic_T start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t - 1 ) italic_d + 1 : italic_t italic_d , italic_i end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t + 1 end_POSTSUBSCRIPT )

      and if i<n𝑖𝑛i<nitalic_i < italic_n

      T(t1)d+1:td,iquery(Ψ(x1),,Ψ(xi))=0subscriptsuperscript𝑇query:𝑡1𝑑1𝑡𝑑𝑖Ψsubscript𝑥1Ψsubscript𝑥𝑖0T^{\mathrm{query}}_{(t-1)d+1:td,i}(\Psi(x_{1}),\dots,\Psi(x_{i}))=0italic_T start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t - 1 ) italic_d + 1 : italic_t italic_d , italic_i end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = 0
    • Additionally, for t=1,,n𝑡1𝑛t=1,\dots,nitalic_t = 1 , … , italic_n, for all i𝑖iitalic_i

      Tnd+t,ikey(Ψ(x1),,Ψ(xi))=n𝟏{i=L+t}subscriptsuperscript𝑇key𝑛𝑑𝑡𝑖Ψsubscript𝑥1Ψsubscript𝑥𝑖𝑛1𝑖𝐿𝑡T^{\mathrm{key}}_{nd+t,i}(\Psi(x_{1}),\dots,\Psi(x_{i}))=n\cdot\bm{1}\{i=L+t\}italic_T start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n italic_d + italic_t , italic_i end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = italic_n ⋅ bold_1 { italic_i = italic_L + italic_t }
Proof.

We prove the following:

  1. 1.

    For the value output, we simply take Tvalue=h1superscript𝑇valuesubscript1T^{\mathrm{value}}=h_{1}italic_T start_POSTSUPERSCRIPT roman_value end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as defined in Lemma C.1.

  2. 2.

    For each t=0,,n𝑡0𝑛t=0,\dots,nitalic_t = 0 , … , italic_n, define:

    gt(𝒙1,,𝒙i)=(t+1)ht+1(𝒙1,,𝒙i)tht(𝒙1,,𝒙i)subscript𝑔𝑡subscript𝒙1subscript𝒙𝑖𝑡1subscript𝑡1subscript𝒙1subscript𝒙𝑖𝑡subscript𝑡subscript𝒙1subscript𝒙𝑖g_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i})=(t+1)\cdot h_{t+1}({\bm{x}}_{1},\dots,{% \bm{x}}_{i})-t\cdot h_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i})italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( italic_t + 1 ) ⋅ italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_t ⋅ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

    where we define h00subscript00h_{0}\equiv 0italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≡ 0. Observe that if i>t𝑖𝑡i>titalic_i > italic_t then:

    gt(𝒙1,,𝒙i)=(t+1)1t+1j=iti𝒙jt1tj=it+1i𝒙j=𝒙itsubscript𝑔𝑡subscript𝒙1subscript𝒙𝑖𝑡11𝑡1superscriptsubscript𝑗𝑖𝑡𝑖subscript𝒙𝑗𝑡1𝑡superscriptsubscript𝑗𝑖𝑡1𝑖subscript𝒙𝑗subscript𝒙𝑖𝑡g_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i})=(t+1)\cdot\frac{1}{t+1}\sum_{j=i-t}^{i}% {\bm{x}}_{j}-t\cdot\frac{1}{t}\sum_{j=i-t+1}^{i}{\bm{x}}_{j}={\bm{x}}_{i-t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( italic_t + 1 ) ⋅ divide start_ARG 1 end_ARG start_ARG italic_t + 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_i - italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_t ⋅ divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_i - italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT

    and if it𝑖𝑡i\leq titalic_i ≤ italic_t then:

    gt(𝒙1,,𝒙i)=t1ij=1i𝒙j(t1)1ij=1i𝒙j=1ij=1i𝒙jsubscript𝑔𝑡subscript𝒙1subscript𝒙𝑖𝑡1𝑖superscriptsubscript𝑗1𝑖subscript𝒙𝑗𝑡11𝑖superscriptsubscript𝑗1𝑖subscript𝒙𝑗1𝑖superscriptsubscript𝑗1𝑖subscript𝒙𝑗g_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i})=t\cdot\frac{1}{i}\sum_{j=1}^{i}{\bm{x}}% _{j}-(t-1)\cdot\frac{1}{i}\sum_{j=1}^{i}{\bm{x}}_{j}=\frac{1}{i}\sum_{j=1}^{i}% {\bm{x}}_{j}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_t ⋅ divide start_ARG 1 end_ARG start_ARG italic_i end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - ( italic_t - 1 ) ⋅ divide start_ARG 1 end_ARG start_ARG italic_i end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_i end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT

    For every j[d]𝑗delimited-[]𝑑j\in[d]italic_j ∈ [ italic_d ], denote

    g^t,j(𝒙1,,𝒙i)subscript^𝑔𝑡𝑗subscript𝒙1subscript𝒙𝑖\displaystyle\hat{g}_{t,j}({\bm{x}}_{1},\dots,{\bm{x}}_{i})over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =σ(ejgt(𝒙1,,𝒙i)nΨ(BOS)gn(𝒙1,,𝒙i))absent𝜎subscript𝑒𝑗subscript𝑔𝑡subscript𝒙1subscript𝒙𝑖𝑛Ψdelimited-⟨⟩BOSsubscript𝑔𝑛subscript𝒙1subscript𝒙𝑖\displaystyle=\sigma(e_{j}\cdot g_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i})-n\Psi(% \left\langle{\tiny\textsc{BOS}}\right\rangle)\cdot g_{n}({\bm{x}}_{1},\dots,{% \bm{x}}_{i}))= italic_σ ( italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_n roman_Ψ ( ⟨ BOS ⟩ ) ⋅ italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
    σ(ejgt(𝒙1,,𝒙i)nΨ(BOS)gn(𝒙1,,𝒙i))𝜎subscript𝑒𝑗subscript𝑔𝑡subscript𝒙1subscript𝒙𝑖𝑛Ψdelimited-⟨⟩BOSsubscript𝑔𝑛subscript𝒙1subscript𝒙𝑖\displaystyle-\sigma(-e_{j}\cdot g_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i})-n\Psi(% \left\langle{\tiny\textsc{BOS}}\right\rangle)\cdot g_{n}({\bm{x}}_{1},\dots,{% \bm{x}}_{i}))- italic_σ ( - italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_n roman_Ψ ( ⟨ BOS ⟩ ) ⋅ italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )

    Claim: g^t(Ψ(x1),,Ψ(xi))=𝟏{i>n}Ψ(xit)subscript^𝑔𝑡Ψsubscript𝑥1Ψsubscript𝑥𝑖1𝑖𝑛Ψsubscript𝑥𝑖𝑡\hat{g}_{t}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\bm{1}\{i>n\}\cdot\Psi(x_{i-t})over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = bold_1 { italic_i > italic_n } ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT )

    Proof: Fix some j[d]𝑗delimited-[]𝑑j\in[d]italic_j ∈ [ italic_d ]. Observe that for all i𝑖iitalic_i, |ejgt(Ψ(x1),,Ψ(xi))|1subscript𝑒𝑗subscript𝑔𝑡Ψsubscript𝑥1Ψsubscript𝑥𝑖1\left\lvert e_{j}\cdot g_{t}(\Psi(x_{1}),\dots,\Psi(x_{i}))\right\rvert\leq 1| italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≤ 1.

    • If in𝑖𝑛i\leq nitalic_i ≤ italic_n, we have gn(Ψ(x1),,Ψ(xi))=1ij=1iΨ(xj)subscript𝑔𝑛Ψsubscript𝑥1Ψsubscript𝑥𝑖1𝑖superscriptsubscriptsuperscript𝑗1𝑖Ψsubscript𝑥superscript𝑗g_{n}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\frac{1}{i}\sum_{j^{\prime}=1}^{i}\Psi(x_% {j^{\prime}})italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = divide start_ARG 1 end_ARG start_ARG italic_i end_ARG ∑ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) and so Ψ(BOS)gn(Ψ(x1),,Ψ(xi))=1Ψdelimited-⟨⟩BOSsubscript𝑔𝑛Ψsubscript𝑥1Ψsubscript𝑥𝑖1\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)\cdot g_{n}(\Psi(x_{1}),% \dots,\Psi(x_{i}))=1roman_Ψ ( ⟨ BOS ⟩ ) ⋅ italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = 1 where we use the properties of ΨΨ\Psiroman_Ψ and the fact that x1=BOSsubscript𝑥1delimited-⟨⟩BOSx_{1}=\left\langle{\tiny\textsc{BOS}}\right\rangleitalic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⟨ BOS ⟩. Therefore, g^t,j(Ψ(x1),,Ψ(xi))=0subscript^𝑔𝑡𝑗Ψsubscript𝑥1Ψsubscript𝑥𝑖0\hat{g}_{t,j}(\Psi(x_{1}),\dots,\Psi(x_{i}))=0over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = 0.

    • If i>nt𝑖𝑛𝑡i>n\geq titalic_i > italic_n ≥ italic_t, then:

      g^t,j(Ψ(x1),,Ψ(xi))subscript^𝑔𝑡𝑗Ψsubscript𝑥1Ψsubscript𝑥𝑖\displaystyle\hat{g}_{t,j}(\Psi(x_{1}),\dots,\Psi(x_{i}))over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) =σ(ejΨ(xit)nΨ(BOS)Ψ(xit))absent𝜎subscript𝑒𝑗Ψsubscript𝑥𝑖𝑡𝑛Ψdelimited-⟨⟩BOSΨsubscript𝑥𝑖𝑡\displaystyle=\sigma\left(e_{j}\cdot\Psi(x_{i-t})-n\Psi(\left\langle{\tiny% \textsc{BOS}}\right\rangle)\cdot\Psi(x_{i-t})\right)= italic_σ ( italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) - italic_n roman_Ψ ( ⟨ BOS ⟩ ) ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) )
      σ(ejΨ(xit)nΨ(BOS)Ψ(xit))𝜎subscript𝑒𝑗Ψsubscript𝑥𝑖𝑡𝑛Ψdelimited-⟨⟩BOSΨsubscript𝑥𝑖𝑡\displaystyle-\sigma\left(-e_{j}\cdot\Psi(x_{i-t})-n\Psi(\left\langle{\tiny% \textsc{BOS}}\right\rangle)\cdot\Psi(x_{i-t})\right)- italic_σ ( - italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) - italic_n roman_Ψ ( ⟨ BOS ⟩ ) ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) )
      =σ(ejΨ(xit))σ(ejΨ(xit))=ejΨ(xit)absent𝜎subscript𝑒𝑗Ψsubscript𝑥𝑖𝑡𝜎subscript𝑒𝑗Ψsubscript𝑥𝑖𝑡subscript𝑒𝑗Ψsubscript𝑥𝑖𝑡\displaystyle=\sigma(e_{j}\cdot\Psi(x_{i-t}))-\sigma(-e_{j}\cdot\Psi(x_{i-t}))% =e_{j}\cdot\Psi(x_{i-t})= italic_σ ( italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) ) - italic_σ ( - italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) ) = italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT )

      where we use the fact that xitBOSsubscript𝑥𝑖𝑡delimited-⟨⟩BOSx_{i-t}\neq\left\langle{\tiny\textsc{BOS}}\right\rangleitalic_x start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ≠ ⟨ BOS ⟩ and therefore Ψ(BOS)Ψ(xit)=0Ψdelimited-⟨⟩BOSΨsubscript𝑥𝑖𝑡0\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)\cdot\Psi(x_{i-t})=0roman_Ψ ( ⟨ BOS ⟩ ) ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) = 0.

    Denote

    g~t(𝒙1,,𝒙i)=12σ(2Ψ(BOS)(gt(𝒙1,,𝒙i)h1(𝒙1,,𝒙i))1)subscript~𝑔𝑡subscript𝒙1subscript𝒙𝑖12𝜎2Ψdelimited-⟨⟩BOSsubscript𝑔𝑡subscript𝒙1subscript𝒙𝑖subscript1subscript𝒙1subscript𝒙𝑖1\tilde{g}_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i})=\frac{1}{2}\sigma\left(2\Psi(% \left\langle{\tiny\textsc{BOS}}\right\rangle)\cdot\left(g_{t}({\bm{x}}_{1},% \dots,{\bm{x}}_{i})-h_{1}({\bm{x}}_{1},\dots,{\bm{x}}_{i})\right)-1\right)over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_σ ( 2 roman_Ψ ( ⟨ BOS ⟩ ) ⋅ ( italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - 1 )

    Claim: g~t(Ψ(x1),,Ψ(xi))=𝟏{i=t+1}subscript~𝑔𝑡Ψsubscript𝑥1Ψsubscript𝑥𝑖1𝑖𝑡1\tilde{g}_{t}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\bm{1}\{i=t+1\}over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = bold_1 { italic_i = italic_t + 1 }

    Proof: Denote gt,i=gt(Ψ(x1),,Ψ(xi))subscript𝑔𝑡𝑖subscript𝑔𝑡Ψsubscript𝑥1Ψsubscript𝑥𝑖g_{t,i}=g_{t}(\Psi(x_{1}),\dots,\Psi(x_{i}))italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) and h1,i=h1(Ψ(x1),,Ψ(xi))subscript1𝑖subscript1Ψsubscript𝑥1Ψsubscript𝑥𝑖h_{1,i}=h_{1}(\Psi(x_{1}),\dots,\Psi(x_{i}))italic_h start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ). Observe:

    • If i=t+1𝑖𝑡1i=t+1italic_i = italic_t + 1, then gt,i=Ψ(x1)=Ψ(BOS)subscript𝑔𝑡𝑖Ψsubscript𝑥1Ψdelimited-⟨⟩BOSg_{t,i}=\Psi(x_{1})=\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = roman_Ψ ( ⟨ BOS ⟩ ) and h1,i=Ψ(xi)Ψ(BOS)subscript1𝑖Ψsubscript𝑥𝑖perpendicular-toΨdelimited-⟨⟩BOSh_{1,i}=\Psi(x_{i})\perp\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)italic_h start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT = roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⟂ roman_Ψ ( ⟨ BOS ⟩ ) and therefore g~t,i=1subscript~𝑔𝑡𝑖1\tilde{g}_{t,i}=1over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = 1.

    • If i>t+1𝑖𝑡1i>t+1italic_i > italic_t + 1 then gt,i=Ψ(xit)Ψ(BOS)subscript𝑔𝑡𝑖Ψsubscript𝑥𝑖𝑡perpendicular-toΨdelimited-⟨⟩BOSg_{t,i}=\Psi(x_{i-t})\perp\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) ⟂ roman_Ψ ( ⟨ BOS ⟩ ) and h1,i=Ψ(xi)Ψ(BOS)subscript1𝑖Ψsubscript𝑥𝑖perpendicular-toΨdelimited-⟨⟩BOSh_{1,i}=\Psi(x_{i})\perp\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)italic_h start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT = roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⟂ roman_Ψ ( ⟨ BOS ⟩ ) and so g~t,i=0subscript~𝑔𝑡𝑖0\tilde{g}_{t,i}=0over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = 0.

    • If 1<it1𝑖𝑡1<i\leq t1 < italic_i ≤ italic_t then Ψ(BOS)gt,i=1i12Ψdelimited-⟨⟩BOSsubscript𝑔𝑡𝑖1𝑖12\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)\cdot g_{t,i}=\frac{1}{i}% \leq\frac{1}{2}roman_Ψ ( ⟨ BOS ⟩ ) ⋅ italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_i end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG and h1,i=Ψ(xi)Ψ(BOS)subscript1𝑖Ψsubscript𝑥𝑖perpendicular-toΨdelimited-⟨⟩BOSh_{1,i}=\Psi(x_{i})\perp\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)italic_h start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT = roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⟂ roman_Ψ ( ⟨ BOS ⟩ ) and so g~t,i=0subscript~𝑔𝑡𝑖0\tilde{g}_{t,i}=0over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = 0.

    • If i=1𝑖1i=1italic_i = 1 then gt,i=h1,i=Ψ(BOS)subscript𝑔𝑡𝑖subscript1𝑖Ψdelimited-⟨⟩BOSg_{t,i}=h_{1,i}=\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT = roman_Ψ ( ⟨ BOS ⟩ ) and therefore g~t,i=0subscript~𝑔𝑡𝑖0\tilde{g}_{t,i}=0over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = 0.

    Finally, we can take Tkey=[g^1,,g^q,g~1,,g~q]superscript𝑇keysubscript^𝑔1subscript^𝑔𝑞subscript~𝑔1subscript~𝑔𝑞T^{\mathrm{key}}=[\hat{g}_{1},\dots,\hat{g}_{q},\tilde{g}_{1},\dots,\tilde{g}_% {q}]italic_T start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT = [ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ].

  3. 3.

    For all t=1,,n𝑡1𝑛t=1,\dots,nitalic_t = 1 , … , italic_n, define gt(𝒙1,,𝒙i)=σ(Ψ(COPY)gt1(𝒙1,,𝒙i))subscriptsuperscript𝑔𝑡subscript𝒙1subscript𝒙𝑖𝜎Ψdelimited-⟨⟩COPYsubscript𝑔𝑡1subscript𝒙1subscript𝒙𝑖g^{*}_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i})=\sigma(\Psi(\left\langle{\tiny% \textsc{COPY}}\right\rangle)\cdot g_{t-1}({\bm{x}}_{1},\dots,{\bm{x}}_{i}))italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_σ ( roman_Ψ ( ⟨ COPY ⟩ ) ⋅ italic_g start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ).

    Claim: gt(Ψ(x1),,Ψ(xi))=𝟏{i=L+t}subscriptsuperscript𝑔𝑡Ψsubscript𝑥1Ψsubscript𝑥𝑖1𝑖𝐿𝑡g^{*}_{t}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\bm{1}\{i=L+t\}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = bold_1 { italic_i = italic_L + italic_t }

    Proof: Denote gt,i=gt(Ψ(x1),,Ψ(xi))subscript𝑔𝑡𝑖subscript𝑔𝑡Ψsubscript𝑥1Ψsubscript𝑥𝑖g_{t,i}=g_{t}(\Psi(x_{1}),\dots,\Psi(x_{i}))italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ). Observe:

    • If i=L+t𝑖𝐿𝑡i=L+titalic_i = italic_L + italic_t then gt1,i=Ψ(xit+1)=Ψ(xL+1)=Ψ(COPY)subscript𝑔𝑡1𝑖Ψsubscript𝑥𝑖𝑡1Ψsubscript𝑥𝐿1Ψdelimited-⟨⟩COPYg_{t-1,i}=\Psi(x_{i-t+1})=\Psi(x_{L+1})=\Psi(\left\langle{\tiny\textsc{COPY}}% \right\rangle)italic_g start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT = roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t + 1 end_POSTSUBSCRIPT ) = roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT ) = roman_Ψ ( ⟨ COPY ⟩ ) and therefore gt,i=1subscriptsuperscript𝑔𝑡𝑖1g^{*}_{t,i}=1italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = 1.

    • If iL+t𝑖𝐿𝑡i\neq L+titalic_i ≠ italic_L + italic_t and i>t1𝑖𝑡1i>t-1italic_i > italic_t - 1 then gt1,i=Ψ(xit+1)Ψ(COPY)subscript𝑔𝑡1𝑖Ψsubscript𝑥𝑖𝑡1perpendicular-toΨdelimited-⟨⟩COPYg_{t-1,i}=\Psi(x_{i-t+1})\perp\Psi(\left\langle{\tiny\textsc{COPY}}\right\rangle)italic_g start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT = roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t + 1 end_POSTSUBSCRIPT ) ⟂ roman_Ψ ( ⟨ COPY ⟩ ) and therefore gt,i=0superscriptsubscript𝑔𝑡𝑖0g_{t,i}^{*}=0italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.

    • If it𝑖𝑡i\leq titalic_i ≤ italic_t then since x1,,xiCOPYsubscript𝑥1subscript𝑥𝑖delimited-⟨⟩COPYx_{1},\dots,x_{i}\neq\left\langle{\tiny\textsc{COPY}}\right\rangleitalic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ ⟨ COPY ⟩ we get Ψ(COPY)gt1,i=0Ψdelimited-⟨⟩COPYsubscript𝑔𝑡1𝑖0\Psi(\left\langle{\tiny\textsc{COPY}}\right\rangle)\cdot g_{t-1,i}=0roman_Ψ ( ⟨ COPY ⟩ ) ⋅ italic_g start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT = 0 and therefore gt,i=0superscriptsubscript𝑔𝑡𝑖0g_{t,i}^{*}=0italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.

    Therefore, we can take Tquery=[g^0,,g^q1,ng1,,ngq]superscript𝑇querysubscript^𝑔0subscript^𝑔𝑞1𝑛subscriptsuperscript𝑔1𝑛subscriptsuperscript𝑔𝑞T^{\mathrm{query}}=[\hat{g}_{0},\dots,\hat{g}_{q-1},n\cdot g^{*}_{1},\dots,n% \cdot g^{*}_{q}]italic_T start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT = [ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_q - 1 end_POSTSUBSCRIPT , italic_n ⋅ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_n ⋅ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ].

Now, we prove Theorem 2.3 by showing that using a single attention head with no positional embedding on top of the construction in Lemma C.4 realizes the copy algorithm. Since the first block computes the correct choice of keyi,queryi,valueisubscriptkey𝑖subscriptquery𝑖subscriptvalue𝑖\mathrm{key}_{i},\mathrm{query}_{i},\mathrm{value}_{i}roman_key start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_query start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_value start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, by correctly scaling of the attention matrix we verify that the output of the second layer at position i𝑖iitalic_i corresponds to valuejabsentsubscriptvalue𝑗\approx\mathrm{value}_{j}≈ roman_value start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for j𝑗jitalic_j s.t. keyj=queryisubscriptkey𝑗subscriptquery𝑖\mathrm{key}_{j}=\mathrm{query}_{i}roman_key start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_query start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Proof of Theorem 2.3.

Let Tvalue,Tkey,Tquerysuperscript𝑇valuesuperscript𝑇keysuperscript𝑇queryT^{\mathrm{value}},T^{\mathrm{key}},T^{\mathrm{query}}italic_T start_POSTSUPERSCRIPT roman_value end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT be the outputs of the Transformer block guaranteed by Lemma C.4. Observe that, for some temprature τ𝜏\tau\in{\mathbb{R}}italic_τ ∈ blackboard_R, the following function can be computed by a softmax-attention layer on-top of this block:

H(Ψ(x1),,Ψ(xi))=Tvaluesoftmax(τTkeyTiquery)𝐻Ψsubscript𝑥1Ψsubscript𝑥𝑖superscript𝑇valuesoftmax𝜏superscript𝑇keysubscriptsuperscript𝑇query𝑖H(\Psi(x_{1}),\dots,\Psi(x_{i}))=T^{\mathrm{value}}\cdot\mathrm{softmax}(\tau% \cdot T^{\mathrm{key}}\cdot T^{\mathrm{query}}_{i})italic_H ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = italic_T start_POSTSUPERSCRIPT roman_value end_POSTSUPERSCRIPT ⋅ roman_softmax ( italic_τ ⋅ italic_T start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT ⋅ italic_T start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where e.g. Tvaluesuperscript𝑇valueT^{\mathrm{value}}italic_T start_POSTSUPERSCRIPT roman_value end_POSTSUPERSCRIPT denotes Tvalue(Ψ(x1),,Ψ(xi))superscript𝑇valueΨsubscript𝑥1Ψsubscript𝑥𝑖T^{\mathrm{value}}(\Psi(x_{1}),\dots,\Psi(x_{i}))italic_T start_POSTSUPERSCRIPT roman_value end_POSTSUPERSCRIPT ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ).

For now, assume that all the n𝑛nitalic_n-grams in 𝒙𝒙{\bm{x}}bold_italic_x are unique, and that the length of the input satisfies 2L+2K2𝐿2𝐾2L+2\leq K2 italic_L + 2 ≤ italic_K for K=Dn𝐾superscript𝐷𝑛K=D^{n}italic_K = italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.

Claim: Fix some i>L𝑖𝐿i>Litalic_i > italic_L, denote 𝒛=TkeyTiquery𝒛superscript𝑇keysubscriptsuperscript𝑇query𝑖{\bm{z}}=T^{\mathrm{key}}\cdot T^{\mathrm{query}}_{i}bold_italic_z = italic_T start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT ⋅ italic_T start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, ziL+1=nsubscript𝑧𝑖𝐿1𝑛z_{i-L+1}=nitalic_z start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT = italic_n and |zj|<n1dsubscript𝑧𝑗𝑛1𝑑\left\lvert z_{j}\right\rvert<n-\frac{1}{d}| italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | < italic_n - divide start_ARG 1 end_ARG start_ARG italic_d end_ARG for all jiL+1𝑗𝑖𝐿1j\neq i-L+1italic_j ≠ italic_i - italic_L + 1.

Proof: We separate to the following cases:

  • If i>L+n1𝑖𝐿𝑛1i>L+n-1italic_i > italic_L + italic_n - 1, then for every j𝑗jitalic_j we have

    TjkeyTiquerysubscriptsuperscript𝑇key𝑗subscriptsuperscript𝑇query𝑖\displaystyle T^{\mathrm{key}}_{j}\cdot T^{\mathrm{query}}_{i}italic_T start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_T start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =𝟏{j>n}[Ψ(xj1),,Ψ(xjn)][Ψ(xi),,Ψ(xin+1)]absent1𝑗𝑛superscriptΨsubscript𝑥𝑗1Ψsubscript𝑥𝑗𝑛topΨsubscript𝑥𝑖Ψsubscript𝑥𝑖𝑛1\displaystyle=\bm{1}\{j>n\}\cdot[\Psi(x_{j-1}),\dots,\Psi(x_{j-n})]^{\top}[% \Psi(x_{i}),\dots,\Psi(x_{i-n+1})]= bold_1 { italic_j > italic_n } ⋅ [ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_j - italic_n end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_n + 1 end_POSTSUBSCRIPT ) ]
    =𝟏{j>n}t=1nΨ(xjt)Ψ(xit+1)absent1𝑗𝑛superscriptsubscript𝑡1𝑛Ψsubscript𝑥𝑗𝑡Ψsubscript𝑥𝑖𝑡1\displaystyle=\bm{1}\{j>n\}\cdot\sum_{t=1}^{n}\Psi(x_{j-t})\Psi(x_{i-t+1})= bold_1 { italic_j > italic_n } ⋅ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_j - italic_t end_POSTSUBSCRIPT ) roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t + 1 end_POSTSUBSCRIPT )

    Now, if j=iL+1𝑗𝑖𝐿1j=i-L+1italic_j = italic_i - italic_L + 1 then xjt=xiL+1t=xit+1subscript𝑥𝑗𝑡subscript𝑥𝑖𝐿1𝑡subscript𝑥𝑖𝑡1x_{j-t}=x_{i-L+1-t}=x_{i-t+1}italic_x start_POSTSUBSCRIPT italic_j - italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 - italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i - italic_t + 1 end_POSTSUBSCRIPT and since j>n𝑗𝑛j>nitalic_j > italic_n we get

    TjkeyTiquery=t=1nΨ(xit+1)=nsuperscriptsubscript𝑇𝑗keysuperscriptsubscript𝑇𝑖querysuperscriptsubscript𝑡1𝑛delimited-∥∥Ψsubscript𝑥𝑖𝑡1𝑛T_{j}^{\mathrm{key}}\cdot T_{i}^{\mathrm{query}}=\sum_{t=1}^{n}\left\lVert\Psi% (x_{i-t+1})\right\rVert=nitalic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT ⋅ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t + 1 end_POSTSUBSCRIPT ) ∥ = italic_n

    If jiL+1𝑗𝑖𝐿1j\neq i-L+1italic_j ≠ italic_i - italic_L + 1, since there are no repeated n𝑛nitalic_n-grams, there is at least some t[n]𝑡delimited-[]𝑛t\in[n]italic_t ∈ [ italic_n ] s.t. Ψ(xjt)Ψ(xit+1)Ψsubscript𝑥𝑗𝑡Ψsubscript𝑥𝑖𝑡1\Psi(x_{j-t})\neq\Psi(x_{i-t+1})roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_j - italic_t end_POSTSUBSCRIPT ) ≠ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t + 1 end_POSTSUBSCRIPT ) and by the choice of the embedding Ψ(xjt)Ψ(xit+1)11dΨsubscript𝑥𝑗𝑡Ψsubscript𝑥𝑖𝑡111𝑑\Psi(x_{j-t})\cdot\Psi(x_{i-t+1})\leq 1-\frac{1}{d}roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_j - italic_t end_POSTSUBSCRIPT ) ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t + 1 end_POSTSUBSCRIPT ) ≤ 1 - divide start_ARG 1 end_ARG start_ARG italic_d end_ARG. In this case, we get |TjkeyTiquery|n1dsuperscriptsubscript𝑇𝑗keysuperscriptsubscript𝑇𝑖query𝑛1𝑑\left\lvert T_{j}^{\mathrm{key}}\cdot T_{i}^{\mathrm{query}}\right\rvert\leq n% -\frac{1}{d}| italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT ⋅ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT | ≤ italic_n - divide start_ARG 1 end_ARG start_ARG italic_d end_ARG.

  • If L<iL+n1𝐿𝑖𝐿𝑛1L<i\leq L+n-1italic_L < italic_i ≤ italic_L + italic_n - 1 and jn𝑗𝑛j\leq nitalic_j ≤ italic_n then

    TjkeyTiquery=nej1eiL=n𝟏{j=iL+1}superscriptsubscript𝑇𝑗keysuperscriptsubscript𝑇𝑖query𝑛subscript𝑒𝑗1subscript𝑒𝑖𝐿𝑛1𝑗𝑖𝐿1T_{j}^{\mathrm{key}}\cdot T_{i}^{\mathrm{query}}=ne_{j-1}\cdot e_{i-L}=n\cdot% \bm{1}\{j=i-L+1\}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT ⋅ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT = italic_n italic_e start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ⋅ italic_e start_POSTSUBSCRIPT italic_i - italic_L end_POSTSUBSCRIPT = italic_n ⋅ bold_1 { italic_j = italic_i - italic_L + 1 }

    which satisfies the required.

  • If L<iL+n1𝐿𝑖𝐿𝑛1L<i\leq L+n-1italic_L < italic_i ≤ italic_L + italic_n - 1 and j>n𝑗𝑛j>nitalic_j > italic_n then

    TjkeyTiquery=t=1nΨ(xjt)Ψ(xit+1)superscriptsubscript𝑇𝑗keysuperscriptsubscript𝑇𝑖querysuperscriptsubscript𝑡1𝑛Ψsubscript𝑥𝑗𝑡Ψsubscript𝑥𝑖𝑡1T_{j}^{\mathrm{key}}\cdot T_{i}^{\mathrm{query}}=\sum_{t=1}^{n}\Psi(x_{j-t})% \Psi(x_{i-t+1})italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT ⋅ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_j - italic_t end_POSTSUBSCRIPT ) roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_t + 1 end_POSTSUBSCRIPT )

    and as before, since there are no repeated n𝑛nitalic_n-grams, we get |TjkeyTiquery|n1dsuperscriptsubscript𝑇𝑗keysuperscriptsubscript𝑇𝑖query𝑛1𝑑\left\lvert T_{j}^{\mathrm{key}}\cdot T_{i}^{\mathrm{query}}\right\rvert\leq n% -\frac{1}{d}| italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT ⋅ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT | ≤ italic_n - divide start_ARG 1 end_ARG start_ARG italic_d end_ARG

Claim: Fix some ϵ(0,1)italic-ϵ01\epsilon\in(0,1)italic_ϵ ∈ ( 0 , 1 ) and some i>L𝑖𝐿i>Litalic_i > italic_L, denote 𝒔=softmax(τTkeyTiquery)=softmax(τ𝒛)𝒔softmax𝜏superscript𝑇keysubscriptsuperscript𝑇query𝑖softmax𝜏𝒛{\bm{s}}=\mathrm{softmax}(\tau T^{\mathrm{key}}\cdot T^{\mathrm{query}}_{i})=% \mathrm{softmax}(\tau\cdot{\bm{z}})bold_italic_s = roman_softmax ( italic_τ italic_T start_POSTSUPERSCRIPT roman_key end_POSTSUPERSCRIPT ⋅ italic_T start_POSTSUPERSCRIPT roman_query end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_softmax ( italic_τ ⋅ bold_italic_z ). If τ=dln(2Kϵ)𝜏𝑑2𝐾italic-ϵ\tau=d\ln(\frac{2K}{\epsilon})italic_τ = italic_d roman_ln ( divide start_ARG 2 italic_K end_ARG start_ARG italic_ϵ end_ARG ), then siL+11ϵsubscript𝑠𝑖𝐿11italic-ϵs_{i-L+1}\geq 1-\epsilonitalic_s start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ≥ 1 - italic_ϵ and sjϵ2Ksubscript𝑠𝑗italic-ϵ2𝐾s_{j}\leq\frac{\epsilon}{2K}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ divide start_ARG italic_ϵ end_ARG start_ARG 2 italic_K end_ARG for all jiL+1𝑗𝑖𝐿1j\neq i-L+1italic_j ≠ italic_i - italic_L + 1.

Proof: Using the previous claim, together with Lemma C.3, we get that:

  • siL+111+iexp(τ/d)11+Kexp(τ/d)11+ϵ/2=1ϵ/21+ϵ/21ϵsubscript𝑠𝑖𝐿111𝑖𝜏𝑑11𝐾𝜏𝑑11italic-ϵ21italic-ϵ21italic-ϵ21italic-ϵs_{i-L+1}\geq\frac{1}{1+i\exp(-\tau/d)}\geq\frac{1}{1+K\exp(-\tau/d)}\geq\frac% {1}{1+\epsilon/2}=1-\frac{\epsilon/2}{1+\epsilon/2}\geq 1-\epsilonitalic_s start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 1 + italic_i roman_exp ( - italic_τ / italic_d ) end_ARG ≥ divide start_ARG 1 end_ARG start_ARG 1 + italic_K roman_exp ( - italic_τ / italic_d ) end_ARG ≥ divide start_ARG 1 end_ARG start_ARG 1 + italic_ϵ / 2 end_ARG = 1 - divide start_ARG italic_ϵ / 2 end_ARG start_ARG 1 + italic_ϵ / 2 end_ARG ≥ 1 - italic_ϵ

  • For jiL+1𝑗𝑖𝐿1j\neq i-L+1italic_j ≠ italic_i - italic_L + 1,

    sjexp(τ/d)ϵ2Ksubscript𝑠𝑗𝜏𝑑italic-ϵ2𝐾s_{j}\leq\exp(-\tau/d)\leq\frac{\epsilon}{2K}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ roman_exp ( - italic_τ / italic_d ) ≤ divide start_ARG italic_ϵ end_ARG start_ARG 2 italic_K end_ARG

Claim: Fix some ϵ(0,1)italic-ϵ01\epsilon\in(0,1)italic_ϵ ∈ ( 0 , 1 ) and some i>L𝑖𝐿i>Litalic_i > italic_L. Then, for τdln(2Kϵ)𝜏𝑑2𝐾italic-ϵ\tau\geq d\ln(\frac{2K}{\epsilon})italic_τ ≥ italic_d roman_ln ( divide start_ARG 2 italic_K end_ARG start_ARG italic_ϵ end_ARG ), it holds that:

H(Ψ(x1),,Ψ(xi))Ψ(xiL+1)ϵdelimited-∥∥𝐻Ψsubscript𝑥1Ψsubscript𝑥𝑖Ψsubscript𝑥𝑖𝐿1italic-ϵ\left\lVert H(\Psi(x_{1}),\dots,\Psi(x_{i}))-\Psi(x_{i-L+1})\right\rVert\leq\epsilon∥ italic_H ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ) ∥ ≤ italic_ϵ

Proof: Let 𝒔𝒔{\bm{s}}bold_italic_s as defined in the previous claim. Then:

H(Ψ(x1),,Ψ(xi))Ψ(xiL+1)delimited-∥∥𝐻Ψsubscript𝑥1Ψsubscript𝑥𝑖Ψsubscript𝑥𝑖𝐿1\displaystyle\left\lVert H(\Psi(x_{1}),\dots,\Psi(x_{i}))-\Psi(x_{i-L+1})\right\rVert∥ italic_H ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ) ∥ =j=1isjΨ(xj)Ψ(xiL+1)absentdelimited-∥∥superscriptsubscript𝑗1𝑖subscript𝑠𝑗Ψsubscript𝑥𝑗Ψsubscript𝑥𝑖𝐿1\displaystyle=\left\lVert\sum_{j=1}^{i}s_{j}\Psi(x_{j})-\Psi(x_{i-L+1})\right\rVert= ∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ) ∥
(1siL+1)Ψ(xiL+1)+jiL+1sjΨ(xj)absent1subscript𝑠𝑖𝐿1delimited-∥∥Ψsubscript𝑥𝑖𝐿1subscript𝑗𝑖𝐿1subscript𝑠𝑗delimited-∥∥Ψsubscript𝑥𝑗\displaystyle\leq(1-s_{i-L+1})\left\lVert\Psi(x_{i-L+1})\right\rVert+\sum_{j% \neq i-L+1}s_{j}\left\lVert\Psi(x_{j})\right\rVert≤ ( 1 - italic_s start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ) ∥ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ) ∥ + ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i - italic_L + 1 end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥
=(1siL+1)+jiL+1sjϵ+(i1)ϵ2K2ϵabsent1subscript𝑠𝑖𝐿1subscript𝑗𝑖𝐿1subscript𝑠𝑗italic-ϵ𝑖1italic-ϵ2𝐾2italic-ϵ\displaystyle=(1-s_{i-L+1})+\sum_{j\neq i-L+1}s_{j}\leq\epsilon+(i-1)\frac{% \epsilon}{2K}\leq 2\epsilon= ( 1 - italic_s start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i - italic_L + 1 end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_ϵ + ( italic_i - 1 ) divide start_ARG italic_ϵ end_ARG start_ARG 2 italic_K end_ARG ≤ 2 italic_ϵ

Now, denote by Φ:d𝔻:Φsuperscript𝑑𝔻\Phi:{\mathbb{R}}^{d}\to{\mathbb{D}}roman_Φ : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_D the output map given by Φ(𝒛)=argmaxx𝔻𝒛Ψ(x)Φ𝒛subscript𝑥𝔻𝒛Ψ𝑥\Phi({\bm{z}})=\arg\max_{x\in{\mathbb{D}}}{\bm{z}}\cdot\Psi(x)roman_Φ ( bold_italic_z ) = roman_arg roman_max start_POSTSUBSCRIPT italic_x ∈ blackboard_D end_POSTSUBSCRIPT bold_italic_z ⋅ roman_Ψ ( italic_x ) (which can be computed by an argmaxargmax\operatorname*{arg\,max}roman_arg roman_max over a linear function).

Claim: If τdln(8Kd)𝜏𝑑8𝐾𝑑\tau\geq d\ln(8Kd)italic_τ ≥ italic_d roman_ln ( 8 italic_K italic_d ), then for all i>L𝑖𝐿i>Litalic_i > italic_L we have Φ(H(Ψ(x1),,Ψ(xi)))=xiL+1Φ𝐻Ψsubscript𝑥1Ψsubscript𝑥𝑖subscript𝑥𝑖𝐿1\Phi(H(\Psi(x_{1}),\dots,\Psi(x_{i})))=x_{i-L+1}roman_Φ ( italic_H ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) = italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT.

Proof: Denote 𝒚i=H(Ψ(x1),,Ψ(xi))subscript𝒚𝑖𝐻Ψsubscript𝑥1Ψsubscript𝑥𝑖{\bm{y}}_{i}=H(\Psi(x_{1}),\dots,\Psi(x_{i}))bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_H ( roman_Ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ). First, using the previous claim, we observe that

𝒚iΨ(xiL+1)subscript𝒚𝑖Ψsubscript𝑥𝑖𝐿1\displaystyle{\bm{y}}_{i}\cdot\Psi(x_{i-L+1})bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ) =(𝒚iΨ(xiL+1))Ψ(xiL+1)+Ψ(xiL+1)absentsubscript𝒚𝑖Ψsubscript𝑥𝑖𝐿1Ψsubscript𝑥𝑖𝐿1delimited-∥∥Ψsubscript𝑥𝑖𝐿1\displaystyle=({\bm{y}}_{i}-\Psi(x_{i-L+1}))\cdot\Psi(x_{i-L+1})+\left\lVert% \Psi(x_{i-L+1})\right\rVert= ( bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ) ) ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ) + ∥ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ) ∥
1𝒚iΨ(xiL+1)114dabsent1delimited-∥∥subscript𝒚𝑖Ψsubscript𝑥𝑖𝐿1114𝑑\displaystyle\geq 1-\left\lVert{\bm{y}}_{i}-\Psi(x_{i-L+1})\right\rVert\geq 1-% \frac{1}{4d}≥ 1 - ∥ bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ) ∥ ≥ 1 - divide start_ARG 1 end_ARG start_ARG 4 italic_d end_ARG

Next, observe that for all jiL+1𝑗𝑖𝐿1j\neq i-L+1italic_j ≠ italic_i - italic_L + 1 we have

𝒚iΨ(xj)subscript𝒚𝑖Ψsubscript𝑥𝑗\displaystyle{\bm{y}}_{i}\cdot\Psi(x_{j})bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) =(𝒚iΨ(xiL+1))Ψ(xj)+Ψ(xj)Ψ(xiL+1)absentsubscript𝒚𝑖Ψsubscript𝑥𝑖𝐿1Ψsubscript𝑥𝑗Ψsubscript𝑥𝑗Ψsubscript𝑥𝑖𝐿1\displaystyle=({\bm{y}}_{i}-\Psi(x_{i-L+1}))\cdot\Psi(x_{j})+\Psi(x_{j})\cdot% \Psi(x_{i-L+1})= ( bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ) ) ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT )
𝒚iΨ(xiL+1)+11d134d<𝒚iΨ(xiL+1)absentdelimited-∥∥subscript𝒚𝑖Ψsubscript𝑥𝑖𝐿111𝑑134𝑑subscript𝒚𝑖Ψsubscript𝑥𝑖𝐿1\displaystyle\leq\left\lVert{\bm{y}}_{i}-\Psi(x_{i-L+1})\right\rVert+1-\frac{1% }{d}\leq 1-\frac{3}{4d}<{\bm{y}}_{i}\cdot\Psi(x_{i-L+1})≤ ∥ bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT ) ∥ + 1 - divide start_ARG 1 end_ARG start_ARG italic_d end_ARG ≤ 1 - divide start_ARG 3 end_ARG start_ARG 4 italic_d end_ARG < bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_i - italic_L + 1 end_POSTSUBSCRIPT )

From the above claim, the Transformer construction outputs the correct token at each step of the auto-regressive generation. ∎

C.3 Proof of Lemma 2.4

Proof of Lemma 2.4.

Fix some i<j[L]𝑖𝑗delimited-[]𝐿i<j\in[L]italic_i < italic_j ∈ [ italic_L ]. Let I:={i,,i+n}assign𝐼𝑖𝑖𝑛I:=\{i,\dots,i+n\}italic_I := { italic_i , … , italic_i + italic_n } and J:={j,,j+n}assign𝐽𝑗𝑗𝑛J:=\{j,\dots,j+n\}italic_J := { italic_j , … , italic_j + italic_n }. We first bound the probability of drawing some 𝒙𝒙{\bm{x}}bold_italic_x s.t. 𝒙I=𝒙Jsubscript𝒙𝐼subscript𝒙𝐽{\bm{x}}_{I}={\bm{x}}_{J}bold_italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT. Note that there are D|IJ|superscript𝐷𝐼𝐽D^{\left\lvert I\cup J\right\rvert}italic_D start_POSTSUPERSCRIPT | italic_I ∪ italic_J | end_POSTSUPERSCRIPT choices for 𝒙IJsubscript𝒙𝐼𝐽{\bm{x}}_{I\cup J}bold_italic_x start_POSTSUBSCRIPT italic_I ∪ italic_J end_POSTSUBSCRIPT. We count the number of choices for 𝒙IJsubscript𝒙𝐼𝐽{\bm{x}}_{I\cup J}bold_italic_x start_POSTSUBSCRIPT italic_I ∪ italic_J end_POSTSUBSCRIPT s.t. 𝒙I=𝒙Jsubscript𝒙𝐼subscript𝒙𝐽{\bm{x}}_{I}={\bm{x}}_{J}bold_italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT. Notice that in this case, 𝒙IJsubscript𝒙𝐼𝐽{\bm{x}}_{I\cup J}bold_italic_x start_POSTSUBSCRIPT italic_I ∪ italic_J end_POSTSUBSCRIPT is determined by 𝒙IJsubscript𝒙𝐼𝐽{\bm{x}}_{I\setminus J}bold_italic_x start_POSTSUBSCRIPT italic_I ∖ italic_J end_POSTSUBSCRIPT, therefore there are D|IJ|superscript𝐷𝐼𝐽D^{\left\lvert I\setminus J\right\rvert}italic_D start_POSTSUPERSCRIPT | italic_I ∖ italic_J | end_POSTSUPERSCRIPT possible choices. We conclude that

Pr[𝒙I=𝒙J]=D|IJ|D|IJ|=D|IJ||IJ|=DnPrsubscript𝒙𝐼subscript𝒙𝐽superscript𝐷𝐼𝐽superscript𝐷𝐼𝐽superscript𝐷𝐼𝐽𝐼𝐽superscript𝐷𝑛\Pr\left[{\bm{x}}_{I}={\bm{x}}_{J}\right]=\frac{D^{\left\lvert I\setminus J% \right\rvert}}{D^{\left\lvert I\cup J\right\rvert}}=D^{\left\lvert I\setminus J% \right\rvert-\left\lvert I\cup J\right\rvert}=D^{-n}roman_Pr [ bold_italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ] = divide start_ARG italic_D start_POSTSUPERSCRIPT | italic_I ∖ italic_J | end_POSTSUPERSCRIPT end_ARG start_ARG italic_D start_POSTSUPERSCRIPT | italic_I ∪ italic_J | end_POSTSUPERSCRIPT end_ARG = italic_D start_POSTSUPERSCRIPT | italic_I ∖ italic_J | - | italic_I ∪ italic_J | end_POSTSUPERSCRIPT = italic_D start_POSTSUPERSCRIPT - italic_n end_POSTSUPERSCRIPT

Using the union bound, we get that

Pr[i<js.t.𝒙i,,i+n=𝒙j,,j+n]i<jPr[𝒙i,,i+n=𝒙j,,j+n]<L2Dn\Pr\left[\exists i<j~{}\mathrm{s.t.}~{}{\bm{x}}_{i,\dots,i+n}={\bm{x}}_{j,% \dots,j+n}\right]\leq\sum_{i<j}\Pr\left[{\bm{x}}_{i,\dots,i+n}={\bm{x}}_{j,% \dots,j+n}\right]<L^{2}D^{-n}roman_Pr [ ∃ italic_i < italic_j roman_s . roman_t . bold_italic_x start_POSTSUBSCRIPT italic_i , … , italic_i + italic_n end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_j , … , italic_j + italic_n end_POSTSUBSCRIPT ] ≤ ∑ start_POSTSUBSCRIPT italic_i < italic_j end_POSTSUBSCRIPT roman_Pr [ bold_italic_x start_POSTSUBSCRIPT italic_i , … , italic_i + italic_n end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_j , … , italic_j + italic_n end_POSTSUBSCRIPT ] < italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT - italic_n end_POSTSUPERSCRIPT

Appendix D Proofs - Lower Bound

In this section, we prove Theorem 2.7. We begin by showing that, for every input, the output of the model in each iteration is a deterministic function of the state of the model after observing the input:

Lemma D.1.

Let Hu,r:𝔻n𝔻n:subscript𝐻𝑢𝑟superscript𝔻superscript𝑛superscript𝔻𝑛H_{u,r}:{\mathbb{D}}^{n^{\prime}}\to{\mathbb{D}}^{n}italic_H start_POSTSUBSCRIPT italic_u , italic_r end_POSTSUBSCRIPT : blackboard_D start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT → blackboard_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be some fixed-state sequence-to-sequence model. Then, there exists map G:𝒮𝔻n:𝐺𝒮superscript𝔻𝑛G:{\mathcal{S}}\to{\mathbb{D}}^{n}italic_G : caligraphic_S → blackboard_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT s.t. for all 𝐱𝔻n𝐱superscript𝔻superscript𝑛{\bm{x}}\in{\mathbb{D}}^{n^{\prime}}bold_italic_x ∈ blackboard_D start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT

Hu,r(𝒙)=GSn(𝒙)subscript𝐻𝑢𝑟𝒙𝐺subscript𝑆superscript𝑛𝒙H_{u,r}({\bm{x}})=G\circ S_{n^{\prime}}({\bm{x}})italic_H start_POSTSUBSCRIPT italic_u , italic_r end_POSTSUBSCRIPT ( bold_italic_x ) = italic_G ∘ italic_S start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x )
Proof.

Let xn+1,,xn+nsubscript𝑥superscript𝑛1subscript𝑥superscript𝑛𝑛x_{n^{\prime}+1},\dots,x_{n^{\prime}+n}italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_n end_POSTSUBSCRIPT be the outputs of Hu,rsubscript𝐻𝑢𝑟H_{u,r}italic_H start_POSTSUBSCRIPT italic_u , italic_r end_POSTSUBSCRIPT. We need to show that there exist functions G1,,Gnsubscript𝐺1subscript𝐺𝑛G_{1},\dots,G_{n}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT s.t. Hu,r(x1,,xn)=G(Sn(x1,,xn))subscript𝐻𝑢𝑟subscript𝑥1subscript𝑥superscript𝑛𝐺subscript𝑆superscript𝑛subscript𝑥1subscript𝑥𝑛H_{u,r}(x_{1},\dots,x_{n^{\prime}})=G(S_{n^{\prime}}(x_{1},\dots,x_{n}))italic_H start_POSTSUBSCRIPT italic_u , italic_r end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = italic_G ( italic_S start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ). We give the following recursive definition:

  • G1(s)=r(s)subscript𝐺1𝑠𝑟𝑠G_{1}(s)=r(s)italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) = italic_r ( italic_s ), G~1(s)=u(s,G1(s))subscript~𝐺1𝑠𝑢𝑠subscript𝐺1𝑠\tilde{G}_{1}(s)=u(s,G_{1}(s))over~ start_ARG italic_G end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) = italic_u ( italic_s , italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) ).

  • Gi(s)=r(G~i1(s))subscript𝐺𝑖𝑠𝑟subscript~𝐺𝑖1𝑠G_{i}(s)=r(\tilde{G}_{i-1}(s))italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) = italic_r ( over~ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ( italic_s ) ), G~i(s)=u(G~i1(s),Gi(s))subscript~𝐺𝑖𝑠𝑢subscript~𝐺𝑖1𝑠subscript𝐺𝑖𝑠\tilde{G}_{i}(s)=u(\tilde{G}_{i-1}(s),G_{i}(s))over~ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) = italic_u ( over~ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ( italic_s ) , italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ).

Denote s=Sn(x1,,xn)𝑠subscript𝑆superscript𝑛subscript𝑥1subscript𝑥superscript𝑛s=S_{n^{\prime}}(x_{1},\dots,x_{n^{\prime}})italic_s = italic_S start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) We prove by induction that Gi(s)=xn+isubscript𝐺𝑖𝑠subscript𝑥superscript𝑛𝑖G_{i}(s)=x_{n^{\prime}+i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) = italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_i end_POSTSUBSCRIPT and also that G~i(s)=Sn+i(x1,,xn+i)subscript~𝐺𝑖𝑠subscript𝑆superscript𝑛𝑖subscript𝑥1subscript𝑥superscript𝑛𝑖\tilde{G}_{i}(s)=S_{n^{\prime}+i}(x_{1},\dots,x_{n^{\prime}+i})over~ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) = italic_S start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_i end_POSTSUBSCRIPT ).

  • G1(s)=r(s)=Rn(x1,,xn)=xn+1subscript𝐺1𝑠𝑟𝑠subscript𝑅superscript𝑛subscript𝑥1subscript𝑥superscript𝑛subscript𝑥superscript𝑛1G_{1}(s)=r(s)=R_{n^{\prime}}(x_{1},\dots,x_{n^{\prime}})=x_{n^{\prime}+1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) = italic_r ( italic_s ) = italic_R start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT.

  • G~1(s)=u(s,G1(s))=u(s,xn+1)=Sn+1(x1,,xn+1)subscript~𝐺1𝑠𝑢𝑠subscript𝐺1𝑠𝑢𝑠subscript𝑥superscript𝑛1subscript𝑆superscript𝑛1subscript𝑥1subscript𝑥superscript𝑛1\tilde{G}_{1}(s)=u(s,G_{1}(s))=u(s,x_{n^{\prime}+1})=S_{n^{\prime}+1}(x_{1},% \dots,x_{n^{\prime}+1})over~ start_ARG italic_G end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) = italic_u ( italic_s , italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) ) = italic_u ( italic_s , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT ) = italic_S start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT )

  • Gi(s)=r(G~i1(s))=r(Sn+i1(x1,,xn+i1))=Rn+i1(x1,,xn+i1)=xn+isubscript𝐺𝑖𝑠𝑟subscript~𝐺𝑖1𝑠𝑟subscript𝑆superscript𝑛𝑖1subscript𝑥1subscript𝑥superscript𝑛𝑖1subscript𝑅superscript𝑛𝑖1subscript𝑥1subscript𝑥superscript𝑛𝑖1subscript𝑥superscript𝑛𝑖G_{i}(s)=r(\tilde{G}_{i-1}(s))=r(S_{n^{\prime}+i-1}(x_{1},\dots,x_{n^{\prime}+% i-1}))=R_{n^{\prime}+i-1}(x_{1},\dots,x_{n^{\prime}+i-1})=x_{n^{\prime}+i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) = italic_r ( over~ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ( italic_s ) ) = italic_r ( italic_S start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_i - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_i - 1 end_POSTSUBSCRIPT ) ) = italic_R start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_i - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_i - 1 end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_i end_POSTSUBSCRIPT

  • G~i(s)=u(G~i1(s,Gi(s)))=u(Sn+i1(x1,,xn+i1),xn+i)=Sn+i(x1,,xn+i)subscript~𝐺𝑖𝑠𝑢subscript~𝐺𝑖1𝑠subscript𝐺𝑖𝑠𝑢subscript𝑆superscript𝑛𝑖1subscript𝑥1subscript𝑥superscript𝑛𝑖1subscript𝑥superscript𝑛𝑖subscript𝑆superscript𝑛𝑖subscript𝑥1subscript𝑥superscript𝑛𝑖\tilde{G}_{i}(s)=u(\tilde{G}_{i-1}(s,G_{i}(s)))=u(S_{n^{\prime}+i-1}(x_{1},% \dots,x_{n^{\prime}+i-1}),x_{n^{\prime}+i})=S_{n^{\prime}+i}(x_{1},\dots,x_{n^% {\prime}+i})over~ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) = italic_u ( over~ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ( italic_s , italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ) ) = italic_u ( italic_S start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_i - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_i - 1 end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_i end_POSTSUBSCRIPT ) = italic_S start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_i end_POSTSUBSCRIPT )

and so the required follows. ∎

Given the previous Lemma, we bound the error of the model by comparing the number of possible states to the number of possible inputs.

Proof of Theorem 2.7.

From Lemma D.1, there exists some function G:𝒮𝔻n:𝐺𝒮superscript𝔻𝑛G:{\mathcal{S}}\to{\mathbb{D}}^{n}italic_G : caligraphic_S → blackboard_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT s.t. Hu,r=GSnsubscript𝐻𝑢𝑟𝐺subscript𝑆superscript𝑛H_{u,r}=G\circ S_{n^{\prime}}italic_H start_POSTSUBSCRIPT italic_u , italic_r end_POSTSUBSCRIPT = italic_G ∘ italic_S start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. For each 𝒙𝒙{\bm{x}}bold_italic_x, we denote by 𝒙~~𝒙\tilde{{\bm{x}}}over~ start_ARG bold_italic_x end_ARG the sequence BOS,𝒙,COPYdelimited-⟨⟩BOS𝒙delimited-⟨⟩COPY\left\langle{\tiny\textsc{BOS}}\right\rangle,{\bm{x}},\left\langle{\tiny% \textsc{COPY}}\right\rangle⟨ BOS ⟩ , bold_italic_x , ⟨ COPY ⟩. Now, observe the following:

1err𝒟n(Hu,r)1subscripterrsubscript𝒟𝑛subscript𝐻𝑢𝑟\displaystyle 1-\mathrm{err}_{{\mathcal{D}}_{n}}(H_{u,r})1 - roman_err start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_u , italic_r end_POSTSUBSCRIPT ) =Pr𝒟n[Hu,r(𝒙~)=𝒙]absentsubscriptPrsubscript𝒟𝑛subscript𝐻𝑢𝑟~𝒙𝒙\displaystyle=\Pr_{{\mathcal{D}}_{n}}\left[H_{u,r}(\tilde{{\bm{x}}})={\bm{x}}\right]= roman_Pr start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_H start_POSTSUBSCRIPT italic_u , italic_r end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) = bold_italic_x ]
=1Dn𝒙𝔻n𝟏{Hu,r(𝒙~)=𝒙}absent1superscript𝐷𝑛subscript𝒙superscript𝔻𝑛1subscript𝐻𝑢𝑟~𝒙𝒙\displaystyle=\frac{1}{D^{n}}\sum_{{\bm{x}}\in{\mathbb{D}}^{n}}\bm{1}\{H_{u,r}% (\tilde{{\bm{x}}})={\bm{x}}\}= divide start_ARG 1 end_ARG start_ARG italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT bold_italic_x ∈ blackboard_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_1 { italic_H start_POSTSUBSCRIPT italic_u , italic_r end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) = bold_italic_x }
=1Dns𝒮𝒙Sn+21(𝒙~)𝟏{GSn+2(𝒙~)=𝒙}absent1superscript𝐷𝑛subscript𝑠𝒮subscript𝒙superscriptsubscript𝑆𝑛21~𝒙1𝐺subscript𝑆superscript𝑛2~𝒙𝒙\displaystyle=\frac{1}{D^{n}}\sum_{s\in{\mathcal{S}}}\sum_{{\bm{x}}\in S_{n+2}% ^{-1}(\tilde{{\bm{x}}})}\bm{1}\{G\circ S_{n^{\prime}+2}(\tilde{{\bm{x}}})={\bm% {x}}\}= divide start_ARG 1 end_ARG start_ARG italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT bold_italic_x ∈ italic_S start_POSTSUBSCRIPT italic_n + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) end_POSTSUBSCRIPT bold_1 { italic_G ∘ italic_S start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 2 end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) = bold_italic_x }
=1Dns𝒮𝒙Sn+21(𝒙~)𝟏{G(s)=𝒙}|𝒮|Dnabsent1superscript𝐷𝑛subscript𝑠𝒮subscript𝒙superscriptsubscript𝑆𝑛21~𝒙1𝐺𝑠𝒙𝒮superscript𝐷𝑛\displaystyle=\frac{1}{D^{n}}\sum_{s\in{\mathcal{S}}}\sum_{{\bm{x}}\in S_{n+2}% ^{-1}(\tilde{{\bm{x}}})}\bm{1}\{G(s)={\bm{x}}\}\leq\frac{\left\lvert{\mathcal{% S}}\right\rvert}{D^{n}}= divide start_ARG 1 end_ARG start_ARG italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT bold_italic_x ∈ italic_S start_POSTSUBSCRIPT italic_n + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) end_POSTSUBSCRIPT bold_1 { italic_G ( italic_s ) = bold_italic_x } ≤ divide start_ARG | caligraphic_S | end_ARG start_ARG italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG