In-Context Learning State Vector with Inner and Momentum Optimization

Dongfang Li, Zhenyu Liu, Xinshuo Hu, Zetian Sun, Baotian Hu, Min Zhang
Harbin Institute of Technology (Shenzhen), Shenzhen, China
{crazyofapple, liuzhenyuhit}@gmail.com
{hubaotian, zhangmin2021}@hit.edu.cn
Corresponding author.
Abstract

Large Language Models (LLMs) have exhibited an impressive ability to perform In-Context Learning (ICL) from only a few examples. Recent works have indicated that the functions learned by ICL can be represented through compressed vectors derived from the transformer. However, the working mechanisms and optimization of these vectors are yet to be thoroughly explored. In this paper, we address this gap by presenting a comprehensive analysis of these compressed vectors, drawing parallels to the parameters trained with gradient descent, and introducing the concept of state vector. Inspired by the works on model soup and momentum-based gradient descent, we propose inner and momentum optimization methods that are applied to refine the state vector progressively as test-time adaptation. Moreover, we simulate state vector aggregation in the multiple example setting, where demonstrations comprising numerous examples are usually too lengthy for regular ICL, and further propose a divide-and-conquer aggregation method to address this challenge. We conduct extensive experiments using Llama-2 and GPT-J in both zero-shot setting and few-shot setting. The experimental results show that our optimization method effectively enhances the state vector and achieves the state-of-the-art performance on diverse tasks. Code is available at https://github.com/HITsz-TMG/ICL-State-Vector

1 Introduction

In-Context Learning (ICL) has emerged as a powerful capability in tandem with the scaling of large language models (LLMs) (Brown et al., 2020). By simply conditioning on a few input-label pairs as demonstrations, LLMs yield a significant improvement and deliver remarkable results in various downstream Natural Language Processing (NLP) tasks (Wei et al., 2022; Liu et al., 2023a). For example, a model prompted with the input “gaot \rightarrow goat, sakne \rightarrow snake, brid \rightarrow can produce the output “bird”. Given these successes, it is worthwhile to inquire about the exact internal working mechanisms of ICL. Considering the opaque operation of ICL within the auto-regressive transformer, it is plausible that ICL might function as a general mechanism that leverages both demonstrations and the query to yield the prediction (Dong et al., 2023).

Recently, some studies have found that the ICL mapping function exists in the outputs of the attention layers or attention heads (Liu et al., 2023b; Dai et al., 2023) when applying causal effects analysis on a different set of models and tasks, such as the task vector (Hendel et al., 2023) and the function vector (Todd et al., 2023). These works show that the functionalities learned through ICL can be encapsulated in compressed vectors derived from transformers, which then can be used to intervene in the transformer to handle queries without demonstrations. This revelation suggests the potential mechanism of ICL that first utilises demonstrations to learn the mapping function from inputs to labels in shallow transformer layers, and then uses the ICL function in deeper transformer layers to make predictions (Hendel et al., 2023). However, while these compressed vectors encapsulate learned information in a more condensed form and show significant promise in applying ICL, there still exists a considerable gap in understanding the operational mechanisms and optimization strategies of these vectors. This significant gap hinders the further grasping and utilization of ICL.

In this paper, we aim to bridge the existing gap by presenting a comprehensive analysis of compressed vectors. Specifically, we investigate their similarities with parameters trained via gradient descent and introduce the formulation of state vector that encapsulates the processing state of ICL stored in the attention activations. Building on the concept of state vector, and drawing insights from the model soup (Wortsman et al., 2022) and momentum-based gradient optimization algorithms (Qian, 1999; Sutskever et al., 2013), we propose inner optimization and momentum optimization strategies which are progressively applied to enhance the state vector. Moreover, we further exploit the demonstration compression capabilities of the state vector to address the practical challenges encountered when applying ICL in settings with multiple examples, where demonstrations are typically too lengthy for standard ICL, such as in the 100-shot setting which is prevalent in practice. Specifically, we introduce a divide-and-conquer aggregation method that effectively aggregates the ICL functions of these extensive examples. This approach enables us to scale up for processing extended examples by compressing them into a single state vector. We conduct extensive experiments using Llama-2 (Touvron et al., 2023) and GPT-J (Wang and Komatsuzaki, 2021) in both zero-shot and few-shot settings. The experimental results show that our method effectively enhances the state vector and achieves state-of-the-art performance on diverse tasks. This not only manifests the effectiveness of our approach but also paves the way for a more comprehensive understanding of ICL.

Our contributions are summarized as follows:

  • We delve into the working mechanism of compressed vectors in ICL and highlight their similarities with parameters trained via gradient descent. Building on this observation, we propose the formulation of the state vector.

  • We propose inner and momentum optimization to progressively refine the state vector as an efficient test-time adaptation. Additionally, we introduce a divide-and-conquer aggregation to effectively scale up to large numbers of examples.

  • We show the practicality of our proposed methods across a wide range of tasks through extensive experiments. Our results also offer insights for future research aiming to fully understand the functionalities of ICL.

2 Related Work

Mechanistic Interpretability.

Recent works have focused on the working mechanisms of ICL (Chan et al., 2022; Xie et al., 2022; Wang et al., 2023)Olsson et al. (2022) argue that induction heads may be the mechanistic source of general ICL in transformers. Akyürek et al. (2022) show that transformer-based in-context learners can implicitly implement standard optimization algorithms on linear models. A mainstream assumption posits that ICL has a similarity with the gradient descent. von Oswald et al. (2023) demonstrate how a linear attention-only transformer model can perform a gradient descent-like procedure implicitly. Dai et al. (2023) compare standard gradient descent based fine-tuning and ICL, and figure out that the transformer attention of ICL exhibits a dual form of gradient descent-based optimization. Moreover, some works revisit and modify this theory on the layer causality dependence (Natan et al., 2023) or training batch size (Shen et al., 2023). In contrast, we focus on the application of the dual form of gradient descent and ICL and present optimization methods with inspiration from the dual form.

Task Representation.

Numerous studies have extensively explored the concept of compressing various tasks into task representations as a means of effectively manipulating tasks within ICL ability. Notably, Shao et al. (2023) and Mu et al. (2023) have successfully yielded compositional task representations by training a composition model. In a slightly different vein, some researchers have delved into the art of devising methodologies to compose minor parameter adjustments acquired through task fine-tuning (Ilharco et al., 2022; Panigrahi et al., 2023; Yu et al., 2023; Hu et al., 2024; Merullo et al., 2023). An alternative line of research finds that the task representation could be extracted in ICL (Liu et al., 2023b; Hendel et al., 2023; Todd et al., 2023; Yang et al., 2023). Different from these approaches, our work avoids the need for additional training and focuses more on analysing why these compressed vectors work and how to improve their performance.

3 Formalization

In this section, we first provide a detailed examination of attention activation which is found to contain the compressed ICL function by previous works (Hendel et al., 2023; Todd et al., 2023). Then, we highlight its inherent similarities with parameters trained through gradient descent. Finally, we introduce the concept of the state vector drawing inspiration from these observations.

A classic template of ICL has the following necessary components: (1) N𝑁Nitalic_N examples that are used to form the demonstrations and each example contains an input query 𝒳𝒳\mathcal{X}caligraphic_X and its corresponding label 𝒴𝒴\mathcal{Y}caligraphic_Y. (2) Separate tokens 𝒮𝒮\mathcal{S}caligraphic_S that separate the input query and the label for each example (e.g., \rightarrow). (3) A query 𝒳qsubscript𝒳𝑞\mathcal{X}_{q}caligraphic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT for prediction. With the above components, the contextual model input of ICL could be written as follows:

𝒳1,𝒮,𝒴1,𝒳2,𝒮,𝒴2,,𝒳N,𝒮,𝒴N,𝒳q,𝒮.subscript𝒳1𝒮subscript𝒴1subscript𝒳2𝒮subscript𝒴2subscript𝒳𝑁𝒮subscript𝒴𝑁subscript𝒳𝑞𝒮\mathcal{X}_{1},\mathcal{S},\mathcal{Y}_{1},\mathcal{X}_{2},\mathcal{S},% \mathcal{Y}_{2},\cdots,\mathcal{X}_{N},\mathcal{S},\mathcal{Y}_{N},\mathcal{X}% _{q},\mathcal{S}.caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_S , caligraphic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_S , caligraphic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , caligraphic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , caligraphic_S , caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , caligraphic_S .

Here we analyse the attention activation of the last separate token. In the l𝑙litalic_l-th transformer layer, the output activation 𝐚lsuperscript𝐚𝑙\mathbf{a}^{l}bold_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT of the attention heads of the last separate token is:

𝐚l=WV[X;X]softmax((WK[X;X])T𝐪d),superscript𝐚𝑙subscript𝑊𝑉superscript𝑋𝑋softmaxsuperscriptsubscript𝑊𝐾superscript𝑋𝑋𝑇𝐪𝑑\mathbf{a}^{l}=W_{V}[X^{\prime};X]\operatorname{softmax}\left(\frac{\left(W_{K% }[X^{\prime};X]\right)^{T}\mathbf{q}}{\sqrt{d}}\right),bold_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT [ italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_X ] roman_softmax ( divide start_ARG ( italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT [ italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_X ] ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_q end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) , (1)

where Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes the hidden state of demonstrations, X𝑋Xitalic_X denotes the hidden state of the query and the last separate token (called zero-shot input), q𝑞qitalic_q denotes the attention query vector of the last separate token, [X;X]superscript𝑋𝑋[X^{\prime};X][ italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_X ] denotes the matrix concatenation, d𝑑\sqrt{d}square-root start_ARG italic_d end_ARG is the scaling factor, WKsubscript𝑊𝐾W_{K}italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT are parameter weight matrix.

Consistent with previous works (Dai et al., 2023; Natan et al., 2023), we omit the softmax operation and the scaling factor to approximate standard attention as relaxed linear attention for qualitative analysis. Consequently, the activation can be simplified as follows:

𝐚lsuperscript𝐚𝑙\displaystyle\mathbf{a}^{l}bold_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT WV[X;X](WK[X;X])T𝐪absentsubscript𝑊𝑉superscript𝑋𝑋superscriptsubscript𝑊𝐾superscript𝑋𝑋𝑇𝐪\displaystyle\approx W_{V}[X^{\prime};X]\left(W_{K}[X^{\prime};X]\right)^{T}% \mathbf{q}≈ italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT [ italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_X ] ( italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT [ italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_X ] ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_q (2)
=(WVX(WKX)T+WVX(WKX)T)𝐪absentsubscript𝑊𝑉𝑋superscriptsubscript𝑊𝐾𝑋𝑇subscript𝑊𝑉superscript𝑋superscriptsubscript𝑊𝐾superscript𝑋𝑇𝐪\displaystyle=\left(W_{V}X\left(W_{K}X\right)^{T}+W_{V}X^{\prime}\left(W_{K}X^% {\prime}\right)^{T}\right)\mathbf{q}= ( italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT italic_X ( italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_X ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) bold_q
=(WZSL+i((WVxi)(WKxi)))𝐪.absentsubscript𝑊ZSLsubscript𝑖tensor-productsubscript𝑊𝑉subscriptsuperscriptx𝑖subscript𝑊𝐾subscriptsuperscriptx𝑖𝐪\displaystyle=\left(W_{\text{ZSL}}+\sum_{i}\left((W_{V}\textbf{x}^{\prime}_{i}% )\otimes\left(W_{K}\textbf{x}^{\prime}_{i}\right)\right)\right)\mathbf{q}.= ( italic_W start_POSTSUBSCRIPT ZSL end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⊗ ( italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) bold_q .

We define WZSL=WVX(WKX)Tsubscript𝑊ZSLsubscript𝑊𝑉𝑋superscriptsubscript𝑊𝐾𝑋𝑇W_{\text{ZSL}}=W_{V}X\left(W_{K}X\right)^{T}italic_W start_POSTSUBSCRIPT ZSL end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT italic_X ( italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_X ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT as the initialized parameters since it is the attention result in the Zero-Shot Learning (ZSL) setting.

To draw a meaningful comparison between attention activation and parameters trained through gradient descent, we now shift our focus towards analyzing a simple linear transformation represented by 𝐲i=W𝐱isubscript𝐲𝑖𝑊subscript𝐱𝑖\mathbf{y}_{i}=W\mathbf{x}_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Given a loss function \mathcal{L}caligraphic_L and the learning rate η𝜂\etaitalic_η, the gradient of linear weight is:

W(𝐲i)=(𝐲i)𝐲i𝐲iW=𝐲i(𝐲i)𝐱iT.subscript𝑊subscript𝐲𝑖subscript𝐲𝑖subscript𝐲𝑖subscript𝐲𝑖𝑊subscriptsubscript𝐲𝑖subscript𝐲𝑖superscriptsubscript𝐱𝑖𝑇\nabla_{W}\mathcal{L}(\mathbf{y}_{i})=\frac{\partial\mathcal{L}(\mathbf{y}_{i}% )}{\partial{\mathbf{y}_{i}}}\frac{\partial\mathbf{y}_{i}}{\partial W}=\nabla_{% \mathbf{y}_{i}}\mathcal{L}(\mathbf{y}_{i})\mathbf{x}_{i}^{T}.∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_L ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG ∂ caligraphic_L ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG divide start_ARG ∂ bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_W end_ARG = ∇ start_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (3)

Denoting the back-propagated errors as 𝐞i=η𝐲isubscript𝐞𝑖𝜂subscriptsubscript𝐲𝑖\mathbf{e}_{i}=-\eta\nabla_{\mathbf{y}_{i}}\mathcal{L}bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - italic_η ∇ start_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L, we can get the full batch gradient with training examples:

ΔWGD=i𝐞i𝐱i,Δsubscript𝑊𝐺𝐷subscript𝑖tensor-productsubscript𝐞𝑖subscriptsuperscript𝐱𝑖\Delta W_{GD}=\sum_{i}\mathbf{e}_{i}\otimes\mathbf{x}^{\prime}_{i},roman_Δ italic_W start_POSTSUBSCRIPT italic_G italic_D end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊗ bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (4)

where 𝐱isubscriptsuperscript𝐱𝑖\mathbf{x}^{\prime}_{i}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the input training examples. Hence, in the previous Eqn. 2, if we substitute WKxisubscript𝑊𝐾subscriptsuperscriptx𝑖W_{K}\textbf{x}^{\prime}_{i}italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as training examples, and take WVxi𝐞isubscript𝑊𝑉subscriptsuperscriptx𝑖subscript𝐞𝑖W_{V}\textbf{x}^{\prime}_{i}\approx\mathbf{e}_{i}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≈ bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponding to some meta gradients (Dai et al., 2023; Natan et al., 2023). The activation can be written as:

𝐚l=(WZSL+i𝐞iWKxi)𝐪=(WZSL+ΔWGD)𝐪.superscript𝐚𝑙subscript𝑊ZSLsubscript𝑖tensor-productsubscript𝐞𝑖subscript𝑊𝐾subscriptsuperscriptx𝑖𝐪subscript𝑊ZSLΔsubscript𝑊𝐺𝐷𝐪\mathbf{a}^{l}=\left(W_{\text{ZSL}}+\sum_{i}\mathbf{e}_{i}\otimes W_{K}\textbf% {x}^{\prime}_{i}\right)\mathbf{q}=\left(W_{\text{ZSL}}+\Delta W_{GD}\right)% \mathbf{q}.bold_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = ( italic_W start_POSTSUBSCRIPT ZSL end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊗ italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_q = ( italic_W start_POSTSUBSCRIPT ZSL end_POSTSUBSCRIPT + roman_Δ italic_W start_POSTSUBSCRIPT italic_G italic_D end_POSTSUBSCRIPT ) bold_q . (5)

Hence, it can be inferred that the output activation 𝐚lsuperscript𝐚𝑙\mathbf{a}^{l}bold_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT can be regarded as parameters trained via gradient descent which utilizes the demonstrations as training instances.

With the above dual form between activation and trained parameters, and in light of observations that transformers tend to learn the ICL function primarily in their first L𝐿Litalic_L layers (Wang et al., 2023), we have the following hypothesis: During the process of ICL, the first L𝐿Litalic_L layers progressively update the flow of information using each example in the demonstration through forward computation. The processing state of ICL is then stored within the activation of the attention head. The subsequent layers access and utilize the processing state to reinstate the ICL function, which is used implicitly for predicting the queries. Therefore we concatenate the activation in the initial L𝐿Litalic_L layers and introduce the notation of the state vector:

𝒱NL=l=1L𝐚l,subscriptsuperscript𝒱𝐿𝑁superscriptsubscript𝑙1𝐿superscript𝐚𝑙\mathcal{V}^{L}_{N}=\mathop{\Big{\|}}\limits_{l=1}^{L}\mathbf{a}^{l},caligraphic_V start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = ∥ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT bold_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , (6)

where L𝐿Litalic_L is the number of layers and N𝑁Nitalic_N is the number of examples in the demonstration. \| denotes the concatenation operation. Note that we have a completely different construction strategy and usage compared to the function vector (Todd et al., 2023). Although the task vector (Hendel et al., 2023) may be functionally equivalent in the forward process, the proposed state vector differs significantly in its integration into the model, making it easier and more effective to analyse and interpret.

4 Method

Refer to caption
Figure 1: The overall framework of the proposed state vector. The state vectors are extracted from the output activations of attention heads. These state vectors are progressively optimized by inner optimization and momentum optimization, or be aggregated through a divide-and-conquer (D&C) aggregation. Finally, the processed state vector is utilized to intervene the inference forward pass.

4.1 Overview

As illustrated in Figure 1, our approach initially extracts the state vector from the attention head that corresponds to the final separate token in the first L𝐿Litalic_L layers using a demonstration and a dummy query. Then, with the view of treating the state vector as trained parameters, coupled with drawing inspiration from the model soup and the momentum-based gradient optimization algorithm, we introduce two methods that progressively optimize the state vector as test-time adaptation (Liang et al., 2023): (1) inner optimization (§4.2) and (2) momentum optimization (§4.3). Moreover, we propose a divide-and-conquer (D&C) state vector aggregation method for efficiently compressing the ICL function in the multiple example setting (§4.4).

After the state vector optimization or aggregation, we utilize the processed state vector to intervene the model during the forward inference pass. In particular, we first input a test query in the zero-shot setting or with the demonstration in the few-shot setting. During the forward pass in the first L𝐿Litalic_L layers, we replace the attention activation of the last separate token with the corresponding activation in the state vector. In other words, the state vector is leveraged to intervene in the output of the first L𝐿Litalic_L transformer layers, blocking the attention of the last separate token to the previous context. With state vector intervention, the transformer learns the ICL function from the processing state stored in the state vector, and continues to make the prediction on the test query.

4.2 Inner Optimization

Inspired by the works on the model soup (Wortsman et al., 2022; Chronopoulou et al., 2023) which show that weight-space averaging not only yields performance improvement but also often enhances robustness, we thus ask the following research question (RQ1): Is it possible to optimize our state vector using the model soup approach? To explore this question, we propose an inner optimization method to improve the effectiveness and robustness of state vector. Specifically, we not only extract the state vector in each separate token of the dummy query but also extract the state vector from each example. Formally, with a forward pass in an N𝑁Nitalic_N shot ICL setting, we extract the N𝑁Nitalic_N state vector 𝒱iLsubscriptsuperscript𝒱𝐿𝑖\mathcal{V}^{L}_{i}caligraphic_V start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (1iN1𝑖𝑁1\leq i\leq N1 ≤ italic_i ≤ italic_N) from last N𝑁Nitalic_N separate token. Subsequently, we apply a uniform averaging process to these state vectors as follows:

𝒱¯NL=1Ni=1N𝒱iL,subscriptsuperscript¯𝒱𝐿𝑁1𝑁subscriptsuperscript𝑁𝑖1subscriptsuperscript𝒱𝐿𝑖\mathcal{\overline{V}}^{L}_{N}=\frac{1}{N}\sum^{N}_{i=1}\mathcal{V}^{L}_{i},over¯ start_ARG caligraphic_V end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT caligraphic_V start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (7)

where 𝒱¯NLsubscriptsuperscript¯𝒱𝐿𝑁\mathcal{\overline{V}}^{L}_{N}over¯ start_ARG caligraphic_V end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is the inner optimized state vector, which can be directly utilized for inference intervention or serves as the initial state vector for later momentum optimization.

4.3 Momentum Optimization

Since we view the state vector as parameters trained gradually through demonstration examples, the difference between two state vectors with adjacent corresponding separate tokens can also be regarded as the influence of the middle example, akin to the gradient. Motivated by this understanding, coupled with extensive studies of the gradient optimization algorithm (Sutskever et al., 2013; Duchi et al., 2010; Loshchilov and Hutter, 2019), we direct our focus toward a simple momentum-based gradient optimization algorithm, seeking to answer the following research question (RQ2): Can our state vector be optimized using momentum-based optimization algorithm? To answer this question, we propose a momentum optimization. Formally, we first extract the influence of each example by subtracting two adjacent state vectors:

EiL=𝒱iL𝒱i1L,subscriptsuperscript𝐸𝐿𝑖subscriptsuperscript𝒱𝐿𝑖subscriptsuperscript𝒱𝐿𝑖1E^{L}_{i}=\mathcal{V}^{L}_{i}-\mathcal{V}^{L}_{i-1},italic_E start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_V start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - caligraphic_V start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , (8)

where EiLsubscriptsuperscript𝐸𝐿𝑖E^{L}_{i}italic_E start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the influence of i𝑖iitalic_i-th (1<iN1𝑖𝑁1<i\leq N1 < italic_i ≤ italic_N) example in the early L𝐿Litalic_L layer. Then, we apply the momentum gradient optimization algorithm to obtain optimized influence E~iLsubscriptsuperscript~𝐸𝐿𝑖\widetilde{E}^{L}_{i}over~ start_ARG italic_E end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and add it to the last state vector:

𝒱^NL=𝒱¯NL+E~L=𝒱¯NL+opt([EiL]i=1N),subscriptsuperscript^𝒱𝐿𝑁subscriptsuperscript¯𝒱𝐿𝑁superscript~𝐸𝐿subscriptsuperscript¯𝒱𝐿𝑁optsuperscriptsubscriptdelimited-[]subscriptsuperscript𝐸𝐿𝑖𝑖1𝑁\mathcal{\widehat{V}}^{L}_{N}=\mathcal{\overline{V}}^{L}_{N}+\widetilde{E}^{L}% =\mathcal{\overline{V}}^{L}_{N}+\texttt{opt}([E^{L}_{i}]_{i=1}^{N}),over^ start_ARG caligraphic_V end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = over¯ start_ARG caligraphic_V end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT + over~ start_ARG italic_E end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = over¯ start_ARG caligraphic_V end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT + opt ( [ italic_E start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) , (9)

where 𝒱^NLsubscriptsuperscript^𝒱𝐿𝑁\mathcal{\widehat{V}}^{L}_{N}over^ start_ARG caligraphic_V end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is the momentum optimized state vector and 𝒱¯NLsubscriptsuperscript¯𝒱𝐿𝑁\mathcal{\overline{V}}^{L}_{N}over¯ start_ARG caligraphic_V end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is the inner optimized state vector. opt()opt\texttt{opt}(\cdot)opt ( ⋅ ) denotes the momentum gradient optimization algorithm. We also explore various other gradient optimization algorithms in §6.1.

4.4 Divide-and-Conquer Aggregation

In addition to optimizing the state vector to more effectively represent the ICL function from a small number of examples, we also explore its capacity to encapsulate multiple examples within a single vector. However, regular ICL can not be directly used on multiple examples due to the context length limitation of current LLMs. This leads us to investigate the following question (RQ3): Can we use the state vector to represent multiple examples that are unmanageable for regular ICL? To address this question, we propose a divide-and-conquer method for state vector aggregation. As depicted in Figure 1, our approach involves distinct aggregation processes (i.e. the divide stage and the conquer stage). In the divide stage, examples are randomly divided into groups, termed grouped demonstrations. Within each group, a random example is selected to serve as a dummy query, which allows us to extract a group-specific state vector. In the conquer stage, these dummy queries are paired with their corresponding labels to form input-label pairs. From these input-label pairs, we form an aggregated demonstration, add an additional dummy query, and subsequently extract the aggregated state vector. It is worth noting that during the forward pass of aggregated state vector extraction, we utilise the group-specific state vector to intervene the attention activation of the separate tokens of their corresponding examples. The divide and conquer approach allows us to aggregate the ICL function of each grouped demonstration into its respective group-specific state vector, and subsequently aggregate the ICL function of each group-specific state vector into a single, comprehensive aggregated state vector. This aggregated vector is then utilized for interventions during inference, similarly to the optimized state vector discussed in §4.2 and §4.3. Moreover, in the few-shot setting, the aggregated demonstrations are treated as inference demonstrations. The divide-and-conquer approach effectively circumvents the context-length constraints inherent in LLMs, thereby enabling a more effective and efficient aggregation of information across multiple examples.

5 Experiment

5.1 Setup

We conduct the evaluation across 12 datasets that encompass different domains.

  • Linguistics includes Antonym (Nguyen et al., 2017), Capitalize, Present-Past, and Singular-Plural (Todd et al., 2023), focusing on transformations in the form or meaning of words.

  • Translation is represented by the English-French (Lample et al., 2018) dataset, which involves translating English words into their French counterparts.

  • Knowledge comprises Country-Capital (Todd et al., 2023), AG News (Zhang et al., 2015), Person-Sport, Person-Instrument, Person-Occupation, Product-Company, and Landmark-Country (Hernandez et al., 2023), which are centred around question-to-answer mappings for commonsense knowledge queries.

We employ Llama-2-7B and GPT-J-6B as our LLMs, chosen for their moderate model sizes, open-source and capability for ICL. We also provide the results with larger models (i.e., Llama-2-13B) in the Appendix H. We use Llama-2-7B as the default model unless otherwise specified. Our method is orthogonal to the choice of transformer-based decoder-only autoregressive LLMs.

For simplicity evaluation, we restrict to single-token output and use first output token accuracy as the evaluation metric as in previous work (Hendel et al., 2023; Todd et al., 2023).

5.2 Baseline

In the paper, we compare with the following methods:

  • Regular is the baseline for the zero-shot setting that uses only the given query as input, while ICL baseline (Wei et al., 2022) makes predictions on the label by taking both the demonstrations and the given query.

  • Function vector (Todd et al., 2023) is extracted from attention activation using the causal mediation method and is then added to the hidden state of certain transformer layers during inference.

  • Task vector (Hendel et al., 2023) is extracted from the hidden state of the separate token and is leveraged for blocking the layer when inference.

Model Method Anym Eng-Fr Pers-Inst Pers-Occ Prod-Comp Land-Cout Average
Llama-2 Zero-shot Regular 1.0±plus-or-minus\pm± 0.2 0.1±plus-or-minus\pm±0.1 0.0±plus-or-minus\pm±0.0 0.0±plus-or-minus\pm±0.0 0.4±plus-or-minus\pm±0.2 0.0±plus-or-minus\pm±0.0 0.3
Function vector 45.1±plus-or-minus\pm±2.0 21.6±plus-or-minus\pm±2.0 11.3±plus-or-minus\pm±10.7 0.1±plus-or-minus\pm±0.1 25.6±plus-or-minus\pm±4.3 32.9±plus-or-minus\pm±21.6 22.8
Task vector 56.2±plus-or-minus\pm±2.8 63.2±plus-or-minus\pm±3.6 61.8±plus-or-minus\pm±8.4 27.9±plus-or-minus\pm±15.2 55.5±plus-or-minus\pm±20.1 57.8±plus-or-minus\pm±26.3 53.7
State vector (inn.) 61.0±plus-or-minus\pm±1.0 66.5±plus-or-minus\pm±2.2 67.4±plus-or-minus\pm±2.6 42.7±plus-or-minus\pm±4.2 64.5±plus-or-minus\pm±10.6 81.0±plus-or-minus\pm±1.7 63.9
State vector (mom.) 60.4±plus-or-minus\pm±0.7 67.5±plus-or-minus\pm±1.8 68.7±plus-or-minus\pm±1.6 45.6±plus-or-minus\pm±5.9 71.3±plus-or-minus\pm±3.6 77.7±plus-or-minus\pm±1.8 65.2
Few-shot ICL baseline 64.8±plus-or-minus\pm±4.8 74.3±plus-or-minus\pm±0.8 71.7±plus-or-minus\pm±3.7 56.1±plus-or-minus\pm±2.7 80.8±plus-or-minus\pm±0.8 87.0±plus-or-minus\pm±0.3 72.5
Function vector 54.5±plus-or-minus\pm±0.9 65.2±plus-or-minus\pm±1.4 60.8±plus-or-minus\pm±5.6 54.2±plus-or-minus\pm±2.2 76.0±plus-or-minus\pm±1.3 84.2±plus-or-minus\pm±2.9 65.8
Task vector 65.7±plus-or-minus\pm±1.8 73.8±plus-or-minus\pm±0.9 66.6±plus-or-minus\pm±5.2 56.4±plus-or-minus\pm±2.3 81.9±plus-or-minus\pm±1.8 86.7±plus-or-minus\pm±0.9 71.8
State vector (inn.) 66.2±plus-or-minus\pm±1.6 74.6±plus-or-minus\pm±0.9 70.1±plus-or-minus\pm±4.3 57.0±plus-or-minus\pm±2.2 82.8±plus-or-minus\pm±1.6 87.5±plus-or-minus\pm±0.9 73.0
State vector (mom.) 65.8±plus-or-minus\pm±3.7 74.3±plus-or-minus\pm±1.1 74.9±plus-or-minus\pm±2.9 58.2±plus-or-minus\pm±0.4 82.0±plus-or-minus\pm±1.0 87.6±plus-or-minus\pm±0.3 73.8
 
GPT-J Zero-shot Regular 8.1±plus-or-minus\pm±0.6 7.2±plus-or-minus\pm±0.6 0.0±plus-or-minus\pm±0.0 0.0±plus-or-minus\pm±0.0 1.9±plus-or-minus\pm±0.5 0.9±plus-or-minus\pm±0.2 3.0
Function vector 33.1±plus-or-minus\pm±1.8 29.1±plus-or-minus\pm±8.5 4.1±plus-or-minus\pm±5.8 11.1±plus-or-minus\pm±2.3 46.3±plus-or-minus\pm±5.7 22.5±plus-or-minus\pm±10.2 24.4
Task vector 23.6±plus-or-minus\pm±3.8 32.2±plus-or-minus\pm±5.1 44.4±plus-or-minus\pm±5.0 28.3±plus-or-minus\pm±18.6 43.8±plus-or-minus\pm±5.7 41.3±plus-or-minus\pm±12.3 35.6
State vector (inn.) 33.4±plus-or-minus\pm±1.9 31.7±plus-or-minus\pm±3.8 49.3±plus-or-minus\pm±2.0 30.0±plus-or-minus\pm±6.2 42.8±plus-or-minus\pm±4.3 61.9±plus-or-minus\pm±1.6 41.5
State vector (mom.) 31.1±plus-or-minus\pm±1.0 35.1±plus-or-minus\pm±2.4 50.3±plus-or-minus\pm±3.0 42.4±plus-or-minus\pm±1.5 44.2±plus-or-minus\pm±1.5 60.3±plus-or-minus\pm±0.9 43.9
Few-shot ICL baseline 59.2±plus-or-minus\pm±1.4 69.9±plus-or-minus\pm±2.0 44.7±plus-or-minus\pm±6.7 29.3±plus-or-minus\pm±1.0 62.5±plus-or-minus\pm±1.0 69.3±plus-or-minus\pm±0.5 55.8
Function vector 56.4±plus-or-minus\pm±1.9 65.8±plus-or-minus\pm±1.9 49.1±plus-or-minus\pm±2.2 30.3±plus-or-minus\pm±1.9 58.5±plus-or-minus\pm±3.3 69.2±plus-or-minus\pm±0.6 54.9
Task vector 58.5±plus-or-minus\pm±1.6 70.6±plus-or-minus\pm±1.2 42.3±plus-or-minus\pm±6.4 27.8±plus-or-minus\pm±3.3 66.0±plus-or-minus\pm±2.6 63.1±plus-or-minus\pm±5.3 54.7
State vector (inn.) 58.7±plus-or-minus\pm±2.2 70.9±plus-or-minus\pm±1.3 46.5±plus-or-minus\pm±4.9 29.4±plus-or-minus\pm±1.7 66.3±plus-or-minus\pm±2.1 66.4±plus-or-minus\pm±2.8 56.4
State vector (mom.) 59.6±plus-or-minus\pm±1.4 70.1±plus-or-minus\pm±2.2 51.9±plus-or-minus\pm±2.4 30.4±plus-or-minus\pm±1.1 63.8±plus-or-minus\pm±0.8 68.6±plus-or-minus\pm±0.3 57.4
Table 1: Performance of state vector optimization. The best results in the zero shot setting are in underline and the best results in the few shot setting are in bold. The result of basic state vector is mathematically equivalent to task vector. Note that we only present the results across six tasks here and leave the rest in the Appendix. We also report standard deviation and the results are passed with significance test (p<.05𝑝.05p<.05italic_p < .05).
Refer to caption
(a) Llama-2 Antonym
Refer to caption
(b) Llama-2 Person-Instrument
Refer to caption
(c) Llama-2 English-French
Refer to caption
(d) Llama-2 AG News
Refer to caption
(e) GPT-J Antonym
Refer to caption
(f) GPT-J Person-Instrument
Refer to caption
(g) GPT-J English-French
Refer to caption
(h) GPT-J AG News
Figure 2: Performance of aggregation across number of examples. Avg. denotes the average aggregation baseline and D&C. denotes the divide-and-conquer aggregation. The X axis represents the number of examples, and the Y axis represents the accuracy.

5.3 Inner Optimization(RQ1)

As shown in Table 1, the performance of our inner optimized state vector has a significant improvement comparing the task vector and function vector in both zero-shot and few-shot settings. Our state vector with inner optimization. In the zero-shot setting, the inner optimization shows an average improvement of 10.2% on Llama-2 and 5.9% on GPT-J across six datasets. In the few-shot setting, the inner optimization also achieves a 1.2% improvement on Llama-2 and 1.7% on GPT-J. The improvement demonstrates the effectiveness of inner optimization. However, although state vector (inn.) outperforms task vector, its few-shot performance on some datasets is inferior to the ICL baseline. We attribute this primarily to the introduction of query information from examples. While inner optimization enhances task-relevant information for the state vector, it also introduces noise of other dummy queries, hindering the model’s ability to focus on the current predictive query, thereby reducing performance. In addition to the performance improvements, our inner optimization approach also effectively alleviates the phenomenon of high variance in the original task vector in the zero-shot setting. In practical use, the performance of the task vector is influenced by demonstrations and dummy queries, leading to weaker robustness. Our proposed inner optimization approach effectively mitigates this issue, similarly motivated as the model averaging method, thereby enhancing the robustness of the state vector.

5.4 Momentum Optimization (RQ2)

As depicted in Table 1, building upon the inner optimized state vector, our proposed momentum optimization algorithm further enhances the effectiveness of the state vector, achieving the best performance on average in all settings. In the zero-shot setting, the momentum optimization boosts the performance of the inner-optimized state vector with an average increase of 1.3% on Llama-2 and 2.4% on GPT-J. In the few-shot setting, state vector with momentum optimization achieves a 0.8% average increase on Llama-2 and 1.0% on GPT-J. This reveals the effectiveness of our momentum optimization. With the combination of inner optimization and momentum optimization, our state vector (mom.) surpasses the original variant, showcasing a remarkable improvement of 11.5% for Llama-2 and 8.3% for GPT-J in the zero-shot setting. In the few-shot setting, our state vector (mom.) still outperforms the task vector with a 2.0% improvement for Llama-2 and 2.7% for GPT-J. Furthermore, without inputting demonstration during inference, the state vector (mom.) achieves an impressive 90% ICL performance on Llama-2 and 78% ICL performance on GPT-J. When compared to ICL with the same examples as the demonstration, state vector (mom.) outperforms ICL in both Llama-2 and GPT-J. These improvements verify the effectiveness of our progressive optimization strategy. Note that applying momentum optimization directly to task vectors does not yield average improvements across tasks in our preliminary experiment. We speculate that this inconsistency stems from the poor robustness of the task vectors, which hinders the stable optimization by momentum optimization and leads to poor performance in some tasks.

5.5 Divide-and-Conquer Aggregation (RQ3)

In this experiment, we explore the performance of D&C state vector aggregation across varying numbers of examples. Besides the regular and ICL baseline mentioned, we introduce average aggregation as a strong baseline. This approach first extracts state vectors from the example group and subsequently employs their mathematical average for aggregation. We compare our D&C aggregation method with the baseline ranging from 10 to 100 examples across two models. Due to limited computational resources, we were not able to do an exhaustive search over all datasets. Thus, we only present the results for four tasks.

As illustrated in the Figure 2, both the D&C aggregation and average aggregation exhibit similar trends in both few-shot and zero-shot settings. The performance of both aggregation methods initially falls short of the ICL baseline. However, their performance boosts when examples increase. The initial poor performance can be attributed to the limited number of state vectors. Additionally, although the performance of the D&C aggregation initially falls behind that of the average aggregation, it exhibits a more substantial performance improvement when examples increase, ultimately outperforming average aggregation in the multiple example setting, highlighting the efficiency of D&C aggregation.

Method Zero-shot Few-shot
ICL baseline 0.2±plus-or-minus\pm± 0.4 71.0±plus-or-minus\pm± 10.8
Task vector 52.9±plus-or-minus\pm± 9.4 68.5±plus-or-minus\pm± 10.5
State vector (mom.) 65.2±plus-or-minus\pm± 10.2 72.2±plus-or-minus\pm± 10.6
State vector (adag.) 11.7±plus-or-minus\pm± 12.0 16.1±plus-or-minus\pm± 10.2
State vector (rms.) 0.8±plus-or-minus\pm± 0.9 1.5±plus-or-minus\pm± 1.0
State vector (adam.) 6.7±plus-or-minus\pm± 6.1 10.6±plus-or-minus\pm± 8.5
Table 2: Performance comparison of gradient optimization algorithms. The method means the optimization algorithm applied to the opt()opt\texttt{opt}(\cdot)opt ( ⋅ ) in Eqn. 9.
Refer to caption
Figure 3: Average zero-shot performance across six datasets for each choice of the intermediate layer L𝐿Litalic_L. The solid line means the average value, while the shaded area indicates the standard deviation.

6 Analysis

6.1 Ablation with Other Optimization Methods

We present an ablation study to investigate various classical gradient optimization algorithms, aiming to delve deeper into the inner state vector optimization. We compare the momentum-based gradient optimization algorithm with following additional first-order gradient optimization algorithms: Adagrad (adag.) (Duchi et al., 2010), RMSprop (rms.) (Graves, 2013) and Adam(adam.) (Kingma and Ba, 2015). As shown in Table 3, we observe a significant decrease in state vector performance with first-order gradient optimization algorithms, unlike with momentum-based optimization. This outcome indicates a discrepancy between the state vector and updated parameters with gradient descent. It suggests that the current first-order gradient optimization algorithms may not be optimally effective for state vector optimization. Due to computational constraints, we were not able to do an exhaustive search over all hyper-parameters.

6.2 Layer Selection

We investigate the impact of layer selection on the extraction of state vectors in transformer models. We evaluate the average performance across different datasets in the zero-shot setting, as illustrated in Figure 3. Our results reveal a dual-phase trend: initially, increasing the number of layers for state vector extraction improves performance, but this improvement reverses beyond the 14th layer. We correlate this with the dynamics of ICL function processing in transformers in line with previous works (Voita et al., 2019; Wang et al., 2023). In the initial layers, transformers are primarily engaged in learning and encapsulating the ICL function within state vector, where additional layers enhance the richness of the functional information in the state vector. In contrast, the later layers prioritize applying this learned information for prediction tasks. Here, additional layers tend to introduce noise, especially from predicted labels of dummy queries, which may negatively impact performance.

Refer to caption
(a) Antonym
Refer to caption
(b) English-French
Refer to caption
(c) Product-Company
Figure 4: The 2D PCA visualization of the state vector in the Antonym ,English-French and Product-Company task, where each color represents the state vector corresponding to examples occupying specific positions in the demonstration and the outlier is the first order.

6.3 Qualitative Study

We provide the visualization by Principal Component Analysis (PCA) of the original state vector in the Antonym, English-French and Product-Company task. As depicted in Figure 4, we have three observations: (1) State vectors corresponding to the examples occupying the same position tend to form distinct clusters. This clustering pattern suggests a high degree of similarity among state vectors within each example position, despite different contexts. (2) A notable separation is evident between the state vectors originating from the first example and other position examples. This demarcation implies that ICL may begin to effectively function with a few examples. (3) An interesting trend is observable in the movement of these clusters as the example position increases. This trend may be indicative of an accumulation of task-specific information, where each additional example contributes to a more nuanced understanding of the model. These findings suggest a progressive enhancement in the ability of model to internalize and reflect the subtleties of the task at hand. Moreover, these observations reflect the efficacy of momentum optimization to leverage the observed clustering trend.

7 Conclusion

In this paper, we reveal that ICL compressed vector can be viewed as parameters trained through gradient descent on the demonstrations. Then, we introduce the concept of state vector coupled with two optimization methods to enhance the capability of ICL and conduct comprehensive experiments across two popular LLMs and multiple tasks to support our claim. Furthermore, our approach demonstrates the ability to compress context while maintaining lower variance. In the future, we aim to extend our methods to more complex ICL scenarios and apply them to larger LLMs and call for more nuanced and realistic studies of ICL.

References

  • Akyürek et al. [2022] Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. ArXiv preprint, abs/2211.15661, 2022.
  • Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • Chan et al. [2022] Stephanie Chan, Adam Santoro, Andrew K. Lampinen, Jane Wang, Aaditya Singh, Pierre H. Richemond, James L. McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022.
  • Chronopoulou et al. [2023] Alexandra Chronopoulou, Matthew Peters, Alexander Fraser, and Jesse Dodge. AdapterSoup: Weight averaging to improve generalization of pretrained language models. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2054–2063, 2023.
  • Dai et al. [2023] Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019, 2023.
  • Dao et al. [2022] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  • Dong et al. [2023] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. ArXiv preprint, abs/2301.00234, 2023.
  • Duchi et al. [2010] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 257–269, 2010.
  • Graves [2013] Alex Graves. Generating sequences with recurrent neural networks. ArXiv, abs/1308.0850, 2013.
  • Hendel et al. [2023] Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. ArXiv preprint, abs/2310.15916, 2023.
  • Hernandez et al. [2023] Evan Hernandez, Arnab Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. ArXiv preprint, abs/2308.09124, 2023.
  • Hu et al. [2024] Xinshuo Hu, Dongfang Li, Zihao Zheng, Zhenyu Liu, Baotian Hu, and Min Zhang. Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation. In Proc. of AAAI, 2024.
  • Ilharco et al. [2022] Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. ArXiv preprint, abs/2212.04089, 2022.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Machine Learning, 2015.
  • Kwon et al. [2023] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors, Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023, pages 611–626. ACM, 2023. doi: 10.1145/3600006.3613165.
  • Lample et al. [2018] Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Word translation without parallel data. In International Conference on Machine Learning, 2018.
  • Liang et al. [2023] Jian Liang, Ran He, and Tieniu Tan. A comprehensive survey on test-time adaptation under distribution shifts. ArXiv preprint, abs/2303.15361, 2023.
  • Liu et al. [2023a] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023a.
  • Liu et al. [2023b] Sheng Liu, Lei Xing, and James Zou. In-context vectors: Making in context learning more effective and controllable through latent space steering. ArXiv preprint, abs/2311.06668, 2023b.
  • Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Machine Learning, 2019.
  • Merullo et al. [2023] Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Language models implement simple word2vec-style vector arithmetic. ArXiv preprint, abs/2305.16130, 2023.
  • Mu et al. [2023] Jesse Mu, Xiang Lisa Li, and Noah D. Goodman. Learning to compress prompts with gist tokens. ArXiv preprint, abs/2304.08467, 2023.
  • Natan et al. [2023] Tomer Bar Natan, Gilad Deutch, Nadav Magar, and Guy Dar. In-context learning and gradient descent revisited. ArXiv preprint, abs/2311.07772, 2023.
  • Nguyen et al. [2017] Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. Distinguishing antonyms and synonyms in a pattern-based neural network. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 76–85, 2017.
  • Olsson et al. [2022] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads. ArXiv preprint, abs/2209.11895, 2022.
  • Panigrahi et al. [2023] Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, and Sanjeev Arora. Task-specific skill localization in fine-tuned language models. In International Conference on Machine Learning, 2023.
  • Qian [1999] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145–151, 1999.
  • Shao et al. [2023] Nan Shao, Zefan Cai, Hanwei Xu, Chonghua Liao, Yanan Zheng, and Zhilin Yang. Compositional task representations for large language models. In International Conference on Learning Representations, 2023.
  • Shen et al. [2023] Lingfeng Shen, Aayush Mishra, and Daniel Khashabi. Do pretrained transformers really learn in-context by gradient descent? ArXiv preprint, abs/2310.08540, 2023.
  • Sutskever et al. [2013] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In Proc. of ICML, volume 28 of JMLR Workshop and Conference Proceedings, pages 1139–1147, 2013.
  • Todd et al. [2023] Eric Todd, Millicent Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. ArXiv preprint, abs/2310.15213, 2023.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, and Kevin R. Stone et al. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023.
  • Voita et al. [2019] Elena Voita, Rico Sennrich, and Ivan Titov. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In Proc. of EMNLP, pages 4396–4406, 2019.
  • von Oswald et al. [2023] Johannes von Oswald, Eyvind Niklasson, E. Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, 2023.
  • Wang and Komatsuzaki [2021] Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, 2021.
  • Wang et al. [2023] Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9840–9855, 2023.
  • Wei et al. [2022] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022.
  • Wortsman et al. [2022] Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 23965–23998, 2022.
  • Xie et al. [2022] Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. In International Conference on Machine Learning, 2022.
  • Yang et al. [2023] Jiaxi Yang, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. Iterative forward tuning boosts in-context learning in language models. ArXiv preprint, abs/2305.13016, 2023.
  • Yu et al. [2023] Le Yu, Yu Bowen, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. ArXiv preprint, abs/2311.03099, 2023.
  • Zhang et al. [2015] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 649–657, 2015.

Appendix A Implementation Details

In this paper, we use random sampling to create subsets for each dataset. Each subset consists of 10 instances for demonstrations and one instance for a dummy query since we employ a 10-shot as the default ICL setting. The remaining instances are split into test and development sets with a 7:3 ratio. For experiments with multiple examples, we sample 100 instances instead of 10. We use “\rightarrow” as the separate token similar to previous works. We tried other tokens but no significant difference. All the experiments are reported over 5 random seeds. The inference mechanism with state vector we describe in §4.1 has a key hyper-parameter (i.e.the layer L𝐿Litalic_L). Previous studies [Hendel et al., 2023] have shown that the choice of L𝐿Litalic_L has an influence on performance. We find the best layer for different tasks via the accuracy of the development set. For the inner optimization in §4.2, we choose the last seven state vectors to optimize. This is because the early state vectors yield subpar performance, primarily due to limitations in the available examples. For the momentum optimization, we choose 0.5 as the retention rate for historical momentum from the options of 0.25, 0.5 and 0.75. We run all the experiments on a single NVIDIA A100 80G GPUs. Each of our experiments consumes between 10 minutes to 8 hours of GPU time, depending on the dataset.

Appendix B More Details about Baseline

In this section, we present an in-depth and comprehensive analysis of two baselines (i.e. task vector [Hendel et al., 2023] and function vector [Todd et al., 2023]). Furthermore, we offer a more nuanced comparison with our proposed state vector, highlighting the distinct differences and advantages of our approach.

The task vector is designed to extract the ICL function from a specific layer’s hidden state within the transformer model. This is achieved by directly replacing the corresponding hidden state during inference for intervention. On the other hand, Todd et al. [2023] first extracts the ICL function from the output activations across all attention heads in all transformer layers. These activations are then prioritized based on their causal effect, quantified by the variance in the model’s output space with or without individual activation interventions. The mathematical average of the top 10 causal effect activations is the function vector, which is subsequently added to the hidden state of a specific layer during the inference stage.

In contrast to these methods, our approach for state vector extraction focuses on procuring the ICL procession state from the output activations of the attention heads within the first L𝐿Litalic_L layers. During inference, we replace the corresponding activations with optimized ones. While functionally equivalent to the forward process of the task vector when disregarding state vector optimization (i.e., the vanilla state vector), our approach offers enhanced mechanical explainability. This is attributable to its motivation from the dual form of in-context learning and gradient decay, as explicated in previous work [Dai et al., 2023, Natan et al., 2023]. Furthermore, inspired by the dual form, we focus on the further optimization process. On the other hand, unlike the function vector which extracts activations based on the causal effects resulting from individual interventions, our method is rooted in the underlying mechanisms of ICL. This strategy not only improves mechanical explainability but also demonstrates greater performance as evidenced by extensive experiments. Experiments also show notably poor performance of the function vector on certain knowledge-based datasets, such as Person-Occupation.

Appendix C More Details about Datasets

Here, we describe in detail the tasks that we use to evaluate the state vectors.

  • Antonym [Nguyen et al., 2017] contains 2398 word pairs that are antonyms of each other (e.g. “massive” \rightarrow “tiny”). We apply the dataset processed version from the function vector [Todd et al., 2023]. They filter the word pairs where both words can be tokenized as a single token.

  • Capitalize [Todd et al., 2023] contains 813 word pairs that capitalize the first letter of the given input word (e.g. “plan” \rightarrow “Plan”).

  • Present-Past [Todd et al., 2023] contains 293 word pairs, where simple past tense verbs are output when given simple present tense verbs (e.g. “adapt” \rightarrow “adapted”).

  • Singular-Plural [Todd et al., 2023] contains 205 word pairs, where the plural form of a given singular word (e.g., “wallet” \rightarrow “wallets”).

  • English-French [Lample et al., 2018] contains 4698 pairs of words, which consists of a word in English and its translation into French (e.g., “circle” \rightarrow “cercle”). We apply the processed version from the function vector[Todd et al., 2023].

  • Country-Capital [Todd et al., 2023] contains 197 instances, which output the name of the capital city of the given country (e.g. “Luanda” \rightarrow “Angola”).

  • AG News [Zhang et al., 2015] contains 7600 instances. Each instance contains the news headlines and the first few sentences of an article as input, and output corresponding labels include Business, Science, Sports, and World.

  • Person-Sport [Hernandez et al., 2023] contains 318 instances. Each instance contains the name of a professional athlete and the sport that they play (e.g. “Hank Aaron” \rightarrow “basketball”).

  • Person-Instrument [Hernandez et al., 2023] contains 510 instances. Each instance contains the name of a professional musician and the instrument they play (e.g. “Tom Fletcher” \rightarrow “guitar”).

  • Person-Occupation [Hernandez et al., 2023] contains 821 instances. Each instance contains the name of a well-known individual and their occupation (e.g. “Tom Fletcher” \rightarrow “guitar”).

  • Product-Company [Hernandez et al., 2023] contains 522 instances. Each instance contains the name of a commercial product and the company that sells the product (e.g. “Tom Fletcher” \rightarrow “guitar”).

  • Landmark-Country [Hernandez et al., 2023] contains 836 instances. Each instance contains the name of a landmark and the country in which it is located.

Appendix D Efficiency Analysis

Refer to caption
Figure 5: Time efficiency analysis of Llama-2-7B and GPT-J-6B. Inn denotes our state vector with inner optimization. Mom denotes our state vector with momentum optimization

In this section, we present an efficiency analysis of two proposed optimization methods. We evaluate the average inference time using 1000 test data on a single NVIDIA A100 (80G) GPU, covering six main datasets and 10 random seeds per dataset. The results are illustrated in Figure 5. In the zero-shot setting, we compress the ICL function into the state vector which eliminates the need to concatenate demonstrations during inference. As shown in the Figure 5, the proposed inner optimization and momentum optimization, which, while tripling the inference speed, achieve 89% of the regular ICL performance on Llama-2-7B and 78% on GPT-J-6B (see Table 1 in the paper). In the few-shot setting, the proposed inner optimization and momentum optimization achieve better results than standard ICL at the cost of a minimal loss in inference speed (e.g., 99% and 96%). Moreover, our method is orthogonal to attention speedup techniques, such as flash attention [Dao et al., 2022] and page attention [Kwon et al., 2023]. Therefore, our approach can also benefit from the achievements of these works and achieve further efficiency improvement. We leave the exploration of alternative enhancement as future work.

Prompt Llama-2 +SV
What instrument did X play? 8.7±plus-or-minus\pm± 0.7 67.3±plus-or-minus\pm± 2.8
Can you tell me which musical instrument was played by X? 25.1±plus-or-minus\pm± 0.7 69.0±plus-or-minus\pm± 4.3
What was the primary instrument of X in their music career? 17.3±plus-or-minus\pm± 1.4 70.3±plus-or-minus\pm± 2.9
Table 3: Text portability of momentum optimized state vector. The templates are provided with “X” replaced by a query word. “+SV” denotes adding momentum optimized state vector

Appendix E Natural Text Completions

In this study, we evaluate the effectiveness of the momentum optimized state vector on natural text completions. Given a natural text template, we instruct the model to greedily generate 5 tokens with or without intervention in the zero-shot setting. We use exact match accuracy as the metric. Table 3 shows the result of natural text completions on Llama-2. The performance boosts observed with the momentum-optimized state vector on the separate tokens indicate that it can guide the model to generate answers correctly. We include more examples of natural text completions in the Appendix.

Appendix F Case Study

English-French
Prompt What is the meaning of biography?
Llama-2 A written account of someone’s life.
   + state vector It is biographie.
Antonym
Prompt When I think of upright, I think of
Llama-2 I think of a person who is standing up
for what they believe in.
   + state vector I think of down.
Table 4: Natural prompt cases with momentum optimized state vector on Antonym task and English-French task.

In this section, we present a case study shown in Table 4, to demonstrate the efficacy of the momentum-optimized state vector in natural text completions. Consider the query: “What is the meaning of biography?”, The vanilla Llama-2 model would directly answer this question. However, when influenced by an English-French state vector, Llama-2 changes its response, translating the question into French instead. Similarly, when presented with the sentence “When I think of upright, I think of”. Influenced by an Antonym state vector, Llama-2 completes the sentence with an anonymous pattern. These instances exemplify the model learning the ICL function stored in the momentum optimized state vector, enabling it to generate context relevant to the specified task.

Appendix G Full Result

Model Method Capitalize Country-Capital Present-Past Singular-Plural Person-Sport AG News Average (All)
Llama-2 Zero-shot Regular 0.0±plus-or-minus\pm± 0.0 0.0±plus-or-minus\pm± 0.0 0.0±plus-or-minus\pm± 0.0 0.0±plus-or-minus\pm± 0.0 0.0±plus-or-minus\pm± 0.0 0.0±plus-or-minus\pm± 0.0 0.2
Function vector 98.6±plus-or-minus\pm± 0.4 67.4±plus-or-minus\pm± 20.7 80.2±plus-or-minus\pm± 4.5 94.2±plus-or-minus\pm± 0.6 1.4±plus-or-minus\pm± 0.5 57.7±plus-or-minus\pm± 0.9 44.7
Task vector 92.9±plus-or-minus\pm±6.5 92.8±plus-or-minus\pm±2.8 95.2±plus-or-minus\pm±1.7 95.3±plus-or-minus\pm±1.9 86.9±plus-or-minus\pm±4.5 47.8±plus-or-minus\pm±1.3 69.4
State vector (inn.) 99.6±plus-or-minus\pm±0.4 94.0±plus-or-minus\pm±1.3 96.5±plus-or-minus\pm±1.2 97.1±plus-or-minus\pm±1.0 89.7±plus-or-minus\pm±3.2 52.0±plus-or-minus\pm±5.5 76.0
State vector (mom.) 99.1±plus-or-minus\pm±0.3 94.5±plus-or-minus\pm±0.7 96.5±plus-or-minus\pm±0.7 96.6±plus-or-minus\pm±1.0 88.1±plus-or-minus\pm±2.6 50.0±plus-or-minus\pm±8.3 76.3
Few-shot ICL baseline 99.9±plus-or-minus\pm±0.1 95.2±plus-or-minus\pm±1.0 98.3±plus-or-minus\pm±0.6 98.5±plus-or-minus\pm±0.1 94.8±plus-or-minus\pm±0.2 76.0±plus-or-minus\pm±5.7 83.1
Function vector 99.7±plus-or-minus\pm± 0.1 82.2±plus-or-minus\pm± 3.8 94.6±plus-or-minus\pm± 1.7 97.3±plus-or-minus\pm± 0.7 88.4±plus-or-minus\pm± 1.9 80.7±plus-or-minus\pm±4.6 78.1
Task vector 98.0±plus-or-minus\pm±1.0 92.9±plus-or-minus\pm±3.4 98.2±plus-or-minus\pm±0.5 98.5±plus-or-minus\pm±1.3 95.4±plus-or-minus\pm±0.4 64.3±plus-or-minus\pm±8.4 81.5
State vector (inn.) 99.7±plus-or-minus\pm±0.1 94.4±plus-or-minus\pm±1.3 98.3±plus-or-minus\pm±0.6 98.5±plus-or-minus\pm±0.4 95.2±plus-or-minus\pm±0.2 76.0±plus-or-minus\pm±8.5 83.3
State vector (mom.) 99.3±plus-or-minus\pm±0.1 94.9±plus-or-minus\pm±0.7 98.3±plus-or-minus\pm±0.6 98.8±plus-or-minus\pm±0.3 95.7±plus-or-minus\pm±0.2 76.3±plus-or-minus\pm±5.9 83.8
 
GPT-J Zero-shot Regular 0.3±plus-or-minus\pm± 0.1 1.8±plus-or-minus\pm± 1.7 19.4±plus-or-minus\pm± 2.1 22.7±plus-or-minus\pm± 2.9 0.0±plus-or-minus\pm± 0.0 0.0±plus-or-minus\pm± 0.0 5.2
Function vector 66.3±plus-or-minus\pm± 8.4 57.0±plus-or-minus\pm± 9.9 63.1±plus-or-minus\pm± 2.1 69.3±plus-or-minus\pm± 2.1 0.8±plus-or-minus\pm± 1.1 46.4±plus-or-minus\pm± 4.5 37.4
Task vector 51.0±plus-or-minus\pm±4.7 31.6±plus-or-minus\pm±4.8 37.0±plus-or-minus\pm±5.3 61.6±plus-or-minus\pm±1.2 46.4±plus-or-minus\pm±4.0 55.0±plus-or-minus\pm±3.7 41.4
State vector (inn.) 58.2±plus-or-minus\pm±1.3 45.5±plus-or-minus\pm±8.3 47.3±plus-or-minus\pm±2.0 61.9±plus-or-minus\pm±0.7 51.7±plus-or-minus\pm±1.8 59.7±plus-or-minus\pm±5.4 47.8
State vector (mom.) 58.6±plus-or-minus\pm±0.8 52.9±plus-or-minus\pm±6.1 45.9±plus-or-minus\pm±0.2 62.5±plus-or-minus\pm±0.7 51.4±plus-or-minus\pm±1.4 61.3±plus-or-minus\pm±4.8 49.7
Few-shot ICL regular 99.3±plus-or-minus\pm±0.3 88.2±plus-or-minus\pm±3.4 96.9±plus-or-minus\pm±0.9 99.3±plus-or-minus\pm±0.5 82.4±plus-or-minus\pm±3.5 76.3±plus-or-minus\pm±1.7 73.1
Function vector 98.6±plus-or-minus\pm± 0.6 78.6±plus-or-minus\pm± 5.1 90.8±plus-or-minus\pm± 1.3 95.9±plus-or-minus\pm± 0.9 81.6±plus-or-minus\pm± 1.4 72.7±plus-or-minus\pm±3.2 70.6
Task vector 99.3±plus-or-minus\pm±0.3 89.8±plus-or-minus\pm±2.8 97.3±plus-or-minus\pm±1.0 99.3±plus-or-minus\pm±0.5 83.3±plus-or-minus\pm±3.6 63.3±plus-or-minus\pm±8.7 71.7
State vector (inn.) 99.4±plus-or-minus\pm±0.3 89.2±plus-or-minus\pm±3.6 97.3±plus-or-minus\pm±0.8 99.3±plus-or-minus\pm±0.5 83.8±plus-or-minus\pm±3.5 75.7±plus-or-minus\pm±1.2 73.6
State vector (mom.) 99.4±plus-or-minus\pm±0.2 90.1±plus-or-minus\pm±3.5 97.6±plus-or-minus\pm±0.9 99.4±plus-or-minus\pm±0.3 83.7±plus-or-minus\pm±3.0 78.0±plus-or-minus\pm±2.2 74.4
Table 5: Performance of state vector optimization across other six tasks and average performance of all task. The best results in the zero shot setting are in underline and the best results in the few shot setting are in bold. The result of basic state vector is mathematically equivalent to task vector.

In this section, we provide the additional result with llama-2-7B GPT-J model. We first present the main result of optimization on the other six tasks except the main result, and the average performance across all tasks. As shown in Table 5, our inner optimization and momentum optimization effectively enhance the state vector.

Refer to caption
(a) Llama-2 Person-Occupation
Refer to caption
(b) Llama-2 Product-Company
Refer to caption
(c) GPT-J Person-Occupation
Refer to caption
(d) GPT-J Product-Company
Figure 6: Performance of aggregation across number of examples. Avg. denotes the average aggregation baseline and D&C. denotes the divide-and-conquer aggregation. The X axis represents the number of examples, and the Y axis represents the accuracy.

Moreover, we provide the result of state vector aggregation on two additional datasets. As shown in Figure 6, the trends of both D$C and average aggregation follow a similar pattern to the main result shown in Figure 2 as the number of examples increases, illustrating the effectiveness of our aggregation methods.

Appendix H Result on Larger Model

Model Method Antonym English-French Person-Instrument Average
Llama-2-13B Zero-shot Regular 1.2±plus-or-minus\pm± 0.7 0.2±plus-or-minus\pm± 0.2 0.0±plus-or-minus\pm± 0.0 0.5
Function vector 47.1±plus-or-minus\pm± 1.6 23.2±plus-or-minus\pm±4.3 0.1±plus-or-minus\pm± 0.1 23.5
Task vector 46.0±plus-or-minus\pm± 2.4 43.1±plus-or-minus\pm±7.2 58.2±plus-or-minus\pm±6.3 49.1
State vector (inn.) 47.0±plus-or-minus\pm± 1.2 50.5±plus-or-minus\pm±1.9 66.6±plus-or-minus\pm±3.1 54.7
State vector (mom.) 47.9±plus-or-minus\pm± 1.1 55.9±plus-or-minus\pm±3.4 68.5±plus-or-minus\pm±2.0 57.4
Few-shot ICL baseline 67.0±plus-or-minus\pm± 0.1 74.5±plus-or-minus\pm±1.3 75.0±plus-or-minus\pm±0.2 72.2
Function vector 65.7±plus-or-minus\pm± 1.7 75.2±plus-or-minus\pm±2.6 72.2±plus-or-minus\pm±0.4 71.3
Task vector 64.8±plus-or-minus\pm± 1.2 70.5±plus-or-minus\pm±3.5 70.6±plus-or-minus\pm±3.1 68.6
State vector (inn.) 65.5±plus-or-minus\pm± 0.8 75.8±plus-or-minus\pm±1.6 77.0±plus-or-minus\pm±1.3 72.8
State vector (mom.) 65.9±plus-or-minus\pm± 0.7 75.6±plus-or-minus\pm±0.4 78.6±plus-or-minus\pm±1.1 73.4
Table 6: Performance of state vector optimization across three tasks on llama-2-13B. The best results in the zero shot setting are in underline and the best results in the few shot setting are in bold. The result of basic state vector is mathematically equivalent to task vector.
Refer to caption
(a) AG News
Refer to caption
(b) Antonym
Refer to caption
(c) English-French
Refer to caption
(d) Product-Company
Figure 7: Performance of aggregation on Llama-2-13B across number of examples. Avg. denotes the average aggregation baseline and D&C. denotes the divide-and-conquer aggregation. The X axis represents the number of examples, and the Y axis represents the accuracy.

In this section, we provide the optimization and aggregation results on the larger model. Here we choose Llama-2-13B as its memory requirements suit our hardware conditions. We present the result of the optimization method on three representative datasets shown in Table 6, and the result of the aggregation method on four representative datasets shown in Figure 7. The result shows that our inner and momentum optimization and D&C aggregation method could also benefit the state vector on the larger model setting.

Appendix I Qualitative Study

Refer to caption
(a) Antonym
Refer to caption
(b) English-French
Figure 8: The 2D PCA visualization of the state vector in the Antonym task and English-French task of GPT-J, where each color represents the state vector corresponding to examples occupying specific positions in the demonstration and the outlier is of the first order.

In Figure 8, we present a Principal Component Analysis (PCA) visualization of the original state vector in GPT-J, applied to both the Antonym task and the English-French translation task. Note that the cluster distributions observed in GPT-J closely mirror those of Llama-2. This similarity indicates a consistent and progressive enhancement in the model capacity, as originally identified in Llama-2 in §6.3, which is also shown on GPT-J. Such findings demonstrate the broad applicability and generalizability of our momentum optimization approach across different models.

Refer to caption
(a) Demonstration Robustness
Refer to caption
(b) Dummy Robustness
Figure 9: Standard deviation of performance on Llama-2 across three datasets.

Appendix J Robustness Analysis

In this appendix, we examine the robustness of the state vector with inner optimization. Specifically, we evaluate the task vector and the inner optimized state vector on the Llama-2 dataset, focusing on three tasks. We measure and report the performance standard deviation using 100 diverse demonstrations or dummy queries. As illustrated in Figure 9, our analysis yields three key observations:

  • The task vector and state vector exhibit greater sensitivity to dummy queries than to demonstrations. This finding suggests that dummy queries have a greater impact on performance compared to demonstrations, underscoring the importance of reducing the noise from dummy queries to enhance state vector performance.

  • In the few-shot setting, both the task vector and the state vector (inn.) indicate significantly greater robustness compared to their performance in the zero-shot setting. There is a noticeable reduction in the standard deviation across diverse demonstrations or dummy queries when applying demonstrations during ICL inference. This improvement may be attributed to the richer ICL function information provided by demonstrations, which in turn bolsters performance stability.

  • Compared to the task vector, our inner optimized state vector shows markedly enhanced robustness to the variations in demonstrations and dummy queries, in both zero-shot and few-shot settings. This highlights the effectiveness of our proposed inner optimization in improving the robustness of the state vector.

Appendix K Limitation

The definition of state vectors is contingent upon specific assumptions and lacks a rigorous theoretical foundation, which may impact its generalizability and reliability across different NLP tasks. Additionally, the experiments were conducted on a limited scale with moderate-sized models and datasets. These constraints may affect the applicability of the results to larger models or more complex datasets. Further research will explore these aspects to establish a more robust validation of the proposed methods.