Context Embeddings for
Efficient Answer Generation in RAG

David Rau University of AmsterdamAmsterdamNetherlands [email protected] , Shuai Wang The University of QueenslandBrisbaneAustralia [email protected] , Hervé Déjean Naver Labs EuropeGrenobleFrance [email protected] and Stéphane Clinchant Naver Labs EuropeGrenobleFrance [email protected]

Context Embeddings for Efficient Answer Generation in RAG

Abstract.

Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin. Our method allows for different compression rates trading off decoding time for answer quality. Compared to earlier methods, COCOM allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates a speed-up of up to 5.69 $\times$ while achieving higher performance compared to existing efficient context compression methods.

Context Compression, LLM, RAG

^†^†copyright: rightsretained

1. Introduction

Refer to caption — Figure 1. COCOM: Compressing multiple contexts for RAG into a small set ( $\xi={4,16,128}$ ) of Context Embeddings leads to a massive speed up in answer generation while maintaining higher performance compared to other methods. Results are shown for the ASQA dataset.

Large Language Models (LLMs) are pre-trained on massive amounts of textual data; for instance, Llama 2 (Touvron et al., 2023) has been trained on 3 trillion tokens during pre-training. Through billions of learnable parameters, LLMs not only excel at modeling language but at the same time, build up a knowledge base that could be later used for question answering. On the other hand, the model is limited to the knowledge contained in the pre-training data. In knowledge-intensive scenarios, relying solely on the parametric memory of the model is often insufficient. To alleviate this, context can be provided explicitly from an external source through a preceding retrieval step (Retrieval-Augmented Generation–RAG). Although LLMs show notable improvements when given additional relevant context in knowledge-intensive tasks, this approach has limitations.

A key drawback is that adding more context to the input considerably slows down generation during inference. This occurs because the self-attention mechanism in transformers grows exponentially in space and memory requirements with increasing input length. At the same time, previous research has shown providing multiple documents as context can improve RAG performance (Izacard et al., 2022; Hsia et al., 2024). This is particularly critical for QA applications where reasoning over context from multiple documents is necessary, such as in multi-doc QA tasks (Fan et al., 2019; Joshi et al., 2017; Yang et al., 2018). In fact, the observation that modern transformers can naturally cope with many context documents for answer generation in open domain QA tasks was central to the development of RAG (Dehghani et al., 2019; Izacard and Grave, 2021). However, as the input length becomes larger, the position bias in LLMs might further complicate the extraction of relevant information (Liu et al., 2023).

Previous work has shown that the increased generation time in RAG can be alleviated by reducing the model’s input through context compression. This can be achieved either by applying lexical-based compression, where unimportant terms or tokens in the context are identified and filtered out during generation (Jiang et al., 2023), or by embedding-based compression, where embedding models transform the context into fewer embedding tokens in the LLM input (Ge et al., 2024; Tan et al., 2024; Cheng et al., 2024; Muennighoff et al., 2024). Notably, state-of-the-art embedding-based compression methods often achieve higher effectiveness and lower latency compared to lexical-based compression methods (Cheng et al., 2024).

However, despite the current embedding-based compression approaches achieving lower latency in RAG systems, several limitations remain:

•

Large compressor model: These methods rely on large compression models to achieve high effectiveness, such as (Cheng et al., 2024; Muennighoff et al., 2024).
•

Low effectiveness: The effectiveness of current embedding-based compression methods underestimates the potential of LLMs for answer generation, as they only tune parts of model components and leave the decoder LLM untuned. We hypothesize that freezing the decoder hinders the use of compressed contexts.
•

Fixed compression rate: Current methods do not offer different compression rates with respect to the length of input context, allowing to trade of inference time for generation quality at high effectiveness.
•

Single document limitation: Current effective methods only support using a single document context to generate answers.

We address the described limitations, similar to concurrently developed methods, by compressing contexts into a small number of context embeddings which are then provided as input to the LLM. This allows us to reduce the input size to a fraction of its surface form, which leads to an increased decoding time during answer generation. We call our model COCOM (COntext COmpression Model), a multi-context compression method leveraging a single model for context compression and answer generation.

Additionally, we further show that with appropriate pretraining and tuning approaches, our compressing model achieves significantly higher effectiveness than current context compressing approaches (see Figure 1). We summarize our contributions as follows:

•

We present COCOM, an effective context compression method, reducing long contexts to only a handful of context embeddings speeding up the generation time while achieving higher performance compared to other methods.
•

In an efficiency study, we demonstrate the efficiency-effectiveness trade-offs achievable with different compression rates. We further illustrate the time and memory required for compression. We reduce inference time by up to 5.69 $\times$ and GFLOPs by up to 22 $\times$ while maintaining high performance.
•

We conduct an ablation to understand which factors are the most important for effective generation and analyze the impact of the pretraining collection, pretraining, fine-tuning, and freezing or not the decoder. on the target dataset, and training the decoder.

The rest of this paper is structured in the following way. Section 2 discusses related work on RAG, efficiency, and compression approaches. We continue in Section 3 discussing the RAG task and our novel COCOM approach to effective context compression. Section 4 details the experimental setup in terms of the RAG models and the five QA tasks. In Section 5, we present the main COCOM results in terms of effectiveness and efficiency. Section 6 conducts further analysis of how compression affects the model. We end with discussion and conclusions in Section 7, and limitations in Section 8.

Table 1. Comparison to previous works on Embedding-based Context Compression.

Work	Light Compressor	Decoder Tuning	Adaptable $\gamma$	Multi-Doc	Efficient Answer Generation
GridLM (Muennighoff et al., 2024)	✗	✓	✗	✗	✗
AutoCompressor (Chevalier et al., 2023)	✗	✓	✓	✓	✓
ICAE (Ge et al., 2024)	✗	✗	✓	✓	✓
xRAG (Cheng et al., 2024)	✗	✗	✗	✗	✓
COCOM-light (ours)	✓	✓	✓	✓	✓
COCOM (ours)	✗	✓	✓	✓	✓

2. Related Work

In this section, we discuss related work on RAG, efficiency, and compression approaches.

The initial motivation for this work stems from a recent study by Morris et al. (2023), which demonstrates that a bag-of-words representation of the original surface terms can be recovered from text embeddings. This observation that embeddings can encapsulate the content of an entire passage inspired the idea to provide context in the form of an embedding rather than the original context in token form to an LLM.

The underlying motivation in the context of RAG to reduce the input size is, as mentioned earlier, due to the computational costs of contextualizing long inputs and as a consequence thereof increased decoding time (Asai et al., 2024). We address this by reducing the provided context to only a handful of context embeddings that are provided the LLM head-on.

Reducing the input to RAG models is a very active research field, with many works being done concurrently with ours. Among those works, two primary lines of research have emerged: embedding-based and lexical-based context compression. We discuss them in the following.

2.1. Lexical-based Compression.

Lexical-based compression focuses on either selecting tokens from the context (Li, 2023) or summarizing contexts (Xu et al., 2023), both aiming to retain essential information while reducing overall context size. LLMLingua comprises a query-independent token filtering module that uses a LLM to first select important tokens in the context. Then, a query-dependent token classifier is used to select tokens to form the compressed context.

On the other hand, Zhu et al. (2024) do not consider compression at the term level, but at the document level. Retrieved documents are either included or excluded with respect to the query. Only the included documents form the context for answer generation. It is worth noting that current lexical-based compression approaches all rely on specific query inputs. Therefore, compression needs to be (partially) processed online not allowing to compress documents offline, slowing down generation time.

2.2. Embedding-based Compression.

Embedding-based compression approaches focus on compressing the context into one or multiple summary embeddings that can be directly interpreted by the decoder model. This first work of this line is called AutoCompressor (Chevalier et al., 2023). This approach attempts to compress contextual information by segmenting it into randomly segmented chunks, subsequently aggregating these into summary embeddings through an iterative process until target embedding size is met. However, the training of the summary embeddings relies exclusively on next token prediction tasks, raising concerns about their ability to effectively encapsulate relevant contextual data. Furthermore, AutoCompressor is designed primarily for long contexts, generating a minimum of 50 summary embeddings. Such a configuration is not suitable for common RAG pipelines where short passages are retrieved, such as KILT.

Building up on AutoCompressor, ICAE by Ge et al. (2024) explores training a context compressor using the same LLM as the decoder model, and compress only once to get the summary embeddings. However, their approach limits the model’s capacity by using a frozen decoder module, preventing the accumulation of gradients from the decoder part during training. In this paper, we argue that decoder training is an important factor that strongly impacts the performance of the model. We illustrate this argument in Section 4.2.1.

Furthermore, GridLM Muennighoff et al. (2024) addresses the issue of double decoding the same context first for retrieval and then again as the provided context to the LLM. They use the same LLM for ranking and generation which allows them to cache all representations during encoding the contexts and to reuse them during generation. This approach compared to ours is limited to only a single context, does not speed up decoding time, and results in gigantic storage requirements.

Cheng et al. (2024) propose xRAG concurrently to our method. They directly reuse frozen ranking representations based on embedding models while freezing the decoder. Although this approach successfully resolves the double decoding problem, it suffers from low effectiveness because the representation is not trained prior to its application to compression tasks. This issue becomes particularly challenging when light-weight encoder models, such as DPR with 109 million parameters, are used as compressors. In such cases, the model achieves similar effectiveness to the Mistral-7b model when retrieval is not applied ¹¹1By default, xRAG uses a 7B SFR LLM-based ranking model as compressor. On the other hand, using retrieval representations from lightweight models for compression is counter-intuitive. Representations gathered from retrieval tasks may lack sufficient information to fully recover the context. Conversely, representation learned for compression demonstrate its capacity to reconstruct the original context (Ge et al., 2024). This suggest that, upon further adjustment, it may show a higher potential to serve as an effective retriever.

2.3. Overview

In Table 1 we contrast our method with the described related works on embedding-based compression. It is important to note that most previous works mentioned so far have only considered cases that may not directly apply to RAG settings but rather to long-context question answering. In their setting, only one relevant document is used for each query to fulfill the user request.

Therefore, such models are not naturally able to deal with effectively multiple documents. Furthermore, their reported effectiveness may not directly indicate the final performance in RAG systems, where the document may be potentially irrelevant, and often multiple top-retrieved documents are used. As a decoder model, by design, should be able to handle multiple context representations, we argue that fine-tuning the decoder is a simple yet necessary solution compared to existing works

3. Methodology

In this section, we detail the RAG task and our novel COCOM approach to effective context compression.

3.1. Task Definition: RAG

RAG employs a ranking system $\mathcal{R}$ and a parametric generative language model $\theta_{LLM}$ , where the ranking system can be multi-staged. First, the ranking system builds a search index $\mathcal{I}$ based on a collection. Then, at request time, the index $\mathcal{I}$ is searched yielding context segments²²2The segments can be at different granularities for instance sentences, passages, or entire documents. In this work, we focus on passages. $\mathcal{C}$ that are relevant to the user input $x$ : $f_{\mathcal{I},\mathcal{R}}:\{x\}\rightarrow\mathcal{C}$ .

Next, the LLM generates a response $r$ based on the context $\mathcal{C}$ and user input $x$ :

(1)

\theta_{LLM}:\{\mathcal{C},x\}\rightarrow r

Note how in RAG the context is added to the input of the LLM dramatically increasing the input to the LLM, as $|\mathcal{C}|\gg|x|$ .

3.2. COCOM: Effective Context Compression

The main idea of COCOM is to enhance efficiency by compressing the context, which is typically given in surface form as input tokens into a smaller set of context embeddings which then serve as the input to the LLM. An overview of our entire pipeline is given in Figure 2. More formally, our approach can be described as follows:

Given a context $\mathcal{C}$ tokenized into a a sequence of tokens $\{t_{1},t_{2},$ $\dots,t_{n}\}$ , a compressor model $\phi_{comp}$ , we compress $\mathcal{C}$ into context embeddings $\mathcal{E}$ , a smaller set of embeddings $\{e_{1},e_{2},\ldots,e_{k}\}$ , where $k\ll n$ . Each embedding $e_{i}\in\mathbb{R}^{d}$ , with $d$ being the LLM’s hidden dimension.

(2)

\phi_{comp}:\{t_{1},t_{2},\ldots,t_{n}\}\rightarrow\{e_{1},e_{2},\ldots,e_{k}% \}\in\mathbb{R}^{d}

Next, based on the compressed context embeddings $\mathcal{E}$ and the user input $x$ the LLM $\phi_{LLM}$ generates a response $r$ :

(3)

\theta_{LLM}:\{\mathcal{E},x\}\rightarrow r

The $\phi_{comp}$ model is trained to generate context embeddings that capture the content of the input tokens in a compressed form. As both models are trained jointly, $\theta_{LLM}$ learns to decode these context embeddings, extracting the relevant information required to answer user queries.

COCOM compresses the context-embeddings question independently. This means not only do individual contexts have to be contextualized by an LLM only once, but they can also be pre-computed offline and stored, drastically reducing computational costs of the LLM at inference time. Further, by only feeding a small number of context embeddings instead of the long context, the input size is reduced to a fraction leading to a massive speed-up for answer generation.

For COCOM, we utilize the same model for compression and answer generation $\phi_{comp}=\theta_{LLM}$ . Therefore, we effectively train a single model on the two tasks. For the compression task, we prepend a special token <AE> to the input and depending on $\xi$ append a different number of context embedding tokens <CTX> at the end of the sequence. We directly use the representations of the last hidden layer as our context embeddings as input - to the same model - for the answer generation.

As demonstrated later in the experiments, our method also allows us to potentially employ any embedding model as a compressor; including more lightweight encoder-only models such as BERT ³³3See Section 5.2.

3.2.1. Adaptable Compression Rate

The number of context embeddings $k=|\mathcal{E}|$ can be varied and allows to control the level of compression of the original context $\mathcal{C}=\{t_{1},\dots,t_{n}\}$ . We calculate the number of context embeddings $\xi$ per context $\mathcal{C}$ based on a compression rate $\xi$ , and the length of the tokenized input $n=|\mathcal{C}|$ .

(4)

\xi=\left\lfloor\frac{n}{\xi}\right\rfloor\

For instance, when compressing a context with length $n=128$ with a compression rate $\xi=64$ we obtain 2 context embeddings, reducing the input by 64 times.

3.2.2. Multiple Contexts

Knowledge-intensive tasks can benefit from providing the context of multiple retrieved passages (Izacard et al., 2022; Hsia et al., 2024), especially where reasoning over multiple contexts is necessary to solve the task (Fan et al., 2019; Joshi et al., 2017; Yang et al., 2018). In classical RAG the contexts of multiple passages are concatenated and provided to the model. Similarly in COCOM we can provide context embeddings of multiple passages to the LLM. Contexts are compressed independently following Equation 2. We add [SEP] special tokens between the context embeddings before feeding them to the LLM to distinguish context stemming from different passages in the input.

3.3. Pre-training Context Embeddings

We propose two auto-regressive variations of the next-token prediction task to learn to compress context into context embeddings and to use these context embeddings as input to the LLM.

Following our earlier notation, the objective function for the standard next token prediction for input $\mathcal{X}=\{x_{1},x_{2},\dots,x_{T}\}$ can be written as:

(5)

\mathcal{L}(\theta_{LLM})=-\sum_{x_{t}\in\mathcal{X}}\log P_{\theta_{LLM}}(x_{% t}\mid x_{1},x_{2},\ldots,x_{t-1})

3.3.1. Auto-encoding with Context Embeddings.

We modify the next token prediction task to recover the original input tokens from the compressed context embeddings $\mathcal{E}$ . This way we jointly train the compressor and LLM to decompress the original input which can be seen as a form of auto-encoding.

(6)

\mathcal{E}=\phi_{comp}(x_{1},x_{2},\dots,x_{T})

(7)

\mathcal{L}(\theta_{LLM},\phi_{comp})=-\sum_{x_{t}\in\mathcal{X}}\log P_{% \theta_{LLM}}(x_{t}\mid\mathcal{E},x_{1},\dots,x_{t-1})

This task serves as a preliminary step toward our final objective of answering questions from context embeddings. For this objective, we first aim to learn to compress and decompress the same input effectively.

3.3.2. Language Modeling from Context Embeddings.

Our final task is to answer questions based on the context embeddings. To this end, in our language modeling task, we train the model to continue a given input conditioned on context embeddings. This way the model learns not only to compress a given input but also to leverage the content of the context embeddings effectively.

We split input $\mathcal{X}=\{x_{1},x_{2},\dots,x_{T}\}$ into $\mathcal{X}_{A}=\{x_{1},x_{2},x_{j}\}$ and $\mathcal{X}_{B}=\{x_{j+1},\dots,x_{T}\}$ . After compressing the first part $\mathcal{X}_{A}$ into $\mathcal{E}_{A}$ we learn to generate the continuation - namely the second part $\mathcal{X}_{B}$ - based on the compressed representations $\mathcal{E}_{A}=\phi_{comp}(\mathcal{X}_{A})$ . This can be seen as a variation of the next token prediction task but conditioned on context embeddings.

(8)

\mathcal{L}(\theta_{LLM},\phi_{comp})=-\sum_{x_{t}\in\mathcal{X}_{B}}\log P_{% \theta_{LLM}}\big{(}x_{t}\mid\phi_{comp}(\mathcal{X}_{A}),x_{1},\dots,x_{t-1}% \big{)}

This language modeling task is complementary to the auto-encoding task. If we would only employ the auto-encoding from context embeddings task the LLM would be biased towards only recovering the original input, instead of leveraging the content of the context embeddings.

3.4. Fine-tuning

For the downstream RAG application, we fine-tune the model on a question $q$ , relevant context(s) retrieved by a retrieval system and compressed into context embeddings $\mathcal{E}$ , which are combined into an instruction $I_{q,\mathcal{E}}$ . We train the LLM to generate the target response $R=(r_{1},r_{2},\dots,t_{T})$ . We fine-tune our models on a combined set of publicly available QA datasets. We employ instruction fine-tuning only updating the models based on the target responses.

(9)

\mathcal{L}(\theta_{LLM},\phi_{comp})=-\sum_{r_{t}\in R}\log P_{\theta_{LLM}}(% r_{t}\mid I_{\mathcal{E},q},r_{1},r_{2},\dots,r_{t-1})

4. Experimental Setup

In this section, we detail our experimental setup in terms of the RAG models and the five QA tasks.

4.1. Implementation Details

We use Mistral-7B-Instruct-v0.2⁴⁴4https://https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 as our backbone LLM for answer generation. For context compression in COCOM, we utilize the same model. For our more light-weight context compression, in COCOM-light, we employ bert-base-uncased⁵⁵5https://https://huggingface.co/google/bert-base-uncased. We apply three different compression rates: $\xi={1,16,128}$ . We employ SPLADE-v3 (Lassance et al., 2024) with reranking top-50 using DeBERTa-v3 (He et al., 2021) as our retrieval system. For all our experiments we use top-5 documents as context.

4.2. Training

For both pre-training and fine-tuning, we apply parameter-efficient LoRA tuning.

4.2.1. Pre-training

For our pre-training, we employ the two earlier-mentioned pre-training autoencoding and language modeling tasks. Samples are drawn randomly with equal probability from both tasks. We tried different ratios but found this to perform best. to ensure efficient batch processing, which requires that every sample in a batch contains a fixed-length tokenized input. To achieve this, we split the Wikipedia-KILT (Petroni et al., 2020) corpus ⁶⁶6We publish this resource as a Huggingface dataset under https://huggingface.co/datasets/dmrau/kilt-128. into chunks of 128 tokens using the Llama-2-7b tokenizer. We pre-train on in total 10m samples. Training hyperparameters can be found in the Appendix in Table 10.

4.2.2. Fine-tuning

The BERGEN library (Rau et al., 2024) is used to fine-tune the model. We fine-tune our models on various datasets concurrently. To construct our fine-tuning dataset ⁷⁷7We publish the dataset under https://huggingface.co/datasets/dmrau/multi_qa, we combine training samples from Natural Questions (Kwiatkowski et al., 2019), MS MARCO ⁸⁸8We select only the first 100k queries. (Nguyen et al., 2016), adversarial QA (Bartolo et al., 2020), HotpotQA (Yang et al., 2018), WikiQA (Yang et al., 2015), SCIQ (Johannes Welbl, 2017), ASQA (Stelmakh et al., 2022), TriviaQA (Joshi et al., 2017), Freebase QA (Jiang et al., 2019) and squad (Rajpurkar et al., 2016) - all of which are for question answering. Then we filter out queries with more than 128 tokens and labels of more than 64 tokens. For mode details we refer to Table 12 in the Appendix. Training hyperparameters can be found in the Appendix in Table 11.

4.3. Evaluation

We evaluate our model on several widely used QA datasets. Natural Questions (Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017), HotpotQA (Yang et al., 2018), ASQA (Stelmakh et al., 2022), and PopQA (Mallen et al., 2023).

4.3.1. Metrics

As our main metric, following the standard protocol to evaluate fine-tuned models we use Exact Match (EM). To compare our results to previous works, which partially rely on untuned decoders and therefore produce verbose answers, we revert to the Match metric (M), which indicates whether the label is contained (as an exact match) in the generated answer.

4.4. Baselines without Context Compression

We fine-tune the base model (Mistral-7B-Instruct-v0.2):

•

RAG - upper bound. The model receives the top-5 retrieved contexts, alongside the query and answers the question. This model serves as an upper bound in our experiment not applying context compression.
•

Closed Book - lower bound. (w/o RAG). The LLM generates an answer based on the query without any provided context. This serves as a lower-bound baseline.

4.5. Baselines with Context Compression

We compare our models to the context compression methods mentioned below. As mentioned earlier these models tune only parts of their model components on the downstream data but leave their decoder LLM untuned applying it zero-shot. We argue this to be a major limitation, as answering questions from context embeddings differs fundamentally from the standard language modeling hindering the model to effectively leverage the context embeddings.

To ensure comparability among approaches we use the same retrieval system as mentioned earlier in Section 4.1.

•

Autocompressor (Chevalier et al., 2023): We use the princeton-nlp/AutoCompressor-Llama-2-7b-6k checkpoint producing 50 summary vectors. As their model is limited to compressing one single context, we just use the top retrieved document as context.
•

ICAE. (Ge et al., 2024): We use the Mistral-7B-Instruct-v0.2 LoRa-checkpoint⁹⁹9https://huggingface.co/sggetao/icae which uses the same base LLM as ours and is therefore directly comparable. ICAE is fine-tuned to compress a single long context, however, in our work we use multiple contexts. To alleviate this we concatenate the top five retrieved documents together as the context input for the model and truncate as the maximum length of 512 tokens. Note the model has a maximum output length of 128 compressed tokens, which approximately indicates a compression rate of 4 from its original concatenated context input.
•

xRAG. We utilize the xRAG-7b¹⁰¹⁰10https://huggingface.co/Hannibal046/xrag-7b, and 8x7B mixture-of-experts model ¹¹¹¹11https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 alongside their strongest SFR compressor. The base model is again the same as ours, to ensure comparability. As their model is limited to compressing only a single context into a single compressed representation, we use the top retrieved document for the xRAG setting.¹²¹²12We also tested compressing five documents together, which yielded lower effectiveness. We apply their predefined stopping criteria for answer generation, which aims at cutting the verbose nature of a untuned decoder LLM.

Table 2. Results in Exact Match (EM) comparing COCOM (-light) to other context compression works. For Match metric (M) see Table 9 in Appendix. All methods use 5 context passages unless indicated otherwise. ^★ Method limited to single context. ^△ upper baseline. ^▽ lower baseline. ^∗ indicates statistical non-significance (p¿0.05) with respect to COCOM

\xi

=4.

Decoder		Method	Compression rate ( $\xi$ )	Dataset
				NQ	TriviaQA	HotpotQA	ASQA	PopQA	Average
Zero-shot		AutoCompressor (Chevalier et al., 2023)^★	$\times$ 4	0.000	0.000	0.000	0.000	0.000	0.000
		ICAE (Ge et al., 2024)	$\times$ 4	0.210	0.592	0.184	0.222	0.290	0.300
		xRAG (Cheng et al., 2024)^★
		Mistral-7B-v0.2	$\times$ 128	0.184	0.622	0.185	0.182	0.199	0.274
		Mixtral-8x7b	$\times$ 128	0.265	0.744	0.239	0.292	0.318	0.372
Fine-tuned	Mistral-7B-v0.2	RAG^△ (no compression)	-	0.597	0.883	0.500	0.622*	0.514	0.623
		LLM^▽ (without context)	-	0.359	0.708	0.264	0.546	0.199	0.416
		COCOM-light (ours)	$\times$ 4	0.539	0.849	0.409	0.601*	0.458	0.531
			$\times$ 16	0.492	0.823	0.367	0.565	0.385	0.526
			$\times$ 128	0.444	0.794	0.321	0.550	0.314	0.485
		COCOM (ours)	$\times$ 4	0.554	0.859	0.430	0.609	0.474	0.585
			$\times$ 16	0.539	0.852*	0.426*	0.602*	0.465	0.577
			$\times$ 128	0.511	0.835	0.378	0.585*	0.391	0.540

5. results

In this section, we present the main COCOM and COCOM-light results in terms of effectiveness and efficiency.

5.1. Main Results

The main results for COCOM are presented in Table 2. We measure performance following the standard practice for fine-tuned models using the Exact Match (EM) metric. Compared to existing context compression methods¹³¹³13Side note: As previously mentioned earlier in Section 2, existing context compression methods do not tune the decoder LLM and therefore compare their performance and make effectiveness claims against zero-shot baselines. However, we argue that tuning compression models while freezing the decoder LLM could not be considered zero-shot, as it involves tuning some parts of the model on the task data. This setting is akin to soft-prompt tuning (Cuconasu et al., 2024; Li and Liang, 2021), where the compressor model effectively parameterizes the soft prompt. Consequently, the performance of these methods should be regarded as intermediate between zero-shot and full decoder tuning and should be compared against similar tuning settings, such as soft prompt tuning. , our approach demonstrates a significantly (Tested with paired t-test (p¡0.05)). higher effectiveness across different compression rates for all datasets tested. COCOM even outperforms the much stronger xRAG Mixtral-8x7B model by a large margin having 8 times more parameters than COCOM. The highest performance is observed at a low compression rate ( $\xi$ =4). Increasing the compression rate results in a slight performance decline, which we will analyze further in Section 6.1.

Compared to our upper bound baseline RAG without compression, we reduce the context by up to 128 times while still maintaining relatively high performance on average over datasets.

Performance decreases on average 4 points for our strongest model (COCOM $\xi$ = 4) and 10 points for the highest compression rate (COCOM $\xi$ = 128). Compared to the lower bound baseline LLM without provided context we gain up to 17 points, adding only a small number of additional context embeddings to the input.

Note, while EM is a standard metric for evaluating tuned models, it might underestimate zero-shot decoder methods that do not adapt the decoder to generate answers. To address this, we also provide results using the Match metric in the appendix in Table 9. Although models that do not tune their decoder achieve relatively higher performance when measured in Match, our method’s effectiveness compared to other methods still remains consistently significantly higher.

Overall, considering the effectiveness and the efficiency gains from context compression (discussed further in Section 5.3), COCOM shows a very favorable trade-off.

5.2. COCOM-light

Even though context compression has to be done only once offline, using a very large LLM can be costly, especially in resource-constraint settings. To this end, we propose COCOM-light, a computationally efficient context compression model based on BERT as a context compressor.

To alleviate the dimensional mismatch between the bert-based compressor and the - typically larger - LLM, we learn a linear projection layer $\bm{W}^{\gamma\times d}$ , where $\gamma$ is the compression rate and $d$ the hidden dimension of the LLM. To obtain a set of Context Embeddings we leverage the last hidden representation of each input token. We simply split the hidden representations into blocks of length $\gamma$ and project each block into a single Context Embedding. This way, we learn a block-wise aggregation of the input representations that depending on the input length, and the compression rate $\gamma$ yields a different number of Context Embeddings per input. Note that a similar approach is applied in xRAG, where a projection layer is used on the embedding vector to resolve the dimensional mismatch. However, we argue that compressing using a single vector embedding could significantly restrict the compression quality, especially when using lightweight encoder models such as BERT. This restriction can result in much lower effectiveness compared to using a larger embedding model (Cheng et al., 2024).

We present the results in Table 2 measured in EM. Results for Match can be again found in the appendix in Table 9. We find that while being highly effective for small compression rates to drop considerably for the highest compression rate of $\xi$ =128. COCOM-light, compared to other methods poses an effective alternative to it’s bigger counterpart COCOM, in resource-constrained settings.

5.3. Computational Efficiency

Table 3. Decoding efficiency in generation Time, GPU Memory, and number of operations (GFLOPs) for COCOM (-light) on dataset NQ.

\xi

the compression rate. Increasing times is compared against RAG(no compr.).

Model	$\xi$	Decoding Time	GPU Mem.	GFLOPs
Mistral-7b-v0.2		(ms)	(GB)
RAG (no compr.)	-	1064	18.1	25031
LLM (no context.)	-	159	14.1	607
COCOM (-light)	4	371 ( $\times$ 2.87 )	15.1 ( $\times$ 1.20)	7016 d( $\times$ 23.57)
	16	213 ( $\times$ 5.00 )	14.4 ( $\times$ 1.29)	2465 d( $\times$ 10.16)
	128	187 ( $\times$ 5.69 )	14.2 ( $\times$ 1.27)	1138 d( $\times$ 22.00)

Table 4. Compression efficiency and storage requirements. Compressing 24m contexts using on a single A100 80GB GPU.

Compressor	$\xi$	Time (h)	Index size (TB)
COCOM	4	89	6.06
	16	77	1.51
	128	73	0.19
COCOM-light	4	1	6.06
	16	1	1.51
	128	1	0.19

We measure efficiency in answer generation time (ms), maximum GPU memory (GB), and number of operations per generated token (Giga FLOPs) using the torch profiler. We run the experiments on a single A100 40GB with a fixed batch size of 16¹⁴¹⁴14Maximum batch size that fits on GPU across models.. We load the model in half-precision and use the PyTorch inference mode. We discard the first warm-up batch from the measurement and measure the bare forward pass of the model. Note decoding results are independent of the compressor, therefore COCOM and COCOM-light share efficiency results.

We show our efficiency results for answer generation in Table 3 for different compression rates $\xi$ and compare them to RAG without context compression. Context compression with COCOM reduces answer generation time , GPU memory, and the number of operations drastically with up to 5.69 $\times$ less inference time cost, 1.27 $\times$ GPU memory, and 22 $\times$ GFLOPs compared to no compression.

In addition, Table 4 presents the compression costs for all documents in the kilt-100w ( 24m contexts) collection using COCOM-light models at various compression rates. COCOM-light models demonstrate significantly faster compression speeds compared to the COCOM model by employing a much computationally lighter compressing module (up to 89 $\times$ ). Index size varies inversely with compression rate: higher compression rates result in smaller index storage requirements. However, this trade-off leads to lower quality in answer generation, as shown in Table 2.

5.4. Ablations

In the following section, we run additional ablation experiments for COCOM and COCOM-light. Most results can be found in Table 6. We report performance in Exact Match on two datasets (NQ and ASQA).

5.4.1. Handling multiple contexts.

In table 5, we compare the performance of COCOM with 1 retrieved context ( $k=1$ ) versus our default setup $k=5$ . On both datasets and for all compression rates, we observe a substantial gain when using more contexts. Moreover, COCOM with 1 retrieved context is still significantly better compared to other baselines relying on single retrieved document (ICAE, xRAG) in table 2. As a decoder model by design should be able to handle multiple context representations, we argue that fine-tuning the decoder is a simple yet necessary solution compared to existing works.

Table 5. Impact of the number of provided contexts on COCOM measured in EM on datasets NQ and ASQA.

Model	$\xi$	NQ		ASQA
		k=5	k=1	k=5	k=1
COCOM	4	0.554	0.4987	0.609	0.558
	16	0.539	0.4913	0.602	0.541
	128	0.511	0.4818	0.585	0.544

5.4.2. Pre-training Context Compression

Central to our approach is the compression of context into a small number of Context Embeddings. We argue that context compression fundamentally differs from the language modeling objective on which the model was originally trained. Consequently, we have employed auto-encoding and language-modeling-from-context-embedding tasks to learn how to effectively compress the context and utilize these compressed representations during decoding. We show the results of the impact of the pre-training tasks on the downstream performance after fine-tuning. Our results suggest that the dedicated pre-training tasks for context compression can improve performance for downstream QA performance, suggesting two possible explanations. Either context compression is too complex to be learned concurrently with the downstream task, or larger fine-tuning datasets are necessary to effectively learn how to compress contexts.

5.4.3. Pre-training Corpus

Our method employs an initial pre-training step aimed at initializing context compression. We train auto-regressively on the same target corpus, which is later used to retrieve relevant contexts. In this experiment, our objective is to assess how variations in the pre-training corpus impact downstream QA performance, thereby testing the robustness of our approach. To explore this, we additionally pre-train the model on the ”sample-10BT” subset of Fine-Web (Penedo et al., 2024). We employ the same training methodology described in Section 4.2.1, where we segment the collection into non-overlapping passages of 128 tokens using the Llama-2-7b tokenizer and train on a subset of 10 million tokens, similar to the target corpus. The results presented in Table 6 indicate a slight decrease in performance when using a different target corpus for pre-training. Nonetheless, our approach demonstrates robustness in handling variations in the pre-training corpus, highlighting its adaptability and effectiveness in context compression.

5.4.4. Decoder LLM Tuning

Existing context compression methods tune only the compression module while keeping the decoder, responsible for generating the answer, frozen. A core distinction from these methods is that we tune all components including the decoder, in COCOM. We hypothesize that context embeddings differ significantly from the input token embeddings the model was trained on, thereby hindering effective utilization without dedicated tuning. We investigate the consequences of freezing the decoder and solely tuning the compressor, akin to existing methods. Our findings show the criticality of tuning the decoder to achieve high effectiveness. This reinforces our hypothesis that specific tuning of context embeddings seem essential for better performances.

Table 6. Impact of pre-training corpus, pre-training, and decoder tuning on downstream performance (EM). Compression rate

\xi

= 128

Ablation	Datasets
	NQ	ASQA
COCOM-light (baseline)	0.444	0.550
w/o pre-training	0.423	0.524
pre-training on FineWeb	0.427	0.545
w/o tuning decoder	0.353	0.438
COCOM (baseline)	0.519	0.585
w/o pre-training	0.490	0.565
pre-training on FineWeb	0.503	0.581
w/o tuning decoder	0.421	0.521

5.4.5. Fine-tuning Data

In our experiments, we fine-tune our models simultaneously on multiple QA datasets before evaluating them on individual datasets. We explore the impact of this multi-dataset fine-tuning compared to training on a single dataset. Specifically, we fine-tune and evaluate our models on NQ (Natural Questions). For assessing transferability, we also conduct zero-shot evaluations on other datasets. The results are presented in Figure 3. We find that fine-tuning solely on a single dataset, such as NQ, leads to slightly higher performance on that specific dataset. However, training on multiple datasets demonstrates superior transferability across all datasets, resulting in better average performance overall.

6. Analysis

In this section, we conduct further analysis on how compression affects the model.

6.1. Context compression

In our earlier results in Section 5.1, we observe a decline in performance with higher compression rates, particularly for the lightweight compressor in COCOM-light. To investigate potential reasons for this drop, we assess the model’s ability to perform the two pre-training tasks: (i) compressing and decompressing input (auto-encoding) and (ii) language modeling from compressed representations after pre-training.

Our results in Table 7 indicate that both models effectively learn the auto-encoding task at lower compression rates ( $\xi$ =4, 16), but struggle to recover the input when the context is compressed into fewer embeddings ( $\xi$ =128), with this issue being more pronounced for the lightweight compression module.

We identify two possible explanations: First, compressing longer contexts is inherently more challenging and might require additional objectives. Second, decompressing information from a smaller set of Context Embeddings may be more difficult due to the sequential decoding nature of decoder-only models. Introducing additional pause tokens (Goyal et al., 2024) could help alleviate this issue, providing the model with a means to hierarchically decompress information, drawing on ideas from Chain-of-Thought prompting (Wei et al., 2022). We also experimented with pre-training on more samples but found that this did not improve downstream performance. Regarding the second pre-training task, it is noteworthy that COCOM-light outperforms its larger counterpart in language modeling from Context Embeddings. This analysis shows compressing and re-constructing longer texts is challenging and needs further investigation.

Table 7. Pre-training evaluation on the tasks Auto Encoding (AE) and Language Modeling from Context Embeddings (LMCE) measured in Rouge-L.

Model	$\xi$	Rouge-L
		AE	LMCE
COCOM-light	4	0.9979	0.2045
	16	0.9912	0.1991
	128	0.5545	0.1771
COCOM	4	0.9734	0.1882
	16	0.9643	0.1800
	128	0.7938	0.1618

6.2. Case Study Answer Quality

We investigate the answers generated with different models. For this, we randomly select a query from the NQ dataset and compare the responses generated by each method. Table 8 presents the responses to the selected question.

From the responses, we observe that without RAG, the LLM tends to hallucinate and provide an irrelevant name as an answer. On the other hand, XRAG understands the question but returns an incorrect named entity, likely due to limitations in reading compressed embeddings accurately. ICAE struggles to comprehend the question, resulting in an unreasonable answer. Both COCOM and COCOM-light successfully answer the question correctly at a compression rate of 4. However, they encounter difficulties when the compression rate is increased to 128.

It is also worth noting that the XRAG response was intentionally truncated to a maximum of 30 tokens in its original publication, with the stopping criteria involving halting at punctuation mark such as periods, commas, and colons.

Table 8. Case Study: Generated responses using different methods. Dataset: NQ

Model Input

Question: who played sarah hedley in when the boat comes in?

Context 1: Rosalind Bailey. Rosalind Bailey Rosalind Bailey (born 1946) is a British actress, known for her portrayal of Sarah Headley (”née” Lytton) in the 1970s and 1980s BBC television drama ”When the Boat Comes In”. Bailey has appeared in numerous British television drama series, including ”Byker Grove”, ”Distant Shores” and ”Burn Up”. Her stage work includes playing Miss Mary Shepherd in Alan Bennett’s play ”The Lady in the Van”.

Context 2: Malcolm Terris. Malcolm Terris Malcolm Terris (born 11 January 1941 in Sunderland, County Durham) is a British actor. He had a lengthy career in a large number of television programmes. Possibly his best-known role was in ”When the Boat Comes In”, a popular 1970s series, where he played the part of Matt Headley. His film career includes appearances in ”The First Great Train Robbery” (1978), ”McVicar” (1980), ”The Plague Dogs” (1982, voice only), ”Slayground” (1983), ”The Bounty” (1984) as Thomas Huggan, ship’s surgeon, ”Mata Hari” (1985), ”Revolution” (1985), ”Scandal” (1989), and ”Chaplin” (1992). His TV appearances include: One episode of

Context 3: When the Boat Comes In. When the Boat Comes In When the Boat Comes In is a British television period drama produced by the BBC between 1976 and 1981. The series stars James Bolam as Jack Ford, a First World War veteran who returns to his poverty-stricken (fictional) town of Gallowshield in the North East of England. The series dramatises the political struggles of the 1920s and 1930s and explores the impact of national and international politics upon Ford and the people around him. Section:Production. The majority of episodes were written by creator James Mitchell, but in Series 1 north-eastern

Context 4: Susie Youssef. Youssef began her comedy career as a writer for ”The Ronnie Johns Half Hour” in 2006, and made her acting debut in the short film ”Clicked” in the role of Lina in 2011. In 2014, she played Jane in the short film ”Kevin Needs to Make New Friends: Because Everyone Hates Him for Some Reason” and then turned to television where she appeared in ”The Chaser’s Media Circus”. In 2014, Youssef played the lead role of Sarah in the Hayloft Project’s stage play ”The Boat People” which won the Best On Stage award at the FBi SMAC Awards

Context 5: Madelaine Newton. Madelaine Newton Madelaine Newton is a British actress best known for her portrayal of Dolly in 1970s BBC television drama ”When the Boat Comes In”. She is married to actor Kevin Whately, known for his role as Robert ”Robbie” Lewis in both ”Inspector Morse” and its spin-off ”Lewis”. They have two children. She starred alongside her husband in the ”Inspector Morse” episode ”Masonic Mysteries” as Beryl Newsome - the love-interest of Morse - whom Morse was wrongly suspected of murdering. She played Whately’s on-screen wife in the 1988 Look and Read children’s serial, Geordie Racer. She also made

Generated Responses

Label: Rosalind Bailey

LLM: Anna Cropper

RAG: Rosalind Bailey

xRAG: 1976 : The role of Sarah Hedley in When the Boat Comes In was played by Rosalie Crutchley.

ICAE Response: Sarah Hadland

COCOM-4: Rosalind Bailey

COCOM-light-4: Rosalind Bailey

COCOM-128: Alison Steadman

COCOM-light-128: Rosalind Elliott

7. Conclusion

In this paper, we presented our novel approach COCOM approach for context compression. Our main finding is that COCOM accelerates answer generation, by reducing the model’s input, by compressing multiple contexts into context embeddings that, once pre-computed serve to augment the answer generation.

Our approach maximizes the potential of the LLM by tuning all components outperforming existing methods for context compression in RAG. By offering a trade-off between efficiency and effectiveness, our method allows for the selection of varying numbers of context compression tokens. This flexibility enables us to balance higher answer quality against faster generation times as needed. Unlike previous methods, our approach allows for the input of multiple contexts, which enhances generation quality and optimally makes use of the reduced decoding time. This is because only for very long inputs, the distinction between the context in token form and a reduced set of embeddings becomes most apparent.

We hope that our work will inspire further research in context compression and pave the way for efficient and effective deployment of Retrieval-Augmented Generation (RAG) models in real-world applications.

8. Limitations

We end this paper by discussing the remaining limitations of our model and of our experiments.

Our approach offers great potential to reduce the computational footprint of RAG. However, in our experiments we were constrained by computational resources, which limits us to utilizing a relatively small model of 7 billion parameters. This constraint prevents us from exploring the capabilities of larger models such as LLaMA70B or Mixtral7x8B, which might offer enhanced performance but require significant computational power for training and inference.

Our approach demonstrates the potential to leverage a much larger set of documents compared to non-compressed models, leading to notable efficiency gains. These gains are particularly evident when dealing with a substantial volume of documents. However, due to resource limitations, our experiments have been restricted to only 5 documents. This limited scope may not fully reflect the method’s effectiveness when scaled to larger document collections, where the benefits could be more pronounced.

Additionally, the evaluation of our method has been conducted exclusively on Question Answering (QA) tasks and using English corpora. A more comprehensive assessment, encompassing diverse tasks and multilingual datasets, would be necessary to thoroughly understand the model’s capabilities and limitations in different scenarios.

References

(1)
Asai et al. (2024) Akari Asai, Zexuan Zhong, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi, and Wen-tau Yih. 2024. Reliable, Adaptable, and Attributable Language Models with Retrieval. arXiv preprint arXiv:2403.03187 (2024).
Bartolo et al. (2020) Max Bartolo, Alastair Roberts, Johannes Welbl, Sebastian Riedel, and Pontus Stenetorp. 2020. Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension. Transactions of the Association for Computational Linguistics 8 (2020), 662–678. https://doi.org/10.1162/tacl_a_00338 arXiv:https://doi.org/10.1162/tacl_a_00338
Cheng et al. (2024) Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. 2024. xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token. arXiv preprint arXiv:2405.13792 (2024).
Chevalier et al. (2023) Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. 2023. Adapting Language Models to Compress Contexts. arXiv:2305.14788 [cs.CL]
Cuconasu et al. (2024) Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The Power of Noise: Redefining Retrieval for RAG Systems. arXiv:2401.14887 [cs.IR]
Dehghani et al. (2019) Mostafa Dehghani, Hosein Azarbonyad, Jaap Kamps, and Maarten de Rijke. 2019. Learning to Transform, Combine, and Reason in Open-Domain Question Answering. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, J. Shane Culpepper, Alistair Moffat, Paul N. Bennett, and Kristina Lerman (Eds.). ACM, 681–689. https://doi.org/10.1145/3289600.3291012
Fan et al. (2019) Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. ELI5: Long Form Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 3558–3567. https://doi.org/10.18653/v1/P19-1346
Ge et al. (2024) Tao Ge, Hu Jing, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. 2024. In-context Autoencoder for Context Compression in a Large Language Model. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=uREj4ZuGJE
Goyal et al. (2024) Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. 2024. Think before you speak: Training Language Models With Pause Tokens. In The Twelfth International Conference on Learning Representations.
He et al. (2021) Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021).
Hsia et al. (2024) Jennifer Hsia, Afreen Shaikh, Zhiruo Wang, and Graham Neubig. 2024. RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems. arXiv:2403.09040 [cs.CL]
Izacard and Grave (2021) Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. http://arxiv.org/abs/2007.01282 arXiv:2007.01282 [cs].
Izacard et al. (2022) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022. Atlas: Few-shot Learning with Retrieval Augmented Language Models. http://arxiv.org/abs/2208.03299 arXiv:2208.03299 [cs].
Jiang et al. (2023) Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 13358–13376. https://doi.org/10.18653/v1/2023.emnlp-main.825
Jiang et al. (2019) Kelvin Jiang, Dekun Wu, and Hui Jiang. 2019. FreebaseQA: A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with Freebase. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 318–323. https://doi.org/10.18653/v1/N19-1028
Johannes Welbl (2017) Matt Gardner Johannes Welbl, Nelson F. Liu. 2017. Crowdsourcing Multiple Choice Science Questions. arXiv:1707.06209v1.
Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics, Vancouver, Canada, 1601–1611. https://doi.org/10.18653/v1/P17-1147
Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 453–466.
Lassance et al. (2024) Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE. arXiv preprint arXiv:2403.06789 (2024).
Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 4582–4597. https://doi.org/10.18653/v1/2021.acl-long.353
Li (2023) Yucheng Li. 2023. Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering. arXiv:2304.12102 [cs.CL]
Liu et al. (2023) Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. https://doi.org/10.48550/arXiv.2307.03172 arXiv:2307.03172 [cs].
Mallen et al. (2023) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 9802–9822. https://doi.org/10.18653/v1/2023.acl-long.546
Morris et al. (2023) John Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander Rush. 2023. Text Embeddings Reveal (Almost) As Much As Text. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 12448–12460. https://doi.org/10.18653/v1/2023.emnlp-main.765
Muennighoff et al. (2024) Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. 2024. Generative Representational Instruction Tuning. arXiv:2402.09906 [cs.CL]
Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016).
Penedo et al. (2024) Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv:2406.17557 [cs.CL] https://arxiv.org/abs/2406.17557
Petroni et al. (2020) Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. 2020. KILT: a benchmark for knowledge intensive language tasks. arXiv preprint arXiv:2009.02252 (2020).
Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Jian Su, Kevin Duh, and Xavier Carreras (Eds.). Association for Computational Linguistics, Austin, Texas, 2383–2392. https://doi.org/10.18653/v1/D16-1264 arXiv:1606.05250 [cs.CL]
Rau et al. (2024) David Rau, Hervé Déjean, Nadezhda Chirkova, Thibault Formal, Shuai Wang, Vassilina Nikoulina, and Stéphane Clinchant. 2024. BERGEN: A Benchmarking Library for Retrieval-Augmented Generation. arXiv:2407.01102 [cs.CL] https://arxiv.org/abs/2407.01102
Stelmakh et al. (2022) Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. 2022. ASQA: Factoid Questions Meet Long-Form Answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 8273–8288. https://doi.org/10.18653/v1/2022.emnlp-main.566
Tan et al. (2024) Sijun Tan, Xiuyu Li, Shishir Patil, Ziyang Wu, Tianjun Zhang, Kurt Keutzer, Joseph E Gonzalez, and Raluca Ada Popa. 2024. LLoCO: Learning Long Contexts Offline. arXiv preprint arXiv:2404.07979 (2024).
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
Xu et al. (2023) Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023. RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation. arXiv:2310.04408 [cs.CL]
Yang et al. (2015) Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. WikiQA: A Challenge Dataset for Open-Domain Question Answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lluís Màrquez, Chris Callison-Burch, and Jian Su (Eds.). Association for Computational Linguistics, Lisbon, Portugal, 2013–2018. https://doi.org/10.18653/v1/D15-1237
Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 2369–2380. https://doi.org/10.18653/v1/D18-1259
Zhu et al. (2024) Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo, Lei Meng, Bang Liu, et al. 2024. Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection. arXiv preprint arXiv:2405.16178 (2024).

Appendix A Appendix

Table 9. Results in Match (M) comparing COCOM (-light) to other context compression works. All methods use 5 context passages unless indicated otherwise. ^★ Method limited to single context. ^△ upper baseline. ^▽ lower baseline. ^∗ indicates statistical non-significance (p¿0.05) with respect to COCOM

\xi

=4.

Decoder		Method	Compression rate ( $\xi$ )	Dataset
				NQ	TriviaQA	HotpotQA	ASQA	PopQA	Average
Zero-shot		AutoCompressor (Chevalier et al., 2023) ^★	$\times$ 4	0.351	0.703	0.314	0.574	0.237	0.435
		ICAE (Ge et al., 2024)	$\times$ 4	0.421	0.784	0.293	0.469	0.426	0.479
		xRAG (Cheng et al., 2024)^★
		Mistral-7B-v0.2	$\times$ 128	0.316	0.766	0.267	0.339	0.326	0.403
		Mixtral-8x7b	$\times$ 128	0.405	0.852	0.326	0.457	0.412	0.490
Fine-tuned	Mistral-7B-v0.2	RAG^△ (no compression)	-	0.637	0.917	0.544	0.665*	0.543	0.661
		LLM^▽ (no context)	-	0.403	0.753	0.283	0.573	0.208	0.444
		COCOM-light (ours)	$\times$ 4	0.579	0.882	0.439	0.633*	0.473	0.601
			$\times$ 16	0.529	0.857	0.395	0.604	0.395	0.556
			$\times$ 128	0.479	0.828	0.347	0.586	0.326	0.513
		COCOM (ours)	$\times$ 4	0.589	0.894	0.461	0.640	0.487	0.614
			$\times$ 16	0.577*	0.886*	0.456*	0.633*	0.478	0.606
			$\times$ 128	0.546	0.866	0.403	0.617*	0.402	0.567

Table 10. Hyperparameters for Pretraining

Hyperparameter	Assignment
learning Rate	1e-4
lr scheduler type	linear
warmup ratio	0.05
weight dacay	0.1
overall batch size	256
optimizer	AdamW
epochs	1
LoRa layers	all linear layers
LoRa alpha	32
LoRa dropout	0.1
LoRa $r$	16
LoRa bias	None
GPU	8 x A100 80GB
context max length	128

Table 11. Hyperparameters for Fine-tuning

Hyperparameter	Assignment
learning Rate	1e-4
lr scheduler type	linear
warmup ratio	0.05
weight dacay	0.1
overall batch size	64
optimizer	AdamW
epochs	2
LoRa layers	all linear layers
LoRa alpha	32
LoRa dropout	0.1
LoRa $r$	16
LoRa bias	None
GPU	8 x A100 80GB
retriever(s)	SPLADE-v3 (+ DeBERTa-v3)
num passages	5

Table 12. Datasets contained in the multi-dataset collection used for fine-tuning our COCOM (-light). We filtered out queries with more than 128 tokens and labels of more than 64 tokens.

Dataset	Number examples
NQ	87,925
MSMARCO	100,000
Adversarial QA	30,000
HotpotQA	88,869
WikiQA	873
SciQ	11,679
ASQA	4,353
Wiki QA	61,817
Freebase	20,358
SQuAD	87,599
Total	493,473