Context Embeddings for
Efficient Answer Generation in RAG

David Rau University of AmsterdamAmsterdamNetherlands [email protected] Shuai Wang The University of QueenslandBrisbaneAustralia [email protected] Hervé Déjean Naver Labs EuropeGrenobleFrance [email protected]  and  Stéphane Clinchant Naver Labs EuropeGrenobleFrance [email protected]

Context Embeddings for Efficient Answer Generation in RAG

David Rau University of AmsterdamAmsterdamNetherlands [email protected] Shuai Wang The University of QueenslandBrisbaneAustralia [email protected] Hervé Déjean Naver Labs EuropeGrenobleFrance [email protected]  and  Stéphane Clinchant Naver Labs EuropeGrenobleFrance [email protected]
Abstract.

Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin. Our method allows for different compression rates trading off decoding time for answer quality. Compared to earlier methods, COCOM  allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates a speed-up of up to 5.69 ×\times× while achieving higher performance compared to existing efficient context compression methods.

Context Compression, LLM, RAG
copyright: rightsretained

1. Introduction

Refer to caption
Figure 1. COCOM: Compressing multiple contexts for RAG into a small set (ξ=4,16,128𝜉416128\xi={4,16,128}italic_ξ = 4 , 16 , 128) of Context Embeddings leads to a massive speed up in answer generation while maintaining higher performance compared to other methods. Results are shown for the ASQA dataset.

Large Language Models (LLMs) are pre-trained on massive amounts of textual data; for instance, Llama 2 (Touvron et al., 2023) has been trained on 3 trillion tokens during pre-training. Through billions of learnable parameters, LLMs not only excel at modeling language but at the same time, build up a knowledge base that could be later used for question answering. On the other hand, the model is limited to the knowledge contained in the pre-training data. In knowledge-intensive scenarios, relying solely on the parametric memory of the model is often insufficient. To alleviate this, context can be provided explicitly from an external source through a preceding retrieval step (Retrieval-Augmented Generation–RAG). Although LLMs show notable improvements when given additional relevant context in knowledge-intensive tasks, this approach has limitations.

A key drawback is that adding more context to the input considerably slows down generation during inference. This occurs because the self-attention mechanism in transformers grows exponentially in space and memory requirements with increasing input length. At the same time, previous research has shown providing multiple documents as context can improve RAG performance (Izacard et al., 2022; Hsia et al., 2024). This is particularly critical for QA applications where reasoning over context from multiple documents is necessary, such as in multi-doc QA tasks (Fan et al., 2019; Joshi et al., 2017; Yang et al., 2018). In fact, the observation that modern transformers can naturally cope with many context documents for answer generation in open domain QA tasks was central to the development of RAG (Dehghani et al., 2019; Izacard and Grave, 2021). However, as the input length becomes larger, the position bias in LLMs might further complicate the extraction of relevant information (Liu et al., 2023).

Previous work has shown that the increased generation time in RAG can be alleviated by reducing the model’s input through context compression. This can be achieved either by applying lexical-based compression, where unimportant terms or tokens in the context are identified and filtered out during generation (Jiang et al., 2023), or by embedding-based compression, where embedding models transform the context into fewer embedding tokens in the LLM input (Ge et al., 2024; Tan et al., 2024; Cheng et al., 2024; Muennighoff et al., 2024). Notably, state-of-the-art embedding-based compression methods often achieve higher effectiveness and lower latency compared to lexical-based compression methods (Cheng et al., 2024).

However, despite the current embedding-based compression approaches achieving lower latency in RAG systems, several limitations remain:

  • Large compressor model: These methods rely on large compression models to achieve high effectiveness, such as (Cheng et al., 2024; Muennighoff et al., 2024).

  • Low effectiveness: The effectiveness of current embedding-based compression methods underestimates the potential of LLMs for answer generation, as they only tune parts of model components and leave the decoder LLM untuned. We hypothesize that freezing the decoder hinders the use of compressed contexts.

  • Fixed compression rate: Current methods do not offer different compression rates with respect to the length of input context, allowing to trade of inference time for generation quality at high effectiveness.

  • Single document limitation: Current effective methods only support using a single document context to generate answers.

We address the described limitations, similar to concurrently developed methods, by compressing contexts into a small number of context embeddings which are then provided as input to the LLM. This allows us to reduce the input size to a fraction of its surface form, which leads to an increased decoding time during answer generation. We call our model COCOM  (COntext COmpression Model), a multi-context compression method leveraging a single model for context compression and answer generation.

Additionally, we further show that with appropriate pretraining and tuning approaches, our compressing model achieves significantly higher effectiveness than current context compressing approaches (see Figure 1). We summarize our contributions as follows:

  • We present COCOM, an effective context compression method, reducing long contexts to only a handful of context embeddings speeding up the generation time while achieving higher performance compared to other methods.

  • In an efficiency study, we demonstrate the efficiency-effectiveness trade-offs achievable with different compression rates. We further illustrate the time and memory required for compression. We reduce inference time by up to 5.69 ×\times× and GFLOPs by up to 22 ×\times× while maintaining high performance.

  • We conduct an ablation to understand which factors are the most important for effective generation and analyze the impact of the pretraining collection, pretraining, fine-tuning, and freezing or not the decoder. on the target dataset, and training the decoder.

The rest of this paper is structured in the following way. Section 2 discusses related work on RAG, efficiency, and compression approaches. We continue in Section 3 discussing the RAG task and our novel COCOM approach to effective context compression. Section 4 details the experimental setup in terms of the RAG models and the five QA tasks. In Section 5, we present the main COCOM results in terms of effectiveness and efficiency. Section 6 conducts further analysis of how compression affects the model. We end with discussion and conclusions in Section 7, and limitations in Section 8.

Table 1. Comparison to previous works on Embedding-based Context Compression.
Work Light Compressor Decoder Tuning Adaptable γ𝛾\gammaitalic_γ Multi-Doc Efficient Answer Generation
GridLM (Muennighoff et al., 2024)
AutoCompressor (Chevalier et al., 2023)
ICAE (Ge et al., 2024)
xRAG (Cheng et al., 2024)
COCOM-light (ours)
COCOM (ours)

2. Related Work

In this section, we discuss related work on RAG, efficiency, and compression approaches.

The initial motivation for this work stems from a recent study by Morris et al. (2023), which demonstrates that a bag-of-words representation of the original surface terms can be recovered from text embeddings. This observation that embeddings can encapsulate the content of an entire passage inspired the idea to provide context in the form of an embedding rather than the original context in token form to an LLM.

The underlying motivation in the context of RAG to reduce the input size is, as mentioned earlier, due to the computational costs of contextualizing long inputs and as a consequence thereof increased decoding time (Asai et al., 2024). We address this by reducing the provided context to only a handful of context embeddings that are provided the LLM head-on.

Reducing the input to RAG models is a very active research field, with many works being done concurrently with ours. Among those works, two primary lines of research have emerged: embedding-based and lexical-based context compression. We discuss them in the following.

2.1. Lexical-based Compression.

Lexical-based compression focuses on either selecting tokens from the context (Li, 2023) or summarizing contexts (Xu et al., 2023), both aiming to retain essential information while reducing overall context size. LLMLingua comprises a query-independent token filtering module that uses a LLM to first select important tokens in the context. Then, a query-dependent token classifier is used to select tokens to form the compressed context.

On the other hand, Zhu et al. (2024) do not consider compression at the term level, but at the document level. Retrieved documents are either included or excluded with respect to the query. Only the included documents form the context for answer generation. It is worth noting that current lexical-based compression approaches all rely on specific query inputs. Therefore, compression needs to be (partially) processed online not allowing to compress documents offline, slowing down generation time.

2.2. Embedding-based Compression.

Embedding-based compression approaches focus on compressing the context into one or multiple summary embeddings that can be directly interpreted by the decoder model. This first work of this line is called AutoCompressor (Chevalier et al., 2023). This approach attempts to compress contextual information by segmenting it into randomly segmented chunks, subsequently aggregating these into summary embeddings through an iterative process until target embedding size is met. However, the training of the summary embeddings relies exclusively on next token prediction tasks, raising concerns about their ability to effectively encapsulate relevant contextual data. Furthermore, AutoCompressor is designed primarily for long contexts, generating a minimum of 50 summary embeddings. Such a configuration is not suitable for common RAG pipelines where short passages are retrieved, such as KILT.

Building up on AutoCompressor, ICAE by Ge et al. (2024) explores training a context compressor using the same LLM as the decoder model, and compress only once to get the summary embeddings. However, their approach limits the model’s capacity by using a frozen decoder module, preventing the accumulation of gradients from the decoder part during training. In this paper, we argue that decoder training is an important factor that strongly impacts the performance of the model. We illustrate this argument in Section 4.2.1.

Furthermore, GridLM Muennighoff et al. (2024) addresses the issue of double decoding the same context first for retrieval and then again as the provided context to the LLM. They use the same LLM for ranking and generation which allows them to cache all representations during encoding the contexts and to reuse them during generation. This approach compared to ours is limited to only a single context, does not speed up decoding time, and results in gigantic storage requirements.

Cheng et al. (2024) propose xRAG concurrently to our method. They directly reuse frozen ranking representations based on embedding models while freezing the decoder. Although this approach successfully resolves the double decoding problem, it suffers from low effectiveness because the representation is not trained prior to its application to compression tasks. This issue becomes particularly challenging when light-weight encoder models, such as DPR with 109 million parameters, are used as compressors. In such cases, the model achieves similar effectiveness to the Mistral-7b model when retrieval is not applied 111By default, xRAG uses a 7B SFR LLM-based ranking model as compressor. On the other hand, using retrieval representations from lightweight models for compression is counter-intuitive. Representations gathered from retrieval tasks may lack sufficient information to fully recover the context. Conversely, representation learned for compression demonstrate its capacity to reconstruct the original context (Ge et al., 2024). This suggest that, upon further adjustment, it may show a higher potential to serve as an effective retriever.

2.3. Overview

In Table 1 we contrast our method with the described related works on embedding-based compression. It is important to note that most previous works mentioned so far have only considered cases that may not directly apply to RAG settings but rather to long-context question answering. In their setting, only one relevant document is used for each query to fulfill the user request.

Therefore, such models are not naturally able to deal with effectively multiple documents. Furthermore, their reported effectiveness may not directly indicate the final performance in RAG systems, where the document may be potentially irrelevant, and often multiple top-retrieved documents are used. As a decoder model, by design, should be able to handle multiple context representations, we argue that fine-tuning the decoder is a simple yet necessary solution compared to existing works

3. Methodology

In this section, we detail the RAG task and our novel COCOM approach to effective context compression.

3.1. Task Definition: RAG

RAG employs a ranking system \mathcal{R}caligraphic_R and a parametric generative language model θLLMsubscript𝜃𝐿𝐿𝑀\theta_{LLM}italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT, where the ranking system can be multi-staged. First, the ranking system builds a search index \mathcal{I}caligraphic_I based on a collection. Then, at request time, the index \mathcal{I}caligraphic_I is searched yielding context segments222The segments can be at different granularities for instance sentences, passages, or entire documents. In this work, we focus on passages. 𝒞𝒞\mathcal{C}caligraphic_C that are relevant to the user input x𝑥xitalic_x: f,:{x}𝒞:subscript𝑓𝑥𝒞f_{\mathcal{I},\mathcal{R}}:\{x\}\rightarrow\mathcal{C}italic_f start_POSTSUBSCRIPT caligraphic_I , caligraphic_R end_POSTSUBSCRIPT : { italic_x } → caligraphic_C.

Next, the LLM generates a response r𝑟ritalic_r based on the context 𝒞𝒞\mathcal{C}caligraphic_C and user input x𝑥xitalic_x:

(1) θLLM:{𝒞,x}r:subscript𝜃𝐿𝐿𝑀𝒞𝑥𝑟\theta_{LLM}:\{\mathcal{C},x\}\rightarrow ritalic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT : { caligraphic_C , italic_x } → italic_r

Note how in RAG the context is added to the input of the LLM dramatically increasing the input to the LLM, as |𝒞||x|much-greater-than𝒞𝑥|\mathcal{C}|\gg|x|| caligraphic_C | ≫ | italic_x |.

3.2. COCOM: Effective Context Compression

Refer to caption
Figure 2. Overview of our COCOM (-light) model pipeline.

The main idea of COCOM is to enhance efficiency by compressing the context, which is typically given in surface form as input tokens into a smaller set of context embeddings which then serve as the input to the LLM. An overview of our entire pipeline is given in Figure 2. More formally, our approach can be described as follows:

Given a context 𝒞𝒞\mathcal{C}caligraphic_C tokenized into a a sequence of tokens {t1,t2,\{t_{1},t_{2},{ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ,tn}\dots,t_{n}\}… , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, a compressor model ϕcompsubscriptitalic-ϕ𝑐𝑜𝑚𝑝\phi_{comp}italic_ϕ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT, we compress 𝒞𝒞\mathcal{C}caligraphic_C into context embeddings \mathcal{E}caligraphic_E, a smaller set of embeddings {e1,e2,,ek}subscript𝑒1subscript𝑒2subscript𝑒𝑘\{e_{1},e_{2},\ldots,e_{k}\}{ italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, where knmuch-less-than𝑘𝑛k\ll nitalic_k ≪ italic_n. Each embedding eidsubscript𝑒𝑖superscript𝑑e_{i}\in\mathbb{R}^{d}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, with d𝑑ditalic_d being the LLM’s hidden dimension.

(2) ϕcomp:{t1,t2,,tn}{e1,e2,,ek}d:subscriptitalic-ϕ𝑐𝑜𝑚𝑝subscript𝑡1subscript𝑡2subscript𝑡𝑛subscript𝑒1subscript𝑒2subscript𝑒𝑘superscript𝑑\phi_{comp}:\{t_{1},t_{2},\ldots,t_{n}\}\rightarrow\{e_{1},e_{2},\ldots,e_{k}% \}\in\mathbb{R}^{d}italic_ϕ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT : { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } → { italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT

Next, based on the compressed context embeddings \mathcal{E}caligraphic_E and the user input x𝑥xitalic_x the LLM ϕLLMsubscriptitalic-ϕ𝐿𝐿𝑀\phi_{LLM}italic_ϕ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT generates a response r𝑟ritalic_r:

(3) θLLM:{,x}r:subscript𝜃𝐿𝐿𝑀𝑥𝑟\theta_{LLM}:\{\mathcal{E},x\}\rightarrow ritalic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT : { caligraphic_E , italic_x } → italic_r

The ϕcompsubscriptitalic-ϕ𝑐𝑜𝑚𝑝\phi_{comp}italic_ϕ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT model is trained to generate context embeddings that capture the content of the input tokens in a compressed form. As both models are trained jointly, θLLMsubscript𝜃𝐿𝐿𝑀\theta_{LLM}italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT learns to decode these context embeddings, extracting the relevant information required to answer user queries.

COCOM compresses the context-embeddings question independently. This means not only do individual contexts have to be contextualized by an LLM only once, but they can also be pre-computed offline and stored, drastically reducing computational costs of the LLM at inference time. Further, by only feeding a small number of context embeddings instead of the long context, the input size is reduced to a fraction leading to a massive speed-up for answer generation.

For COCOM, we utilize the same model for compression and answer generation ϕcomp=θLLMsubscriptitalic-ϕ𝑐𝑜𝑚𝑝subscript𝜃𝐿𝐿𝑀\phi_{comp}=\theta_{LLM}italic_ϕ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT. Therefore, we effectively train a single model on the two tasks. For the compression task, we prepend a special token <AE> to the input and depending on ξ𝜉\xiitalic_ξ append a different number of context embedding tokens <CTX> at the end of the sequence. We directly use the representations of the last hidden layer as our context embeddings as input - to the same model - for the answer generation.

As demonstrated later in the experiments, our method also allows us to potentially employ any embedding model as a compressor; including more lightweight encoder-only models such as BERT 333See Section 5.2.

3.2.1. Adaptable Compression Rate

The number of context embeddings k=||𝑘k=|\mathcal{E}|italic_k = | caligraphic_E | can be varied and allows to control the level of compression of the original context 𝒞={t1,,tn}𝒞subscript𝑡1subscript𝑡𝑛\mathcal{C}=\{t_{1},\dots,t_{n}\}caligraphic_C = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. We calculate the number of context embeddings ξ𝜉\xiitalic_ξ per context 𝒞𝒞\mathcal{C}caligraphic_C based on a compression rate ξ𝜉\xiitalic_ξ, and the length of the tokenized input n=|𝒞|𝑛𝒞n=|\mathcal{C}|italic_n = | caligraphic_C |.

(4) ξ=nξ𝜉𝑛𝜉\xi=\left\lfloor\frac{n}{\xi}\right\rfloor\ italic_ξ = ⌊ divide start_ARG italic_n end_ARG start_ARG italic_ξ end_ARG ⌋

For instance, when compressing a context with length n=128𝑛128n=128italic_n = 128 with a compression rate ξ=64𝜉64\xi=64italic_ξ = 64 we obtain 2 context embeddings, reducing the input by 64 times.

3.2.2. Multiple Contexts

Knowledge-intensive tasks can benefit from providing the context of multiple retrieved passages (Izacard et al., 2022; Hsia et al., 2024), especially where reasoning over multiple contexts is necessary to solve the task (Fan et al., 2019; Joshi et al., 2017; Yang et al., 2018). In classical RAG the contexts of multiple passages are concatenated and provided to the model. Similarly in COCOM  we can provide context embeddings of multiple passages to the LLM. Contexts are compressed independently following Equation 2. We add [SEP] special tokens between the context embeddings before feeding them to the LLM to distinguish context stemming from different passages in the input.


3.3. Pre-training Context Embeddings

We propose two auto-regressive variations of the next-token prediction task to learn to compress context into context embeddings and to use these context embeddings as input to the LLM.

Following our earlier notation, the objective function for the standard next token prediction for input 𝒳={x1,x2,,xT}𝒳subscript𝑥1subscript𝑥2subscript𝑥𝑇\mathcal{X}=\{x_{1},x_{2},\dots,x_{T}\}caligraphic_X = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } can be written as:

(5) (θLLM)=xt𝒳logPθLLM(xtx1,x2,,xt1)subscript𝜃𝐿𝐿𝑀subscriptsubscript𝑥𝑡𝒳subscript𝑃subscript𝜃𝐿𝐿𝑀conditionalsubscript𝑥𝑡subscript𝑥1subscript𝑥2subscript𝑥𝑡1\mathcal{L}(\theta_{LLM})=-\sum_{x_{t}\in\mathcal{X}}\log P_{\theta_{LLM}}(x_{% t}\mid x_{1},x_{2},\ldots,x_{t-1})caligraphic_L ( italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )

3.3.1. Auto-encoding with Context Embeddings.

We modify the next token prediction task to recover the original input tokens from the compressed context embeddings \mathcal{E}caligraphic_E. This way we jointly train the compressor and LLM to decompress the original input which can be seen as a form of auto-encoding.

(6) =ϕcomp(x1,x2,,xT)subscriptitalic-ϕ𝑐𝑜𝑚𝑝subscript𝑥1subscript𝑥2subscript𝑥𝑇\mathcal{E}=\phi_{comp}(x_{1},x_{2},\dots,x_{T})caligraphic_E = italic_ϕ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
(7) (θLLM,ϕcomp)=xt𝒳logPθLLM(xt,x1,,xt1)subscript𝜃𝐿𝐿𝑀subscriptitalic-ϕ𝑐𝑜𝑚𝑝subscriptsubscript𝑥𝑡𝒳subscript𝑃subscript𝜃𝐿𝐿𝑀conditionalsubscript𝑥𝑡subscript𝑥1subscript𝑥𝑡1\mathcal{L}(\theta_{LLM},\phi_{comp})=-\sum_{x_{t}\in\mathcal{X}}\log P_{% \theta_{LLM}}(x_{t}\mid\mathcal{E},x_{1},\dots,x_{t-1})caligraphic_L ( italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ caligraphic_E , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )

This task serves as a preliminary step toward our final objective of answering questions from context embeddings. For this objective, we first aim to learn to compress and decompress the same input effectively.

3.3.2. Language Modeling from Context Embeddings.

Our final task is to answer questions based on the context embeddings. To this end, in our language modeling task, we train the model to continue a given input conditioned on context embeddings. This way the model learns not only to compress a given input but also to leverage the content of the context embeddings effectively.

We split input 𝒳={x1,x2,,xT}𝒳subscript𝑥1subscript𝑥2subscript𝑥𝑇\mathcal{X}=\{x_{1},x_{2},\dots,x_{T}\}caligraphic_X = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } into 𝒳A={x1,x2,xj}subscript𝒳𝐴subscript𝑥1subscript𝑥2subscript𝑥𝑗\mathcal{X}_{A}=\{x_{1},x_{2},x_{j}\}caligraphic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } and 𝒳B={xj+1,,xT}subscript𝒳𝐵subscript𝑥𝑗1subscript𝑥𝑇\mathcal{X}_{B}=\{x_{j+1},\dots,x_{T}\}caligraphic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }. After compressing the first part 𝒳Asubscript𝒳𝐴\mathcal{X}_{A}caligraphic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT into Asubscript𝐴\mathcal{E}_{A}caligraphic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT we learn to generate the continuation - namely the second part 𝒳Bsubscript𝒳𝐵\mathcal{X}_{B}caligraphic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT - based on the compressed representations A=ϕcomp(𝒳A)subscript𝐴subscriptitalic-ϕ𝑐𝑜𝑚𝑝subscript𝒳𝐴\mathcal{E}_{A}=\phi_{comp}(\mathcal{X}_{A})caligraphic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ). This can be seen as a variation of the next token prediction task but conditioned on context embeddings.

(8) (θLLM,ϕcomp)=xt𝒳BlogPθLLM(xtϕcomp(𝒳A),x1,,xt1)subscript𝜃𝐿𝐿𝑀subscriptitalic-ϕ𝑐𝑜𝑚𝑝subscriptsubscript𝑥𝑡subscript𝒳𝐵subscript𝑃subscript𝜃𝐿𝐿𝑀conditionalsubscript𝑥𝑡subscriptitalic-ϕ𝑐𝑜𝑚𝑝subscript𝒳𝐴subscript𝑥1subscript𝑥𝑡1\mathcal{L}(\theta_{LLM},\phi_{comp})=-\sum_{x_{t}\in\mathcal{X}_{B}}\log P_{% \theta_{LLM}}\big{(}x_{t}\mid\phi_{comp}(\mathcal{X}_{A}),x_{1},\dots,x_{t-1}% \big{)}caligraphic_L ( italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_ϕ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )

This language modeling task is complementary to the auto-encoding task. If we would only employ the auto-encoding from context embeddings task the LLM would be biased towards only recovering the original input, instead of leveraging the content of the context embeddings.

3.4. Fine-tuning

For the downstream RAG application, we fine-tune the model on a question q𝑞qitalic_q, relevant context(s) retrieved by a retrieval system and compressed into context embeddings \mathcal{E}caligraphic_E, which are combined into an instruction Iq,subscript𝐼𝑞I_{q,\mathcal{E}}italic_I start_POSTSUBSCRIPT italic_q , caligraphic_E end_POSTSUBSCRIPT. We train the LLM to generate the target response R=(r1,r2,,tT)𝑅subscript𝑟1subscript𝑟2subscript𝑡𝑇R=(r_{1},r_{2},\dots,t_{T})italic_R = ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ). We fine-tune our models on a combined set of publicly available QA datasets. We employ instruction fine-tuning only updating the models based on the target responses.

(9) (θLLM,ϕcomp)=rtRlogPθLLM(rtI,q,r1,r2,,rt1)subscript𝜃𝐿𝐿𝑀subscriptitalic-ϕ𝑐𝑜𝑚𝑝subscriptsubscript𝑟𝑡𝑅subscript𝑃subscript𝜃𝐿𝐿𝑀conditionalsubscript𝑟𝑡subscript𝐼𝑞subscript𝑟1subscript𝑟2subscript𝑟𝑡1\mathcal{L}(\theta_{LLM},\phi_{comp})=-\sum_{r_{t}\in R}\log P_{\theta_{LLM}}(% r_{t}\mid I_{\mathcal{E},q},r_{1},r_{2},\dots,r_{t-1})caligraphic_L ( italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_R end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_I start_POSTSUBSCRIPT caligraphic_E , italic_q end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )

4. Experimental Setup

In this section, we detail our experimental setup in terms of the RAG models and the five QA tasks.

4.1. Implementation Details

We use Mistral-7B-Instruct-v0.2444https://https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 as our backbone LLM for answer generation. For context compression in COCOM, we utilize the same model. For our more light-weight context compression, in COCOM-light, we employ bert-base-uncased555https://https://huggingface.co/google/bert-base-uncased. We apply three different compression rates: ξ=1,16,128𝜉116128\xi={1,16,128}italic_ξ = 1 , 16 , 128. We employ SPLADE-v3 (Lassance et al., 2024) with reranking top-50 using DeBERTa-v3 (He et al., 2021) as our retrieval system. For all our experiments we use top-5 documents as context.

4.2. Training

For both pre-training and fine-tuning, we apply parameter-efficient LoRA tuning.

4.2.1. Pre-training

For our pre-training, we employ the two earlier-mentioned pre-training autoencoding and language modeling tasks. Samples are drawn randomly with equal probability from both tasks. We tried different ratios but found this to perform best. to ensure efficient batch processing, which requires that every sample in a batch contains a fixed-length tokenized input. To achieve this, we split the Wikipedia-KILT (Petroni et al., 2020) corpus 666We publish this resource as a Huggingface dataset under https://huggingface.co/datasets/dmrau/kilt-128. into chunks of 128 tokens using the Llama-2-7b tokenizer. We pre-train on in total 10m samples. Training hyperparameters can be found in the Appendix in Table 10.

4.2.2. Fine-tuning

The BERGEN library (Rau et al., 2024) is used to fine-tune the model. We fine-tune our models on various datasets concurrently. To construct our fine-tuning dataset 777We publish the dataset under https://huggingface.co/datasets/dmrau/multi_qa, we combine training samples from Natural Questions (Kwiatkowski et al., 2019), MS MARCO 888We select only the first 100k queries. (Nguyen et al., 2016), adversarial QA (Bartolo et al., 2020), HotpotQA (Yang et al., 2018), WikiQA (Yang et al., 2015), SCIQ (Johannes Welbl, 2017), ASQA (Stelmakh et al., 2022), TriviaQA (Joshi et al., 2017), Freebase QA (Jiang et al., 2019) and squad (Rajpurkar et al., 2016) - all of which are for question answering. Then we filter out queries with more than 128 tokens and labels of more than 64 tokens. For mode details we refer to Table 12 in the Appendix. Training hyperparameters can be found in the Appendix in Table 11.

4.3. Evaluation

We evaluate our model on several widely used QA datasets. Natural Questions (Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017), HotpotQA (Yang et al., 2018), ASQA (Stelmakh et al., 2022), and PopQA (Mallen et al., 2023).

4.3.1. Metrics

As our main metric, following the standard protocol to evaluate fine-tuned models we use Exact Match (EM). To compare our results to previous works, which partially rely on untuned decoders and therefore produce verbose answers, we revert to the Match metric (M), which indicates whether the label is contained (as an exact match) in the generated answer.

4.4. Baselines without Context Compression

We fine-tune the base model (Mistral-7B-Instruct-v0.2):

  • RAG - upper bound. The model receives the top-5 retrieved contexts, alongside the query and answers the question. This model serves as an upper bound in our experiment not applying context compression.

  • Closed Book - lower bound. (w/o RAG). The LLM generates an answer based on the query without any provided context. This serves as a lower-bound baseline.

4.5. Baselines with Context Compression

We compare our models to the context compression methods mentioned below. As mentioned earlier these models tune only parts of their model components on the downstream data but leave their decoder LLM untuned applying it zero-shot. We argue this to be a major limitation, as answering questions from context embeddings differs fundamentally from the standard language modeling hindering the model to effectively leverage the context embeddings.

To ensure comparability among approaches we use the same retrieval system as mentioned earlier in Section 4.1.

  • Autocompressor  (Chevalier et al., 2023): We use the princeton-nlp/AutoCompressor-Llama-2-7b-6k checkpoint producing 50 summary vectors. As their model is limited to compressing one single context, we just use the top retrieved document as context.

  • ICAE.  (Ge et al., 2024): We use the Mistral-7B-Instruct-v0.2 LoRa-checkpoint999https://huggingface.co/sggetao/icae which uses the same base LLM as ours and is therefore directly comparable. ICAE is fine-tuned to compress a single long context, however, in our work we use multiple contexts. To alleviate this we concatenate the top five retrieved documents together as the context input for the model and truncate as the maximum length of 512 tokens. Note the model has a maximum output length of 128 compressed tokens, which approximately indicates a compression rate of 4 from its original concatenated context input.

  • xRAG. We utilize the xRAG-7b101010https://huggingface.co/Hannibal046/xrag-7b, and 8x7B mixture-of-experts model 111111https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 alongside their strongest SFR compressor. The base model is again the same as ours, to ensure comparability. As their model is limited to compressing only a single context into a single compressed representation, we use the top retrieved document for the xRAG setting.121212We also tested compressing five documents together, which yielded lower effectiveness. We apply their predefined stopping criteria for answer generation, which aims at cutting the verbose nature of a untuned decoder LLM.

Table 2. Results in Exact Match (EM) comparing COCOM (-light) to other context compression works. For Match metric (M) see Table 9 in Appendix. All methods use 5 context passages unless indicated otherwise. Method limited to single context. upper baseline. lower baseline. indicates statistical non-significance (p¿0.05) with respect to COCOM ξ𝜉\xiitalic_ξ=4.
Decoder Method Compression rate (ξ𝜉\xiitalic_ξ) Dataset
NQ TriviaQA HotpotQA ASQA PopQA Average
Zero-shot AutoCompressor (Chevalier et al., 2023) ×\times× 4 0.000 0.000 0.000 0.000 0.000 0.000
ICAE (Ge et al., 2024) ×\times× 4 0.210 0.592 0.184 0.222 0.290 0.300
xRAG (Cheng et al., 2024)
Mistral-7B-v0.2 ×\times× 128 0.184 0.622 0.185 0.182 0.199 0.274
Mixtral-8x7b ×\times× 128 0.265 0.744 0.239 0.292 0.318 0.372
Fine-tuned Mistral-7B-v0.2 RAG (no compression) - 0.597 0.883 0.500 0.622* 0.514 0.623
LLM (without context) - 0.359 0.708 0.264 0.546 0.199 0.416
COCOM-light (ours) ×\times× 4 0.539 0.849 0.409 0.601* 0.458 0.531
×\times× 16 0.492 0.823 0.367 0.565 0.385 0.526
×\times× 128 0.444 0.794 0.321 0.550 0.314 0.485
COCOM (ours) ×\times× 4 0.554 0.859 0.430 0.609 0.474 0.585
×\times× 16 0.539 0.852* 0.426* 0.602* 0.465 0.577
×\times× 128 0.511 0.835 0.378 0.585* 0.391 0.540

5. results

In this section, we present the main COCOM and COCOM-light results in terms of effectiveness and efficiency.

5.1. Main Results

The main results for COCOM are presented in Table 2. We measure performance following the standard practice for fine-tuned models using the Exact Match (EM) metric. Compared to existing context compression methods131313Side note: As previously mentioned earlier in Section 2, existing context compression methods do not tune the decoder LLM and therefore compare their performance and make effectiveness claims against zero-shot baselines. However, we argue that tuning compression models while freezing the decoder LLM could not be considered zero-shot, as it involves tuning some parts of the model on the task data. This setting is akin to soft-prompt tuning (Cuconasu et al., 2024; Li and Liang, 2021), where the compressor model effectively parameterizes the soft prompt. Consequently, the performance of these methods should be regarded as intermediate between zero-shot and full decoder tuning and should be compared against similar tuning settings, such as soft prompt tuning. , our approach demonstrates a significantly (Tested with paired t-test (p¡0.05)). higher effectiveness across different compression rates for all datasets tested. COCOM even outperforms the much stronger xRAG Mixtral-8x7B model by a large margin having 8 times more parameters than COCOM. The highest performance is observed at a low compression rate (ξ𝜉\xiitalic_ξ=4). Increasing the compression rate results in a slight performance decline, which we will analyze further in Section 6.1.

Compared to our upper bound baseline RAG without compression, we reduce the context by up to 128 times while still maintaining relatively high performance on average over datasets.

Performance decreases on average 4 points for our strongest model (COCOM ξ𝜉\xiitalic_ξ = 4) and 10 points for the highest compression rate (COCOM ξ𝜉\xiitalic_ξ = 128). Compared to the lower bound baseline LLM without provided context we gain up to 17 points, adding only a small number of additional context embeddings to the input.

Note, while EM is a standard metric for evaluating tuned models, it might underestimate zero-shot decoder methods that do not adapt the decoder to generate answers. To address this, we also provide results using the Match metric in the appendix in Table 9. Although models that do not tune their decoder achieve relatively higher performance when measured in Match, our method’s effectiveness compared to other methods still remains consistently significantly higher.

Overall, considering the effectiveness and the efficiency gains from context compression (discussed further in Section 5.3), COCOM shows a very favorable trade-off.

5.2. COCOM-light

Even though context compression has to be done only once offline, using a very large LLM can be costly, especially in resource-constraint settings. To this end, we propose COCOM-light, a computationally efficient context compression model based on BERT as a context compressor.

To alleviate the dimensional mismatch between the bert-based compressor and the - typically larger - LLM, we learn a linear projection layer 𝑾γ×dsuperscript𝑾𝛾𝑑\bm{W}^{\gamma\times d}bold_italic_W start_POSTSUPERSCRIPT italic_γ × italic_d end_POSTSUPERSCRIPT, where γ𝛾\gammaitalic_γ is the compression rate and d𝑑ditalic_d the hidden dimension of the LLM. To obtain a set of Context Embeddings we leverage the last hidden representation of each input token. We simply split the hidden representations into blocks of length γ𝛾\gammaitalic_γ and project each block into a single Context Embedding. This way, we learn a block-wise aggregation of the input representations that depending on the input length, and the compression rate γ𝛾\gammaitalic_γ yields a different number of Context Embeddings per input. Note that a similar approach is applied in xRAG, where a projection layer is used on the embedding vector to resolve the dimensional mismatch. However, we argue that compressing using a single vector embedding could significantly restrict the compression quality, especially when using lightweight encoder models such as BERT. This restriction can result in much lower effectiveness compared to using a larger embedding model (Cheng et al., 2024).

We present the results in Table 2 measured in EM. Results for Match can be again found in the appendix in Table 9. We find that while being highly effective for small compression rates to drop considerably for the highest compression rate of ξ𝜉\xiitalic_ξ=128. COCOM-light, compared to other methods poses an effective alternative to it’s bigger counterpart COCOM, in resource-constrained settings.

5.3. Computational Efficiency

Table 3. Decoding efficiency in generation Time, GPU Memory, and number of operations (GFLOPs) for COCOM (-light) on dataset NQ. ξ𝜉\xiitalic_ξ the compression rate. Increasing times is compared against RAG(no compr.).
Model ξ𝜉\xiitalic_ξ Decoding Time GPU Mem. GFLOPs
Mistral-7b-v0.2 (ms) (GB)
RAG (no compr.) - 1064 18.1 25031
LLM (no context.) - 159 14.1 607
COCOM (-light) 4 371 (×\times× 2.87 ) 15.1 (×\times× 1.20) 7016 d(×\times× 23.57)
16 213 (×\times× 5.00 ) 14.4 (×\times× 1.29) 2465 d(×\times× 10.16)
128 187 (×\times× 5.69 ) 14.2 (×\times× 1.27) 1138 d(×\times× 22.00)
Table 4. Compression efficiency and storage requirements. Compressing  24m contexts using on a single A100 80GB GPU.
Compressor ξ𝜉\xiitalic_ξ Time (h) Index size (TB)
COCOM 4 89 6.06
16 77 1.51
128 73 0.19
COCOM-light 4 1 6.06
16 1 1.51
128 1 0.19

We measure efficiency in answer generation time (ms), maximum GPU memory (GB), and number of operations per generated token (Giga FLOPs) using the torch profiler. We run the experiments on a single A100 40GB with a fixed batch size of 16141414Maximum batch size that fits on GPU across models.. We load the model in half-precision and use the PyTorch inference mode. We discard the first warm-up batch from the measurement and measure the bare forward pass of the model. Note decoding results are independent of the compressor, therefore COCOM  and COCOM-light share efficiency results.

We show our efficiency results for answer generation in Table 3 for different compression rates ξ𝜉\xiitalic_ξ and compare them to RAG without context compression. Context compression with COCOM reduces answer generation time , GPU memory, and the number of operations drastically with up to 5.69 ×\times× less inference time cost, 1.27 ×\times× GPU memory, and 22 ×\times× GFLOPs compared to no compression.

In addition, Table 4 presents the compression costs for all documents in the kilt-100w ( 24m contexts) collection using COCOM-light models at various compression rates. COCOM-light models demonstrate significantly faster compression speeds compared to the COCOM model by employing a much computationally lighter compressing module (up to 89 ×\times×). Index size varies inversely with compression rate: higher compression rates result in smaller index storage requirements. However, this trade-off leads to lower quality in answer generation, as shown in Table 2.

5.4. Ablations

In the following section, we run additional ablation experiments for COCOM and COCOM-light. Most results can be found in Table 6. We report performance in Exact Match on two datasets (NQ and ASQA).

5.4.1. Handling multiple contexts.

In table 5, we compare the performance of COCOM with 1 retrieved context (k=1𝑘1k=1italic_k = 1) versus our default setup k=5𝑘5k=5italic_k = 5. On both datasets and for all compression rates, we observe a substantial gain when using more contexts. Moreover, COCOM with 1 retrieved context is still significantly better compared to other baselines relying on single retrieved document (ICAE, xRAG) in table 2. As a decoder model by design should be able to handle multiple context representations, we argue that fine-tuning the decoder is a simple yet necessary solution compared to existing works.

Table 5. Impact of the number of provided contexts on COCOM  measured in EM on datasets NQ and ASQA.
Model ξ𝜉\xiitalic_ξ NQ ASQA
k=5 k=1 k=5 k=1
COCOM 4 0.554 0.4987 0.609 0.558
16 0.539 0.4913 0.602 0.541
128 0.511 0.4818 0.585 0.544

5.4.2. Pre-training Context Compression

Central to our approach is the compression of context into a small number of Context Embeddings. We argue that context compression fundamentally differs from the language modeling objective on which the model was originally trained. Consequently, we have employed auto-encoding and language-modeling-from-context-embedding tasks to learn how to effectively compress the context and utilize these compressed representations during decoding. We show the results of the impact of the pre-training tasks on the downstream performance after fine-tuning. Our results suggest that the dedicated pre-training tasks for context compression can improve performance for downstream QA performance, suggesting two possible explanations. Either context compression is too complex to be learned concurrently with the downstream task, or larger fine-tuning datasets are necessary to effectively learn how to compress contexts.

5.4.3. Pre-training Corpus

Our method employs an initial pre-training step aimed at initializing context compression. We train auto-regressively on the same target corpus, which is later used to retrieve relevant contexts. In this experiment, our objective is to assess how variations in the pre-training corpus impact downstream QA performance, thereby testing the robustness of our approach. To explore this, we additionally pre-train the model on the ”sample-10BT” subset of Fine-Web (Penedo et al., 2024). We employ the same training methodology described in Section 4.2.1, where we segment the collection into non-overlapping passages of 128 tokens using the Llama-2-7b tokenizer and train on a subset of 10 million tokens, similar to the target corpus. The results presented in Table 6 indicate a slight decrease in performance when using a different target corpus for pre-training. Nonetheless, our approach demonstrates robustness in handling variations in the pre-training corpus, highlighting its adaptability and effectiveness in context compression.

5.4.4. Decoder LLM Tuning

Existing context compression methods tune only the compression module while keeping the decoder, responsible for generating the answer, frozen. A core distinction from these methods is that we tune all components including the decoder, in COCOM. We hypothesize that context embeddings differ significantly from the input token embeddings the model was trained on, thereby hindering effective utilization without dedicated tuning. We investigate the consequences of freezing the decoder and solely tuning the compressor, akin to existing methods. Our findings show the criticality of tuning the decoder to achieve high effectiveness. This reinforces our hypothesis that specific tuning of context embeddings seem essential for better performances.

Table 6. Impact of pre-training corpus, pre-training, and decoder tuning on downstream performance (EM). Compression rate ξ𝜉\xiitalic_ξ = 128
Ablation Datasets
NQ ASQA
COCOM-light (baseline) 0.444 0.550
w/o pre-training 0.423 0.524
pre-training on FineWeb 0.427 0.545
w/o tuning decoder 0.353 0.438
COCOM (baseline) 0.519 0.585
w/o pre-training 0.490 0.565
pre-training on FineWeb 0.503 0.581
w/o tuning decoder 0.421 0.521

5.4.5. Fine-tuning Data

In our experiments, we fine-tune our models simultaneously on multiple QA datasets before evaluating them on individual datasets. We explore the impact of this multi-dataset fine-tuning compared to training on a single dataset. Specifically, we fine-tune and evaluate our models on NQ (Natural Questions). For assessing transferability, we also conduct zero-shot evaluations on other datasets. The results are presented in Figure 3. We find that fine-tuning solely on a single dataset, such as NQ, leads to slightly higher performance on that specific dataset. However, training on multiple datasets demonstrates superior transferability across all datasets, resulting in better average performance overall.

Refer to caption
Figure 3. Impact on zero-shot transferability of fine-tuning on multiple datasets (multi) concurrently vs. on a single dataset for COCOM. Compression rate ξ𝜉\xiitalic_ξ = 128

6. Analysis

In this section, we conduct further analysis on how compression affects the model.

6.1. Context compression

In our earlier results in Section 5.1, we observe a decline in performance with higher compression rates, particularly for the lightweight compressor in COCOM-light. To investigate potential reasons for this drop, we assess the model’s ability to perform the two pre-training tasks: (i) compressing and decompressing input (auto-encoding) and (ii) language modeling from compressed representations after pre-training.

Our results in Table 7 indicate that both models effectively learn the auto-encoding task at lower compression rates (ξ𝜉\xiitalic_ξ=4, 16), but struggle to recover the input when the context is compressed into fewer embeddings (ξ𝜉\xiitalic_ξ=128), with this issue being more pronounced for the lightweight compression module.

We identify two possible explanations: First, compressing longer contexts is inherently more challenging and might require additional objectives. Second, decompressing information from a smaller set of Context Embeddings may be more difficult due to the sequential decoding nature of decoder-only models. Introducing additional pause tokens (Goyal et al., 2024) could help alleviate this issue, providing the model with a means to hierarchically decompress information, drawing on ideas from Chain-of-Thought prompting (Wei et al., 2022). We also experimented with pre-training on more samples but found that this did not improve downstream performance. Regarding the second pre-training task, it is noteworthy that COCOM-light outperforms its larger counterpart in language modeling from Context Embeddings. This analysis shows compressing and re-constructing longer texts is challenging and needs further investigation.

Table 7. Pre-training evaluation on the tasks Auto Encoding (AE) and Language Modeling from Context Embeddings (LMCE) measured in Rouge-L.
Model ξ𝜉\xiitalic_ξ Rouge-L
AE LMCE
COCOM-light 4 0.9979 0.2045
16 0.9912 0.1991
128 0.5545 0.1771
COCOM 4 0.9734 0.1882
16 0.9643 0.1800
128 0.7938 0.1618

6.2. Case Study Answer Quality

We investigate the answers generated with different models. For this, we randomly select a query from the NQ dataset and compare the responses generated by each method. Table 8 presents the responses to the selected question.

From the responses, we observe that without RAG, the LLM tends to hallucinate and provide an irrelevant name as an answer. On the other hand, XRAG understands the question but returns an incorrect named entity, likely due to limitations in reading compressed embeddings accurately. ICAE struggles to comprehend the question, resulting in an unreasonable answer. Both COCOM and COCOM-light successfully answer the question correctly at a compression rate of 4. However, they encounter difficulties when the compression rate is increased to 128.

It is also worth noting that the XRAG response was intentionally truncated to a maximum of 30 tokens in its original publication, with the stopping criteria involving halting at punctuation mark such as periods, commas, and colons.

Table 8. Case Study: Generated responses using different methods. Dataset: NQ
Model Input
Question: who played sarah hedley in when the boat comes in?
Context 1: Rosalind Bailey. Rosalind Bailey Rosalind Bailey (born 1946) is a British actress, known for her portrayal of Sarah Headley (”née” Lytton) in the 1970s and 1980s BBC television drama ”When the Boat Comes In”. Bailey has appeared in numerous British television drama series, including ”Byker Grove”, ”Distant Shores” and ”Burn Up”. Her stage work includes playing Miss Mary Shepherd in Alan Bennett’s play ”The Lady in the Van”.
Context 2: Malcolm Terris. Malcolm Terris Malcolm Terris (born 11 January 1941 in Sunderland, County Durham) is a British actor. He had a lengthy career in a large number of television programmes. Possibly his best-known role was in ”When the Boat Comes In”, a popular 1970s series, where he played the part of Matt Headley. His film career includes appearances in ”The First Great Train Robbery” (1978), ”McVicar” (1980), ”The Plague Dogs” (1982, voice only), ”Slayground” (1983), ”The Bounty” (1984) as Thomas Huggan, ship’s surgeon, ”Mata Hari” (1985), ”Revolution” (1985), ”Scandal” (1989), and ”Chaplin” (1992). His TV appearances include: One episode of
Context 3: When the Boat Comes In. When the Boat Comes In When the Boat Comes In is a British television period drama produced by the BBC between 1976 and 1981. The series stars James Bolam as Jack Ford, a First World War veteran who returns to his poverty-stricken (fictional) town of Gallowshield in the North East of England. The series dramatises the political struggles of the 1920s and 1930s and explores the impact of national and international politics upon Ford and the people around him. Section:Production. The majority of episodes were written by creator James Mitchell, but in Series 1 north-eastern
Context 4: Susie Youssef. Youssef began her comedy career as a writer for ”The Ronnie Johns Half Hour” in 2006, and made her acting debut in the short film ”Clicked” in the role of Lina in 2011. In 2014, she played Jane in the short film ”Kevin Needs to Make New Friends: Because Everyone Hates Him for Some Reason” and then turned to television where she appeared in ”The Chaser’s Media Circus”. In 2014, Youssef played the lead role of Sarah in the Hayloft Project’s stage play ”The Boat People” which won the Best On Stage award at the FBi SMAC Awards
Context 5: Madelaine Newton. Madelaine Newton Madelaine Newton is a British actress best known for her portrayal of Dolly in 1970s BBC television drama ”When the Boat Comes In”. She is married to actor Kevin Whately, known for his role as Robert ”Robbie” Lewis in both ”Inspector Morse” and its spin-off ”Lewis”. They have two children. She starred alongside her husband in the ”Inspector Morse” episode ”Masonic Mysteries” as Beryl Newsome - the love-interest of Morse - whom Morse was wrongly suspected of murdering. She played Whately’s on-screen wife in the 1988 Look and Read children’s serial, Geordie Racer. She also made
Generated Responses
Label: Rosalind Bailey
LLM: Anna Cropper
RAG: Rosalind Bailey
xRAG: 1976 : The role of Sarah Hedley in When the Boat Comes In was played by Rosalie Crutchley.
ICAE Response: Sarah Hadland
COCOM-4: Rosalind Bailey
COCOM-light-4: Rosalind Bailey
COCOM-128: Alison Steadman
COCOM-light-128: Rosalind Elliott

7. Conclusion

In this paper, we presented our novel approach COCOM approach for context compression. Our main finding is that COCOM accelerates answer generation, by reducing the model’s input, by compressing multiple contexts into context embeddings that, once pre-computed serve to augment the answer generation.

Our approach maximizes the potential of the LLM by tuning all components outperforming existing methods for context compression in RAG. By offering a trade-off between efficiency and effectiveness, our method allows for the selection of varying numbers of context compression tokens. This flexibility enables us to balance higher answer quality against faster generation times as needed. Unlike previous methods, our approach allows for the input of multiple contexts, which enhances generation quality and optimally makes use of the reduced decoding time. This is because only for very long inputs, the distinction between the context in token form and a reduced set of embeddings becomes most apparent.

We hope that our work will inspire further research in context compression and pave the way for efficient and effective deployment of Retrieval-Augmented Generation (RAG) models in real-world applications.

8. Limitations

We end this paper by discussing the remaining limitations of our model and of our experiments.

Our approach offers great potential to reduce the computational footprint of RAG. However, in our experiments we were constrained by computational resources, which limits us to utilizing a relatively small model of 7 billion parameters. This constraint prevents us from exploring the capabilities of larger models such as LLaMA70B or Mixtral7x8B, which might offer enhanced performance but require significant computational power for training and inference.

Our approach demonstrates the potential to leverage a much larger set of documents compared to non-compressed models, leading to notable efficiency gains. These gains are particularly evident when dealing with a substantial volume of documents. However, due to resource limitations, our experiments have been restricted to only 5 documents. This limited scope may not fully reflect the method’s effectiveness when scaled to larger document collections, where the benefits could be more pronounced.

Additionally, the evaluation of our method has been conducted exclusively on Question Answering (QA) tasks and using English corpora. A more comprehensive assessment, encompassing diverse tasks and multilingual datasets, would be necessary to thoroughly understand the model’s capabilities and limitations in different scenarios.

References

  • (1)
  • Asai et al. (2024) Akari Asai, Zexuan Zhong, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi, and Wen-tau Yih. 2024. Reliable, Adaptable, and Attributable Language Models with Retrieval. arXiv preprint arXiv:2403.03187 (2024).
  • Bartolo et al. (2020) Max Bartolo, Alastair Roberts, Johannes Welbl, Sebastian Riedel, and Pontus Stenetorp. 2020. Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension. Transactions of the Association for Computational Linguistics 8 (2020), 662–678. https://doi.org/10.1162/tacl_a_00338 arXiv:https://doi.org/10.1162/tacl_a_00338
  • Cheng et al. (2024) Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. 2024. xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token. arXiv preprint arXiv:2405.13792 (2024).
  • Chevalier et al. (2023) Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. 2023. Adapting Language Models to Compress Contexts. arXiv:2305.14788 [cs.CL]
  • Cuconasu et al. (2024) Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The Power of Noise: Redefining Retrieval for RAG Systems. arXiv:2401.14887 [cs.IR]
  • Dehghani et al. (2019) Mostafa Dehghani, Hosein Azarbonyad, Jaap Kamps, and Maarten de Rijke. 2019. Learning to Transform, Combine, and Reason in Open-Domain Question Answering. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, J. Shane Culpepper, Alistair Moffat, Paul N. Bennett, and Kristina Lerman (Eds.). ACM, 681–689. https://doi.org/10.1145/3289600.3291012
  • Fan et al. (2019) Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. ELI5: Long Form Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 3558–3567. https://doi.org/10.18653/v1/P19-1346
  • Ge et al. (2024) Tao Ge, Hu Jing, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. 2024. In-context Autoencoder for Context Compression in a Large Language Model. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=uREj4ZuGJE
  • Goyal et al. (2024) Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. 2024. Think before you speak: Training Language Models With Pause Tokens. In The Twelfth International Conference on Learning Representations.
  • He et al. (2021) Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021).
  • Hsia et al. (2024) Jennifer Hsia, Afreen Shaikh, Zhiruo Wang, and Graham Neubig. 2024. RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems. arXiv:2403.09040 [cs.CL]
  • Izacard and Grave (2021) Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. http://arxiv.org/abs/2007.01282 arXiv:2007.01282 [cs].
  • Izacard et al. (2022) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022. Atlas: Few-shot Learning with Retrieval Augmented Language Models. http://arxiv.org/abs/2208.03299 arXiv:2208.03299 [cs].
  • Jiang et al. (2023) Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 13358–13376. https://doi.org/10.18653/v1/2023.emnlp-main.825
  • Jiang et al. (2019) Kelvin Jiang, Dekun Wu, and Hui Jiang. 2019. FreebaseQA: A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with Freebase. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 318–323. https://doi.org/10.18653/v1/N19-1028
  • Johannes Welbl (2017) Matt Gardner Johannes Welbl, Nelson F. Liu. 2017. Crowdsourcing Multiple Choice Science Questions. arXiv:1707.06209v1.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics, Vancouver, Canada, 1601–1611. https://doi.org/10.18653/v1/P17-1147
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 453–466.
  • Lassance et al. (2024) Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE. arXiv preprint arXiv:2403.06789 (2024).
  • Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 4582–4597. https://doi.org/10.18653/v1/2021.acl-long.353
  • Li (2023) Yucheng Li. 2023. Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering. arXiv:2304.12102 [cs.CL]
  • Liu et al. (2023) Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. https://doi.org/10.48550/arXiv.2307.03172 arXiv:2307.03172 [cs].
  • Mallen et al. (2023) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 9802–9822. https://doi.org/10.18653/v1/2023.acl-long.546
  • Morris et al. (2023) John Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander Rush. 2023. Text Embeddings Reveal (Almost) As Much As Text. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 12448–12460. https://doi.org/10.18653/v1/2023.emnlp-main.765
  • Muennighoff et al. (2024) Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. 2024. Generative Representational Instruction Tuning. arXiv:2402.09906 [cs.CL]
  • Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016).
  • Penedo et al. (2024) Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv:2406.17557 [cs.CL] https://arxiv.org/abs/2406.17557
  • Petroni et al. (2020) Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. 2020. KILT: a benchmark for knowledge intensive language tasks. arXiv preprint arXiv:2009.02252 (2020).
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Jian Su, Kevin Duh, and Xavier Carreras (Eds.). Association for Computational Linguistics, Austin, Texas, 2383–2392. https://doi.org/10.18653/v1/D16-1264 arXiv:1606.05250 [cs.CL]
  • Rau et al. (2024) David Rau, Hervé Déjean, Nadezhda Chirkova, Thibault Formal, Shuai Wang, Vassilina Nikoulina, and Stéphane Clinchant. 2024. BERGEN: A Benchmarking Library for Retrieval-Augmented Generation. arXiv:2407.01102 [cs.CL] https://arxiv.org/abs/2407.01102
  • Stelmakh et al. (2022) Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. 2022. ASQA: Factoid Questions Meet Long-Form Answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 8273–8288. https://doi.org/10.18653/v1/2022.emnlp-main.566
  • Tan et al. (2024) Sijun Tan, Xiuyu Li, Shishir Patil, Ziyang Wu, Tianjun Zhang, Kurt Keutzer, Joseph E Gonzalez, and Raluca Ada Popa. 2024. LLoCO: Learning Long Contexts Offline. arXiv preprint arXiv:2404.07979 (2024).
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
  • Xu et al. (2023) Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023. RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation. arXiv:2310.04408 [cs.CL]
  • Yang et al. (2015) Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. WikiQA: A Challenge Dataset for Open-Domain Question Answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lluís Màrquez, Chris Callison-Burch, and Jian Su (Eds.). Association for Computational Linguistics, Lisbon, Portugal, 2013–2018. https://doi.org/10.18653/v1/D15-1237
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 2369–2380. https://doi.org/10.18653/v1/D18-1259
  • Zhu et al. (2024) Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo, Lei Meng, Bang Liu, et al. 2024. Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection. arXiv preprint arXiv:2405.16178 (2024).

Appendix A Appendix

Table 9. Results in Match (M) comparing COCOM (-light) to other context compression works. All methods use 5 context passages unless indicated otherwise. Method limited to single context. upper baseline. lower baseline. indicates statistical non-significance (p¿0.05) with respect to COCOM ξ𝜉\xiitalic_ξ=4.
Decoder Method Compression rate (ξ𝜉\xiitalic_ξ) Dataset
NQ TriviaQA HotpotQA ASQA PopQA Average
Zero-shot AutoCompressor (Chevalier et al., 2023) ×\times× 4 0.351 0.703 0.314 0.574 0.237 0.435
ICAE (Ge et al., 2024) ×\times× 4 0.421 0.784 0.293 0.469 0.426 0.479
xRAG (Cheng et al., 2024)
Mistral-7B-v0.2 ×\times× 128 0.316 0.766 0.267 0.339 0.326 0.403
Mixtral-8x7b ×\times× 128 0.405 0.852 0.326 0.457 0.412 0.490
Fine-tuned Mistral-7B-v0.2 RAG (no compression) - 0.637 0.917 0.544 0.665* 0.543 0.661
LLM (no context) - 0.403 0.753 0.283 0.573 0.208 0.444
COCOM-light (ours) ×\times× 4 0.579 0.882 0.439 0.633* 0.473 0.601
×\times× 16 0.529 0.857 0.395 0.604 0.395 0.556
×\times× 128 0.479 0.828 0.347 0.586 0.326 0.513
COCOM (ours) ×\times× 4 0.589 0.894 0.461 0.640 0.487 0.614
×\times× 16 0.577* 0.886* 0.456* 0.633* 0.478 0.606
×\times× 128 0.546 0.866 0.403 0.617* 0.402 0.567
Table 10. Hyperparameters for Pretraining
Hyperparameter Assignment
learning Rate 1e-4
lr scheduler type linear
warmup ratio 0.05
weight dacay 0.1
overall batch size 256
optimizer AdamW
epochs 1
LoRa layers all linear layers
LoRa alpha 32
LoRa dropout 0.1
LoRa r𝑟ritalic_r 16
LoRa bias None
GPU 8 x A100 80GB
context max length 128
Table 11. Hyperparameters for Fine-tuning
Hyperparameter Assignment
learning Rate 1e-4
lr scheduler type linear
warmup ratio 0.05
weight dacay 0.1
overall batch size 64
optimizer AdamW
epochs 2
LoRa layers all linear layers
LoRa alpha 32
LoRa dropout 0.1
LoRa r𝑟ritalic_r 16
LoRa bias None
GPU 8 x A100 80GB
retriever(s) SPLADE-v3 (+ DeBERTa-v3)
num passages 5
Table 12. Datasets contained in the multi-dataset collection used for fine-tuning our COCOM (-light). We filtered out queries with more than 128 tokens and labels of more than 64 tokens.
Dataset Number examples
NQ 87,925
MSMARCO 100,000
Adversarial QA 30,000
HotpotQA 88,869
WikiQA 873
SciQ 11,679
ASQA 4,353
Wiki QA 61,817
Freebase 20,358
SQuAD 87,599
Total 493,473