[1]\fnmHaowen \surHou \equalcontThese authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

[1]\orgnameGuangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), \orgaddress\streetYutang, \cityShenzhen, \postcode518000, \stateGuangdong, \countryChina

Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

[email protected]    \fnmFei \surMa    \fnmBinwen \surBai    \fnmXinxin \surZhu    \fnmFei \surYu *
Abstract

Large Language Models (LLMs) have garnered widespread attention due to their remarkable performance across various tasks. However, to mitigate the issue of hallucinations, LLMs often incorporate retrieval-augmented pipeline to provide them with rich external knowledge and context. Nevertheless, challenges stem from inaccurate and coarse-grained context retrieved from the retriever. Supplying irrelevant context to the LLMs can result in poorer responses, increased inference latency, and higher costs. This paper introduces a method called Instruction-Aware Contextual Compression, which filters out less informative content, thereby accelerating and enhancing the use of LLMs. The experimental results demonstrate that Instruction-Aware Contextual Compression notably reduces memory consumption and minimizes generation latency while maintaining performance levels comparable to those achieved with the use of the full context. Specifically, we achieved a 50% reduction in context-related costs, resulting in a 5% reduction in inference memory usage and a 2.2-fold increase in inference speed, with only a minor drop of 0.047 in Rouge-1. These findings suggest that our method strikes an effective balance between efficiency and performance.

keywords:
Large Language Models, Context Compression, Retrieval Augmented Generation
Refer to caption
Figure 1: Retrieval Augmented Generation(RAG) pipeline with Instruction-Aware Contextual Compression.

1 Introduction

Large language models (LLMs) have exhibited impressive capabilities in terms of both their robust performance and generalization across a diverse spectrum of natural language processing tasks, as well as practical real-world applications [5, 30, 31]. To address certain issues with Large Language Models (LLMs), such as long-context [39] or hallucination [14, 29] problems, retrieval-augmented generation (RAG) [18] has emerged. RAG has become an important approach to enhance Large Language Models.

However, when using RAG, there can still be problems with irrelevant information. On one hand, inaccurate recall may lead to the retrieval of irrelevant documents. On the other hand, even within relevant documents, there might be irrelevant content that could distract the Large Language Model (LLM) from the relevant information. Passing the full document to the LLMs can lead to poor responses, large inference latency, and high costs.

Contextual compression aims to address this issue. The concept is straightforward: rather than directly presenting retrieved documents in their original form, they can be compressed, ensuring that only relevant information is conveyed. Some research efforts [19] have been committed to effectively compressing the context or prompt for large language models, with the aim of utilizing the most concise input while simultaneously preserving the robust performance of these models.

In this paper, we introduce Instruction-Aware Contextual Compressor (IACC), a novel approach that harnesses both ranking and generation information to eliminate extraneous context, thereby mitigating the computational overhead associated with the given context. Consequently, this leads to a reduction in both inference memory usage and inference time. The Instruction-Aware Contextual Compressor is adept at preserving finely detailed content directly related to instructions while compactly representing context, resulting in an efficient and streamlined input for Large Language Models (LLMs) without compromising their performance.

The main contributions of our paper are as follows:

  1. 1.

    We introduce Instruction-Aware Contextual Compressor, an innovative model aimed at enhancing the context efficiency of LLMs, which is able to reduce memory usage and inference latency without sacrificing LLMs performance.

  2. 2.

    We found that Instruction-aware contextual compression by generation is more effective than Instruction-aware contextual compression by ranking, even though the former utilizes training data that is only one-tenth of the latter.

  3. 3.

    We developed the WikiQA-LongForm Dataset, a long-form open-domain question answering dataset based on Wikipedia entries, which can be used for training and evaluating models’ context compression capabilities. This dataset is now publicly available at WikiQA-LongForm for use in other research projects.

The remaining sections of the paper are structured as follows: In Section 2, we delve into the related work. Section 3 outlines the method and model architecture. Section 4 describes experimental setup. Results are detailed in Section 5, and we provide conclusion in Section 6.

2 Related Work

In this section, we review existing approaches [21, 22, 10, 36] aimed at addressing the limitations imposed by context length in Large Language Models (LLMs). These limitations have motivated the development of various techniques to extend the context window of LLMs and enhance their performance.

2.1 Retrieval-Augmented Generation

Retrieval has been integrated into language models for years to enhance various aspects such as perplexity [4, 32], factual accuracy [24], downstream task accuracy [8, 12, 13, 18], and in-context learning capability [11]. Combined with a standalone retriever [16, 34, 20], retrieval-augmented LLM is a well-established approach for addressing question answering with long documents in an open-domain context. In previous studies, language models have been augmented with retrieval during inference [17, 41], fine-tuning [13, 18, 8], and pretraining [4, 13, 32]. There are also some methods that aim to integrate LLM and retriever into a single model, creating an end-to-end solution [15, 28].

2.2 Long Context Large Language Models

Many approaches have sought to improve the handling of longer contexts in Large Language Models (LLMs) through modifications to their underlying architectures. Notably, the Longformer [3] employs a linear attention mechanism that scales with sequence length, allowing it to accommodate longer contexts effectively. CoLT5 [1] introduces conditional computation techniques that enable the model to focus more on crucial tokens in both feedforward and attention layers. However, it’s worth noting that many existing works have not yet adopted such architectural modifications, mainly due to the high computational cost associated with training LLMs. Another category of approaches addresses the context length limitation by employing context chunking strategies. The Parallel Context Windows [26] proposes a parallel context window method, which calculates attention within individual chunks and incorporates positional embeddings between these chunks.

2.3 Prompt Engineering

Prompt engineering is a relatively emerging discipline focused on crafting and refining prompts to harness the power of language models (LMs) for diverse applications and research endeavors. Proficiency in prompt engineering aids in gaining a deeper insight into the capabilities and constraints of large language models (LLMs). Researchers employ prompt engineering to enhance the performance of LLMs across an array of common and intricate tasks, including question answering and arithmetic reasoning. Developers leverage prompt engineering to devise resilient and efficient prompting strategies that interact seamlessly with LLMs and other associated tools. Prompt engineering encompasses two key directions: text-to-text and text-to-image [33, 25] interactions. This area of research has witnessed significant manual efforts, exemplified by A Prompt Pattern Catalog [37], where a comprehensive collection of handcrafted prompt techniques is meticulously documented. On the other hand, there are automated approaches to prompt generation, such as the work on ”Automatic Prompt Engineer” by Zhou and ”LM-BFF” by Gao [42, 7].

2.4 Context Compression

Context compression can be considered a form of prompt engineering, although their emphases are slightly different. Several techniques aim to compress prompts effectively while maintaining context relevance. The Selective Context [19] approach leverages concepts from information theory, specifically self-information, to compress the context. Another approach, Learning to Compress Prompts with Gist Tokens [23], trains Gist models to compress prompt words into ”gist” tokens before inputting them into the LLMs. Additionally, LeanContext [2] extracts a dynamic number k𝑘kitalic_k of key sentences from prompts and uses a reinforcement learning mechanism to determine the optimal value of k𝑘kitalic_k for compression.

3 Method

We introduce Instruction-Aware Contextual Compression [19], an innovative approach for context compression that leverages both ranking and generation information. In contrast to the instruction-agnostic context compression methods used previously, Instruction-Aware Contextual Compression is a method that relies on instructions to perform context compression. Depending on the specific instruction provided, the model produces different compression outcomes, as shown in Figure 2, removing irrelevant portions of the context, ultimately achieving improved context compression results.

To achieve Instruction-Aware Contextual Compression, we propose a two-stage pre-training methodology comprising the following stages: (1) Ranking-Based Learning Stage. (2) Generative Learning Stage. This section commences with an exposition of the model architecture employed in Instruction-Aware Contextual Compressor, and then introduces how we trained it in two different stages.

Original: The 2018 Chinese Professional Baseball League (CPBL) Red vs White All-Star Game was a special event held during the 29th CPBL season. It took place on July 7th and July 8th, 2018, at the Taipei Tianmu Baseball Stadium in Taipei, Taiwan. The main game was held on the first day, while the second day featured five skill competitions. The main game concluded with the Three-Family Mart Brothers White Team defeating the Taiwan Cooperative Bank Red Team by a one-run difference.

Compressed by instruction ”When was the 2018 CPBL Red vs. White All-Star Game held?”: The 2 0 1 8 Chinese Professional Baseball League ( CP BL ) Red vs White All - Star Game was a special event held during the 2 9 th CP BL season . It took place on July 7 th and July 8 th , 2 0 1 8 , at the Taip ei Tian mu Baseball Stadium in Taip ei , Taiwan

Compressed by instruction ”Who won the 2018 CPBL Red vs White All-Star Game? ”: The 2 0 1 8 Chinese Professional Baseball League ( CP BL ) Red vs White All - Star Game was a special event held during the 2 9 th CP BL season . The main game concluded with the Three - Family Mart Brothers White Team defeat ing the Taiwan Cooperativ e Bank Red Team by a one - run difference

Figure 2: A visualisation of Instruction-Aware Contextual Compression. Deeper color indicates a stronger relevance to the instruction.

3.1 Model Architecture

Refer to caption
Figure 3: Model architecture of Instruction-Aware Contextual Compressor. We jointly optimize two objectives which enforce the model to extract contextual representation most relevant to the instruction.

We introduce Instruction-Aware Contextual Compressor as a trainable module to implement the Instruction-Aware Contextual Compression method. Instruction-Aware Contextual Compressor adopts an encoder-decoder architecture, consisting of both an encoder and a decoder, as illustrated in Figure 3.

The document encoder is a standard multi-layer transformer encoder and utilized to extract features from the input documents.

The decoder is a standard multi-layer transformer decoder, which equipped with two distinct functionalities, which can be toggled by modifying the masking:

  1. 1.

    The ranking decoder performs re-ranking based on the output features obtained from the document encoder. In this mode, interaction occurs between instruction features and document features within cross-attention layers. The decoder employs bidirectional self-attention layers without any masking.

  2. 2.

    The generation decoder, on the other hand, replaces the bidirectional self-attention layers in the decoder with causal self-attention layers. It uses a [BOS] token to denote the start of a sequence and an end-of-sequence token to signify its conclusion.

The model consists of 8 encoder and 8 decoder layers, which affect its depth and ability to capture complex patterns. The primary model dimension is 512 and it uses a feed-forward dimension of 1024 for its inner layers. The model has 6 attention heads, allowing it to focus on different aspects of the input data. The model has a total of 0.18 billion trainable parameters, which is significantly smaller than the current mainstream large language models like Llama  [30, 31], which have 7 billion, 13 billion, and 70 billion parameters, respectively. We initialize Instruction-Aware Contextual Compressor with the pre-trained weights of umT5[6], which has been pre-trained on a multilingual corpus, enabling it to handle multilingual tasks effectively.

The model’s architecture, in conjunction with its training objectives, empowers it to capture the intricate interplay between instructions and documents, facilitating the extraction of the most pertinent information from the documents.

3.2 Training Objectives

We jointly optimize two objectives during training, with one ranking-based objective and one generation based objective. Each instruction-document pair only requires one forward pass through the document encoder, and two forward passes through the decoder, where different functionalities are activated to compute the two losses as delineated below

Ranking Loss activates the ranking decoder. It aims to learn instruction-document representation that captures the fine-grained alignment between instruction and document. In ranking task, the model uses a ranking head (a linear layer) to predict a instruction-document matching score given their instruction feature and document feature. All positive samples are placed in the first position, and an additional 19 hard negative samples are retrieved by the retriever. Then, the model is trained as a 20-class classification task. This approach effectively boosts the scores of positive samples while suppressing the scores of negative samples. The specific formula for calculating the loss is as follows:

Lranking=ictilog(esijcesj)subscript𝐿𝑟𝑎𝑛𝑘𝑖𝑛𝑔superscriptsubscript𝑖𝑐subscript𝑡𝑖𝑙𝑜𝑔superscript𝑒subscript𝑠𝑖superscriptsubscript𝑗𝑐superscript𝑒subscript𝑠𝑗L_{ranking}=-\sum_{i}^{c}t_{i}log(\frac{e^{s_{i}}}{\sum_{j}^{c}{e^{s_{j}}}})italic_L start_POSTSUBSCRIPT italic_r italic_a italic_n italic_k italic_i italic_n italic_g end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_l italic_o italic_g ( divide start_ARG italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ) (1)

where sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the ranking score of i𝑖iitalic_i-th sample. tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th target. c𝑐citalic_c represents the number of class.

The Language Modeling Loss activates the generation decoder, with the goal of generating useful response based on the provided context and instruction. It is optimized using a cross-entropy loss that trains the model to maximize the likelihood of the text in an autoregressive manner. We incorporate a label smoothing factor of 0.1 when calculating the loss. In comparison to the ranking loss, the Language Modeling loss equips the model with the ability to generalize for following instructions. This empowers the model to gain a deeper understanding of the potential correct context location and to effectively model fine-grained correlations. the language modeling loss can be written as:

Llm=t=1Nlogp(yt|y<t)subscript𝐿𝑙𝑚superscriptsubscript𝑡1𝑁𝑝conditionalsubscript𝑦𝑡subscript𝑦absent𝑡L_{lm}=-\sum_{t=1}^{N}\log{p(y_{t}|y_{<t})}italic_L start_POSTSUBSCRIPT italic_l italic_m end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) (2)

where p(yt|y<t)𝑝conditionalsubscript𝑦𝑡subscript𝑦absent𝑡p(y_{t}|y_{<t})italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) denotes the output probability for the correct token ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given the previous context y<tsubscript𝑦absent𝑡y_{<t}italic_y start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT. N𝑁Nitalic_N denotes the sequence length.

3.3 Instruction-Aware Contextual Compression by Ranking

Instruction-Aware contextual compression by ranking is a fairly straightforward process. First, an appropriate text-splitting strategy is employed, which can involve splitting based on specific character or by length, effectively converting the document into multiple chunks. Next, the model’s ranking capability is applied to score chunks with instruction, followed by sorting them. Chunks are then retained based on a specified percentage. Throughout the compression process, care is taken to maintain the order of the chunks.

3.4 Instruction-Aware Contextual Compression by Generation

Compressing context by generation leverages the ability of Grad-CAM [27] to capture fine-grained relevance. The calculation process of Grad-CAM can be summarized as follows: Firstly, perform a forward pass to obtain the final classification probabilities. Calculate the gradients of the token with respect to the target class. Average the gradients for each token and extract an attention map within a specific cross-attention layer. Grad-CAM scores can be obtained by multiplying the attention map and the gradient weight vector. These Grad-CAM scores can be considered as the contribution of each token to the classification result.

There’s no need to pre-split the context; instead, the entire context is input into the model. Following that, k-step responses are generated based on the context and instruction. In the Instruction-Aware Contextual Compressor, attention maps and gradients are recorded for specific cross-attention layers, which are used to compute token-level Grad-CAM scores. These token-level Grad-CAM scores are then averaged to obtain chunk/sentence-level Grad-CAM scores. The chunks are subsequently sorted based on their Grad-CAM scores, and a specified percentage of the highest-scoring chunks is retained. Similarly, throughout the compression process, care is taken to maintain the order of the chunks.

3.5 Ensemble the two methods

Effectively ensemble these two methods of Instruction-Aware contextual compression can yield better compression results. The magnitude difference between the ranking score and Grad-CAM score makes it challenging to determine a suitable weighting parameter for fusion. Therefore, in the end, we opted for a non-parametric approach. Specifically, we individually rank the two types of information and then use the average of the two rankings to compress the context.

4 Experiments

4.1 Datasets

In this study, we utilized two types of datasets. The first are ranking datasets, designed to empower the model with robust re-ranking capabilities. The second are generation datasets, intended to equip the model with generative abilities. The ranking datasets comprise 15 million samples, while the generation datasets consist of 1.63 million samples.

4.1.1 Ranking datasets

𝐓𝟐superscript𝐓2\mathbf{T^{2}}bold_T start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPTRanking [38] is a large-scale Chinese passage ranking dataset published in April 2023, which comprises 307K queries and 2.3M unique passages from real-world search engines. To constructing more accurate ranking algorithms, each query-passage pair has 4-level fine-grained annotations. For the retrieval task, it classifies Level-2 and Level-3 passages as relevant passages, while categorizing all remaining passages as irrelevant.

M3E Dataset [35] comprises a total of 22M sentence pair samples from a diverse range of topics, including Chinese encyclopedia, finance, healthcare, law, news, academia, etc. This dataset mainly consists of datasets used for other tasks, among which over 3M of data is instruction fine-tuning data, while some datasets comes from tasks such as Q&A, parallel semantics, machine reading comprehension, corpus, NL2SQL, text classification, text summarization, natural language processing, etc.

4.1.2 Generation Datasets

Dureader Dataset [9] is a extensive open-domain Chinese machine reading comprehension dataset, encompassing 200,000 questions, 420,000 answers, and 1 million documents. The questions and documents are sourced from Baidu Search and Baidu Zhidao, while the answers are manually crafted. In this study, we only use its robust subset, which contains 14,500 samples for training and 1.42k samples for validation.

WikiQA-LongForm Dataset is a long-form open-domain question answering dataset based on Wikipedia entries. We employed a heuristic approach and our proprietary NLU model to filter out lower-quality and sensitive or controversial entries, retaining 254,547 high-quality entries. These entries were transformed into multi-turn dialogue data using ChatGPT, resulting in high-quality Long Form QA data after further heuristic filtering. The WikiQA-LongForm Dataset is a contribution of this study and is publicly available at WikiQA-LongForm for use in other research projects.

4.2 Large Language Models

During our experimentation, we conducted tests using Instruction-Aware Contextual Compressor on ChatGPT, which is based on the GPT-3.5-turbo-0613 architecture. ChatGPT represents an Instruct-tuned language model that has undergone further enhancement through Reinforcement Learning from Human Feedback (RLHF) and boasts an impressive 175 billion parameters. The foundational language model of ChatGPT appears to be code-davinci-0022, and previously, davinci, as outlined in [5]. Our objective was to compare the performance of ChatGPT with and without the application of Instruction-Aware Contextual Compressor to gain insights into its impact on the model’s efficiency and accuracy. The settings of ChatGPT are all set to their default values, except for the top_p𝑡𝑜𝑝_𝑝top\_pitalic_t italic_o italic_p _ italic_p parameter, which is set to 0.1. This adjustment is made to reduce the impact of randomness on the evaluation results.

4.3 Experimental Settings

We conduct a comparative analysis to assess the effectiveness of Instruction-Aware Contextual Compressor and analyze the associated trade-offs.

RAG Baseline: The table 1 displays several RAG baselines, including scenarios where RAG is not used, direct feeding of the correct context to the large language model, the scenario where only a retriever is used in pipeline, and the scenario where a retriever is used in combination with Instruction-Aware Contextual Compressor for re-ranking.

Table 1: The LLM performance in different scenarios
Rouge-1 Rouge-2 Rouge-L Recall@1
Ground Truth 0.683 0.539 0.631 1.0
Recalled Top1 0.656 0.507 0.605 0.86
Recall + Rerank Top1 0.675 0.529 0.623 0.962

Compression Baseline: Our evaluation involves a comparison between Instruction-Aware Contextual Compressor and Selective Context [19], which employs a basic approach to filter out an equivalent amount of data based on self-information. Selective Context utilizes the GPT-2 model with 124 million parameters. Some readers may doubt whether the GPT-2 124M model is too small to be considered a sufficiently robust baseline for comparison. To this end, we used Baichuan-7B with 7 billion parameters to run the Selective Context, which has parameters 56 times larger than GPT-2. The results at a retention ratio of 0.5 are presented in the table below:

Table 2: Comparison of ROUGE-L scores for the Selective Context method using Baichuan-7B and GPT-2 models.
Model Parameters R (rouge-l) P (rouge-l) F (rouge-l)
Baichuan-7B 7B 0.5448 0.5419 0.5024
GPT-2 124M 0.5758 0.5506 0.5205

We found that Baichuan-7B did not perform better than GPT-2. This also indicates that the selective context method is not scalable. This therefore indicates that the baseline we used is sufficiently robust.

Retention Ratios: In our experiments, we explore various content retention ratios: 0.2, 0.35, 0.5, 0.65, and 0.8. These ratios determine the proportion of content to be retained. This exploration allows us to examine the trade-off between efficiency and performance as the amount of retained information varies.

Setting for Inference Measure: To measure the inference acceleration and memory savings brought about by context compression, we conducted practical measurements using an NVIDIA 4090 GPU with 24GB of VRAM. The LLM model used was Baichuan-7B [40], and the data format employed was bfloat16.

5 Results and Discussions

5.1 Comparison to Original Context

We initially compare the performance of Instruction-Aware Contextual Compressor with varying context retention ratios to the reranked original context, which utilizes the original context after reranking but no compression at all. All results are shown in table 3, and the ”diff” column represents the difference in performance compared to uncompressed text.

As shown in the table 3, at retention rate of 0.8, the performance loss is minimal, with Rouge-1 showing only a marginal decrease in the range of 0.003 to 0.008. This demonstrates a high level of consistency between answers provided in compressed contexts and those in original contexts. Surprisingly, the Rouge-2 and Rouge-L score with the generation method is even higher than the original text, which was unexpected. This indicates that our method successfully filtered unrelated or even noisy content, improving the LLM’s performance.

As the retention ratio decreases, the effectiveness of all methods declines, which is expected since there is less valuable information provided to the LLM. Overall, the generation method outperforms the ranking method, with a slower rate of performance decline from 0.8 to 0.35 compared to the ranking method. However, it was unexpected that at a retention ratio of 0.2, there was a sudden significant drop in performance, indicating a rapid loss of effectiveness for the generation method at low retention ratios.

In traditional machine learning, ensemble learning is widely regarded as a robust and effective method for improving model performance. Therefore, we propose using the ”average rank” to combine the generation method and the ranking method. This involves taking the average of the ranks assigned by the generation method and the ranking method, resulting in a new ranking score for a given text. Overall, the ”average rank” method outperforms both the generation and ranking methods. It shows significant improvement from 0.2 to 0.65 retention ratios, with fewer losses compared to the original context. At a retention ratio of 0.8, while it may not outperform the generation method, it still surpasses the ranking method.

Table 3: Comparing Instruction-Aware Contextual Compressor with different context retention ratio to the original context
Method Rouge-1 Rouge-2 Rouge-L Rouge-1 Diff Rouge-2 Diff Rouge-L Diff
origin 0.675 0.529 0.623 0 0 0
ranking-0.8 0.667 0.522 0.617 -0.008 -0.007 -0.006
ranking-0.65 0.643 0.493 0.594 -0.032 -0.036 -0.029
ranking-0.5 0.633 0.481 0.584 -0.042 -0.048 -0.039
ranking-0.35 0.608 0.454 0.56 -0.067 -0.075 -0.063
ranking-0.2 0.587 0.429 0.539 -0.088 -0.1 -0.084
generation-0.8 0.672 0.53 0.624 -0.003 0.001 0.001
generation-0.65 0.66 0.515 0.611 -0.015 -0.014 -0.012
generation-0.5 0.642 0.495 0.596 -0.033 -0.034 -0.027
generation-0.35 0.62 0.471 0.574 -0.055 -0.058 -0.049
generation-0.2 0.559 0.4 0.509 -0.116 -0.129 -0.114
ensembled-0.8 0.669 0.523 0.619 -0.006 -0.006 -0.004
ensembled-0.65 0.66 0.515 0.611 -0.015 -0.014 -0.012
ensembled-0.5 0.649 0.504 0.601 -0.026 -0.025 -0.022
ensembled-0.35 0.628 0.479 0.582 -0.047 -0.05 -0.041
ensembled-0.2 0.599 0.442 0.553 -0.076 -0.087 -0.07

5.2 Comparison to Baseline

In this section, we compare our method to Selective Context baseline, and the results are presented in Figure 4. Selective Context is a context compression method based on text self-information and represents the state-of-the-art as of the time of writing this paper. Comparing our method to Selective Context, which serves as a baseline, can effectively demonstrate the validity of our approach. As shown in Figure 4, our proposed method, Instruction-Aware Contextual Compressor, is even more effective compared to Selective Context. Both methods, whether purely ranking-based or generation-based, outperform Selective Context, and the lead becomes more significant as the retention ratio decreases. This indicates that our proposed method excels in selecting more informative content even when only a limited amount of information can be retained.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Performance of Instruction-Aware Contextual Compression compared to the Selective Context baseline

5.3 The Impact of Generation Steps

For context compression using generative information, intuitively, if the number of generation steps is too few, it might not have generated a complete response. Consequently, the effectiveness at this stage may be suboptimal. As the number of generation steps increases, the effectiveness of compression is expected to improve. To explore this, we conducted experiments with a fixed retention ratio of 0.5, testing a series of generation step values ranging from 4 to 64. The results, as shown in Figure 5, indeed demonstrate that for generation-based context compression, the performance gradually improves with an increase in the number of steps. However, after reaching 32 steps, it reaches a plateau, indicating a diminishing marginal return with further increases in the number of generation steps.

Refer to caption
Figure 5: The Impact of Generation Steps on Context Compression Effectiveness

5.4 Speed Up and Memory Saving

We also measured the impact of context compression on the Large Language Model (LLM). As demonstrated in the table 4, when the retention ratio is set to 0.5, the inference speed per token increases by a factor of 2.2, while memory usage decreases by 5.05%.

Table 4: Speed up after context compression
Retention Ratio Ranking Generation Ensemble
0.8 1.38 1.00 0.94
0.65 1.73 1.17 1.08
0.5 2.05 1.31 1.20
0.35 2.41 1.45 1.32
0.2 2.81 1.58 1.43

5.5 Re-ranking Performance

The ability to filter out irrelevant documents through re-ranking is also a crucial capability of Instruction-Aware Contextual Compression. Following a hierarchical design approach, a large number of initially retrieved documents are first re-ordered using Instruction-Aware Contextual Compressor to select the top-k documents, which are then further compressed. Therefore, we present the retrieval performance of Instruction-Aware Contextual Compressor on extensive datasets, measured by Recall@1, 5, and 10, as shown in 5.

Table 5: Re-ranking Performance of Instruction-Aware Contextual Compressor
Dataset Recall@1 Recall@5 Recall@10
wikipedia-cn-20230720-dataset 0.975 0.998 0.999
wiki_atomic_edits 0.968 0.997 0.999
alpaca_gpt4 0.789 0.933 0.971
bq 0.718 0.833 0.903
firefly 0.632 0.849 0.947
webqa 0.667 0.913 0.974
dureader_dataset 0.913 0.988 0.996
cmrc2018 0.972 0.986 0.993
csl 0.846 0.961 0.986
pawsx 0.606 0.996 1.000
dureader_robust 0.627 0.859 0.921
T2superscriptT2\rm T^{2}roman_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTRanking_train_dataset 0.405 0.756 0.901
tiracl 0.727 0.891 0.956
belle_2m 0.749 0.921 0.972
mlqa 0.718 0.908 0.960
lcqmc 0.435 0.876 0.960
hc3_chinese 0.270 0.751 0.897
zhihu_kol 0.238 0.557 0.787
xlsum 0.395 0.833 0.933
ocnli 0.371 0.876 0.961
chatmed_consult 0.594 0.803 0.899

6 Conclusion

In this paper, we introduced Instruction-Aware Contextual Compression to filter out less relevant content, providing a more concise and efficient context representation for LLMs, all without compromising their performance. An important discovery we made is that generation-based Instruction-Aware Contextual Compression is more effective than ranking-based Instruction-Aware Contextual Compression methods. With generation-based Instruction-Aware Contextual Compression, using only 1/10 of the data, the results can surpass those of the Instruction-Aware Contextual Compression method. With only 50% of the context retained, we achieved a 2.2x inference speedup for the LLM and saved 5% of GPU VRAM, while the Rouge-1 metric only dropped by 0.047. According to our evaluations, the results show that Instruction-Aware Contextual Compressor significantly improves the efficiency of LLMs and serves as a valuable component in the Retrieval-augmented Generation pipeline.

Data availability and access

The code and data used in this project are open-sourced and available at https://github.com/howard-hou/instruction-aware-contextual-compressor. The provided information is sufficient to reproduce the results of this paper. Additional data supporting the findings of this study can be obtained from the corresponding author upon reasonable request.

References

  • \bibcommenthead
  • Ainslie et al [2023] Ainslie J, Lei T, de Jong M, et al (2023) Colt5: Faster long-range transformers with conditional computation. ArXiv preprint abs/2303.09752. URL https://arxiv.org/abs/2303.09752
  • Arefeen et al [2023] Arefeen MA, Debnath B, Chakradhar S (2023) Leancontext: Cost-efficient domain-specific question answering using llms. ArXiv preprint abs/2309.00841. URL https://arxiv.org/abs/2309.00841
  • Beltagy et al [2020] Beltagy I, Peters ME, Cohan A (2020) Longformer: The long-document transformer. ArXiv preprint abs/2004.05150. URL https://arxiv.org/abs/2004.05150
  • Borgeaud et al [2022] Borgeaud S, Mensch A, Hoffmann J, et al (2022) Improving language models by retrieving from trillions of tokens. In: ICML
  • Brown et al [2020] Brown T, Mann B, Ryder N, et al (2020) Language models are few-shot learners. NeurIPS
  • Chung et al [2023] Chung HW, Constant N, García X, et al (2023) Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. ArXiv abs/2304.09151. URL https://api.semanticscholar.org/CorpusID:258187051
  • Gao et al [2021] Gao T, Fisch A, Chen D (2021) Making pre-trained language models better few-shot learners. In: Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021, Association for Computational Linguistics (ACL), pp 3816–3830
  • Guu et al [2020] Guu K, Lee K, Tung Z, et al (2020) REALM: Retrieval augmented language model pre-training. In: ICML
  • He et al [2017] He W, Liu K, Liu J, et al (2017) Dureader: a chinese machine reading comprehension dataset from real-world applications. ArXiv abs/1711.05073. URL https://api.semanticscholar.org/CorpusID:3662564
  • Honovich et al [2022] Honovich O, Shaham U, Bowman SR, et al (2022) Instruction induction: From few examples to natural language task descriptions. ArXiv preprint abs/2205.10782. URL https://arxiv.org/abs/2205.10782
  • Huang et al [2023] Huang J, Ping W, Xu P, et al (2023) Raven: In-context learning with retrieval augmented encoder-decoder language models. arXiv preprint arXiv:230807922
  • Izacard and Grave [2021] Izacard G, Grave É (2021) Leveraging passage retrieval with generative models for open domain question answering. In: EACL
  • Izacard et al [2022] Izacard G, Lewis P, Lomeli M, et al (2022) Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:220803299
  • Ji et al [2022] Ji Z, Lee N, Frieske R, et al (2022) Survey of hallucination in natural language generation. ACM Computing Surveys 55:1 – 38. URL https://api.semanticscholar.org/CorpusID:246652372
  • Jiang et al [2022] Jiang Z, Gao L, Araki J, et al (2022) Retrieval as attention: End-to-end learning of retrieval and reading within a single transformer. In: EMNLP
  • Karpukhin et al [2020] Karpukhin V, Oguz B, Min S, et al (2020) Dense passage retrieval for open-domain question answering. In: EMNLP
  • Khandelwal et al [2019] Khandelwal U, Levy O, Jurafsky D, et al (2019) Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:191100172
  • Lewis et al [2020] Lewis P, Perez E, Piktus A, et al (2020) Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS
  • Li et al [2023] Li Y, Dong B, Lin C, et al (2023) Compressing context to enhance inference efficiency of large language models. URL https://api.semanticscholar.org/CorpusID:263830231
  • Lin et al [2023] Lin SC, Asai A, Li M, et al (2023) How to train your dragon: Diverse augmentation towards generalizable dense retrieval. arXiv preprint arXiv:230207452
  • Liu et al [2022] Liu J, Shen D, Zhang Y, et al (2022) What makes good in-context examples for GPT-3? In: Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. Association for Computational Linguistics, Dublin, Ireland and Online, pp 100–114, 10.18653/v1/2022.deelio-1.10, URL https://aclanthology.org/2022.deelio-1.10
  • Lu et al [2022] Lu Y, Bartolo M, Moore A, et al (2022) Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, pp 8086–8098, 10.18653/v1/2022.acl-long.556, URL https://aclanthology.org/2022.acl-long.556
  • Mu et al [2023] Mu J, Li XL, Goodman N (2023) Learning to compress prompts with gist tokens. ArXiv preprint abs/2304.08467. URL https://arxiv.org/abs/2304.08467
  • Nakano et al [2021] Nakano R, Hilton J, Balaji S, et al (2021) WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:211209332
  • Oppenlaender [2022] Oppenlaender J (2022) Prompt engineering for text-based generative art. arXiv preprint arXiv:220413988 abs/2204.13988. URL https://arxiv.org/abs/2204.13988
  • Ratner et al [2023] Ratner N, Levine Y, Belinkov Y, et al (2023) Parallel context windows for large language models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 6383–6402
  • Selvaraju et al [2016] Selvaraju RR, Das A, Vedantam R, et al (2016) Grad-cam: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision 128:336–359. URL https://api.semanticscholar.org/CorpusID:15019293
  • Shi et al [2023] Shi W, Min S, Yasunaga M, et al (2023) RePlug: Retrieval-augmented black-box language models. arXiv preprint arXiv:230112652
  • Shuster et al [2021] Shuster K, Poff S, Chen M, et al (2021) Retrieval augmentation reduces hallucination in conversation. In: Conference on Empirical Methods in Natural Language Processing, URL https://api.semanticscholar.org/CorpusID:233240939
  • Touvron et al [2023a] Touvron H, Lavril T, Izacard G, et al (2023a) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971
  • Touvron et al [2023b] Touvron H, Martin L, Stone K, et al (2023b) Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:230709288
  • Wang et al [2023a] Wang B, Ping W, Xu P, et al (2023a) Shall we pretrain autoregressive language models with retrieval? a comprehensive study. arXiv preprint arXiv:230406762
  • Wang et al [2023b] Wang J, Liu Z, Zhao L, et al (2023b) Review of large vision models and visual prompt engineering. arXiv preprint arXiv:230700855 abs/2307.00855. URL https://arxiv.org/abs/2307.00855
  • Wang et al [2022] Wang L, Yang N, Huang X, et al (2022) Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:221203533
  • Wang Yuxin [2023] Wang Yuxin HsSun Qingxuan (2023) M3e: Moka massive mixed embedding model
  • Wei et al [2022] Wei J, Wang X, Schuurmans D, et al (2022) Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of Neural Information Processing Systems 35:24824–24837
  • White et al [2023] White J, Fu Q, Hays S, et al (2023) A prompt pattern catalog to enhance prompt engineering with chatgpt. ArXiv preprint abs/2302.11382. URL https://arxiv.org/abs/2302.11382
  • Xie et al [2023] Xie X, Dong Q, Wang B, et al (2023) T2ranking: A large-scale chinese benchmark for passage ranking. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval URL https://api.semanticscholar.org/CorpusID:258041275
  • Xu et al [2023] Xu P, Ping W, Wu X, et al (2023) Retrieval meets long context large language models. URL https://api.semanticscholar.org/CorpusID:263620134
  • Yang et al [2023] Yang AM, Xiao B, Wang B, et al (2023) Baichuan 2: Open large-scale language models. ArXiv abs/2309.10305. URL https://api.semanticscholar.org/CorpusID:261951743
  • Yogatama et al [2021] Yogatama D, de Masson d’Autume C, Kong L (2021) Adaptive semiparametric language models. Transactions of the Association for Computational Linguistics 9:362–373
  • Zhou et al [2022] Zhou Y, Muresanu AI, Han Z, et al (2022) Large language models are human-level prompt engineers. In: The Eleventh International Conference on Learning Representations

Acknowledgements

This project was supported by the Industrial Metaverse project of Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), with the project number 000015; it was also supported by the Guangming District Government GPT Service project, with the project number 23210016.

Author information

Haowen Hou and Fei Ma are contributed equally to this work.

Authors and Affiliations

Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), Yutang, Shenzhen, 518000, Guangdong, China.

Haowen Hou, Fei Ma, Binwen Bai, Xinxin Zhu & Fei Yu

Contributions

All authors have contributed to the ideas and design of this research. Haowen Hou was in charge of developing and training the model, and also wrote this manuscript. Fei Ma carried out the experiments and the analysis of the outcomes. Binwen Bai and Xinxin Zhu managed the data collection, arrangement, cleansing, and its ultimate transformation into the training format. Fei Yu offered feedback on the preliminary drafts of the manuscript. All authors have reviewed and approved the final manuscript.

Corresponding author

Correspondence to Haowen Hou.

Ethics declarations

Competing interests

The authors confirm that there are no financial conflicts or personal connections that might be perceived as affecting the findings presented in this study.

Ethical and informed consent for data used

There are no concerns regarding the ethical use of the data or the informed consent process. All necessary ethical guidelines and protocols have been followed, and informed consent was obtained from all participants involved in the study.