Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers

Jiawen Xie12, Pengyu Cheng1, Xiao Liang13, Yong Dai1, Nan Du1
1Tencent AI Lab
2Shanghai Jiao Tong University
3Tsinghua University
Work done during an internship at Tencent AI Lab.Corresponding author.
Abstract

Although dominant in natural language processing, transformer-based models still struggle with long-sequence processing, due to the computational costs of their self-attention operations, which increase exponentially as the length of the input sequence grows. To address this challenge, we propose a Simple framework to enhance the long-content processing of off-the-shelf pre-trained transformers via three steps: Chunk, Align, and Select (SimCAS). More specifically, we first divide each long-sequence input into a batch of chunks, then align the inter-chunk information during the encoding steps, and finally, select the most representative hidden states from the encoder for the decoding process. With our SimCAS, the computation and memory costs can be reduced to linear complexity. In experiments, we demonstrate the effectiveness of the proposed method on various real-world long-text summarization and reading comprehension tasks, in which SimCAS significantly outperforms prior long-sequence processing baselines. The code is at https://github.com/xjw-nlp/SimCAS.

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers


Jiawen Xie12thanks: Work done during an internship at Tencent AI Lab., Pengyu Cheng1thanks: Corresponding author., Xiao Liang13, Yong Dai1, Nan Du1 1Tencent AI Lab 2Shanghai Jiao Tong University 3Tsinghua University


1 Introduction

Transformers (Vaswani et al., 2017) have become a fundamental model architecture for sequential data modeling (Sun et al., 2019a; Dosovitskiy et al., 2021; Wysocki et al., 2023), especially in Natural Language Processing (NLP) (Devlin et al., 2018; Brown et al., 2020), where texts are regarded as sequences of tokens. Built with transformer blocks, pre-trained language models (PLMs) have recently shown astonishing empirical performance in various NLP tasks such as question answering Yang et al. (2019), controllable generation (Byrne et al., 2021; Cheng and Li, 2022), summarization (Rush et al., 2015; Nallapati et al., 2016) and logic reasoning (Wei et al., 2022; Cheng et al., 2024a). However, one fatal weakness, that has hindered transformer-based models from being applied in broader application scenarios, is the quadratically raised computational consumption of self-attention operations when the input length increases. Hence, vanilla transformers have continuously been challenged by long-context tasks, such as machine reading comprehension (Kwiatkowski et al., 2019; Gong et al., 2020; Pang et al., 2022) and long-text summarization (Huang et al., 2021; Ma et al., 2022).

To enhance transformers with more efficient long-sequence processing, prior works focus on two perspectives, efficient attention operations (Beltagy et al., 2020; Zaheer et al., 2020; Choromanski et al., 2020) and sub-sequence processing (Liu et al., 2022). Efficient attention targets on reducing the memory and calculation cost of self-attention operations while preserving transformers’ empirical performance on downstream tasks. Unfortunately, most efficient attention methods require customized self-attention implementations, which demand from-scratch training instead of being directly plugged into existing pre-trained models. Moreover, some empirical studies have demonstrated that efficient-attention methods inevitably sacrifice the short-sequence processing performance compared with full-attention models (Phang et al., 2022).

The alternative solution to long-sequence processing is to decompose the long-sequence input into multiple sub-sequences and then feed-forward each separately, known as sub-sequence processing (Moro et al., 2022). Although full-attention blocks are sufficiently utilized in each sub-sequence, these methods have been challenged to capture the semantic information across different sub-contexts. To solve this, some works assign the same fragment to different chunks (Ivgi et al., 2023), which however significantly increases the computational cost of each chunk and runs counter to the original efficiency goal.

To gather the advantages of the methods above, we introduce a Simple yet effective learning framework with three typical operations: Chunk, Align, and Select (SimCAS). In detail, SimCAS first chunks the input sequence into a batch of sub-sequences then feeds each subsequence forward the elaborately designed encoding blocks with an inter-chunk alignment mechanism. Finally, the most semantically representative hidden representations are selected via a tailored selection policy to compress the overall sequence length. To align the semantic information across sub-sequence chunks, we introduce a sequential batch alignment operation to calibrate the start and end token embeddings of each sub-sequence in the encoder layers. To learn the selector policy, inspired by the recent success of reinforcement learning in NLP (Ouyang et al., 2022), we adopt the Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithm with the decoder treated as the environment. Moreover, we use the attention scores and output likelihood to calculate effective rewards for the selector’s policy optimization. To verify the effectiveness of SimCA, we conducted comprehensive experiments on seven long-context datasets from the domains of document-level summarization, multi-document summarization, and reading comprehension. The empirical results show that SimCAS outperforms other long-sequence processing baselines with high-level scalability. The main contributions of this paper are:

  • We propose a simple yet effective long-sequence processing framework, which noticeably extends the application range of existing full-attention PLMs. Unlike prior works compromising the performance on short sequences, our method maintains satisfying performances processing either short or long contexts.

  • We discover that transformers can be conceptualized as simulation environments for policy learning. We leverage transformers’ attention scores and output likelihoods to optimize the selector policy to compress the long-sequence information. The optimized selector policy meanwhile facilitates the transformer to concentrate more on hidden states with high semantic importance.

  • We conduct comprehensive experiments illustrating that SimCAS consistently surpasses previous baselines. Furthermore, we provide model scaling, efficiency comparisons, and ablation studies to substantiate the superior performance of our proposed method.

2 Background

Language Modeling

The training objective for sequence generation consists of a sequence of token decisions made in an auto-regressive manner. This is formulated as a product of decision probabilities corresponding to specific tokens. Given an input sequence 𝑿=(𝒙1,𝒙2,,𝒙N)𝑿subscript𝒙1subscript𝒙2subscript𝒙𝑁{\bm{X}}=({\bm{x}}_{1},{\bm{x}}_{2},\cdots,{\bm{x}}_{N})bold_italic_X = ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) and its corresponding output 𝒀=(𝒚1,𝒚2,,𝒚M)𝒀subscript𝒚1subscript𝒚2subscript𝒚𝑀{\bm{Y}}=({\bm{y}}_{1},{\bm{y}}_{2},\cdots,{\bm{y}}_{M})bold_italic_Y = ( bold_italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_y start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ), we model the following conditional probability:

pϕ(𝒀|𝑿)=m=1Mpϕ(𝒚m|𝒀<m,𝑿),subscript𝑝italic-ϕconditional𝒀𝑿superscriptsubscriptproduct𝑚1𝑀subscript𝑝italic-ϕconditionalsubscript𝒚𝑚subscript𝒀absent𝑚𝑿p_{\phi}({\bm{Y}}|{\bm{X}})=\prod_{m=1}^{M}p_{\phi}({\bm{y}}_{m}|{\bm{Y}}_{<m}% ,{\bm{X}}),italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_Y | bold_italic_X ) = ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | bold_italic_Y start_POSTSUBSCRIPT < italic_m end_POSTSUBSCRIPT , bold_italic_X ) , (1)

where 𝒀<m=(𝒚1,𝒚2,,𝒚m1)subscript𝒀absent𝑚subscript𝒚1subscript𝒚2subscript𝒚𝑚1{\bm{Y}}_{<m}=({\bm{y}}_{1},{\bm{y}}_{2},\dots,{\bm{y}}_{m-1})bold_italic_Y start_POSTSUBSCRIPT < italic_m end_POSTSUBSCRIPT = ( bold_italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_y start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ), and ϕitalic-ϕ\phiitalic_ϕ represents the model parameters.

Proximal Policy Optimization

In the domain of reinforcement learning (RL), Proximal Policy Optimization (PPO) (Schulman et al., 2017) is a widely used policy gradient method (Kakade, 2001) for its remarkable performance and efficiency in solving complex control and decision-making tasks (Vinyals et al., 2019; Akkaya et al., 2019; Cheng et al., 2024b). The vanilla policy gradient estimator has the form: θ𝔼πθ(𝒂t|𝒔t)[Atπ(𝒂t,𝒔t)]𝔼^t[θlogπθ(𝒂t|𝒔t)A^t]subscript𝜃subscript𝔼subscript𝜋𝜃conditionalsubscript𝒂𝑡subscript𝒔𝑡delimited-[]subscriptsuperscript𝐴𝜋𝑡subscript𝒂𝑡subscript𝒔𝑡subscript^𝔼𝑡delimited-[]subscript𝜃subscript𝜋𝜃conditionalsubscript𝒂𝑡subscript𝒔𝑡subscript^𝐴𝑡\nabla_{\theta}\mathbb{E}_{\pi_{\theta}({\bm{a}}_{t}|{\bm{s}}_{t})}[A^{\pi}_{t% }({\bm{a}}_{t},{\bm{s}}_{t})]\approx\hat{\mathbb{E}}_{t}[\nabla_{\theta}\log% \pi_{\theta}({\bm{a}}_{t}|{\bm{s}}_{t})\hat{A}_{t}]∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_A start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ≈ over^ start_ARG blackboard_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ], where 𝒔t𝒮subscript𝒔𝑡𝒮{\bm{s}}_{t}\in{\mathcal{S}}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S is the state at t𝑡titalic_t-step, πθ(𝒂t|𝒔t)subscript𝜋𝜃conditionalsubscript𝒂𝑡subscript𝒔𝑡\pi_{\theta}({\bm{a}}_{t}|{\bm{s}}_{t})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is a stochastic policy acting 𝒂tsubscript𝒂𝑡{\bm{a}}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at 𝒔tsubscript𝒔𝑡{\bm{s}}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the estimated value of the advantage function Atπ(𝒂t,𝒔t)superscriptsubscript𝐴𝑡𝜋subscript𝒂𝑡subscript𝒔𝑡A_{t}^{\pi}({\bm{a}}_{t},{\bm{s}}_{t})italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and 𝔼^tsubscript^𝔼𝑡\hat{\mathbb{E}}_{t}over^ start_ARG blackboard_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the empirical average over a sample batch. The PPO algorithm improves the training stability of the policy gradient, by optimizing the following objective:

L=𝔼^t[min(rt(θ)A^t,clip(rt(θ),1ε,1+ε)A^t)],𝐿subscript^𝔼𝑡delimited-[]subscript𝑟𝑡𝜃subscript^𝐴𝑡clipsubscript𝑟𝑡𝜃1𝜀1𝜀subscript^𝐴𝑡L=\hat{\mathbb{E}}_{t}[\min(r_{t}(\theta)\hat{A}_{t},\text{clip}(r_{t}(\theta)% ,1-\varepsilon,1+\varepsilon)\hat{A}_{t})],italic_L = over^ start_ARG blackboard_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ roman_min ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , clip ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) , 1 - italic_ε , 1 + italic_ε ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] , (2)

where rt(θ)=πθ(𝒂t|𝒔t)πθold(𝒂t|𝒔t)subscript𝑟𝑡𝜃subscript𝜋𝜃conditionalsubscript𝒂𝑡subscript𝒔𝑡subscript𝜋subscript𝜃oldconditionalsubscript𝒂𝑡subscript𝒔𝑡r_{t}(\theta)=\frac{\pi_{\theta}({\bm{a}}_{t}|{\bm{s}}_{t})}{\pi_{\theta_{% \text{old}}}({\bm{a}}_{t}|{\bm{s}}_{t})}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG is the probability ratio between new and old policies, and ε>0𝜀0\varepsilon>0italic_ε > 0 is a hyper-parameter for clipping.

Refer to caption
Figure 1: The SimCAS framework: The long inputs are first divided into a batch of chunks, each of which is filled with start token [S], padding token [P] and end token [E]. Then the inter-chunk information can be transferred via the alignment of [S] and [E] representations after each encoder layer. Next, hidden states are selected for decoding steps. The decoder output logits and attention scores are utilized as rewards for updating the token selector.

3 Methodology

Given a long-input text 𝑿=(𝒙1,𝒙2,,𝒙N)𝑿subscript𝒙1subscript𝒙2subscript𝒙𝑁{\bm{X}}=({\bm{x}}_{1},{\bm{x}}_{2},\dots,{\bm{x}}_{N})bold_italic_X = ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) with a fairly large input length N𝑁Nitalic_N, we aim to design a model pϕ(𝒀|𝑿)subscript𝑝italic-ϕconditional𝒀𝑿p_{\phi}({\bm{Y}}|{\bm{X}})italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_Y | bold_italic_X ) to predict a corresponding label sequence 𝒀=(𝒚1,𝒚2,,𝒚M)𝒀subscript𝒚1subscript𝒚2subscript𝒚𝑀{\bm{Y}}=({\bm{y}}_{1},{\bm{y}}_{2},\dots,{\bm{y}}_{M})bold_italic_Y = ( bold_italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_y start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ), where 𝒀𝒀{\bm{Y}}bold_italic_Y can either be classification labels or output sequence tokens. The major difficulty of the task is that the input length N𝑁Nitalic_N is so large that the original self-attention operations become infeasible with the quadratic complexity 𝒪(N2)𝒪superscript𝑁2{\mathcal{O}}(N^{2})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

To address the challenge, we propose a novel method that intuitively splits the long inputs into chunks with feasible lengths, then selects the most representative tokens for decoding steps. To guarantee inter-chunk semantic information extracted during encoding, we design an aligning scheme in the encoder blocks. In the following, we will describe the chunking scheme, the aligning strategy, and the selector design in detail.

3.1 Chunking Scheme

Assume the maximum input sequence length of a pre-trained transformer model is S𝑆Sitalic_S. We first split the long input sequence into B=NS𝐵𝑁𝑆B=\lceil\frac{N}{S}\rceilitalic_B = ⌈ divide start_ARG italic_N end_ARG start_ARG italic_S end_ARG ⌉ chunks with each chunk length equal to S𝑆Sitalic_S. To simplify the expression, we assume that a sentence is in only one chunk. In the practical experiment, we use the sentence-level segmentation (Moro and Ragazzi, 2022) to acquire a sequence of sentences, and then employ the greedy method to allocate sequentially these sentences to the chunks. If there is not enough space in the chunk for the current sentence, we will use the padding tokens to fill the chunk. Mathematically, these divided segments can be represented as:

{(𝒄1k,𝒄2k,,𝒄Sk)}k=1B,superscriptsubscriptsubscriptsuperscript𝒄𝑘1subscriptsuperscript𝒄𝑘2subscriptsuperscript𝒄𝑘𝑆𝑘1𝐵\big{\{}({\bm{c}}^{k}_{1},{\bm{c}}^{k}_{2},\dots,{\bm{c}}^{k}_{S})\big{\}}_{k=% 1}^{B},{ ( bold_italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT , (3)

where 𝒄ik=𝒙(k1)S+isubscriptsuperscript𝒄𝑘𝑖subscript𝒙𝑘1𝑆𝑖{\bm{c}}^{k}_{i}={\bm{x}}_{(k-1)S+i}bold_italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT ( italic_k - 1 ) italic_S + italic_i end_POSTSUBSCRIPT, and \lceil\cdot\rceil⌈ ⋅ ⌉ is the ceiling function. Since the chunk might not have S𝑆Sitalic_S tokens from 𝑿𝑿{\bm{X}}bold_italic_X, we append special padding tokens at the end (see Figure 1). After chunking, we add the start token ([S]) and end token ([E]) to each chunk, and treat the chunks as a normal transformer input batch 𝒞𝒞{\mathcal{C}}caligraphic_C with batch size B𝐵Bitalic_B:

𝒞={([S],𝒄1k,𝒄2k,𝒄Sk,[E])}k=1B.𝒞superscriptsubscript[S]subscriptsuperscript𝒄𝑘1subscriptsuperscript𝒄𝑘2subscriptsuperscript𝒄𝑘𝑆[E]𝑘1𝐵{\mathcal{C}}=\big{\{}(\texttt{[S]},{\bm{c}}^{k}_{1},{\bm{c}}^{k}_{2}\dots,{% \bm{c}}^{k}_{S},\texttt{[E]})\big{\}}_{k=1}^{B}.caligraphic_C = { ( [S] , bold_italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … , bold_italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , [E] ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT . (4)

3.2 Sequential Batch Alignment

After chunking the input text into a standard token batch, we can encode it with the transformer encoder layers. As in Figure 1, we assume the encoder has N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT layers. Denote the hidden representations of k𝑘kitalic_k-th chunk in 𝒞𝒞{\mathcal{C}}caligraphic_C (in equation 4) at l𝑙litalic_l-th encoder layer as 𝑯k,l=(𝒉0k,l,𝒉1k,l,,𝒉Sk,l,𝒉S+1k,l)superscript𝑯𝑘𝑙subscriptsuperscript𝒉𝑘𝑙0subscriptsuperscript𝒉𝑘𝑙1subscriptsuperscript𝒉𝑘𝑙𝑆subscriptsuperscript𝒉𝑘𝑙𝑆1{\bm{H}}^{k,l}=({\bm{h}}^{k,l}_{0},{\bm{h}}^{k,l}_{1},\dots,{\bm{h}}^{k,l}_{S}% ,{\bm{h}}^{k,l}_{S+1})bold_italic_H start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT = ( bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S + 1 end_POSTSUBSCRIPT ), where 𝒉0k,lsubscriptsuperscript𝒉𝑘𝑙0{\bm{h}}^{k,l}_{0}bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒉S+1k,lsubscriptsuperscript𝒉𝑘𝑙𝑆1{\bm{h}}^{k,l}_{S+1}bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S + 1 end_POSTSUBSCRIPT are hidden representation of [S] and [E] tokens respectively, and 𝒉ik,lsubscriptsuperscript𝒉𝑘𝑙𝑖{\bm{h}}^{k,l}_{i}bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the embedding for 𝒄iksubscriptsuperscript𝒄𝑘𝑖{\bm{c}}^{k}_{i}bold_italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with 1iS1𝑖𝑆1\leq i\leq S1 ≤ italic_i ≤ italic_S.

As mentioned in Section 1, chunking methods make it difficult to capture the inter-chunk semantic information. Inspired by recent work using special tokens for full sequence information (Mohtashami and Jaggi, 2023), we align the information of [S] and [E] tokens at each encoding block. More specifically, at l𝑙litalic_l-th layer, our batch alignment average the hidden states of [S] and [E] of all chunks:

𝒉¯BOSl=1Bk=1B𝒉0k,l,𝒉¯EOSl=1Bk=1B𝒉S+1k,l.formulae-sequencesubscriptsuperscript¯𝒉𝑙BOS1𝐵superscriptsubscript𝑘1𝐵subscriptsuperscript𝒉𝑘𝑙0subscriptsuperscript¯𝒉𝑙EOS1𝐵superscriptsubscript𝑘1𝐵subscriptsuperscript𝒉𝑘𝑙𝑆1\bar{{\bm{h}}}^{l}_{\text{BOS}}=\frac{1}{B}\sum_{k=1}^{B}{\bm{h}}^{k,l}_{0},\ % \ \bar{{\bm{h}}}^{l}_{\text{EOS}}=\frac{1}{B}\sum_{k=1}^{B}{\bm{h}}^{k,l}_{S+1}.over¯ start_ARG bold_italic_h end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT BOS end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over¯ start_ARG bold_italic_h end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT EOS end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S + 1 end_POSTSUBSCRIPT . (5)

Then we replace 𝒉0k,lsubscriptsuperscript𝒉𝑘𝑙0{\bm{h}}^{k,l}_{0}bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒉S+1k,lsubscriptsuperscript𝒉𝑘𝑙𝑆1{\bm{h}}^{k,l}_{S+1}bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S + 1 end_POSTSUBSCRIPT with the aligned 𝒉¯BOSlsubscriptsuperscript¯𝒉𝑙BOS\bar{{\bm{h}}}^{l}_{\text{BOS}}over¯ start_ARG bold_italic_h end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT BOS end_POSTSUBSCRIPT and 𝒉¯EOSlsubscriptsuperscript¯𝒉𝑙EOS\bar{{\bm{h}}}^{l}_{\text{EOS}}over¯ start_ARG bold_italic_h end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT EOS end_POSTSUBSCRIPT into the hidden states for the next-layer encoding block, as shown in Figure 1.

3.3 Token Selector

After being encoded into the last hidden space with our sequential batch aligning scheme, the chunks should be reformed back to a sequence for the next decoding steps. Directly tiling the chunks’ representations back into a sequence is still infeasible because the overlong sequence makes it difficult to fuse information at the decoding stage. Therefore, we propose a token selector to select the most representative hidden representations for decoding steps. Inspired by Ramamurthy et al. (2023), we design the selector from the perspective of reinforcement learning (RL).

Selection Module Design

Formally, the token selector takes the last hidden states 𝑯L={𝒉0k,L,𝒉1k,L,,𝒉Sk,L,𝒉S+1k,L}k=1Bsuperscript𝑯𝐿superscriptsubscriptsubscriptsuperscript𝒉𝑘𝐿0subscriptsuperscript𝒉𝑘𝐿1subscriptsuperscript𝒉𝑘𝐿𝑆subscriptsuperscript𝒉𝑘𝐿𝑆1𝑘1𝐵{\bm{H}}^{L}=\{{\bm{h}}^{k,L}_{0},{\bm{h}}^{k,L}_{1},\dots,{\bm{h}}^{k,L}_{S},% {\bm{h}}^{k,L}_{S+1}\}_{k=1}^{B}bold_italic_H start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = { bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_k , italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S + 1 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT and selection actions 𝑨t=(𝒂1,𝒂2,,𝒂t1)subscript𝑨𝑡subscript𝒂1subscript𝒂2subscript𝒂𝑡1{\bm{A}}_{t}=({\bm{a}}_{1},{\bm{a}}_{2},\dots,{\bm{a}}_{t-1})bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( bold_italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) as inputs and predicts the next selection action 𝒂tsubscript𝒂𝑡{\bm{a}}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where each action 𝒂tsubscript𝒂𝑡{\bm{a}}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has two values "select" and "skip" for operating the t𝑡titalic_t-th token 𝒙tsubscript𝒙𝑡{\bm{x}}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in 𝑿𝑿{\bm{X}}bold_italic_X. We set the state of RL as 𝒔t=(𝑯L,𝑨t)subscript𝒔𝑡superscript𝑯𝐿subscript𝑨𝑡{\bm{s}}_{t}=({\bm{H}}^{L},{\bm{A}}_{t})bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( bold_italic_H start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), then the selector is a policy πθ(𝒂t|𝒔t)subscript𝜋𝜃conditionalsubscript𝒂𝑡subscript𝒔𝑡\pi_{\theta}({\bm{a}}_{t}|{\bm{s}}_{t})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to predict next action 𝒂tsubscript𝒂𝑡{\bm{a}}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

We implement the selector with the actor-critic style (Konda and Tsitsiklis, 1999). Both actor and critic are simple feed-forward networks, but the actor outputs a probability distribution over the action space and the critic outputs a single scalar value. At state 𝒔tsubscript𝒔𝑡{\bm{s}}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, to consider the sequential effect of previous action 𝑨tsubscript𝑨𝑡{\bm{A}}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we first take the average of all selected hidden states as

𝒉¯t=k,i𝕀{𝒂j="select",j<t}𝒉ik,Lk,i𝕀{𝒂j="select",j<t},subscript¯𝒉𝑡subscript𝑘𝑖subscript𝕀formulae-sequencesubscript𝒂𝑗"select"𝑗𝑡superscriptsubscript𝒉𝑖𝑘𝐿subscript𝑘𝑖subscript𝕀formulae-sequencesubscript𝒂𝑗"select"𝑗𝑡\bar{{\bm{h}}}_{t}=\frac{\sum_{k,i}\mathbb{I}_{\{{\bm{a}}_{j}=\textit{"select"% },j<t\}}{\bm{h}}_{i}^{k,L}}{\sum_{k,i}\mathbb{I}_{\{{\bm{a}}_{j}=\textit{"% select"},j<t\}}},over¯ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT { bold_italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = "select" , italic_j < italic_t } end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_L end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT { bold_italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = "select" , italic_j < italic_t } end_POSTSUBSCRIPT end_ARG , (6)

where j=(k1)S+i𝑗𝑘1𝑆𝑖j=(k-1)S+iitalic_j = ( italic_k - 1 ) italic_S + italic_i maps chunk indices back to the input sequence, and 𝕀{}subscript𝕀\mathbb{I}_{\{\cdot\}}blackboard_I start_POSTSUBSCRIPT { ⋅ } end_POSTSUBSCRIPT is the indicator function. Then we concatenate the current selector state 𝒉¯tsubscript¯𝒉𝑡\bar{{\bm{h}}}_{t}over¯ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and token information 𝒉ik,Lsuperscriptsubscript𝒉𝑖𝑘𝐿{\bm{h}}_{i}^{k,L}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_L end_POSTSUPERSCRIPT, t=(k1)S+i𝑡𝑘1𝑆𝑖t=(k-1)S+iitalic_t = ( italic_k - 1 ) italic_S + italic_i, to predict next action 𝒂tsubscript𝒂𝑡{\bm{a}}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT via the actor:

πθ(𝒂t|𝒔t)=actor(𝒉¯t,𝒉ik,L).subscript𝜋𝜃conditionalsubscript𝒂𝑡subscript𝒔𝑡actorsubscript¯𝒉𝑡superscriptsubscript𝒉𝑖𝑘𝐿\pi_{\theta}({\bm{a}}_{t}|{\bm{s}}_{t})=\text{actor}(\bar{{\bm{h}}}_{t},{\bm{h% }}_{i}^{k,L}).italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = actor ( over¯ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_L end_POSTSUPERSCRIPT ) . (7)

Reward Design

To train the selector within an RL scheme, we treat the transformer as an environment and design action rewards. Inspired by Yuan et al. (2021), we can directly utilize the language modeling likelihood as the generation quality reward for selection actions:

RLM=ξexp{1Mlogpϕ(𝒀|𝑿)},subscript𝑅LM𝜉1𝑀subscript𝑝italic-ϕconditional𝒀𝑿R_{\text{LM}}=\xi\exp\{\frac{1}{M}\log p_{\phi}({\bm{Y}}|{\bm{X}})\},italic_R start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT = italic_ξ roman_exp { divide start_ARG 1 end_ARG start_ARG italic_M end_ARG roman_log italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_Y | bold_italic_X ) } , (8)

where ξ𝜉\xiitalic_ξ is a coefficient that magnifies the value of the reward for easier optimization of the selector. However, RLMsubscript𝑅LMR_{\text{LM}}italic_R start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT is only a scalar value, which cannot provide fine-grained guidance to the selector. Therefore, we use the input-output cross-attention scores to calibrate RLMsubscript𝑅LMR_{\text{LM}}italic_R start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT. More specifically, we denote the cross-attention matrix of the q𝑞qitalic_q-th attention head in the l𝑙litalic_l-th layer of the decoder as 𝑨qlM×Nsubscriptsuperscript𝑨𝑙𝑞superscript𝑀𝑁{\bm{A}}^{l}_{q}\in\mathbb{R}^{M\times N}bold_italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT, and the overall cross-attention

𝑨¯=1N2Ql=1N2q=1Q𝑨ql,¯𝑨1subscript𝑁2𝑄superscriptsubscript𝑙1subscript𝑁2superscriptsubscript𝑞1𝑄subscriptsuperscript𝑨𝑙𝑞\bar{{\bm{A}}}=\frac{1}{N_{2}\cdot Q}\sum_{l=1}^{N_{2}}\sum_{q=1}^{Q}{\bm{A}}^% {l}_{q},over¯ start_ARG bold_italic_A end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_Q end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT bold_italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , (9)

where N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the number of decoder layers, and Q𝑄Qitalic_Q is the number of cross-attention heads. With the overall feedback RLMsubscript𝑅LMR_{\text{LM}}italic_R start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT and cross-attention 𝑨¯=(a¯ij)M×N¯𝑨subscriptsubscript¯𝑎𝑖𝑗𝑀𝑁\bar{{\bm{A}}}=(\bar{a}_{ij})_{M\times N}over¯ start_ARG bold_italic_A end_ARG = ( over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_M × italic_N end_POSTSUBSCRIPT, we adjust the reward Rj+superscriptsubscript𝑅𝑗R_{j}^{+}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT to each selected tokens. For "skip" action, we intend to limit the selected sequence length. Assume the number of overall input tokens and selected token representations is Lallsubscript𝐿all{L}_{\text{all}}italic_L start_POSTSUBSCRIPT all end_POSTSUBSCRIPT and Lselectsubscript𝐿select{L}_{\text{select}}italic_L start_POSTSUBSCRIPT select end_POSTSUBSCRIPT respectively. We set a hyper-parameter Lhypersubscript𝐿hyper{L}_{\text{hyper}}italic_L start_POSTSUBSCRIPT hyper end_POSTSUBSCRIPT to control the size of Lselectsubscript𝐿select{L}_{\text{select}}italic_L start_POSTSUBSCRIPT select end_POSTSUBSCRIPT with the skipping reward Rsuperscript𝑅R^{-}italic_R start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT:

Rj+=a¯j1a¯0RLM,a¯j=1Mi=1Ma¯ij,formulae-sequencesuperscriptsubscript𝑅𝑗subscript¯𝑎𝑗1subscript¯𝑎0subscript𝑅LMsubscript¯𝑎𝑗1𝑀superscriptsubscript𝑖1𝑀subscript¯𝑎𝑖𝑗\displaystyle R_{j}^{+}=\frac{\bar{a}_{j}}{1-\bar{a}_{0}}R_{\text{LM}},\bar{a}% _{j}=\frac{1}{M}\sum_{i=1}^{M}\bar{a}_{ij},italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = divide start_ARG over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG italic_R start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT , over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , (10)
R={RLMLallif Lselect<LhyperRLMLselectotherwise.superscript𝑅casessubscript𝑅LMsubscript𝐿allif Lselect<Lhypersubscript𝑅LMsubscript𝐿selectotherwise.\displaystyle R^{-}=\begin{cases}\frac{R_{\text{LM}}}{L_{\text{all}}}&\text{if% ${L}_{\text{select}}<{L}_{\text{hyper}}$}\\ \frac{R_{\text{LM}}}{L_{\text{select}}}&\text{otherwise.}\end{cases}italic_R start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = { start_ROW start_CELL divide start_ARG italic_R start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT end_ARG start_ARG italic_L start_POSTSUBSCRIPT all end_POSTSUBSCRIPT end_ARG end_CELL start_CELL if italic_L start_POSTSUBSCRIPT select end_POSTSUBSCRIPT < italic_L start_POSTSUBSCRIPT hyper end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_R start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT end_ARG start_ARG italic_L start_POSTSUBSCRIPT select end_POSTSUBSCRIPT end_ARG end_CELL start_CELL otherwise. end_CELL end_ROW

With the selector and rewards designed above, we can optimize the selector with the PPO algorithm described in Section 2. Note that in our setups, the environment (the transformer) changes during the training steps. Therefore, we alternatively update the selector and transformer: in each interaction, we first fix the transformer and use the reward RLMsubscript𝑅LMR_{\text{LM}}italic_R start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT and cross-attention scores to update the selector, then fix the selector and update the transformer with language modeling loss. In addition, the selector is similar to the chunk-wise RNN, so the time overhead of the selection process is low.

4 Related Work

Efficient Transformers

The attention mechanism in transformers requires quadratically increased computational complexity with respect to the input sequence length, which limits the application scenarios, especially for long-text processing. To address this issue, various previous works have been proposed for designing more efficient attention operations (Tay et al., 2023; Fournier et al., 2023). Longformer (Beltagy et al., 2020), BIGBIRD (Zaheer et al., 2020), GMAT (Gupta and Berant, 2020), and ETC (Ainslie et al., 2020) reduce the memory consumption of dense attentions by elaborate combinations of global attention and local attention mechanisms. LongT5 (Guo et al., 2022) is based on the original T5 with global-local attention sparsity patterns to handle long input and has an additional pre-training phase on C4 dataset. CoLT5 (Ainslie et al., 2023) is an improved version of LongT5, with the same parameter size and pre-training dataset as LongT5, and it optimizes the computation of the attention module and the feed-forward module and proposes a new pre-training objective. Additionally, Reformer (Kitaev et al., 2020) leverages a locality-sensitive hashing to the attention mechanism, changing its complexity from 𝒪(n2)𝒪superscript𝑛2\mathcal{O}(n^{2})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) to 𝒪(nlogn)𝒪𝑛𝑛\mathcal{O}(n\log n)caligraphic_O ( italic_n roman_log italic_n ), where n𝑛nitalic_n is the input text sequence length. Routing Transformer (Roy et al., 2021) applies a sparse routing module based on online k-means to self-attention while reducing the overall complexity of attention. Approximation-based methods, such as Performers (Choromanski et al., 2020) and RFA (Peng et al., 2021), use linear space and time complexity to estimate the attention matrix based on random features. Luna (Ma et al., 2021) attends only to a fixed number of hidden vectors. Linformer (Wang et al., 2020) calculates self-attention by a low-rank matrix. However, the vast majority of these methods are difficult to apply to existing PLMs. Moreover, Xiong et al. (2022) proposes that many efficient-attention transformers do not even perform as well as simple local-attention models on downstream language tasks.

Chunking Methods for Long Sequence

Another solution for long-sequence processing is to perform sequence chunking and then process them respectively (Zhong et al., 2022; Wang et al., 2023). Among chunking methods, SLED (Ivgi et al., 2023) splits the long sequence into overlapping chunks and processes each chunk with the encoder, then fuses cross-chunk information with the decoder. PageSum (Liu et al., 2022) separates the long sequence into non-overlapping chunks and effectively tackles them by the principle of locality (Denning, 2005). Unlimiformer (Bertsch et al., 2023) encodes long inputs in chunks and utilizes only the top-k input tokens for every attention head.

Length Extrapolation

Length extrapolation in transformers refers to their ability to handle input sequences of varying lengths during both training and inference (Press et al., 2022; Sun et al., 2023). Transformers use a self-attention mechanism to capture dependencies across positions in a sequence, allowing them to generalize well to sequences of different lengths. This flexibility is essential for tasks like NLP and time series analysis, where input lengths can vary.

Sequence Length Reduction

Reducing the length of hidden states (Guan et al., 2022; Kim et al., 2022) is the method of model compression from the width perspective, which is promising since some studies showed that there is redundant encoded information in token representations (Ethayarajh, 2019; Klafka and Ettinger, 2020). Among the redundancy, some tokens carry more task-specific information than others, suggesting that these tokens are more salient and imperative to be selected to feed into subsequent operations. Compared with model compression via layer-wise pruning, token-level pruning does not come at the expense of model performance in complex reasoning (Sanh et al., 2019; Sun et al., 2019b).

Base Model Method arXiv PubMed
R-1 R-2 R-L BS R-1 R-2 R-L BS
LEDlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT Standard 46.63 19.62 41.83 - - - - -
LEDlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT PRIMERA 47.60 20.80 42.60 - - - - -
PEGASUSlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT Standard 44.21 16.95 38.83 - 45.97 20.15 41.34 -
PEGASUSlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT BIGBIRD 46.63 19.02 41.77 - 46.32 20.65 42.33 -
BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT HEPOS 47.87 20.00 41.50 - 47.93 20.74 42.58 -
BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT Standard 40.36 13.78 36.11 59.44 40.36 13.29 35.02 61.77
BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT Standard 42.97 15.54 37.02 61.18 42.87 15.44 36.93 63.08
BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT SimCAS 47.22 19.35 42.25 63.51 48.17 21.11 43.90 66.33
BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT SimCAS 48.14 19.77 42.93 63.78 48.65 21.40 44.14 66.52
Table 1: Average results on arXiv and PubMed test sets. R-1/2/L is the ROUGE-1/2/L F1 score. BS refers to the neural model-based metrics BERTScore. Bold (underline) are used to denote the best (second-best) metric.

5 Experiments

To evaluate the performance of SimCAS, we conduct experiments on the long-text abstractive summarization and machine reading comprehension tasks. In the following, we introduce the information about the datasets, baselines, model implementations, and evaluation results of our experiments.

5.1 Datasets

We conduct experiments on two types of NLP tasks: long-text summarization and machine reading comprehension. For long-text summarization, we use four single-document summarization datasets: arXiv, PubMed (Cohan et al., 2018), GovReport (Huang et al., 2021), SummScreen (Chen et al., 2022), and two multi-document summarization datasets: Multi-News (Fabbri et al., 2019b), WCEP (Gholipour Ghalandari et al., 2020). For the reading comprehension task, we test on the NarrativeQA (Kočiský et al., 2018) dataset. We list the introduction about these datasets below. More dataset details can be found in Appendix § A.

arXiv & PubMed111https://github.com/armancohan/long-summarization are two long-document summarization datasets in the scientific research domain. Each document is a scientific paper whose summary is the corresponding abstract.

GovReport222https://github.com/luyang-huang96/LongDocSum is a long-document summarization dataset based on reports published by the U.S. Government Accountability Office and Congressional Research Service.

SummScreen333https://github.com/mingdachen/SummScreen, which includes TV series transcripts, often indirectly presents plot details through character dialogues scattered throughout the transcript. These details need to be consolidated to create concise plot descriptions.

Multi-News444https://github.com/Alex-Fabbri/Multi-News is a large-scale multi-document summarization dataset. It consists of news articles and human-written summaries of these articles. Each summary is professionally written by editors and with links to the original articles cited.

WCEP555https://github.com/allenai/PRIMER is a dataset for multi-document summarization (MDS). It contains short, human-written summaries about news events, obtained from the Wikipedia Current Events Portal (WCEP).

NarrativeQA666https://github.com/deepmind/narrativeqa is a reading comprehension dataset over entire books from Project Gutenberg and movie scripts from different websites.

5.2 Baselines

There are several baselines for comparison: among them, those based on the full-attention mechanism are HiMAP (Fabbri et al., 2019a), BERTREG (Gholipour Ghalandari et al., 2020), Submodular+Abs (Gholipour Ghalandari et al., 2020), BART (Lewis et al., 2020), PEGASUS (Zhang et al., 2020), DynE (Hokamp et al., 2020), GraphSum (Li et al., 2020), BART-Long-Graph (Pasunuru et al., 2021), SLED (Ivgi et al., 2023), Memorizing Transformers (Wu et al., 2022), Unlimiformer (Bertsch et al., 2023).

Moreover, several baselines introduce the tailored sparse attention to address longer sequences such as LED (Longformer Encoder-Decoder) (Beltagy et al., 2020), BIGBIRD (Zaheer et al., 2020), PRIMERA (Xiao et al., 2022), HEPOS (Huang et al., 2021) and LED+RELAX (Parnell et al., 2022). More details of baselines can be found in Appendix §C.

R-1 R-2 R-L BS
R-1 1.000 0.872 0.913 0.810
R-2 0.872 1.000 0.898 0.794
R-L 0.913 0.898 1.000 0.825
BS 0.810 0.794 0.825 1.000
Table 2: Pearson correlation coefficient between the four automatic evaluation metrics (R-1, R-2, R-L, BS) used for a base BART with beam search on GovReport.

5.3 Implementation Details

Base Model Method GovReport SummScreen
R-1 R-2 R-L BS R-1 R-2 R-L BS
BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT SLED 54.70 24.40 25.40 - 32.70 7.90 19.10 -
BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT SLED 57.50 26.30 27.40 - 35.20 8.70 19.40 -
LEDlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT PRIMERA 55.10 23.90 25.90 67.00 32.30 7.10 18.30 57.10
BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT Memorizing 55.20 25.10 26.40 67.50 32.70 7.40 19.20 57.40
BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT Unlimiformer 56.60 26.30 27.60 68.20 34.70 8.50 19.90 58.50
PRIMERA Unlimiformer 57.40 26.20 28.00 68.10 33.30 7.60 18.90 57.70
BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT Standard 51.72 19.37 23.11 64.12 29.73 5.23 15.65 54.30
BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT Standard 52.94 19.78 23.71 64.44 29.89 5.32 15.71 54.43
BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT SimCAS 59.30 25.95 27.07 68.17 43.45 12.74 18.38 62.46
BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT SimCAS 60.29 26.68 27.97 68.64 44.15 13.42 18.50 62.82
Table 3: Average results on GovReport and SummScreen test sets. R-1/2/L is the ROUGE-1/2/L F1 score. BS refers to the neural model-based metrics BERTScore. The best and second-best results are bolded and underlined respectively.
System R-1 R-2 R-L BS
BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT 42.04 14.88 23.34 -
HiMAP 44.17 16.05 21.38 -
GraphSum 45.87 17.56 23.39 -
BART-Long-Graph 49.24 18.99 23.97 -
LED+RELAX 47.23 18.86 25.03 -
PRIMERA 49.94 21.05 25.85 -
BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT 40.54 12.18 22.39 58.75
BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT 42.16 14.69 23.51 60.70
BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT+SimCAS 48.88 20.01 25.32 65.05
BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT+SimCAS 49.40 20.47 25.96 65.40
System R-1 R-2 R-L BS
BERTREG 35.00 13.50 25.50 -
SUBMODULAR+ABS 34.40 13.10 25.00 -
DynE 35.40 15.10 25.60 -
LED 39.79 18.94 32.10 -
LED+RELAX 41.11 19.46 33.13 -
PRIMERA 46.08 25.21 37.86 -
BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT 36.12 13.77 29.98 60.84
BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT 37.66 15.98 31.01 62.32
BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT+SimCAS 45.68 22.80 37.71 70.59
BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT+SimCAS 46.29 24.45 38.61 71.38
Table 4: Average results on Multi-News (left) and WCEP (right) test sets. R-1/2/L is the ROUGE-1/2/L F1 score. BS refers to the model-based metrics BERTScore. Bold (underline) are used to denote the best (second-best) metric.
Datasets w/o Chunk w/o Align w/o Select SimCAS ΔChunksubscriptΔChunk\Delta_{\text{Chunk}}roman_Δ start_POSTSUBSCRIPT Chunk end_POSTSUBSCRIPT ΔAlignsubscriptΔAlign\Delta_{\text{Align}}roman_Δ start_POSTSUBSCRIPT Align end_POSTSUBSCRIPT ΔSelectsubscriptΔSelect\Delta_{\text{Select}}roman_Δ start_POSTSUBSCRIPT Select end_POSTSUBSCRIPT
arXiv 30.10 ±plus-or-minus\pm± 0.73 36.12 ±plus-or-minus\pm± 0.47 32.70 ±plus-or-minus\pm± 0.62 36.46 ±plus-or-minus\pm± 0.56 21.13%absent21.13%\uparrow\text{21.13\%}↑ 21.13% 0.94%absent0.94%\uparrow\text{0.94\%}↑ 0.94% 11.50%absent11.50%\uparrow\text{11.50\%}↑ 11.50%
PubMed 29.63 ±plus-or-minus\pm± 0.51 37.20 ±plus-or-minus\pm± 0.64 34.32 ±plus-or-minus\pm± 0.59 37.77 ±plus-or-minus\pm± 0.51 27.47%absent27.47%\uparrow\text{27.47\%}↑ 27.47% 1.53%absent1.53%\uparrow\text{1.53\%}↑ 1.53% 10.05%absent10.05%\uparrow\text{10.05\%}↑ 10.05%
GovReport 31.23 ±plus-or-minus\pm± 0.50 37.12 ±plus-or-minus\pm± 0.66 32.91 ±plus-or-minus\pm± 0.81 37.54 ±plus-or-minus\pm± 0.48 20.20%absent20.20%\uparrow\text{20.20\%}↑ 20.20% 1.13%absent1.13%\uparrow\text{1.13\%}↑ 1.13% 14.07%absent14.07%\uparrow\text{14.07\%}↑ 14.07%
SummScreen 19.71 ±plus-or-minus\pm± 0.49 20.69 ±plus-or-minus\pm± 0.45 19.63 ±plus-or-minus\pm± 0.62 25.15 ±plus-or-minus\pm± 0.42 27.60%absent27.60%\uparrow\text{27.60\%}↑ 27.60% 2.56%absent2.56%\uparrow\text{2.56\%}↑ 2.56% 08.10%absent08.10%\uparrow\text{08.10\%}↑ 08.10%
Multi-News 25.34 ±plus-or-minus\pm± 0.38 31.45 ±plus-or-minus\pm± 0.55 27.98 ±plus-or-minus\pm± 0.35 31.36 ±plus-or-minus\pm± 0.51 23.76%absent23.76%\uparrow\text{23.76\%}↑ 23.76% 0.29%absent0.29%\downarrow\text{0.29\%}↓ 0.29% 12.08%absent12.08%\uparrow\text{12.08\%}↑ 12.08%
WCEP 28.42 ±plus-or-minus\pm± 0.72 35.01 ±plus-or-minus\pm± 0.64 30.33 ±plus-or-minus\pm± 0.76 35.31 ±plus-or-minus\pm± 0.59 24.24%absent24.24%\uparrow\text{24.24\%}↑ 24.24% 0.86%absent0.86%\uparrow\text{0.86\%}↑ 0.86% 16.42%absent16.42%\uparrow\text{16.42\%}↑ 16.42%
NarrativeQA 21.70 ±plus-or-minus\pm± 0.55 31.52 ±plus-or-minus\pm± 0.46 23.20 ±plus-or-minus\pm± 0.23 31.76 ±plus-or-minus\pm± 0.44 46.36%absent46.36%\uparrow\text{46.36\%}↑ 46.36% 0.76%absent0.76%\uparrow\text{0.76\%}↑ 0.76% 36.90%absent36.90%\uparrow\text{36.90\%}↑ 36.90%
Table 5: Ablation study results on the development sets of all datasets. Performance changes compared with the full model (BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT+SimCAS) are reported. The metrics of summarization and reading comprehension datasets are the average of ROUGE-1/2/L and F1 respectively.
Refer to caption
Refer to caption
Figure 2: The left-hand-side plot shows the change of the actual number of tokens entered into the model and the number of tokens selected by our selector during training. The red dashed line represents the conditional boundary for the skipping reward. The right-hand-side plot shows the effect of increasing the number of input tokens in the inference phase on the time latency and the number of selected tokens. The area marked in blue on the right represents the limit of the number of tokens that the V100 can handle (350K tokens).

Our implementation is based on PyTorch (Paszke et al., 2019) and Transformers (Wolf et al., 2020) libraries. We train our model by using 8 NVIDIA V100 32G GPUs.

During the training phase, the maximum input lengths for BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT and BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT are set to 8192 and 16384, respectively, unless otherwise specified. To ensure efficient training, we update the parameters of the original backbone and the selector alternately. The reward estimation for each action is computed based on decoder cross-attention and the output feedback of the generative model. This estimation process is detached from the computation graph and does not participate in backpropagation.

At the inference stage, compared to the original generation process, our framework only adds a chunk-wise selection procedure between the encoder and the decoder, which takes very little time. At the decoding stage, the target sequence is generated with beam search in an auto-regressive manner (Wiseman and Rush, 2016).

5.4 Evaluations

Like most previous works, for abstractive summarization tasks, we measure the quality of generated summaries using the popular metric ROUGE (Lin, 2004). On the test set of arXiv, PubMed, GovReport, SummScreen, Multi-News, and WCEP, we report full-length F1-based ROUGE-1, ROUGE-2, and ROUGE-L scores computed with the standard ROUGE Perl package. Furthermore, we also use a popular model-based semantic metric BERTScore777In order to make a fair comparison with the previous work, we also use checkpoint “facebook/bart-large-mnli” for BERTScore. (Zhang* et al., 2020) to demonstrate the superiority of our approaches comprehensively. As shown in Table 2, ROUGE-1 and ROUGE-L are strongly correlated (Pearson correlation score of 0.913). In particular, the relatively low Pearson correlation coefficient (<<< 0.8) between ROUGE-2 and BERTScore indicates a significant difference in preferences.

As for the reading comprehension task, we use the F1 and exact match (EM) metrics defined in SCROLLS (Shaham et al., 2022) to evaluate the model performance on the NarrativeQA dataset. Both metrics normalize the reference and system output strings via lowercasing, removing punctuation and stopwords, and normalizing whitespace.

6 Discussion

Main Result Anaylses

Table 1 and 3 report results over four long-document test sets. We note that casting the backbone BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT into our SimCAS can significantly improve the model performance, and increasing the size of the model can further improve performance. Our approach outperforms baseline on several metrics and achieves new state-of-the-art performance on the PubMed test set. In Table 4, the results of multi-document summarization tasks have a similar trend. Apart from PRIMERA, which customizes a pre-training objective for multi-document summarization, our method significantly outperforms previous results. See Figure 3, we observe a substantial improvement in the performance of our model on the NarrativeQA test set compared to previous works. Given the fact that this dataset has an extremely long average input length (\geq100K) and short average output length (\leq10), we attribute this significant performance enhancement to our method’s proficiency in filtering out an immense amount of task-irrelevant information. This, in turn, enables the decoding blocks to fuse information more efficiently.

Refer to caption
Figure 3: System performance comparison on NarrativeQA test set.
Refer to caption
Figure 4: The AVG ROUGE scores (R-1, R-2, and R-L) of the pre-trained models (BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, LEDbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, SimCASbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT) with 0, 10, and 100 training examples with variance. All results are obtained by the average of 5 random runs with different seeds.

Ablation Study

SimCAS is composed of three components: Chunk, Align, and Select. To investigate the contributions of each component, we independently removed each one. Notably, after removing Chunk, the maximum input length was restricted to 1024, causing Align to fail. As demonstrated in Table 5, the performance experienced a significant decline when either Chunk or Select was removed, underscoring their effectiveness. The Align component improved performance on all datasets, with the exception of the Multi-News dataset. We hypothesize that more sophisticated alignment methods could potentially enhance performance further.

Effectiveness of Conditional Skipping Reward

To prevent our selector from choosing an excessive number of token embeddings, we implement a conditional reward mechanism for skipping actions, which imposes a soft constraint on the number of selected token embeddings. Figure 2 (left) demonstrates the effectiveness of this reward mechanism. As training progresses, the number of selected embeddings gradually converges to a threshold of 2048. This indicates that our selection process is effective in identifying essential token embeddings. Additionally, as the input sequence length increases, the ratio of selected token embeddings to the total length decreases. This is due to the length penalty in equation LABEL:eq:select_skip_reward, which discourages the selection of an excessive number of token embeddings.

Increasing Maximum Input Length

We explore the performance changes of our model on a single 32G V100 GPU as the input sequence length increases during inference. Initially, we assess the efficiency of our framework in processing long sequences using BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT as the backbone. We depict the fluctuations in the ratio of the average time required by SimCAS to the time taken by BART (1024) in Figure 2 (right). Specifically, a line graph represents the relative time per sample, while a bar graph displays the number of token embeddings selected by SimCAS. When the input length is short (<214absentsuperscript214<2^{14}< 2 start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT), the time cost exhibits near-linear growth. We attribute this phenomenon to our method’s significant reduction in computational complexity and the redundant computation resource for short sequences. However, with a dramatic increase in input length (>217absentsuperscript217>2^{17}> 2 start_POSTSUPERSCRIPT 17 end_POSTSUPERSCRIPT), the time cost rises sharply. As indicated in the blue area on the far right of the graph, our method can handle input sequences of up to 350k in length on a V100. Moreover, even as the number of input tokens increases significantly, the selected hidden states remain at a reasonable size, demonstrating our selector’s capability to retrieve high-contribution information.

Low-Resource Evaluation

In real-world applications, the quantity of training samples available for downstream tasks is frequently quite constrained. Consequently, the performance of the model under low-resource conditions is of paramount importance. In our setup, we randomly sample a few (10 and 100) training examples from specific datasets to adapt these PLMs such as BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT (1024), LEDbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT (4096), and SimCASbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT (4096), for corresponding data distributions. The results in Figure 4 indicate that our model demonstrates superior sample efficiency in low-resource scenarios compared to previous robust baselines.

7 Conclusion and Future Work

In this paper, we introduced a simple method for long-text processing via chunking, aligning, and selecting, called SimCAS. We divided the input long sequence into chunks, and encoded them with sequential batch alignment to capture the inter-chunk semantics. To select the important token representations in the encoded output of the encoder, we introduced a reinforcement learning-based token selector with the PPO method. We leverage the transformer as an environment and design a reward scheme for the corresponding actions based on the output logits and decoder cross-attention feedback to optimize the hidden token selector. Substantial experiment results and analyses demonstrate the satisfying effectiveness of SimCAS. Besides, our method does not depend on any particular tasks or models, which have good generalization ability for various application scenarios.

For future work, our method can be naturally adopted into non-language long-sequence processing tasks, such as molecular structure analysis. Also, SimCAS has the potential to enhance the long text pre-training to transformers.

8 Limitations

Even though our framework significantly reduces the overhead of processing long sequences, due to memory constraints, the maximum input length on a single V100 32G GPU during training is 16k, and there is an extremely large consumption of GPU memory even if the batch size is small, which might suppress the potential of our method. In addition, we use small-scale PLMs with good performance as the backbone for downstream tasks. The effectiveness of our framework on large language models needs to be investigated further.

9 Ethical Considerations

Similar to existing methods, there is no guarantee that the generated content is factually consistent and free from hallucination Maynez et al. (2020); Kang and Hashimoto (2020). Therefore caution is imperative when applying our method to scenarios with high demand of generation accuracy.

References

  • Ainslie et al. (2023) Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Ontanon, Siddhartha Brahma, Yury Zemlyanskiy, David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, Yun-Hsuan Sung, and Sumit Sanghai. 2023. CoLT5: Faster long-range transformers with conditional computation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5085–5100, Singapore. Association for Computational Linguistics.
  • Ainslie et al. (2020) Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. 2020. ETC: Encoding long and structured inputs in transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 268–284, Online. Association for Computational Linguistics.
  • Akkaya et al. (2019) Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. 2019. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113.
  • Beltagy et al. (2020) Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  • Bertsch et al. (2023) Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R. Gormley. 2023. Unlimiformer: Long-range transformers with unlimited length input.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  • Byrne et al. (2021) Bill Byrne, Karthik Krishnamoorthi, Saravanan Ganesh, and Mihir Kale. 2021. TicketTalk: Toward human-level performance with end-to-end, transaction-based dialog systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 671–680, Online. Association for Computational Linguistics.
  • Chen et al. (2022) Mingda Chen, Zewei Chu, Sam Wiseman, and Kevin Gimpel. 2022. SummScreen: A dataset for abstractive screenplay summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8602–8615, Dublin, Ireland. Association for Computational Linguistics.
  • Cheng et al. (2024a) Pengyu Cheng, Tianhao Hu, Han Xu, Zhisong Zhang, Yong Dai, Lei Han, and Nan Du. 2024a. Self-playing adversarial language game enhances llm reasoning. arXiv preprint arXiv:2404.10642.
  • Cheng and Li (2022) Pengyu Cheng and Ruineng Li. 2022. Replacing language model for style transfer. arXiv preprint arXiv:2211.07343.
  • Cheng et al. (2024b) Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, Tianhao Hu, Peixin Cao, Nan Du, and Xiaolong Li. 2024b. Adversarial preference optimization: Enhancing your alignment via rm-llm game. In Findings of the Association for Computational Linguistics.
  • Choromanski et al. (2020) Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. 2020. Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
  • Cohan et al. (2018) Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. 2018. A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 615–621, New Orleans, Louisiana. Association for Computational Linguistics.
  • Denning (2005) Peter J. Denning. 2005. The locality principle. Commun. ACM, 48(7):19–24.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  • Ethayarajh (2019) Kawin Ethayarajh. 2019. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. arXiv preprint arXiv:1909.00512.
  • Fabbri et al. (2019a) Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019a. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy. Association for Computational Linguistics.
  • Fabbri et al. (2019b) Alexander R Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R Radev. 2019b. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749.
  • Fournier et al. (2023) Quentin Fournier, Gaétan Marceau Caron, and Daniel Aloise. 2023. A practical survey on faster and lighter transformers. ACM Comput. Surv. Just Accepted.
  • Gholipour Ghalandari et al. (2020) Demian Gholipour Ghalandari, Chris Hokamp, Nghia The Pham, John Glover, and Georgiana Ifrim. 2020. A large-scale multi-document summarization dataset from the Wikipedia current events portal. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1302–1308, Online. Association for Computational Linguistics.
  • Gong et al. (2020) Hongyu Gong, Yelong Shen, Dian Yu, Jianshu Chen, and Dong Yu. 2020. Recurrent chunking mechanisms for long-text machine reading comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6751–6761, Online. Association for Computational Linguistics.
  • Guan et al. (2022) Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, and Minyi Guo. 2022. Transkimmer: Transformer learns to layer-wise skim. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7275–7286, Dublin, Ireland. Association for Computational Linguistics.
  • Guo et al. (2022) Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. 2022. LongT5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 724–736, Seattle, United States. Association for Computational Linguistics.
  • Gupta and Berant (2020) Ankit Gupta and Jonathan Berant. 2020. Gmat: Global memory augmentation for transformers. arXiv preprint arXiv:2006.03274.
  • Hokamp et al. (2020) Chris Hokamp, Demian Gholipour Ghalandari, Nghia The Pham, and John Glover. 2020. Dyne: Dynamic ensemble decoding for multi-document summarization.
  • Huang et al. (2021) Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. 2021. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419–1436, Online. Association for Computational Linguistics.
  • Ivgi et al. (2023) Maor Ivgi, Uri Shaham, and Jonathan Berant. 2023. Efficient Long-Text Understanding with Short-Text Models. Transactions of the Association for Computational Linguistics, 11:284–299.
  • Kakade (2001) Sham M Kakade. 2001. A natural policy gradient. Advances in neural information processing systems, 14.
  • Kang and Hashimoto (2020) Daniel Kang and Tatsunori B. Hashimoto. 2020. Improved natural language generation via loss truncation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 718–731, Online. Association for Computational Linguistics.
  • Kazemnejad et al. (2023) Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. 2023. The impact of positional encoding on length generalization in transformers.
  • Kim et al. (2022) Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, and Kurt Keutzer. 2022. Learned token pruning for transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 784–794.
  • Kitaev et al. (2020) Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In International Conference on Learning Representations.
  • Klafka and Ettinger (2020) Josef Klafka and Allyson Ettinger. 2020. Spying on your neighbors: Fine-grained probing of contextual embeddings for information about surrounding words. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4801–4811, Online. Association for Computational Linguistics.
  • Konda and Tsitsiklis (1999) Vijay Konda and John Tsitsiklis. 1999. Actor-critic algorithms. In Advances in Neural Information Processing Systems, volume 12. MIT Press.
  • Kočiský et al. (2018) Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018. The NarrativeQA Reading Comprehension Challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics, 7:453–466.
  • Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  • Li et al. (2020) Wei Li, Xinyan Xiao, Jiachen Liu, Hua Wu, Haifeng Wang, and Junping Du. 2020. Leveraging graph to improve abstractive multi-document summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6232–6243, Online. Association for Computational Linguistics.
  • Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  • Liu et al. (2022) Yixin Liu, Ansong Ni, Linyong Nan, Budhaditya Deb, Chenguang Zhu, Ahmed Hassan Awadallah, and Dragomir Radev. 2022. Leveraging locality in abstractive text summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6081–6093, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Ma et al. (2022) Congbo Ma, Wei Emma Zhang, Mingyu Guo, Hu Wang, and Quan Z. Sheng. 2022. Multi-document summarization via deep learning techniques: A survey. ACM Comput. Surv., 55(5).
  • Ma et al. (2021) Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. 2021. Luna: Linear unified nested attention. In Advances in Neural Information Processing Systems.
  • Maynez et al. (2020) Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  • Mohtashami and Jaggi (2023) Amirkeivan Mohtashami and Martin Jaggi. 2023. Landmark attention: Random-access infinite context length for transformers. In Workshop on Efficient Systems for Foundation Models @ ICML2023.
  • Moro and Ragazzi (2022) Gianluca Moro and Luca Ragazzi. 2022. Semantic self-segmentation for abstractive summarization of long documents in low-resource regimes. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11085–11093.
  • Moro et al. (2022) Gianluca Moro, Luca Ragazzi, Lorenzo Valgimigli, and Davide Freddi. 2022. Discriminative marginalized probabilistic neural method for multi-document summarization of medical literature. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 180–189, Dublin, Ireland. Association for Computational Linguistics.
  • Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çağlar Gulçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
  • Narayan et al. (2018) Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  • Pang et al. (2022) Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel Bowman. 2022. QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5336–5358, Seattle, United States. Association for Computational Linguistics.
  • Parnell et al. (2022) Jacob Parnell, Inigo Jauregi Unanue, and Massimo Piccardi. 2022. A multi-document coverage reward for relaxed multi-document summarization.
  • Pasunuru et al. (2021) Ramakanth Pasunuru, Mengwen Liu, Mohit Bansal, Sujith Ravi, and Markus Dreyer. 2021. Efficiently summarizing text and graph encodings of multi-document clusters. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4768–4779, Online. Association for Computational Linguistics.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  • Peng et al. (2021) Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. 2021. Random feature attention. In International Conference on Learning Representations.
  • Phang et al. (2022) Jason Phang, Yao Zhao, and Peter J Liu. 2022. Investigating efficiently extending transformers for long input summarization. arXiv preprint arXiv:2208.04347.
  • Press et al. (2022) Ofir Press, Noah Smith, and Mike Lewis. 2022. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations.
  • Ramamurthy et al. (2023) Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, and Yejin Choi. 2023. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations.
  • Roy et al. (2021) Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2021. Efficient Content-Based Sparse Attention with Routing Transformers. Transactions of the Association for Computational Linguistics, 9:53–68.
  • Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics.
  • Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  • Schulman et al. (2016) John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. 2016. High-dimensional continuous control using generalized advantage estimation. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. CoRR, abs/1707.06347.
  • Shaham et al. (2022) Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. 2022. SCROLLS: Standardized CompaRison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12007–12021, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Sun et al. (2019a) Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019a. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473.
  • Sun et al. (2019b) Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019b. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355.
  • Sun et al. (2023) Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. 2023. A length-extrapolatable transformer. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14590–14604, Toronto, Canada. Association for Computational Linguistics.
  • Tay et al. (2023) Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q. Tran, Dani Yogatama, and Donald Metzler. 2023. Scaling laws vs model architectures: How does inductive bias influence scaling?
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  • Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, L. Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander Sasha Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom Le Paine, Caglar Gulcehre, Ziyun Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy P. Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. 2019. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, pages 1–5.
  • Wang et al. (2020) Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity.
  • Wang et al. (2023) Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. 2023. Augmenting language models with long-term memory.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  • Wiseman and Rush (2016) Sam Wiseman and Alexander M. Rush. 2016. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1296–1306, Austin, Texas. Association for Computational Linguistics.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  • Wu et al. (2022) Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. 2022. Memorizing transformers. In International Conference on Learning Representations.
  • Wysocki et al. (2023) Oskar Wysocki, Zili Zhou, Paul O’Regan, Deborah Ferreira, Magdalena Wysocka, Dónal Landers, and André Freitas. 2023. Transformers and the Representation of Biomedical Background Knowledge. Computational Linguistics, 49(1):73–115.
  • Xiao et al. (2022) Wen Xiao, Iz Beltagy, Giuseppe Carenini, and Arman Cohan. 2022. PRIMERA: Pyramid-based masked sentence pre-training for multi-document summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5245–5263, Dublin, Ireland. Association for Computational Linguistics.
  • Xiong et al. (2022) Wenhan Xiong, Barlas Oguz, Anchit Gupta, Xilun Chen, Diana Liskovich, Omer Levy, Scott Yih, and Yashar Mehdad. 2022. Simple local attentions remain competitive for long-context tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1975–1986, Seattle, United States. Association for Computational Linguistics.
  • Yang et al. (2019) Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019. End-to-end open-domain question answering with BERTserini. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 72–77, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Yuan et al. (2021) Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems, volume 34, pages 27263–27277. Curran Associates, Inc.
  • Zaheer et al. (2020) Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297.
  • Zhang et al. (2020) Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 11328–11339. PMLR.
  • Zhang* et al. (2020) Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  • Zhong et al. (2022) Zexuan Zhong, Tao Lei, and Danqi Chen. 2022. Training language models with memory augmentation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5657–5673, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Appendix A Dataset Statistics

Refer to caption
Figure 5: Input text length distributions on the six summarization datasets
Dataset Train Valid Test Avg. Input Tokens
arXiv 203.0K 6.4K 6.4K 6.0K
PubMed 119.9K 6.6K 6.7K 3.0K
GovReport 17.5K 1.0K 1.0K 9.6K
SummScreen 22.6K 2.1K 2.1K 6.6K
Multi-News 44.9K 5.6K 5.6K 1.8K
WCEP 8.1K 1.0K 1.0K 3.9K
NarrativeQA 32.7K 3.5K 10.6K 121.7K
Table 6: Statistics of used datasets.
Dataset Model Batch Size Max Epoch Warmup Steps
arXiv BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT 1 10 6400,10000
PubMed BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT 1 10 6400, 10000
GovReport BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT 1 20 6400, 10000
SummScreen BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT 1 20 2500, 6400
Multi-News BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT 1 20 6400, 10000
WCEP BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT 1 100 1600, 2500
NarrativeQA BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT 1 10 6400, 10000
Table 7: Hyper-parameter grid for downstream task fine-tuning. We use Adam optimizer (β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, ϵ=1e6italic-ϵ1𝑒6\epsilon=1e-6italic_ϵ = 1 italic_e - 6) for all datasets.
Dataset Generation Parameters
arXiv beam: 4, max_len: 300, min_len: 50, no_repeat_ngrams: 3, length_penalty: 5.0
PubMed beam: 4, max_len: 400, min_len: 40, no_repeat_ngrams: 3, length_penalty: 4.0
GovReport beam: 4, max_len: 740, min_len: 50, no_repeat_ngrams: 3, length_penalty: 4.0
SummScreen beam: 4, max_len: 750, min_len: 50, no_repeat_ngrams: 3, length_penalty: 4.0
Multi-News beam: 4, max_len: 400, min_len: 150, no_repeat_ngrams: 3, length_penalty: 2.0
WCEP beam: 4, max_len: 40, min_len: 15, no_repeat_ngrams: 3, length_penalty: 2.0
NarrativeQA beam: 4, max_len: 20, no_repeat_ngrams: 3, length_penalty: 1.0
Table 8: Hyper-parameter settings during inference for each dataset.

Appendix B Hyper-parameters & Packages

Table 7 and Table 8 delineate the hyper-parameter settings at the training and inference phases of the experiment, respectively. For evaluation metrics, we used the following packages:

Appendix C Introduction for Baselines

We use the competitive baselines that demonstrate the downstream task results for comparison. Among them, BART (Lewis et al., 2020) is a standard full-attention PLM for sequence generation. Compared with BART, PEGASUS (Zhang et al., 2020) has a tailored pre-training objective for abstractive text summarization. LED (Longformer Encoder-Decoder) (Beltagy et al., 2020) uses the sparse attention-based encoder and full-attention decoder. Before pre-training, its parameters are initialized from BART. BIGBIRD (Zaheer et al., 2020), for an encoder-decoder setup, also introduces their specified sparse attention mechanism only at the encoder side. PRIMERA based on LED introduces a task-specific per-training objective for multi-document summarization. SLED (Ivgi et al., 2023) processes long sequences via short-context PLMs. The origin long sequence is partitioned into overlapping chunks. HEPOS (Huang et al., 2021) proposes head-wise positional strides to effectively pinpoint salient information from the source. Memorizing Transformers (Wu et al., 2022) employs a trainable attention gate to moderate between the standard cross-attention and attention over retrieved keys from a datastore. Unlimiformer (Bertsch et al., 2023) uses the k-nearest-neighbor KNN index to choose full-length input to reduce computation overhead. HiMAP (Fabbri et al., 2019a) expands the existing pointer-generator network model into a hierarchical network. GraphSum Li et al. (2020) employs well-known graph representations of documents to effectively process multiple input documents for abstractive summarization. BART-Long-Graph (Pasunuru et al., 2021) is fine-tuned based on LED and additionally injects discourse graph. LED+RELAX Parnell et al. (2022) introduces a RELAX gradient estimator with multi-document coverage reward. BERTREG (Gholipour Ghalandari et al., 2020) is a regression-based sentence ranking system with BERT embedding, which is used as an extractive summarization method. Submodular+Abs (Gholipour Ghalandari et al., 2020) consists of a submodular-based extractive summarizer and a bottom-up abstractive summarizer. DynE (Hokamp et al., 2020) ensembles multiple-document for abstractive summarization by single document summarization models.

Appendix D Perplexity on Test Sets

To more comprehensively demonstrate the effectiveness of our approach, we additionally use perplexity as an evaluation metric for comparative experiments. The experimental results in Table 9 indicate that our system (SimCAS) significantly enhances model performance compared to the baselines (BART).

Method arXiv PubMed GovReport SummScreen Multi-News WCEP NarrativeQA
BART 31.82 34.81 42.10 48.91 49.40 50.91 23.34
SimCAS 6.89 5.58 6.23 10.80 8.25 8.85 4.71
Table 9: Perplexity on all seven test sets including arXiv, PubMed, GovReport, SummScreen, Multi-News, WCEP, and NarrativeQA.

Appendix E Time Latency at Inference Stage

At the inference stage, our framework additionally introduces a selection process compared with the standard transformer. Therefore, We study the proportion of our selector in the average inference time cost per sample. The results in Table 10 indicate that our selector only adds a small amount of time cost while significantly improving model performance, demonstrating the effectiveness of our approach.

Appendix F Potential to select representation

Because in the decoder cross-attention module, the encoded output from the encoder is used as “Key” and “Value” to participate in the calculation, its attention score can lead to the contribution of the corresponding token representation from encoded output to the current token decision at the decoding step. Figure6 demonstrates how the cross-attention scores change during the decoding of the reference output with one example in GovReport. We can observe that 1) most of the token decisions in the decoding phase focus only on a small set of encoded representations; 2) For each token decision, the contributions of different encoded token representations vary greatly. These phenomena suggest that there is still the feasibility of further filtering out low-contribution encoded representations.

Refer to caption
Figure 6: Visualization of the average cross-attention of all attention heads at all layers in the decoder (excluding start position). This example is generated by the BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT+SimCAS trained on GovReport.

Appendix G Compatible with short-input tasks

Method arXiv PubMed GovReport SummScreen Multi-News WCEP NarrativeQA
Selection (ms) 157.5 70.4 293.9 103.4 74.2 9.8 112.4
All (ms) 9661.5 13799.1 104953.7 30424.0 17664.2 1356.2 689.6
Ratio (%) 1.63 0.51 0.28 0.34 0.42 0.72 16.30
Table 10: The average time overhead of the selection process (Select) and the whole process (All), and their ratio (Ratio) for a single sample.

While the sparse-attention transformers have been proven effective on a wide range of long-sequence datasets, as shown in Figure 7, these methods tend to underperform traditional full-attention transformers on the more common short-sequence tasks. However, in real scenarios, short-sequence inputs and long-sequence inputs are often mixed together, and the former occupies the vast majority. This limits the application scope of sparse-attention transformer architecture. In contrast, applying our framework SimCAS to existing full-attention transformers has strong flexibility. Specifically, given the full-attention model under SimCAS, if the input sequence exceeds the maximum length of a single chunk, we will perform chunking and selecting, otherwise, we can naturally switch to a standard short-text encoding form by skipping chunking procedures.

Appendix H Efficacy of SBA

Refer to caption
Figure 7: Comparison of the performance of full-attention PLM BART and sparse-attention PLM LED on short-text summarization datasets CNNDM (Nallapati et al., 2016) and XSum (Narayan et al., 2018). AVG ROUGE denotes the average of ROUGE-1/2/L F1 scores.

As our framework introduces SBA to align inter-chunk information during the encoding steps, we explore the change of hidden state in the forward propagation process under the influence of this component. Figure 8 demonstrates the visualized similarity matrix between each chunk’s start hidden state after the encoding block. It can be seen from an array of cosine similarity matrices in the top half that, through the alternation of SBA and encoding blocks, the cosine similarities between the start hidden states from different chunks are all close to 1, which indicates that their directions in high-dimensional space are almost the same.

Considering that cosine similarity can only reflect the closeness between vector directions, we design a similarity calculation method to add the measure of Euclidean distance. Given a pair of vectors 𝒗1,𝒗2dsubscript𝒗1subscript𝒗2superscript𝑑{\bm{v}}_{1},{\bm{v}}_{2}\in\mathbb{R}^{d}bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the custom similarity Sim between them is formulated as follows:

Sim=(𝒗1,𝒗2)𝒗1𝒗2(1+𝒗1𝒗2),Simsubscript𝒗1subscript𝒗2delimited-∥∥subscript𝒗1delimited-∥∥subscript𝒗21delimited-∥∥subscript𝒗1subscript𝒗2\textbf{Sim}=\frac{({\bm{v}}_{1},{\bm{v}}_{2})}{\lVert{\bm{v}}_{1}\rVert*% \lVert{\bm{v}}_{2}\rVert*(1+\lVert{\bm{v}}_{1}-{\bm{v}}_{2}\rVert)},Sim = divide start_ARG ( bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ∗ ∥ bold_italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ ∗ ( 1 + ∥ bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ ) end_ARG ,

by which, we scale the cosine similarity to observe its change. As can be seen from the lower part of Figure 8, after each encoder layer, the start hidden states will vary in scale due to different contexts.

Algorithm 1 Token embedding selection with PPO
1:  Initialize agent and hyper-parameters
2:  Initialize advantages, rewards, and returns
3:  Compute Rewards and Advantages
4:  Initialize next_value 0absent0\leftarrow 0← 0, last_gae_lam 0absent0\leftarrow 0← 0
5:  valid_reward_num \leftarrow size of select_rewards
6:  for t=0𝑡0t=0italic_t = 0 to num_steps-1 do
7:     Compute action sum for each step and update rewards
8:  end for
9:  cur_num 0absent0\leftarrow 0← 0
10:  for t=0𝑡0t=0italic_t = 0 to num_steps-1 do
11:     Update rewards and actions based on select_rewards
12:  end for
13:  for t=𝑡absentt=italic_t = num_steps-1 to 0 do
14:     Compute delta, advantages, and returns
15:  end for
16:  Reverse advantages and returns
17:  Policy and Value Function Update
18:  for epoch=0𝑒𝑝𝑜𝑐0epoch=0italic_e italic_p italic_o italic_c italic_h = 0 to update_epochs-1 do
19:     Shuffle batch indices
20:     for batch_st=0𝑏𝑎𝑡𝑐_𝑠𝑡0batch\_st=0italic_b italic_a italic_t italic_c italic_h _ italic_s italic_t = 0 to seq_len by mini_batch do
21:        Compute policy loss, value loss, and entropy loss
22:        Perform gradient ascent
23:     end for
24:     if approx_kl >>> threshold then
25:        Break
26:     end if
27:  end for

Appendix I More Experimental Details

Refer to caption
Figure 8: Evolution of similarities between start hidden states of each chunk after each encoding layer during one forward propagation. Only cosine similarity is used in the top half of the figure. The lower part takes into account both cosine similarity and Euclidean distance.

Since we aim to minimize dataset-specific modeling, we unify the processing of the used long-document summarization, multi-document summarization, and machine reading comprehension datasets. In the NarrativeQA dataset for machine reading comprehension, we concatenate together the question and reference content, each of which prepends a prefix to indicate the type of text. For a fair comparison, we uniformly use the full-attention PLM BART888The checkpoints of BART are “facebook/bart-base” and “facebook/bart-large” containing around 139 M and 406 M parameters respectively, whose maximum encoding length is 1024. on the seven public datasets above. Built on the BART, our framework introduces an additional parameterized selector to focus on more task-specific token representations. The selector follows the actor-critic style (Konda and Tsitsiklis, 1999) and contains around 8M parameters. There are two Adam optimizers with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999 for BART and selector respectively.

Additionally, to handle longer sequences during the training phase, we set the chunk size to 512 instead of 1024 in all experiments (considering the start token and end token). To maintain chunk-wise alignment, we pad the chunks to uniform the size of each chunk. During the forward propagation, the embedding layer embeds position representations for each chunk independently.999Considering the context of each chunk is different, although our design may lead to repeated position representation, we argue that after the encoding stage, the combination of the same token and position information would still produce various representations (Kazemnejad et al., 2023)

Training Details

In this paper, for the training of BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT on various datasets we uniformly employ the Adam optimizer with the following dynamic learning rate:

lr=γmin(step0.5,step×warmup1.5),𝑙𝑟𝛾superscriptstep0.5stepsuperscriptwarmup1.5lr=\gamma\min(\textit{step}^{-0.5},\textit{step}\times\textit{warmup}^{-1.5}),italic_l italic_r = italic_γ roman_min ( step start_POSTSUPERSCRIPT - 0.5 end_POSTSUPERSCRIPT , step × warmup start_POSTSUPERSCRIPT - 1.5 end_POSTSUPERSCRIPT ) ,

where γ𝛾\gammaitalic_γ affects the maximum learning rate and warmup indicates the warmup steps, step is the number of updating steps, and lr𝑙𝑟lritalic_l italic_r is the learning rate. In addition, there is another separate Adam optimizer for our reinforcement learning-based parameterized selector, in which the learning rate is fixed to 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The optimization of the selector and transformer is performed alternately, with one model trained while the other model is fixed.

Inference Details

In practice, We also investigate how the model performs when the maximum effective input length is increased during inference. In Table 11, we can observe that although the maximum input length is set to 16384 during the training phase due to the memory limitation, increasing this maximum length during inference still improves the model performance.

Maximum Input Length R-1 R-2 R-L BS
4096 44.96 22.60 37.17 69.74
8192 45.41 22.79 37.40 70.51
16384 45.22 22.48 37.34 70.46
32768 45.45 22.56 37.45 70.52
++\infty+ ∞ 45.68 22.80 37.71 70.59
Table 11: Performance of BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT-SimCAS on WCEP with different maximum input lengths during inference. R-1/2/L is the ROUGE-1/2/L F1 score. BS refers to the model-based metric BERTScore. \star: the maximum input length in the training phase. ++\infty+ ∞: the input sequence to the model is complete. The best results are bolded.

The Details of Selector

Our token embedding selection process based on PPO is shown in Algorithm 1. In our setting, the selector consists of an actor component and a critic component with PPO optimization, both of which are the feed-forward network, except for the final layers. In order to enable the selector to choose more diverse token representations instead of becoming homogeneous during the chunk-wise selection process, the state space consists of the current token representation and the selector’s hidden state that is affected by previous actions. At the beginning of the selection procedure, the initial selector hidden state is the average of the start hidden state of all chunk representations. At each time step, we input the current selector hidden state and a chunk of token representations. Then the actor of the selector outputs a probability distribution over action space, and the critic of the selector outputs a single estimated scalar value for each token based on the hidden state selector and the corresponding token representation. For the state transition after executing current actions, we update the hidden state of the selector using the selected token representation from the current chunk. Note that in order to avoid the extreme case where the selector skips all tokens, in each chunk decision, if all actions are “skipping”, they are all switched to “selecting”, and the corresponding action probability and value are re-obtained. Additionally, to increase the training stability, the advantage in the PPO algorithm is approximated using the Generalized Advantage Estimation (GAE) (Schulman et al., 2016).

Appendix J Case Study on GovReport Dataset

In addition to using regular automatic evaluation metrics to measure the effect of model generation, we also present some actual output to support the results. Figure 9 displays several examples of summaries generated by the fine-tuned base model BARTlargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT and our BARTbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT-SimCAS. We can observe that the system output of our model has fewer grammatical errors and higher coherence compared with the base model. Furthermore, since our model is able to perceive longer sequences, our output is more informative and better aligned with the reference text.

Refer to caption
Figure 9: Example summary generated by BART and SimCAS trained on GovReport dataset. The maximum input length of standard BART and SimCAS is 1024 and 16384 respectively. The sentence in green is included in the SimCAS summary, while the one in red is discarded.