Human-like Episodic Memory for Infinite Context LLMs

Zafeirios Fountas Huawei Noah’s Ark Lab, London, UK Martin A Benfeghoul Equal Contribution. Huawei Noah’s Ark Lab, London, UK Adnan Oomerjee Huawei Noah’s Ark Lab, London, UK Fenia Christopoulou Huawei Noah’s Ark Lab, London, UK
Gerasimos Lampouras
Huawei Noah’s Ark Lab, London, UK
Haitham Bou-Ammar Huawei Noah’s Ark Lab, London, UK University College London, UK Jun Wang University College London, UK
Abstract

Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs, enabling them to effectively handle practically infinite context lengths while maintaining computational efficiency. EM-LLM organises sequences of tokens into coherent episodic events using a combination of Bayesian surprise and graph-theoretic boundary refinement in an on-line fashion. When needed, these events are retrieved through a two-stage memory process, combining similarity-based and temporally contiguous retrieval for efficient and human-like access to relevant information. Experiments on the LongBench dataset demonstrate EM-LLM’s superior performance, outperforming the state-of-the-art InfLLM model with an overall relative improvement of 4.3%percent4.34.3\%4.3 % across various tasks, including a 33%percent3333\%33 % improvement on the PassageRetrieval task. Furthermore, our analysis reveals strong correlations between EM-LLM’s event segmentation and human-perceived events, suggesting a bridge between this artificial system and its biological counterpart. This work not only advances LLM capabilities in processing extended contexts but also provides a computational framework for exploring human memory mechanisms, opening new avenues for interdisciplinary research in AI and cognitive science.

1 Introduction

For contemporary pre-trained large language models (LLMs), the context window serves as the primary mechanism to incorporate domain-specific, private, or common up-to-date information. However, despite their remarkable and ever-expanding capabilities, LLMs still exhibit significant limitations when tasked with processing extensive contexts [Liu et al., 2024a]. These limitations stem from inherent challenges in Transformer-based architectures. Recent studies have shown that Transformers struggle with extrapolating to contexts longer than their training window size [Kazemnejad et al., 2024]. On top of this, employing softmax attention over extended token sequences requires substantial computational resources for each token generation, and the resulting attention embeddings risk becoming excessively noisy and losing their distinctiveness [Tworkowski et al., 2023].

To mitigate those challenges, recent works have focused on retrieval-based methods, either in the form of in-context augmentation (e.g., RAG-based techniques [Lewis et al., 2020, Gao et al., 2024]) or via retrieval of previously-inferred key-value pairs (KV) within individual attention heads [Wu et al., 2022, Tworkowski et al., 2023, Bertsch et al., 2023]. Notably, state-of-the-art performance is achieved when KV pairs are initially organised into non-overlapping segments and then retrieved together as one block of sequential tokens [Xiao et al., 2024a]. While such techniques present interesting avenues of research, results still indicate a significant gap between the performance of LLMs in short- vs long-context tasks, even when existing long-context architectures are employed [Liu et al., 2024a].

This work tackles the above challenges and attempts to bridge this performance gap by taking inspiration from the algorithmic interpretation of episodic memory in the human brain -- the memory system responsible for encoding, storing, and retrieving personal experiences and events. The brain makes sense of its continuous experience in the real world by segmenting it into discrete episodic events [Clewett et al., 2019, Zacks, 2020], which are organised in a hierarchical and nested-timescale structure [Baldassano et al., 2017] and stored in long-term memory. Notably, the boundaries between such events are the access points when it comes to memory retrieval [Michelmann et al., 2023a] and are widely believed to correspond to points in time with high prediction errors between the brain’s generative model and its raw sensory input (a.k.a., surprise). In this context, surprise refers to moments when the brain’s predictions about incoming sensory information are significantly violated, leading to a mismatch between what is expected and what is actually perceived. These instances of high surprise are thought to signal important changes in the environment or narrative, prompting the brain to segment the ongoing experience into distinct events [Zacks et al., 2007, 2011, Roseboom et al., 2019, Sinclair et al., 2021, Fountas et al., 2022]. Once segmented and stored, the brain can recall episodic memories based on their similarity to its current experience, recency, original temporal order, and their proximity to other recalled memories (temporal asymmetry and contiguity, Howard and Kahana, 2002).

Following these insights, we propose a novel architecture, EM-LLM, that integrates crucial aspects of event cognition and episodic memory into Transformer-based LLMs. For memory formation, we segment the sequence of the tokens presented to the underlying LLM into individual memory units representing episodic events. The boundaries, and thus the size of those events, are initially determined dynamically, based on the level of surprise of the model during inference, and then refined to maximise cohesion within memory units and separation of memory content across them (see Section 3.2). This refinement process leverages graph-theoretic metrics, treating the similarity between attention keys (the learned representations used in Transformer self-attention mechanisms) as a weighted adjacency matrix, and aims to enhance the model’s ability to efficiently recall relevant information when addressing complex tasks with extended contexts. Importantly, this memory formation process incurs minimal additional computational cost, with the surprise-based segmentation requiring no extra computation and the refinement step having a complexity of 𝒪(kn)𝒪𝑘𝑛\mathcal{O}(kn)caligraphic_O ( italic_k italic_n ), where k𝑘kitalic_k is typically very small compared to the number of tokens n𝑛nitalic_n. With this efficient memory formation process, by grouping similar information in single units, we minimise the number of units needed to recall details around specific events. For memory recall, our approach integrates similarity-based retrieval with mechanisms that facilitate temporal contiguity and asymmetry effects. By retrieving and buffering salient memory units, our model leverages and enhances the recently discovered propensity of LLMs to exhibit human-like patterns in sequential information retrieval [Ji-An et al., 2024]. This method not only ensures efficient access to pertinent information but also mimics the temporal dynamics found in human free recall studies (such as Howard and Kahana, 2002), further enhancing the model’s ability to handle complex tasks that require nuanced temporal reasoning.

To prove our hypotheses, we first employ a series of human-annotated podcast scripts, where we show that information in LLM attention heads can be semantically grouped in a way that correlates with the event structure perceived by humans. Therefore, LLM-perceived surprise can indeed serve as a proxy for the cognitive signals that drive human event segmentation, as confirmed by previous works [Kumar et al., 2023]. Then, using the long-context PG-19 dataset [Rae et al., 2020], which comprises a diverse corpus of English books, we evaluate the effectiveness of both steps in our segmentation method for grouping relevant information, and assess the performance of different refinement objectives. Finally, we show that our method is scalable and significantly outperforms the state-of-the-art model InfLLM [Xiao et al., 2024a] on the widely-used LongBench benchmark [Bai et al., 2023] for long-context tasks, achieving an overall relative improvement of 4.3%percent4.34.3\%4.3 %, including a substantial 33%percent3333\%33 % improvement on the PassageRetrieval task.

2 Related work

2.1 Long-context in LLMs

Recently, several approaches have been proposed to extend the context window of Transformer-based models. Those include methods that address the limited representational capacity of softmax attention, and its quadratic computational and memory cost [Katharopoulos et al., 2020, Munkhdalai et al., 2024]. Other methods target the poor extrapolation of typical positional encodings to out-of-distribution (OOD) context lengths [Kazemnejad et al., 2024]. The latter is evident in most widely used methods, including the original absolute positional encodings [Vaswani et al., 2017] and the more recent relative positional encodings, such as the popular Rotary Positional Embeddings (RoPE) [Su et al., 2024]. To address this, some approaches propose scaling of the rotation angles [Chen et al., 2024a] or the base constant [Xiong et al., 2023, Liu et al., 2024b, Peng et al., 2024, Ding et al., 2024]. Others, scale positions without affecting the embedding function [Press et al., 2021, Chen et al., 2023, Jin et al., 2024], exploring alternative strategies such as KERPLE [Chi et al., 2022] and FIRE [Li et al., 2024] or adopt mechanisms from certain LMs like T5 [Raffel et al., 2020].

Concerning computational efficiency and diluted attention, successful approaches propose methods for general improvements to the efficiency of Transformers through optimised computations [Dao, 2024, Han et al., 2024a, Aminabadi et al., 2022, Kwon et al., 2023, Liu et al., 2024c, Brandon et al., 2023] or compression techniques [Nawrot et al., 2024, Zhang et al., 2023], as well as training methods tailored for long-context scenarios [Zhu et al., 2024, Chen et al., 2024b]. Another direction is the utilisation of retrieval-based methods, the vast majority of which relies on a vector database that keeps a key-value cache and scalable approximations of k-nearest neighbors (k-NNs) to perform lookups [Wu et al., 2022, Tworkowski et al., 2023, Bertsch et al., 2023]. Interestingly, since using a key-value cache with k-NN lookup can be seen as an approximation of applying softmax attention to the full token sequence (see Appendix A.4), k-NN retrieval methods can be used without any fine-tuning [Bertsch et al., 2023]. For an exception that does not rely on k-NNs, see Wang et al. [2023].

A recent and interesting variant of k-NN retrieval involves retrieving large groups of tokens, rather than individual ones. Models that rely on this approach include SLED [Ivgi et al., 2023] and the more recent InfLLM [Xiao et al., 2024a], which achieves state-of-the-art performance on long-context benchmarks. InfLLM segments the entire context length into fixed-size memory units and employs k-NN lookup using the tokens with the highest accumulated scores per unit. This can be seen as a form of hierarchical attention, as illustrated in Fig. 1. While group-based retrieval represents a promising direction, our approach significantly advances this concept by dynamically determining token groupings in a manner akin to human memory formation, addressing a fundamental limitation of InfLLM’s fixed-size segmentation and enabling more adaptive and context-sensitive processing of extended information.

Refer to caption
Figure 1: Group-based k𝑘kitalic_k-NN retrieval can be seen as a form of hierarchical episodic attention. Initially, k=4𝑘4k=4italic_k = 4 groups of tokens are selected (left) and then used for softmax attention (right), as if all other similarity scores were forced to be zero (non-shaded areas of the left curve). This framework can support multiple levels of episodic attention.

2.2 Neural models of Episodic Memory and Event Cognition

The concept of episodic memory, central to our approach, has been extensively studied in both theoretical neuroscience and machine learning. Neural models of episodic memory capture human behaviour and neuroimaging data, providing insights into how the brain processes and stores experiences and suggesting links between memory, efficient representations and navigation of physical and conceptual spaces [Gershman et al., 2012, Benna and Fusi, 2021]. In machine learning, episodic memory-inspired approaches have yielded significant improvements across various domains. For instance, episodic control has enhanced reinforcement learning agents’ performance and learning speed [Blundell et al., 2016, Pritzel et al., 2017, Coda-Forno et al., 2024]. In addition, models of memory construction and consolidation have been successful in alleviating catastrophic forgetting in neural networks [Kirkpatrick et al., 2017, Lopez-Paz and Ranzato, 2017, Chaudhry et al., 2019, Buzzega et al., 2020, Prabhu et al., 2020], including LLMs [Das et al., 2024], and appear to explain key features of human memory, such as imagination and future thinking [Spens and Burgess, 2024].

These models have revealed key aspects of episodic memory, particularly in describing how experiences are segmented into events, and when new memories are encoded and retrieved [Lu et al., 2022]. Surprise plays a critical role in this process, triggering event boundaries and memory formation [Fountas et al., 2022, Kumar et al., 2023]. This event-based structure is deeply intertwined with our perception of time [Roseboom et al., 2019, Sherman et al., 2022], highlighting the interdependence of memory and temporal cognition. This insight has helped generative models for video [Zakharov et al., 2022a, b] and reinforcement learning [Zakharov et al., 2021] to capture temporal dynamics more accurately. In terms of memory retrieval, studies in human free recall have shown a distinctive increased likelihood of retrieving items encoded close together in time (temporal contiguity) and in succession (temporal asymmetry) (see Fig.2A). Recently, it was shown that attention heads in transformer-based LLMs that are associated with in-context learning, already exhibit the same dynamic retrieval behaviour [Ji-An et al., 2024] (Fig.2B) including both contiguity and asymmetry effects. Therefore, transformers have the inherent ability to act as episodic memory retrieval models, if provided with the right information within their context window. Our work leverages these concepts of surprise-based event segmentation and LLMs’ inherent temporal contiguity and asymmetry effects to enable a new generation of Infinite Context-Length LLMs, capable of processing and understanding information over vastly extended timescales.

Refer to caption
Figure 2: (A) Example of the temporal contiguity and asymmetry effect in human free recall. Data averaged over several large free recall studies (adopted from Howard and Kahana, 2002). (B) The attention scores of a GPT2 head averaged over all tokens tested. Figure adopted from Ji-An et al. [2024]. (C) Schematic illustrating our proposed process for memory formation and retrieval: ① Input sequence with surprise-based segmentation (purple arrows indicate high surprise). ② Formation of episodic memories: input is segmented into events and stored, with initial tokens and local context preserved. Note that the boundary refinement process is not shown here for clarity. ③ Memory retrieval via k-NN search, selecting contiguous events from episodic memory. ④ Final context window structure, comprising initial tokens, contiguity buffer (populated by neighboring events), similarity buffer (from k-NN retrieval), and local context.

3 EM-LLM: LLM with Episodic Memory

3.1 Architecture

EM-LLM is designed to be applied directly to pre-trained LLMs, enabling them to handle context lengths significantly larger than their original training length. Our architecture divides the context into three distinct groups: initial tokens, evicted tokens, and local context. This structure, while incorporating insights from recent work on token block retrieval [Xiao et al., 2024a], introduces novel elements inspired by human episodic memory.

The local context represents the most recent tokens, maximising information about the current task, and fits within the typical context window of the underlying LLM. This group utilises full softmax attention and plays a role similar to the focus of attention in cognitive models of working memory, holding the most immediately relevant information for the current task [Cowan, 2001]. The evicted tokens typically comprise the majority of past tokens in a long-context scenario, extending far beyond the LLM’s original training length. These tokens are managed by our proposed memory model functioning similarly to short-term episodic memory in the brain. Finally, following previous work, we also maintain a group of 128128128128 initial tokens in the LLM context. These act as attention sinks and help recover the performance of window attention, as first observed by Xiao et al. [2024b], Han et al. [2024b] and later adopted by Xiao et al. [2024a]. For retrieved tokens, which are therefore discontinuous and outside the local context, we assign a fixed position embedding as in Raffel et al. [2020], Xiao et al. [2024a]. This architecture enables EM-LLM to effectively process and utilise information from positions outside its pre-trained local context window, while maintaining the underlying LLM’s performance characteristics.

3.2 Memory formation via Surprise

In the context of LLMs, we define episodic memory as the organised, event-based collection of past key-value pairs, analogous to the latent representations of personal experiences in human memory. Just as unexpected or novel information plays a crucial role in human memory formation, we posit that analogous indicators of novelty in LLMs can serve as an effective proxy for identifying significant ‘‘events’’ within the model’s experience. In Bayesian terms, surprise is quantified by the negative log-likelihood of observing the current, ground-truth token given the previous tokens in an auto-regressive model, with high values indicating the unpredictability or novelty of each new token within the context according to the model, i.e., it is ‘‘surprised’’ by the next token. Following work on cognitive modelling [Roseboom et al., 2019, Fountas et al., 2022], we employ a thresholding mechanism to perform an initial identification of event boundaries (used for the first time in LLMs). Formally, a token xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is considered a potential boundary if its surprise value exceeds a threshold T𝑇Titalic_T:

logP(xt|x1,,xt1;θ)>TwithT=μtτ+γσtτformulae-sequence𝑃conditionalsubscript𝑥𝑡subscript𝑥1subscript𝑥𝑡1𝜃𝑇with𝑇subscript𝜇𝑡𝜏𝛾subscript𝜎𝑡𝜏\displaystyle-\log P(x_{t}|x_{1},\ldots,x_{t-1};\theta)>T\quad~{}~{}\text{with% }\quad~{}~{}T=\mu_{t-\tau}+\gamma\sigma_{t-\tau}- roman_log italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_θ ) > italic_T with italic_T = italic_μ start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT + italic_γ italic_σ start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT (1)

where μtτ:tsubscript𝜇:𝑡𝜏𝑡\mu_{t-\tau:t}italic_μ start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT and σtτ:t2superscriptsubscript𝜎:𝑡𝜏𝑡2\sigma_{t-\tau:t}^{2}italic_σ start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are the mean and variance of surprise for a window offset τ𝜏\tauitalic_τ, and γ𝛾\gammaitalic_γ is a scaling factor. The choice of threshold T𝑇Titalic_T is critical in balancing the granularity of segmentation with the model’s sensitivity to contextual shifts. If the T𝑇Titalic_T is too high, we will identify very few event boundaries, especially if the local context contains few surprising tokens. Conversely, a low T𝑇Titalic_T results in frequent boundary identification. Using a moving window ensures that T𝑇Titalic_T adapts to contextual shifts, minimizing the need for manual tuning while maintaining control over threshold sensitivity via γ𝛾\gammaitalic_γ. We also explored a fixed threshold approach (T=Tfixed𝑇subscript𝑇fixedT=T_{\text{fixed}}italic_T = italic_T start_POSTSUBSCRIPT fixed end_POSTSUBSCRIPT), though our primary focus remained on the dynamic threshold due to its adaptability to varying contexts. This initial segmentation results in a set of potential event boundaries =b1,b2,,bksubscript𝑏1subscript𝑏2subscript𝑏𝑘\mathcal{B}={b_{1},b_{2},...,b_{k}}caligraphic_B = italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where each bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the index of a token exceeding the surprise threshold. These boundaries serve as the starting point for our subsequent refinement process, which aims to optimise the intra-event coherence and inter-event distinctiveness of the resulting memory segments.

3.3 Boundary refinement

While surprise-based segmentation provides an effective initial estimate of event boundaries, we make the key observation that the utility of elements within an event during memory recall depends on their likelihood of being utilised by the current query. Therefore, we theorise that memory recall will be most efficient with high intra-event similarity between keys while maintaining low inter-event similarity. For instance, see the similarity of groups in Fig. 1. To further ensure this, we introduce a boundary refinement step which looks to optimise this objective. Such an objective is typically optimised in the context of graph-clustering, hence we will express this refinement process in a graph-theoretic manner. To achieve this, we treat the similarity matrix between all keys of an attention head hhitalic_h within the local context window for tokens x1,x2,,xnsubscript𝑥1subscript𝑥2subscript𝑥𝑛x_{1},x_{2},...,x_{n}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as an adjacency matrix. We define the adjacency matrix Ahsuperscript𝐴A^{h}italic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT as

Aijh=sim(Kih,Kjh),subscriptsuperscript𝐴𝑖𝑗simsubscriptsuperscript𝐾𝑖subscriptsuperscript𝐾𝑗\displaystyle A^{h}_{ij}=\text{sim}(K^{h}_{i},K^{h}_{j}),italic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = sim ( italic_K start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (2)

where Kihsubscriptsuperscript𝐾𝑖K^{h}_{i}italic_K start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Kjhsubscriptsuperscript𝐾𝑗K^{h}_{j}italic_K start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are the key vectors corresponding to tokens xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, respectively. The similarity function measures the closeness of two key vectors; in our implementation, we use dot product similarity Khi𝖳Kjhsubscriptsuperscriptsuperscript𝐾𝖳𝑖subscriptsuperscript𝐾𝑗{K^{h}}^{\mathsf{T}}_{i}\cdot K^{h}_{j}italic_K start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_K start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT due to its effectiveness in capturing semantic relationships in high-dimensional spaces [Vaswani et al., 2017] and to align with the mechanism of self-attention in Transformers.

To evaluate the quality of potential boundaries, we define a metric function f(A,):n×n×{1,,n}k:𝑓𝐴superscript𝑛𝑛superscript1𝑛𝑘f(A,\mathcal{B}):\mathbb{R}^{n\times n}\times\{1,\ldots,n\}^{k}\rightarrow% \mathbb{R}italic_f ( italic_A , caligraphic_B ) : blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT × { 1 , … , italic_n } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT → blackboard_R. This function quantifies the cohesion within events and separation between events based on the graph structure represented by the similarity matrix A𝐴Aitalic_A and event boundaries \mathcal{B}caligraphic_B. We experimented with two widely-accepted graph-clustering metrics: modularity and conductance [Miasnikof et al., 2018]. Modularity [Newman and Girvan, 2004] provides a measure of the quality of a particular division of a network into communities, with higher values indicating higher edge density in the identified cluster when compared to the density of edges expected in a random cluster. As our edge weights represent the similarity between two tokens, we seek to maximise this metric. Modularity is defined as:

fM(Ah,)=14mi,j[AijhiAijhjAijh2m]δ(ci,cj)subscript𝑓𝑀superscript𝐴14𝑚subscript𝑖𝑗delimited-[]subscriptsuperscript𝐴𝑖𝑗subscript𝑖subscriptsuperscript𝐴𝑖𝑗subscript𝑗subscriptsuperscript𝐴𝑖𝑗2𝑚𝛿subscript𝑐𝑖subscript𝑐𝑗f_{M}(A^{h},\mathcal{B})=\frac{1}{4m}\sum_{i,j}\left[A^{h}_{ij}-\frac{\sum_{i}% A^{h}_{ij}\cdot\sum_{j}A^{h}_{ij}}{2m}\right]\delta(c_{i},c_{j})italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , caligraphic_B ) = divide start_ARG 1 end_ARG start_ARG 4 italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT [ italic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_m end_ARG ] italic_δ ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (3)

where m𝑚mitalic_m is the total edge weight in the graph, cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the community (episodic event) to which node i𝑖iitalic_i is assigned, and δ𝛿\deltaitalic_δ is the Kronecker delta function. Conductance, on the other hand, measures the fraction of total weighted edges cut by a given community boundary, and is defined as:

fC(Ah,)=minSViS,jSAijhmin(vol(S),vol(VS)),withvol(S)=iS,jSAij,vol(VS)=iS,jSAijformulae-sequencesubscript𝑓𝐶superscript𝐴subscript𝑆𝑉subscriptformulae-sequence𝑖𝑆𝑗𝑆subscriptsuperscript𝐴𝑖𝑗vol𝑆vol𝑉𝑆formulae-sequencewithvol𝑆subscriptformulae-sequence𝑖𝑆𝑗𝑆subscript𝐴𝑖𝑗vol𝑉𝑆subscriptformulae-sequence𝑖𝑆𝑗𝑆subscript𝐴𝑖𝑗f_{C}(A^{h},\mathcal{B})=\min_{S\in V}\frac{\sum_{i\in S,j\notin S}A^{h}_{ij}}% {\min(\text{vol}(S),\text{vol}(V\setminus S))},\qquad\text{with}\;\text{vol}(S% )=\sum_{i\in S,j\in S}A_{ij},\;\text{vol}(V\setminus S)=\sum_{i\notin S,j% \notin S}A_{ij}italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , caligraphic_B ) = roman_min start_POSTSUBSCRIPT italic_S ∈ italic_V end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S , italic_j ∉ italic_S end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_min ( vol ( italic_S ) , vol ( italic_V ∖ italic_S ) ) end_ARG , with vol ( italic_S ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S , italic_j ∈ italic_S end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , vol ( italic_V ∖ italic_S ) = ∑ start_POSTSUBSCRIPT italic_i ∉ italic_S , italic_j ∉ italic_S end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (4)

where S={bi,bi+1,,bi+1}𝑆subscript𝑏𝑖subscript𝑏𝑖1subscript𝑏𝑖1S=\{b_{i},b_{i}+1,...,b_{i+1}\}italic_S = { italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 , … , italic_b start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT } is a subset of all nodes V={b1,b1+1,,bk}𝑉subscript𝑏1subscript𝑏11subscript𝑏𝑘V=\{b_{1},b_{1}+1,...,b_{k}\}italic_V = { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 , … , italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } in the induced graph, with bisubscript𝑏𝑖b_{i}\in\mathcal{B}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_B. Lower conductance values indicate better community structure. Our boundary refinement algorithm iteratively adjusts the initial surprise-based boundaries to optimise these metric functions. While our best results were achieved using modularity, we also include comparisons with conductance-based boundary refinement to provide a comprehensive analysis. The overall process can be summarized in Algorithm 1.

Algorithm 1 Event segmentation in KV cache
1:tokens: List of tokens in the sequence
2:T𝑇Titalic_T: Threshold for surprisal to identify initial boundaries
3:f𝑓fitalic_f: Metric function to evaluate potential boundaries
4:\mathcal{B}caligraphic_B: List of final boundary positions
5:absent\mathcal{B}\leftarrowcaligraphic_B ← [i for i in range(length(tokens)) if log(P(tokens[i]))>𝑃tokensdelimited-[]𝑖absent-\log(P(\texttt{tokens}[i]))>- roman_log ( italic_P ( tokens [ italic_i ] ) ) > T𝑇Titalic_T] \triangleright Boundary identification
6:for i in range(length(\mathcal{B}caligraphic_B)) do
7:     α,β=[i],[i+1]formulae-sequence𝛼𝛽delimited-[]𝑖delimited-[]𝑖1\alpha,\beta=\mathcal{B}[i],\mathcal{B}[i+1]italic_α , italic_β = caligraphic_B [ italic_i ] , caligraphic_B [ italic_i + 1 ]
8:     [i+1]argmaxβ^(α,β]f(A,{α,β^})delimited-[]𝑖1subscript^𝛽𝛼𝛽𝑓𝐴𝛼^𝛽\mathcal{B}[i+1]\leftarrow\arg\max_{\hat{\beta}\in(\alpha,\beta]}f(A,\{\alpha,% \hat{\beta}\})caligraphic_B [ italic_i + 1 ] ← roman_arg roman_max start_POSTSUBSCRIPT over^ start_ARG italic_β end_ARG ∈ ( italic_α , italic_β ] end_POSTSUBSCRIPT italic_f ( italic_A , { italic_α , over^ start_ARG italic_β end_ARG } ) \triangleright Boundary refinement
9:end for
10:return \mathcal{B}caligraphic_B

This algorithm first identifies initial boundaries based on the surprise threshold T𝑇Titalic_T, then refines these boundaries by finding the optimal position β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG between each pair of consecutive initial boundaries (α,β)𝛼𝛽(\alpha,\beta)( italic_α , italic_β ) that optimises the chosen metric function f𝑓fitalic_f (either maximising modularity or minimising conductance). This process ensures that the final segmentation (1) captures points of high surprise and (2) optimises for coherent information grouping. The boundary identification step incurs negligible computational cost, as it only evaluates existing LLM outputs. The time complexity of Algorithm 1 is dominated by the boundary refinement step, which has an overall complexity of 𝒪(kn)𝒪𝑘𝑛\mathcal{O}(kn)caligraphic_O ( italic_k italic_n ), where k𝑘kitalic_k is the number of initial boundaries and n𝑛nitalic_n is the sequence length. A detailed analysis of the algorithm’s complexity, including the computation of modularity and conductance metrics, is provided in Appendix A.2. Despite this modest computational overhead, the resulting improvement in segment quality leads to significant performance gains in downstream tasks, particularly those requiring complex temporal reasoning.

3.4 Memory Retrieval

When inferring a new token, a number of episodic events are selected and become a part of the (extended) context window of the underlying LLM. Our memory retrieval process employs a two-stage mechanism to select relevant episodic events for the LLM’s context window (Fig.2C). First, we retrieve kssubscript𝑘𝑠k_{s}italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT events using k𝑘kitalic_k-NN search based on dot product similarity between the current query and representative tokens of each event. These representatives, selected as per Xiao et al. [2024a], are the most influential tokens within each event. For large memory stores, we utilise approximate k𝑘kitalic_k-NN [Douze et al., 2024] to maintain efficiency. These kssubscript𝑘𝑠k_{s}italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT events, retrieved based on their similarity to the current query, form a part of the LLM’s context window that we refer to as the similarity buffer.

The second stage of our retrieval process introduces another buffer, which we refer to as the contiguity buffer, designed to maintain temporal context. Implemented as a queue of size kcsubscript𝑘𝑐k_{c}italic_k start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, this buffer promotes temporal relationships in retrieval. When an event is retrieved, we also enqueue its neighboring events (within ±nplus-or-minus𝑛\pm n± italic_n positions in the original sequence) into this buffer. This mechanism enables the LLM’s ‘‘induction’’ attention heads to exhibit the contiguity and asymmetry effects discussed in Section 2.2. The queue structure allows for a natural decay of temporal context as new events are processed, with older or repeated events being dequeued as new ones are added. In total, k=ks+kc+2𝑘subscript𝑘𝑠subscript𝑘𝑐2k=k_{s}+k_{c}+2italic_k = italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + 2 events are added to the context window, striking a balance between relevance and temporal relationships in a manner analogous to human episodic memory retrieval.

4 Experiments

4.1 Performance of EM-LLM on long-context tasks

As previously mentioned, InfLLM is, at the time of writing, considered to achieve state-of-the-art performance on long-context benchmarks (\infty-Bench, LongBench), as well as being the only method which uses group-based k-NN retrieval in LLMs on such benchmarks. We, therefore, employ this model as our baseline for comparison with our own methods on short context windows (4K+2K, as in Xiao et al., 2024a).

Task InfLLM Max Imp. EM-LLM
S SM S+C SM+C
NarrativeQA 22.12 +1.49%percent\%% 21.32 21.13 21.80 22.45
MultiNews 26.70 -0.30%percent\%% 26.52 26.54 26.69 26.62
Qasper 29.33 +0.17%percent\%% 28.99 29.38 29.11 28.68
TREC 69.00 +2.17%percent\%% 70.00 70.00 70.50 70.50
MultiFieldQA 47.42 +0.42%percent\%% 47.49 47.39 47.46 47.62
TriviaQA 86.67 +1.10%percent\%% 86.93 87.62 87.35 87.47
HotpotQA 36.56 +9.38%percent\%% 39.99 39.01 39.05 38.90
SAMSum 42.52 +0.87%percent\%% 42.34 42.13 42.89 42.48
2WikiMQA 22.31 +6.41%percent\%% 23.74 22.75 22.65 23.46
PassageRetrieval 64.00 +33.47%percent\%% 85.42 78.92 84.67 84.08
Musique 17.68 +6.17%percent\%% 17.58 17.82 17.93 18.77
LCC 56.67 +0.63%percent\%% 54.90 57.03 54.79 56.79
GovReport 31.03 +1.90%percent\%% 31.24 31.62 31.34 31.43
RepoBench-P 52.97 +1.34%percent\%% 50.76 53.68 51.34 52.86
QMSum 23.49 +2.13%percent\%% 23.82 23.20 23.99 23.47
Avg. score: 41.90 +4.30%percent\%% 43.40 43.22 43.44 43.70
Table 1: EM-LLM performance on LongBench compared to our baseline InfLLM. S: surprise threshold, SM: surprise threshold + refinement with modularity, S+C: surprise threshold + contiguity buffer, SM+C: surprise threshold + refinement with modularity + contiguity buffer. Max Imp. shows the maximum relative improvement over InfLLM across all EM-LLM variants.

Results on the LongBench dataset (Table 1) show that our method is able to improve on InfLLM in all but one task, with the best method achieving an overall increase in performance of 1.81.81.81.8 percentage points (a relative improvement of 4.3%percent4.34.3\%4.3 %). Note that the table shows the best single method in terms of overall performance for each ablation. Looking at individual task performance across all experiments, we are able to beat InfLLM in all tasks (see Appendix A.1). Interestingly, we see an especially large jump in performance on the PassageRetrieval task across all ablations, with up to a 33%percent3333\%33 % improvement on InfLLM. This task requires the model to identify the original paragraph from a summary, a challenging task that tests the model’s ability to accurately recall a wide range of detailed information from a large context concurrently. The substantial improvement on this task highlights the effectiveness of our event segmentation method in enhancing long-term memory recall and retrieval accuracy in LLMs. Additionally, our method achieves a notable 9.38%percent9.389.38\%9.38 % improvement on the HotpotQA task, which involves complex reasoning over multiple supporting documents, further emphasising the model’s ability to provide explanations for answers.

4.2 Humans and LLM surprise cluster similar tokens together

As mentioned in Section 3.2, we employ modularity and conductance as two refinement objectives in our boundary refinement algorithm, due to their qualities in assessing the intra- and inter-event similarities between individual tokens. We will now use such metrics to compare various event segmentation methods, including human event segmentation data. Additionally, we introduce one further, simple metric for this experiment: the ratio between intra- and inter-community similarity (I/IS), calculated for each head and community S𝑆Sitalic_S as follows:

intra=iS,jSAij,inter=iS,jSAij,I/ISintrainterformulae-sequenceintrasubscriptformulae-sequence𝑖𝑆𝑗𝑆subscript𝐴𝑖𝑗formulae-sequenceintersubscriptformulae-sequence𝑖𝑆𝑗𝑆subscript𝐴𝑖𝑗I/ISintrainter\text{intra}=\sum_{i\in S,j\in S}A_{ij},\qquad\text{inter}=\sum_{i\in S,j% \notin S}A_{ij},\qquad\text{I/IS}\equiv\frac{\text{intra}}{\text{inter}}intra = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S , italic_j ∈ italic_S end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , inter = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S , italic_j ∉ italic_S end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , I/IS ≡ divide start_ARG intra end_ARG start_ARG inter end_ARG (5)

Kumar et al. [2023] found strong correlations between human-perceived events and prediction errors across 3 short podcasts (7-30 minutes on average), when processing the corresponding transcript with an LLM. Taking advantage of such human data and results from previous works on this dataset [Michelmann et al., 2021, Lositsky et al., 2016], we compare the segmentation quality and correlation with human data for each of our methods (3) using our similarity metrics.

Refer to caption
Figure 3: Comparison of human event segmentation with different computational segmentation methods in two human-annotated audio datasets (see Appendix A.3). (A) Difference in metrics for the cohesion and separation of KV cache of LLaMA2 attention heads. The graphs report the difference of each method with the corresponding random segmentation. (B) Distance between human reports and different methods. In both sets of results, fixed methods (F, FM, FC) perform worse than their surprise-based counterparts (S, SM, SC) with InfLLM’s method (F) performing worse than random.

As shown in Fig. 3A, human-perceived events achieve significantly higher scores in similarity metrics compared to fixed or random events, suggesting that surprise is indeed an important factor for humans in their own perception of events. Furthermore, surprise-only segmentation (𝐒𝐒\mathbf{S}bold_S) achieves very similar results to humans, while the addition of our refinement algorithm (𝐒𝐌𝐒𝐌\mathbf{SM}bold_SM, 𝐒𝐂𝐒𝐂\mathbf{SC}bold_SC, 𝐅𝐌𝐅𝐌\mathbf{FM}bold_FM, 𝐅𝐂𝐅𝐂\mathbf{FC}bold_FC) significantly improves performance. Fig. 3B further shows that surprise-based methods (𝐒𝐒\mathbf{S}bold_S, 𝐒𝐌𝐒𝐌\mathbf{SM}bold_SM, 𝐒𝐂𝐒𝐂\mathbf{SC}bold_SC), consistently identify event boundaries that are closest to those perceived by humans.

4.3 Comparing segmentation methods

LLM Metric F FM FC S SM SC
Mistral-7B Mod \uparrow -2.3 ±plus-or-minus\pm± 4.1 29.2 ±plus-or-minus\pm± 44.0 6.7 ±plus-or-minus\pm± 25.9 18.6 ±plus-or-minus\pm± 29.6 39.9 ±plus-or-minus\pm± 55.5 29.5 ±plus-or-minus\pm± 42.7
Con \downarrow 9.1 ±plus-or-minus\pm± 8.7 -16.9 ±plus-or-minus\pm± 6.7 -12.5 ±plus-or-minus\pm± 9.6 -23.6 ±plus-or-minus\pm± 9.4 -24.6 ±plus-or-minus\pm± 9.3 -27.6 ±plus-or-minus\pm± 9.8
I/IS \uparrow -4.3 ±plus-or-minus\pm± 4.0 31.2 ±plus-or-minus\pm± 21.4 3.7 ±plus-or-minus\pm± 14.9 17.9 ±plus-or-minus\pm± 17.0 35.3 ±plus-or-minus\pm± 27.7 21.6 ±plus-or-minus\pm± 22.4
LLaMA2-7B Mod \uparrow -1.1 ±plus-or-minus\pm± 4.3 13.4 ±plus-or-minus\pm± 19.5 0.6 ±plus-or-minus\pm± 7.3 8.7 ±plus-or-minus\pm± 16.0 18.7 ±plus-or-minus\pm± 26.4 11.5 ±plus-or-minus\pm± 19.4
Con \downarrow 11.9 ±plus-or-minus\pm± 9.8 -18.8 ±plus-or-minus\pm± 7.4 -13.7 ±plus-or-minus\pm± 10.9 -29.5 ±plus-or-minus\pm± 10.2 -29.7 ±plus-or-minus\pm± 10.1 -33.3 ±plus-or-minus\pm± 10.3
I/IS \uparrow -3.8 ±plus-or-minus\pm± 3.7 20.7 ±plus-or-minus\pm± 184.7 -1.1 ±plus-or-minus\pm± 6.8 15.0 ±plus-or-minus\pm± 880.0 25.0 ±plus-or-minus\pm± 19.9 16.5 ±plus-or-minus\pm± 15.4
LLaMA3-8B Mod \uparrow -1.6 ±plus-or-minus\pm± 3.6 18.9 ±plus-or-minus\pm± 25.6 0.9 ±plus-or-minus\pm± 11.8 13.1 ±plus-or-minus\pm± 21.5 27.0 ±plus-or-minus\pm± 35.6 18.3 ±plus-or-minus\pm± 28.5
Con \downarrow 11.3 ±plus-or-minus\pm± 9.5 -20.3 ±plus-or-minus\pm± 6.9 -14.6 ±plus-or-minus\pm± 11.4 -29.7 ±plus-or-minus\pm± 9.2 -30.6 ±plus-or-minus\pm± 9.2 -33.9 ±plus-or-minus\pm± 9.6
I/IS \uparrow -3.8 ±plus-or-minus\pm± 3.1 24.5 ±plus-or-minus\pm± 13.9 -1.1 ±plus-or-minus\pm± 5.8 15.7 ±plus-or-minus\pm± 11.0 28.1 ±plus-or-minus\pm± 16.1 16.4 ±plus-or-minus\pm± 12.2
Table 2: Comparison with graph-theoretic metrics in the KV cache of different LLMs and segmentation methods using the PG-19 dataset. Reported values are the difference with the corresponding random segmentation. Mod: modularity×105absentsuperscript105\times 10^{5}× 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, Con: Conductance, I/IS: intra/inter-similarity ×103absentsuperscript103\times 10^{3}× 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. Hyper-parameters: Surprise threshold = 0.0010.0010.0010.001

Looking at Table 2, it is clear that surprise-based segmentation with refinement (𝐒𝐌𝐒𝐌\mathbf{SM}bold_SM, 𝐒𝐂𝐒𝐂\mathbf{SC}bold_SC) provides the best results in terms of event similarity metrics, regardless of the base LLM used. While the surprise-only method (𝐒𝐒\mathbf{S}bold_S) achieves some good results, we observe that refinement is especially adept to improving this performance with regards to our metrics, as it is directly optimising for such an objective. Interestingly however, the fixed-based refinement methods (𝐅𝐌𝐅𝐌\mathbf{FM}bold_FM, 𝐅𝐂𝐅𝐂\mathbf{FC}bold_FC) do not reach the same performance as their surprise-based counterparts, further showing that the initial segmentation with a surprise threshold is crucial to achieving the best possible balance in intra-/inter-similarity with our methods.

4.4 Similarity, Contiguity, Recency and Temporal Order

As demonstrated in Tables 1 and 2, along with Fig. 3, each of our ablations show various positive improvements on InfLLM (also see Appendix A.1). As mentioned in Section 4.3, refinement has a strong positive impact in improving our similarity metrics. This is seen to translate well to model performance in Table 1, achieving the best performance in a third of the tasks, as well as agreeing with human data (Fig. 3). The effects of contiguity are also clearly demonstrated in this table, with the addition of our contiguity buffer achieving the best performance on three tasks, and the second-best overall score. Furthermore, these methods are shown to be generally complementary, achieving the best overall performance when combined.

However, the fact that certain tasks still appear to benefit more from either surprise-only, refinement, or contiguity, is an interesting result. This is likely due to the nature of the tasks and the varying importance of contiguity across these tasks. For instance, in Supplementary Fig. 5, the MultiNews task scores higher than our baseline only for a ratio of 70%percent7070\%70 % contiguity to similarity buffers. Where contiguity is not crucial, adding such a buffer to our context window also reduces the size of the similarity buffer, and therefore provides potentially less directly relevant events. This is compatible with our own findings that a contiguity buffer that is as big or smaller than the similarity buffer yields the best results, suggesting that the similarity buffer, is still the most crucial part of our approach. This is especially the case when combined with refinement, which we expect is due to the improved similarity of refined events, hence further reducing the need for contiguous events.

5 Discussion

Human studies

The surprise-based segmentation and boundary refinement processes in EM-LLM mirror key aspects of human event perception and memory formation. Our approach aligns with theories proposing that humans segment continuous experiences into discrete events based on prediction errors or moments of surprise [Zacks et al., 2007, Fountas et al., 2022]. This segmentation process is crucial for organising and later retrieving episodic memories efficiently. Indeed, significant correlations have been found between human event segmentation and prediction errors in both LLMs [Kumar et al., 2023] and video models [Fountas et al., 2022, Mariola et al., 2022]. Our results add to this growing body of evidence, demonstrating that LLM-based surprise can serve as a proxy for human event segmentation, in multiple levels of hierarchical abstraction, and that the resulting event structure in EM-LLM’s attention heads correlates strongly with human-perceived events. This finding creates a more direct, low-level connection between LLM mechanisms and human cognitive processes. Furthermore, our model’s use of both similarity-based and temporally contiguous retrieval mechanisms parallels human memory retrieval patterns, allowing for the expression of robust phenomena found in human memory research [Howard and Kahana, 2002].

Furthermore, our model’s use of both similarity-based and temporally contiguous retrieval mechanisms parallels human memory retrieval patterns. The temporal contiguity effect, where items experienced close together in time are often recalled together, is a robust phenomenon in human memory research [Howard and Kahana, 2002]. Further experiments could deepen our understanding of the connections between EM-LLM and human episodic memory. Following Michelmann et al. [2023b], one potential direction is to test whether the timing of the event boundaries or the degree of modularity per level that our method produces is closer on average to the human consensus, than individual human subjects. Second, we can explore the level at which different ratios of the contiguity buffer allow the human biases presented in Fig. 2A and the analysis in Ji-An et al. [2024] to be more easily reproduced. Finally, we could investigate how skewing event recall based on recency and originally-recorded surprise affects model performance and to what extent it produces biased behaviour found in studies of free recall.

In addition, the architecture of EM-LLM, with its differentiated context handling described in Section 3.1, invites further interesting comparisons to cognitive models of human memory beyond episodic. The group of tokens forming the local context, which hold the most recent and task-relevant information, share characteristics with the concept of working memory. For instance, Baddeley [2003]’s influential model of working memory, which posits a limited-capacity system for temporary information storage and manipulation, bears similarities to our local context functionality. Yet, the analogy is not perfect. Our broader context window, including both local context and retrieved memories, might be more accurately compared to Ericsson and Kintsch [1995]’s concept of long-term working memory, which proposes a mechanism for rapid access to relevant information in long-term memory, extending beyond the traditional capacity limits of working memory. Alternatively, our architecture aligns well with Cowan [2001]’s embedded-processes model, where our local context could be likened to the limited-capacity ‘‘focus of attention’’ within working memory, while the full context window parallels the activated portion of long-term memory. Future work could explore these analogies more deeply, providing a flexible test-bed for rapidly exploring hypotheses about human memory, and potentially informing debates about capacity limits in working memory. Additionally, inspired by the multi-component nature of Baddeley’s model, one might explore the integration of modality-specific buffers within EM-LLM to enhance its performance on multi-modal tasks.

Machine learning

In refining event boundaries, we utilized modularity and conductance as metrics for evaluating community structure in the similarity graph of attention keys. While effective in our experiments, we acknowledge that numerous other methods for graph clustering and sequence segmentation could potentially be applied [Fortunato, 2010, Yang et al., 2016]. Our choice was motivated by their established theoretical foundations and computational efficiency, though comparative studies suggest performance can vary based on network characteristics [Yang et al., 2016]. Interestingly, our surprise-based initial boundary detection shares similarities with Bayesian online change-point detection [Adams and MacKay, 2007], suggesting potential avenues for integrating time series analysis techniques into LLM context processing. Future work could explore whether more sophisticated segmentation or clustering algorithms could improve EM-LLM’s performance, particularly for extremely long contexts or streaming data scenarios. Such investigations could enhance our model and contribute to understanding how information is structured and processed in LLMs, bridging the gap between traditional sequence analysis and LLM context processing.

Looking ahead, several more avenues for future research emerge from this work. One promising direction is to extend our surprise-based segmentation and boundary refinement processes to operate at each layer of the Transformer independently. This could lead to more nuanced and hierarchical representations of episodic memories, following the underlying semantic structure of the input more closely. Additionally, exploring how EM-LLM could be utilised to enable imagination and future thinking has great potential for advancing model-based reinforcement learning and continual learning techniques in LLMs. By leveraging its event-based structure to simulate potential future scenarios or recall past experiences in novel contexts, EM-LLM could enhance an LLM’s ability to plan, adapt, and learn continuously from new information.

6 Conclusion

In this work, we introduced EM-LLM, a novel and flexible architecture that integrates key aspects of human episodic memory and event cognition into transformer-based language models. Our approach enables LLMs to effectively process and utilise information from vastly extended contexts, far beyond their original training lengths. By combining surprise-based event segmentation with graph-theoretic boundary refinement, and a two-stage memory retrieval process, EM-LLM demonstrates superior performance on long-context tasks compared to state-of-the-art models. Crucially, our method requires no pre-training and can be readily applied to existing LLMs, offering a promising path towards virtually infinite context windows. This capability has the potential to revolutionise how we interact with LLMs, enabling continuous, personalized interactions over extended periods. Furthermore, the flexibility of our framework suggests it could serve as a viable alternative to traditional retrieval-augmented generation (RAG) techniques, especially when combined with efficient compression methods to reduce the memory requirements for the model’s KV cache.

In conclusion, EM-LLM represents a significant step forward in the development of language models with extended context-processing capabilities. By bridging insights from cognitive science with machine learning, our approach not only enhances the performance of LLMs on long-context tasks but also provides a scalable computational framework for testing hypotheses about human memory. We hope this study will inspire the community to expand research on the intersection between LLMs and human memory mechanisms.

References

  • Liu et al. [2024a] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157--173, 2024a.
  • Kazemnejad et al. [2024] Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024.
  • Tworkowski et al. [2023] Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Miłoś. Focused transformer: Contrastive training for context scaling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=s1FjXzJ0jy.
  • Lewis et al. [2020] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459--9474, 2020.
  • Gao et al. [2024] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024.
  • Wu et al. [2022] Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TrjbxzRcnf-.
  • Bertsch et al. [2023] Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R. Gormley. Unlimiformer: Long-range transformers with unlimited length input. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=lJWUJWLCJo.
  • Xiao et al. [2024a] Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory, 2024a.
  • Clewett et al. [2019] David Clewett, Sarah DuBrow, and Lila Davachi. Transcending time in the brain: How event memories are constructed from experience. Hippocampus, 29(3):162--183, 2019.
  • Zacks [2020] Jeffrey M Zacks. Event perception and memory. Annual review of psychology, 71:165--191, 2020.
  • Baldassano et al. [2017] Christopher Baldassano, Janice Chen, Asieh Zadbood, Jonathan W Pillow, Uri Hasson, and Kenneth A Norman. Discovering event structure in continuous narrative perception and memory. Neuron, 95(3):709--721, 2017.
  • Michelmann et al. [2023a] Sebastian Michelmann, Uri Hasson, and Kenneth A. Norman. Evidence that event boundaries are access points for memory retrieval. Psychological Science, 34(3):326--344, 2023a. doi:10.1177/09567976221128206. URL https://doi.org/10.1177/09567976221128206. PMID: 36595492.
  • Zacks et al. [2007] Jeffrey M Zacks, Nicole K Speer, Khena M Swallow, Todd S Braver, and Jeremy R Reynolds. Event perception: a mind-brain perspective. Psychological bulletin, 133(2):273, 2007.
  • Zacks et al. [2011] Jeffrey M Zacks, Christopher A Kurby, Michelle L Eisenberg, and Nayiri Haroutunian. Prediction error associated with the perceptual segmentation of naturalistic events. Journal of cognitive neuroscience, 23(12):4057--4066, 2011.
  • Roseboom et al. [2019] Warrick Roseboom, Zafeirios Fountas, Kyriacos Nikiforou, David Bhowmik, Murray Shanahan, and Anil K Seth. Activity in perceptual classification networks as a basis for human subjective time perception. Nature communications, 10(1):267, 2019.
  • Sinclair et al. [2021] Alyssa H. Sinclair, Grace M. Manalili, Iva K. Brunec, R. Alison Adcock, and Morgan D. Barense. Prediction errors disrupt hippocampal representations and update episodic memories. Proceedings of the National Academy of Sciences, 118(51):e2117625118, 2021. doi:10.1073/pnas.2117625118. URL https://www.pnas.org/doi/abs/10.1073/pnas.2117625118.
  • Fountas et al. [2022] Zafeirios Fountas, Anastasia Sylaidi, Kyriacos Nikiforou, Anil K. Seth, Murray Shanahan, and Warrick Roseboom. A Predictive Processing Model of Episodic Memory and Time Perception. Neural Computation, 34(7):1501--1544, 06 2022. ISSN 0899-7667. doi:10.1162/neco_a_01514. URL https://doi.org/10.1162/neco_a_01514.
  • Howard and Kahana [2002] Marc W Howard and Michael J Kahana. A distributed representation of temporal context. Journal of mathematical psychology, 46(3):269--299, 2002.
  • Ji-An et al. [2024] Li Ji-An, Corey Y. Zhou, Marcus K. Benna, and Marcelo G. Mattar. Linking in-context learning in transformers to human episodic memory, 2024.
  • Kumar et al. [2023] Manoj Kumar, Ariel Goldstein, Sebastian Michelmann, Jeffrey M Zacks, Uri Hasson, and Kenneth A Norman. Bayesian surprise predicts human event segmentation in story listening. Cognitive science, 47(10):e13343, 2023.
  • Rae et al. [2020] Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SylKikSYDH.
  • Bai et al. [2023] Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  • Katharopoulos et al. [2020] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156--5165. PMLR, 2020.
  • Munkhdalai et al. [2024] Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite context transformers with infini-attention, 2024.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Su et al. [2024] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. ISSN 0925-2312. doi:https://doi.org/10.1016/j.neucom.2023.127063. URL https://www.sciencedirect.com/science/article/pii/S0925231223011864.
  • Chen et al. [2024a] Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, and Lidong Bing. CLEX: Continuous length extrapolation for large language models. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=wXpSidPpc5.
  • Xiong et al. [2023] Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  • Liu et al. [2024b] Xiaoran Liu, Hang Yan, Chenxin An, Xipeng Qiu, and Dahua Lin. Scaling laws of roPE-based extrapolation. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=JO7k0SJ5V6.
  • Peng et al. [2024] Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u.
  • Ding et al. [2024] Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024.
  • Press et al. [2021] Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  • Chen et al. [2023] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  • Jin et al. [2024] Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. Llm maybe longlm: Self-extend llm context window without tuning, 2024.
  • Chi et al. [2022] Ta-Chung Chi, Ting-Han Fan, Peter Ramadge, and Alexander Rudnicky. KERPLE: Kernelized relative positional embedding for length extrapolation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=hXzOqPlXDwm.
  • Li et al. [2024] Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=rR03qFesqk.
  • Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  • Dao [2024] Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mZn2Xyh9Ec.
  • Han et al. [2024a] Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David Woodruff, and Amir Zandieh. Hyperattention: Long-context attention in near-linear time. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=Eh0Od2BJIM.
  • Aminabadi et al. [2022] Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’22. IEEE Press, 2022. ISBN 9784665454445.
  • Kwon et al. [2023] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702297. doi:10.1145/3600006.3613165. URL https://doi.org/10.1145/3600006.3613165.
  • Liu et al. [2024c] Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringattention with blockwise transformers for near-infinite context. In The Twelfth International Conference on Learning Representations, 2024c. URL https://openreview.net/forum?id=WsRHpHH4s0.
  • Brandon et al. [2023] William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers. arXiv preprint arXiv:2311.09431, 2023.
  • Nawrot et al. [2024] Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, and Edoardo M Ponti. Dynamic memory compression: Retrofitting llms for accelerated inference. arXiv preprint arXiv:2403.09636, 2024.
  • Zhang et al. [2023] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Re, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=RkRrPp7GKO.
  • Zhu et al. [2024] Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. PoSE: Efficient context window extension of LLMs via positional skip-wise training. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=3Z1gxuAQrA.
  • Chen et al. [2024b] Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. LongloRA: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=6PmJoRfdaK.
  • Wang et al. [2023] Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=BryMFPQ4L6.
  • Ivgi et al. [2023] Maor Ivgi, Uri Shaham, and Jonathan Berant. Efficient long-text understanding with short-text models. Transactions of the Association for Computational Linguistics, 11:284--299, 2023. doi:10.1162/tacl_a_00547. URL https://aclanthology.org/2023.tacl-1.17.
  • Gershman et al. [2012] Samuel J Gershman, Christopher D Moore, Michael T Todd, Kenneth A Norman, and Per B Sederberg. The successor representation and temporal context. Neural Computation, 24(6):1553--1568, 2012.
  • Benna and Fusi [2021] Marcus K Benna and Stefano Fusi. Place cells may simply be memory cells: Memory compression leads to spatial tuning and history dependence. Proceedings of the National Academy of Sciences, 118(51):e2018422118, 2021.
  • Blundell et al. [2016] Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic control. arXiv preprint arXiv:1606.04460, 2016.
  • Pritzel et al. [2017] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adrià Puigdomènech Badia, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2827--2836. PMLR, 06--11 Aug 2017.
  • Coda-Forno et al. [2024] Julian Coda-Forno, Changmin Yu, Qinghai Guo, Zafeirios Fountas, and Neil Burgess. Leveraging episodic memory to improve world models for reinforcement learning. In Memory in Artificial and Real Intelligence (MemARI) Workshop at 36th Conference on Neural Information Processing Systems (NeurIPS 2022), 2024.
  • Kirkpatrick et al. [2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521--3526, 2017.
  • Lopez-Paz and Ranzato [2017] David Lopez-Paz and Marc' Aurelio Ranzato. Gradient episodic memory for continual learning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/f87522788a2be2d171666752f97ddebb-Paper.pdf.
  • Chaudhry et al. [2019] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-GEM. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Hkf2_sC5FX.
  • Buzzega et al. [2020] Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and SIMONE CALDERARA. Dark experience for general continual learning: a strong, simple baseline. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 15920--15930. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/b704ea2c39778f07c617f6b7ce480e9e-Paper.pdf.
  • Prabhu et al. [2020] Ameya Prabhu, Philip H. S. Torr, and Puneet K. Dokania. Gdumb: A simple approach that questions our progress in continual learning. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision -- ECCV 2020, pages 524--540, Cham, 2020. Springer International Publishing.
  • Das et al. [2024] Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarath Swaminathan, Sihui Dai, Aurélie Lozano, Georgios Kollias, Vijil Chenthamarakshan, Soham Dan, et al. Larimar: Large language models with episodic memory control. arXiv preprint arXiv:2403.11901, 2024.
  • Spens and Burgess [2024] Eleanor Spens and Neil Burgess. A generative model of memory construction and consolidation. Nature Human Behaviour, pages 1--18, 2024.
  • Lu et al. [2022] Qihong Lu, Uri Hasson, and Kenneth A Norman. A neural network model of when to retrieve and encode episodic memories. elife, 11:e74445, 2022.
  • Sherman et al. [2022] Maxine T Sherman, Zafeirios Fountas, Anil K Seth, and Warrick Roseboom. Trial-by-trial predictions of subjective time from human brain activity. PLOS Computational Biology, 18(7):e1010223, 2022.
  • Zakharov et al. [2022a] Alexey Zakharov, Qinghai Guo, and Zafeirios Fountas. Variational predictive routing with nested subjective timescales. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=JxFgJbZ-wft.
  • Zakharov et al. [2022b] Alexey Zakharov, Qinghai Guo, and Zafeirios Fountas. Long-horizon video prediction using a dynamic latent hierarchy. arXiv preprint arXiv:2212.14376, 2022b.
  • Zakharov et al. [2021] Alexey Zakharov, Matthew Crosby, and Zafeirios Fountas. Episodic memory for subjective-timescale models. In ICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021. URL https://openreview.net/forum?id=30lZDhrjonR.
  • Cowan [2001] Nelson Cowan. The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and brain sciences, 24(1):87--114, 2001.
  • Xiao et al. [2024b] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=NG7sS51zVF.
  • Han et al. [2024b] Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. LM-infinite: Zero-shot extreme length generalization for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3991--4008, Mexico City, Mexico, June 2024b. Association for Computational Linguistics. URL https://aclanthology.org/2024.naacl-long.222.
  • Miasnikof et al. [2018] Pierre Miasnikof, Alexander Shestopaloff, Anthony Bonner, and Yuri Lawryshyn. A Statistical Performance Analysis of Graph Clustering Algorithms, pages 170--184. 05 2018. ISBN 978-3-319-92870-8. doi:10.1007/978-3-319-92871-5_11.
  • Newman and Girvan [2004] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004.
  • Douze et al. [2024] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. 2024.
  • Michelmann et al. [2021] Sebastian Michelmann, Amy R Price, Bobbi Aubrey, Camilla K Strauss, Werner K Doyle, Daniel Friedman, Patricia C Dugan, Orrin Devinsky, Sasha Devore, Adeen Flinker, et al. Moment-by-moment tracking of naturalistic learning and its underlying hippocampo-cortical interactions. Nature communications, 12(1):5394, 2021.
  • Lositsky et al. [2016] Olga Lositsky, Janice Chen, Daniel Toker, Christopher J Honey, Michael Shvartsman, Jordan L Poppenk, Uri Hasson, and Kenneth A Norman. Neural pattern change during encoding of a narrative predicts retrospective duration estimates. elife, 5:e16070, 2016.
  • Mariola et al. [2022] Alberto Mariola, Zafeirios Fountas, Lionel Barnett, and Warrick Roseboom. Event segmentation in continuous, naturalistic videos from model-based, data-driven, and human perspectives. 2022.
  • Michelmann et al. [2023b] Sebastian Michelmann, Manoj Kumar, Kenneth A Norman, and Mariya Toneva. Large language models can segment narrative events similarly to humans. arXiv preprint arXiv:2301.10297, 2023b.
  • Baddeley [2003] Alan Baddeley. Working memory: looking back and looking forward. Nature reviews neuroscience, 4(10):829--839, 2003.
  • Ericsson and Kintsch [1995] K Anders Ericsson and Walter Kintsch. Long-term working memory. Psychological review, 102(2):211, 1995.
  • Fortunato [2010] Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75--174, 2010.
  • Yang et al. [2016] Zhao Yang, René Algesheimer, and Claudio J Tessone. A comparative analysis of community detection algorithms on artificial networks. Scientific reports, 6(1):30750, 2016.
  • Adams and MacKay [2007] Ryan Prescott Adams and David JC MacKay. Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742, 2007.
  • Newman [2004] Mark EJ Newman. Fast algorithm for detecting community structure in networks. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 69(6):066133, 2004.
  • Panaretos and Zemel [2019] Victor M. Panaretos and Yoav Zemel. Statistical aspects of wasserstein distances. Annual Review of Statistics and Its Application, 6(Volume 6, 2019):405--431, 2019. ISSN 2326-831X. doi:https://doi.org/10.1146/annurev-statistics-030718-104938. URL https://www.annualreviews.org/content/journals/10.1146/annurev-statistics-030718-104938.

Appendix A Appendix / supplemental material

A.1 Supplementary figures

Refer to caption
Figure 4: Ablation study in LongBench. Comparison of EM-LLM performance for different combinations of model features (represented by different colours) and different values of γ𝛾\gammaitalic_γ (the threshold’s scaling factor). Model variants are aligned on the x-axis based on the average number of block size that emerges for each case. The γ𝛾\gammaitalic_γ values for each model variant are shown in the first sub-plot. The corresponding InfLLM performance is also shown.
Refer to caption
Figure 5: Ablation study in LongBench. Comparison of EM-LLM performance for different ratios of the contiguity and similarity buffers (represented by different colours) and different values of γ𝛾\gammaitalic_γ. Model variants are aligned on the x-axis based on the average number of block size that emerges for each case. The γ𝛾\gammaitalic_γ values for each model variant are shown in the first sub-plot. The corresponding InfLLM performance is also shown.

A.2 Complexity Analysis of EM-LLM Algorithm

Here, we provide a detailed analysis of the computational complexity of our Algorithm 1, focusing on the boundary refinement step and the calculation of modularity and conductance metrics.

Boundary Refinement Step

The boundary refinement step involves finding the optimal position β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG between each pair of consecutive initial boundaries (α,β)𝛼𝛽(\alpha,\beta)( italic_α , italic_β ) that optimizes the chosen metric function f𝑓fitalic_f. This step has the following components:

Iteration over initial boundaries: 𝒪(k)𝒪𝑘\mathcal{O}(k)caligraphic_O ( italic_k ), where k𝑘kitalic_k is the number of initial boundaries. For each pair of boundaries, we compute the metric function f𝑓fitalic_f for all positions between α𝛼\alphaitalic_α and β𝛽\betaitalic_β. In the worst case, this is 𝒪(n)𝒪𝑛\mathcal{O}(n)caligraphic_O ( italic_n ) operations per boundary pair.

Therefore, the overall complexity of this step is 𝒪(kn)𝒪𝑘𝑛\mathcal{O}(kn)caligraphic_O ( italic_k italic_n ).

Metric Function Computation

The metric functions (modularity or conductance) are computed at the level of individual memory units. For a memory unit of size m𝑚mitalic_m:

  • Modularity: The naive computation involves summing over all pairs of nodes within the unit, resulting in a worst-case complexity of 𝒪(m2)𝒪superscript𝑚2\mathcal{O}(m^{2})caligraphic_O ( italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). However, in practice, the similarity graph is often sparse, meaning many node pairs have negligible similarity. Leveraging this sparsity, more efficient implementations can achieve 𝒪(l)𝒪𝑙\mathcal{O}(l)caligraphic_O ( italic_l ) complexity, where l𝑙litalic_l is the number of non-zero similarity edges within the unit [Newman, 2004]. Typically, l𝑙litalic_l is much smaller than m2superscript𝑚2m^{2}italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, especially for larger units, leading to significant computational savings.

  • Conductance: This requires computing the sum of edge weights crossing the boundary and the total volume of the unit, which can be done in 𝒪(m)𝒪𝑚\mathcal{O}(m)caligraphic_O ( italic_m ) time.

Given that m𝑚mitalic_m is typically much smaller than n𝑛nitalic_n and varies for each unit, we can consider the average unit size m¯¯𝑚\bar{m}over¯ start_ARG italic_m end_ARG and average number of non-zero similarity edges l¯¯𝑙\bar{l}over¯ start_ARG italic_l end_ARG. The total complexity for computing metrics across all units is then 𝒪(kl¯)𝒪𝑘¯𝑙\mathcal{O}(k\bar{l})caligraphic_O ( italic_k over¯ start_ARG italic_l end_ARG ) for modularity (which in the worst case is 𝒪(km¯2)𝒪𝑘superscript¯𝑚2\mathcal{O}(k\bar{m}^{2})caligraphic_O ( italic_k over¯ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), but typically much lower) or 𝒪(km¯)𝒪𝑘¯𝑚\mathcal{O}(k\bar{m})caligraphic_O ( italic_k over¯ start_ARG italic_m end_ARG ) for conductance.

Overall Complexity

Combining the boundary refinement step and metric computation, the overall complexity is:

For modularity: 𝒪(kn+km¯2)𝒪𝑘𝑛𝑘superscript¯𝑚2\mathcal{O}(kn+k\bar{m}^{2})caligraphic_O ( italic_k italic_n + italic_k over¯ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) For conductance: 𝒪(kn+km¯)𝒪𝑘𝑛𝑘¯𝑚\mathcal{O}(kn+k\bar{m})caligraphic_O ( italic_k italic_n + italic_k over¯ start_ARG italic_m end_ARG )

Since typically m¯nmuch-less-than¯𝑚𝑛\bar{m}\ll nover¯ start_ARG italic_m end_ARG ≪ italic_n, the dominant term in both cases is 𝒪(kn)𝒪𝑘𝑛\mathcal{O}(kn)caligraphic_O ( italic_k italic_n ). Therefore, we express the overall complexity of our algorithm as 𝒪(kn)𝒪𝑘𝑛\mathcal{O}(kn)caligraphic_O ( italic_k italic_n ).

A.3 Analysis of human data

The human data released as part of Kumar et al. [2023] used Gaussian smoothing on the average signal across participants to define a probability distribution of likely event boundary positions with respect to timestamps in the podcast. In order to calculate our similarity metrics, as shown in Fig. 3A, we need to express this data in terms of discrete event positions with respect to tokens in the transcript. For fair comparison, we therefore identified human-annotated positions by selecting as many of the most likely positions in the distribution as our initial surprise-based event segmentation had identified in the transcript. In the same process used by Kumar et al. [2023], we then used their provided word onset times to translate these timestamps to token positions, allowing us to calculate our similarity metrics.

In Fig. 3B, we use Wasserstein distance in order to compare the relative positions of event boundaries between human annotations and those found by our own methods. Wasserstein distance is a versatile metric used to compare two probability distributions [Panaretos and Zemel, 2019]. We used such a metric to better capture the uncertainty present in the human data, and found it to give more meaningful results than standard correlation or discrete distance metrics, which showed very little differences between methods. In order to calculate such a metric, we therefore need to convert our own discrete boundary positions to a distribution across token positions. We did so by defining a Mixture of Gaussians (MoG), with each Gaussian corresponding to a single position. Note that, for fair comparison with human data, we apply the same process to the discrete version of the human-annotated positions described above, and use this for comparison.

A.4 Approximate equivalence of K-nearest neighbours and softmax attention

Here we will attempt to show that using a k-NN retrieval in a key-value cache as part of the attention mechanism in transformers is an approximation of applying softmax attention over the entire sequence of tokens.

Let q𝑞qitalic_q be a query vector and K={k1,k2,,kn}𝐾subscript𝑘1subscript𝑘2subscript𝑘𝑛K=\{k_{1},k_{2},\dots,k_{n}\}italic_K = { italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } the set of key vectors in a transformer model with dimensionality d𝑑ditalic_d. Each key kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has a corresponding value vector visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, with V={v1,v2,,vn}𝑉subscript𝑣1subscript𝑣2subscript𝑣𝑛V=\{v_{1},v_{2},\dots,v_{n}\}italic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. The softmax attention weights aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are defined as:

ai=exp(qkid12)j=1nexp(qkjd12)subscript𝑎𝑖𝑞subscript𝑘𝑖superscript𝑑12superscriptsubscript𝑗1𝑛𝑞subscript𝑘𝑗superscript𝑑12a_{i}=\frac{\exp(q\cdot k_{i}~{}d^{-\frac{1}{2}})}{\sum_{j=1}^{n}\exp(q\cdot k% _{j}~{}d^{-\frac{1}{2}})}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_q ⋅ italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_exp ( italic_q ⋅ italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) end_ARG (6)

The output vector u𝑢uitalic_u is computed as:

u=i=1naivi𝑢superscriptsubscript𝑖1𝑛subscript𝑎𝑖subscript𝑣𝑖u=\sum_{i=1}^{n}a_{i}v_{i}italic_u = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (7)

In the k-NN approach, a subset Ksuperscript𝐾K^{\prime}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of size k𝑘kitalic_k is selected, containing keys nearest to q𝑞qitalic_q. The approximated attention weights aisubscriptsuperscript𝑎𝑖a^{\prime}_{i}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over this subset are:

ai=exp(qkid12)jKexp(qkjd12)for kiKformulae-sequencesubscriptsuperscript𝑎𝑖𝑞subscript𝑘𝑖superscript𝑑12subscript𝑗superscript𝐾𝑞subscript𝑘𝑗superscript𝑑12for subscript𝑘𝑖superscript𝐾a^{\prime}_{i}=\frac{\exp(q\cdot k_{i}~{}d^{-\frac{1}{2}})}{\sum_{j\in K^{% \prime}}\exp(q\cdot k_{j}~{}d^{-\frac{1}{2}})}\quad\text{for }k_{i}\in K^{\prime}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_q ⋅ italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_q ⋅ italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) end_ARG for italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (8)

The approximate output vector usuperscript𝑢u^{\prime}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is:

u=kiKaivisuperscript𝑢subscriptsubscript𝑘𝑖superscript𝐾subscriptsuperscript𝑎𝑖subscript𝑣𝑖u^{\prime}=\sum_{k_{i}\in K^{\prime}}a^{\prime}_{i}v_{i}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (9)

Assumptions

  1. 1.

    Exponential Dominance: The exponential function in the softmax is sharply peaked, implying that keys with the highest similarities to q𝑞qitalic_q contribute significantly more to the sum than others.

  2. 2.

    Representativeness of k-NN Subset: The subset Ksuperscript𝐾K^{\prime}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT captures the majority of the attention weight from the full set K𝐾Kitalic_K.

Lemma 1: Dominance of k-NN Subset

If Ksuperscript𝐾K^{\prime}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT consists of the k𝑘kitalic_k keys with the highest dot products qki𝑞subscript𝑘𝑖q\cdot k_{i}italic_q ⋅ italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, then:

jKexp(qkjd12)αj=1nexp(qkjd12)subscript𝑗superscript𝐾𝑞subscript𝑘𝑗superscript𝑑12𝛼superscriptsubscript𝑗1𝑛𝑞subscript𝑘𝑗superscript𝑑12\sum_{j\in K^{\prime}}\exp(q\cdot k_{j}~{}d^{-\frac{1}{2}})\geq\alpha\sum_{j=1% }^{n}\exp(q\cdot k_{j}~{}d^{-\frac{1}{2}})∑ start_POSTSUBSCRIPT italic_j ∈ italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_q ⋅ italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) ≥ italic_α ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_exp ( italic_q ⋅ italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) (10)

for some α1𝛼1\alpha\approx 1italic_α ≈ 1, typically very close to 1.

Proof: This follows from the exponential dominance assumption and the nature of the exponential function, which is sharply peaked.

Lemma 2: Approximation of Output Vector

Given the dominance of Ksuperscript𝐾K^{\prime}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as shown in Lemma 1, the approximate output usuperscript𝑢u^{\prime}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT effectively represents the full output u𝑢uitalic_u:

uuϵnormsuperscript𝑢𝑢italic-ϵ\|u^{\prime}-u\|\leq\epsilon∥ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_u ∥ ≤ italic_ϵ (11)

where ϵitalic-ϵ\epsilonitalic_ϵ is a small error term.

Proof: Follows from the weighted sum structure of u𝑢uitalic_u and usuperscript𝑢u^{\prime}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, using the bounds established in Lemma 1.

Given the lemmas and under the stated assumptions, the k-NN retrieval mechanism within a key-value cache effectively approximates the softmax attention mechanism in transformers. This proof highlights the efficiency versus accuracy trade-off inherent in using approximate methods like k-NN retrieval.