Human-like Episodic Memory for Infinite Context LLMs

Zafeirios Fountas Huawei Noah’s Ark Lab, London, UK Martin A Benfeghoul ^∗Equal Contribution. Huawei Noah’s Ark Lab, London, UK Adnan Oomerjee Huawei Noah’s Ark Lab, London, UK Fenia Christopoulou Huawei Noah’s Ark Lab, London, UK
Gerasimos Lampouras Huawei Noah’s Ark Lab, London, UK Haitham Bou-Ammar Huawei Noah’s Ark Lab, London, UK University College London, UK Jun Wang University College London, UK

Abstract

Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs, enabling them to effectively handle practically infinite context lengths while maintaining computational efficiency. EM-LLM organises sequences of tokens into coherent episodic events using a combination of Bayesian surprise and graph-theoretic boundary refinement in an on-line fashion. When needed, these events are retrieved through a two-stage memory process, combining similarity-based and temporally contiguous retrieval for efficient and human-like access to relevant information. Experiments on the LongBench dataset demonstrate EM-LLM’s superior performance, outperforming the state-of-the-art InfLLM model with an overall relative improvement of $4.3\%$ across various tasks, including a $33\%$ improvement on the PassageRetrieval task. Furthermore, our analysis reveals strong correlations between EM-LLM’s event segmentation and human-perceived events, suggesting a bridge between this artificial system and its biological counterpart. This work not only advances LLM capabilities in processing extended contexts but also provides a computational framework for exploring human memory mechanisms, opening new avenues for interdisciplinary research in AI and cognitive science.

1 Introduction

For contemporary pre-trained large language models (LLMs), the context window serves as the primary mechanism to incorporate domain-specific, private, or common up-to-date information. However, despite their remarkable and ever-expanding capabilities, LLMs still exhibit significant limitations when tasked with processing extensive contexts [Liu et al., 2024a]. These limitations stem from inherent challenges in Transformer-based architectures. Recent studies have shown that Transformers struggle with extrapolating to contexts longer than their training window size [Kazemnejad et al., 2024]. On top of this, employing softmax attention over extended token sequences requires substantial computational resources for each token generation, and the resulting attention embeddings risk becoming excessively noisy and losing their distinctiveness [Tworkowski et al., 2023].

To mitigate those challenges, recent works have focused on retrieval-based methods, either in the form of in-context augmentation (e.g., RAG-based techniques [Lewis et al., 2020, Gao et al., 2024]) or via retrieval of previously-inferred key-value pairs (KV) within individual attention heads [Wu et al., 2022, Tworkowski et al., 2023, Bertsch et al., 2023]. Notably, state-of-the-art performance is achieved when KV pairs are initially organised into non-overlapping segments and then retrieved together as one block of sequential tokens [Xiao et al., 2024a]. While such techniques present interesting avenues of research, results still indicate a significant gap between the performance of LLMs in short- vs long-context tasks, even when existing long-context architectures are employed [Liu et al., 2024a].

This work tackles the above challenges and attempts to bridge this performance gap by taking inspiration from the algorithmic interpretation of episodic memory in the human brain -- the memory system responsible for encoding, storing, and retrieving personal experiences and events. The brain makes sense of its continuous experience in the real world by segmenting it into discrete episodic events [Clewett et al., 2019, Zacks, 2020], which are organised in a hierarchical and nested-timescale structure [Baldassano et al., 2017] and stored in long-term memory. Notably, the boundaries between such events are the access points when it comes to memory retrieval [Michelmann et al., 2023a] and are widely believed to correspond to points in time with high prediction errors between the brain’s generative model and its raw sensory input (a.k.a., surprise). In this context, surprise refers to moments when the brain’s predictions about incoming sensory information are significantly violated, leading to a mismatch between what is expected and what is actually perceived. These instances of high surprise are thought to signal important changes in the environment or narrative, prompting the brain to segment the ongoing experience into distinct events [Zacks et al., 2007, 2011, Roseboom et al., 2019, Sinclair et al., 2021, Fountas et al., 2022]. Once segmented and stored, the brain can recall episodic memories based on their similarity to its current experience, recency, original temporal order, and their proximity to other recalled memories (temporal asymmetry and contiguity, Howard and Kahana, 2002).

Following these insights, we propose a novel architecture, EM-LLM, that integrates crucial aspects of event cognition and episodic memory into Transformer-based LLMs. For memory formation, we segment the sequence of the tokens presented to the underlying LLM into individual memory units representing episodic events. The boundaries, and thus the size of those events, are initially determined dynamically, based on the level of surprise of the model during inference, and then refined to maximise cohesion within memory units and separation of memory content across them (see Section 3.2). This refinement process leverages graph-theoretic metrics, treating the similarity between attention keys (the learned representations used in Transformer self-attention mechanisms) as a weighted adjacency matrix, and aims to enhance the model’s ability to efficiently recall relevant information when addressing complex tasks with extended contexts. Importantly, this memory formation process incurs minimal additional computational cost, with the surprise-based segmentation requiring no extra computation and the refinement step having a complexity of $\mathcal{O}(kn)$ , where $k$ is typically very small compared to the number of tokens $n$ . With this efficient memory formation process, by grouping similar information in single units, we minimise the number of units needed to recall details around specific events. For memory recall, our approach integrates similarity-based retrieval with mechanisms that facilitate temporal contiguity and asymmetry effects. By retrieving and buffering salient memory units, our model leverages and enhances the recently discovered propensity of LLMs to exhibit human-like patterns in sequential information retrieval [Ji-An et al., 2024]. This method not only ensures efficient access to pertinent information but also mimics the temporal dynamics found in human free recall studies (such as Howard and Kahana, 2002), further enhancing the model’s ability to handle complex tasks that require nuanced temporal reasoning.

To prove our hypotheses, we first employ a series of human-annotated podcast scripts, where we show that information in LLM attention heads can be semantically grouped in a way that correlates with the event structure perceived by humans. Therefore, LLM-perceived surprise can indeed serve as a proxy for the cognitive signals that drive human event segmentation, as confirmed by previous works [Kumar et al., 2023]. Then, using the long-context PG-19 dataset [Rae et al., 2020], which comprises a diverse corpus of English books, we evaluate the effectiveness of both steps in our segmentation method for grouping relevant information, and assess the performance of different refinement objectives. Finally, we show that our method is scalable and significantly outperforms the state-of-the-art model InfLLM [Xiao et al., 2024a] on the widely-used LongBench benchmark [Bai et al., 2023] for long-context tasks, achieving an overall relative improvement of $4.3\%$ , including a substantial $33\%$ improvement on the PassageRetrieval task.

2 Related work

2.1 Long-context in LLMs

Recently, several approaches have been proposed to extend the context window of Transformer-based models. Those include methods that address the limited representational capacity of softmax attention, and its quadratic computational and memory cost [Katharopoulos et al., 2020, Munkhdalai et al., 2024]. Other methods target the poor extrapolation of typical positional encodings to out-of-distribution (OOD) context lengths [Kazemnejad et al., 2024]. The latter is evident in most widely used methods, including the original absolute positional encodings [Vaswani et al., 2017] and the more recent relative positional encodings, such as the popular Rotary Positional Embeddings (RoPE) [Su et al., 2024]. To address this, some approaches propose scaling of the rotation angles [Chen et al., 2024a] or the base constant [Xiong et al., 2023, Liu et al., 2024b, Peng et al., 2024, Ding et al., 2024]. Others, scale positions without affecting the embedding function [Press et al., 2021, Chen et al., 2023, Jin et al., 2024], exploring alternative strategies such as KERPLE [Chi et al., 2022] and FIRE [Li et al., 2024] or adopt mechanisms from certain LMs like T5 [Raffel et al., 2020].

Concerning computational efficiency and diluted attention, successful approaches propose methods for general improvements to the efficiency of Transformers through optimised computations [Dao, 2024, Han et al., 2024a, Aminabadi et al., 2022, Kwon et al., 2023, Liu et al., 2024c, Brandon et al., 2023] or compression techniques [Nawrot et al., 2024, Zhang et al., 2023], as well as training methods tailored for long-context scenarios [Zhu et al., 2024, Chen et al., 2024b]. Another direction is the utilisation of retrieval-based methods, the vast majority of which relies on a vector database that keeps a key-value cache and scalable approximations of k-nearest neighbors (k-NNs) to perform lookups [Wu et al., 2022, Tworkowski et al., 2023, Bertsch et al., 2023]. Interestingly, since using a key-value cache with k-NN lookup can be seen as an approximation of applying softmax attention to the full token sequence (see Appendix A.4), k-NN retrieval methods can be used without any fine-tuning [Bertsch et al., 2023]. For an exception that does not rely on k-NNs, see Wang et al. [2023].

A recent and interesting variant of k-NN retrieval involves retrieving large groups of tokens, rather than individual ones. Models that rely on this approach include SLED [Ivgi et al., 2023] and the more recent InfLLM [Xiao et al., 2024a], which achieves state-of-the-art performance on long-context benchmarks. InfLLM segments the entire context length into fixed-size memory units and employs k-NN lookup using the tokens with the highest accumulated scores per unit. This can be seen as a form of hierarchical attention, as illustrated in Fig. 1. While group-based retrieval represents a promising direction, our approach significantly advances this concept by dynamically determining token groupings in a manner akin to human memory formation, addressing a fundamental limitation of InfLLM’s fixed-size segmentation and enabling more adaptive and context-sensitive processing of extended information.

Refer to caption — Figure 1: Group-based $k$ -NN retrieval can be seen as a form of hierarchical episodic attention. Initially, $k=4$ groups of tokens are selected (left) and then used for softmax attention (right), as if all other similarity scores were forced to be zero (non-shaded areas of the left curve). This framework can support multiple levels of episodic attention.

2.2 Neural models of Episodic Memory and Event Cognition

The concept of episodic memory, central to our approach, has been extensively studied in both theoretical neuroscience and machine learning. Neural models of episodic memory capture human behaviour and neuroimaging data, providing insights into how the brain processes and stores experiences and suggesting links between memory, efficient representations and navigation of physical and conceptual spaces [Gershman et al., 2012, Benna and Fusi, 2021]. In machine learning, episodic memory-inspired approaches have yielded significant improvements across various domains. For instance, episodic control has enhanced reinforcement learning agents’ performance and learning speed [Blundell et al., 2016, Pritzel et al., 2017, Coda-Forno et al., 2024]. In addition, models of memory construction and consolidation have been successful in alleviating catastrophic forgetting in neural networks [Kirkpatrick et al., 2017, Lopez-Paz and Ranzato, 2017, Chaudhry et al., 2019, Buzzega et al., 2020, Prabhu et al., 2020], including LLMs [Das et al., 2024], and appear to explain key features of human memory, such as imagination and future thinking [Spens and Burgess, 2024].

These models have revealed key aspects of episodic memory, particularly in describing how experiences are segmented into events, and when new memories are encoded and retrieved [Lu et al., 2022]. Surprise plays a critical role in this process, triggering event boundaries and memory formation [Fountas et al., 2022, Kumar et al., 2023]. This event-based structure is deeply intertwined with our perception of time [Roseboom et al., 2019, Sherman et al., 2022], highlighting the interdependence of memory and temporal cognition. This insight has helped generative models for video [Zakharov et al., 2022a, b] and reinforcement learning [Zakharov et al., 2021] to capture temporal dynamics more accurately. In terms of memory retrieval, studies in human free recall have shown a distinctive increased likelihood of retrieving items encoded close together in time (temporal contiguity) and in succession (temporal asymmetry) (see Fig.2A). Recently, it was shown that attention heads in transformer-based LLMs that are associated with in-context learning, already exhibit the same dynamic retrieval behaviour [Ji-An et al., 2024] (Fig.2B) including both contiguity and asymmetry effects. Therefore, transformers have the inherent ability to act as episodic memory retrieval models, if provided with the right information within their context window. Our work leverages these concepts of surprise-based event segmentation and LLMs’ inherent temporal contiguity and asymmetry effects to enable a new generation of Infinite Context-Length LLMs, capable of processing and understanding information over vastly extended timescales.

3 EM-LLM: LLM with Episodic Memory

3.1 Architecture

EM-LLM is designed to be applied directly to pre-trained LLMs, enabling them to handle context lengths significantly larger than their original training length. Our architecture divides the context into three distinct groups: initial tokens, evicted tokens, and local context. This structure, while incorporating insights from recent work on token block retrieval [Xiao et al., 2024a], introduces novel elements inspired by human episodic memory.

The local context represents the most recent tokens, maximising information about the current task, and fits within the typical context window of the underlying LLM. This group utilises full softmax attention and plays a role similar to the focus of attention in cognitive models of working memory, holding the most immediately relevant information for the current task [Cowan, 2001]. The evicted tokens typically comprise the majority of past tokens in a long-context scenario, extending far beyond the LLM’s original training length. These tokens are managed by our proposed memory model functioning similarly to short-term episodic memory in the brain. Finally, following previous work, we also maintain a group of $128$ initial tokens in the LLM context. These act as attention sinks and help recover the performance of window attention, as first observed by Xiao et al. [2024b], Han et al. [2024b] and later adopted by Xiao et al. [2024a]. For retrieved tokens, which are therefore discontinuous and outside the local context, we assign a fixed position embedding as in Raffel et al. [2020], Xiao et al. [2024a]. This architecture enables EM-LLM to effectively process and utilise information from positions outside its pre-trained local context window, while maintaining the underlying LLM’s performance characteristics.

3.2 Memory formation via Surprise

In the context of LLMs, we define episodic memory as the organised, event-based collection of past key-value pairs, analogous to the latent representations of personal experiences in human memory. Just as unexpected or novel information plays a crucial role in human memory formation, we posit that analogous indicators of novelty in LLMs can serve as an effective proxy for identifying significant ‘‘events’’ within the model’s experience. In Bayesian terms, surprise is quantified by the negative log-likelihood of observing the current, ground-truth token given the previous tokens in an auto-regressive model, with high values indicating the unpredictability or novelty of each new token within the context according to the model, i.e., it is ‘‘surprised’’ by the next token. Following work on cognitive modelling [Roseboom et al., 2019, Fountas et al., 2022], we employ a thresholding mechanism to perform an initial identification of event boundaries (used for the first time in LLMs). Formally, a token $x_{t}$ is considered a potential boundary if its surprise value exceeds a threshold $T$ :

\displaystyle-\log P(x_{t}|x_{1},\ldots,x_{t-1};\theta)>T\quad~{}~{}\text{with% }\quad~{}~{}T=\mu_{t-\tau}+\gamma\sigma_{t-\tau}

(1)

where $\mu_{t-\tau:t}$ and $\sigma_{t-\tau:t}^{2}$ are the mean and variance of surprise for a window offset $\tau$ , and $\gamma$ is a scaling factor. The choice of threshold $T$ is critical in balancing the granularity of segmentation with the model’s sensitivity to contextual shifts. If the $T$ is too high, we will identify very few event boundaries, especially if the local context contains few surprising tokens. Conversely, a low $T$ results in frequent boundary identification. Using a moving window ensures that $T$ adapts to contextual shifts, minimizing the need for manual tuning while maintaining control over threshold sensitivity via $\gamma$ . We also explored a fixed threshold approach ( $T=T_{\text{fixed}}$ ), though our primary focus remained on the dynamic threshold due to its adaptability to varying contexts. This initial segmentation results in a set of potential event boundaries $\mathcal{B}={b_{1},b_{2},...,b_{k}}$ , where each $b_{i}$ represents the index of a token exceeding the surprise threshold. These boundaries serve as the starting point for our subsequent refinement process, which aims to optimise the intra-event coherence and inter-event distinctiveness of the resulting memory segments.

3.3 Boundary refinement

While surprise-based segmentation provides an effective initial estimate of event boundaries, we make the key observation that the utility of elements within an event during memory recall depends on their likelihood of being utilised by the current query. Therefore, we theorise that memory recall will be most efficient with high intra-event similarity between keys while maintaining low inter-event similarity. For instance, see the similarity of groups in Fig. 1. To further ensure this, we introduce a boundary refinement step which looks to optimise this objective. Such an objective is typically optimised in the context of graph-clustering, hence we will express this refinement process in a graph-theoretic manner. To achieve this, we treat the similarity matrix between all keys of an attention head $h$ within the local context window for tokens $x_{1},x_{2},...,x_{n}$ as an adjacency matrix. We define the adjacency matrix $A^{h}$ as

\displaystyle A^{h}_{ij}=\text{sim}(K^{h}_{i},K^{h}_{j}),

(2)

where $K^{h}_{i}$ and $K^{h}_{j}$ are the key vectors corresponding to tokens $x_{i}$ and $x_{j}$ , respectively. The similarity function measures the closeness of two key vectors; in our implementation, we use dot product similarity ${K^{h}}^{\mathsf{T}}_{i}\cdot K^{h}_{j}$ due to its effectiveness in capturing semantic relationships in high-dimensional spaces [Vaswani et al., 2017] and to align with the mechanism of self-attention in Transformers.

To evaluate the quality of potential boundaries, we define a metric function $f(A,\mathcal{B}):\mathbb{R}^{n\times n}\times\{1,\ldots,n\}^{k}\rightarrow% \mathbb{R}$ . This function quantifies the cohesion within events and separation between events based on the graph structure represented by the similarity matrix $A$ and event boundaries $\mathcal{B}$ . We experimented with two widely-accepted graph-clustering metrics: modularity and conductance [Miasnikof et al., 2018]. Modularity [Newman and Girvan, 2004] provides a measure of the quality of a particular division of a network into communities, with higher values indicating higher edge density in the identified cluster when compared to the density of edges expected in a random cluster. As our edge weights represent the similarity between two tokens, we seek to maximise this metric. Modularity is defined as:

f_{M}(A^{h},\mathcal{B})=\frac{1}{4m}\sum_{i,j}\left[A^{h}_{ij}-\frac{\sum_{i}% A^{h}_{ij}\cdot\sum_{j}A^{h}_{ij}}{2m}\right]\delta(c_{i},c_{j})

(3)

where $m$ is the total edge weight in the graph, $c_{i}$ is the community (episodic event) to which node $i$ is assigned, and $\delta$ is the Kronecker delta function. Conductance, on the other hand, measures the fraction of total weighted edges cut by a given community boundary, and is defined as:

f_{C}(A^{h},\mathcal{B})=\min_{S\in V}\frac{\sum_{i\in S,j\notin S}A^{h}_{ij}}% {\min(\text{vol}(S),\text{vol}(V\setminus S))},\qquad\text{with}\;\text{vol}(S% )=\sum_{i\in S,j\in S}A_{ij},\;\text{vol}(V\setminus S)=\sum_{i\notin S,j% \notin S}A_{ij}

(4)

where $S=\{b_{i},b_{i}+1,...,b_{i+1}\}$ is a subset of all nodes $V=\{b_{1},b_{1}+1,...,b_{k}\}$ in the induced graph, with $b_{i}\in\mathcal{B}$ . Lower conductance values indicate better community structure. Our boundary refinement algorithm iteratively adjusts the initial surprise-based boundaries to optimise these metric functions. While our best results were achieved using modularity, we also include comparisons with conductance-based boundary refinement to provide a comprehensive analysis. The overall process can be summarized in Algorithm 1.

Algorithm 1 Event segmentation in KV cache

1:tokens: List of tokens in the sequence

T

: Threshold for surprisal to identify initial boundaries

f

: Metric function to evaluate potential boundaries

\mathcal{B}

: List of final boundary positions

\mathcal{B}\leftarrow

[i for i in range(length(tokens)) if

-\log(P(\texttt{tokens}[i]))>

T

]

\triangleright

Boundary identification

6:for i in range(length(

\mathcal{B}

)) do

\alpha,\beta=\mathcal{B}[i],\mathcal{B}[i+1]

\mathcal{B}[i+1]\leftarrow\arg\max_{\hat{\beta}\in(\alpha,\beta]}f(A,\{\alpha,% \hat{\beta}\})

\triangleright

Boundary refinement

9:end for

10:return

\mathcal{B}

This algorithm first identifies initial boundaries based on the surprise threshold $T$ , then refines these boundaries by finding the optimal position $\hat{\beta}$ between each pair of consecutive initial boundaries $(\alpha,\beta)$ that optimises the chosen metric function $f$ (either maximising modularity or minimising conductance). This process ensures that the final segmentation (1) captures points of high surprise and (2) optimises for coherent information grouping. The boundary identification step incurs negligible computational cost, as it only evaluates existing LLM outputs. The time complexity of Algorithm 1 is dominated by the boundary refinement step, which has an overall complexity of $\mathcal{O}(kn)$ , where $k$ is the number of initial boundaries and $n$ is the sequence length. A detailed analysis of the algorithm’s complexity, including the computation of modularity and conductance metrics, is provided in Appendix A.2. Despite this modest computational overhead, the resulting improvement in segment quality leads to significant performance gains in downstream tasks, particularly those requiring complex temporal reasoning.

3.4 Memory Retrieval

When inferring a new token, a number of episodic events are selected and become a part of the (extended) context window of the underlying LLM. Our memory retrieval process employs a two-stage mechanism to select relevant episodic events for the LLM’s context window (Fig.2C). First, we retrieve $k_{s}$ events using $k$ -NN search based on dot product similarity between the current query and representative tokens of each event. These representatives, selected as per Xiao et al. [2024a], are the most influential tokens within each event. For large memory stores, we utilise approximate $k$ -NN [Douze et al., 2024] to maintain efficiency. These $k_{s}$ events, retrieved based on their similarity to the current query, form a part of the LLM’s context window that we refer to as the similarity buffer.

The second stage of our retrieval process introduces another buffer, which we refer to as the contiguity buffer, designed to maintain temporal context. Implemented as a queue of size $k_{c}$ , this buffer promotes temporal relationships in retrieval. When an event is retrieved, we also enqueue its neighboring events (within $\pm n$ positions in the original sequence) into this buffer. This mechanism enables the LLM’s ‘‘induction’’ attention heads to exhibit the contiguity and asymmetry effects discussed in Section 2.2. The queue structure allows for a natural decay of temporal context as new events are processed, with older or repeated events being dequeued as new ones are added. In total, $k=k_{s}+k_{c}+2$ events are added to the context window, striking a balance between relevance and temporal relationships in a manner analogous to human episodic memory retrieval.

4 Experiments

4.1 Performance of EM-LLM on long-context tasks

As previously mentioned, InfLLM is, at the time of writing, considered to achieve state-of-the-art performance on long-context benchmarks ( $\infty$ -Bench, LongBench), as well as being the only method which uses group-based k-NN retrieval in LLMs on such benchmarks. We, therefore, employ this model as our baseline for comparison with our own methods on short context windows (4K+2K, as in Xiao et al., 2024a).

Task	InfLLM	Max Imp.	EM-LLM
Task	InfLLM	Max Imp.	S	SM	S+C	SM+C
NarrativeQA	22.12	+1.49 $\%$	21.32	21.13	21.80	22.45
MultiNews	26.70	-0.30 $\%$	26.52	26.54	26.69	26.62
Qasper	29.33	+0.17 $\%$	28.99	29.38	29.11	28.68
TREC	69.00	+2.17 $\%$	70.00	70.00	70.50	70.50
MultiFieldQA	47.42	+0.42 $\%$	47.49	47.39	47.46	47.62
TriviaQA	86.67	+1.10 $\%$	86.93	87.62	87.35	87.47
HotpotQA	36.56	+9.38 $\%$	39.99	39.01	39.05	38.90
SAMSum	42.52	+0.87 $\%$	42.34	42.13	42.89	42.48
2WikiMQA	22.31	+6.41 $\%$	23.74	22.75	22.65	23.46
PassageRetrieval	64.00	+33.47 $\%$	85.42	78.92	84.67	84.08
Musique	17.68	+6.17 $\%$	17.58	17.82	17.93	18.77
LCC	56.67	+0.63 $\%$	54.90	57.03	54.79	56.79
GovReport	31.03	+1.90 $\%$	31.24	31.62	31.34	31.43
RepoBench-P	52.97	+1.34 $\%$	50.76	53.68	51.34	52.86
QMSum	23.49	+2.13 $\%$	23.82	23.20	23.99	23.47
Avg. score:	41.90	+4.30 $\%$	43.40	43.22	43.44	43.70

Table 1: EM-LLM performance on LongBench compared to our baseline InfLLM. S: surprise threshold, SM: surprise threshold + refinement with modularity, S+C: surprise threshold + contiguity buffer, SM+C: surprise threshold + refinement with modularity + contiguity buffer. Max Imp. shows the maximum relative improvement over InfLLM across all EM-LLM variants.

Results on the LongBench dataset (Table 1) show that our method is able to improve on InfLLM in all but one task, with the best method achieving an overall increase in performance of $1.8$ percentage points (a relative improvement of $4.3\%$ ). Note that the table shows the best single method in terms of overall performance for each ablation. Looking at individual task performance across all experiments, we are able to beat InfLLM in all tasks (see Appendix A.1). Interestingly, we see an especially large jump in performance on the PassageRetrieval task across all ablations, with up to a $33\%$ improvement on InfLLM. This task requires the model to identify the original paragraph from a summary, a challenging task that tests the model’s ability to accurately recall a wide range of detailed information from a large context concurrently. The substantial improvement on this task highlights the effectiveness of our event segmentation method in enhancing long-term memory recall and retrieval accuracy in LLMs. Additionally, our method achieves a notable $9.38\%$ improvement on the HotpotQA task, which involves complex reasoning over multiple supporting documents, further emphasising the model’s ability to provide explanations for answers.

4.2 Humans and LLM surprise cluster similar tokens together

As mentioned in Section 3.2, we employ modularity and conductance as two refinement objectives in our boundary refinement algorithm, due to their qualities in assessing the intra- and inter-event similarities between individual tokens. We will now use such metrics to compare various event segmentation methods, including human event segmentation data. Additionally, we introduce one further, simple metric for this experiment: the ratio between intra- and inter-community similarity (I/IS), calculated for each head and community $S$ as follows:

\text{intra}=\sum_{i\in S,j\in S}A_{ij},\qquad\text{inter}=\sum_{i\in S,j% \notin S}A_{ij},\qquad\text{I/IS}\equiv\frac{\text{intra}}{\text{inter}}

(5)

Kumar et al. [2023] found strong correlations between human-perceived events and prediction errors across 3 short podcasts (7-30 minutes on average), when processing the corresponding transcript with an LLM. Taking advantage of such human data and results from previous works on this dataset [Michelmann et al., 2021, Lositsky et al., 2016], we compare the segmentation quality and correlation with human data for each of our methods (3) using our similarity metrics.

As shown in Fig. 3A, human-perceived events achieve significantly higher scores in similarity metrics compared to fixed or random events, suggesting that surprise is indeed an important factor for humans in their own perception of events. Furthermore, surprise-only segmentation ( $\mathbf{S}$ ) achieves very similar results to humans, while the addition of our refinement algorithm ( $\mathbf{SM}$ , $\mathbf{SC}$ , $\mathbf{FM}$ , $\mathbf{FC}$ ) significantly improves performance. Fig. 3B further shows that surprise-based methods ( $\mathbf{S}$ , $\mathbf{SM}$ , $\mathbf{SC}$ ), consistently identify event boundaries that are closest to those perceived by humans.

4.3 Comparing segmentation methods

LLM	Metric	F	FM	FC	S	SM	SC
Mistral-7B	Mod $\uparrow$	-2.3 $\pm$ 4.1	29.2 $\pm$ 44.0	6.7 $\pm$ 25.9	18.6 $\pm$ 29.6	39.9 $\pm$ 55.5	29.5 $\pm$ 42.7
	Con $\downarrow$	9.1 $\pm$ 8.7	-16.9 $\pm$ 6.7	-12.5 $\pm$ 9.6	-23.6 $\pm$ 9.4	-24.6 $\pm$ 9.3	-27.6 $\pm$ 9.8
	I/IS $\uparrow$	-4.3 $\pm$ 4.0	31.2 $\pm$ 21.4	3.7 $\pm$ 14.9	17.9 $\pm$ 17.0	35.3 $\pm$ 27.7	21.6 $\pm$ 22.4
LLaMA2-7B	Mod $\uparrow$	-1.1 $\pm$ 4.3	13.4 $\pm$ 19.5	0.6 $\pm$ 7.3	8.7 $\pm$ 16.0	18.7 $\pm$ 26.4	11.5 $\pm$ 19.4
	Con $\downarrow$	11.9 $\pm$ 9.8	-18.8 $\pm$ 7.4	-13.7 $\pm$ 10.9	-29.5 $\pm$ 10.2	-29.7 $\pm$ 10.1	-33.3 $\pm$ 10.3
	I/IS $\uparrow$	-3.8 $\pm$ 3.7	20.7 $\pm$ 184.7	-1.1 $\pm$ 6.8	15.0 $\pm$ 880.0	25.0 $\pm$ 19.9	16.5 $\pm$ 15.4
LLaMA3-8B	Mod $\uparrow$	-1.6 $\pm$ 3.6	18.9 $\pm$ 25.6	0.9 $\pm$ 11.8	13.1 $\pm$ 21.5	27.0 $\pm$ 35.6	18.3 $\pm$ 28.5
	Con $\downarrow$	11.3 $\pm$ 9.5	-20.3 $\pm$ 6.9	-14.6 $\pm$ 11.4	-29.7 $\pm$ 9.2	-30.6 $\pm$ 9.2	-33.9 $\pm$ 9.6
	I/IS $\uparrow$	-3.8 $\pm$ 3.1	24.5 $\pm$ 13.9	-1.1 $\pm$ 5.8	15.7 $\pm$ 11.0	28.1 $\pm$ 16.1	16.4 $\pm$ 12.2

Table 2: Comparison with graph-theoretic metrics in the KV cache of different LLMs and segmentation methods using the PG-19 dataset. Reported values are the difference with the corresponding random segmentation. Mod: modularity

\times 10^{5}

, Con: Conductance, I/IS: intra/inter-similarity

\times 10^{3}

. Hyper-parameters: Surprise threshold =

0.001

Looking at Table 2, it is clear that surprise-based segmentation with refinement ( $\mathbf{SM}$ , $\mathbf{SC}$ ) provides the best results in terms of event similarity metrics, regardless of the base LLM used. While the surprise-only method ( $\mathbf{S}$ ) achieves some good results, we observe that refinement is especially adept to improving this performance with regards to our metrics, as it is directly optimising for such an objective. Interestingly however, the fixed-based refinement methods ( $\mathbf{FM}$ , $\mathbf{FC}$ ) do not reach the same performance as their surprise-based counterparts, further showing that the initial segmentation with a surprise threshold is crucial to achieving the best possible balance in intra-/inter-similarity with our methods.

4.4 Similarity, Contiguity, Recency and Temporal Order

As demonstrated in Tables 1 and 2, along with Fig. 3, each of our ablations show various positive improvements on InfLLM (also see Appendix A.1). As mentioned in Section 4.3, refinement has a strong positive impact in improving our similarity metrics. This is seen to translate well to model performance in Table 1, achieving the best performance in a third of the tasks, as well as agreeing with human data (Fig. 3). The effects of contiguity are also clearly demonstrated in this table, with the addition of our contiguity buffer achieving the best performance on three tasks, and the second-best overall score. Furthermore, these methods are shown to be generally complementary, achieving the best overall performance when combined.

However, the fact that certain tasks still appear to benefit more from either surprise-only, refinement, or contiguity, is an interesting result. This is likely due to the nature of the tasks and the varying importance of contiguity across these tasks. For instance, in Supplementary Fig. 5, the MultiNews task scores higher than our baseline only for a ratio of $70\%$ contiguity to similarity buffers. Where contiguity is not crucial, adding such a buffer to our context window also reduces the size of the similarity buffer, and therefore provides potentially less directly relevant events. This is compatible with our own findings that a contiguity buffer that is as big or smaller than the similarity buffer yields the best results, suggesting that the similarity buffer, is still the most crucial part of our approach. This is especially the case when combined with refinement, which we expect is due to the improved similarity of refined events, hence further reducing the need for contiguous events.

5 Discussion

Human studies

The surprise-based segmentation and boundary refinement processes in EM-LLM mirror key aspects of human event perception and memory formation. Our approach aligns with theories proposing that humans segment continuous experiences into discrete events based on prediction errors or moments of surprise [Zacks et al., 2007, Fountas et al., 2022]. This segmentation process is crucial for organising and later retrieving episodic memories efficiently. Indeed, significant correlations have been found between human event segmentation and prediction errors in both LLMs [Kumar et al., 2023] and video models [Fountas et al., 2022, Mariola et al., 2022]. Our results add to this growing body of evidence, demonstrating that LLM-based surprise can serve as a proxy for human event segmentation, in multiple levels of hierarchical abstraction, and that the resulting event structure in EM-LLM’s attention heads correlates strongly with human-perceived events. This finding creates a more direct, low-level connection between LLM mechanisms and human cognitive processes. Furthermore, our model’s use of both similarity-based and temporally contiguous retrieval mechanisms parallels human memory retrieval patterns, allowing for the expression of robust phenomena found in human memory research [Howard and Kahana, 2002].

Furthermore, our model’s use of both similarity-based and temporally contiguous retrieval mechanisms parallels human memory retrieval patterns. The temporal contiguity effect, where items experienced close together in time are often recalled together, is a robust phenomenon in human memory research [Howard and Kahana, 2002]. Further experiments could deepen our understanding of the connections between EM-LLM and human episodic memory. Following Michelmann et al. [2023b], one potential direction is to test whether the timing of the event boundaries or the degree of modularity per level that our method produces is closer on average to the human consensus, than individual human subjects. Second, we can explore the level at which different ratios of the contiguity buffer allow the human biases presented in Fig. 2A and the analysis in Ji-An et al. [2024] to be more easily reproduced. Finally, we could investigate how skewing event recall based on recency and originally-recorded surprise affects model performance and to what extent it produces biased behaviour found in studies of free recall.

In addition, the architecture of EM-LLM, with its differentiated context handling described in Section 3.1, invites further interesting comparisons to cognitive models of human memory beyond episodic. The group of tokens forming the local context, which hold the most recent and task-relevant information, share characteristics with the concept of working memory. For instance, Baddeley [2003]’s influential model of working memory, which posits a limited-capacity system for temporary information storage and manipulation, bears similarities to our local context functionality. Yet, the analogy is not perfect. Our broader context window, including both local context and retrieved memories, might be more accurately compared to Ericsson and Kintsch [1995]’s concept of long-term working memory, which proposes a mechanism for rapid access to relevant information in long-term memory, extending beyond the traditional capacity limits of working memory. Alternatively, our architecture aligns well with Cowan [2001]’s embedded-processes model, where our local context could be likened to the limited-capacity ‘‘focus of attention’’ within working memory, while the full context window parallels the activated portion of long-term memory. Future work could explore these analogies more deeply, providing a flexible test-bed for rapidly exploring hypotheses about human memory, and potentially informing debates about capacity limits in working memory. Additionally, inspired by the multi-component nature of Baddeley’s model, one might explore the integration of modality-specific buffers within EM-LLM to enhance its performance on multi-modal tasks.

Machine learning

In refining event boundaries, we utilized modularity and conductance as metrics for evaluating community structure in the similarity graph of attention keys. While effective in our experiments, we acknowledge that numerous other methods for graph clustering and sequence segmentation could potentially be applied [Fortunato, 2010, Yang et al., 2016]. Our choice was motivated by their established theoretical foundations and computational efficiency, though comparative studies suggest performance can vary based on network characteristics [Yang et al., 2016]. Interestingly, our surprise-based initial boundary detection shares similarities with Bayesian online change-point detection [Adams and MacKay, 2007], suggesting potential avenues for integrating time series analysis techniques into LLM context processing. Future work could explore whether more sophisticated segmentation or clustering algorithms could improve EM-LLM’s performance, particularly for extremely long contexts or streaming data scenarios. Such investigations could enhance our model and contribute to understanding how information is structured and processed in LLMs, bridging the gap between traditional sequence analysis and LLM context processing.

Looking ahead, several more avenues for future research emerge from this work. One promising direction is to extend our surprise-based segmentation and boundary refinement processes to operate at each layer of the Transformer independently. This could lead to more nuanced and hierarchical representations of episodic memories, following the underlying semantic structure of the input more closely. Additionally, exploring how EM-LLM could be utilised to enable imagination and future thinking has great potential for advancing model-based reinforcement learning and continual learning techniques in LLMs. By leveraging its event-based structure to simulate potential future scenarios or recall past experiences in novel contexts, EM-LLM could enhance an LLM’s ability to plan, adapt, and learn continuously from new information.

6 Conclusion

In this work, we introduced EM-LLM, a novel and flexible architecture that integrates key aspects of human episodic memory and event cognition into transformer-based language models. Our approach enables LLMs to effectively process and utilise information from vastly extended contexts, far beyond their original training lengths. By combining surprise-based event segmentation with graph-theoretic boundary refinement, and a two-stage memory retrieval process, EM-LLM demonstrates superior performance on long-context tasks compared to state-of-the-art models. Crucially, our method requires no pre-training and can be readily applied to existing LLMs, offering a promising path towards virtually infinite context windows. This capability has the potential to revolutionise how we interact with LLMs, enabling continuous, personalized interactions over extended periods. Furthermore, the flexibility of our framework suggests it could serve as a viable alternative to traditional retrieval-augmented generation (RAG) techniques, especially when combined with efficient compression methods to reduce the memory requirements for the model’s KV cache.

In conclusion, EM-LLM represents a significant step forward in the development of language models with extended context-processing capabilities. By bridging insights from cognitive science with machine learning, our approach not only enhances the performance of LLMs on long-context tasks but also provides a scalable computational framework for testing hypotheses about human memory. We hope this study will inspire the community to expand research on the intersection between LLMs and human memory mechanisms.

References

Liu et al. [2024a] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157--173, 2024a.
Kazemnejad et al. [2024] Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024.
Tworkowski et al. [2023] Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Miłoś. Focused transformer: Contrastive training for context scaling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=s1FjXzJ0jy.
Lewis et al. [2020] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459--9474, 2020.
Gao et al. [2024] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024.
Wu et al. [2022] Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TrjbxzRcnf-.
Bertsch et al. [2023] Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R. Gormley. Unlimiformer: Long-range transformers with unlimited length input. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=lJWUJWLCJo.
Xiao et al. [2024a] Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory, 2024a.
Clewett et al. [2019] David Clewett, Sarah DuBrow, and Lila Davachi. Transcending time in the brain: How event memories are constructed from experience. Hippocampus, 29(3):162--183, 2019.
Zacks [2020] Jeffrey M Zacks. Event perception and memory. Annual review of psychology, 71:165--191, 2020.
Baldassano et al. [2017] Christopher Baldassano, Janice Chen, Asieh Zadbood, Jonathan W Pillow, Uri Hasson, and Kenneth A Norman. Discovering event structure in continuous narrative perception and memory. Neuron, 95(3):709--721, 2017.
Michelmann et al. [2023a] Sebastian Michelmann, Uri Hasson, and Kenneth A. Norman. Evidence that event boundaries are access points for memory retrieval. Psychological Science, 34(3):326--344, 2023a. doi:10.1177/09567976221128206. URL https://doi.org/10.1177/09567976221128206. PMID: 36595492.
Zacks et al. [2007] Jeffrey M Zacks, Nicole K Speer, Khena M Swallow, Todd S Braver, and Jeremy R Reynolds. Event perception: a mind-brain perspective. Psychological bulletin, 133(2):273, 2007.
Zacks et al. [2011] Jeffrey M Zacks, Christopher A Kurby, Michelle L Eisenberg, and Nayiri Haroutunian. Prediction error associated with the perceptual segmentation of naturalistic events. Journal of cognitive neuroscience, 23(12):4057--4066, 2011.
Roseboom et al. [2019] Warrick Roseboom, Zafeirios Fountas, Kyriacos Nikiforou, David Bhowmik, Murray Shanahan, and Anil K Seth. Activity in perceptual classification networks as a basis for human subjective time perception. Nature communications, 10(1):267, 2019.
Sinclair et al. [2021] Alyssa H. Sinclair, Grace M. Manalili, Iva K. Brunec, R. Alison Adcock, and Morgan D. Barense. Prediction errors disrupt hippocampal representations and update episodic memories. Proceedings of the National Academy of Sciences, 118(51):e2117625118, 2021. doi:10.1073/pnas.2117625118. URL https://www.pnas.org/doi/abs/10.1073/pnas.2117625118.
Fountas et al. [2022] Zafeirios Fountas, Anastasia Sylaidi, Kyriacos Nikiforou, Anil K. Seth, Murray Shanahan, and Warrick Roseboom. A Predictive Processing Model of Episodic Memory and Time Perception. Neural Computation, 34(7):1501--1544, 06 2022. ISSN 0899-7667. doi:10.1162/neco_a_01514. URL https://doi.org/10.1162/neco_a_01514.
Howard and Kahana [2002] Marc W Howard and Michael J Kahana. A distributed representation of temporal context. Journal of mathematical psychology, 46(3):269--299, 2002.
Ji-An et al. [2024] Li Ji-An, Corey Y. Zhou, Marcus K. Benna, and Marcelo G. Mattar. Linking in-context learning in transformers to human episodic memory, 2024.
Kumar et al. [2023] Manoj Kumar, Ariel Goldstein, Sebastian Michelmann, Jeffrey M Zacks, Uri Hasson, and Kenneth A Norman. Bayesian surprise predicts human event segmentation in story listening. Cognitive science, 47(10):e13343, 2023.
Rae et al. [2020] Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SylKikSYDH.
Bai et al. [2023] Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
Katharopoulos et al. [2020] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156--5165. PMLR, 2020.
Munkhdalai et al. [2024] Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite context transformers with infini-attention, 2024.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Su et al. [2024] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. ISSN 0925-2312. doi:https://doi.org/10.1016/j.neucom.2023.127063. URL https://www.sciencedirect.com/science/article/pii/S0925231223011864.
Chen et al. [2024a] Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, and Lidong Bing. CLEX: Continuous length extrapolation for large language models. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=wXpSidPpc5.
Xiong et al. [2023] Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
Liu et al. [2024b] Xiaoran Liu, Hang Yan, Chenxin An, Xipeng Qiu, and Dahua Lin. Scaling laws of roPE-based extrapolation. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=JO7k0SJ5V6.
Peng et al. [2024] Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u.
Ding et al. [2024] Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024.
Press et al. [2021] Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
Chen et al. [2023] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
Jin et al. [2024] Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. Llm maybe longlm: Self-extend llm context window without tuning, 2024.
Chi et al. [2022] Ta-Chung Chi, Ting-Han Fan, Peter Ramadge, and Alexander Rudnicky. KERPLE: Kernelized relative positional embedding for length extrapolation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=hXzOqPlXDwm.
Li et al. [2024] Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=rR03qFesqk.
Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
Dao [2024] Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mZn2Xyh9Ec.
Han et al. [2024a] Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David Woodruff, and Amir Zandieh. Hyperattention: Long-context attention in near-linear time. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=Eh0Od2BJIM.
Aminabadi et al. [2022] Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’22. IEEE Press, 2022. ISBN 9784665454445.
Kwon et al. [2023] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702297. doi:10.1145/3600006.3613165. URL https://doi.org/10.1145/3600006.3613165.
Liu et al. [2024c] Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringattention with blockwise transformers for near-infinite context. In The Twelfth International Conference on Learning Representations, 2024c. URL https://openreview.net/forum?id=WsRHpHH4s0.
Brandon et al. [2023] William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers. arXiv preprint arXiv:2311.09431, 2023.
Nawrot et al. [2024] Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, and Edoardo M Ponti. Dynamic memory compression: Retrofitting llms for accelerated inference. arXiv preprint arXiv:2403.09636, 2024.
Zhang et al. [2023] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Re, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=RkRrPp7GKO.
Zhu et al. [2024] Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. PoSE: Efficient context window extension of LLMs via positional skip-wise training. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=3Z1gxuAQrA.
Chen et al. [2024b] Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. LongloRA: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=6PmJoRfdaK.
Wang et al. [2023] Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=BryMFPQ4L6.
Ivgi et al. [2023] Maor Ivgi, Uri Shaham, and Jonathan Berant. Efficient long-text understanding with short-text models. Transactions of the Association for Computational Linguistics, 11:284--299, 2023. doi:10.1162/tacl_a_00547. URL https://aclanthology.org/2023.tacl-1.17.
Gershman et al. [2012] Samuel J Gershman, Christopher D Moore, Michael T Todd, Kenneth A Norman, and Per B Sederberg. The successor representation and temporal context. Neural Computation, 24(6):1553--1568, 2012.
Benna and Fusi [2021] Marcus K Benna and Stefano Fusi. Place cells may simply be memory cells: Memory compression leads to spatial tuning and history dependence. Proceedings of the National Academy of Sciences, 118(51):e2018422118, 2021.
Blundell et al. [2016] Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic control. arXiv preprint arXiv:1606.04460, 2016.
Pritzel et al. [2017] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adrià Puigdomènech Badia, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2827--2836. PMLR, 06--11 Aug 2017.
Coda-Forno et al. [2024] Julian Coda-Forno, Changmin Yu, Qinghai Guo, Zafeirios Fountas, and Neil Burgess. Leveraging episodic memory to improve world models for reinforcement learning. In Memory in Artificial and Real Intelligence (MemARI) Workshop at 36th Conference on Neural Information Processing Systems (NeurIPS 2022), 2024.
Kirkpatrick et al. [2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521--3526, 2017.
Lopez-Paz and Ranzato [2017] David Lopez-Paz and Marc' Aurelio Ranzato. Gradient episodic memory for continual learning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/f87522788a2be2d171666752f97ddebb-Paper.pdf.
Chaudhry et al. [2019] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-GEM. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Hkf2_sC5FX.
Buzzega et al. [2020] Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and SIMONE CALDERARA. Dark experience for general continual learning: a strong, simple baseline. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 15920--15930. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/b704ea2c39778f07c617f6b7ce480e9e-Paper.pdf.
Prabhu et al. [2020] Ameya Prabhu, Philip H. S. Torr, and Puneet K. Dokania. Gdumb: A simple approach that questions our progress in continual learning. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision -- ECCV 2020, pages 524--540, Cham, 2020. Springer International Publishing.
Das et al. [2024] Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarath Swaminathan, Sihui Dai, Aurélie Lozano, Georgios Kollias, Vijil Chenthamarakshan, Soham Dan, et al. Larimar: Large language models with episodic memory control. arXiv preprint arXiv:2403.11901, 2024.
Spens and Burgess [2024] Eleanor Spens and Neil Burgess. A generative model of memory construction and consolidation. Nature Human Behaviour, pages 1--18, 2024.
Lu et al. [2022] Qihong Lu, Uri Hasson, and Kenneth A Norman. A neural network model of when to retrieve and encode episodic memories. elife, 11:e74445, 2022.
Sherman et al. [2022] Maxine T Sherman, Zafeirios Fountas, Anil K Seth, and Warrick Roseboom. Trial-by-trial predictions of subjective time from human brain activity. PLOS Computational Biology, 18(7):e1010223, 2022.
Zakharov et al. [2022a] Alexey Zakharov, Qinghai Guo, and Zafeirios Fountas. Variational predictive routing with nested subjective timescales. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=JxFgJbZ-wft.
Zakharov et al. [2022b] Alexey Zakharov, Qinghai Guo, and Zafeirios Fountas. Long-horizon video prediction using a dynamic latent hierarchy. arXiv preprint arXiv:2212.14376, 2022b.
Zakharov et al. [2021] Alexey Zakharov, Matthew Crosby, and Zafeirios Fountas. Episodic memory for subjective-timescale models. In ICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021. URL https://openreview.net/forum?id=30lZDhrjonR.
Cowan [2001] Nelson Cowan. The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and brain sciences, 24(1):87--114, 2001.
Xiao et al. [2024b] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=NG7sS51zVF.
Han et al. [2024b] Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. LM-infinite: Zero-shot extreme length generalization for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3991--4008, Mexico City, Mexico, June 2024b. Association for Computational Linguistics. URL https://aclanthology.org/2024.naacl-long.222.
Miasnikof et al. [2018] Pierre Miasnikof, Alexander Shestopaloff, Anthony Bonner, and Yuri Lawryshyn. A Statistical Performance Analysis of Graph Clustering Algorithms, pages 170--184. 05 2018. ISBN 978-3-319-92870-8. doi:10.1007/978-3-319-92871-5_11.
Newman and Girvan [2004] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004.
Douze et al. [2024] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. 2024.
Michelmann et al. [2021] Sebastian Michelmann, Amy R Price, Bobbi Aubrey, Camilla K Strauss, Werner K Doyle, Daniel Friedman, Patricia C Dugan, Orrin Devinsky, Sasha Devore, Adeen Flinker, et al. Moment-by-moment tracking of naturalistic learning and its underlying hippocampo-cortical interactions. Nature communications, 12(1):5394, 2021.
Lositsky et al. [2016] Olga Lositsky, Janice Chen, Daniel Toker, Christopher J Honey, Michael Shvartsman, Jordan L Poppenk, Uri Hasson, and Kenneth A Norman. Neural pattern change during encoding of a narrative predicts retrospective duration estimates. elife, 5:e16070, 2016.
Mariola et al. [2022] Alberto Mariola, Zafeirios Fountas, Lionel Barnett, and Warrick Roseboom. Event segmentation in continuous, naturalistic videos from model-based, data-driven, and human perspectives. 2022.
Michelmann et al. [2023b] Sebastian Michelmann, Manoj Kumar, Kenneth A Norman, and Mariya Toneva. Large language models can segment narrative events similarly to humans. arXiv preprint arXiv:2301.10297, 2023b.
Baddeley [2003] Alan Baddeley. Working memory: looking back and looking forward. Nature reviews neuroscience, 4(10):829--839, 2003.
Ericsson and Kintsch [1995] K Anders Ericsson and Walter Kintsch. Long-term working memory. Psychological review, 102(2):211, 1995.
Fortunato [2010] Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75--174, 2010.
Yang et al. [2016] Zhao Yang, René Algesheimer, and Claudio J Tessone. A comparative analysis of community detection algorithms on artificial networks. Scientific reports, 6(1):30750, 2016.
Adams and MacKay [2007] Ryan Prescott Adams and David JC MacKay. Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742, 2007.
Newman [2004] Mark EJ Newman. Fast algorithm for detecting community structure in networks. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 69(6):066133, 2004.
Panaretos and Zemel [2019] Victor M. Panaretos and Yoav Zemel. Statistical aspects of wasserstein distances. Annual Review of Statistics and Its Application, 6(Volume 6, 2019):405--431, 2019. ISSN 2326-831X. doi:https://doi.org/10.1146/annurev-statistics-030718-104938. URL https://www.annualreviews.org/content/journals/10.1146/annurev-statistics-030718-104938.

Appendix A Appendix / supplemental material

A.1 Supplementary figures

A.2 Complexity Analysis of EM-LLM Algorithm

Here, we provide a detailed analysis of the computational complexity of our Algorithm 1, focusing on the boundary refinement step and the calculation of modularity and conductance metrics.

Boundary Refinement Step

The boundary refinement step involves finding the optimal position $\hat{\beta}$ between each pair of consecutive initial boundaries $(\alpha,\beta)$ that optimizes the chosen metric function $f$ . This step has the following components:

Iteration over initial boundaries: $\mathcal{O}(k)$ , where $k$ is the number of initial boundaries. For each pair of boundaries, we compute the metric function $f$ for all positions between $\alpha$ and $\beta$ . In the worst case, this is $\mathcal{O}(n)$ operations per boundary pair.

Therefore, the overall complexity of this step is $\mathcal{O}(kn)$ .

Metric Function Computation

The metric functions (modularity or conductance) are computed at the level of individual memory units. For a memory unit of size $m$ :

•

Modularity: The naive computation involves summing over all pairs of nodes within the unit, resulting in a worst-case complexity of $\mathcal{O}(m^{2})$ . However, in practice, the similarity graph is often sparse, meaning many node pairs have negligible similarity. Leveraging this sparsity, more efficient implementations can achieve $\mathcal{O}(l)$ complexity, where $l$ is the number of non-zero similarity edges within the unit [Newman, 2004]. Typically, $l$ is much smaller than $m^{2}$ , especially for larger units, leading to significant computational savings.
•

Conductance: This requires computing the sum of edge weights crossing the boundary and the total volume of the unit, which can be done in $\mathcal{O}(m)$ time.

Given that $m$ is typically much smaller than $n$ and varies for each unit, we can consider the average unit size $\bar{m}$ and average number of non-zero similarity edges $\bar{l}$ . The total complexity for computing metrics across all units is then $\mathcal{O}(k\bar{l})$ for modularity (which in the worst case is $\mathcal{O}(k\bar{m}^{2})$ , but typically much lower) or $\mathcal{O}(k\bar{m})$ for conductance.

Overall Complexity

Combining the boundary refinement step and metric computation, the overall complexity is:

For modularity: $\mathcal{O}(kn+k\bar{m}^{2})$ For conductance: $\mathcal{O}(kn+k\bar{m})$

Since typically $\bar{m}\ll n$ , the dominant term in both cases is $\mathcal{O}(kn)$ . Therefore, we express the overall complexity of our algorithm as $\mathcal{O}(kn)$ .

A.3 Analysis of human data

The human data released as part of Kumar et al. [2023] used Gaussian smoothing on the average signal across participants to define a probability distribution of likely event boundary positions with respect to timestamps in the podcast. In order to calculate our similarity metrics, as shown in Fig. 3A, we need to express this data in terms of discrete event positions with respect to tokens in the transcript. For fair comparison, we therefore identified human-annotated positions by selecting as many of the most likely positions in the distribution as our initial surprise-based event segmentation had identified in the transcript. In the same process used by Kumar et al. [2023], we then used their provided word onset times to translate these timestamps to token positions, allowing us to calculate our similarity metrics.

In Fig. 3B, we use Wasserstein distance in order to compare the relative positions of event boundaries between human annotations and those found by our own methods. Wasserstein distance is a versatile metric used to compare two probability distributions [Panaretos and Zemel, 2019]. We used such a metric to better capture the uncertainty present in the human data, and found it to give more meaningful results than standard correlation or discrete distance metrics, which showed very little differences between methods. In order to calculate such a metric, we therefore need to convert our own discrete boundary positions to a distribution across token positions. We did so by defining a Mixture of Gaussians (MoG), with each Gaussian corresponding to a single position. Note that, for fair comparison with human data, we apply the same process to the discrete version of the human-annotated positions described above, and use this for comparison.

A.4 Approximate equivalence of K-nearest neighbours and softmax attention

Here we will attempt to show that using a k-NN retrieval in a key-value cache as part of the attention mechanism in transformers is an approximation of applying softmax attention over the entire sequence of tokens.

Let $q$ be a query vector and $K=\{k_{1},k_{2},\dots,k_{n}\}$ the set of key vectors in a transformer model with dimensionality $d$ . Each key $k_{i}$ has a corresponding value vector $v_{i}$ , with $V=\{v_{1},v_{2},\dots,v_{n}\}$ . The softmax attention weights $a_{i}$ are defined as:

a_{i}=\frac{\exp(q\cdot k_{i}~{}d^{-\frac{1}{2}})}{\sum_{j=1}^{n}\exp(q\cdot k% _{j}~{}d^{-\frac{1}{2}})}

(6)

The output vector $u$ is computed as:

u=\sum_{i=1}^{n}a_{i}v_{i}

(7)

In the k-NN approach, a subset $K^{\prime}$ of size $k$ is selected, containing keys nearest to $q$ . The approximated attention weights $a^{\prime}_{i}$ over this subset are:

a^{\prime}_{i}=\frac{\exp(q\cdot k_{i}~{}d^{-\frac{1}{2}})}{\sum_{j\in K^{% \prime}}\exp(q\cdot k_{j}~{}d^{-\frac{1}{2}})}\quad\text{for }k_{i}\in K^{\prime}

(8)

The approximate output vector $u^{\prime}$ is:

u^{\prime}=\sum_{k_{i}\in K^{\prime}}a^{\prime}_{i}v_{i}

(9)

Assumptions

1.

Exponential Dominance: The exponential function in the softmax is sharply peaked, implying that keys with the highest similarities to $q$ contribute significantly more to the sum than others.
2.

Representativeness of k-NN Subset: The subset $K^{\prime}$ captures the majority of the attention weight from the full set $K$ .

Lemma 1: Dominance of k-NN Subset

If $K^{\prime}$ consists of the $k$ keys with the highest dot products $q\cdot k_{i}$ , then:

\sum_{j\in K^{\prime}}\exp(q\cdot k_{j}~{}d^{-\frac{1}{2}})\geq\alpha\sum_{j=1% }^{n}\exp(q\cdot k_{j}~{}d^{-\frac{1}{2}})

(10)

for some $\alpha\approx 1$ , typically very close to 1.

Proof: This follows from the exponential dominance assumption and the nature of the exponential function, which is sharply peaked.

Lemma 2: Approximation of Output Vector

Given the dominance of $K^{\prime}$ as shown in Lemma 1, the approximate output $u^{\prime}$ effectively represents the full output $u$ :

\|u^{\prime}-u\|\leq\epsilon

(11)

where $\epsilon$ is a small error term.

Proof: Follows from the weighted sum structure of $u$ and $u^{\prime}$ , using the bounds established in Lemma 1.

Given the lemmas and under the stated assumptions, the k-NN retrieval mechanism within a key-value cache effectively approximates the softmax attention mechanism in transformers. This proof highlights the efficiency versus accuracy trade-off inherent in using approximate methods like k-NN retrieval.