Retrieval-Enhanced Machine Learning: Synthesis and Opportunities

To Eun Kim 0000-0002-2807-1623 Carnegie Mellon UniversityPAUnited States15213 [email protected] Alireza Salemi 0009-0006-1937-2615 University of Massachusetts AmherstMAUnited States01003 [email protected] Andrew Drozdov 0000-0002-1025-5715 University of Massachusetts AmherstMAUnited States01003 [email protected] Fernando Diaz 0000-0003-2345-1288 Carnegie Mellon UniversityPAUnited States15213 [email protected]  and  Hamed Zamani 0000-0002-0800-3340 University of Massachusetts AmherstMAUnited States01003 [email protected]
Abstract.

In the field of language modeling, models augmented with retrieval components have emerged as a promising solution to address several challenges faced in the natural language processing (NLP) field, including knowledge grounding, interpretability, and scalability. Despite the primary focus on NLP, we posit that the paradigm of retrieval-enhancement can be extended to a broader spectrum of machine learning (ML) such as computer vision, time series prediction, and computational biology. Therefore, this work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature. Also, we found that while a number of studies employ retrieval components to augment their models, there is a lack of integration with foundational Information Retrieval (IR) research. We bridge this gap between the seminal IR research and contemporary REML studies by investigating each component that comprises the REML framework. Ultimately, the goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.

journal: CSUR

1. Introduction

1.1. Background

In recent years, the research landscape surrounding large language models (LLMs) has witnessed substantial growth, underscored by the profound potential these models hold for various natural language processing (NLP) tasks. One of the significant advancements that has propelled this field forward is the scaling of the number of parameters of LLMs, which has enabled the training of models with unprecedented size and complexity (Zhao et al., 2023). We witness a similar trend in other fields adjacent to machine learning, for example, large vision foundation models for representing images and videos (Dosovitskiy et al., 2021; Arnab et al., 2021). Concurrently, the notion of in-context learning (ICL) (Dong et al., 2023) has emerged as a transformative capability, allowing LLMs to dynamically adapt and incorporate new information during its inference. In parallel, the information retrieval (IR) community has been actively exploring techniques aimed at improving the efficiency, effectiveness, and robustness of accessing information from large-scale collections.

The convergence of these two domains has given rise to a new trend in research, where models are equipped with retrieval capabilities to access external knowledge during both training and inference stages (Mialon et al., 2023; Zamani et al., 2022). This integration of retrieval mechanisms into the prediction pipeline started to gain significant traction, as it allows models to ground their predictions in external knowledge without necessitating an increase in model capacity. Methods presented by Hashemi et al. (2020) and Lewis et al. (2020b) are among the earliest work in this space; the former focuses on retrieval-augmented representation learning by extending the transformer network, while the latter studies the paradigm of retrieval-augmented generation (RAG) for knowledge-intensive language tasks. That said, using retrieval results to improve a machine learning systems is not new. Pseudo-relevance feedback methods—methods for representing search queries using the top retrieved documents—are perhaps the first set of methods in this category (Attar and Fraenkel, 1977; Croft and Harper, 1979). The ICL ability inherent in the LLMs has played a pivotal role in facilitating the dissemination and adoption of these retrieval-augmented approaches. By integrating retrieved documents into the prompt of the LLMs, researchers have been able to harness the external knowledge sources without fundamentally altering the underlying model architecture.

1.2. Motivation

Since improving model performance by increasing the number of parameters is not sustainable, one motivation of retrieval-based models stems from the finding that, while large models tend to memorize training data (Carlini et al., 2021), incorporating retrieval-based methods can effectively transfer the burden of memorization to external storage systems (Borgeaud et al., 2022). We advocate for enhancing machine learning (ML) models in general (i.e., beyond generation) with the ability to employ stored information via information retrieval techniques. IR has already shown its merits in aiding human interaction with vast text databases. We posit that IR’s utility can be broadened to enable machine access to not only extensive text databases but also to knowledge represented in more abstract forms. By integrating ML architectures with direct access to IR systems, we aim to separate the processes of reasoning and memory. Zamani et al. (2022) dubbed this approach, retrieval-enhanced machine learning (REML), as a broader concept that extends ML. Extending their work, we further survey the recent advances of REML in the field of ML, including NLP, with consistent mathematical notation. By doing so, we aim to equip researchers with a comprehensive and structured overview of the REML methodologies, enabling them to swiftly embark on research within this domain.

1.3. Applications of REML

The landscape of REML paradigm encompasses a diverse array of sub-domains, each with its unique set of challenges and applications. This includes seminal work in language modeling (Guu et al., 2020; Lewis et al., 2020b; Borgeaud et al., 2022; Izacard and Grave, 2021b; Zhong et al., 2022; Izacard et al., 2024; Ram et al., 2023; Wang et al., 2023b; Shi et al., 2024; Li et al., 2022; Khandelwal et al., 2020; Lyu et al., 2023), machine translation (Khandelwal et al., 2021), question answering (Yu et al., 2022a; Chen et al., 2017a; Lee et al., 2019; Nakano et al., 2022; Lazaridou et al., 2022; Wu et al., 2022d; Chen et al., 2023c, d; Zhang et al., 2024), fact verification (Lewis et al., 2020b; Petroni et al., 2023; Chen et al., 2023b), open domain (Shuster et al., 2022b, a; Komeili et al., 2022; Thoppilan et al., 2022) and task-oriented (Thulke et al., 2021; Cai et al., 2023; Eric et al., 2017; Raghu et al., 2021; Nekvinda and Dušek, 2022) dialog systems, slot filling (Glass et al., 2021), state tracking (King and Flanigan, 2023), multimodal dialog (Fan et al., 2021), reinforcement learning (Fernández and Veloso, 2006; Goyal et al., 2022; Humphreys et al., 2022), computer vision (Chen et al., 2023a; Yasunaga et al., 2023; Ramos et al., 2023; Shrestha et al., 2024), commonsense reasoning (Yu et al., 2022c), evidence attribution (Gao et al., 2023b; Aksitov et al., 2023; Menick et al., 2022; Huo et al., 2023; Gao et al., 2023a), knowledge-graph augmentation (Sen et al., 2023; Kang et al., 2023; Baek et al., 2023; Ju et al., 2022; Hu et al., 2023; Zhang et al., 2022b; Yu et al., 2022a), ranking (Hui et al., 2022a), personalization (Salemi et al., 2024b, a), mathematical problem-solving (Yang et al., 2024b), code generation (Zhang et al., 2023; Zhou et al., 2023; Wang et al., 2024a), representation learning for audio and speech (Sanabria et al., 2023; Lin et al., 2024), time series forecasting (Jing et al., 2022; Yang et al., 2022), and protein structure prediction (Cramer, 2021). The industry and open source communities have swiftly embraced the adoption of retrieval-based models, recognizing their potential for accelerated adaptation and performance enhancement. Frameworks such as LangChain,111https://www.langchain.com LlamaIndex,222https://www.llamaindex.ai and DSPy (Khattab et al., 2024) have emerged, streamlining the process of implementing the retrieval-based models. This broad spectrum of domains (not an exhaustive list) underscores the versatility and impact of the REML paradigm across diverse applications.

1.4. Main Contributions of This Work

Although many current applications are centered around natural language processing, we believe that ML models that leverage retrieval components are not confined to language models alone, but can be extended to any ML models. To address this broader applicability, we formalize the framework as Retrieval-Enhanced Machine Learning (REML) and synthesize existing studies with consistent mathematical notations which is lacking in the current literature. Also, despite the advancements in REML models, there remains a significant underutilization of the rich and extensive body of work from information retrieval research which can offer numerous methodologies and insights that can substantially benefit REML models. This work aims to bridge this gap by integrating IR research into the design of REML models. Ultimately, we hope this work will enable researchers across various fields leveraging ML to easily understand the framework of REML and its extensibility.

2. Retrieval-Enhanced Machine Learning

Notation Description x𝑥xitalic_x input instance 𝒳𝒳\mathcal{X}caligraphic_X input space y𝑦yitalic_y output target 𝒴𝒴\mathcal{Y}caligraphic_Y output space L labeled data (i.e., L𝒳×𝒴L𝒳𝒴\textrm{L}\subset\mathcal{X}\times\mathcal{Y}L ⊂ caligraphic_X × caligraphic_Y) U unlabeled data (i.e., U𝒳U𝒳\textrm{U}\subset\mathcal{X}U ⊂ caligraphic_X) \mathcal{L}caligraphic_L The downstream loss function fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT a predictive machine learning model parameterized by θ𝜃\thetaitalic_θ gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT a retrieval model parameterized by ω𝜔\omegaitalic_ω \mathcal{M}caligraphic_M a tuple of predictive and retrieval model d𝑑ditalic_d retrieval item 𝒟𝒟\mathcal{D}caligraphic_D retrieval space 𝒟keysubscript𝒟key\mathcal{D}_{\text{key}}caligraphic_D start_POSTSUBSCRIPT key end_POSTSUBSCRIPT retrieval key space C stored retrievable items q𝑞qitalic_q query 𝒬𝒬\mathcal{Q}caligraphic_Q query space r𝑟ritalic_r retrieval results \mathcal{R}caligraphic_R retrieval result space s𝑠sitalic_s model feedback 𝒮𝒮\mathcal{S}caligraphic_S model feedback space μ𝜇\muitalic_μ evaluation metric

Table 1. Notations used in this paper to synthesize REML research.

To begin an in-depth exploration of retrieval-enhanced machine learning (REML), we commence with reiterating the generalized formal definition of the task set by Zamani et al. (2022). Like all predictive machine learning frameworks, subsequently referred to as ML models, REML is tasked with learning a functional relationship that maps an input space 𝒳𝒳\mathcal{X}caligraphic_X to an output space 𝒴𝒴\mathcal{Y}caligraphic_Y. Unlike other ML models, REML predicts outcomes through interactions with one or more information access models, each facilitating access to a database or repository of knowledge. Hence, REML is formally articulated as y=fθ(x;gω1,gω2,,gωN)𝑦subscript𝑓𝜃𝑥subscript𝑔subscript𝜔1subscript𝑔subscript𝜔2subscript𝑔subscript𝜔𝑁y=f_{\theta}(x;g_{\omega_{1}},g_{\omega_{2}},\cdots,g_{\omega_{N}})italic_y = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ; italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), where x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y symbolize the input instance and target output respectively, fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT represents an ML model parameterized by θ𝜃\thetaitalic_θ, and gωisubscript𝑔subscript𝜔𝑖g_{\omega_{i}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents an information access model parameterized by ωisubscript𝜔𝑖\omega_{i}italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Here, N𝑁Nitalic_N signifies the total number of information access models that fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can consult with. Each gωisubscript𝑔subscript𝜔𝑖g_{\omega_{i}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is associated with a collection, repository, or memory CisubscriptC𝑖\textrm{C}_{i}C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which might consist of natural language texts or alternative indexed representations. Consequently, the collection CisubscriptC𝑖\textrm{C}_{i}C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT serves as a versatile array of parameters available to gωisubscript𝑔subscript𝜔𝑖g_{\omega_{i}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which may be employed ad hoc, in the same way as many non-parametric and lazy learning techniques. Zamani et al. (2022) outlines three necessary (Reqs) and optional (Opts) requirements for REML models.

  1. Req 1

    Querying. Every fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT should possess a capability to generate queries that are dependent on the input, directed towards gωisubscript𝑔subscript𝜔𝑖g_{\omega_{i}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPTs. Refer to §3.

  2. Req 2

    Retrieval. Every gωisubscript𝑔subscript𝜔𝑖g_{\omega_{i}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT must be capable of processing the queries from fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, fetching pertinent information from its corresponding repository CisubscriptC𝑖\textrm{C}_{i}C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Refer to §4.

  3. Req 3

    Response Utilization. Every fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT should utilize the information obtained from gωisubscript𝑔subscript𝜔𝑖g_{\omega_{i}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPTs in its prediction-making process. Refer to §5.

  4. Opt 1

    Storing. fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT may archive some information in CisubscriptC𝑖\textrm{C}_{i}C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for later retrieval, applicable during both training and inference. Refer to §6.

  5. Opt 2

    Feedback. fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT may have the functionality to offer feedback to gωisubscript𝑔subscript𝜔𝑖g_{\omega_{i}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, during the training and inference for improvements of either fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, gωisubscript𝑔subscript𝜔𝑖g_{\omega_{i}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, or both. Refer to §7.

The simplest form of REML model is depicted in Figure 1(a) by focusing solely on the essential criteria. The second category, illustrated in Figure 1(b), utilizes the first optional property by storing information in a storage for subsequent retrieval. The third category, presented in Figure 1(c), employs the second optional property, offering feedback to the information access models. The final category incorporates all optional properties, as detailed in Figure 1(d).

Based on the requirements, Zamani et al. (2022) proposed a comprehensive framework for REML, as illustrated in Figure 2. This framework is structured around two principal components: the prediction model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and the information retrieval models gωisubscript𝑔subscript𝜔𝑖g_{\omega_{i}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPTs. For any given input x𝑥xitalic_x, the predictive model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT has the flexibility to initiate multiple retrieval operations. This could involve dispatching multiple queries, engaging with numerous data repositories, offering feedback to the information retrieval components, or employing a mix of these strategies. It’s noteworthy that for certain inputs, the number of retrieval processes might be nil, thereby allowing REML to extend the conventional predictive modeling.

Refer to caption
(a) Retrieval-only
Refer to caption
(b) Retrieval with memory
Refer to caption
(c) Retrieval with feedback
Refer to caption
(d) Retrieval with memory & feedback
Figure 1. Retrieval-enhanced machine learning models should implement three necessary requirements (querying, retrieval, and response utilization) and may implement two optional properties (storing information and providing feedback to the information access model). This results in four categories of REML models presented above. Figure is taken from (Zamani et al., 2022).
Refer to caption
Figure 2. A generic framework for REML (Zamani et al., 2022). The multiplicative nature of the information access process implies that the access to the information can be distributed and/or be done iteratively. Note that each component do not have to be completely separated, e.g., Query Generation or Response Processing module can be dealt within the Predictive Model. In abstract, however, we consider them as one of the components of information access process that can be described separately.

3. Querying

Querying refers to how requests are represented and constructed. how do we define 𝒬𝒬\mathcal{Q}caligraphic_Q? how does fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT construct q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q from x𝑥xitalic_x?

In REML, the process of acquiring information hinges upon the act of querying a knowledge or information repository. Consequently, the formulation of a query from the input, whether unstructured or structured, stands as the pivotal initiation point for the interplay between predictive and retrieval models within the REML framework. The following sections introduce common operations employed to generate queries based on the task’s input.

3.1. Deciding Where to Query

Before sending a query to an information access system, a REML system can decide where its query should be sent to. Bearing in mind that each gωisubscript𝑔subscript𝜔𝑖g_{\omega_{i}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is associated with CisubscriptC𝑖\textrm{C}_{i}C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (multiplicative nature of information access process depicted in Figure 2), the Query Generation module first decides which tuple(s) of gωisubscript𝑔subscript𝜔𝑖g_{\omega_{i}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and CisubscriptC𝑖\textrm{C}_{i}C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT should be selected depending on the context which aligns with the mixture-of-expert-like interpretation described by Pan et al. (2023) 333In KiC (Pan et al., 2023), a router selects an expert predictive model (not a retriever) with specific knowledge source. However, we think that this helps the understanding of the first step of the Query Generation module.. This query decision problem can be understood by the following sub-problems. 1) Corpus Selection: deciding which corpora are needed to be searched over (can be null when no retrieval is needed), and 2) Retriever Selection: deciding which retrieval model should be used to search over the chosen corpora.

3.1.1. Corpus Selection

Selecting CisubscriptC𝑖\textrm{C}_{i}C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is not only regarded with deciding what kind of external information should be provided to the fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (Pan et al., 2023) but also encompasses the question of when to query (no corpus selection when no retrieval is beneficial to fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT). This can be a critical question as retrieval augmentation can hurt the performance for certain input types (Maekawa et al., 2024; Mallen et al., 2023; Asai et al., 2024). It also can save computational resources by reducing the number of searching (Labruna et al., 2024). There can be several criteria on whether external information can be beneficial to the predictive model. It can be based on term popularity (Mallen et al., 2023), input complexity (Jeong et al., 2024), or a trained model (Asai et al., 2024).

3.1.2. Retriever Selection

Once one or more CisubscriptC𝑖\textrm{C}_{i}C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is selected, a REML system can further benefit itself by choosing the optimal retriever specialized in searching the selected corpora. This is a relatively challenging and new task, and we refer readers to Khramtsova et al. (2023, 2024) for deeper understanding.

3.2. Reformulating the Input

For many cases, raw user input cannot be directly leveraged as a query to the retrieval model, underscoring the critical need for input reformulation into a different representation. This reformulation occurs through a process, where the input is transformed using a separate component or the same predictive model based on the specific requirements of the system. The general equation for transformation is

(1) q=transformq(x,context)𝑞subscripttransform𝑞𝑥𝑐𝑜𝑛𝑡𝑒𝑥𝑡q=\textit{transform}_{q}(x,context)italic_q = transform start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x , italic_c italic_o italic_n italic_t italic_e italic_x italic_t )

where x𝑥xitalic_x is the input that the transformation should be applied on (i.e., the original input to the task, the previous search query, the output of another transformation, etc.) and context𝑐𝑜𝑛𝑡𝑒𝑥𝑡contextitalic_c italic_o italic_n italic_t italic_e italic_x italic_t is side information that this function can use in performing the transformation. For example, one of the use cases of context𝑐𝑜𝑛𝑡𝑒𝑥𝑡contextitalic_c italic_o italic_n italic_t italic_e italic_x italic_t can be a user profile, which can help personalize this transformation for a user (Salemi et al., 2024b). Common motivations for transforming the input into an alternative format encompass a range of factors, including but not limited to truncation, expansion, and conversion.

3.2.1. Compression

In certain scenarios, not all words or components of the input prove relevant for the search objective. Consequently, the common practice of omitting specific segments of the input has been employed in numerous prior studies. In the majority of cases, sequence-to-sequence models are trained to identify and mark the segments that require reduction (Ni et al., 2019; Musa et al., 2019; Yadegari et al., 2022; Khashabi et al., 2017). At times, a straightforward approach such as segmenting the input into distinct chunks and utilizing these segments as queries can be highly effective (Borgeaud et al., 2022). In multi-modal search scenarios, omitting a specific modality from the input and conducting searches across a corpus from different modalities has proven to be valuable (Gui et al., 2022).

3.2.2. Expansion

In certain scenarios, the input alone may lack essential information required by the search system to yield desired results. In such situations, augmenting the input with additional pertinent data can be done. This process of expansion broadens the context and enhances the search system’s capability to retrieve relevant and meaningful results, thereby improving overall system performance. Typically, expansion is achieved by concatenating the input with previously retrieved results (Zhu et al., 2021; Xiong et al., 2021) or generated text that is conditioned on the input (Wang et al., 2023d; Mao et al., 2021; Liu et al., 2022; Chuang et al., 2023).

3.2.3. Conversion

For some situations, reshaping the input into a new query based on its inherent structure, instead of mere expansion, is proved to be advantageous. This approach is particularly valuable when crafting structured queries for database (Arcadinho et al., 2022; Dou et al., 2023) and API access (Schick et al., 2023; Qin et al., 2023; Ouyang et al., 2022; Jin et al., 2024).

The conversion operation may results in a transformation of the input space 𝒳𝒳\mathcal{X}caligraphic_X into the query space 𝒬𝒬\mathcal{Q}caligraphic_Q. Consequently, it is essential to note that 𝒬𝒬\mathcal{Q}caligraphic_Q and 𝒳𝒳\mathcal{X}caligraphic_X are not necessarily equivalent, signifying that the transformed queries might operate in a distinct space compared to the original input. In complex multi-modal search scenarios, employing the input directly might not be viable. Consequently, transforming the input into a different modality form becomes imperative, ensuring seamless and efficient communication between the predictive model and the search system (Gao et al., 2022; Lin and Byrne, 2022; Wu and Mooney, 2022; Lin et al., 2023; Salemi et al., 2023a).

Moreover, transforming the input from the original input space into the latent space of a language model and retrieving information from the model’s prior interactions with data, like what happens in kNN-LM (Khandelwal et al., 2020), represent additional transformations that alter the query space (Chen et al., 2022; Yogatama et al., 2021; He et al., 2021; Khandelwal et al., 2021; Kassner and Schütze, 2020). Indeed, Neural Turing Machines (Graves et al., 2014; Gulcehre et al., 2017; Rae et al., 2016) and Memory Transformers (Zhong et al., 2022; Wu et al., 2022a; Wan et al., 2022) employ a similar conversion process to translate input into a latent variable. This transformation is essential for enabling effective retrieval from the memory/storage component of these models.

3.3. Decomposing the Input

This category involves breaking down a complex input into simpler parts, often to better understand the content and retrieve more accurate results. This technique is particularly useful when dealing with long and complex inputs that cover multiple topics or concepts (Min et al., 2019; Perez et al., 2020; Zhou et al., 2022). The general equation for decomposition is

(2) Q=decompose(x,context)𝑄decompose𝑥𝑐𝑜𝑛𝑡𝑒𝑥𝑡Q=\textit{decompose}(x,context)italic_Q = decompose ( italic_x , italic_c italic_o italic_n italic_t italic_e italic_x italic_t )

where x𝑥xitalic_x is the input that should be decomposed (i.e., the original input to the task, the output of a transformation, etc.), context𝑐𝑜𝑛𝑡𝑒𝑥𝑡contextitalic_c italic_o italic_n italic_t italic_e italic_x italic_t is side information that this function can use in performing the decomposition, and Q𝑄Qitalic_Q is the set of decomposed queries. Note that decomposition returns a set of queries in contrast with the transformation operation which only returns a single query.

3.4. Unified Equation for Query Generation

The unified equation for query generation is therefore

(3) Q=decompose(transformq(x,context),context)𝑄decomposesubscripttransform𝑞𝑥𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑐𝑜𝑛𝑡𝑒𝑥𝑡Q=\textit{decompose}(\textit{transform}_{q}(x,context),context)italic_Q = decompose ( transform start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x , italic_c italic_o italic_n italic_t italic_e italic_x italic_t ) , italic_c italic_o italic_n italic_t italic_e italic_x italic_t )

Any of these function can be replaced with the identity function (Izacard and Grave, 2021b; Karpukhin et al., 2020; Asai et al., 2022; Yamada et al., 2021; Lewis et al., 2020b, b; Thorne et al., 2018; Guu et al., 2020) disregarding modality (Salemi et al., 2023b; Qu et al., 2021). Furthermore, it is conceivable to apply these functions multiple times and in various order. Particularly in the realm of multi-hop question answering and fact verification, prior research extensively employs multiple transformations and decompositions to fulfill the task requirements (Qi et al., 2019; Trivedi et al., 2023; Yadav et al., 2020; Das et al., 2019; Jiang et al., 2023). Given the intricacies of these tasks, leveraging a combination of various transforms and decompositions becomes essential.

4. Searching

Searching refers to how requests and items are combined to construct the retrieval results. how does gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT generate r𝑟r\in\mathcal{R}italic_r ∈ caligraphic_R from q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q ?

Depending on the nature of the documents, the queries, and the tasks, different search functionalities are required and expected. For instance, in some task-oriented dialogue systems, fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (conversational agent) requires access to relational databases. Therefore, in such scenarios, structured queries like SQL is used for searching. That being said, retrieval items in most applications are in the form of semi-structured or unstructured text or involve multi-modal aspects. In the following, we review different retrieval models for the various situations.

4.1. Retrieval Models with Sparse Representations

Many text-based retrieval models use sparse representations for representing queries and documents. For instance, term-matching (lexical) retrieval models, such as TF-IDF (Salton and Buckley, 1988), BM25 (Robertson et al., 1995), and query likelihood (Ponte and Croft, 1998), represent each query and document using a V𝑉Vitalic_V-dimensional sparse vector, where V𝑉Vitalic_V denotes the vocabulary size. In these models, the dimensions associated with the terms that appear in the given text carry non-zero values and the rest are zero. Most of these models are based on the term independent, or the bag-of-words assumption. That being said, models that consider term position and ordering exist, such as higher-order language models (Song and Croft, 1999), positional language models (Lv and Zhai, 2009), and sequential dependency models (Metzler and Croft, 2005).

The sparsity nature of data in these retrieval models enable them to use an inverted index data structure for scalable and efficient retrieval. Note that these models often suffer from a vocabulary mismatch problem, meaning that using different vocabulary for representing the same concept in the query and document does not contribute to the estimated relevance score. This can significantly impact the performance of gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, especially from the recall perspective. Query expansion and document expansion approaches exist to address the vocabulary mismatch problem, including the pseudo-relevance feedback models (Lavrenko and Croft, 2001; Zhai and Lafferty, 2001; Rocchio, 1971). Neural network solutions for expanding the documents, such as SPLADE (Formal et al., 2021), has shown promising results when sufficiently large-scale training data is available.

An alternative to lexical representation is using latent vectors. For instance, SNRM (Zamani et al., 2018) learns high-dimensional sparse latent vectors produced by deep learning models for representing queries and documents.

4.2. Retrieval Models with Dense Representations

Queries and documents can be represented using low-dimensional (compared to the vocabulary size) dense vectors. Such dense vectors are often obtained using pre-trained language models, such as BERT (Devlin et al., 2019), that are fine-tuned for retrieval tasks (Karpukhin et al., 2020). Dense retrieval models are commonly based on bi-encoder architectures – one encoder for representing the query and another one for representing the document. These encoders can share parameters. Some dense retrieval methods, such as DPR (Karpukhin et al., 2020), represent each query or document by a single dense vector. While others, such as ColBERT (Khattab and Zaharia, 2020), use one vector per token, resulting in multiple vectors for each query and document. Approximate nearest neighbor (ANN) algorithms, such as HNSW (Malkov and Yashunin, 2020), are used for efficient retrieval when dealing with dense representations. Dense retrieval approaches are also commonly used when dealing with multi-media and multi-modal data (Qu et al., 2021; Salemi et al., 2023a).

4.3. Reranking Models

Modern search engines are mainly designed based on a multi-stage cascaded architecture–a stack of ranking models where the first model efficiently retrieves a list of documents and the following models rerank the results from the previous stage. A common scenario is a two-stage process: retrieval and reranking. Reranking models are often optimized using explicit or implicit relevance labels. These models are called learning-to-rank models. Early learning-to-rank models rely on manually-extracted and engineered features sets, while the most recent ones rely on deep learning models for representation learning and reranking. A common strategy for reranking using deep learning models is called cross encoding (Nogueira and Cho, 2019), meaning that a query and a candidate document are concatenated and fed to a network like BERT (Devlin et al., 2019), which is trained (or fine-tuned) to produce a relevance score. Learning-to-rank models can be optimized using pointwise, pairwise, or listwise loss functions. For more information, refer to the learning-to-rank survey by Liu (2009) and the neural ranking model surveys by Mitra and Craswell (2018) and Guo et al. (2020a).

4.4. Generative Retrieval Models

Generative retrieval models, or differentiable search indexes, adopt an encoder-decoder or a decoder-only neural network architecture with the goal of generating document identifiers given the query. Even though earlier attempts to developing these models (Tay et al., 2022) fail at performing effectively at scale (Pradeep et al., 2023), recent research by Zeng et al. (2024) developed prefix-oriented optimization approaches that let generative retrieval models to effectively scale up to large collections. These models often assign a semantic document identifier to each document in the collection and are optimized to generate the identifiers of relevant documents using sequential decoding algorithms, such as beam search.

4.5. Unified Equation for Searching

The current literature of searching can be generalized by equations of addressing and reading borrowed from the paradigm of Neural Turing Machines (NTM) (Graves et al., 2014, 2016; Rae et al., 2016). Before reading from the collection CisubscriptC𝑖\textrm{C}_{i}C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the model should decide which part of the collection it should attend to by addressing. The addressing can be done by comparing the query qtQsubscript𝑞𝑡𝑄q_{t}\in Qitalic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_Q (from 3) with the keys in the collection and/or by finding the location in the collection. The addressing is also used when constructing the collection which will be discussed in Section 6.

4.5.1. Content-based addressing

At t𝑡titalic_t’th iteration, with slight abuse of notation (simplifying CitsuperscriptsubscriptC𝑖𝑡\textrm{C}_{i}^{t}C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to CtsubscriptC𝑡\textrm{C}_{t}C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), given a query qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, a collection CtsubscriptC𝑡\textrm{C}_{t}C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, content-based addressing can be defined as:

(4) wtcontent=addresscontent(qt,Ct)=topK(sort(score(qt,transforms(Ct))),k)superscriptsubscript𝑤𝑡𝑐𝑜𝑛𝑡𝑒𝑛𝑡subscriptaddresscontentsubscript𝑞𝑡subscriptC𝑡𝑡𝑜𝑝𝐾𝑠𝑜𝑟𝑡scoresubscript𝑞𝑡subscripttransform𝑠subscriptC𝑡𝑘w_{t}^{content}=\textit{address}_{\textit{content}}(q_{t},\textrm{C}_{t})=topK% (sort(\textit{score}(q_{t},\textit{transform}_{s}(\textrm{C}_{t}))),k)italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_o italic_n italic_t italic_e italic_n italic_t end_POSTSUPERSCRIPT = address start_POSTSUBSCRIPT content end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_t italic_o italic_p italic_K ( italic_s italic_o italic_r italic_t ( score ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , transform start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) , italic_k )

where k is the number of relevant addresses to be selected based on the query, score is a scoring function, such as BM25 (Robertson et al., 1995) or cosine similarity (Guu et al., 2020; Majumder et al., 2024). The content-based address vector wtcontentsuperscriptsubscript𝑤𝑡𝑐𝑜𝑛𝑡𝑒𝑛𝑡w_{t}^{content}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_o italic_n italic_t italic_e italic_n italic_t end_POSTSUPERSCRIPT can be exhaustively computed by pairwise comparisons of query and all elements of the collection (Graves et al., 2014, 2016; Weston et al., 2015; Santoro et al., 2016; Chen et al., 2018; Sukhbaatar et al., 2015), accelerated (Lample et al., 2019; Weston et al., 2015), or approximated by ANN, selecting the top k𝑘kitalic_k items diCsubscript𝑑𝑖Cd_{i}\in\textrm{C}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ C, resulting in wtcontentsuperscriptsubscript𝑤𝑡𝑐𝑜𝑛𝑡𝑒𝑛𝑡w_{t}^{content}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_o italic_n italic_t italic_e italic_n italic_t end_POSTSUPERSCRIPT with k𝑘kitalic_k non-zero elements (Rae et al., 2016; Kumar et al., 2016; Wu et al., 2022c; Khandelwal et al., 2020; Zhong et al., 2022; Alon et al., 2022; Majumder et al., 2024). The function transformssubscripttransform𝑠\textit{transform}_{s}transform start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT may be needed when the contents in the collection cannot be readily compared with the qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, e.g., mapping to a feature space (Weston et al., 2015; Shi et al., 2024; Guu et al., 2020; Borgeaud et al., 2022) and lexical transformation (Madaan et al., 2022). This function can be an identity function when transformation is not needed (Grave et al., 2017).

4.5.2. Location-based addressing

Location-based addressing lets the searching system access the corpus purely by storage location such as index without any lexical or contextual comparison between a query and the corpus. Therefore, this is often used for storing (Section 6) or recency-based retrieval. Thus,

(5) wtlocation=addresslocation(qt,context)superscriptsubscript𝑤𝑡𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛subscriptaddresslocationsubscript𝑞𝑡𝑐𝑜𝑛𝑡𝑒𝑥𝑡w_{t}^{location}=\textit{address}_{\textit{location}}(q_{t},context)italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_o italic_c italic_a italic_t italic_i italic_o italic_n end_POSTSUPERSCRIPT = address start_POSTSUBSCRIPT location end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c italic_o italic_n italic_t italic_e italic_x italic_t )

where context is the side information that this function can use in generating the location based address (e.g., previous generated addresses by this function). For some applications, both content- and location-based addressing can be used together. To this end, the final address can be defined as:

(6) wt=combine(wtlocation,wtcontent)subscript𝑤𝑡𝑐𝑜𝑚𝑏𝑖𝑛𝑒superscriptsubscript𝑤𝑡𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛superscriptsubscript𝑤𝑡𝑐𝑜𝑛𝑡𝑒𝑛𝑡w_{t}={combine}(w_{t}^{location},w_{t}^{content})italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_c italic_o italic_m italic_b italic_i italic_n italic_e ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_o italic_c italic_a italic_t italic_i italic_o italic_n end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_o italic_n italic_t italic_e italic_n italic_t end_POSTSUPERSCRIPT )

where combine is a function that generates an address based on the location-based and content-based addresses. Although most of the previous work uses either pure content-based addressing (Santoro et al., 2016; Guu et al., 2020; Khandelwal et al., 2020; Rae et al., 2016) or pure location-based addressing (Weston et al., 2015; Shinn et al., 2023), some work employs both content- and location-based addressing (Graves et al., 2014, 2016).

4.5.3. Unified Equation for Searching

With the final address vector wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the retrieval results rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT can be defined as

(7) rt=read(wt,transforms(Ct)),subscript𝑟𝑡readsubscript𝑤𝑡subscripttransform𝑠subscriptC𝑡r_{t}=\textit{read}(w_{t},\textit{transform}_{s}(\textrm{C}_{t})),italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = read ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , transform start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ,

where read simply selects the content from corpus in location wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that that is transformed by the transformssubscripttransform𝑠\textit{transform}_{s}transform start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

5. Presentation & Consumption

Presentation refers to how retrieval results are represented and consumption refers to how they are incorporated into the predictive model. how do we define \mathcal{R}caligraphic_R? how does fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT incorporate/use r𝑟r\in\mathcal{R}italic_r ∈ caligraphic_R?

In this section we will cover two key parts of REML. Presentation involves not only how we define the result space \mathcal{R}caligraphic_R, but also how the results from retrieval are prepared for the next step of consumption. Based on the application, the presentation stage can range from simple copying of the results to more complex pipelines with intermediate preprocessing and model-based transformations. Consumption is the process through which the predictive model (fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT) incorporates the retrieved information. There are many considerations when designing effective methods for presentation and consumption. One typically wants to incorporate as much information as possible while balancing the tradeoffs between efficiency and accuracy.

5.1. Presentation

When presenting search results to a human reader the interface is designed to make the findings easily consumed such as through sorting items by relevance or highlighting salient snippets (White et al., 2003). In REML, we follow a similar principle except the target consumer of the retrieved data is a machine, which has a different set of limitations and capabilities. Table 2 summarizes the research related to presentation.

5.1.1. Transforming the data

Dependent on the task and source of data, the result data will be incomplete prior to consumption. The transformation of data is a general and powerful process that converts data through a separate model depending on the needs of the system. Common reasons requiring further data transformation include decontextualization, translation, and summarization among others, thus r𝑟ritalic_r can be transformed by the following equation:

(8) r=transformp(r)superscript𝑟𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟subscript𝑚𝑝𝑟r^{\prime}=transform_{p}(r)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_r )
Decontextualization

When the retrieved item is only a few sentences of a much larger document, then it may require decontextualization to resolve anaphora or previously defined abbreviation (Newman et al., 2023).

Translation

It may be the case that search operates in a cross-lingual representation space, and there will be a mismatch in language between the retrieved items and desired output language (Lavrenko et al., 2002). A multilingual language model may be robust to code switching in the retrieved context, but it will likely be more reliable to translate any retrieved documents before processing them for prediction (Parton et al., 2008; Jiang et al., 2024). Translation can be applied to other modalities, such as regenerating an image to match an expected style.

Summarization

Due to possible context limits of predictive model, it is desirable to condense document data so that more documents can fit into the context. This can be achieved through automatic summarization, converting the original data into a shortened form through an extractive or abstractive process (Gao et al., 2023b; Li et al., 2023). Furthermore, data can be summarized in the context of the input, providing clarity and explaining why the document is relevant.

5.1.2. Combining result items

To further optimize the presentation of the result items for size or clarity, multiple items can be combined, e.g., summarizing all items jointly (Wu et al., 2021; Sarthi et al., 2024). Not all REML systems will combine items, as operating over individual items can be efficient and effective. Furthermore, combining items may lead to complications such as miscalibration between the individual scores and scores of a combined result, which is represented by compose:

(9) r=compose(r)superscript𝑟𝑐𝑜𝑚𝑝𝑜𝑠𝑒𝑟r^{\prime}=compose(r)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_c italic_o italic_m italic_p italic_o italic_s italic_e ( italic_r )

5.1.3. Truncate results list

If not all documents fit into the context for consumption, we discard or truncate documents based on these limits, optimizing for length and potentially other properties such as diversity (Hofstätter et al., 2023; Bahri et al., 2020; Meng et al., 2024), which is represented by truncate:

(10) r=truncate(r)superscript𝑟𝑡𝑟𝑢𝑛𝑐𝑎𝑡𝑒𝑟r^{\prime}=truncate(r)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t italic_r italic_u italic_n italic_c italic_a italic_t italic_e ( italic_r )

5.1.4. Unified Equation for Presentation

The full equation for presentation:

(11) r=truncate(compose(transformp(r)))superscript𝑟𝑡𝑟𝑢𝑛𝑐𝑎𝑡𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑒𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟subscript𝑚𝑝𝑟r^{\prime}=truncate(compose(transform_{p}(r)))italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t italic_r italic_u italic_n italic_c italic_a italic_t italic_e ( italic_c italic_o italic_m italic_p italic_o italic_s italic_e ( italic_t italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_r ) ) )

where any of these functions can be replaced with simple forms such as an identity function (Lewis et al., 2020a; Izacard and Grave, 2021b, a). Additionally, we can imagine these functions being applied multiple times in any order. We include this ordering as the one that seems most natural when taking into account context length limits. transformp𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟subscript𝑚𝑝transform_{p}italic_t italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is on individual items, and compose𝑐𝑜𝑚𝑝𝑜𝑠𝑒composeitalic_c italic_o italic_m italic_p italic_o italic_s italic_e is similar to transformp𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟subscript𝑚𝑝transform_{p}italic_t italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT but on groups of items. Not shown is loading of retrieved items. Unlike traditional ML systems, where data is typically used only for training, REML systems have unique requirements associated with data loading. Since the amount of external data required for an input is dynamic and considerable (Borgeaud et al., 2022), efficient load and manage of data is essential (Douze et al., 2024; Guo et al., 2020b). We assume loading is handled implicitly by the retrieval module.

Transform (§5.1.1) ALCE (Gao et al., 2023b) Explores both summarization and extractive snippets to compress retrieved items. Teach LLMs to Personalize (Li et al., 2023) Context independent and dependent summarization to emphasize key retrieved aspects. QADecontext (Newman et al., 2023) \dagger An example application where decontextualization is done as the downstream task when presenting passages from scientific documents. Compose (§5.1.2) Fixed Chunking (Wu et al., 2021) Recursively summarizes adjacent chunks in books. RAPTOR (Sarthi et al., 2024) First clusters then summarizes related chunks of text. Truncate (§5.1.3) FiD-Light (Hofstätter et al., 2023) Extracts subset lists of vectors in an FiD-like model to speed up attention-related performance bottleneck during decoding. Choppy (Bahri et al., 2020) \dagger A supervised approach to ranked list truncation.

Table 2. Instances of Presentation-related research. \dagger: Relevant for future REML research.

5.2. Consumption

In REML, the predictive model is presented with one or more documents. The effectiveness of fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is influenced by the consumption of the presented documents. Ideally, fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT would consume all the documents simultaneously, yet our systems are computationally limited; hence, the granularity of consumption is typically limited to a subset of the presented documents. Depending on the task, different consumption algorithms may prove varying in utility—some algorithms are used for extraction and others for on-the-fly updates of the predictive model parameters. In contrast, decoding algorithms, such as beam search (Freitag and Al-Onaizan, 2017) or nucleus sampling (Holtzman et al., 2020), provide ways to decode effective outputs given the presented documents and can incorporate verification for improved effectiveness. There are additional concerns during consumption, such as efficiency (De Jong et al., 2023; Hofstätter et al., 2023) and attribution (Gao et al., 2023b; Schuster et al., 2024; Asai et al., 2024; Menick et al., 2022), that provide further utility.

5.2.1. Consumption at different granularities of retrieval items

Typically, multiple items are retrieved and it is a design choice whether to process retrieved items separately or together. The followings are the main paradigms for consuming retrieval items:

  • Single: Only a single item is incorporated in the prediction. This may be sufficient for simple queries, but often information will need to be combined across multiple retrieved items.

  • Ensemble: Predictions are made for multiple retrieval items in the single-fashion, then aggregated (Khandelwal et al., 2020; Shi et al., 2024).

  • Joint: When the prediction has a sufficiently flexible context representation, then multiple retrieval items can be passed simultaneously in a single inference procedure (Izacard and Grave, 2021b; Lewis et al., 2020a). This is potentially richer than the ensemble approach since each retrieved item is aware of the others. Due to computational limits, only a few retrieval items are able to be processed this way and the ensemble-approach is relatively more scalable.

  • Multi-round: A hybrid approach where a subset of retrieved items are processed at a time, and the next subset incorporates information about the retrieved items processed thus far (Jiang et al., 2023). Although more scalable than the joint-approach, this may be slower. That being said, some applications (e.g. dialogue) naturally conform to the multi-round framework.

These paradigms are atomic functions that can be composed to form more complex operations. For instance, given lists of retrieved items, denoted as r0,r1,r2,r3subscript𝑟0subscript𝑟1subscript𝑟2subscript𝑟3r_{0},r_{1},r_{2},r_{3}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, the atomic functions with different granularities can be composed as the following equation:

(12) y=ensemble(multiround(joint(r0),joint(r1)),multiround(joint(r2),joint(r3))),𝑦𝑒𝑛𝑠𝑒𝑚𝑏𝑙𝑒𝑚𝑢𝑙𝑡𝑖𝑟𝑜𝑢𝑛𝑑𝑗𝑜𝑖𝑛𝑡subscript𝑟0𝑗𝑜𝑖𝑛𝑡subscript𝑟1𝑚𝑢𝑙𝑡𝑖𝑟𝑜𝑢𝑛𝑑𝑗𝑜𝑖𝑛𝑡subscript𝑟2𝑗𝑜𝑖𝑛𝑡subscript𝑟3,y=ensemble(multiround(joint(r_{0}),joint(r_{1})),multiround(joint(r_{2}),joint% (r_{3})))\text{,}italic_y = italic_e italic_n italic_s italic_e italic_m italic_b italic_l italic_e ( italic_m italic_u italic_l italic_t italic_i italic_r italic_o italic_u italic_n italic_d ( italic_j italic_o italic_i italic_n italic_t ( italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_j italic_o italic_i italic_n italic_t ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , italic_m italic_u italic_l italic_t italic_i italic_r italic_o italic_u italic_n italic_d ( italic_j italic_o italic_i italic_n italic_t ( italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_j italic_o italic_i italic_n italic_t ( italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ) ) ,

5.2.2. Consumption algorithms

Independent of the choice of granularity, there are algorithms applicable for consumption. Broadly, they fall into the following categories:

  • Extractive: The predictive model is limited to extracting exact information from the retrieved item, e.g., a span of text from a retrieve passage to answer a question (Khandelwal et al., 2020; Lan et al., 2023). This can be achieved through pointer networks (Vinyals et al., 2015), constrained decoding (Hokamp and Liu, 2017; Hu et al., 2019; Post and Vilar, 2018), and other similar techniques.

  • Analogical: Learning by example, case based reasoning, and retrieve-and-edit approaches all fall under the category of analogical reasoning. Each of these involves different underlying mechanisms, but essentially the predictive model will be extrapolating from one or more demonstrative examples to make its prediction and not necessarily extracting factual knowledge from the retrieved items (Das et al., 2020).

  • Contextual: The predictive model incorporates the retrieved items in its context, but the decoding of the output is not constrained in any way (Shi et al., 2024).

  • Latent: A hybrid approach where retrieved items are not incorporated directly into the context, but are instead incorporated in other ways, e.g., by merging hidden states (Yogatama et al., 2021). Similar to the contextual-approach, decoding is not constrained.

5.2.3. Decoding algorithms

Independent of the consumption algorithm, there are different decoding algorithms that can be done to predict a high quality output in REML. Broadly, decoding algorithms explored thus far fall into the following categories:

  • Output-only: Search algorithms like beam search (Freitag and Al-Onaizan, 2017) will only consider the model output when scoring candidate predictions (Khandelwal et al., 2020).

  • Retrieval-enhanced: Search algorithms like beam search will consider both the model output and the retrieved items when scoring candidate predictions (Lewis et al., 2020a; Asai et al., 2024). This should penalize spurious associations between query and retrieved item.

  • Verification-based: The initial output of the predictive model will be scrutinized by a verification module and potentially rejected if a condition is a met, e.g., the output does not entail the retrieved item (Jiang et al., 2023).

5.2.4. Consumption efficiency

Steps can be taken in presentation to speed up inference, such as by compressing passages through summarization or truncating the list of retrieved items (De Jong et al., 2023; Hofstätter et al., 2023). There are other approaches to improve efficiency that are more tightly integrated with consumption, e.g., partially precomputing passage embeddings.

5.2.5. Attribution and other extensions

Advanced applications of REML will augment the predictive model output space to incorporate REML-specific information. The most common instance of this is probably to support attribution, so that each part of the output can be traced back to the relevant retrieved item (Gao et al., 2023b; Schuster et al., 2024; Asai et al., 2024; Menick et al., 2022). Other cases involve in-line verification or calls to external tools that would not be easily possible without incorporating the appropriate retrieved item (Asai et al., 2024).

Consumption Granularities (§5.2.1) k𝑘kitalic_kNN-LM (Khandelwal et al., 2020) Single + Ensemble Probabilities are computed for each retrieved item independently then combined. RePLUG (Shi et al., 2024) Single + Ensemble Probabilities are computed for each retrieved item independently then combined. RAG (Lewis et al., 2020a) Joint Retrieved items are concatenated before consumption. FiD (Izacard and Grave, 2021b) Joint Retrieved items are concatenated in the decoder during consumption. FLARE (Jiang et al., 2023) Multi-round Potentially retrieves new items as generation progresses. Consumption Algorithms (§5.2.2) kNN-LM (Khandelwal et al., 2020) Extractive A single word is selected from the retrieved context. CoG (Lan et al., 2023) Extractive An extension of kNN-LM that can output both words and phrases. CBR (Das et al., 2020) Analogical Incorporate knowledge graphs into neural models in spirt of case based reasoning. Dynamic L2M (Drozdov et al., 2023) Analogical Retrieves demonstrations for few-shot prompting. RePLUG (Shi et al., 2024) Contextual Uses retrieved items in the context. SPALM (Yogatama et al., 2021) Latent Incorporates retrieved items into the hidden state. Decoding Algorithms (§5.2.3) kNN-LM (Khandelwal et al., 2020) Text-only Retrieval probability is ignored for next word prediction. RAG (Lewis et al., 2020a) Retrieval-enhanced Retrieval probability is incorporated in beam search. Self-RAG (Asai et al., 2024) Retrieval-enhanced Critic probability is incorporated in beam search. FLARE (Jiang et al., 2023) Verification-based Discards low confidence continuations, triggering retrieval. Consumption Efficiency (§5.2.4) LUMEN (De Jong et al., 2023) Precompute Partially computes passage representations offline. Attribution and Extensions (§5.2.5) ALCE (Gao et al., 2023b) Attribution End of sentence. Multi-source. SemQA (Schuster et al., 2024) Attribution Mid-sentence. Multi-source. Self-RAG (Asai et al., 2024) Attribution End of sentence. Single source. Verification Outputs a special token indicating whether to use document.

Table 3. Instances of consumption research.

6. Storing

Storing refers to how retrievable items are represented and indexed. how do we define 𝒟𝒟\mathcal{D}caligraphic_D? how does gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT construct each d𝒟𝑑𝒟d\in\mathcal{D}italic_d ∈ caligraphic_D?

Storing is one of the optional yet crucial properties of REML models and refers to how retrievable items are saved, represented, and indexed. The storage components can be categorized into coupled and decoupled storage. If at least one external memory is optimized jointly with the predictive model, we call the architecture has coupled storage. In the coupled storage architecture, contents can be populated to the storage online, updated alongside a predictive model, e.g., Neural Turing Machines (NTM) (Graves et al., 2014, 2016) and REALM (Guu et al., 2020). On the other hand, if all external storage are from off-the-shelf system, and the contents are populated offline, we call the architecture has decoupled storage, e.g., kNN-LM (Khandelwal et al., 2020) and ED2LM (Hui et al., 2022b).

In its simplest form, REML systems will operate with decoupled storage where the retrieval model is implemented independent of the predictive model. Since entries in the storage are populated offline and many off-the-shelf retrieval models are readily available, it is relatively convenient to construct the decoupled storage. On the other hand, many advanced REML systems operate with coupled storage where the retrieval model is directly influenced by the predictive model. Here, the entries in the storage are populated online and updated alongside the predictive models. In this section, we will describe storage operations and challenges associated with the coupled and decoupled storage.

6.1. Primary Storage Operations

In general, there are three types of operations that each storage system must support to be effectively utilized in REML systems: Address Generation, Lesen, and write. The storage operations will typically be used in three scenarios: 1) fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT needs to incorporate historical context in its prediction, e.g., long context language models; 2) fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT conducts various types of online learning using recent/past interactions, e.g., experience replay in reinforcement learning and language agents; and 3) fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a memory network-like architecture (Weston et al., 2015) where retrieval is an abstract process and everything written to or read from the storage is controlled by the network which is the service of the objective being optimized.

6.1.1. Address Generation

An important aspect of utilizing storage in REML systems is the ability to store and retrieve specific pieces of information. Therefore, the storage system must be capable of generating a specific address for reading from or writing to the storage. Storage location can be divided into abstract location (slots in the storage space) and physical location (where in hardware the storage sits). For abstract location, it boils down to the role of addresslocationsubscriptaddresslocation\textit{address}_{\textit{location}}address start_POSTSUBSCRIPT location end_POSTSUBSCRIPT, as introduced in Section 4.5. In most cases, it will be a simple rotational function (Graves et al., 2014, 2016) which allows an iteration through a sequence of slots. For better efficiency, the function can store entries into clusters by content (Weston et al., 2015) or layers (Lample et al., 2019), reducing the computation during searching. For physical location, entries that do not need to be in RAM or VRAM can be moved to a disk with memory mapping, while entries that must be in the RAM, such as embedding index to be searched, can be compressed without notable performance degradation (Izacard et al., 2024).

6.1.2. Read

Reading from the storage is an essential part of REML models and closely tied with the search operations. However in Section 6, we focus on how retrievable items are represented in the storage. How they are read are discussed in Section 4.

6.1.3. Write

At time t𝑡titalic_t, after the address vector wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is obtained, storage operation can be done by a write function that updates the datastore CtsubscriptC𝑡\textrm{C}_{t}C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as follows:

(13) Ct+1=write(wt,Ct,payloadt)subscriptC𝑡1writesubscript𝑤𝑡subscriptC𝑡subscriptpayload𝑡\textrm{C}_{t+1}=\textit{write}(w_{t},\textrm{C}_{t},\textit{payload}_{t})C start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = write ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , payload start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

where payload can be a form of vector or raw representation, which can be preprocessed by some functions, similar to the functions defined in Section 5, before being stored in the storage (Hui et al., 2022b).

With the view of addressing mechanism mentioned in Section 4.5, it can be understood that location-based addressing is used before the write operation, following the framework of Neural Turing Machine (NTM) (Graves et al., 2014, 2016). In most cases, the write operation will simply append the latest entry to the end of a storage (address vector pointing to the next available slot), executed after every new input in sequential order, following the work of neural cache model (Grave et al., 2017). On the other hand, some architectures, such as Memory Neural Network (MemNN) (Weston et al., 2015) or large memory layers (Lample et al., 2019) operate differently, and will generate an address specifying where in memory to write the new entry, potentially overwriting any entry that was there before. Regularization can be applied in memory networks to ensure that a substantial portion of the memory is used. These decisions can be made by concerted effort with the storage management component.

6.2. Phases of Storage Operation

In REML systems, storage operates through two distinct phases. The initial phase, called Storage Construction, involves the system setting up the storage infrastructure with the necessary information to facilitate its operations. Following this setup, the system transitions to the Storage Management phase, where it determines the appropriate strategies for storing information, including the selection of what data to retain, the optimal storage locations, and the methods for organizing the information for future retrieval.

6.2.1. Storage Construction

In most cases, a REML system will initialize its storage by processing an entire retrieval corpus. This can be done offline before training, after training before inference, and throughout training as needed. Storage construction is well documented and studied, and the initial storage construction is essentially a series of write and address generation operations. Storage can be constructed as a key-value structure where retrieval space 𝒟𝒟\mathcal{D}caligraphic_D can be defined as:

(14) 𝒟={(ki,vi)|dC,ki=transformk(d),vi=transformv(d)}𝒟conditional-setsubscript𝑘𝑖subscript𝑣𝑖formulae-sequence𝑑Cformulae-sequencesubscript𝑘𝑖𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟subscript𝑚𝑘𝑑subscript𝑣𝑖𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟subscript𝑚𝑣𝑑\mathcal{D}=\{(k_{i},v_{i})\,|\,d\in\textrm{C},\,k_{i}=transform_{k}(d),v_{i}=% transform_{v}(d)\}caligraphic_D = { ( italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_d ∈ C , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_t italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_d ) , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_t italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_d ) }

where transformk𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟subscript𝑚𝑘transform_{k}italic_t italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a key representation function that can take each entry in the corpus or an input instance x𝑥xitalic_x. Similarly, transformv𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟subscript𝑚𝑣transform_{v}italic_t italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is a value representation function that can take each entry in the corpus or an input instance x𝑥xitalic_x to generate a value in the storage. It is important to note that the transformk𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟subscript𝑚𝑘transform_{k}italic_t italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and transformv𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟subscript𝑚𝑣transform_{v}italic_t italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT function can simply be the identity function, meaning it does not change the key and value at all. Table 4 describes how each paper constructed its storage offline and/or online. For example, in EMAT (Wu et al., 2022d), where the collection is pairs of questions and answers, the transformk𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟subscript𝑚𝑘transform_{k}italic_t italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT function utilizes only the question in each pair to generate the key representation, and transformv𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟subscript𝑚𝑣transform_{v}italic_t italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT function uses only the answer from each pair to generate the value representation.

6.2.2. Storage Management

Once the storage is initialized, both for optimal task completion and efficiency, there is a need to schedule the storage operations, and we dub this phase as storage management. Efficient usage of storage can be understood in terms of space and speed which come down to determining three decisions: when, what, and how to store.

When to store

In the most scenarios, gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT will pull information from the storage built at an initial phase based on the need of fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. However, this setup can introduce storage staleness problem when ω𝜔\omegaitalic_ω changes during gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT is being sequentially or jointly trained with fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Since 𝒟keysubscript𝒟key\mathcal{D}_{\text{key}}caligraphic_D start_POSTSUBSCRIPT key end_POSTSUBSCRIPT (retrieval space of keys in C) is constructed by ω𝜔\omegaitalic_ω during the training, it is necessary to refresh the storage as ω𝜔\omegaitalic_ω is updated; this was even mentioned from MemNN (Weston et al., 2015). This comes down to asking two questions: 1) how often to update, and 2) what portion of storage to update. For the first question, one can either synchronously (every training step) or asynchronously (every T𝑇Titalic_T training step) update the storage (Asai et al., 2023a). For the second question, one can choose to update the entire storage, a subset of the storage, or refrain from updating the storage at all.

What to store

Continued from when to store, if the storage are being periodically updated, it may be beneficial to selectively store. For the synchronous update, updating a subset of storage either by in-batch approximation or reranking is preferred due to large computational overhead, while full storage update is often performed when update is done asynchronously (Izacard et al., 2024; Asai et al., 2023a; Zhong et al., 2022; Guu et al., 2020). Another way to selectively store is to erase some of the entries stored in the past along with the periodic update as storage can become full. One simple approach is to set a window size of a storage and manage it like a queue (Shinn et al., 2023; Wu et al., 2022b; Dai et al., 2019; Rae et al., 2020), similar to discarding the oldest entry in the storage (Grave et al., 2017). Weston et al. (2015) devised a separate erasure module that scores the utility of each entry to discard the least useful entries.

How to store

This encompasses entry representation, e.g., index compression (Wu et al., 2022b; Rae et al., 2020; Martins et al., 2022) and quantization (Izacard et al., 2024) and architectural choice of the storage, e.g., key-value structure (Grave et al., 2017; Khandelwal et al., 2020; Wu et al., 2022c; Zhong et al., 2022; Min et al., 2023; Borgeaud et al., 2022; Yogatama et al., 2021; Alon et al., 2022; Hui et al., 2022b), where the compression and representation computations can happen incrementally by batch (Zamani et al., 2022).

6.3. Storage Types

Coupled Storage (§6.3.1) Key Value NTM (Graves et al., 2014, 2016) transformed output of fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (the same as the Key) MemNN (Weston et al., 2015) & MemN2N (Sukhbaatar et al., 2015) input feature embedding (the same as the Key) Dynamic Memory Network (Kumar et al., 2016) input feature embedding (the same as the Key) Neural Cache Model (Grave et al., 2017) hidden representation of RNN next word RUM (Chen et al., 2018) user-item embedding (the same as the Key) Transformer-XL (Dai et al., 2019) hidden representation of Transformer (the same as the Key) LongMem (Wang et al., 2023a) attention-key attention-value Memorizing Transformer (Wu et al., 2022c) attention-key attention-value MemTransformer (Burtsev et al., 2021) sequence of tokens tokens RPT (Rubin and Berant, 2023) token chunk embedding (the same as the Key) Unlimiformer (Bertsch et al., 2023) token chunk embedding (the same as the Key) MeMViT (Wu et al., 2022b) image frame embedding (the same as the Key) PFMN (Lee et al., 2018) image frame embedding (the same as the Key) REALM (Guu et al., 2020) document embedding (the same as the Key) REPLUG LSR (Shi et al., 2024) document embedding (the same as the Key) ATLAS (Izacard et al., 2024) document embedding (the same as the Key) EMAT (Wu et al., 2022d) question embedding answer embedding QAMAT (Chen et al., 2023c) question embedding question-answer embedding TRIME (Zhong et al., 2022) context (sequence of tokens) next token NPM (Min et al., 2023) token embedding token Reflexion (Shinn et al., 2023) NL self-reflection in NL (the same as the Key) CLIN (Majumder et al., 2024) self-reflection in NL (the same as the Key) ExpeL (Zhao et al., 2024) self-reflection in NL (the same as the Key) Generative Agent (Park et al., 2023) stream of experience in NL (the same as the Key) Voyager (Wang et al., 2023c) program description embedding program code MemPrompt (Madaan et al., 2022) NL question NL human feedback Decoupled Storage (§6.3.2) kNN-LM (Khandelwal et al., 2020) context embedding next token SPALM (Yogatama et al., 2021) context embedding next token RAG (Lewis et al., 2020b) document embedding (the same as the Key) RETOMATON (Alon et al., 2022) context embedding next token, pointer KIF (Fan et al., 2021) evidence embedding (multimodal) (the same as the Key) RETRO (Borgeaud et al., 2022) token chunk embedding token chunk ED2LM (Hui et al., 2022b) document embedding document REPLUG (Shi et al., 2024) document embedding (the same as the Key)

Table 4. Instances of REML models with external storage. NL is an abbreviation of natural language.

In the literature, two types of storage architectures are identified in REML systems: Coupled Storage and Decoupled Storage. The following sections will elaborate on these architectures and introduce the characteristics associated with each type of storage.

6.3.1. Coupled Storage

Coupled storage is defined as a storage that can be updated online during training and inference of the predictive model and can be jointly optimized. Initial developments of the coupled storage enhancements were primarily led by Neural Turing Machine (NTM) (Graves et al., 2014, 2016) and Memory Network (MemNN) (Weston et al., 2015). They showed abstract operations, such as copying, recall, and sorting, to language reasoning tasks by leveraging external addressable storage, where contents in the storage of these architectures are dense vectors which can be readily used by fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. We refer the readers to the original papers of the two models for deeper understanding of primary shape of REML models with coupled storage. There are a few notable characteristics about REML systems with coupled memory, including but not limited to:

Staleness of Coupled Storage

One of the primary concerns of the coupled storage arises when gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT is being trained, making the storage stale. It is still an ongoing challenge in the community, but as mentioned in Section 6.2, there have been several techniques devised to circumvent this issue by answering how often to update (synchronous or asynchronous) and what portion of the storage to update (full or partial). Therefore, there are five different strategies including an avoidance of update. Synchronous full update is the simplistic approach to solve the staleness problem by updating the storage at every training step. It is attempted in a few research (Rubin and Berant, 2023; Bertsch et al., 2023), but its large computational overhead prevents it from being used in a practical setting (Izacard et al., 2024). Synchronous partial update can be done by selecting a batch of entries to update (Izacard et al., 2024). Depending on the applications, there can be various batch selection strategies, such as lexical similarity (Zhong et al., 2022) or in-document sampling (Min et al., 2023). Asynchronous full update is done by updating the full storage every T𝑇Titalic_T training steps (Guu et al., 2020; Izacard et al., 2024; Shi et al., 2024; Wu et al., 2022d). This allows staleness in the index before it is updated again. For example, Wu et al. (2022d) freeze the storage at the beginning of each training epoch and only updates at the end of each epoch. As far as we know, there is little attempt on asynchronous partial update as it may degrade the training performance in larger margin. Alternatively, since index recreation is highly expensive, it is possible to ignore the staleness and avoid re-indexing with adequate strategies without compromising a performance in a large margin (Rae et al., 2016; Lewis et al., 2020b; Izacard et al., 2024; Guu et al., 2020; Wang et al., 2023a; Wu et al., 2022c).

Cold Start Problem in Coupled Storage

Another characteristic of the models equipped with coupled storage is that they can have a cold start problem, where the performance of fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is suboptimal before the storage is filled up with enough information. Most of the architectures that start with an empty storage such as language agents (LA) or long-context language models that process a long document in multiple training steps have this issue (Park et al., 2023; Majumder et al., 2024; Shinn et al., 2023; Zhao et al., 2024; Wang et al., 2023c; Wu et al., 2022c; Zhong et al., 2022). However, the cold start problem can be alleviated when the model can be adapted to a new task and the storage/experience built during the previous tasks is transferable to the new task (Majumder et al., 2024).

Versatility of Coupled Storage

Despite its disadvantages, coupled storage has seen significant development both theoretically and in various applications. It includes making REML systems with coupled storage end-to-end trainable (Sukhbaatar et al., 2015), capturing the position and temporality of language using episodic storage (Kumar et al., 2016), and scaling the storage through sparse access (Rae et al., 2016; Zaremba and Sutskever, 2016) and strategic storage management (Lample et al., 2019; Grave et al., 2017). These have led to applications in various domains, such as meta-learning (Santoro et al., 2016), sequential recommendation (Chen et al., 2018), video summarization and recognition (Lee et al., 2018; Wu et al., 2022b). In long context language modeling (Bertsch et al., 2023; Zhong et al., 2022), there have been approaches to solve the task through attention recurrence (Dai et al., 2019; Yogatama et al., 2021; Wu et al., 2022c; Wang et al., 2023a) and compression of hidden states (Rae et al., 2020; Martins et al., 2022; Wu et al., 2022b). Recently, research on language agents focus on agents’ ability to use language models for perceiving, reasoning, planning, and managing memory while interacting with external environments (Sumers et al., 2024). These agents, which learn independently from their observations and adapt their knowledge, are equipped with external storage for long-term memory, allowing them to store past reasoning (Majumder et al., 2024; Wang et al., 2023c; Zhao et al., 2024; Park et al., 2023; Shinn et al., 2023) or user feedback (Madaan et al., 2022) for future use and self-reflection (Madaan et al., 2024).

6.3.2. Decoupled Storage

Decoupled storage is defined as a storage method in REML systems where gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT operates independent to fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. In contrast to the coupled storage, where the entries are being updated dynamically online usually influenced by joint optimization of fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, decoupled storage involves offline population of entries, i.e., storage becomes read-only during the online stage. There are a few notable characteristics about REML systems with decoupled memory, including but not limited to:

Ease of Implementation.

Since the retriever is completely independent from the predictive model, the implementation of the REML system becomes a lot easier compared to that with coupled storage (Borgeaud et al., 2022; Hui et al., 2022b; Yogatama et al., 2021; Khandelwal et al., 2020; Alon et al., 2022). When training the REML system, in decoupled storage architecture, one can either train the retriever separately or use an off-the-shelf retriever already publicly available (Shi et al., 2024). This also means that one can easily edit a REML system by simply replacing its components. Therefore, this design can guarantee a liberation from storage staleness and cold start problem unlike the coupled storage architecture. The ease of implementation stands out when the systems need to incorporate multiple storage that are multi-modal or multi-source (Fan et al., 2021; Yang et al., 2024a; Yu et al., 2022b), where it can be tricky to be implemented with coupled memory architecture.

Performance sub-optimality.

Generally, it is known that in REML systems with multiple components, end-to-end training yields a better performance compared to training each component individually (Sachan et al., 2021; Wang et al., 2024b; Zamani and Bendersky, 2024). However, in REML systems with decoupled storage, the storage component and predictive model are trained separately. This configuration might lead to sub-optimal performance in the system’s downstream tasks. In other words, despite the convenience, the fixed nature of the storage during the training of the predictive model, and vice versa, in decoupled storage systems, prevents them from adapting to each other’s needs.

7. Optimization

Optimization refers to how different model parameters are adjusted for performance. how does gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT use s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S provided by fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to update ω𝜔\omegaitalic_ω? how does fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT use \mathcal{L}caligraphic_L to update θ𝜃\thetaitalic_θ?

As depicted in Figure 2, a REML system consists of the multiple components interacting with each other. Optimization can be done either end-to-end (i.e., optimizing all model parameters simultaneously for a common goal). Alternatively, optimization can be done for a subset of model parameters, such as independent optimization of each component in REML. For instance, for optimizing the Query Generation component, ground-truth queries are required for independent optimization of this component. Obtaining such data is difficult to obtain for some components. Distant or weak supervision is a potential solution to address this issue. In the rest of this section, we mainly focus on the optimization of the retrieval model and the predictive model as the two main components of REML systems.

7.1. Retrieval Model Optimization

7.1.1. No REML Optimization

In a wide range of studies, retrieval models are not optimized. In many of them, queries and documents are in the form of unstructured text. In that case, query and document representations are often computed based on term statistics, such as term frequencies in the documents or document frequencies in the document collection. Using retrieval models such as TF-IDF (Salton and Buckley, 1988), BM25 (Robertson et al., 1995), and query likelihood (Ponte and Croft, 1998), with default parameters, belongs to this category. For example. the Dr.QA model (Chen et al., 2017a) uses the ElasticSearch implementation of TF-IDF for document retrieval from Wikipedia for factoid question answering. The SelfMem model (Cheng et al., 2024) uses BM25 for document retrieval for a number of retrieval-augmented text generation tasks, such as translation and dialogue.

Pre-trained language models (e.g., encoder representations) can be also used to produce latent representations for queries and documents, where simple similarity functions, such as dot product or cosine similarity, are used for computing relevance scores for a query-document pair. Even though these language models went through expensive optimization procedures, their optimizations are not REML-specific. Note that employing language model representations for retrieval with no optimization often do not perform well. For instance, Lien et al. (2024) demonstrated that using plain BERT or RoBERTa representations for zero-shot retrieval is substantially worse than term matching models, like BM25.

More recently, it has been observed that large-scale instruction-tuned language models, such as GPT-3.5, can be carefully instructed to rank a few documents for a given query (Sun et al., 2023). These models can perform effectively, but cannot be applied to large document collections and could be only used as re-ranking models.

Retrieving from databases through structured queries, such as SQL, also belong to this category. A wide range of task-oriented dialogue systems, such as intelligent assistants for travel booking and restaurant reservation, require access to databases for up-to-date availability (Chen et al., 2017b).

7.1.2. Independent Optimization

REML may take advantage of retrieval models whose optimization is independent of the predictive model’s parameters. Retrieval models are often trained using a set of query-document-relevance triplets. The relevance signal may come from (1) explicit annotations, e.g., from expert assessors or crowdworkers, (2) implicit feedback (Joachims, 2002; Joachims et al., 2017), e.g., user clicks, dwell time, mouse movements, etc., or (3) automatically generated weak signals (also known as distant supervision signals), such as appearance of a phrase, e.g., answer in the context of QA, in the documents, retrieval scores from another retrieval model (Dehghani et al., 2017), annotations produced by large language models (Thomas et al., 2023). Using these training triplets, retrieval models can be optimized using three different approaches: (1) pointwise ranking, (2) pairwise ranking, and (3) listwise ranking. Refer to Liu (2009) for more information on various ranking loss functions.

In the context of REML, a large set of studies, e.g., (Vu et al., 2023; Hashemi et al., 2021; Lyu et al., 2023; Jiang et al., 2023), use commercial search engines, such as Bing or Google, as their retrieval models. These search engines are optimized using a combination of the relevance signals mentioned above. Therefore, they are considered as independent optimization models. A set of approaches, such as (Izacard and Grave, 2021b), train retrieval models on explicitly labeled collections, such as MS MARCO (Campos et al., 2016) or provenance labels in the KILT benchmark (Petroni et al., 2021), and then use the trained models on an often out-of-domain REML scenario. Weak or distant supervision is also used in (Qu et al., 2021, 2020; Salemi et al., 2023b) for open-domain (visual) question answering by assuming that any document that contains the answer phrase is relevant.

7.1.3. Conditional Optimization

In conditional optimization, the retrieval model is optimized, conditioned on the predictive model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. A group of conditional optimization approaches use knowledge distillation. For instance, Izacard and Grave (2021a) use the aggregation of cross-attention weights in the fusion-in-decoder architecture as weak signals to train the retrieval model. Here, the decoder that provides the weights plays the role of a teacher model and a dense passage retrieval model plays the role of a student model. Alternatively, Yang and Seo (2020) use the similarity score produced by an answer span selection model, i.e., the reader, as teacher scores and minimize the KL-divergence between them and retrieval scores.

As shown by Izacard and Grave (2021a), knowledge distillation from the downstream ML model to the retrieval model can be done iteratively, as follows:

(15) ω(t+1)=argminθ1|T|(x,y)T(fθ(t)(x;gω),y)superscript𝜔𝑡1subscript𝜃1𝑇subscript𝑥𝑦𝑇subscript𝑓superscript𝜃𝑡𝑥subscript𝑔𝜔𝑦\omega^{(t+1)}=\arg\min_{\theta}\frac{1}{|T|}\sum_{(x,y)\in T}\mathcal{L}\left% (f_{\theta^{(t)}}\left(x;g_{\omega}\right),y\right)italic_ω start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_T | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_T end_POSTSUBSCRIPT caligraphic_L ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ; italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ) , italic_y )

where the retrieval model in iteration (or epoch) t+1𝑡1t+1italic_t + 1 is optimized based on the parameters of the predictive model at iteration t𝑡titalic_t, where \mathcal{L}caligraphic_L is the downstream loss function.

7.2. Predictive Model Optimization

7.2.1. No REML Optimization

Similar to retrieval models, predictive models may also be used as a ‘black-box’ systems without REML-specific training. For instance, a wide range of query expansion approaches, such as the Rocchio’s algorithm (Rocchio, 1971), relevance models (Lavrenko and Croft, 2001), and divergence minimization model (Zhai and Lafferty, 2001) expand the queries based on the appearance of terms and concepts in the retrieval results. Using pre-trained large language models in a zero-shot setting is another example that has received considerable attention in recent years (Shi et al., 2024; Salemi et al., 2024b).

7.2.2. Independent Optimization

Predictive models in REML can be optimized independent of the retrieval model’s parameters. For instance, we can optimize predictive models by assuming that the retrieval model is optimal (i.e., retrieving ground truth relevant documents). In this case, the optimization of predictive model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can be modeled as:

(16) θ=argminθ1|T|(x,y)T(fθ(x;gopt),y)superscript𝜃subscript𝜃1𝑇subscript𝑥𝑦𝑇subscript𝑓𝜃𝑥subscript𝑔opt𝑦\theta^{*}=\arg\min_{\theta}\frac{1}{|T|}\sum_{(x,y)\in T}\mathcal{L}\left(f_{% \theta}\left(x;g_{\mathrm{opt}}\right),y\right)italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_T | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_T end_POSTSUBSCRIPT caligraphic_L ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ; italic_g start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) , italic_y )

where \mathcal{L}caligraphic_L is the loss function for the downstream task. For instance, a number of open-domain QA models are optimized to extract or generate answers given the question and the gold (ground truth) passage (Chen et al., 2017a). Some may relax the optimality assumption of retrieval models and inject non-relevant documents to the ground truth set. These documents can be either sampled randomly or from the output of a retrieval model, but not gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT.

7.2.3. Conditional Optimization

Alternatively, predictive models can be trained conditioned on the retrieval model’s performance. Without loss of generality, this can be seen as an iterative process, where the predictive model is optimized in one iteration and the retrieval model is optionally optimized in the next iteration. With this formulation, the parameters of a predictive model at iteration t𝑡titalic_t can be obtained as follows:

(17) θ(t)=argminθ1|T|(x,y)T(fθ(x;gω(t)),y)superscript𝜃𝑡subscript𝜃1𝑇subscript𝑥𝑦𝑇subscript𝑓𝜃𝑥subscript𝑔superscript𝜔𝑡𝑦\theta^{(t)}=\arg\min_{\theta}\frac{1}{|T|}\sum_{(x,y)\in T}\mathcal{L}\left(f% _{\theta}\left(x;g_{\omega^{(t)}}\right),y\right)italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_T | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_T end_POSTSUBSCRIPT caligraphic_L ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ; italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_y )

7.3. Joint Optimization of Retrieval and Predictive Models

7.3.1. Joint Multi-Task Optimization

Retrieval and predictive models can be trained jointly. Joint optimization can be modeled end-to-end (explained later in this section) or through multi-task learning. In joint multi-task optimization, for any training instance, both the retrieval results and the predictive model parameters will be updated. For instance, FiD-Light (Hofstätter et al., 2023) generates the documents with positive provenance score in addition to the output text for retrieval-augmented text generation tasks. The generated document IDs are then used for re-ranking the result list. Therefore, this can be seen as a joint optimization of re-ranking and generation.

7.3.2. End-to-End Optimization

Following the risk minimization framework, end-to-end optimization in REML can be modeled as follows:

(18) θ,ω=argminθ,ω1|T|(x,y)T(fθ(x;gω),y)superscript𝜃superscript𝜔subscript𝜃𝜔1𝑇subscript𝑥𝑦𝑇subscript𝑓𝜃𝑥subscript𝑔𝜔𝑦\theta^{*},\omega^{*}=\arg\min_{\theta,\omega}\frac{1}{|T|}\sum_{(x,y)\in T}% \mathcal{L}\left(f_{\theta}\left(x;g_{\omega}\right),y\right)italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ , italic_ω end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_T | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_T end_POSTSUBSCRIPT caligraphic_L ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ; italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ) , italic_y )

where both parameters sets θ𝜃\thetaitalic_θ and ω𝜔\omegaitalic_ω get updated simultaneously by optimizing an appropriate loss function for the downstream machine learning task. End-to-end optimization of REML, however, can be challenging. It is mostly due to the top k𝑘kitalic_k item selection process of information access models in REML that makes the end-to-end REML model non-differentiable. Existing work make some simplifying assumptions to turn the optimization to a differentiable process. For instance, the RAG model from Lewis et al. (2020b) by marginalizing the retrieved document set to a set of pre-selected documents. A similar approach was later utilized by Sachan et al. (2021) for open-domain question answering. In addition to marginalization, RetGen (Zhang et al., 2022a) and EMDR2 (Singh et al., 2021) make a document independence assumption and computes the loss function as a summation over each individual document.

8. Evaluation

Evaluation refers to how retrieval components are benchmarked. evaluation constructs: extrinsic evaluation on the downstream task, intrinsic evaluation, attribution/support/explainability, efficiency given two retrieval models gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT and gωsubscript𝑔superscript𝜔g_{\omega^{\prime}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, how can we determine if gωgωsucceedssubscript𝑔𝜔subscript𝑔superscript𝜔g_{\omega}\succ g_{\omega^{\prime}}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ≻ italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT intrinsically (e.g., using relevance judgments on retrieval results)? given two retrieval models gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT and gωsubscript𝑔superscript𝜔g_{\omega^{\prime}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, how can we determine if gωgωsucceedssubscript𝑔𝜔subscript𝑔superscript𝜔g_{\omega}\succ g_{\omega^{\prime}}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ≻ italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT extrinsically (e.g., using L oder s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S)?

Our goal in evaluation is to understand whether a change to the system—including a full replacement—is better than keeping the status quo. For example, we might be interested in knowing whether changing the search component improves predictive performance. We will refer to this evaluation metric as μ𝜇\muitalic_μ, whose arguments will be explained shortly. We compute the expected metric value with respect to a distribution over some population Pr(𝒳)Pr𝒳\mathop{\textrm{Pr}}(\mathcal{X})Pr ( caligraphic_X ), which is ideally the same distribution used for training data.

We classify evaluation as either extrinsic, looking at the final performance of the predictive model, or intrinsic, looking at the performance of a component of the system using a local measure of quality rather than predictive model performance (Sparck-Jones and Galliers, 1996). An intrinsic evaluation of a model can be an efficient approximation for an extrinsic evaluation or can measure some independent value such as resource consumption.

8.1. Extrinsic evaluation

In all situations, we are most often interested in the expected value of the metric for a system. That is, for a model =fθ,gωsubscript𝑓𝜃subscript𝑔𝜔\mathcal{M}=\langle f_{\theta},g_{\omega}\ranglecaligraphic_M = ⟨ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ⟩ and evaluation data LtestsubscriptLtest\textrm{L}_{\text{test}}L start_POSTSUBSCRIPT test end_POSTSUBSCRIPT, compute,

(19) 𝔼[μ((x))]𝔼delimited-[]𝜇𝑥\displaystyle\mathop{\mathbb{E}}[\mu(\mathcal{M}(x))]blackboard_E [ italic_μ ( caligraphic_M ( italic_x ) ) ] =1|Ltest|xLtestμ((x))absent1subscriptLtestsubscript𝑥subscriptLtest𝜇𝑥\displaystyle=\frac{1}{|\textrm{L}_{\text{test}}|}\sum_{x\in\textrm{L}_{\text{% test}}}\mu(\mathcal{M}(x))= divide start_ARG 1 end_ARG start_ARG | L start_POSTSUBSCRIPT test end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ L start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_μ ( caligraphic_M ( italic_x ) )

When evaluating a system extrinsically, we can pose hypotheses about relative system performance in several ways (Guu et al., 2020; Petroni et al., 2021; Lewis et al., 2020b). In non-overlapping system comparison, given two model tuples =fθ,gωsubscript𝑓𝜃subscript𝑔𝜔\mathcal{M}=\langle f_{\theta},g_{\omega}\ranglecaligraphic_M = ⟨ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ⟩ and =fθ,gωsuperscriptsubscript𝑓superscript𝜃subscript𝑔superscript𝜔\mathcal{M}^{\prime}=\langle f_{\theta^{\prime}},g_{\omega^{\prime}}\ranglecaligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ⟨ italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩, determine if 𝔼[μ((x))]>𝔼[μ((x))]𝔼delimited-[]𝜇𝑥𝔼delimited-[]𝜇superscript𝑥\mathop{\mathbb{E}}[\mu(\mathcal{M}(x))]>\mathop{\mathbb{E}}[\mu(\mathcal{M}^{% \prime}(x))]blackboard_E [ italic_μ ( caligraphic_M ( italic_x ) ) ] > blackboard_E [ italic_μ ( caligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) ]. In fixed retrieval model comparison, given two model tuples =fθ,gωsubscript𝑓𝜃subscript𝑔𝜔\mathcal{M}=\langle f_{\theta},g_{\omega}\ranglecaligraphic_M = ⟨ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ⟩ and =fθ,gωsuperscriptsubscript𝑓superscript𝜃subscript𝑔𝜔\mathcal{M}^{\prime}=\langle f_{\theta^{\prime}},g_{\omega}\ranglecaligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ⟨ italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ⟩, determine if 𝔼[μ((x))]>𝔼[μ((x))]𝔼delimited-[]𝜇𝑥𝔼delimited-[]𝜇superscript𝑥\mathop{\mathbb{E}}[\mu(\mathcal{M}(x))]>\mathop{\mathbb{E}}[\mu(\mathcal{M}^{% \prime}(x))]blackboard_E [ italic_μ ( caligraphic_M ( italic_x ) ) ] > blackboard_E [ italic_μ ( caligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) ]. As a special case, we can consider gωsubscript𝑔superscript𝜔g_{\omega^{*}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT an optimal ranker according to some intrinsic criteria; this allows us to examine whether a system can effectively incorporate relevant items. In fixed predictive model comparison, given two model tuples =fθ,gωsubscript𝑓𝜃subscript𝑔𝜔\mathcal{M}=\langle f_{\theta},g_{\omega}\ranglecaligraphic_M = ⟨ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ⟩ and =fθ,gωsuperscriptsubscript𝑓𝜃subscript𝑔superscript𝜔\mathcal{M}^{\prime}=\langle f_{\theta},g_{\omega^{\prime}}\ranglecaligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ⟨ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩, determine if 𝔼[μ((x))]>𝔼[μ((x))]𝔼delimited-[]𝜇𝑥𝔼delimited-[]𝜇superscript𝑥\mathop{\mathbb{E}}[\mu(\mathcal{M}(x))]>\mathop{\mathbb{E}}[\mu(\mathcal{M}^{% \prime}(x))]blackboard_E [ italic_μ ( caligraphic_M ( italic_x ) ) ] > blackboard_E [ italic_μ ( caligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) ]. In this case, we can consider fθsubscript𝑓superscript𝜃f_{\theta^{*}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT as an optimal predictive model according to some intrinsic criteria; this allows us to examine whether a system can effectively retrieve relevant items.

8.2. Intrinsic evaluation

REML systems comprise numerous components, each capable of individual assessment. Intrinsic evaluation of a component involves comparing systems based on their isolated performance with respect to that component. Nevertheless, such systems’ most important components are the retrieval and predictive models.

8.2.1. Intrinsic evaluation of retrieval

Intrinsic evaluation of a retrieval model focuses on comparing systems according to their isolated retrieval performance. In this case, assuming single-turn retrieval, we can pose two styles of hypothesis. In non-overlapping system comparison, given two model tuples =fθ,gωsubscript𝑓𝜃subscript𝑔𝜔\mathcal{M}=\langle f_{\theta},g_{\omega}\ranglecaligraphic_M = ⟨ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ⟩ and =fθ,gωsuperscriptsubscript𝑓superscript𝜃subscript𝑔superscript𝜔\mathcal{M}^{\prime}=\langle f_{\theta^{\prime}},g_{\omega^{\prime}}\ranglecaligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ⟨ italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩, determine if 𝔼[μ(gω(fθ(x)))]>𝔼[μ(gω(fθ(x)))]𝔼delimited-[]𝜇subscript𝑔𝜔subscript𝑓𝜃𝑥𝔼delimited-[]𝜇subscript𝑔superscript𝜔subscript𝑓superscript𝜃𝑥\mathop{\mathbb{E}}[\mu(g_{\omega}(f_{\theta}(x)))]>\mathop{\mathbb{E}}[\mu(g_% {\omega^{\prime}}(f_{\theta^{\prime}}(x)))]blackboard_E [ italic_μ ( italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) ) ] > blackboard_E [ italic_μ ( italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) ) ], where, with some abuse of notation, fθ(x)subscript𝑓𝜃𝑥f_{\theta}(x)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) and fθ(x)subscript𝑓superscript𝜃𝑥f_{\theta^{\prime}}(x)italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) considers only the query processing for gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT and gωsubscript𝑔superscript𝜔g_{\omega^{\prime}}italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. In fixed query processing comparison, given two model tuples =fθ,gωsubscript𝑓𝜃subscript𝑔𝜔\mathcal{M}=\langle f_{\theta},g_{\omega}\ranglecaligraphic_M = ⟨ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ⟩ and =fθ,gωsuperscriptsubscript𝑓𝜃subscript𝑔superscript𝜔\mathcal{M}^{\prime}=\langle f_{\theta},g_{\omega^{\prime}}\ranglecaligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ⟨ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩, determine if 𝔼[μ(gω(fθ(x)))]>𝔼[μ(gω(fθ(x)))]𝔼delimited-[]𝜇subscript𝑔𝜔subscript𝑓𝜃𝑥𝔼delimited-[]𝜇subscript𝑔superscript𝜔subscript𝑓𝜃𝑥\mathop{\mathbb{E}}[\mu(g_{\omega}(f_{\theta}(x)))]>\mathop{\mathbb{E}}[\mu(g_% {\omega^{\prime}}(f_{\theta}(x)))]blackboard_E [ italic_μ ( italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) ) ] > blackboard_E [ italic_μ ( italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) ) ]. The choice of metric μ𝜇\muitalic_μ depends on the task but should be some retrieval metric, unless the retrieval result is not a ranking. Such metrics require some relevance estimate for each item. In the case of REML, this can come from,

  1. (1)

    Explicit labels gathered from human raters. This requires instances, targets, and items to be interpretable. ’provenance Labels’ in the KILT benchmark for some tasks such as Natural Questions (Kwiatkowski et al., 2019) and ELI5 (Fan et al., 2019) can be thought as such labels.

  2. (2)

    Inferred labels from the target. For example, in QA, we could compute the similarity between a retrieved item and the target. ‘Context Relevance’ from RAGAS (Es et al., 2024) and ARES (Saad-Falcon et al., 2024) can be thought as a variant of this case.

  3. (3)

    Attributed labels from the model prediction. For example, in QA, if a model generates an answer correctly, we can try to attribute the answer correctness to each of the retrieved items. This method, drawing inspiration from eRAG (Salemi and Zamani, 2024a), assesses the retrieval model’s performance by quantifying the contribution of each retrieved document towards achieving the correct answer.

8.2.2. Intrinsic evaluation of consumption

Intrinsic evaluation of consumption focuses on comparing systems according to their isolated ability to translate retrieval results into effective predictions. Although extrinsic evaluation measures the effectiveness of the system in general, intrinsic evaluation of consumption focuses on whether a prediction is attributable to retrieval results (e.g., versus information already in the consumption model parameters). In fixed retrieval comparison, given two model tuples =fθ,gωsubscript𝑓𝜃subscript𝑔𝜔\mathcal{M}=\langle f_{\theta},g_{\omega}\ranglecaligraphic_M = ⟨ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ⟩ and =fθ,gωsuperscriptsubscript𝑓superscript𝜃subscript𝑔𝜔\mathcal{M}^{\prime}=\langle f_{\theta^{\prime}},g_{\omega}\ranglecaligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ⟨ italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ⟩, determine if 𝔼[μ(fθ(ω,x),ω,x)]>𝔼[μ(fθ(ω,x),ω,x)]𝔼delimited-[]𝜇subscript𝑓𝜃subscript𝜔𝑥subscript𝜔𝑥𝔼delimited-[]𝜇subscript𝑓superscript𝜃subscript𝜔𝑥subscript𝜔𝑥\mathop{\mathbb{E}}[\mu(f_{\theta}(\mathcal{R}_{\omega,x}),\mathcal{R}_{\omega% ,x})]>\mathop{\mathbb{E}}[\mu(f_{\theta^{\prime}}(\mathcal{R}_{\omega,x}),% \mathcal{R}_{\omega,x})]blackboard_E [ italic_μ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_ω , italic_x end_POSTSUBSCRIPT ) , caligraphic_R start_POSTSUBSCRIPT italic_ω , italic_x end_POSTSUBSCRIPT ) ] > blackboard_E [ italic_μ ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_ω , italic_x end_POSTSUBSCRIPT ) , caligraphic_R start_POSTSUBSCRIPT italic_ω , italic_x end_POSTSUBSCRIPT ) ]. The choice of metric μ(y~,~)𝜇~𝑦~\mu(\tilde{y},\tilde{\mathcal{R}})italic_μ ( over~ start_ARG italic_y end_ARG , over~ start_ARG caligraphic_R end_ARG ) depends on the task but measures whether the prediction y~~𝑦\tilde{y}over~ start_ARG italic_y end_ARG is related to the retrieval results ~~\tilde{\mathcal{R}}over~ start_ARG caligraphic_R end_ARG. ‘Faithfulness’ from RAGAS (Es et al., 2024) and ARES (Saad-Falcon et al., 2024) can be thought as a variant of this case.

8.3. Datasets

Every REML system is tailored to execute a specific array of tasks. In literature, various benchmarks and datasets serve to assess these systems from diverse angles. Broadly, datasets fall into two categories: 1) those exclusively considering extrinsic evaluation of REML systems, assessing them solely based on end-to-end performance. 2) those furnishing retrieval relevance labels to evaluate retrieval performance in addition to end-to-end performance. That said, Table 5 illustrates the most commonly employed datasets and benchmarks in the literature.

End-to-End Evaluation (§8.3) Task Datasets Corpus Entity Related QA PopQA(Mallen et al., 2023), EntityQuestions(Sciavolino et al., 2021) Wikipedia Current Events Related QA RealtimeQA(Kasai et al., 2023) News Websites Science Related Multiple-choice QA ARC (Clark et al., 2018) Subset of Web Science Related QA Qasper(Dasigi et al., 2021) Scientific Articles Story Related Long-form QA NarrativeQA(Kočiský et al., 2018) A Long Story Query-based Summarization QMSum(Zhong et al., 2021) A Meeting Transcript Personalized Classification and Generation LaMP(Salemi et al., 2024b) A User Profile End-to-End & Retrieval Evaluation (§8.3) Open-domain Multi-Hop QA 2WikiMultiHopQA(Ho et al., 2020), HotpotQA(Yang et al., 2018; Petroni et al., 2021) Wikipedia Open-domain Short-form QA Natural Questions(Kwiatkowski et al., 2019; Petroni et al., 2021), TriviaQA(Joshi et al., 2017; Petroni et al., 2021), StrategyQA(Geva et al., 2021) Wikipedia Open-domain Long-form QA ELI5(Fan et al., 2019; Petroni et al., 2021), ASQA(Gao et al., 2023b) Wikipedia Dialogue Generation Wizard of Wikipedia(Dinan et al., 2019; Petroni et al., 2021) Wikipedia Slot Filling ZeroShot RE(Levy et al., 2017; Petroni et al., 2021), T-REx(Elsahar et al., 2018; Petroni et al., 2021) Wikipedia Entity Linking AIDA CoNLL-YAGO(Hoffart et al., 2011; Petroni et al., 2021), WNED-WIKI/CWEB (Alani et al., 2018; Petroni et al., 2021) Wikipedia Fact Verification FEVER(Thorne et al., 2018; Petroni et al., 2021) Wikipedia Open-domain Visual QA OK-VQA(Marino et al., 2019; Qu et al., 2021) Wikipedia Open-domain Visual QA FVQA(Wang et al., 2017) A Supporting Facts Set

Table 5. Datasets available for training and evaluating REML systems (not an exhaustive list). Some focus on end-to-end evaluation, while others provide retrieval evaluation labels.

9. Future Directions

Considering the extensive efforts made thus far in designing REML systems, this section presents new ideas for future work aimed at enhancing various aspects of these systems. Specifically, we propose several future directions for each of the previously discussed sections.

9.1. Querying

9.1.1. Query with Instruction

Recent advancements in instruction tuning for LLMs have demonstrated substantial improvements in performance across downstream tasks (Wei et al., 2022; Mishra et al., 2022; Wang et al., 2022). Moreover, recent research on retrieval utilizing instructions has surpassed competitive baselines, showcasing superior performance in terms of retrieval efficiency (Asai et al., 2023b). With that in mind, developing transformation functions for query generation that produce task and query-specific instructions alongside the query can significantly enhance the retrieval model’s capacity to fulfill the requirements of the predictive model.

9.1.2. Retrieval System Aware Query Generation

Most of the existing query generation transformation and decomposition functions operate without taking into account the type and configuration of the retrieval model. However, using the query generation component that consider the specifications of the retrieval model can lead to the generation of more effective queries tailored to meet the model’s specific requirements. For instance, when using BM25 as the retriever model, which emphasizes term matching, queries with terms aligning closely with relevant documents can enhance retrieval performance. Conversely, in dense retrieval models that prioritize semantic similarity between queries and documents, generating queries with increased semantic alignment to relevant documents will be more advantageous.

9.1.3. Dissociated Interface between Retrieval and Predictive Model

Most REML systems use natural language for communication between predictive and retrieval models. Models like kNN-LM (Khandelwal et al., 2020), which uses hidden representations of the predictive model as queries and keys, the representations are all from the predictive model. However, relying solely on natural language or the representation of one model may not be optimal. An alternative is to train both retrieval and predictive models jointly to learn a shared hidden space, enabling more effective communication. This approach can convey information more efficiently and enhance the interaction between the models, leading to better performance and coordination.

9.2. Searching

9.2.1. Predictive Model Aware Retrieval Systems

Personalization literature in IR can be a good motivation for making a retrieval model that are tailored to a specific predictive model. For example, there can be a situation where multiple predictive models are being served by a few retrieval models (Salemi and Zamani, 2024b). In this case, the retrieval model can store meta-data of the predictive models they are serving, opening up opportunities to tailor the retrieval results to each or group of predictive models.

9.2.2. Redefining Relevancy

Related to the predictive model aware IR system, current IR models are built based on the relevance of a document. This relevancy is defined by whether this document is helpful in satisfying the users’ information need. However, in REML, the consumer of the documents are predictive models not human users (Salemi and Zamani, 2024a). Therefore, the relevancy should be redefined to incorporate the usefulness of the documents in predicting a targeted output.

9.3. Presentation & Consumption

9.3.1. Task-Specialized Presentation and Consumption

Similar to how task-specific retrieval is beneficial for distinguishing between fact verification, entity linking, QA, and so on, it will likely be better to use a document representation specific to the task at hand. This may materialize as a prompt with task-specific instructions, task-specific intermediate steps (including explanation for how the document is relevant), or even task-specific embeddings of documents.

9.3.2. Proactive REML

In practice, it is beneficial to not only answer the immediate question posed by the query, but also address potential follow up questions (Samarinas and Zamani, 2024; Liao et al., 2023). Follow up questions can transition to a new topic (e.g. purchasing a hotel after booking a flight), or dive deep on part of the initial answer (e.g. not only did Dave Grohl form the Foo Fighters, he was previously a drummer for Nirvana).

9.4. Storing

9.4.1. Shared collection

Figure 2 depicts a single fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT interacting with multiple information access system. However, it is also possible to map multiple fθisubscript𝑓subscript𝜃𝑖f_{\theta_{i}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT to a single information access system. In this specific situation, where multiple predictive model is sharing a single collection, an ability for models to learn what is and what is not important to store becomes critical as pushing irrelevant content to the shared storage can cause degradation in its usefulness and cause performance degradation of other fθisubscript𝑓subscript𝜃𝑖f_{\theta_{i}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT sharing the same collection.

9.4.2. Storage staleness

While training the REML model, the updated retriever often makes the previously-built storage stale. Although there have been many attempts to detour this issue, no studies have found a profound way to solve the problem. This persistent challenge necessitates further research into adaptive storage mechanisms that can dynamically align with retriever updates, ensuring data integrity and model efficiency.

9.5. Evaluation

9.5.1. Transparent Intrinsic evaluations

Although there have been a few attempts on systematic evaluation on REML models (Saad-Falcon et al., 2024; Es et al., 2024; Salemi and Zamani, 2024a), most of them use LLMs to perform evaluation. However, these metrics lack transparency in that they do not give valid reasoning why they are evaluated such a way. That said, researching transparent evaluation of REML systems is an important future direction.

9.5.2. Fairness, Trustworthiness, and Transparency of Output

Although these topics are rising, they are mainly evaluated on non-REML settings (e.g., text generation without retrieval-augmentation). Those evaluation criteria can pose different research direction tailored to REML applications. Identifying where the unfairness can happen (during retrieval and/or consumption) and checking if the unfairness propagates to the final output of the system can be a future research direction. Similar directions can be explored in model trustworthiness and transparency research.

9.6. Optimization

9.6.1. Effective and Efficient End-to-End Optimization

The non-differentiable nature of some components in REML or their interaction makes it challenging to optimize REML systems end-to-end. Existing approaches are based on some simplifying assumptions and developing more accurate and robust approximations for end-to-end REML optimization is an important research direction. Moreover, reward-based optimization of these systems, based on both human and AI feedback, is relatively underexplored. Better understanding of exploration and exploitation of information items provided by the information access system is required.

9.6.2. Learning from Online and Session-based Feedback

Interaction between the predictive and information access models can be sequential. Simple forms of such sequential interaction already exists in the context of multi-hop question answering. Using the feedback provided by the predictive model during an inference session and its users to adjust the REML output is critical to develop effective interactive REML systems.

9.6.3. Efficient Approximation of Feedback for Optimization

Both end-to-end and conditional optimization approaches require feedback from different components of the REML system. As the field moves towards larger and more expensive, developing efficient and accurate feedback approximations could substantially reduce the cost of REML training. This would not only reduces the monetary cost associated with REML training, but would also speed up research progress and lead to more sustainable and environmental-friendly systems.

9.6.4. One Information Access and Multiple Predictive Models

Most research focuses on developing a REML system for one task. On the other hand, information access systems can serve multiple predictive models, similar to the current search engines that serve billions of users, as studied in Salemi and Zamani (2024b). Optimizing information access components that provide service to multiple predictive models, aggregating and calibrating feedback across predictive models, and “personalizing” the retrieval result lists for each predictive model are important directions for future research.

10. Conclusion

In this work, we surveyed the current literature on retrieval-enhanced machine learning (REML) and synthesized it into consistent mathematical concepts, providing researchers with a formalized framework for REML. Additionally, by bridging information retrieval research and REML, we identify new opportunities and open avenues for future studies in this emerging research paradigm.

References

  • (1)
  • Aksitov et al. (2023) Renat Aksitov, Chung-Ching Chang, David Reitter, Siamak Shakeri, and Yunhsuan Sung. 2023. Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models. arXiv:2302.05578 [cs.CL]
  • Alani et al. (2018) Harith Alani, Zhaochen Guo, and Denilson Barbosa. 2018. Robust named entity disambiguation with random walks. Semant. Web 9, 4 (jan 2018), 459–479. https://doi.org/10.3233/SW-170273
  • Alon et al. (2022) Uri Alon, Frank F. Xu, Junxian He, Sudipta Sengupta, Dan Roth, and Graham Neubig. 2022. Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval. In ICML 2022 Workshop on Knowledge Retrieval and Language Models. https://openreview.net/forum?id=ZJZmKGM6UB
  • Arcadinho et al. (2022) Samuel David Arcadinho, David Aparicio, Hugo Veiga, and Antonio Alegria. 2022. T5QL: Taming language models for SQL generation. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 276–286. https://doi.org/10.18653/v1/2022.gem-1.23
  • Arnab et al. (2021) Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 6836–6846.
  • Asai et al. (2022) Akari Asai, Matt Gardner, and Hannaneh Hajishirzi. 2022. Evidentiality-guided Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 2226–2243. https://doi.org/10.18653/v1/2022.naacl-main.162
  • Asai et al. (2023a) Akari Asai, Sewon Min, Zexuan Zhong, and Danqi Chen. 2023a. Retrieval-based Language Models and Applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts). Association for Computational Linguistics, Toronto, Canada, 41–46. https://doi.org/10.18653/v1/2023.acl-tutorials.6
  • Asai et al. (2023b) Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, and Wen-tau Yih. 2023b. Task-aware Retrieval with Instructions. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 3650–3675. https://doi.org/10.18653/v1/2023.findings-acl.225
  • Asai et al. (2024) Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=hSyW5go0v8
  • Attar and Fraenkel (1977) R. Attar and A. S. Fraenkel. 1977. Local Feedback in Full-Text Retrieval Systems. J. ACM 24, 3 (jul 1977), 397–417. https://doi.org/10.1145/322017.322021
  • Baek et al. (2023) Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023. Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering. In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE). Association for Computational Linguistics, Toronto, Canada, 78–106. https://doi.org/10.18653/v1/2023.nlrse-1.7
  • Bahri et al. (2020) Dara Bahri, Yi Tay, Che Zheng, Donald Metzler, and Andrew Tomkins. 2020. Choppy: Cut Transformer for Ranked List Truncation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1513–1516. https://doi.org/10.1145/3397271.3401188
  • Bertsch et al. (2023) Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R. Gormley. 2023. Unlimiformer: Long-Range Transformers with Unlimited Length Input. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=lJWUJWLCJo
  • Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on machine learning. PMLR, 2206–2240.
  • Burtsev et al. (2021) Mikhail S. Burtsev, Yuri Kuratov, Anton Peganov, and Grigory V. Sapunov. 2021. Memory Transformer. arXiv:2006.11527 [cs.CL]
  • Cai et al. (2023) Yucheng Cai, Hong Liu, Zhijian Ou, Yi Huang, and Junlan Feng. 2023. Knowledge-Retrieval Task-Oriented Dialog Systems with Semi-Supervision. arXiv:2305.13199 [cs.CL]
  • Campos et al. (2016) Daniel Fernando Campos, Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, Li Deng, and Bhaskar Mitra. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. 30th Conference on Neural Information Processing Systems, NIPS (2016).
  • Carlini et al. (2021) Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21). 2633–2650.
  • Chen et al. (2017a) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017a. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1870–1879. https://doi.org/10.18653/v1/P17-1171
  • Chen et al. (2017b) Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017b. A Survey on Dialogue Systems: Recent Advances and New Frontiers. SIGKDD Explor. Newsl. 19, 2 (nov 2017), 25–35. https://doi.org/10.1145/3166054.3166058
  • Chen et al. (2023d) Hung-Ting Chen, Fangyuan Xu, Shane Arora, and Eunsol Choi. 2023d. Understanding Retrieval Augmentation for Long-Form Question Answering. arXiv:2310.12150 [cs.CL]
  • Chen et al. (2023b) Jifan Chen, Grace Kim, Aniruddh Sriram, Greg Durrett, and Eunsol Choi. 2023b. Complex Claim Verification with Evidence Retrieved in the Wild. arXiv:2305.11859 [cs.CL]
  • Chen et al. (2023a) Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W. Cohen. 2023a. Re-Imagen: Retrieval-Augmented Text-to-Image Generator. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=XSEBx0iSjFQ
  • Chen et al. (2023c) Wenhu Chen, Pat Verga, Michiel de Jong, John Wieting, and William W. Cohen. 2023c. Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Dubrovnik, Croatia, 1597–1610. https://doi.org/10.18653/v1/2023.eacl-main.117
  • Chen et al. (2022) Xiang Chen, Lei Li, Ningyu Zhang, Xiaozhuan Liang, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si, and Huajun Chen. 2022. Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning. In Advances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., 23908–23922. https://proceedings.neurips.cc/paper_files/paper/2022/file/97011c648eda678424f9292dadeae72e-Paper-Conference.pdf
  • Chen et al. (2018) Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2018. Sequential Recommendation with User Memory Networks. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (Marina Del Rey, CA, USA) (WSDM ’18). Association for Computing Machinery, New York, NY, USA, 108–116. https://doi.org/10.1145/3159652.3159668
  • Cheng et al. (2024) Xin Cheng, Di Luo, Xiuying Chen, Lemao Liu, Dongyan Zhao, and Rui Yan. 2024. Lift yourself up: Retrieval-augmented text generation with self-memory. Advances in Neural Information Processing Systems 36 (2024).
  • Chuang et al. (2023) Yung-Sung Chuang, Wei Fang, Shang-Wen Li, Wen-tau Yih, and James Glass. 2023. Expand, Rerank, and Retrieve: Query Reranking for Open-Domain Question Answering. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 12131–12147. https://doi.org/10.18653/v1/2023.findings-acl.768
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457 [cs.AI]
  • Cramer (2021) Patrick Cramer. 2021. AlphaFold2 and the future of structural biology. Nature structural & molecular biology 28, 9 (2021), 704–705.
  • Croft and Harper (1979) W. B. Croft and D. J. Harper. 1979. Using Probabilistic Models of Document Retrieval Without Relevance Information. J. of Documentation 35, 4 (1979), 285–295.
  • Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 2978–2988. https://doi.org/10.18653/v1/P19-1285
  • Das et al. (2020) Rajarshi Das, Ameya Godbole, Shehzaad Dhuliawala, Manzil Zaheer, and Andrew McCallum. 2020. A Simple Approach to Case-Based Reasoning in Knowledge Bases. In Automated Knowledge Base Construction. https://openreview.net/forum?id=AEY9tRqlU7
  • Das et al. (2019) Rajarshi Das, Ameya Godbole, Dilip Kavarthapu, Zhiyu Gong, Abhishek Singhal, Mo Yu, Xiaoxiao Guo, Tian Gao, Hamed Zamani, Manzil Zaheer, and Andrew McCallum. 2019. Multi-step Entity-centric Information Retrieval for Multi-Hop Question Answering. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. Association for Computational Linguistics, Hong Kong, China, 113–118. https://doi.org/10.18653/v1/D19-5816
  • Dasigi et al. (2021) Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. 2021. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 4599–4610. https://doi.org/10.18653/v1/2021.naacl-main.365
  • De Jong et al. (2023) Michiel De Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Joshua Ainslie, Sumit Sanghai, Fei Sha, and William W. Cohen. 2023. Pre-Computed Memory or on-the-Fly Encoding? A Hybrid Approach to Retrieval Augmentation Makes the Most of Your Compute. In Proceedings of the 40th International Conference on Machine Learning (Honolulu, Hawaii, USA) (ICML’23). JMLR.org, Article 290, 14 pages.
  • Dehghani et al. (2017) Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W. Bruce Croft. 2017. Neural Ranking Models with Weak Supervision. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (Shinjuku, Tokyo, Japan) (SIGIR ’17). Association for Computing Machinery, New York, NY, USA, 65–74. https://doi.org/10.1145/3077136.3080832
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  • Dinan et al. (2019) Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019. Wizard of Wikipedia: Knowledge-Powered Conversational Agents. In International Conference on Learning Representations. https://openreview.net/forum?id=r1l73iRqKm
  • Dong et al. (2023) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. 2023. A Survey on In-context Learning. arXiv:2301.00234 [cs.CL]
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy
  • Dou et al. (2023) Longxu Dou, Yan Gao, Mingyang Pan, Dingzirui Wang, Wanxiang Che, Jian-Guang Lou, and Dechen Zhan. 2023. Unisar: A unified structure-aware autoregressive language model for text-to-sql semantic parsing. International Journal of Machine Learning and Cybernetics 14, 12 (2023), 4361–4376.
  • Douze et al. (2024) Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. arXiv:2401.08281 [cs.LG]
  • Drozdov et al. (2023) Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, and Denny Zhou. 2023. Compositional Semantic Parsing with Large Language Models. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=gJW8hSGBys8
  • Elsahar et al. (2018) Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Frederique Laforest, and Elena Simperl. 2018. T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1544
  • Eric et al. (2017) Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D. Manning. 2017. Key-Value Retrieval Networks for Task-Oriented Dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. Association for Computational Linguistics, Saarbrücken, Germany, 37–49. https://doi.org/10.18653/v1/W17-5506
  • Es et al. (2024) Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, St. Julians, Malta, 150–158. https://aclanthology.org/2024.eacl-demo.16
  • Fan et al. (2021) Angela Fan, Claire Gardent, Chloé Braud, and Antoine Bordes. 2021. Augmenting Transformers with KNN-Based Composite Memory for Dialog. Transactions of the Association for Computational Linguistics 9 (03 2021), 82–99. https://doi.org/10.1162/tacl_a_00356 arXiv:https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00356/1924032/tacl_a_00356.pdf
  • Fan et al. (2019) Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. ELI5: Long Form Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 3558–3567. https://doi.org/10.18653/v1/P19-1346
  • Fernández and Veloso (2006) Fernando Fernández and Manuela Veloso. 2006. Probabilistic Policy Reuse in a Reinforcement Learning Agent. In Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS ’06). ACM, New York, NY, USA, 720–727.
  • Formal et al. (2021) Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (, Virtual Event, Canada,) (SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 2288–2292. https://doi.org/10.1145/3404835.3463098
  • Freitag and Al-Onaizan (2017) Markus Freitag and Yaser Al-Onaizan. 2017. Beam Search Strategies for Neural Machine Translation. In Proceedings of the First Workshop on Neural Machine Translation. Association for Computational Linguistics, Vancouver, 56–60. https://doi.org/10.18653/v1/W17-3207
  • Gao et al. (2022) Feng Gao, Qing Ping, Govind Thattai, Aishwarya Reganti, Ying Nian Wu, and Prem Natarajan. 2022. A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering. arXiv:2201.05299 [cs.CV]
  • Gao et al. (2023a) Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. 2023a. RARR: Researching and Revising What Language Models Say, Using Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 16477–16508. https://aclanthology.org/2023.acl-long.910
  • Gao et al. (2023b) Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023b. Enabling Large Language Models to Generate Text with Citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 6465–6488. https://doi.org/10.18653/v1/2023.emnlp-main.398
  • Geva et al. (2021) Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics 9 (2021), 346–361.
  • Glass et al. (2021) Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, and Alfio Gliozzo. 2021. Robust Retrieval Augmented Generation for Zero-shot Slot Filling. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 1939–1949. https://doi.org/10.18653/v1/2021.emnlp-main.148
  • Goyal et al. (2022) Anirudh Goyal, Abram Friesen, Andrea Banino, Theophane Weber, Nan Rosemary Ke, Adrià Puigdomènech Badia, Arthur Guez, Mehdi Mirza, Peter C Humphreys, Ksenia Konyushova, Michal Valko, Simon Osindero, Timothy Lillicrap, Nicolas Heess, and Charles Blundell. 2022. Retrieval-Augmented Reinforcement Learning. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162). PMLR, 7740–7765.
  • Grave et al. (2017) Edouard Grave, Armand Joulin, and Nicolas Usunier. 2017. Improving Neural Language Models with a Continuous Cache. In International Conference on Learning Representations. https://openreview.net/forum?id=B184E5qee
  • Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural Turing Machines. arXiv:1410.5401 [cs.NE]
  • Graves et al. (2016) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. 2016. Hybrid computing using a neural network with dynamic external memory. Nature 538, 7626 (2016), 471–476.
  • Gui et al. (2022) Liangke Gui, Borui Wang, Qiuyuan Huang, Alexander Hauptmann, Yonatan Bisk, and Jianfeng Gao. 2022. KAT: A Knowledge Augmented Transformer for Vision-and-Language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 956–968. https://doi.org/10.18653/v1/2022.naacl-main.70
  • Gulcehre et al. (2017) Caglar Gulcehre, Sarath Chandar, and Yoshua Bengio. 2017. Memory Augmented Neural Networks with Wormhole Connections. arXiv:1701.08718 [cs.LG]
  • Guo et al. (2020a) Jiafeng Guo, Yixing Fan, Liang Pang, Liu Yang, Qingyao Ai, Hamed Zamani, Chen Wu, W. Bruce Croft, and Xueqi Cheng. 2020a. A Deep Look into neural ranking models for information retrieval. Information Processing & Management 57, 6 (2020), 102067. https://doi.org/10.1016/j.ipm.2019.102067
  • Guo et al. (2020b) Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020b. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. arXiv:1908.10396 [cs.LG]
  • Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. REALM: Retrieval-Augmented Language Model Pre-Training. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 368, 10 pages.
  • Hashemi et al. (2020) Helia Hashemi, Hamed Zamani, and W. Bruce Croft. 2020. Guided Transformer: Leveraging Multiple External Sources for Representation Learning in Conversational Search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1131–1140. https://doi.org/10.1145/3397271.3401061
  • Hashemi et al. (2021) Helia Hashemi, Hamed Zamani, and W. Bruce Croft. 2021. Learning Multiple Intent Representations for Search Queries. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (Virtual Event, Queensland, Australia) (CIKM ’21). Association for Computing Machinery, New York, NY, USA, 669–679. https://doi.org/10.1145/3459637.3482445
  • He et al. (2021) Junxian He, Graham Neubig, and Taylor Berg-Kirkpatrick. 2021. Efficient Nearest Neighbor Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 5703–5714. https://doi.org/10.18653/v1/2021.emnlp-main.461
  • Ho et al. (2020) Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 6609–6625. https://doi.org/10.18653/v1/2020.coling-main.580
  • Hoffart et al. (2011) Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust Disambiguation of Named Entities in Text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Edinburgh, Scotland, UK., 782–792. https://aclanthology.org/D11-1072
  • Hofstätter et al. (2023) Sebastian Hofstätter, Jiecao Chen, Karthik Raman, and Hamed Zamani. 2023. Fid-light: Efficient and effective retrieval-augmented text generation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1437–1447.
  • Hokamp and Liu (2017) Chris Hokamp and Qun Liu. 2017. Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1535–1546. https://doi.org/10.18653/v1/P17-1141
  • Holtzman et al. (2020) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text Degeneration. In International Conference on Learning Representations. https://openreview.net/forum?id=rygGQyrFvH
  • Hu et al. (2019) J. Edward Hu, Huda Khayrallah, Ryan Culkin, Patrick Xia, Tongfei Chen, Matt Post, and Benjamin Van Durme. 2019. Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 839–850. https://doi.org/10.18653/v1/N19-1090
  • Hu et al. (2023) Linmei Hu, Zeyi Liu, Ziwang Zhao, Lei Hou, Liqiang Nie, and Juanzi Li. 2023. A survey of knowledge enhanced pre-trained language models. IEEE Transactions on Knowledge and Data Engineering (2023).
  • Hui et al. (2022a) Kai Hui, Tao Chen, Zhen Qin, Honglei Zhuang, Fernando Diaz, Mike Bendersky, and Don Metzler. 2022a. Retrieval Augmentation for T5 Re-ranker using External Sources. arXiv:2210.05145 [cs.IR]
  • Hui et al. (2022b) Kai Hui, Honglei Zhuang, Tao Chen, Zhen Qin, Jing Lu, Dara Bahri, Ji Ma, Jai Gupta, Cicero Nogueira dos Santos, Yi Tay, and Donald Metzler. 2022b. ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 3747–3758. https://doi.org/10.18653/v1/2022.findings-acl.295
  • Humphreys et al. (2022) Peter Conway Humphreys, Arthur Guez, Olivier Tieleman, Laurent Sifre, Theophane Weber, and Timothy P Lillicrap. 2022. Large-Scale Retrieval for Reinforcement Learning. In Advances in Neural Information Processing Systems.
  • Huo et al. (2023) Siqing Huo, Negar Arabzadeh, and Charles Clarke. 2023. Retrieving Supporting Evidence for Generative Question Answering. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (, Beijing, China,) (SIGIR-AP ’23). Association for Computing Machinery, New York, NY, USA, 11–20. https://doi.org/10.1145/3624918.3625336
  • Izacard and Grave (2021a) Gautier Izacard and Edouard Grave. 2021a. Distilling Knowledge from Reader to Retriever for Question Answering. In International Conference on Learning Representations. https://openreview.net/forum?id=NTEz-6wysdb
  • Izacard and Grave (2021b) Gautier Izacard and Edouard Grave. 2021b. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 874–880. https://doi.org/10.18653/v1/2021.eacl-main.74
  • Izacard et al. (2024) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2024. Atlas: few-shot learning with retrieval augmented language models. J. Mach. Learn. Res. 24, 1, Article 251 (mar 2024), 43 pages.
  • Jeong et al. (2024) Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. 2024. Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. arXiv:2403.14403 [cs.CL]
  • Jiang et al. (2024) Fan Jiang, Tom Drummond, and Trevor Cohn. 2024. Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision. arXiv:2402.16508 [cs.CL]
  • Jiang et al. (2023) Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active Retrieval Augmented Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 7969–7992. https://doi.org/10.18653/v1/2023.emnlp-main.495
  • Jin et al. (2024) Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu. 2024. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics 40, 2 (2024), btae075.
  • Jing et al. (2022) Baoyu Jing, Si Zhang, Yada Zhu, Bin Peng, Kaiyu Guan, Andrew Margenot, and Hanghang Tong. 2022. Retrieval Based Time Series Forecasting. arXiv:2209.13525 [cs.AI]
  • Joachims (2002) Thorsten Joachims. 2002. Optimizing Search Engines Using Clickthrough Data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Edmonton, Alberta, Canada) (KDD ’02). Association for Computing Machinery, New York, NY, USA, 133–142. https://doi.org/10.1145/775047.775067
  • Joachims et al. (2017) Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (Cambridge, United Kingdom) (WSDM ’17). Association for Computing Machinery, New York, NY, USA, 781–789. https://doi.org/10.1145/3018661.3018699
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1601–1611. https://doi.org/10.18653/v1/P17-1147
  • Ju et al. (2022) Mingxuan Ju, Wenhao Yu, Tong Zhao, Chuxu Zhang, and Yanfang Ye. 2022. Grape: Knowledge Graph Enhanced Passage Reader for Open-domain Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 169–181. https://doi.org/10.18653/v1/2022.findings-emnlp.13
  • Kang et al. (2023) Minki Kang, Jin Myung Kwak, Jinheon Baek, and Sung Ju Hwang. 2023. Knowledge Graph-Augmented Language Models for Knowledge-Grounded Dialogue Generation. arXiv:2305.18846 [cs.CL]
  • Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
  • Kasai et al. (2023) Jungo Kasai, Keisuke Sakaguchi, yoichi takahashi, Ronan Le Bras, Akari Asai, Xinyan Velocity Yu, Dragomir Radev, Noah A. Smith, Yejin Choi, and Kentaro Inui. 2023. RealTime QA: What’s the Answer Right Now?. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=HfKOIPCvsv
  • Kassner and Schütze (2020) Nora Kassner and Hinrich Schütze. 2020. BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 3424–3430. https://doi.org/10.18653/v1/2020.findings-emnlp.307
  • Khandelwal et al. (2021) Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2021. Nearest Neighbor Machine Translation. In International Conference on Learning Representations. https://openreview.net/forum?id=7wCBOfJ8hJM
  • Khandelwal et al. (2020) Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through Memorization: Nearest Neighbor Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=HklBjCEKvH
  • Khashabi et al. (2017) Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Dan Roth. 2017. Learning What is Essential in Questions. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Association for Computational Linguistics, Vancouver, Canada, 80–89. https://doi.org/10.18653/v1/K17-1010
  • Khattab et al. (2024) Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2024. DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=sY5N0zY5Od
  • Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 39–48. https://doi.org/10.1145/3397271.3401075
  • Khramtsova et al. (2023) Ekaterina Khramtsova, Shengyao Zhuang, Mahsa Baktashmotlagh, Xi Wang, and Guido Zuccon. 2023. Selecting which Dense Retriever to use for Zero-Shot Search. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 223–233.
  • Khramtsova et al. (2024) Ekaterina Khramtsova, Shengyao Zhuang, Mahsa Baktashmotlagh, and Guido Zuccon. 2024. Leveraging LLMs for Unsupervised Dense Retriever Ranking. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (Washington DC, USA) (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1307–1317. https://doi.org/10.1145/3626772.3657798
  • King and Flanigan (2023) Brendan King and Jeffrey Flanigan. 2023. Diverse Retrieval-Augmented In-Context Learning for Dialogue State Tracking. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 5570–5585. https://aclanthology.org/2023.findings-acl.344
  • Kočiský et al. (2018) Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018. The NarrativeQA Reading Comprehension Challenge. Transactions of the Association for Computational Linguistics 6 (2018), 317–328. https://doi.org/10.1162/tacl_a_00023
  • Komeili et al. (2022) Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2022. Internet-Augmented Dialogue Generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8460–8478. https://doi.org/10.18653/v1/2022.acl-long.579
  • Kumar et al. (2016) Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In International conference on machine learning. PMLR, 1378–1387.
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 7 (2019), 452–466. https://doi.org/10.1162/tacl_a_00276
  • Labruna et al. (2024) Tiziano Labruna, Jon Ander Campos, and Gorka Azkune. 2024. When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively. arXiv preprint arXiv:2404.19705 (2024).
  • Lample et al. (2019) Guillaume Lample, Alexandre Sablayrolles, Marc' Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. 2019. Large Memory Layers with Product Keys. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/9d8df73a3cfbf3c5b47bc9b50f214aff-Paper.pdf
  • Lan et al. (2023) Tian Lan, Deng Cai, Yan Wang, Heyan Huang, and Xian-Ling Mao. 2023. Copy is All You Need. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=CROlOA9Nd8C
  • Lavrenko et al. (2002) Victor Lavrenko, Martin Choquette, and W. Bruce Croft. 2002. Cross-lingual relevance models. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Tampere, Finland) (SIGIR ’02). Association for Computing Machinery, New York, NY, USA, 175–182. https://doi.org/10.1145/564376.564408
  • Lavrenko and Croft (2001) Victor Lavrenko and W. Bruce Croft. 2001. Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, 120–127.
  • Lazaridou et al. (2022) Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. 2022. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv:2203.05115 [cs.CL]
  • Lee et al. (2019) Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6086–6096. https://doi.org/10.18653/v1/P19-1612
  • Lee et al. (2018) Sangho Lee, Jinyoung Sung, Youngjae Yu, and Gunhee Kim. 2018. A memory network approach for story-based temporal summarization of 360 videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1410–1419.
  • Levy et al. (2017) Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-Shot Relation Extraction via Reading Comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Association for Computational Linguistics, Vancouver, Canada, 333–342. https://doi.org/10.18653/v1/K17-1034
  • Lewis et al. (2020a) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020a. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  • Lewis et al. (2020b) Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020b. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html
  • Li et al. (2023) Cheng Li, Mingyang Zhang, Qiaozhu Mei, Yaqing Wang, Spurthi Amba Hombaiah, Yi Liang, and Michael Bendersky. 2023. Teach LLMs to Personalize – An Approach inspired by Writing Education. arXiv:2308.07968 [cs.CL]
  • Li et al. (2022) Zonglin Li, Ruiqi Guo, and Sanjiv Kumar. 2022. Decoupled Context Processing for Context Augmented Language Modeling. In Advances in Neural Information Processing Systems. https://openreview.net/forum?id=02dbnEbEFn
  • Liao et al. (2023) Lizi Liao, Grace Hui Yang, and Chirag Shah. 2023. Proactive Conversational Agents in the Post-ChatGPT World. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (, Taipei, Taiwan,) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 3452–3455. https://doi.org/10.1145/3539618.3594250
  • Lien et al. (2024) Yen-Chieh Lien, Hamed Zamani, and Bruce Croft. 2024. Generalized Weak Supervision for Neural Information Retrieval. ACM Trans. Inf. Syst. 42, 5, Article 121 (apr 2024), 26 pages. https://doi.org/10.1145/3647639
  • Lin et al. (2024) Chyi-Jiunn Lin, Guan-Ting Lin, Yung-Sung Chuang, Wei-Lun Wu, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, and Lin-shan Lee. 2024. SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering. arXiv preprint arXiv:2401.13463 (2024).
  • Lin and Byrne (2022) Weizhe Lin and Bill Byrne. 2022. Retrieval Augmented Visual Question Answering with Outside Knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 11238–11254. https://aclanthology.org/2022.emnlp-main.772
  • Lin et al. (2023) Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, and Bill Byrne. 2023. Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=IWWWulAX7g
  • Liu et al. (2022) Linqing Liu, Minghan Li, Jimmy Lin, Sebastian Riedel, and Pontus Stenetorp. 2022. Query Expansion Using Contextual Clue Sampling with Language Models. arXiv:2210.07093 [cs.CL]
  • Liu (2009) Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Found. Trends Inf. Retr. 3, 3 (mar 2009), 225–331. https://doi.org/10.1561/1500000016
  • Lv and Zhai (2009) Yuanhua Lv and ChengXiang Zhai. 2009. Positional language models for information retrieval. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Boston, MA, USA) (SIGIR ’09). Association for Computing Machinery, New York, NY, USA, 299–306. https://doi.org/10.1145/1571941.1571994
  • Lyu et al. (2023) Xiaozhong Lyu, Stefan Grafberger, Samantha Biegel, Shaopeng Wei, Meng Cao, Sebastian Schelter, and Ce Zhang. 2023. Improving Retrieval-Augmented Large Language Models via Data Importance Learning. arXiv:2307.03027 [cs.LG]
  • Madaan et al. (2022) Aman Madaan, Niket Tandon, Peter Clark, and Yiming Yang. 2022. Memory-assisted prompt editing to improve GPT-3 after deployment. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2833–2861. https://doi.org/10.18653/v1/2022.emnlp-main.183
  • Madaan et al. (2024) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2024. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems 36 (2024).
  • Maekawa et al. (2024) Seiji Maekawa, Hayate Iso, Sairam Gurajada, and Nikita Bhutani. 2024. Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models. arXiv:2402.13492 [cs.CL]
  • Majumder et al. (2024) Bodhisattwa Prasad Majumder, Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. 2024. CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization. https://openreview.net/forum?id=d5DGVHMdsC
  • Malkov and Yashunin (2020) Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 4 (apr 2020), 824–836. https://doi.org/10.1109/TPAMI.2018.2889473
  • Mallen et al. (2023) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 9802–9822. https://doi.org/10.18653/v1/2023.acl-long.546
  • Mao et al. (2021) Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen. 2021. Generation-Augmented Retrieval for Open-Domain Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4089–4100. https://doi.org/10.18653/v1/2021.acl-long.316
  • Marino et al. (2019) Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. 3195–3204.
  • Martins et al. (2022) Pedro Henrique Martins, Zita Marinho, and Andre Martins. 2022. \infty-former: Infinite Memory Transformer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 5468–5485. https://doi.org/10.18653/v1/2022.acl-long.375
  • Meng et al. (2024) Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, and Maarten de Rijke. 2024. Ranked List Truncation for Large Language Model-based Re-Ranking. In SIGIR.
  • Menick et al. (2022) Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nat McAleese. 2022. Teaching language models to support answers with verified quotes. arXiv:2203.11147 [cs.CL]
  • Metzler and Croft (2005) Donald Metzler and W. Bruce Croft. 2005. A Markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Salvador, Brazil) (SIGIR ’05). Association for Computing Machinery, New York, NY, USA, 472–479. https://doi.org/10.1145/1076034.1076115
  • Mialon et al. (2023) Grégoire Mialon, Roberto Dessi, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Baptiste Roziere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. 2023. Augmented Language Models: a Survey. Transactions on Machine Learning Research (2023). https://openreview.net/forum?id=jh7wH2AzKK
  • Min et al. (2023) Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen-tau Yih, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2023. Nonparametric Masked Language Modeling. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 2097–2118. https://doi.org/10.18653/v1/2023.findings-acl.132
  • Min et al. (2019) Sewon Min, Victor Zhong, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2019. Multi-hop Reading Comprehension through Question Decomposition and Rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6097–6109. https://doi.org/10.18653/v1/P19-1613
  • Mishra et al. (2022) Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2022. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 3470–3487. https://doi.org/10.18653/v1/2022.acl-long.244
  • Mitra and Craswell (2018) Bhaskar Mitra and Nick Craswell. 2018. An Introduction to Neural Information Retrieval. Found. Trends Inf. Retr. 13, 1 (dec 2018), 1–126. https://doi.org/10.1561/1500000061
  • Musa et al. (2019) Ryan Musa, Xiaoyan Wang, Achille Fokoue, Nicholas Mattei, Maria Chang, Pavan Kapanipathi, Bassem Makni, Kartik Talamadupula, and Michael Witbrock. 2019. Answering Science Exam Questions Using Query Reformulation with Background Knowledge. In Automated Knowledge Base Construction (AKBC). https://openreview.net/forum?id=HJxYZ-5paX
  • Nakano et al. (2022) Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. 2022. WebGPT: Browser-assisted question-answering with human feedback. arXiv:2112.09332 [cs.CL]
  • Nekvinda and Dušek (2022) Tomáš Nekvinda and Ondřej Dušek. 2022. AARGH! End-to-end Retrieval-Generation for Task-Oriented Dialog. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue. Association for Computational Linguistics, Edinburgh, UK, 283–297. https://doi.org/10.18653/v1/2022.sigdial-1.29
  • Newman et al. (2023) Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, and Kyle Lo. 2023. A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 3194–3212. https://doi.org/10.18653/v1/2023.emnlp-main.193
  • Ni et al. (2019) Jianmo Ni, Chenguang Zhu, Weizhu Chen, and Julian McAuley. 2019. Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Open-domain Question Answering. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 335–344. https://doi.org/10.18653/v1/N19-1030
  • Nogueira and Cho (2019) Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019).
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., 27730–27744. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
  • Pan et al. (2023) Xiaoman Pan, Wenlin Yao, Hongming Zhang, Dian Yu, Dong Yu, and Jianshu Chen. 2023. Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language Models. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=a2jNdqE2102
  • Park et al. (2023) Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (, San Francisco, CA, USA,) (UIST ’23). Association for Computing Machinery, New York, NY, USA, Article 2, 22 pages. https://doi.org/10.1145/3586183.3606763
  • Parton et al. (2008) Kristen Parton, Kathleen R. McKeown, James Allan, and Enrique Henestroza. 2008. Simultaneous multilingual search for translingual information retrieval. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (Napa Valley, California, USA) (CIKM ’08). Association for Computing Machinery, New York, NY, USA, 719–728. https://doi.org/10.1145/1458082.1458179
  • Perez et al. (2020) Ethan Perez, Patrick Lewis, Wen-tau Yih, Kyunghyun Cho, and Douwe Kiela. 2020. Unsupervised Question Decomposition for Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 8864–8880. https://doi.org/10.18653/v1/2020.emnlp-main.713
  • Petroni et al. (2023) Fabio Petroni, Samuel Broscheit, Aleksandra Piktus, Patrick Lewis, Gautier Izacard, Lucas Hosseini, Jane Dwivedi-Yu, Maria Lomeli, Timo Schick, Michele Bevilacqua, et al. 2023. Improving wikipedia verifiability with ai. Nature Machine Intelligence 5, 10 (2023), 1142–1148.
  • Petroni et al. (2021) Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. 2021. KILT: a Benchmark for Knowledge Intensive Language Tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 2523–2544. https://doi.org/10.18653/v1/2021.naacl-main.200
  • Ponte and Croft (1998) Jay M. Ponte and W. Bruce Croft. 1998. A Language Modeling Approach to Information Retrieval. In SIGIR’98 (Melbourne, Australia). 275–281.
  • Post and Vilar (2018) Matt Post and David Vilar. 2018. Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1314–1324. https://doi.org/10.18653/v1/N18-1119
  • Pradeep et al. (2023) Ronak Pradeep, Kai Hui, Jai Gupta, Adam Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, and Vinh Tran. 2023. How Does Generative Retrieval Scale to Millions of Passages?. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 1305–1321. https://doi.org/10.18653/v1/2023.emnlp-main.83
  • Qi et al. (2019) Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and Christopher D. Manning. 2019. Answering Complex Open-domain Questions Through Iterative Query Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2590–2602. https://doi.org/10.18653/v1/D19-1261
  • Qin et al. (2023) Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv:2307.16789 [cs.AI]
  • Qu et al. (2020) Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W. Bruce Croft, and Mohit Iyyer. 2020. Open-Retrieval Conversational Question Answering. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 539–548. https://doi.org/10.1145/3397271.3401110
  • Qu et al. (2021) Chen Qu, Hamed Zamani, Liu Yang, W. Bruce Croft, and Erik Learned-Miller. 2021. Passage Retrieval for Outside-Knowledge Visual Question Answering. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, Canada) (SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 1753–1757. https://doi.org/10.1145/3404835.3462987
  • Rae et al. (2016) Jack W Rae, Jonathan J Hunt, Tim Harley, Ivo Danihelka, Andrew Senior, Greg Wayne, Alex Graves, and Timothy P Lillicrap. 2016. Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes. In Proceedings of the 30th International Conference on Neural Information Processing Systems (Barcelona, Spain) (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 3628–3636.
  • Rae et al. (2020) Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. 2020. Compressive Transformers for Long-Range Sequence Modelling. In International Conference on Learning Representations. https://openreview.net/forum?id=SylKikSYDH
  • Raghu et al. (2021) Dinesh Raghu, Shantanu Agarwal, Sachindra Joshi, and Mausam. 2021. End-to-End Learning of Flowchart Grounded Task-Oriented Dialogs. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 4348–4366. https://doi.org/10.18653/v1/2021.emnlp-main.357
  • Ram et al. (2023) Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics 11 (2023), 1316–1331.
  • Ramos et al. (2023) Rita Ramos, Bruno Martins, and Desmond Elliott. 2023. LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 1635–1651. https://aclanthology.org/2023.findings-acl.104
  • Robertson et al. (1995) Stephen Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. 1995. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC-3). Gaithersburg, MD: NIST, 109–126.
  • Rocchio (1971) J. J. Rocchio. 1971. Relevance Feedback in Information Retrieval. In The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice Hall.
  • Rubin and Berant (2023) Ohad Rubin and Jonathan Berant. 2023. Long-range Language Modeling with Self-retrieval. arXiv:2306.13421 [cs.CL]
  • Saad-Falcon et al. (2024) Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational Linguistics, Mexico City, Mexico, 338–354. https://aclanthology.org/2024.naacl-long.20
  • Sachan et al. (2021) Devendra Sachan, Mostofa Patwary, Mohammad Shoeybi, Neel Kant, Wei Ping, William L. Hamilton, and Bryan Catanzaro. 2021. End-to-End Training of Neural Retrievers for Open-Domain Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 6648–6662. https://doi.org/10.18653/v1/2021.acl-long.519
  • Salemi et al. (2023a) Alireza Salemi, Juan Altmayer Pizzorno, and Hamed Zamani. 2023a. A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei, Taiwan) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 110–120. https://doi.org/10.1145/3539618.3591629
  • Salemi et al. (2024a) Alireza Salemi, Surya Kallumadi, and Hamed Zamani. 2024a. Optimization Methods for Personalizing Large Language Models through Retrieval Augmentation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (Washington DC, USA) (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 752–762. https://doi.org/10.1145/3626772.3657783
  • Salemi et al. (2024b) Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2024b. LaMP: When Large Language Models Meet Personalization. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL ’24). Association for Computational Linguistics.
  • Salemi et al. (2023b) Alireza Salemi, Mahta Rafiee, and Hamed Zamani. 2023b. Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval (Taipei, Taiwan) (ICTIR ’23). Association for Computing Machinery, New York, NY, USA, 169–176. https://doi.org/10.1145/3578337.3605137
  • Salemi and Zamani (2024a) Alireza Salemi and Hamed Zamani. 2024a. Evaluating Retrieval Quality in Retrieval-Augmented Generation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (Washington DC, USA) (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 2395–2400. https://doi.org/10.1145/3626772.3657957
  • Salemi and Zamani (2024b) Alireza Salemi and Hamed Zamani. 2024b. Towards a Search Engine for Machines: Unified Ranking for Multiple Retrieval-Augmented Large Language Models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (Washington DC, USA) (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 741–751. https://doi.org/10.1145/3626772.3657733
  • Salton and Buckley (1988) Gerard Salton and Christopher Buckley. 1988. Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage. 24, 5 (Aug. 1988), 513–523. https://doi.org/10.1016/0306-4573(88)90021-0
  • Samarinas and Zamani (2024) Chris Samarinas and Hamed Zamani. 2024. ProCIS: A Benchmark for Proactive Retrieval in Conversations. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (Washington DC, USA) (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 830–840. https://doi.org/10.1145/3626772.3657869
  • Sanabria et al. (2023) Ramon Sanabria, Ondrej Klejch, Hao Tang, and Sharon Goldwater. 2023. Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling. arXiv preprint arXiv:2306.02153 (2023).
  • Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Meta-Learning with Memory-Augmented Neural Networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (New York, NY, USA) (ICML’16). JMLR.org, 1842–1850.
  • Sarthi et al. (2024) Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. 2024. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=GN921JHCRw
  • Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=Yacmpz84TH
  • Schuster et al. (2024) Tal Schuster, Adam Lelkes, Haitian Sun, Jai Gupta, Jonathan Berant, William Cohen, and Donald Metzler. 2024. SEMQA: Semi-Extractive Multi-Source Question Answering. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational Linguistics, Mexico City, Mexico, 1363–1381. https://aclanthology.org/2024.naacl-long.74
  • Sciavolino et al. (2021) Christopher Sciavolino, Zexuan Zhong, Jinhyuk Lee, and Danqi Chen. 2021. Simple Entity-Centric Questions Challenge Dense Retrievers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6138–6148. https://doi.org/10.18653/v1/2021.emnlp-main.496
  • Sen et al. (2023) Priyanka Sen, Sandeep Mavadia, and Amir Saffari. 2023. Knowledge Graph-augmented Language Models for Complex Question Answering. In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE). Association for Computational Linguistics, Toronto, Canada, 1–8. https://doi.org/10.18653/v1/2023.nlrse-1.1
  • Shi et al. (2024) Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2024. REPLUG: Retrieval-Augmented Black-Box Language Models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational Linguistics, Mexico City, Mexico, 8364–8377. https://aclanthology.org/2024.naacl-long.463
  • Shinn et al. (2023) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=vAElhFcKW6
  • Shrestha et al. (2024) Robik Shrestha, Yang Zou, Qiuyu Chen, Zhiheng Li, Yusheng Xie, and Siqi Deng. 2024. FairRAG: Fair Human Generation via Fair Retrieval Augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11996–12005.
  • Shuster et al. (2022a) Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, and Jason Weston. 2022a. Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion. In Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 373–393. https://aclanthology.org/2022.findings-emnlp.27
  • Shuster et al. (2022b) Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, Morteza Behrooz, William Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kambadur, and Jason Weston. 2022b. BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv:2208.03188 [cs.CL]
  • Singh et al. (2021) Devendra Singh, Siva Reddy, Will Hamilton, Chris Dyer, and Dani Yogatama. 2021. End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 25968–25981. https://proceedings.neurips.cc/paper_files/paper/2021/file/da3fde159d754a2555eaa198d2d105b2-Paper.pdf
  • Song and Croft (1999) Fei Song and W. Bruce Croft. 1999. A general language model for information retrieval. In Proceedings of the Eighth International Conference on Information and Knowledge Management (Kansas City, Missouri, USA) (CIKM ’99). Association for Computing Machinery, New York, NY, USA, 316–321. https://doi.org/10.1145/319950.320022
  • Sparck-Jones and Galliers (1996) Karen Sparck-Jones and Julia R. Galliers. 1996. Evaluating Natural Language Processing Systems: An Analysis and Review. Springer-Verlag, Berlin, Heidelberg.
  • Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. Advances in neural information processing systems 28 (2015).
  • Sumers et al. (2024) Theodore Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas Griffiths. 2024. Cognitive Architectures for Language Agents. Transactions on Machine Learning Research (2024). https://openreview.net/forum?id=1i6ZCvflQJ Survey Certification.
  • Sun et al. (2023) Weiwei Sun, Lingyong Yan, Xinyu Ma, Pengjie Ren, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. CoRR abs/2304.09542 (2023). https://doi.org/10.48550/ARXIV.2304.09542 arXiv:2304.09542
  • Tay et al. (2022) Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, Tal Schuster, William W Cohen, and Donald Metzler. 2022. Transformer Memory as a Differentiable Search Index. In Advances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., 21831–21843. https://proceedings.neurips.cc/paper_files/paper/2022/file/892840a6123b5ec99ebaab8be1530fba-Paper-Conference.pdf
  • Thomas et al. (2023) Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2023. Large language models can accurately predict searcher preferences. arXiv:2309.10621 [cs.IR]
  • Thoppilan et al. (2022) Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. 2022. LaMDA: Language Models for Dialog Applications. arXiv:2201.08239 [cs.CL]
  • Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 809–819. https://doi.org/10.18653/v1/N18-1074
  • Thulke et al. (2021) David Thulke, Nico Daheim, Christian Dugast, and Hermann Ney. 2021. Efficient Retrieval Augmented Generation from Unstructured Knowledge for Task-Oriented Dialog. In Workshop on DSTC9, AAAI.
  • Trivedi et al. (2023) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2023. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 10014–10037. https://doi.org/10.18653/v1/2023.acl-long.557
  • Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. In Advances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2015/file/29921001f2f04bd3baee84a12e98098f-Paper.pdf
  • Vu et al. (2023) Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. 2023. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. arXiv:2310.03214 [cs.CL]
  • Wan et al. (2022) Zhongwei Wan, Yichun Yin, Wei Zhang, Jiaxin Shi, Lifeng Shang, Guangyong Chen, Xin Jiang, and Qun Liu. 2022. G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 6585–6597. https://doi.org/10.18653/v1/2022.emnlp-main.441
  • Wang et al. (2024b) Dingmin Wang, Qiuyuan Huang, Matthew Jackson, and Jianfeng Gao. 2024b. LLM Agent - Retrieve What You Need: A Mutual Learning Framework for Open-domain Question Answering. Proceedings of Transactions of the Association for Computational Linguistics (TACL) (April 2024), 247–263. https://www.microsoft.com/en-us/research/publication/retrieve-what-you-need-a-mutual-learning-framework-for-open-domain-question-answering/
  • Wang et al. (2023c) Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023c. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint arXiv: Arxiv-2305.16291 (2023).
  • Wang et al. (2023d) Liang Wang, Nan Yang, and Furu Wei. 2023d. Query2doc: Query Expansion with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 9414–9423. https://doi.org/10.18653/v1/2023.emnlp-main.585
  • Wang et al. (2017) Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2017. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40, 10 (2017), 2413–2427.
  • Wang et al. (2023a) Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. 2023a. Augmenting Language Models with Long-Term Memory. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=BryMFPQ4L6
  • Wang et al. (2023b) Weizhi Wang, Li Dong, Hao Cheng, Haoyu Song, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. 2023b. Visually-Augmented Language Modeling. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=8IN-qLkl215
  • Wang et al. (2022) Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Savan Doshi, Shailaja Keyur Sampat, Siddhartha Mishra, Sujan Reddy A, Sumanta Patro, Tanay Dixit, and Xudong Shen. 2022. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5085–5109. https://doi.org/10.18653/v1/2022.emnlp-main.340
  • Wang et al. (2024a) Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, and Daniel Fried. 2024a. CodeRAG-Bench: Can Retrieval Augment Code Generation? arXiv:2406.14497 [cs.SE] https://arxiv.org/abs/2406.14497
  • Wei et al. (2022) Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations. https://openreview.net/forum?id=gEZrGCozdqR
  • Weston et al. (2015) Jason Weston, Sumit Chopra, and Antoine Bordes. 2015. Memory Networks. arXiv:1410.3916 [cs.AI]
  • White et al. (2003) Ryen W White, Joemon M Jose, Ian Ruthven, GM Rauterberg, M Menozzi, and J Wesson. 2003. A granular approach to web search result presentation. In 9th IFIP TC13 International Conference on Human-Computer Interaction. Interact 2003.
  • Wu et al. (2022b) Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. 2022b. MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition. arXiv preprint arXiv:2201.08383 (2022).
  • Wu and Mooney (2022) Jialin Wu and Raymond Mooney. 2022. Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 8061–8072. https://doi.org/10.18653/v1/2022.emnlp-main.551
  • Wu et al. (2021) Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. 2021. Recursively Summarizing Books with Human Feedback. arXiv:2109.10862
  • Wu et al. (2022a) Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, Alborz Geramifard, and Zhou Yu. 2022a. Memformer: A Memory-Augmented Transformer for Sequence Modeling. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022. Association for Computational Linguistics, Online only, 308–318. https://aclanthology.org/2022.findings-aacl.29
  • Wu et al. (2022c) Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. 2022c. Memorizing Transformers. In International Conference on Learning Representations. https://openreview.net/forum?id=TrjbxzRcnf-
  • Wu et al. (2022d) Yuxiang Wu, Yu Zhao, Baotian Hu, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2022d. An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5184–5196. https://aclanthology.org/2022.emnlp-main.346
  • Xiong et al. (2021) Wenhan Xiong, Xiang Li, Srini Iyer, Jingfei Du, Patrick Lewis, William Yang Wang, Yashar Mehdad, Scott Yih, Sebastian Riedel, Douwe Kiela, and Barlas Oguz. 2021. Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval. In International Conference on Learning Representations. https://openreview.net/forum?id=EMHoBG0avc1
  • Yadav et al. (2020) Vikas Yadav, Steven Bethard, and Mihai Surdeanu. 2020. Unsupervised Alignment-based Iterative Evidence Retrieval for Multi-hop Question Answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4514–4525. https://doi.org/10.18653/v1/2020.acl-main.414
  • Yadegari et al. (2022) Mostafa Yadegari, Ehsan Kamalloo, and Davood Rafiei. 2022. Detecting Frozen Phrases in Open-Domain Question Answering. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 1990–1996. https://doi.org/10.1145/3477495.3531793
  • Yamada et al. (2021) Ikuya Yamada, Akari Asai, and Hannaneh Hajishirzi. 2021. Efficient Passage Retrieval with Hashing for Open-domain Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Online, 979–986. https://doi.org/10.18653/v1/2021.acl-short.123
  • Yang et al. (2024b) Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. 2024b. Leandojo: Theorem proving with retrieval-augmented language models. Advances in Neural Information Processing Systems 36 (2024).
  • Yang et al. (2024a) Linyao Yang, Hongyang Chen, Zhao Li, Xiao Ding, and Xindong Wu. 2024a. Give Us the Facts: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling. arXiv:2306.11489
  • Yang et al. (2022) Sitan Yang, Carson Eisenach, and Dhruv Madeka. 2022. MQRetNN: Multi-Horizon Time Series Forecasting with Retrieval Augmentation. arXiv:2207.10517 [cs.LG]
  • Yang and Seo (2020) Sohee Yang and Minjoon Seo. 2020. Is Retriever Merely an Approximator of Reader? arXiv:2010.10999
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2369–2380. https://doi.org/10.18653/v1/D18-1259
  • Yasunaga et al. (2023) Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Retrieval-augmented multimodal language modeling. In Proceedings of the 40th International Conference on Machine Learning (, Honolulu, Hawaii, USA,) (ICML’23). JMLR.org, Article 1659, 15 pages.
  • Yogatama et al. (2021) Dani Yogatama, Cyprien de Masson d’Autume, and Lingpeng Kong. 2021. Adaptive Semiparametric Language Models. Transactions of the Association for Computational Linguistics 9 (04 2021), 362–373. https://doi.org/10.1162/tacl_a_00371 arXiv:https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00371/1924150/tacl_a_00371.pdf
  • Yu et al. (2022a) Donghan Yu, Chenguang Zhu, Yuwei Fang, Wenhao Yu, Shuohang Wang, Yichong Xu, Xiang Ren, Yiming Yang, and Michael Zeng. 2022a. KG-FiD: Infusing Knowledge Graph in Fusion-in-Decoder for Open-Domain Question Answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 4961–4974. https://doi.org/10.18653/v1/2022.acl-long.340
  • Yu et al. (2022b) Wenhao Yu, Chenguang Zhu, Zaitang Li, Zhiting Hu, Qingyun Wang, Heng Ji, and Meng Jiang. 2022b. A Survey of Knowledge-Enhanced Text Generation. ACM Comput. Surv. 54, 11s, Article 227 (nov 2022), 38 pages. https://doi.org/10.1145/3512467
  • Yu et al. (2022c) Wenhao Yu, Chenguang Zhu, Zhihan Zhang, Shuohang Wang, Zhuosheng Zhang, Yuwei Fang, and Meng Jiang. 2022c. Retrieval Augmentation for Commonsense Reasoning: A Unified Approach. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 4364–4377. https://aclanthology.org/2022.emnlp-main.294
  • Zamani and Bendersky (2024) Hamed Zamani and Michael Bendersky. 2024. Stochastic RAG: End-to-End Retrieval-Augmented Generation through Expected Utility Maximization. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (Washington DC, USA) (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 2641–2646. https://doi.org/10.1145/3626772.3657923
  • Zamani et al. (2018) Hamed Zamani, Mostafa Dehghani, W. Bruce Croft, Erik Learned-Miller, and Jaap Kamps. 2018. From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (Torino, Italy) (CIKM ’18). Association for Computing Machinery, New York, NY, USA, 497–506. https://doi.org/10.1145/3269206.3271800
  • Zamani et al. (2022) Hamed Zamani, Fernando Diaz, Mostafa Dehghani, Donald Metzler, and Michael Bendersky. 2022. Retrieval-Enhanced Machine Learning. In Proceedings of the 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
  • Zaremba and Sutskever (2016) Wojciech Zaremba and Ilya Sutskever. 2016. Reinforcement Learning Neural Turing Machines - Revised. arXiv:1505.00521 [cs.LG]
  • Zeng et al. (2024) Hansi Zeng, Chen Luo, Bowen Jin, Sheikh Sarwar, Tianxin Wei, and Hamed Zamani. 2024. Scalable and Effective Generative Information Retrieval. In Proceedings of the 2024 ACM Web Conference (Singapore, Singapore) (WWW ’24). Association for Computing Machinery, New York, NY, USA.
  • Zhai and Lafferty (2001) Chengxiang Zhai and John Lafferty. 2001. Model-Based Feedback in the Language Modeling Approach to Information Retrieval. In Proceedings of the Tenth International Conference on Information and Knowledge Management (Atlanta, Georgia, USA) (CIKM ’01). Association for Computing Machinery, New York, NY, USA, 403–410. https://doi.org/10.1145/502585.502654
  • Zhang et al. (2023) Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 2471–2484. https://doi.org/10.18653/v1/2023.emnlp-main.151
  • Zhang et al. (2022b) Jing Zhang, Xiaokang Zhang, Jifan Yu, Jian Tang, Jie Tang, Cuiping Li, and Hong Chen. 2022b. Subgraph Retrieval Enhanced Model for Multi-hop Knowledge Base Question Answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 5773–5784. https://doi.org/10.18653/v1/2022.acl-long.396
  • Zhang et al. (2022a) Yizhe Zhang, Siqi Sun, Xiang Gao, Yuwei Fang, Chris Brockett, Michel Galley, Jianfeng Gao, and Bill Dolan. 2022a. Retgen: A joint framework for retrieval and grounded text generation modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11739–11747.
  • Zhang et al. (2024) Zihan Zhang, Meng Fang, and Ling Chen. 2024. RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering. arXiv:2402.16457 [cs.CL] https://arxiv.org/abs/2402.16457
  • Zhao et al. (2024) Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. 2024. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19632–19642.
  • Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large Language Models. arXiv:2303.18223
  • Zhong et al. (2021) Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, and Dragomir Radev. 2021. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 5905–5921. https://doi.org/10.18653/v1/2021.naacl-main.472
  • Zhong et al. (2022) Zexuan Zhong, Tao Lei, and Danqi Chen. 2022. Training Language Models with Memory Augmentation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5657–5673. https://aclanthology.org/2022.emnlp-main.382
  • Zhou et al. (2022) Ben Zhou, Kyle Richardson, Xiaodong Yu, and Dan Roth. 2022. Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2223–2235. https://doi.org/10.18653/v1/2022.emnlp-main.142
  • Zhou et al. (2023) Shuyan Zhou, Uri Alon, Frank F. Xu, Zhengbao Jiang, and Graham Neubig. 2023. DocPrompting: Generating Code by Retrieving the Docs. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=ZTCxT2t2Ru
  • Zhu et al. (2021) Yunchang Zhu, Liang Pang, Yanyan Lan, Huawei Shen, and Xueqi Cheng. 2021. Adaptive Information Seeking for Open-Domain Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 3615–3626. https://doi.org/10.18653/v1/2021.emnlp-main.293