Better RAG using Relevant Information Gain

Marc Pickett &Jeremy Hartman \ANDAyan Kumar Bhowmick &Raquib-ul Alam &Aditya Vempaty \AND
Emergence AI
Contact: [email protected]

Abstract

A common way to extend the memory of large language models (LLMs) is by retrieval augmented generation (RAG), which inserts text retrieved from a larger memory into an LLM’s context window. However, the context window is typically limited to several thousand tokens, which limits the number of retrieved passages that can inform a model’s response. For this reason, it’s important to avoid occupying context window space with redundant information by ensuring a degree of diversity among retrieved passages. At the same time, the information should also be relevant to the current task. Most prior methods that encourage diversity among retrieved results, such as Maximal Marginal Relevance (MMR), do so by incorporating an objective that explicitly trades off diversity and relevance. We propose a novel simple optimization metric based on relevant information gain, a probabilistic measure of the total information relevant to a query for a set of retrieved results. By optimizing this metric, diversity organically emerges from our system. When used as a drop-in replacement for the retrieval component of a RAG system, this method yields state-of-the-art performance on question answering tasks from the Retrieval Augmented Generation Benchmark (RGB), outperforming existing metrics that directly optimize for relevance and diversity. ¹¹1Code is available at https://github.com/EmergenceAI/dartboard.

Marc Pickett Jeremy Hartman

Ayan Kumar Bhowmick Raquib-ul Alam Aditya Vempaty

Emergence AI Contact: [email protected]

1 Introduction

A limitation of transformer-based Large Language Models (LLMs) is that the number of tokens is bounded by the transformer’s context window, which is typically in the thousands. This is often insufficient for representing large texts, such as novels and corporate documentation. A common way to mitigate this constraint is via retrieval augmented generation (RAG), in which a relatively small subset of relevant passages are retrieved from a larger database and inserted into an LLM’s context window Gao et al. (2024). Typically, this process involves applying a similarity metric, such as cosine similarity, to (precomputed) embeddings of passages and the embedding of a query. Using this metric, many systems then use K-nearest-neighbors or a fast approximation with a vector database such as FAISS Douze et al. (2024). Importantly, K-nearest-neighbors Bijalwan et al. (2014) and related methods (such as a cross-encoder reranker Nogueira and Cho (2020)) simply return the highest individually relevant passages, without regard to whether the information in the passages is redundant. Given the premium value on LLM context-window real estate, it’s important to make best use of this limited resource by minimizing redundancy, while maintaining relevance.

To appreciate the importance of minimizing redundancy in a RAG context, consider a toy database of facts and the two possible sets of retrieval results in Table 1, for the same query, “Tell me some facts about sharks.” Both sets of retrieved results are highly relevant to the query, but only the second set is diverse enough to support a satisfactory answer.

Shark dataset

Sharks are boneless.

Sharks do not have any bones.

Sharks have no bones.

Sharks have excellent vision.

Sharks are very fierce.

Sharks are apex predators.

Query: Tell me some facts about sharks
Retrieval results 1:
Sharks are boneless.
Sharks have no bones.
Sharks do not have any bones.
Retrieval results 2:
Sharks are boneless.
Sharks have excellent vision.
Sharks are apex predators.

Table 1: A toy database of shark facts (top) and two possible sets of retrieval results for the same query (bottom).

A family of methods from the Information Retrieval literature attempts to address the general issue of diversity in retrieved results by introducing a measure that explicitly balances diversity and relevance Carbonell and Goldstein (1998). In this paper, we propose a more principled method, Dartboard, that instead seeks to directly accomplish what previous methods are indirectly aiming for - maximize the total amount of information relevant for a given query in a set of $k$ results. The intuition behind Dartboard is simple - we assume that one passage is the “correct” one for a given query. Our system is allowed $k$ “guesses” and it aims to maximize the relevance score of its most relevant guess. Since the best guess is not known ahead of time, this score is weighted by the probability of that guess being the most relevant. This objective is sufficient to encourage diversity in the guesses. This is because a redundant guess does little to increase the relevance of the most relevant guess.

The main contributions of this paper are 3-fold:

•

We introduce the Dartboard algorithm, a principled retrieval method based on optimizing a simple metric of total information gain relevant to a given query (§2).
•

We demonstrate the effectiveness of Dartboard on Retrieval-Augmented Generation Benchmark (RGB) Chen et al. (2023), a closed-domain question answering task. This benchmark consists of a retrieval component, and an end-to-end question-answering component. We show that the Dartboard algorithm, when used as the retrieval component, outperforms all existing baselines at both the component level and at end-to-end level (§3.1).
•

We show that instead of directly encouraging diversity, diversity naturally emerges by optimizing this metric (§A.5).

2 Dartboard

The Dartboard algorithm is based on the following analogy illustrated in Figure 1: Suppose that we have a cooperative two-player game where a dartboard is covered with a random collection of points. Player 1 is given one of these points arbitrarily as the target. Player 1 then throws her dart aiming for the target, and it lands somewhere on the board. Where it lands is the query. Player 2 sees where Player 1’s dart landed (the query), but doesn’t know where the actual target is. Player 2 then picks $k$ of the points on the board. The true target is revealed, and the score (which the players are trying to minimize) is the distance from the target to the closest guess. Note that to minimize the score, Player 2 would not want to put all his guesses right next to each other. Also, Player 2 should take into account how accurate Player 1’s throws are in general. In our implementation, Player 1’s accuracy is modeled by a Gaussian distribution with standard deviation $\sigma$ .

Refer to caption — Figure 1: A visualization of *Dartboard*. The *query* is represented by the red star. All points are represented by blue dots. The five dots highlighted by grey background are the query’s 5 nearest neighbors, while the dots circled in green are the five points selected by the *Dartboard* algorithm (numbered in the order selected by the greedy algorithm). The concentric red circles are spaced at multiples of $\sigma$ , which represents the standard deviation of our uncertainty for the query’s accuracy. Note the possible redundancy by naive k-nearest-neighbors, which ignores points above or to the right of the query.

More formally, Player 1 selects a target $T$ from a set of all points $A$ and gives a query $q$ . Then Player 2 makes a set of guesses $G\subseteq A$ , resulting in a score $s\left(G,q,A,\sigma\right)$ which is given as:

s\left(G,q,A,\sigma\right)=\sum_{t\in A}P\left(T=t|q,\sigma\right)\min_{g\in G% }D\left(t|g\right)

(1)

where $D$ is a distance function. For $d$ dimensional vectors, $A\subseteq\mathbb{R}^{d}$ ; under some assumptions, we can use a Gaussian kernel for the distance functions. For example, we can set $P\left(T=t|q,\sigma\right)=\mathcal{N}\left(q,t,\sigma\right)$ . Thus, our equation becomes:

s\left(G,q,A,\sigma\right)\propto-\sum_{t\in A}\mathcal{N}\left(q,t,\sigma% \right)\max_{g\in G}\mathcal{N}\left(t,g,\sigma\right)

(2)

2.1 The Dartboard Algorithm

The Dartboard Algorithm aims to maximize Equation 2 given a distance metric. In practice, we can greedily build our set $G$ , which works well as it saves us combinatorial search, and allows reuse of previous answers (since the top- $k$ results are a subset of the top- $k+1$ results). We begin by ranking top- $k$ passages $A^{\prime}$ from our initial dataset of passages $A$ using $K$ -nearest-neighbors based on cosine similarity. We use a linear search, but sub-linear methods such as FAISS Douze et al. (2024) could also be used for this initial ranking. Our search is a simple greedy optimization method with two changes - (a) we stay in log space to avoid numerical underflow, and (b) we reuse the results ( $maxes$ ) from previous loops to avoid recomputing the maximums. The detailed algorithm is given in Algorithm 1 in Appendix A.1. In Appendix A.3, we also show how to adapt Dartboard to use a cross-encoder based reranker (resulting in two methods called Dartboard crosscoder and Dartboard hybrid), and Appendix A.4 shows that Dartboard generalizes KNN and MMR retrieval algorithms Onal et al. (2015).

3 Experiments

We tested Dartboard on benchmark datasets from Chen et al. (2023), from which we used two types of closed-domain question answering. In the simple question answering case, a query is answerable from a single passage retrieved from the corpus. For example, consider the query When is the premiere of ‘Carole King & James Taylor: Just Call Out My Name’?. On the other hand, in the information integration case, a query would require multiple passages to be retrieved to answer the query. For example, consider the query Who is the director of ‘Carole King & James Taylor: Just Call Out My Name’ and when is its premiere?. We modified this benchmark for our setup in the following way. The original benchmark contains “positive” and “negative” labeled passages for each query. The positive passages are useful for answering, while the negative ones are related but ineffective in answering the query. Since we are interested in the retrieval component of this task, we merged the positive and negative passages for all queries into a single collection of $11,641$ passages for the $300$ simple question answering test cases and $5,701$ passages for the $100$ information integration test cases. The evaluation is otherwise identical apart from the retrieval component. Note that the innovation of Dartboard is solely on the retrieval component. Therefore, we keep the rest of the RAG pipeline fixed. In particular, we do not modify the prompting of LLMs or try to optimize passage embeddings.

Given a query and the full set of thousands of passage embeddings, we measured both a direct retrieval score and the overall end-to-end performance of the system with the only change being the retrieval algorithm. For the direct retrieval score, we computed the Normalized Discounted Cumulative Gain (NDCG) score Wang et al. (2013) on retrieving any one of the “positive” passages relevant to a specific query. In the information integration case, the positive passages were split into positive ones for each component of the question. Therefore, in this case, we calculated the NDCG score for retrieving at least one positive passage for each component of the query. For the end-to-end score, given an LLM’s response to the query (generated from retrieved passages), we use the same evaluation as Chen et al. (2023), which does a string match of the response on a set of correct answers, marking each response as either correct or incorrect.

Some of the methods (described in Appendix A.2), including Dartboard, have tunable parameters. For instance, Maximal Marginal Relevance (MMR) has a diversity parameter that varies from 0 to 1. We performed a grid search over these parameters, reporting the best results for each method.

3.1 Results

From the results shown in Table 2, we observe that Dartboard outperforms all state-of-the-art methods in terms of all metrics across all the tasks.

	Simple		Integrated
	QA	NDCG	QA	NDCG
Oracle	89.3%	1.000	36%	.826
D-H (ours)	85.6%	0.973	41%	.609
D-CC (ours)	84.3%	0.971	42%	.595
D-CS (ours)	83.0%	0.975	36%	.545
MMR Crosscoder	84.3%	0.971	40%	.598
MMR Cossim	81.0%	0.974	36%	.541
KNN Crosscoder	84.3%	0.968	36%	.580
KNN Cossim	80.0%	0.973	25%	.514
Empty	3.3%	0.000	3%	.000
Random	3.3%	0.044	2%	.028

Table 2: Results for the Dartboard retrieval system on the QA benchmarks using

k=5

. For methods with tunable parameters (Dartboard and MMR), the best score over a parameter sweep is reported.

Figure 2 shows the performance of different retrieval methods on the end-to-end QA task (simple) as the parameters vary. Although Dartboard Crosscoder (D-CC) and Dartboard hybrid (D-H) are fairly robust to a range of $\sigma$ values, the best performance is achieved for Dartboard hybrid with $\sigma=0.096$ (See Appendix A.2 for baselines).

4 Related Work

MMR retrieves documents Carbonell and Goldstein (1998) that are both relevant to the query and dissimilar to previously retrieved documents. It combines a relevance score (e.g., from BM25) with a novelty score that penalizes documents similar to those already retrieved. It have been used extensively for building recommendation systems Xia et al. (2015); Wu et al. (2023) as well as for summarization tasks Agarwal et al. (2022); Adams et al. (2022). However, MMR suffers from few limitations. First is that MMR requires the diversity parameter to control the balance between relevance and novelty. This parameter is often dataset-specific and requires careful tuning, making it impractical for real-world applications. Second is that MMR can favor exact duplicates of previously retrieved documents as they retain a high relevance score while minimally impacting the average novelty score (See Appendix A.7).

KNN retrieves documents based on their similarity to a query embedding Dharani and Aroquiaraj (2013); Bijalwan et al. (2014). While efficient, KNN often suffers from redundancy as nearby documents in the embedding space tend to be semantically similar Taunk et al. (2019). This can lead to a retrieved set dominated by passages conveying the same information with slight variations.

Several recent works have explored incorporating diversity objectives into retrieval models Angel and Koudas (2011); Li et al. (2015); Fromm et al. (2021). These approaches often involve complex optimization functions or require additional training data for diversity estimation. For example, Learning-to-Rank with Diversity methods leverage learning-to-rank frameworks that incorporate diversity objectives directly into the ranking function. This allows for the optimization of both relevance and diversity during the ranking process. However, these approaches often require large amounts of labeled training data for diversity, which can be expensive and time-consuming to obtain Wasilewski and Hurley (2016); Yan et al. (2021). Bandit-based approaches model document selection as a multi-armed bandit problem Hofmann et al. (2011); Wang et al. (2021). The model explores different retrieval strategies and receives feedback based on the relevance and diversity of the retrieved passages. These approaches can be effective but can be computationally expensive for large-scale retrieval tasks.

RAG models have also been extended to incorporate diversity objectives. For example, RAG with Dense Passage Retrieval retrieves a large number of candidate passages Cuconasu et al. (2024); Reichman and Heck (2024); Siriwardhana et al. (2023). It then employs a two-stage selection process: first selecting a diverse subset based on novelty scores, then selecting the most relevant passages from this subset. While effective, this approach requires careful tuning of the selection thresholds.

5 Discussion

In this paper, we introduce Dartboard, a principled retrieval algorithm that implicitly encourages diversity of retrieved passages by optimizing for relevant information gain. We demonstrate that Dartboard outperforms existing state-of-the-art retrieval algorithms on both retrieval and end-to-end QA tasks. We view this work as an initial step for a more general line of work that optimizes information gain during retrieval, especially in the context of RAG systems. In future work, we plan to investigate Dartboard for other retrieval tasks, such as suggestion generation (see Appendix A.6).

6 Limitations

We have not done a systematic investigation of the run time of Dartboard. In the worst case scenario, Dartboard is quadratic in the number of ranked passages. However, in practice, Dartboard hybrid typically runs in a fraction of a second for ranking (based on cosine-similarity with query) a set of $100$ passages (note that a full cross-encoder based MMR/Dartboard needs to run the cross-encoder $10,000$ times, and can take several seconds). This retrieval time is minimal compared to the time required for a LLM to process the retrieved passages and generate an answer.

Our experimental results are limited to a single benchmark and a single LLM i.e. ChatGLM Hou et al. (2024). It remains to be seen whether our results would generalize to other benchmarks and LLMs. We plan to investigate this in future work.

One shortcoming of our method (also shared by MMR) is that it requires a hyperparameter that affects how much diversity is encouraged. While we show that Dartboard is robust to the choice of this hyperparameter, it would be ideal to have a method that does not require manual tuning. As part of future work, we plan to investigate methods that automatically adapt to the context of the query. For example, the hyperparameter could be set based on a held-out validation set.

Another topic for future work is to investigate if it is also possible for $\sigma$ to vary depending on the type of query. For example, a query like “Tell me facts about The Beatles” would warrant a broader range of passages than a query like “Tell me facts about George Harrison”.

Another shortcoming of our approach is that our benchmarking criteria is limited in terms of the evaluation protocol we are using. Our evaluation is based on an exact string match of the output answer generated from the LLM with a set of possible answers. For example, for one question, the generated output answer is considered correct if it contains the exact string ‘January 2 2022’, ‘Jan 2, 2022’, etc., but would be considered incorrect if it only contains ‘January 2nd, 2022’. However, we left the benchmark as is (modulo our modifications mentioned above) so that our method is easily comparable to that of others.

Finally, though the initial cosine similarity based proposed Dartboard method is principled, the hybrid variation of Dartboard is not that principled. This is because it tries to compare logits from a cross-encoder with the cosine similarity of a different embedding model, similar to comparing apples with oranges, though it seems to work well as seen in our presented empirical results.

References

Adams et al. (2022) David Adams, Gandharv Suri, and Yllias Chali. 2022. Combining state-of-the-art models with maximal marginal relevance for few-shot and zero-shot multi-document summarization. arXiv preprint arXiv:2211.10808.
Agarwal et al. (2022) Abhishek Agarwal, Shanshan Xu, and Matthias Grabmair. 2022. Extractive summarization of legal decisions using multi-task learning and maximal marginal relevance. arXiv preprint arXiv:2210.12437.
Angel and Koudas (2011) Albert Angel and Nick Koudas. 2011. Efficient diversity-aware search. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 781–792.
Bijalwan et al. (2014) Vishwanath Bijalwan, Vinay Kumar, Pinki Kumari, and Jordan Pascual. 2014. Knn based machine learning approach for text and document mining. International Journal of Database Theory and Application, 7(1):61–70.
Carbonell and Goldstein (1998) Jaime Carbonell and Jade Goldstein. 1998. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 335–336.
Chen et al. (2023) Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2023. Benchmarking large language models in retrieval-augmented generation. Preprint, arXiv:2309.01431.
Cuconasu et al. (2024) Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The power of noise: Redefining retrieval for rag systems. arXiv preprint arXiv:2401.14887.
Dharani and Aroquiaraj (2013) T Dharani and I Laurence Aroquiaraj. 2013. Content based image retrieval system using feature classification with modified knn algorithm. arXiv preprint arXiv:1307.4717.
Douze et al. (2024) Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The faiss library. Preprint, arXiv:2401.08281.
Fromm et al. (2021) Michael Fromm, Max Berrendorf, Sandra Obermeier, Thomas Seidl, and Evgeniy Faerman. 2021. Diversity aware relevance learning for argument search. In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part II 43, pages 264–271. Springer.
Gao et al. (2024) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-augmented generation for large language models: A survey. Preprint, arXiv:2312.10997.
Hofmann et al. (2011) Katja Hofmann, Shimon Whiteson, Maarten de Rijke, et al. 2011. Contextual bandits for information retrieval. In NIPS 2011 Workshop on Bayesian optimization, experimental design, and bandits, granada, volume 12, page 2011.
Hou et al. (2024) Zhenyu Hou, Yiin Niu, Zhengxiao Du, Xiaohan Zhang, Xiao Liu, Aohan Zeng, Qinkai Zheng, Minlie Huang, Hongning Wang, Jie Tang, et al. 2024. Chatglm-rlhf: Practices of aligning large language models with human feedback. arXiv preprint arXiv:2404.00934.
Li et al. (2015) Jianqiang Li, Chunchen Liu, Bo Liu, Rui Mao, Yongcai Wang, Shi Chen, Ji-Jiang Yang, Hui Pan, and Qing Wang. 2015. Diversity-aware retrieval of medical records. Computers in Industry, 69:81–91.
Nogueira and Cho (2020) Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage re-ranking with bert. Preprint, arXiv:1901.04085.
Onal et al. (2015) Kezban Dilek Onal, Ismail Sengor Altingovde, and Pinar Karagoz. 2015. Utilizing word embeddings for result diversification in tweet search. In Information Retrieval Technology: 11th Asia Information Retrieval Societies Conference, AIRS 2015, Brisbane, QLD, Australia, December 2-4, 2015. Proceedings 11, pages 366–378. Springer.
Reichman and Heck (2024) Benjamin Reichman and Larry Heck. 2024. Retrieval-augmented generation: Is dense passage retrieval retrieving? arXiv preprint arXiv:2402.11035.
Siriwardhana et al. (2023) Shamane Siriwardhana, Rivindu Weerasekera, Elliott Wen, Tharindu Kaluarachchi, Rajib Rana, and Suranga Nanayakkara. 2023. Improving the domain adaptation of retrieval augmented generation (rag) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11:1–17.
Taunk et al. (2019) Kashvi Taunk, Sanjukta De, Srishti Verma, and Aleena Swetapadma. 2019. A brief review of nearest neighbor algorithm for learning and classification. In 2019 international conference on intelligent computing and control systems (ICCS), pages 1255–1260. IEEE.
Wang et al. (2021) Huazheng Wang, Yiling Jia, and Hongning Wang. 2021. Interactive information retrieval with bandit feedback. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2658–2661.
Wang et al. (2013) Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. 2013. A theoretical analysis of ndcg type ranking measures. In Conference on learning theory, pages 25–54. PMLR.
Wasilewski and Hurley (2016) Jacek Wasilewski and Neil Hurley. 2016. Incorporating diversity in a learning to rank recommender system. In The twenty-ninth international flairs conference.
Wu et al. (2023) Chun-Ho Wu, Yue Wang, and Jie Ma. 2023. Maximal marginal relevance-based recommendation for product customisation. Enterprise Information Systems, 17(5):1992018.
Xia et al. (2015) Long Xia, Jun Xu, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng. 2015. Learning maximal marginal relevance model via directly optimizing diversity evaluation measures. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pages 113–122.
Yan et al. (2021) Le Yan, Zhen Qin, Rama Kumar Pasumarthi, Xuanhui Wang, and Michael Bendersky. 2021. Diversification-aware learning to rank using distributed representation. In Proceedings of the Web Conference 2021, pages 127–136.

Appendix A Appendix

A.1 Dartboard Algorithm Details

The full algorithm for Dartboard is described in Algorithm 1.

# Natural log of Gaussian pdf.

1 function LogNorm(

\mu

\sigma

)

2 return

\ln\left(\sigma\right)-\frac{1}{2}\ln\left(2\pi\right)-\frac{\mu^{2}}{2\sigma^% {2}}

q

: the query.

A

: set of all points.

K

: number of points to triage.

k

: number of points to return.

\sigma

: The standard deviation of the Gaussians. A measure of spread.

4 function Dartboard(

q

A

K

k

\sigma

)

# Triage and get the distances.

A^{\prime},ids\leftarrow\texttt{KNN($q$, $A$, $K$)}

# Triage using KNN.

D\leftarrow\texttt{dists($A^{\prime}$, $A^{\prime}$, $\sigma$)}

K

K

distance matrix.

Q\leftarrow\texttt{dists($q$, $A^{\prime}$, $\sigma$)}

# Distance from each

a\in A^{\prime}

q

# Work in log space for numerical stability.

# Note that

D

and

Q

are now log probabilities, not distances.

D\leftarrow\texttt{LogNorm($D$, $\sigma$)}

Q\leftarrow\texttt{LogNorm($Q$, $\sigma$)}

# Greedily seed and search.

# We only track the last addition’s contribution.

m\leftarrow\operatorname*{arg\,max}_{i}\left(Q\right)

maxes\leftarrow D_{m}

ret\leftarrow\left[ids_{m}\right]

# Incrementally add until we have

k

elements.

13 while

\left|ret\right|<k

newmax\leftarrow\max\left(maxes,D\right)

scores\leftarrow\texttt{LogSumExp($newmax+Q$)}

# Get the best candidate.

m\leftarrow\operatorname*{arg\,max}_{i}\left(scores\right)

maxes\leftarrow newmax_{m}

ret\leftarrow\texttt{append($ret$, $ids_{m}$)}

20 return

ret

Algorithm 1 Dartboard

A.2 Baselines

In this section, we briefly describe the different variations of Dartboard as well as the competing retrieval methods that we use to compare the performance of Dartboard in Table 2 in the main paper. All methods that rely on using the cross-encoder first use KNN to retrieve the top $100$ passages.

•

Dartboard cossim (D-CS): This is the variation of the proposed Dartboard method that relies on using cosine similarity for ranking passages.
•

Dartboard crosscoder (D-CC): This is the variation of the proposed Dartboard method that relies on using cross-encoder based similarity.
•

Dartboard hybrid (D-H): This is the variation of the proposed Dartboard method that relies on using cross-encoder for the Gaussian kernel $\mathcal{N}\left(q,t,\sigma\right)$ and cosine similarity for the Gaussian kernel $\mathcal{N}\left(t,g,\sigma\right)$ .
•

KNN cossim: This is the variation of K-nearest neighbors algorithm that relies on using using cosine similarity.
•

KNN crosscoder: This is the variation of K-nearest neighbors algorithm that relies on using cross-encoder similarity.
•

MMR cossim: This is the variation of the Maximal Marginal Relevance method that relies on using cosine similarity.
•

MMR crosscoder: This is the variation of the Maximal Marginal Relevance method that relies on using cross-encoder similarity.
•

Empty: This is a method that involves no retrieval step but uses just the LLM to generate the answer for a given query.
•

Oracle: This method retrieves only the “positive” labeled passages. For the information integration case, we retrieve positive passages for each component of the query up to $k$ . If the number of positive passages is less than $k$ , we use the negative passages to fill in the rest.
•

Random: This method randomly retrieves $k$ passages from the full passage set.

A.3 Modification for cross-encoder based reranker

Cross-encoder-based reranking has been shown to outperform embedding-based approaches such as cosine similarity Nogueira and Cho (2020), as it uses the full computational power of a transformer model, rather than being limited to simple vector operations. We have proposed two variations of Dartboard, namely Dartboard Crosscoder and Dartboard Hybrid, based on how we compute the cross-encoder scores for the Gaussian kernels in Equation 2 given in the main paper. For the Dartboard Crosscoder variation, we use the cross-encoder score $C\left(q,t\right)$ before computing the Gaussian kernel for both $\mathcal{N}\left(q,t,\sigma\right)$ and $\mathcal{N}\left(t,g,\sigma\right)$ in Equation 2. Note that the cross-encoder score is asymmetric, so we simply average the two possible ways to compute the cross-encoder score for $\mathcal{N}\left(t,g,\sigma\right)$ , i.e., $\frac{1}{2}\left(C\left(t,g\right)+C\left(g,t\right)\right)$ . For $\mathcal{N}\left(q,t,\sigma\right)$ , we are only interested in the likelihood of $t$ given $q$ , so we only use the cross-encoder score $C\left(q,t\right)$ .

However, the cross-encoder is computationally expensive to run for $k^{2}$ pairs. Hence, we rely on the Dartboard-Hybrid variation wherein we use the cross-encoder score only for the Gaussian kernel $\mathcal{N}\left(q,t,\sigma\right)$ whereas we use cosine similarity for the Gaussian kernel $\mathcal{N}\left(t,g,\sigma\right)$ .

A.4 Dartboard generalizes KNN and MMR

The Dartboard algorithm can be viewed as a generalization of the traditional retrieval algorithms, KNN and MMR. In order to verify this claim, let us look at the score presented in Equation 1 in the main paper. When the Player 1 has a perfect aim, or in other words, $\sigma\to 0$ , $P(T=t|q,\sigma)$ tends to a point mass distribution such that $t=q$ , and hence the score becomes

s\left(G,q,A,\sigma\right)\to\min_{g\in G}D\left(q|g\right)

(3)

where $D$ is the distance function as before. If the chosen distance function is proportional to the similarity measure, this is nothing but the KNN algorithm. On the other hand, when the chosen distance function is the weighted sum of the similarity between query and guess, and dissimilarity between current guess and past guesses, it reduces to the MMR algorithm.

A.5 Dartboard inherently promotes diversity

In Figure 3, we show the diversity of the retrieved passages from RGB for both Dartboard and MMR, measured as one minus the average cosine similarity between pairs of retrieved passages. While MMR explicitly encourages diversity, Dartboard does not. However, we observe from the figure that as the parameter $\sigma$ increases, the diversity of the retrieved passages also increases. This implies that by optimizing the relevant information gain metric, Dartboard inherently ensures diversity in the set of retrieved passages.

A.6 Example of a generative use of Dartboard

Below is an example of the set of retrieved passages for a query that shows that the passages retrieved by Dartboard are highly diverse compared to those retrieved by KNN which has high redundancy, if we consider the cross-encoder based variations:

Query: Do you want to watch soccer?

Candidates:
 1: Absolutely!
 2: Affirmative!
 3: I don’t know!
 4: I’d love to!
 5: Maybe later.
 6: Maybe!
 7: Maybe...
 8: No thanks.
 9: No way!
10: No, I don’t wanna do dat.
11: No, thank you!
12: No, thank you.
13: Not right now.
14: Not today.
15: Perhaps..
16: Sure!
17: Yeah!
18: Yes!
19: Yes, please can we?
20: Yes, please!
21: Yes, please.
22: Yes, we ought to!
23: Yes, we shall!
24: Yes, we should!

KNN crosscoder:
  18: Yes!
  21: Yes, please.
  20: Yes, please!

Dartboard crosscoder:
  18: Yes!
   7: Maybe...
  12: No, thank you.

A.7 Dartboard does not allow for the possibility of exact duplicates

The “max” in Equation 2 given in the main paper ensures that the same vector (passage) is not selected twice (unless all non-duplicate/unique passages have been exhausted) in case of Dartboard. This is in contrast to MMR, which can select the same vector (passage).

Here is an example where MMR produces exact duplicates. Consider the scenario when our passage database consists of the vectors $\{(2,1),(2,1),(1,2),(0,1)\}$ (with a duplicate $(2,1)$ ). Now if we use cosine similarity based scoring, and set diversity to $.5$ for $k=3$ in case of MMR, the bag that maximizes the score for probe $(2,1)$ for MMR is $\{(0,1),(2,1),(2,1)\}$ , which has an exact duplicate passage vector $(2,1)$ . This verifies that MMR can allow for exact duplicates, which can increase the MMR score because it decreases the average distance to the query, while (possibly) only marginally decreasing the diversity.

On the contrary, in case of Dartboard, an exact duplicate passage vector will add zero information i.e. it would not increase the chances of hitting the target. So it will not be selected for retrieval until all other non-duplicate options are exhausted.

A.8 More results

In Figure 4, we show the relation between NDCG score and final end-to-end performance on the question answering (QA) task.