VRSD: Rethinking Similarity and Diversity for Retrieval in Large Language Models

Hang Gao
Department of Computer Science
Rutgers University
New Brunswick, NJ 08901
[email protected]
&Yongfeng Zhang
Department of Computer Science
Rutgers University
New Brunswick, NJ 08901
[email protected]

Abstract

Vector retrieval algorithms are vital for semantic queries in the evolving landscape of Large Language Models (LLMs). Retrieving vectors that simultaneously meet criteria for both similarity and diversity significantly enhances the capabilities of LLM-based agents. Despite the widespread use of the Maximal Marginal Relevance (MMR) in retrieval scenarios with relevance and diversity requirements, fluctuations caused by variations in the parameter $\lambda$ within the MMR complicate the determination of the optimization trajectory in vector spaces, thus obscuring the direction of enhancement. Moreover, there is a lack of a robust theoretical analysis for the constraints of similarity and diversity in retrieval processes. This paper introduces a novel approach to characterizing both constraints through the relationship between the sum vector and the query vector. The proximity of these vectors addresses the similarity constraint, while necessitating that individual vectors within the sum vector divergently align with the query vector to satisfy the diversity constraint. We also formulate a new combinatorial optimization challenge, taking a selection of $k$ vectors from a set of candidates such that their sum vector maximally aligns with the query vector, a problem we demonstrate to be NP-complete. This establishes the profound difficulty of pursuing similarity and diversity simultaneously in vector retrieval and lays a theoretical groundwork for further research. Additionally, we present the heuristic algorithm Vectors Retrieval with Similarity and Diversity (VRSD) which not only has a definitive optimization goal and eschews the need for preset parameters but also offers a modest reduction in time complexity compared to MMR. Empirical validation further confirm that VRSD significantly surpasses MMR across various datasets.

1 Introduction

Vector retrieval algorithms are crucial for semantic queries and have become increasingly integral to the deployment of Large Language Models (LLMs). Effective interaction with LLMs frequently necessitates the provision of relevant or similar examples to elicit enhanced responses [17]. The introduction of Retrieval Augmented Generation (RAG) has notably advanced the capabilities in knowledge-intensive tasks [16], underscoring the growing importance of retrieval methods. Empirical evidence suggests that employing the BM25 algorithm to select examples from the training set markedly improves LLMs performance over random selection [17, 19]. Moreover, leveraging existing text embedding models for example retrieval often surpasses BM25, particularly in specific contexts [26, 31]. And the advent of Dense Retrieval, which employs dense vectors for semantic matching in latent spaces [5, 15], represents a evolution over traditional sparse retrieval methods like BM25 by utilizing the robust modeling capabilities of pre-trained language models to learn relevance functions [8]. Innovations such as the applying the dual encoder framework [11] and dynamic listwise distillation [27] have further refined the effectiveness of dense retrieval techniques. Subsequent enhancements in semantic parsing and in-context learning [24], facilitated by feedback from LLMs [29], have enabled more precise example selection and improved answer accuracy. Despite ongoing advancements in retrieval methods, the broadening application scope of LLMs necessitates retrieval approaches that balance relevance with diversity—specifically, a relevance-focused diversity rather than an unrestricted diversity. Additionally, the RAG framework’s ability to augment the LLMs’ external data access also underscores the need for simple yet efficient algorithms that can streamline the retrieval process.

Considering the balance between similarity and diversity, the Maximal Marginal Relevance (MMR) [4] is an effective heuristic algorithm and has been widely applied in vector retrieval practices. Aiming to achieve an optimal balance, MMR incorporates a parameter, $\lambda$ , which adjusts the weight of relevance and diversity by varying its value. Nevertheless, this method is not always effective; in different scenarios, $\lambda$ needs to take different values, which cannot be known in advance. Recent research [29, 32] has also explored using LLMs to enhance retrieval results, while also suggests considering the selection of a set of examples from a combinatorial optimization perspective, rather than selecting examples one by one, as the in-context examples can influence each other. In light of this, we propose using the sum vector to characterize both similarity and diversity in vector retrieval. Simply put, this involves maximizing the similarity between the sum vector of the selected vectors and the query vector, and maximizing the similarity of the sum vector to the query vector imposes a similarity constraint. At the same time, from a geometric perspective, the requirement for the sum vector to be similar to the query vector means that the selected vectors approach the query vector from different directions, thus imposing a diversity constraint. Additionally, the idea of considering the similarity between the sum vector and the query vector is analogous to the famous finding in word2vec (king - man + woman = queen) [22], as both involve obtaining complex semantic similarities through simple vector arithmetic. Therefore, using the sum vector to characterize similarity and diversity constraints not only considers similarity while reducing redundancy but also enhances the complementarity among retrieval results.

Consequently, we define a new combinatorial optimization problem: selecting several vectors from a set of candidate vectors such that the similarity between the sum vector of the selected vectors and the query vector is maximized. However, contrary to its intuitive and straightforward appearance, this is a highly challenging problem. We prove that this problem is NP-complete by reducing the subset sum problem to it, revealing theoretically that simultaneously pursuing similarity and diversity in vector retrieval is extremely difficult. This novel combinatorial optimization problem, of independent theoretical interest, establishes a solid theoretical foundation for future research. Subsequently, we present a heuristic algorithm to solve the proposed problem. This algorithm has a clear optimization objective, requires no preset parameters, and has a slightly lower time complexity than the MMR algorithm. Our experimental studies also demonstrate that the new algorithm significantly outperforms the MMR algorithm across various datasets. Additionally, given that similarity measures in vector retrieval typically include cosine similarity, inner product distance, and Euclidean distance, and considering that vectors in LLM applications are usually normalized, the results obtained using these measures in vector retrieval are consistent. Consequently, the discussion on vector similarity in this paper uses cosine similarity. In summary, our work makes the following contributions:

•

We propose using the similarity between the sum vector and the query vector to characterize similarity and diversity constraints in vector retrieval. We formulate a novel optimization problem where we seek to select several vectors from a set of candidates such that the similarity between the sum vector of the selected vectors and the query vector is maximized.
•

We demonstrate that our optimization problem is NP-complete, theoretically revealing the extreme difficulty of simultaneously pursuing similarity and diversity in vector retrieval.
•

For the NP-complete combinatorial optimization problem we propose, we provide a heuristic algorithm, VRSD. We experimentally study our algorithm on several datasets, and our results show that our algorithm significantly outperforms the classic MMR algorithm.

The remainder of this paper is organized as follows: Section 2 provides a review of related work in the field. Section 3 defines the problem under investigation, examines their computational complexities, and proposes a heuristic algorithm for addressing them. Section 4 presents an experimental evaluation of our heuristic algorithm. Finally, Section 5 concludes the paper and discusses potential directions for future research.

2 Related Work

2.1 Retrieval and Large Language Model

Retrieval methods in Large Language Models (LLMs) have gained traction, particularly due to their pivotal role in open-domain question answering, evidenced by seminal contributions in the field [5, 9, 14, 25, 10]. The introduction of Retrieval Augmented Generation (RAG) further underscored the significance of these methods across knowledge-intensive tasks [16], notably enhancing generation capabilities for open-domain queries [20]. This spurred the adoption of techniques such as K-Nearest-Neighbor (KNN) in diverse applications, ranging from the customization of multilingual models in machine translation [12] to improving the prediction of rare patterns in LLMs [13, 1]. Continued advancements in retrieval techniques have focused on identifying highly informative examples to augment in-context learning, thereby enabling LLM-based systems to achieve significant performance improvements with minimal examples [3]. Early on, traditional sparse retrieval methods like BM25 [28]—an extension of TF-IDF—were utilized to refine in-context learning [17]. Subsequently, the integration of LLMs’ intrinsic capabilities [30] and Sentence-BERT (SBERT) [26] facilitated the retrieval of highly pertinent examples for prompt integration. The advent of dense retrievers signified a methodological enhancement in retrieval from a machine learning perspective, and the incorporation of feedback signals with contrastive learning has yielded more effective retrieval systems [29, 32]. Recent innovations like UPRISE [6] and PRAC [23] have further optimized the performance of in-context learning by retrieving demonstrations directly from training data. Despite these advances, most retrieval methods still treat each candidate independently, which can lead to suboptimal outcomes due to the interaction effects among in-context examples, resulting in a lack of diversity. Given the expanding applications of LLMs, diversity becomes increasingly crucial, as complex inference tasks often require undefined example types. Incorporating a diverse range of examples enriches LLMs’ learning processes, facilitating more innovative and robust responses, especially for complex open-ended questions. Furthermore, a relevance-focused diversity retrieval method could help mitigate the impact of malicious information as LLMs continue to scale and assimilate more societal data. Even if some malicious information is inadvertently retrieved, the inclusion of diverse data within the same batch can enable LLMs to fully comprehend the query’s intent and provide accurate responses. Additionally, as LLMs are tasked with processing ever-greater volumes of information, the necessity for a straightforward and efficient retrieval method intensifies, minimizing the dependency on resource-intensive preparatory tasks.

2.2 Maximal Marginal Relevance

To enhance retrieval processes by accounting for both relevance and diversity, the Maximal Marginal Relevance (MMR) algorithm was introduced [4]. MMR addresses the balance between relevance and diversity in traditional retrieval and summarization methods by employing "marginal relevance" as an evaluation metric. This metric is defined as a linear combination of independently measured relevance and novelty, formulated as Eq.1:

\text{MMR}=\arg\max_{d_{i}\in R\setminus S}[\lambda\cdot\text{Sim}_{1}(d_{i},q% )-(1-\lambda)\cdot\max_{d_{j}\in S}\text{Sim}_{2}(d_{i},d_{j})].

(1)

Refer to caption — Figure 1: An analysis of the Maximal Marginal Relevance. (a) The candidate vectors are located on different sides of the query vector. (b) The candidate vectors are located on the same side of the query vector.

The challenge lies in selecting an appropriate $\lambda$ to achieve the desired balance between relevance and diversity, particularly in high-dimensional vector spaces where the impact of varying $\lambda$ is less predictable. This variability in $\lambda$ leads to fluctuations in retrieval results, resulting in unpredictable consequences, which can be illustrated by a simple example. Commonly, $\lambda$ is preset at a value of 0.5 in many MMR implementations, a choice that stems from the algorithm’s foundational design. It is important to note that at $\lambda=1$ , the algorithm exclusively prioritizes relevance, while at $\lambda=0$ , it focuses entirely on diversity. Let us examines the performance of the MMR algorithm at the typical midpoint setting of $\lambda=0.5$ . For clarity and ease of comprehension, we model the retrieval process within a two-dimensional vector space, though the principles observed are equally applicable to more complex, higher-dimensional scenarios.

As illustrated in Figure.1(a), consider $q$ as the query vector, and $d_{0}$ to $d_{3}$ as candidate vectors that surpass the relevance threshold, collectively represented as $R=\{d_{0},d_{1},d_{2},d_{3}\}$ , with $S$ initially empty. Utilizing the MMR algorithm, $d_{0}$ is first selected due to its highest relevance to $q$ , determined using cosine similarity as a measure. Subsequently, $d_{3}$ is chosen over $d_{1}$ , despite $d_{1}$ having a smaller angle with $q$ and thus greater direct relevance. The selection of $d_{3}$ is influenced by the fact that the cumulative relevance between $d_{1}$ and $d_{0}$ significantly surpasses that between $d_{3}$ and $d_{0}$ , resulting in a higher MMR value for $d_{3}$ as per the formula.

However, as depicted in Figure.1(b), with $q$ serving as the query vector and $R=\{d_{0},d_{1},d_{2},d_{3}\}$ representing the initial set of candidate vectors, $d_{0}$ is first selected due to its maximal relevance to $q$ . The selection process using the MMR algorithm proceeds as follows: with $\lambda=0.5$ , $S=\{d_{0}\}$ , and $R\setminus S=\{d_{1},d_{2},d_{3}\}$ , the formula can be articulated as Eq.2:

\text{MMR}=\arg\max_{i=1,2,3}[0.5\cdot(\text{Sim}_{1}(d_{i},q)-\text{Sim}_{2}(% d_{i},d_{0}))].

(2)

Given that $d_{0}$ , $d_{1}$ , $d_{2}$ , and $d_{3}$ are positioned on the same side relative to $q$ , and assuming both $\text{Sim}_{1}$ and $\text{Sim}_{2}$ denote cosine similarity, let $\theta$ represent the angle between $d_{0}$ and $q$ , and $x$ denote the angle between $d_{i}$ (i.e., $d_{1},d_{2},d_{3}$ ) and $d_{0}$ . Thus, we get the Eq.3

\text{MMR}=\arg\max_{i=1,2,3}[0.5\cdot(\text{cos}(d_{i},q)-\text{cos}(d_{i},d_% {0}))]=\arg\max_{i=1,2,3}[0.5\cdot(\text{cos}(x+\theta)-\text{cos}(x))]

(3)

The function $f(x)=\cos(x+\theta)-\cos(x)$ , with its derivative $f^{\prime}(x)=-\sin(x+\theta)+\sin(x)$ , assumes $x$ and $x+\theta$ lie within $(0,\pi/2)$ . Consequently, $f^{\prime}(x)<0$ , indicating that for vectors on the same side of $q$ , their MMR values decrease as the angle with $q$ increases. Thus, following the selection of $d_{0}$ , the subsequent choices are $d_{1}$ , then $d_{2}$ , and so on. This sequence suggests that relevance predominantly influences the selection outcome.

The real challenge in vector retrieval emerges when $\lambda\neq 0.5$ . The selection among candidate vectors $d_{1}$ , $d_{2}$ , and $d_{3}$ hinges critically on both $\lambda$ and $\theta$ , complicating the determination of the most appropriate candidate. This dependency means that different query vectors and the distribution of initial candidate vectors require varying $\lambda$ values to achieve optimal performance. Consequently, it is impractical to predict the value of $\lambda$ in advance or to ascertain a precise direction for optimization. This issue becomes even more pronounced in higher-dimensional vector spaces, where the perturbations induced by changing $\lambda$ complicate the identification of an optimal adjustment direction. This inherent complexity underscores the need for adaptive retrieval strategies that dynamically adjust $\lambda$ based on the characteristics of the query and candidate vector distributions.

3 Vectors retrieval with similarity and diversity

3.1 Problem definition and complexity analysis

To address the problem of selecting a subset of vectors from a set of candidate vectors that satisfy both similarity and diversity requirements, we refer to the MMR algorithm and several LLM-based algorithms, typically considering the following premises: Firstly, the candidate vectors are identified from the entire set of vectors ( $size=N$ ) using similarity metrics, resulting in a subset of vectors ( $size=n$ ). Consequently, this set of candidate vectors inherently exhibits a relative high degree of similarity to the query vector. Secondly, within these $n$ candidate vectors, the vector most similar to the query vector is typically selected first, as is the case with the MMR algorithm and others. This approach is favored because, in applications such as in-context learning with LLMs, examples with the highest similarity to the query are generally the most helpful.

As previously mentioned, while algorithms like MMR are widely applied in practice, these studies often lack a robust and reliable theoretical model. In other words, many approaches employ heuristic strategies or machine learning methods to arrive at a solution without providing a rigorous formal description and analysis of similarity and diversity from a theoretical perspective. Therefore, based on the aforementioned premises, we propose using the sum vector to characterize both similarity and diversity in vector retrieval. The definition of the sum vector is as follows:

Definition 1.

The Sum Vector: Given $k$ vectors $d_{1},d_{2},...,d_{k}$ , the sum vector $d$ is the sum of these $k$ vectors.

Specifically, we aim to maximize the similarity between the sum vector of the selected $k$ vectors and the query vector. On one hand, maximizing the similarity of the sum vector to the query vector imposes a similarity constraint. On the other hand, from a geometric perspective, ensuring the sum vector is similar to the query vector means that the selected vectors approach the query vector from different directions, thus imposing a diversity constraint. Therefore, using the sum vector to characterize similarity and diversity allows us to model complex semantic similarity and diversity through simple vector addition operations. Next, we define the problem of vectors retrieval as follows:

Definition 2.

The problem of Vectors Retrieval with Similarity and Diversity (VRSD): Given a query vector $q$ and a set of candidate vectors $R=\{d_{0},d_{1},...,d_{n-1}\}$ (where $d_{0}$ is the vector with the highest similarity to query vertor $q$ ), $d_{0}$ is selected first because of its highest similarity. Then, how to select $k-1$ vectors ( $d^{\prime}_{1},d^{\prime}_{2},...,d^{\prime}_{k-1}$ ) from the remaining vectors such that the cosine similarity between the sum vector $d=d_{0}+d^{\prime}_{1}+d^{\prime}_{2}+...+d^{\prime}_{k-1}$ and $q$ is maximized.

The vector $d_{0}$ , characterized by its maximal similarity to the query vector $q$ , establishes an initial constraint on similarity. The ensuing optimization objective strives to maximize the cosine similarity between the sum vector of all selected vectors and $q$ . This process necessitates the selection of vectors that not only converge towards $q$ from diverse dimensions but also exhibit significant diversity and complementarity. However, upon further examination of above problem, we find that it is an NP-complete problem. Below, we provide a theoretical proof. Since the vector $d_{0}$ , with the highest similarity, is initially selected, the subsequent selection of $k-1$ vectors must have the maximum cosine similarity with $q-d_{0}$ . That is, maximizing the similarity between sum vector $d=d_{0}+d^{\prime}_{1}+d^{\prime}_{2}+...+d^{\prime}_{k-1}$ ( $d^{\prime}_{1},d^{\prime}_{2},...,d^{\prime}_{k-1}$ represents the $k-1$ vectors selected subsequently) and $q$ , is equivalent to maximizing the similarity between $d^{\prime}=d^{\prime}_{1}+d^{\prime}_{2}+...+d^{\prime}_{k-1}$ and $q-d_{0}$ . To this end, we define a decision problem, namely:

Definition 3.

The decision problem of vectors retrieval: Given a set of candidate vectors $R$ and a query vector $q$ , can $k$ vectors be selected from $R$ such that the cosine similarity between the sum vector of these $k$ vectors and the query vector $q$ equals 1? We denote instances of this vectors retrieval problem as $(R,q,k)$ .

Next, we will prove this decision problem is NP-complete. For the sake of concise proof, we further restrict the components of vectors to integers. The proof strategy is to reduce the subset sum problem [7] to this decision problem.

Definition 4.

The subset sum problem: Given an integer set $T$ and another integer $t$ , does there exist a non-empty subset whose sum of elements equals $t$ ? We denote instances of the subset sum problem as $(T,t)$ .

For the convenience of proof, we also need to define a modified version of the subset sum problem, called the $k$ -subset sum problem.

Definition 5.

$k$ -subset sum problem: Given an integer set $T$ and another integer $t$ , does there exist a non-empty subset of size $k$ (i.e., the cardinality of the subset is $k$ ), whose sum of elements equals $t$ ? We denote instances of the $k$ -subset sum problem as $(T,t,k)$ .

Lemma 1.

The $k$ -subset sum problem is NP-complete.

Proof.

We reduce the subset sum problem(Def.4) to the $k$ -subset sum problem(Def.5) .

1. Clearly, the $k$ -subset sum problem is polynomial-time verifiable.

2. Reducing the subset sum problem to the $k$ -subset sum problem.

For any instance of the subset sum problem $(T,t)$ , we can transform it into $|T|$ instances of the $k$ -subset sum problem, i.e., $(T,t,1),(T,t,2),\ldots,(T,t,|T|)$ . If any of these $|T|$ instances of the $k$ -subset sum problem has a yes answer, then the answer to the subset sum problem is yes. If all answers to these $|T|$ instances of the $k$ -subset sum problem are no, then the answer to the subset sum problem is also no. Therefore, if the $k$ -subset sum problem can be solved in polynomial time, then the subset sum problem can also be solved in polynomial time. Hence, the $k$ -subset sum problem is NP-complete. ∎

Now it is time to prove the NP-completeness of the decision problem of vectors retrieval.

Theorem 1.

The decision problem of vectors retrieval is NP-complete.

Proof.

We reduce the $k$ -subset sum problem(Def.5) to the decision problem of vectors retrieval(Def.3).

1. The answer to vectors retrieval is polynomial-time verifiable. If the answer provides $k$ vectors, we can simply add these $k$ vectors and then calculate whether the cosine similarity between the sum vector and the query vector $q$ equals 1. This verification can be done in polynomial time.

2. Reducing the $k$ -subset sum problem to the decision problem of vectors retrieval.

For any instance of the $k$ -subset sum problem $(T,t,k)$ , let $T=\{t_{1},t_{2},\ldots,t_{n}\}$ . We construct the set of vectors $R$ and the query vector $q$ as Eq.4:

R=\{[t_{1},1],[t_{2},1],\ldots,[t_{n},1]\},q=[t,k]\

(4)

The decision problem of vectors retrieval $(R,q,k)$ asks whether there exist $k$ vectors such that the sum vector (denoted as $d$ ) of these vectors and the query vector $q$ have a cosine similarity of 1. According to the definition of cosine similarity, $cos\_similarity=\frac{d\cdot q}{|d|\cdot|q|}$ . The cosine similarity between $d$ and $q$ equals 1 if and only if $d=\alpha q$ , where $\alpha$ is a constant. Therefore, if vectors retrieval provides an affirmative answer $d=\alpha q$ , we can get the Eq.5,

d=[t^{\prime}_{1},1]+[t^{\prime}_{2},1]+...+[t^{\prime}_{k},1]=\alpha[t,k]% \Rightarrow[(t^{\prime}_{1}+...+t^{\prime}_{k}),k]=\alpha[t,k].

(5)

$[t^{\prime}_{1},1]\ldots[t^{\prime}_{k},1]$ are the selected $k$ vectors. It implies that $\alpha=1$ and $t^{\prime}_{1}+\ldots+t^{\prime}_{k}=t$ . Thus, this provides an affirmative answer to the $k$ -subset sum problem instance $(T,t,k)$ . Conversely, if vectors retrieval provides a negative answer, then a negative answer to the $k$ -subset sum problem can also be obtained. The above reduction process can be clearly completed in polynomial time. Therefore, the decision problem of vectors retrieval is NP-complete. ∎

3.2 Heuristic algorithm for vectors retrieval

Since the vectors retrieval problem $(R,q,k)$ is a NP-complete problem, necessitating the use of heuristic methods to derive feasible solutions. Specifically, given a set of candidate vectors with high similarity, the objective is to select $k$ vectors that maximize the cosine similarity between the sum vector of the $k$ selected vectors and the query vector. We propose a new algorithm denoted as Vectors Retrieval with Similarity and Diversity (VRSD). VRSD initially selects the vector most similar to the query vector and then iteratively selects additional vectors from the remaining candidates. In each iteration, it chooses the vector that maximizes the cosine similarity between the cumulative sum of all previously selected vectors and the query vector, continuing this process until $k$ vectors are chosen. Further details about the VRSD algorithm can be found in Algorithm.1 and Fig.2.

Algorithm 1 Vectors Retrieval with Similarity and Diversity (VRSD)

1:Candidate vector set

R=\{d_{0},d_{1},\ldots,d_{n-1}\}

, query vector

q

, where

d_{0}

is the vector from all

d_{i}

that has the highest cosine similarity with

q

, and constant

k

k

vectors including

d_{0}

, such that the cosine similarity between the sum vector of these

k

vectors and

q

is maximized.

S=\{d_{0}\}

R=R\setminus\{d_{0}\}

5:for

i=1

k-1

s=\sum S

\triangleright

Sum of all vectors in

S

\text{maxCos}=0

p=\text{null}

\triangleright

Initialize

p

to a null vector or equivalent

9: for

v

R

10:

t=s+v

\triangleright

Temporary vector for comparison

11: if

\cos(t,q)>\text{maxCos}

then

12:

\text{maxCos}=\cos(t,q)

13:

p=v

14: end if

15: end for

16:

S=S\cup\{p\}

\triangleright

Add

p

to the set

S

17:

R=R\setminus\{p\}

\triangleright

Remove

p

from

R

18:end for

19:return

S

3.3 Time complexity analysis of VRSD algorithm

As depicted in Algorithm.1, the time complexity of the VRSD algorithm is $k\times|R|=k\times n$ , which accounts for the initial step of selecting $n$ candidate vectors from the entire set of vectors (size = $N$ ) based on similarity. Given that $N\gg n>k$ , the computational load of subsequent steps in Algorithm.1 is minimal in comparison. The MMR algorithm, which also selects $k$ vectors from $|R|$ candidates, requires two iterations of maximum calculations as depicted in Eq.1—once for each candidate vector against the query vector and once against the set of already selected vectors $|S|$ . Thus, the complexity for MMR becomes $k\times|R|\times|S|=k\times|R|^{2}=k\times n^{2}$ , indicating a marginally higher computational demand compared to VRSD.

4 Experiments

4.1 Experiments detail

We evaluated the VRSD algorithm using three publicly available datasets of different categories and compared the VRSD with the MMR algorithm when the values of $\lambda$ are 0, 0.5, and 1 respectively :

•

ARC-DA [2]: A dataset of direct-answer science questions derived from the ARC multiple-choice question. Each example contains a question and multiple answers.
•

OpenBookQA [21]: A dataset of multiple-choice science questions, which probe the understanding of science facts and the application of these facts to novel situations. Each example contains a question, multiple choices, and an answer.
•

Puzzle [18]: A question answering dataset. These questions belong to lateral thinking puzzle. Each example contains a question and an answer.

For each item in each datasets, we concatenate the question part with its corresponding answer, subsequently selecting 20% of these concatenated items to form the test set, wherein the question parts are isolated. Items designated for the test set are excluded from the original dataset for subsequent experiments, where four examples are retrieved for each test question. Our evaluation focuses on two primary aspects: retrieval quality and answer quality. Retrieval quality is assessed by aggregating four vectors retrieved using either VRSD or MMR into a sum vector—denoted as $d_{\text{VRSD}}$ and $d_{\text{MMR}}$ —which reflects the vectorial direction from which the examples approach the query vector $q$ . We compute the cosine similarity between the sum vectors and the query vector as $\cos(d_{\text{VRSD}},q)$ and $\cos(d_{\text{MMR}},q)$ . The comparison includes counting instances where $\cos(d_{\text{VRSD}},q)$ exceeds $\cos(d_{\text{MMR}},q)$ , termed as the VRSD win. rate, and calculating the maximum difference between these cosine similarities for all queries in each test set. Additionally, we compute the mean cosine similarity values for these vectors. Such metrics are instrumental in elucidating the algorithms’ capacity to balance similarity and diversity. For answer quality assessment, we reconstruct prompts by concatenating the original sentences corresponding to the four retrieved vectors with the initial question, and input these into large LLMs. The responses generated by open-source LLM open-mistral-7b and closed-source LLM gpt-3.5-turbo are then compared with standard answers using the Rogue-L metric for ARC-DA and Puzzle datasets, and Accuracy for OpenBookQA. This evaluation helps us ascertain the efficacy of the retrieved examples in providing accurate answers

4.2 Experiments Results

Table 1: Algorithm’s performance in different dataset. Max-diff displays the maximum difference between

\cos(d_{\text{VRSD}},q)

and

\cos(d_{\text{MMR}},q)

across all queries in the test set. Mean displays the mean value of

\cos(d_{\text{VRSD}},q)

and

\cos(d_{\text{MMR}},q)

in the test set. Boldface indicates the algorithm with the highest mean value under different dataset.

	Algorithm	VRSD win.rate	Max-diff	Mean
ARC-DA	VRSD	-	-	0.740
	MMR( $\lambda$ =0)	97.7%	0.160	0.696
	MMR( $\lambda$ =0.5)	92.5%	0.108	0.720
	MMR( $\lambda$ =1)	95.3%	0.158	0.710
OpenBookQA	VRSD	-	-	0.833
	MMR( $\lambda$ =0)	97.3%	0.135	0.809
	MMR( $\lambda$ =0.5)	92.6%	0.101	0.822
	MMR( $\lambda$ =1)	96.8%	0.102	0.812
Puzzle	VRSD	-	-	0.592
	MMR( $\lambda$ =0)	100%	0.161	0.537
	MMR( $\lambda$ =0.5)	90%	0.052	0.576
	MMR( $\lambda$ =1)	100%	0.132	0.577

Table 2: Algorithm’s performance under different LLMs. Boldface indicates the algorithm with the highest score under different models.

Algorithm	ARC-DA		OpenBookQA		Puzzle
Algorithm	gpt-3.5-turbo	open-mistral-7b	gpt-3.5-turbo	open-mistral-7b	gpt-3.5-turbo	open-mistral-7b
VRSD	0.371	0.233	0.789	0.534	0.213	0.198
MMR( $\lambda$ =0)	0.355	0.216	0.767	0.508	0.206	0.198
MMR( $\lambda$ =0.5)	0.364	0.218	0.772	0.507	0.202	0.188
MMR( $\lambda$ =1)	0.347	0.222	0.780	0.510	0.188	0.186

Table.1 presents the results of retrieval quality, indicating that the win rate of VRSD, compared to MMR across various datasets and conditions, consistently exceeds 90%. This outcome suggests that VRSD not only retrieves examples more relevant to the original query but also better satisfies the diversity requirements. Remarkably, VRSD maintains a minimum advantage of 0.05 over MMR in the worst-case scenario concerning the maximum difference. Additionally, the mean value of cosine similarity between the sum vector and the query vector is significantly higher for VRSD than for MMR. VRSD generally demonstrates superior retrieval efficacy from diverse perspectives. Table 2 shows the answer quality results, with VRSD achieving the highest scores across all metrics, thus indicating that the examples retrieved by VRSD enhance the LLM’s understanding of the query and facilitate the generation of the desired answers. The superior performance of the closed-source LLM over the open-source LLM can be attributed to the former’s enhanced understanding and reasoning capabilities. Overall, VRSD exhibits superior performance to MMR in both example retrieval and query answering, fulfilling the requirements of both similarity and diversity without the need for parameter adjustments, thereby highlighting the pronounced advantages of VRSD.

It is worth noting that, compared to other datasets, the mean values of $\cos(d_{\text{VRSD}},q)$ and $\cos(d_{\text{MMR}},q)$ in the Puzzle dataset are relatively low. Additionally, the difference in results between the open-source and closed-source models is less pronounced, and VRSD achieves a 100% win rate, which may be attributed to the small size of the dataset. Nevertheless, given that lateral-puzzle questions require LLMs to comprehend the query and derive insights from various angles in the retrieved examples, the Puzzle dataset remains a valuable tool for assessing our algorithm.

5 Conclusion

In this work, given the complexity of parameter adjustment in MMR, we aim to enhance how LLMs retrieve similar and diverse examples by introducing a novel approach that characterizes both constraints through the relationship between the sum vector and the query vector. This method requires individual vectors within the sum to align divergently with the query vector, thereby satisfying the diversity constraint. We demonstrate that this problem is NP-complete, and we propose the VRSD algorithm, which not only outperforms MMR but also improves efficiency across various metrics, thereby enabling the retrieval of higher-quality examples. Our work underscores the inherent challenges of simultaneously pursuing similarity and diversity in vector retrieval and establishes a solid theoretical foundation for further research. The proposed combinatorial optimization problem holds independent interest from both theoretical and practical standpoints, indicating that further exploration and refinement of the heuristic algorithm would constitute a valuable avenue for future research. In terms of advancing the heuristic algorithm, we aim to explore two specific aspects: (1) the development of heuristic algorithms with lower time complexity that can retrieve high-quality examples more efficiently, and (2) the creation of adaptable heuristic algorithms that remain robust regardless of dataset size or problem type. We believe that as the retrieval algorithm—emphasizing both similarity and diversity—continues to improve, the scope of tasks that LLM-based agents can manage will expand, yielding increasingly satisfactory results.

References

[1] Uri Alon, Frank Xu, Junxian He, Sudipta Sengupta, Dan Roth, and Graham Neubig. Neuro-symbolic language modeling with automaton-augmented retrieval. In International Conference on Machine Learning, pages 468–485. PMLR, 2022.
[2] Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. Think you have solved direct-answer question answering? try arc-da, the direct-answer ai2 reasoning challenge. arXiv preprint arXiv:2102.03315, 2021.
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[4] Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 335–336, 1998.
[5] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-domain questions. In 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, pages 1870–1879. Association for Computational Linguistics (ACL), 2017.
[6] Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Weiwei Deng, and Qi Zhang. Uprise: Universal prompt retrieval for improving zero-shot evaluation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12318–12337, 2023.
[7] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009.
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
[9] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.
[10] Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. In EACL 2021-16th Conference of the European Chapter of the Association for Computational Linguistics, pages 874–880. Association for Computational Linguistics, 2021.
[11] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020.
[12] Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Nearest neighbor machine translation. In International Conference on Learning Representations, 2020.
[13] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2019.
[14] Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020.
[15] Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, 2019.
[16] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
[17] Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, 2022.
[18] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. In The Twelfth International Conference on Learning Representations, 2023.
[19] Man Luo, Xin Xu, Zhuyun Dai, Panupong Pasupat, Mehran Kazemi, Chitta Baral, Vaiva Imbrasaite, and Vincent Y Zhao. Dr. icl: Demonstration-retrieved in-context learning. arXiv preprint arXiv:2305.14128, 2023.
[20] Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen. Generation-augmented retrieval for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2021.
[21] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018.
[22] Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In International Conference on Learning Representations, 2013.
[23] Ercong Nie, Sheng Liang, Helmut Schmid, and Hinrich Schütze. Cross-lingual retrieval augmented prompt for low-resource languages. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
[24] Panupong Pasupat, Yuan Zhang, and Kelvin Guu. Controllable semantic parsing via retrieval augmentation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7683–7698, 2021.
[25] Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5835–5847, 2021.
[26] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
[27] Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. Rocketqav2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2825–2835, 2021.
[28] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
[29] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, 2022.
[30] Richard Shin, Christopher Lin, Sam Thomson, Charles Chen Jr, Subhro Roy, Emmanouil Antonios Platanios, Adam Pauls, Dan Klein, Jason Eisner, and Benjamin Van Durme. Constrained language models yield few-shot semantic parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7699–7715, 2021.
[31] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
[32] Liang Wang, Nan Yang, and Furu Wei. Learning to retrieve in-context examples for large language models. arXiv preprint arXiv:2307.07164, 2023.