VRSD: Rethinking Similarity and Diversity for Retrieval in Large Language Models

Hang Gao
Department of Computer Science
Rutgers University
New Brunswick, NJ 08901
[email protected]
&Yongfeng Zhang
Department of Computer Science
Rutgers University
New Brunswick, NJ 08901
[email protected]
Abstract

Vector retrieval algorithms are vital for semantic queries in the evolving landscape of Large Language Models (LLMs). Retrieving vectors that simultaneously meet criteria for both similarity and diversity significantly enhances the capabilities of LLM-based agents. Despite the widespread use of the Maximal Marginal Relevance (MMR) in retrieval scenarios with relevance and diversity requirements, fluctuations caused by variations in the parameter λ𝜆\lambdaitalic_λ within the MMR complicate the determination of the optimization trajectory in vector spaces, thus obscuring the direction of enhancement. Moreover, there is a lack of a robust theoretical analysis for the constraints of similarity and diversity in retrieval processes. This paper introduces a novel approach to characterizing both constraints through the relationship between the sum vector and the query vector. The proximity of these vectors addresses the similarity constraint, while necessitating that individual vectors within the sum vector divergently align with the query vector to satisfy the diversity constraint. We also formulate a new combinatorial optimization challenge, taking a selection of k𝑘kitalic_k vectors from a set of candidates such that their sum vector maximally aligns with the query vector, a problem we demonstrate to be NP-complete. This establishes the profound difficulty of pursuing similarity and diversity simultaneously in vector retrieval and lays a theoretical groundwork for further research. Additionally, we present the heuristic algorithm Vectors Retrieval with Similarity and Diversity (VRSD) which not only has a definitive optimization goal and eschews the need for preset parameters but also offers a modest reduction in time complexity compared to MMR. Empirical validation further confirm that VRSD significantly surpasses MMR across various datasets.

1 Introduction

Vector retrieval algorithms are crucial for semantic queries and have become increasingly integral to the deployment of Large Language Models (LLMs). Effective interaction with LLMs frequently necessitates the provision of relevant or similar examples to elicit enhanced responses [17]. The introduction of Retrieval Augmented Generation (RAG) has notably advanced the capabilities in knowledge-intensive tasks [16], underscoring the growing importance of retrieval methods. Empirical evidence suggests that employing the BM25 algorithm to select examples from the training set markedly improves LLMs performance over random selection [17, 19]. Moreover, leveraging existing text embedding models for example retrieval often surpasses BM25, particularly in specific contexts [26, 31]. And the advent of Dense Retrieval, which employs dense vectors for semantic matching in latent spaces [5, 15], represents a evolution over traditional sparse retrieval methods like BM25 by utilizing the robust modeling capabilities of pre-trained language models to learn relevance functions [8]. Innovations such as the applying the dual encoder framework [11] and dynamic listwise distillation [27] have further refined the effectiveness of dense retrieval techniques. Subsequent enhancements in semantic parsing and in-context learning [24], facilitated by feedback from LLMs [29], have enabled more precise example selection and improved answer accuracy. Despite ongoing advancements in retrieval methods, the broadening application scope of LLMs necessitates retrieval approaches that balance relevance with diversity—specifically, a relevance-focused diversity rather than an unrestricted diversity. Additionally, the RAG framework’s ability to augment the LLMs’ external data access also underscores the need for simple yet efficient algorithms that can streamline the retrieval process.

Considering the balance between similarity and diversity, the Maximal Marginal Relevance (MMR) [4] is an effective heuristic algorithm and has been widely applied in vector retrieval practices. Aiming to achieve an optimal balance, MMR incorporates a parameter, λ𝜆\lambdaitalic_λ, which adjusts the weight of relevance and diversity by varying its value. Nevertheless, this method is not always effective; in different scenarios, λ𝜆\lambdaitalic_λ needs to take different values, which cannot be known in advance. Recent research [29, 32] has also explored using LLMs to enhance retrieval results, while also suggests considering the selection of a set of examples from a combinatorial optimization perspective, rather than selecting examples one by one, as the in-context examples can influence each other. In light of this, we propose using the sum vector to characterize both similarity and diversity in vector retrieval. Simply put, this involves maximizing the similarity between the sum vector of the selected vectors and the query vector, and maximizing the similarity of the sum vector to the query vector imposes a similarity constraint. At the same time, from a geometric perspective, the requirement for the sum vector to be similar to the query vector means that the selected vectors approach the query vector from different directions, thus imposing a diversity constraint. Additionally, the idea of considering the similarity between the sum vector and the query vector is analogous to the famous finding in word2vec (king - man + woman = queen) [22], as both involve obtaining complex semantic similarities through simple vector arithmetic. Therefore, using the sum vector to characterize similarity and diversity constraints not only considers similarity while reducing redundancy but also enhances the complementarity among retrieval results.

Consequently, we define a new combinatorial optimization problem: selecting several vectors from a set of candidate vectors such that the similarity between the sum vector of the selected vectors and the query vector is maximized. However, contrary to its intuitive and straightforward appearance, this is a highly challenging problem. We prove that this problem is NP-complete by reducing the subset sum problem to it, revealing theoretically that simultaneously pursuing similarity and diversity in vector retrieval is extremely difficult. This novel combinatorial optimization problem, of independent theoretical interest, establishes a solid theoretical foundation for future research. Subsequently, we present a heuristic algorithm to solve the proposed problem. This algorithm has a clear optimization objective, requires no preset parameters, and has a slightly lower time complexity than the MMR algorithm. Our experimental studies also demonstrate that the new algorithm significantly outperforms the MMR algorithm across various datasets. Additionally, given that similarity measures in vector retrieval typically include cosine similarity, inner product distance, and Euclidean distance, and considering that vectors in LLM applications are usually normalized, the results obtained using these measures in vector retrieval are consistent. Consequently, the discussion on vector similarity in this paper uses cosine similarity. In summary, our work makes the following contributions:

  • We propose using the similarity between the sum vector and the query vector to characterize similarity and diversity constraints in vector retrieval. We formulate a novel optimization problem where we seek to select several vectors from a set of candidates such that the similarity between the sum vector of the selected vectors and the query vector is maximized.

  • We demonstrate that our optimization problem is NP-complete, theoretically revealing the extreme difficulty of simultaneously pursuing similarity and diversity in vector retrieval.

  • For the NP-complete combinatorial optimization problem we propose, we provide a heuristic algorithm, VRSD. We experimentally study our algorithm on several datasets, and our results show that our algorithm significantly outperforms the classic MMR algorithm.

The remainder of this paper is organized as follows: Section 2 provides a review of related work in the field. Section 3 defines the problem under investigation, examines their computational complexities, and proposes a heuristic algorithm for addressing them. Section 4 presents an experimental evaluation of our heuristic algorithm. Finally, Section 5 concludes the paper and discusses potential directions for future research.

2 Related Work

2.1 Retrieval and Large Language Model

Retrieval methods in Large Language Models (LLMs) have gained traction, particularly due to their pivotal role in open-domain question answering, evidenced by seminal contributions in the field [5, 9, 14, 25, 10]. The introduction of Retrieval Augmented Generation (RAG) further underscored the significance of these methods across knowledge-intensive tasks [16], notably enhancing generation capabilities for open-domain queries [20]. This spurred the adoption of techniques such as K-Nearest-Neighbor (KNN) in diverse applications, ranging from the customization of multilingual models in machine translation [12] to improving the prediction of rare patterns in LLMs [13, 1]. Continued advancements in retrieval techniques have focused on identifying highly informative examples to augment in-context learning, thereby enabling LLM-based systems to achieve significant performance improvements with minimal examples [3]. Early on, traditional sparse retrieval methods like BM25 [28]—an extension of TF-IDF—were utilized to refine in-context learning [17]. Subsequently, the integration of LLMs’ intrinsic capabilities [30] and Sentence-BERT (SBERT) [26] facilitated the retrieval of highly pertinent examples for prompt integration. The advent of dense retrievers signified a methodological enhancement in retrieval from a machine learning perspective, and the incorporation of feedback signals with contrastive learning has yielded more effective retrieval systems [29, 32]. Recent innovations like UPRISE [6] and PRAC [23] have further optimized the performance of in-context learning by retrieving demonstrations directly from training data. Despite these advances, most retrieval methods still treat each candidate independently, which can lead to suboptimal outcomes due to the interaction effects among in-context examples, resulting in a lack of diversity. Given the expanding applications of LLMs, diversity becomes increasingly crucial, as complex inference tasks often require undefined example types. Incorporating a diverse range of examples enriches LLMs’ learning processes, facilitating more innovative and robust responses, especially for complex open-ended questions. Furthermore, a relevance-focused diversity retrieval method could help mitigate the impact of malicious information as LLMs continue to scale and assimilate more societal data. Even if some malicious information is inadvertently retrieved, the inclusion of diverse data within the same batch can enable LLMs to fully comprehend the query’s intent and provide accurate responses. Additionally, as LLMs are tasked with processing ever-greater volumes of information, the necessity for a straightforward and efficient retrieval method intensifies, minimizing the dependency on resource-intensive preparatory tasks.

2.2 Maximal Marginal Relevance

To enhance retrieval processes by accounting for both relevance and diversity, the Maximal Marginal Relevance (MMR) algorithm was introduced [4]. MMR addresses the balance between relevance and diversity in traditional retrieval and summarization methods by employing "marginal relevance" as an evaluation metric. This metric is defined as a linear combination of independently measured relevance and novelty, formulated as Eq.1:

MMR=argmaxdiRS[λSim1(di,q)(1λ)maxdjSSim2(di,dj)].MMRsubscriptsubscript𝑑𝑖𝑅𝑆𝜆subscriptSim1subscript𝑑𝑖𝑞1𝜆subscriptsubscript𝑑𝑗𝑆subscriptSim2subscript𝑑𝑖subscript𝑑𝑗\text{MMR}=\arg\max_{d_{i}\in R\setminus S}[\lambda\cdot\text{Sim}_{1}(d_{i},q% )-(1-\lambda)\cdot\max_{d_{j}\in S}\text{Sim}_{2}(d_{i},d_{j})].MMR = roman_arg roman_max start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R ∖ italic_S end_POSTSUBSCRIPT [ italic_λ ⋅ Sim start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q ) - ( 1 - italic_λ ) ⋅ roman_max start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_S end_POSTSUBSCRIPT Sim start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] . (1)
Refer to caption
Figure 1: An analysis of the Maximal Marginal Relevance. (a) The candidate vectors are located on different sides of the query vector. (b) The candidate vectors are located on the same side of the query vector.

The challenge lies in selecting an appropriate λ𝜆\lambdaitalic_λ to achieve the desired balance between relevance and diversity, particularly in high-dimensional vector spaces where the impact of varying λ𝜆\lambdaitalic_λ is less predictable. This variability in λ𝜆\lambdaitalic_λ leads to fluctuations in retrieval results, resulting in unpredictable consequences, which can be illustrated by a simple example. Commonly, λ𝜆\lambdaitalic_λ is preset at a value of 0.5 in many MMR implementations, a choice that stems from the algorithm’s foundational design. It is important to note that at λ=1𝜆1\lambda=1italic_λ = 1, the algorithm exclusively prioritizes relevance, while at λ=0𝜆0\lambda=0italic_λ = 0, it focuses entirely on diversity. Let us examines the performance of the MMR algorithm at the typical midpoint setting of λ=0.5𝜆0.5\lambda=0.5italic_λ = 0.5. For clarity and ease of comprehension, we model the retrieval process within a two-dimensional vector space, though the principles observed are equally applicable to more complex, higher-dimensional scenarios.

As illustrated in Figure.1(a), consider q𝑞qitalic_q as the query vector, and d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to d3subscript𝑑3d_{3}italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT as candidate vectors that surpass the relevance threshold, collectively represented as R={d0,d1,d2,d3}𝑅subscript𝑑0subscript𝑑1subscript𝑑2subscript𝑑3R=\{d_{0},d_{1},d_{2},d_{3}\}italic_R = { italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }, with S𝑆Sitalic_S initially empty. Utilizing the MMR algorithm, d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is first selected due to its highest relevance to q𝑞qitalic_q, determined using cosine similarity as a measure. Subsequently, d3subscript𝑑3d_{3}italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is chosen over d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, despite d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT having a smaller angle with q𝑞qitalic_q and thus greater direct relevance. The selection of d3subscript𝑑3d_{3}italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is influenced by the fact that the cumulative relevance between d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT significantly surpasses that between d3subscript𝑑3d_{3}italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, resulting in a higher MMR value for d3subscript𝑑3d_{3}italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT as per the formula.

However, as depicted in Figure.1(b), with q𝑞qitalic_q serving as the query vector and R={d0,d1,d2,d3}𝑅subscript𝑑0subscript𝑑1subscript𝑑2subscript𝑑3R=\{d_{0},d_{1},d_{2},d_{3}\}italic_R = { italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } representing the initial set of candidate vectors, d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is first selected due to its maximal relevance to q𝑞qitalic_q. The selection process using the MMR algorithm proceeds as follows: with λ=0.5𝜆0.5\lambda=0.5italic_λ = 0.5, S={d0}𝑆subscript𝑑0S=\{d_{0}\}italic_S = { italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }, and RS={d1,d2,d3}𝑅𝑆subscript𝑑1subscript𝑑2subscript𝑑3R\setminus S=\{d_{1},d_{2},d_{3}\}italic_R ∖ italic_S = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }, the formula can be articulated as Eq.2:

MMR=argmaxi=1,2,3[0.5(Sim1(di,q)Sim2(di,d0))].MMRsubscript𝑖1230.5subscriptSim1subscript𝑑𝑖𝑞subscriptSim2subscript𝑑𝑖subscript𝑑0\text{MMR}=\arg\max_{i=1,2,3}[0.5\cdot(\text{Sim}_{1}(d_{i},q)-\text{Sim}_{2}(% d_{i},d_{0}))].MMR = roman_arg roman_max start_POSTSUBSCRIPT italic_i = 1 , 2 , 3 end_POSTSUBSCRIPT [ 0.5 ⋅ ( Sim start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q ) - Sim start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ] . (2)

Given that d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and d3subscript𝑑3d_{3}italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are positioned on the same side relative to q𝑞qitalic_q, and assuming both Sim1subscriptSim1\text{Sim}_{1}Sim start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Sim2subscriptSim2\text{Sim}_{2}Sim start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote cosine similarity, let θ𝜃\thetaitalic_θ represent the angle between d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and q𝑞qitalic_q, and x𝑥xitalic_x denote the angle between disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i.e., d1,d2,d3subscript𝑑1subscript𝑑2subscript𝑑3d_{1},d_{2},d_{3}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) and d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Thus, we get the Eq.3

MMR=argmaxi=1,2,3[0.5(cos(di,q)cos(di,d0))]=argmaxi=1,2,3[0.5(cos(x+θ)cos(x))]MMRsubscript𝑖1230.5cossubscript𝑑𝑖𝑞cossubscript𝑑𝑖subscript𝑑0subscript𝑖1230.5cos𝑥𝜃cos𝑥\text{MMR}=\arg\max_{i=1,2,3}[0.5\cdot(\text{cos}(d_{i},q)-\text{cos}(d_{i},d_% {0}))]=\arg\max_{i=1,2,3}[0.5\cdot(\text{cos}(x+\theta)-\text{cos}(x))]MMR = roman_arg roman_max start_POSTSUBSCRIPT italic_i = 1 , 2 , 3 end_POSTSUBSCRIPT [ 0.5 ⋅ ( cos ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q ) - cos ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ] = roman_arg roman_max start_POSTSUBSCRIPT italic_i = 1 , 2 , 3 end_POSTSUBSCRIPT [ 0.5 ⋅ ( cos ( italic_x + italic_θ ) - cos ( italic_x ) ) ] (3)

The function f(x)=cos(x+θ)cos(x)𝑓𝑥𝑥𝜃𝑥f(x)=\cos(x+\theta)-\cos(x)italic_f ( italic_x ) = roman_cos ( italic_x + italic_θ ) - roman_cos ( italic_x ), with its derivative f(x)=sin(x+θ)+sin(x)superscript𝑓𝑥𝑥𝜃𝑥f^{\prime}(x)=-\sin(x+\theta)+\sin(x)italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) = - roman_sin ( italic_x + italic_θ ) + roman_sin ( italic_x ), assumes x𝑥xitalic_x and x+θ𝑥𝜃x+\thetaitalic_x + italic_θ lie within (0,π/2)0𝜋2(0,\pi/2)( 0 , italic_π / 2 ). Consequently, f(x)<0superscript𝑓𝑥0f^{\prime}(x)<0italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) < 0, indicating that for vectors on the same side of q𝑞qitalic_q, their MMR values decrease as the angle with q𝑞qitalic_q increases. Thus, following the selection of d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the subsequent choices are d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, then d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and so on. This sequence suggests that relevance predominantly influences the selection outcome.

The real challenge in vector retrieval emerges when λ0.5𝜆0.5\lambda\neq 0.5italic_λ ≠ 0.5. The selection among candidate vectors d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and d3subscript𝑑3d_{3}italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT hinges critically on both λ𝜆\lambdaitalic_λ and θ𝜃\thetaitalic_θ, complicating the determination of the most appropriate candidate. This dependency means that different query vectors and the distribution of initial candidate vectors require varying λ𝜆\lambdaitalic_λ values to achieve optimal performance. Consequently, it is impractical to predict the value of λ𝜆\lambdaitalic_λ in advance or to ascertain a precise direction for optimization. This issue becomes even more pronounced in higher-dimensional vector spaces, where the perturbations induced by changing λ𝜆\lambdaitalic_λ complicate the identification of an optimal adjustment direction. This inherent complexity underscores the need for adaptive retrieval strategies that dynamically adjust λ𝜆\lambdaitalic_λ based on the characteristics of the query and candidate vector distributions.

3 Vectors retrieval with similarity and diversity

3.1 Problem definition and complexity analysis

To address the problem of selecting a subset of vectors from a set of candidate vectors that satisfy both similarity and diversity requirements, we refer to the MMR algorithm and several LLM-based algorithms, typically considering the following premises: Firstly, the candidate vectors are identified from the entire set of vectors (size=N𝑠𝑖𝑧𝑒𝑁size=Nitalic_s italic_i italic_z italic_e = italic_N) using similarity metrics, resulting in a subset of vectors (size=n𝑠𝑖𝑧𝑒𝑛size=nitalic_s italic_i italic_z italic_e = italic_n). Consequently, this set of candidate vectors inherently exhibits a relative high degree of similarity to the query vector. Secondly, within these n𝑛nitalic_n candidate vectors, the vector most similar to the query vector is typically selected first, as is the case with the MMR algorithm and others. This approach is favored because, in applications such as in-context learning with LLMs, examples with the highest similarity to the query are generally the most helpful.

As previously mentioned, while algorithms like MMR are widely applied in practice, these studies often lack a robust and reliable theoretical model. In other words, many approaches employ heuristic strategies or machine learning methods to arrive at a solution without providing a rigorous formal description and analysis of similarity and diversity from a theoretical perspective. Therefore, based on the aforementioned premises, we propose using the sum vector to characterize both similarity and diversity in vector retrieval. The definition of the sum vector is as follows:

Definition 1.

The Sum Vector: Given k𝑘kitalic_k vectors d1,d2,,dksubscript𝑑1subscript𝑑2subscript𝑑𝑘d_{1},d_{2},...,d_{k}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the sum vector d𝑑ditalic_d is the sum of these k𝑘kitalic_k vectors.

Specifically, we aim to maximize the similarity between the sum vector of the selected k𝑘kitalic_k vectors and the query vector. On one hand, maximizing the similarity of the sum vector to the query vector imposes a similarity constraint. On the other hand, from a geometric perspective, ensuring the sum vector is similar to the query vector means that the selected vectors approach the query vector from different directions, thus imposing a diversity constraint. Therefore, using the sum vector to characterize similarity and diversity allows us to model complex semantic similarity and diversity through simple vector addition operations. Next, we define the problem of vectors retrieval as follows:

Definition 2.

The problem of Vectors Retrieval with Similarity and Diversity (VRSD): Given a query vector q𝑞qitalic_q and a set of candidate vectors R={d0,d1,,dn1}𝑅subscript𝑑0subscript𝑑1subscript𝑑𝑛1R=\{d_{0},d_{1},...,d_{n-1}\}italic_R = { italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT } (where d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the vector with the highest similarity to query vertor q𝑞qitalic_q), d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is selected first because of its highest similarity. Then, how to select k1𝑘1k-1italic_k - 1 vectors (d1,d2,,dk1subscriptsuperscript𝑑1subscriptsuperscript𝑑2subscriptsuperscript𝑑𝑘1d^{\prime}_{1},d^{\prime}_{2},...,d^{\prime}_{k-1}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT) from the remaining vectors such that the cosine similarity between the sum vector d=d0+d1+d2++dk1𝑑subscript𝑑0subscriptsuperscript𝑑1subscriptsuperscript𝑑2subscriptsuperscript𝑑𝑘1d=d_{0}+d^{\prime}_{1}+d^{\prime}_{2}+...+d^{\prime}_{k-1}italic_d = italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + … + italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT and q𝑞qitalic_q is maximized.

The vector d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, characterized by its maximal similarity to the query vector q𝑞qitalic_q, establishes an initial constraint on similarity. The ensuing optimization objective strives to maximize the cosine similarity between the sum vector of all selected vectors and q𝑞qitalic_q. This process necessitates the selection of vectors that not only converge towards q𝑞qitalic_q from diverse dimensions but also exhibit significant diversity and complementarity. However, upon further examination of above problem, we find that it is an NP-complete problem. Below, we provide a theoretical proof. Since the vector d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, with the highest similarity, is initially selected, the subsequent selection of k1𝑘1k-1italic_k - 1 vectors must have the maximum cosine similarity with qd0𝑞subscript𝑑0q-d_{0}italic_q - italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. That is, maximizing the similarity between sum vector d=d0+d1+d2++dk1𝑑subscript𝑑0subscriptsuperscript𝑑1subscriptsuperscript𝑑2subscriptsuperscript𝑑𝑘1d=d_{0}+d^{\prime}_{1}+d^{\prime}_{2}+...+d^{\prime}_{k-1}italic_d = italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + … + italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT (d1,d2,,dk1subscriptsuperscript𝑑1subscriptsuperscript𝑑2subscriptsuperscript𝑑𝑘1d^{\prime}_{1},d^{\prime}_{2},...,d^{\prime}_{k-1}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT represents the k1𝑘1k-1italic_k - 1 vectors selected subsequently) and q𝑞qitalic_q, is equivalent to maximizing the similarity between d=d1+d2++dk1superscript𝑑subscriptsuperscript𝑑1subscriptsuperscript𝑑2subscriptsuperscript𝑑𝑘1d^{\prime}=d^{\prime}_{1}+d^{\prime}_{2}+...+d^{\prime}_{k-1}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + … + italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT and qd0𝑞subscript𝑑0q-d_{0}italic_q - italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. To this end, we define a decision problem, namely:

Definition 3.

The decision problem of vectors retrieval: Given a set of candidate vectors R𝑅Ritalic_R and a query vector q𝑞qitalic_q, can k𝑘kitalic_k vectors be selected from R𝑅Ritalic_R such that the cosine similarity between the sum vector of these k𝑘kitalic_k vectors and the query vector q𝑞qitalic_q equals 1? We denote instances of this vectors retrieval problem as (R,q,k)𝑅𝑞𝑘(R,q,k)( italic_R , italic_q , italic_k ).

Next, we will prove this decision problem is NP-complete. For the sake of concise proof, we further restrict the components of vectors to integers. The proof strategy is to reduce the subset sum problem [7] to this decision problem.

Definition 4.

The subset sum problem: Given an integer set T𝑇Titalic_T and another integer t𝑡titalic_t, does there exist a non-empty subset whose sum of elements equals t𝑡titalic_t? We denote instances of the subset sum problem as (T,t)𝑇𝑡(T,t)( italic_T , italic_t ).

For the convenience of proof, we also need to define a modified version of the subset sum problem, called the k𝑘kitalic_k-subset sum problem.

Definition 5.

k𝑘kitalic_k-subset sum problem: Given an integer set T𝑇Titalic_T and another integer t𝑡titalic_t, does there exist a non-empty subset of size k𝑘kitalic_k (i.e., the cardinality of the subset is k𝑘kitalic_k), whose sum of elements equals t𝑡titalic_t? We denote instances of the k𝑘kitalic_k-subset sum problem as (T,t,k)𝑇𝑡𝑘(T,t,k)( italic_T , italic_t , italic_k ).

Lemma 1.

The k𝑘kitalic_k-subset sum problem is NP-complete.

Proof.

We reduce the subset sum problem(Def.4) to the k𝑘kitalic_k-subset sum problem(Def.5) .

1. Clearly, the k𝑘kitalic_k-subset sum problem is polynomial-time verifiable.

2. Reducing the subset sum problem to the k𝑘kitalic_k-subset sum problem.

For any instance of the subset sum problem (T,t)𝑇𝑡(T,t)( italic_T , italic_t ), we can transform it into |T|𝑇|T|| italic_T | instances of the k𝑘kitalic_k-subset sum problem, i.e., (T,t,1),(T,t,2),,(T,t,|T|)𝑇𝑡1𝑇𝑡2𝑇𝑡𝑇(T,t,1),(T,t,2),\ldots,(T,t,|T|)( italic_T , italic_t , 1 ) , ( italic_T , italic_t , 2 ) , … , ( italic_T , italic_t , | italic_T | ). If any of these |T|𝑇|T|| italic_T | instances of the k𝑘kitalic_k-subset sum problem has a yes answer, then the answer to the subset sum problem is yes. If all answers to these |T|𝑇|T|| italic_T | instances of the k𝑘kitalic_k-subset sum problem are no, then the answer to the subset sum problem is also no. Therefore, if the k𝑘kitalic_k-subset sum problem can be solved in polynomial time, then the subset sum problem can also be solved in polynomial time. Hence, the k𝑘kitalic_k-subset sum problem is NP-complete. ∎

Now it is time to prove the NP-completeness of the decision problem of vectors retrieval.

Theorem 1.

The decision problem of vectors retrieval is NP-complete.

Proof.

We reduce the k𝑘kitalic_k-subset sum problem(Def.5) to the decision problem of vectors retrieval(Def.3).

1. The answer to vectors retrieval is polynomial-time verifiable. If the answer provides k𝑘kitalic_k vectors, we can simply add these k𝑘kitalic_k vectors and then calculate whether the cosine similarity between the sum vector and the query vector q𝑞qitalic_q equals 1. This verification can be done in polynomial time.

2. Reducing the k𝑘kitalic_k-subset sum problem to the decision problem of vectors retrieval.

For any instance of the k𝑘kitalic_k-subset sum problem (T,t,k)𝑇𝑡𝑘(T,t,k)( italic_T , italic_t , italic_k ), let T={t1,t2,,tn}𝑇subscript𝑡1subscript𝑡2subscript𝑡𝑛T=\{t_{1},t_{2},\ldots,t_{n}\}italic_T = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. We construct the set of vectors R𝑅Ritalic_R and the query vector q𝑞qitalic_q as Eq.4:

R={[t1,1],[t2,1],,[tn,1]},q=[t,k]formulae-sequence𝑅subscript𝑡11subscript𝑡21subscript𝑡𝑛1𝑞𝑡𝑘R=\{[t_{1},1],[t_{2},1],\ldots,[t_{n},1]\},q=[t,k]\ italic_R = { [ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 1 ] , [ italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 1 ] , … , [ italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , 1 ] } , italic_q = [ italic_t , italic_k ] (4)

The decision problem of vectors retrieval (R,q,k)𝑅𝑞𝑘(R,q,k)( italic_R , italic_q , italic_k ) asks whether there exist k𝑘kitalic_k vectors such that the sum vector (denoted as d𝑑ditalic_d) of these vectors and the query vector q𝑞qitalic_q have a cosine similarity of 1. According to the definition of cosine similarity, cos_similarity=dq|d||q|𝑐𝑜𝑠_𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦𝑑𝑞𝑑𝑞cos\_similarity=\frac{d\cdot q}{|d|\cdot|q|}italic_c italic_o italic_s _ italic_s italic_i italic_m italic_i italic_l italic_a italic_r italic_i italic_t italic_y = divide start_ARG italic_d ⋅ italic_q end_ARG start_ARG | italic_d | ⋅ | italic_q | end_ARG. The cosine similarity between d𝑑ditalic_d and q𝑞qitalic_q equals 1 if and only if d=αq𝑑𝛼𝑞d=\alpha qitalic_d = italic_α italic_q, where α𝛼\alphaitalic_α is a constant. Therefore, if vectors retrieval provides an affirmative answer d=αq𝑑𝛼𝑞d=\alpha qitalic_d = italic_α italic_q, we can get the Eq.5,

d=[t1,1]+[t2,1]++[tk,1]=α[t,k][(t1++tk),k]=α[t,k].𝑑subscriptsuperscript𝑡11subscriptsuperscript𝑡21subscriptsuperscript𝑡𝑘1𝛼𝑡𝑘subscriptsuperscript𝑡1subscriptsuperscript𝑡𝑘𝑘𝛼𝑡𝑘d=[t^{\prime}_{1},1]+[t^{\prime}_{2},1]+...+[t^{\prime}_{k},1]=\alpha[t,k]% \Rightarrow[(t^{\prime}_{1}+...+t^{\prime}_{k}),k]=\alpha[t,k].italic_d = [ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 1 ] + [ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 1 ] + … + [ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , 1 ] = italic_α [ italic_t , italic_k ] ⇒ [ ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_k ] = italic_α [ italic_t , italic_k ] . (5)

[t1,1][tk,1]subscriptsuperscript𝑡11subscriptsuperscript𝑡𝑘1[t^{\prime}_{1},1]\ldots[t^{\prime}_{k},1][ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 1 ] … [ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , 1 ] are the selected k𝑘kitalic_k vectors. It implies that α=1𝛼1\alpha=1italic_α = 1 and t1++tk=tsubscriptsuperscript𝑡1subscriptsuperscript𝑡𝑘𝑡t^{\prime}_{1}+\ldots+t^{\prime}_{k}=titalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_t. Thus, this provides an affirmative answer to the k𝑘kitalic_k-subset sum problem instance (T,t,k)𝑇𝑡𝑘(T,t,k)( italic_T , italic_t , italic_k ). Conversely, if vectors retrieval provides a negative answer, then a negative answer to the k𝑘kitalic_k-subset sum problem can also be obtained. The above reduction process can be clearly completed in polynomial time. Therefore, the decision problem of vectors retrieval is NP-complete. ∎

3.2 Heuristic algorithm for vectors retrieval

Since the vectors retrieval problem (R,q,k)𝑅𝑞𝑘(R,q,k)( italic_R , italic_q , italic_k ) is a NP-complete problem, necessitating the use of heuristic methods to derive feasible solutions. Specifically, given a set of candidate vectors with high similarity, the objective is to select k𝑘kitalic_k vectors that maximize the cosine similarity between the sum vector of the k𝑘kitalic_k selected vectors and the query vector. We propose a new algorithm denoted as Vectors Retrieval with Similarity and Diversity (VRSD). VRSD initially selects the vector most similar to the query vector and then iteratively selects additional vectors from the remaining candidates. In each iteration, it chooses the vector that maximizes the cosine similarity between the cumulative sum of all previously selected vectors and the query vector, continuing this process until k𝑘kitalic_k vectors are chosen. Further details about the VRSD algorithm can be found in Algorithm.1 and Fig.2.

Refer to caption

Figure 2: An illustration of how the VRSD works for (R,q,3)𝑅𝑞3(R,q,3)( italic_R , italic_q , 3 ). (1) The vector d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is first selected as it has the maximum cosine similarity (MCS) with the query vector q and. And d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT will be removed from R. (2) The vector dysubscript𝑑𝑦d_{y}italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT is second selected because, when added to d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, their sum vector has the maximum cosine similarity with q compared to the other unselected vectors, and is also subsequently removed from R. (3) The vector dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is third selected as it has the maximum cosine similarity with q after being added to the sum vector.
Algorithm 1 Vectors Retrieval with Similarity and Diversity (VRSD)
1:Candidate vector set R={d0,d1,,dn1}𝑅subscript𝑑0subscript𝑑1subscript𝑑𝑛1R=\{d_{0},d_{1},\ldots,d_{n-1}\}italic_R = { italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT }, query vector q𝑞qitalic_q, where d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the vector from all disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that has the highest cosine similarity with q𝑞qitalic_q, and constant k𝑘kitalic_k.
2:k𝑘kitalic_k vectors including d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, such that the cosine similarity between the sum vector of these k𝑘kitalic_k vectors and q𝑞qitalic_q is maximized.
3:S={d0}𝑆subscript𝑑0S=\{d_{0}\}italic_S = { italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }
4:R=R{d0}𝑅𝑅subscript𝑑0R=R\setminus\{d_{0}\}italic_R = italic_R ∖ { italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }
5:for i=1𝑖1i=1italic_i = 1 to k1𝑘1k-1italic_k - 1 do
6:     s=S𝑠𝑆s=\sum Sitalic_s = ∑ italic_S \triangleright Sum of all vectors in S𝑆Sitalic_S
7:     maxCos=0maxCos0\text{maxCos}=0maxCos = 0
8:     p=null𝑝nullp=\text{null}italic_p = null \triangleright Initialize p𝑝pitalic_p to a null vector or equivalent
9:     for v𝑣vitalic_v in R𝑅Ritalic_R do
10:         t=s+v𝑡𝑠𝑣t=s+vitalic_t = italic_s + italic_v \triangleright Temporary vector for comparison
11:         if cos(t,q)>maxCos𝑡𝑞maxCos\cos(t,q)>\text{maxCos}roman_cos ( italic_t , italic_q ) > maxCos then
12:              maxCos=cos(t,q)maxCos𝑡𝑞\text{maxCos}=\cos(t,q)maxCos = roman_cos ( italic_t , italic_q )
13:              p=v𝑝𝑣p=vitalic_p = italic_v
14:         end if
15:     end for
16:     S=S{p}𝑆𝑆𝑝S=S\cup\{p\}italic_S = italic_S ∪ { italic_p } \triangleright Add p𝑝pitalic_p to the set S𝑆Sitalic_S
17:     R=R{p}𝑅𝑅𝑝R=R\setminus\{p\}italic_R = italic_R ∖ { italic_p } \triangleright Remove p𝑝pitalic_p from R𝑅Ritalic_R
18:end for
19:return S𝑆Sitalic_S

3.3 Time complexity analysis of VRSD algorithm

As depicted in Algorithm.1, the time complexity of the VRSD algorithm is k×|R|=k×n𝑘𝑅𝑘𝑛k\times|R|=k\times nitalic_k × | italic_R | = italic_k × italic_n, which accounts for the initial step of selecting n𝑛nitalic_n candidate vectors from the entire set of vectors (size = N𝑁Nitalic_N) based on similarity. Given that Nn>kmuch-greater-than𝑁𝑛𝑘N\gg n>kitalic_N ≫ italic_n > italic_k, the computational load of subsequent steps in Algorithm.1 is minimal in comparison. The MMR algorithm, which also selects k𝑘kitalic_k vectors from |R|𝑅|R|| italic_R | candidates, requires two iterations of maximum calculations as depicted in Eq.1—once for each candidate vector against the query vector and once against the set of already selected vectors |S|𝑆|S|| italic_S |. Thus, the complexity for MMR becomes k×|R|×|S|=k×|R|2=k×n2𝑘𝑅𝑆𝑘superscript𝑅2𝑘superscript𝑛2k\times|R|\times|S|=k\times|R|^{2}=k\times n^{2}italic_k × | italic_R | × | italic_S | = italic_k × | italic_R | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_k × italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, indicating a marginally higher computational demand compared to VRSD.

4 Experiments

4.1 Experiments detail

We evaluated the VRSD algorithm using three publicly available datasets of different categories and compared the VRSD with the MMR algorithm when the values of λ𝜆\lambdaitalic_λ are 0, 0.5, and 1 respectively :

  • ARC-DA [2]: A dataset of direct-answer science questions derived from the ARC multiple-choice question. Each example contains a question and multiple answers.

  • OpenBookQA [21]: A dataset of multiple-choice science questions, which probe the understanding of science facts and the application of these facts to novel situations. Each example contains a question, multiple choices, and an answer.

  • Puzzle [18]: A question answering dataset. These questions belong to lateral thinking puzzle. Each example contains a question and an answer.

For each item in each datasets, we concatenate the question part with its corresponding answer, subsequently selecting 20% of these concatenated items to form the test set, wherein the question parts are isolated. Items designated for the test set are excluded from the original dataset for subsequent experiments, where four examples are retrieved for each test question. Our evaluation focuses on two primary aspects: retrieval quality and answer quality. Retrieval quality is assessed by aggregating four vectors retrieved using either VRSD or MMR into a sum vector—denoted as dVRSDsubscript𝑑VRSDd_{\text{VRSD}}italic_d start_POSTSUBSCRIPT VRSD end_POSTSUBSCRIPT and dMMRsubscript𝑑MMRd_{\text{MMR}}italic_d start_POSTSUBSCRIPT MMR end_POSTSUBSCRIPT—which reflects the vectorial direction from which the examples approach the query vector q𝑞qitalic_q. We compute the cosine similarity between the sum vectors and the query vector as cos(dVRSD,q)subscript𝑑VRSD𝑞\cos(d_{\text{VRSD}},q)roman_cos ( italic_d start_POSTSUBSCRIPT VRSD end_POSTSUBSCRIPT , italic_q ) and cos(dMMR,q)subscript𝑑MMR𝑞\cos(d_{\text{MMR}},q)roman_cos ( italic_d start_POSTSUBSCRIPT MMR end_POSTSUBSCRIPT , italic_q ). The comparison includes counting instances where cos(dVRSD,q)subscript𝑑VRSD𝑞\cos(d_{\text{VRSD}},q)roman_cos ( italic_d start_POSTSUBSCRIPT VRSD end_POSTSUBSCRIPT , italic_q ) exceeds cos(dMMR,q)subscript𝑑MMR𝑞\cos(d_{\text{MMR}},q)roman_cos ( italic_d start_POSTSUBSCRIPT MMR end_POSTSUBSCRIPT , italic_q ), termed as the VRSD win. rate, and calculating the maximum difference between these cosine similarities for all queries in each test set. Additionally, we compute the mean cosine similarity values for these vectors. Such metrics are instrumental in elucidating the algorithms’ capacity to balance similarity and diversity. For answer quality assessment, we reconstruct prompts by concatenating the original sentences corresponding to the four retrieved vectors with the initial question, and input these into large LLMs. The responses generated by open-source LLM open-mistral-7b and closed-source LLM gpt-3.5-turbo are then compared with standard answers using the Rogue-L metric for ARC-DA and Puzzle datasets, and Accuracy for OpenBookQA. This evaluation helps us ascertain the efficacy of the retrieved examples in providing accurate answers

4.2 Experiments Results

Table 1: Algorithm’s performance in different dataset. Max-diff displays the maximum difference between cos(dVRSD,q)subscript𝑑VRSD𝑞\cos(d_{\text{VRSD}},q)roman_cos ( italic_d start_POSTSUBSCRIPT VRSD end_POSTSUBSCRIPT , italic_q ) and cos(dMMR,q)subscript𝑑MMR𝑞\cos(d_{\text{MMR}},q)roman_cos ( italic_d start_POSTSUBSCRIPT MMR end_POSTSUBSCRIPT , italic_q ) across all queries in the test set. Mean displays the mean value of cos(dVRSD,q)subscript𝑑VRSD𝑞\cos(d_{\text{VRSD}},q)roman_cos ( italic_d start_POSTSUBSCRIPT VRSD end_POSTSUBSCRIPT , italic_q ) and cos(dMMR,q)subscript𝑑MMR𝑞\cos(d_{\text{MMR}},q)roman_cos ( italic_d start_POSTSUBSCRIPT MMR end_POSTSUBSCRIPT , italic_q ) in the test set. Boldface indicates the algorithm with the highest mean value under different dataset.
        Algorithm         VRSD win.rate         Max-diff         Mean
        ARC-DA         VRSD         -         -         0.740
        MMR(λ𝜆\lambdaitalic_λ=0)         97.7%         0.160         0.696
        MMR(λ𝜆\lambdaitalic_λ=0.5)         92.5%         0.108         0.720
        MMR(λ𝜆\lambdaitalic_λ=1)         95.3%         0.158         0.710
        OpenBookQA         VRSD         -         -         0.833
        MMR(λ𝜆\lambdaitalic_λ=0)         97.3%         0.135         0.809
        MMR(λ𝜆\lambdaitalic_λ=0.5)         92.6%         0.101         0.822
        MMR(λ𝜆\lambdaitalic_λ=1)         96.8%         0.102         0.812
        Puzzle         VRSD         -         -         0.592
        MMR(λ𝜆\lambdaitalic_λ=0)         100%         0.161         0.537
        MMR(λ𝜆\lambdaitalic_λ=0.5)         90%         0.052         0.576
        MMR(λ𝜆\lambdaitalic_λ=1)         100%         0.132         0.577
Table 2: Algorithm’s performance under different LLMs. Boldface indicates the algorithm with the highest score under different models.
Algorithm ARC-DA OpenBookQA Puzzle
gpt-3.5-turbo open-mistral-7b gpt-3.5-turbo open-mistral-7b gpt-3.5-turbo open-mistral-7b
VRSD 0.371 0.233 0.789 0.534 0.213 0.198
MMR(λ𝜆\lambdaitalic_λ=0) 0.355 0.216 0.767 0.508 0.206 0.198
MMR(λ𝜆\lambdaitalic_λ=0.5) 0.364 0.218 0.772 0.507 0.202 0.188
MMR(λ𝜆\lambdaitalic_λ=1) 0.347 0.222 0.780 0.510 0.188 0.186

Table.1 presents the results of retrieval quality, indicating that the win rate of VRSD, compared to MMR across various datasets and conditions, consistently exceeds 90%. This outcome suggests that VRSD not only retrieves examples more relevant to the original query but also better satisfies the diversity requirements. Remarkably, VRSD maintains a minimum advantage of 0.05 over MMR in the worst-case scenario concerning the maximum difference. Additionally, the mean value of cosine similarity between the sum vector and the query vector is significantly higher for VRSD than for MMR. VRSD generally demonstrates superior retrieval efficacy from diverse perspectives. Table 2 shows the answer quality results, with VRSD achieving the highest scores across all metrics, thus indicating that the examples retrieved by VRSD enhance the LLM’s understanding of the query and facilitate the generation of the desired answers. The superior performance of the closed-source LLM over the open-source LLM can be attributed to the former’s enhanced understanding and reasoning capabilities. Overall, VRSD exhibits superior performance to MMR in both example retrieval and query answering, fulfilling the requirements of both similarity and diversity without the need for parameter adjustments, thereby highlighting the pronounced advantages of VRSD.

It is worth noting that, compared to other datasets, the mean values of cos(dVRSD,q)subscript𝑑VRSD𝑞\cos(d_{\text{VRSD}},q)roman_cos ( italic_d start_POSTSUBSCRIPT VRSD end_POSTSUBSCRIPT , italic_q ) and cos(dMMR,q)subscript𝑑MMR𝑞\cos(d_{\text{MMR}},q)roman_cos ( italic_d start_POSTSUBSCRIPT MMR end_POSTSUBSCRIPT , italic_q ) in the Puzzle dataset are relatively low. Additionally, the difference in results between the open-source and closed-source models is less pronounced, and VRSD achieves a 100% win rate, which may be attributed to the small size of the dataset. Nevertheless, given that lateral-puzzle questions require LLMs to comprehend the query and derive insights from various angles in the retrieved examples, the Puzzle dataset remains a valuable tool for assessing our algorithm.

5 Conclusion

In this work, given the complexity of parameter adjustment in MMR, we aim to enhance how LLMs retrieve similar and diverse examples by introducing a novel approach that characterizes both constraints through the relationship between the sum vector and the query vector. This method requires individual vectors within the sum to align divergently with the query vector, thereby satisfying the diversity constraint. We demonstrate that this problem is NP-complete, and we propose the VRSD algorithm, which not only outperforms MMR but also improves efficiency across various metrics, thereby enabling the retrieval of higher-quality examples. Our work underscores the inherent challenges of simultaneously pursuing similarity and diversity in vector retrieval and establishes a solid theoretical foundation for further research. The proposed combinatorial optimization problem holds independent interest from both theoretical and practical standpoints, indicating that further exploration and refinement of the heuristic algorithm would constitute a valuable avenue for future research. In terms of advancing the heuristic algorithm, we aim to explore two specific aspects: (1) the development of heuristic algorithms with lower time complexity that can retrieve high-quality examples more efficiently, and (2) the creation of adaptable heuristic algorithms that remain robust regardless of dataset size or problem type. We believe that as the retrieval algorithm—emphasizing both similarity and diversity—continues to improve, the scope of tasks that LLM-based agents can manage will expand, yielding increasingly satisfactory results.

References

  • [1] Uri Alon, Frank Xu, Junxian He, Sudipta Sengupta, Dan Roth, and Graham Neubig. Neuro-symbolic language modeling with automaton-augmented retrieval. In International Conference on Machine Learning, pages 468–485. PMLR, 2022.
  • [2] Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. Think you have solved direct-answer question answering? try arc-da, the direct-answer ai2 reasoning challenge. arXiv preprint arXiv:2102.03315, 2021.
  • [3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • [4] Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 335–336, 1998.
  • [5] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-domain questions. In 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, pages 1870–1879. Association for Computational Linguistics (ACL), 2017.
  • [6] Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Weiwei Deng, and Qi Zhang. Uprise: Universal prompt retrieval for improving zero-shot evaluation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12318–12337, 2023.
  • [7] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009.
  • [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
  • [9] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.
  • [10] Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. In EACL 2021-16th Conference of the European Chapter of the Association for Computational Linguistics, pages 874–880. Association for Computational Linguistics, 2021.
  • [11] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020.
  • [12] Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Nearest neighbor machine translation. In International Conference on Learning Representations, 2020.
  • [13] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2019.
  • [14] Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020.
  • [15] Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, 2019.
  • [16] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  • [17] Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, 2022.
  • [18] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. In The Twelfth International Conference on Learning Representations, 2023.
  • [19] Man Luo, Xin Xu, Zhuyun Dai, Panupong Pasupat, Mehran Kazemi, Chitta Baral, Vaiva Imbrasaite, and Vincent Y Zhao. Dr. icl: Demonstration-retrieved in-context learning. arXiv preprint arXiv:2305.14128, 2023.
  • [20] Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen. Generation-augmented retrieval for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2021.
  • [21] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018.
  • [22] Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In International Conference on Learning Representations, 2013.
  • [23] Ercong Nie, Sheng Liang, Helmut Schmid, and Hinrich Schütze. Cross-lingual retrieval augmented prompt for low-resource languages. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
  • [24] Panupong Pasupat, Yuan Zhang, and Kelvin Guu. Controllable semantic parsing via retrieval augmentation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7683–7698, 2021.
  • [25] Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5835–5847, 2021.
  • [26] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
  • [27] Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. Rocketqav2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2825–2835, 2021.
  • [28] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  • [29] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, 2022.
  • [30] Richard Shin, Christopher Lin, Sam Thomson, Charles Chen Jr, Subhro Roy, Emmanouil Antonios Platanios, Adam Pauls, Dan Klein, Jason Eisner, and Benjamin Van Durme. Constrained language models yield few-shot semantic parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7699–7715, 2021.
  • [31] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  • [32] Liang Wang, Nan Yang, and Furu Wei. Learning to retrieve in-context examples for large language models. arXiv preprint arXiv:2307.07164, 2023.