GenSco: Can Question Decomposition based Passage Alignment improve Question Answering?

Barah Fazili1 Research work conducted during internship at Adobe Research India. Koustava Goswami2 Natwar Modani2 Inderjeet Nair3
1 Indian Institute of Technology
Research work conducted when at Adobe Research India.
Bombay
2 Adobe Research
India
3 University of Michigan
Ann Arbor
[email protected]
{koustavag,nmodani}@adobe.com [email protected]
Abstract

Retrieval augmented generation (RAG) with large language models (LLMs) for Question Answering (QA) entails furnishing relevant context within the prompt to facilitate the LLM in answer generation. During the generation, inaccuracies or hallucinations frequently occur due to two primary factors: inadequate or distracting context in the prompts, and the inability of LLMs to effectively reason through the facts. In this paper, we investigate whether providing aligned context via a carefully selected passage sequence leads to better answer generation by the LLM for multi-hop QA. We introduce, “GenSco”, a novel approach of selecting passages based on the predicted decomposition of the multi-hop questions. The framework consists of two distinct LLMs: (i) Generator LLM, which is used for question decomposition and final answer generation; (ii) an auxiliary open-sourced LLM, used as the scorer, to semantically guide the Generator for passage selection. The generator is invoked only once for the answer generation, resulting in a cost-effective and efficient approach.111For question decomposition,the generator is called O(N)𝑂𝑁O(N)italic_O ( italic_N ) times where N𝑁Nitalic_N is the max number of hops along the greedy path. Refer to Section 3 for details. We evaluate on three broadly established multi-hop question answering datasets: 2WikiMultiHop, Adversarial HotPotQA and MuSiQue and achieve an absolute gain of 15.115.115.115.1 and 5.95.95.95.9 points in Exact Match score with respect to the best performing baselines over MuSiQue and 2WikiMultiHop respectively.

GenSco: Can Question Decomposition based Passage Alignment improve Question Answering?



1 Introduction

Retrieval augmented generation (RAG) with Question answering typically involves presenting the model with a “context" supporting the ground truth answer, such as a paragraph from wikipedia, alongside the posed question. This task offers a measurable means to assess the language comprehension and reasoning capabilities of an NLP system Hermann et al. (2015); Xiao et al. (2019); Rajpurkar et al. (2016). Earlier approaches predominantly concentrated on conducting this reasoning within a singular context Liu et al. (2018); Seo et al. (2018); Wang et al. (2017). Thanks to recent advancements in Deep Learning techniques Lan et al. (2020), machines have now surpassed human performance on datasets like SQUAD 2.0 Rajpurkar et al. (2016). The recent progress in single-hop QA tasks has spurred interest towards more challenging and practical QA form, Multi-Hop Question Answering (MHQA). Multi-step reasoning involves asking one or more preliminary questions before getting to the final answer, with each preliminary step feeding into the subsequent step, forming a reasoning chain Mavi et al. (2022). Consequently, techniques proven effective in MHQA can be seamlessly integrated into tasks such as sentence fusion Geva et al. (2019); Weiss et al. (2021), abstractive summarization Nayeem et al. (2018), event occurrence time prediction Wang et al. (2021), multi-document summarization Ma et al. (2021), and timeline summarization Yu et al. (2021), all of which require information synthesis across multiple documents.

While LLMs produce efficient results for most of the natural language understanding tasks, finding the best way to provide intructions to LLMs to perform MHQA remains a popular area of research.

Factual inaccuracy occurs when the model lacks the requisite supporting data to generate an accurate response, often arising from its unfamiliarity with specific entities, attributes, or events. Although this kind of inaccuracy is simple, it constitutes the bulk of errors in the model generations Zheng et al. (2023). For multi-hop QA, there has been some work on language model prompting for multi-hop passage re-ranking where the passage relevance scores are computed using conditional likelihoods using the LLM Khalifa et al. (2023). Another thread of research focuses on providing simplified queries to LLMs by performing Question Decomposition Patel et al. (2022). Question decomposition has been explored by either involving human in the loop  Patel et al. (2022) or getting the model to respond to subquestions before rendering the final answer Radhakrishnan et al. (2023); Schlag et al. (2023); Yao et al. (2023a). But generating answers at every iteration for each of the decomposed questions for longer sequence of documents is time consuming and a tedious process to solve. Infact there is a high risk of hallucination if the passage selection goes wrong.

To tackle this problem, we explore Question Decomposition for passage retrieval instead for invoking the LLM for answer generation at each step, resulting in less latency and higher efficiency.

We combine the two kinds of approaches: the ones leveraging instruction tuned LLMs to compute relevance scores in passage reranking and the methods generating simpler questions via Question Decomposition, to design an effective passage sequence selection method for MHQA. We propose GenSco which leverages an open-source LLM as a scorer guiding the Generator LLM (assumed to be a black box) for passage sequence selection before Answer Generation. Our intuition behind choosing two separate LLMs is to complement the generator LLM with a scorer module in terms of semantic and grounded knowledge base.

In our proposed approach, we start with an empty context. We ask generator LLM to generate a sub-question from the question, the context collected up to now (initially empty), as well as, the sub-questions generated up to now (again, initially empty). Given the generated sub-question, we rank all the candidate passages based on negative log-likelihood using the scorer LLM. We add the passage with the best score to the context (and the generated sub-question to the list of sub-questions) and ask the generator LLM to generate next sub-question. When the stopping criteria is met, we send the context (in the order it was accumulated) and the question as part of a suitable prompt to the generator LLM for final answer generation.

To the best of our knowledge, we are the first to propose Question Decomposition for selecting a passage sequence for multi-hop question answering. Note that it’s different from simple reranking since there are two aspects to passage selection with our method: not only do we identify the most relevant passages but also render them in an order that is consistent with the reasoning steps implied by the multi-hop question. We empirically show that GenSco retrieves relevant passages with a high precision and that the order in which the passages are presented to the downstream LLM also contributes to achieving higher accuracy. To summarize, our main contributions in this paper are: (i): We introduce a novel, inference only (hence data-efficient) greedy approach called “GenSco” for passage sequence selection in Multi-Hop Question Answering and achieve an absolute gain of 15.115.115.115.1 and 5.95.95.95.9 points in Exact Match score wrt best performing baselines for 2WikiMultiHop Ho et al. (2020) and MuSiQue Trivedi et al. (2022) datasets respectively.

; (ii) Apart from the superior downstream QA performance, “GenSco” also achieves high precision on the passage retrieval task, effectively mitigating hallucination in the LLM responses.

2 Related Work

2.1 LLMs for Information Retrieval

Researchers have experimented with in-domain few-shot examples with LLMs to generate queries Dai et al. (2022); Bonifacio et al. (2022); Boytsov et al. (2023)

Thereafter, neural retrieval models are fine-tuned over this enhanced dataset.

While LLMs have been previously used to score paragraphs based on their relevance to a certain query Sachan et al. (2022), directly applying them to determine pertinent paragraphs based on a complex query requiring intricate reasoning results in sub-optimal performance. Our novel approach of breaking down a complex query into a set of simpler elemental queries allows our method to leverage generative ability of LLMs for accurate retrieval and better answering performance. Zhang et al. (2023) introduce beam retrieval for multihop QA optimizing learnable parameters across all hops. In contrast, our suggested method exclusively employs an inference approach, eliminating the necessity for training or training examples.

2.2 Complex Task Decomposition

One of the well known techniques for MHQA is to decompose the complex query into simpler sub-questions, answer them and then combine the results to get the overall answer Fu et al. (2021). Recently researchers have proposed deconstruct a complex problem into a series of simpler sub-problems before feeding to LLMs Zhou et al. (2022); Press et al. (2022).

Despite significant advancements in LLMs that have enhanced their reasoning capabilities and reduced the disparity between machine and human intelligence, using them directly to answer sub-questions might result in inaccurate responses due to the lack of appropriate knowledge. Thus, instead of directly solving the sub-problems via prompting, we leverage LLMs in finding the right context for each of the sub-problems. This is inline with the conclusions drawn by  Zheng et al. (2023) - providing fine-grained external knowledge can boost the truthfulness of LLMs in answering a question.

3 Methodology

Refer to caption
Figure 1: GenSco : subquestion at each level is generated using subQ-Gen module , the Scorer module is invoked for selecting the passage (greedy algorithm). The sequence of passages are then passed as context to G to generate the final answer (bottom)

We introduce “GenSco" for passage selection in retrieval augmented Multi-Hop Question Answering (refer to Figure 1). GenSco leverages two different LLMs (scorer S and generator G) in a greedy exploration of a passage tree. In order to improve the alignment of selected passages with the reasoning chain implied from the multi-hop question, GenSco provides a refined passage selection technique by using the generator for question decomposition in tandem with the scorer for passage selection at each step.

For a given question Q𝑄Qitalic_Q and a set of already retrieved passages P=[p1,p2,,pk]𝑃subscript𝑝1subscript𝑝2subscript𝑝𝑘P=[p_{1},p_{2},...,p_{k}]italic_P = [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ], we construct a tree of passages where each node along the greedy path is expanded to k child nodes ni,1,ni,ksubscript𝑛𝑖1subscript𝑛𝑖𝑘n_{i,1},...n_{i,k}italic_n start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … italic_n start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT222The first element in the subscript indicates the level of the node in the tree(0-indexed, starting from the root node) and the next element represents the index of the passage included at this node. each corresponding to one of the k𝑘kitalic_k passages in P𝑃Pitalic_P. Node ni,jsubscript𝑛𝑖𝑗n_{i,j}italic_n start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT represents the history of passages included on the path from the root to its parent node along with the passage pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT included at the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT level.

The subquestion qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each level i𝑖iitalic_i in the tree is drawn out of the generator LLM G using few-shot prompting. More detail on the prompt structure is available in the supplementary. The nodes at level i𝑖iitalic_i are evaluated using scorer S for relevance to the subquestion qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT which presumably captures the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT reasoning step for the multi-hop question Q𝑄Qitalic_Q. The node with the best score is chosen from among the k𝑘kitalic_k candidates at each level, which is then used to create the next batch of candidates up to level of the last𝑙𝑎𝑠𝑡lastitalic_l italic_a italic_s italic_t subquestion. Since the subquestions could be indefinitely generated (using G), note that it requires specifying an upper limit on the number of levels that can be explored, hence we introduce a novel automated stopping criterion leveraging the scorer model in addition to setting a rough upper bound estimated for the dataset. A broad outline is presented in Algorithm 1 while we expand on more details in section  3.1 and section  3.2 which serve as the two primary modules of GenSco technique.

Algorithm 1 GenSco
1:G: Generator LLM, S: Scorer LLM
2:for (P,Q,A) in data do
3:     P:[p1,p2,pk]:𝑃subscript𝑝1subscript𝑝2subscript𝑝𝑘P:[p_{1},p_{2},...p_{k}]italic_P : [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
4:     for i in 1 to max-levels do
5:         if i = 1 then
6:              q[i]G(Q,shots)𝑞delimited-[]𝑖𝐺𝑄𝑠𝑜𝑡superscript𝑠q[i]\leftarrow G(Q,shots^{\prime})italic_q [ italic_i ] ← italic_G ( italic_Q , italic_s italic_h italic_o italic_t italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
7:         else
8:              q[i]G(Q,shots,q[1:i1],P[1:i1])q[i]\leftarrow G(Q,shots^{\prime},q[1:i-1],P^{\prime}[1:i-1])italic_q [ italic_i ] ← italic_G ( italic_Q , italic_s italic_h italic_o italic_t italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q [ 1 : italic_i - 1 ] , italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ 1 : italic_i - 1 ] )
9:         end if
10:         if stop then
11:              break
12:         end if
13:         P[i]S(q[i],P)superscript𝑃delimited-[]𝑖𝑆𝑞delimited-[]𝑖𝑃P^{\prime}[i]\leftarrow S(q[i],P)italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ italic_i ] ← italic_S ( italic_q [ italic_i ] , italic_P )
14:     end for
15:     AG(Q,P,shots)superscript𝐴𝐺𝑄superscript𝑃𝑠𝑜𝑡𝑠A^{\prime}\leftarrow G(Q,P^{\prime},shots)italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_G ( italic_Q , italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s italic_h italic_o italic_t italic_s )
16:     Compute_Metrics(A,A,P)superscript𝐴𝐴𝑃(A^{\prime},A,P)( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_A , italic_P )
17:     
18:end for

3.1 Question Decomposition

For decomposition, the multi-hop Question Q𝑄Qitalic_Q is provided to G along with the previously generated subquestions and corresponding passages selected for each subquestion. The prompt also includes an instruction to flag the end of decomposition when no more subquestions can be created for Q𝑄Qitalic_Q by generating a specified keyword. Apart from relying on G to flag the end of decomposition or detecting a repeated subquestion, we propose a stopping criterion based on the log-likelihood scores from S: Scorer model S is instructed to generate a multihop question given the decomposition and we evaluate the following expression where nllS𝑛𝑙subscript𝑙𝑆nll_{S}italic_n italic_l italic_l start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT refers to negative log-likelihood from Scorer S:

nllS(p(Q|C,q1,q2,qi))>nllS(p(Q|C,q1,q2,qi1))C=P[j],jϵ[1,i1]formulae-sequence𝑛𝑙subscript𝑙𝑆𝑝conditional𝑄𝐶subscript𝑞1subscript𝑞2subscript𝑞𝑖𝑛𝑙subscript𝑙𝑆𝑝conditional𝑄𝐶subscript𝑞1subscript𝑞2subscript𝑞𝑖1𝐶superscript𝑃delimited-[]𝑗𝑗italic-ϵ1𝑖1\begin{multlined}nll_{S}(p(Q|C,q_{1},q_{2},...q_{i}))>\\ nll_{S}(p(Q|C,q_{1},q_{2},...q_{i-1}))\\ C=P^{\prime}[j],j\epsilon[1,i-1]\end{multlined}nll_{S}(p(Q|C,q_{1},q_{2},...q_% {i}))>\\ nll_{S}(p(Q|C,q_{1},q_{2},...q_{i-1}))\\ C=P^{\prime}[j],j\epsilon[1,i-1]start_ROW start_CELL italic_n italic_l italic_l start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_p ( italic_Q | italic_C , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) > end_CELL end_ROW start_ROW start_CELL italic_n italic_l italic_l start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_p ( italic_Q | italic_C , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_q start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL italic_C = italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ italic_j ] , italic_j italic_ϵ [ 1 , italic_i - 1 ] end_CELL end_ROW (1)

When the likelihood given the decomposition including the generated subquestion qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT falls below the likelihood while excluding qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the prompt, we stop exploring more passages. Note that the prompt to S also includes C𝐶Citalic_C which represents the concatenation of passages selected for q1,,qi1subscript𝑞1subscript𝑞𝑖1q_{1},...,q_{i-1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT represented as :P1,,Pi1:absentsubscriptsuperscript𝑃1subscriptsuperscript𝑃𝑖1:P^{\prime}_{1},...,P^{\prime}_{i-1}: italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. This is included to provide the model enough context to synthesize(indirectly score) the composite question Q𝑄Qitalic_Q. The prompt is designed only to extract likelihood scores from the scorer S in order to get a proxy for the significance of the new subquestion qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the decomposition of Q𝑄Qitalic_Q, therefore, take note that we ignore the question generated here by S.

3.2 Passage selection

Once the subquestion qmsubscript𝑞𝑚q_{m}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT has been formulated for level m𝑚mitalic_m, we proceed to expand the most favorable node from the previous level (m1)𝑚1(m-1)( italic_m - 1 ) into k child nodes. Each of these child nodes, denoted as nm,isubscript𝑛𝑚𝑖n_{m,i}italic_n start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT, represents the concatenated sequence of passages chosen along the greedy path from the root node to the parent of nm,isubscript𝑛𝑚𝑖n_{m,i}italic_n start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT, followed by the passage pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For each child node at level m𝑚mitalic_m, a score is computed relative to the subquestion qmsubscript𝑞𝑚q_{m}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT using the scorer LLM S. The LLM S is instructed to produce a question based on the passage sequence represented by each node at level m𝑚mitalic_m, yielding m𝑚mitalic_m distinct scores for subquestion qmsubscript𝑞𝑚q_{m}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT being posed as the output text sequence.

The score for node nm,isubscript𝑛𝑚𝑖n_{m,i}italic_n start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT is computed by taking the log likelihood of the token sequence qmsubscript𝑞𝑚q_{m}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in the output distribution of S given the already selected passages(P[:m1]P^{\prime}[:m-1]italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ : italic_m - 1 ]) plus passage pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the scorer S.

score(nm,i)=nllS(p(qm/P[:m1],P[i])),iϵ[1,k]\begin{multlined}score(n_{m,i})=nll_{S}(p(q_{m}/P^{\prime}[:m-1],P[i])),\\ i\epsilon[1,k]\end{multlined}score(n_{m,i})=nll_{S}(p(q_{m}/P^{\prime}[:m-1],P% [i])),\\ i\epsilon[1,k]start_ROW start_CELL italic_s italic_c italic_o italic_r italic_e ( italic_n start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ) = italic_n italic_l italic_l start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_p ( italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT / italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ : italic_m - 1 ] , italic_P [ italic_i ] ) ) , end_CELL end_ROW start_ROW start_CELL italic_i italic_ϵ [ 1 , italic_k ] end_CELL end_ROW (2)

After the passage selection is done, the generator is called to which we provide the multi-hop question and selected passage sequence in a few-shot setting to generate the answer.

A~=G(Q,P)~𝐴𝐺𝑄superscript𝑃\tilde{A}=G(Q,P^{\prime})over~ start_ARG italic_A end_ARG = italic_G ( italic_Q , italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (3)

Here Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes the sequence of passages chosen following a greedy approach, with the final level being either the maximum permitted levels or determined by stopping conditions, whichever is reached first.

Note that in our log-likelihood expressions, we compute the probability of the question conditioned on the passages rather than the other way around which may be less intuitive since we should be scoring the passages(and not the question). As has been discussed in prior work Sachan et al. (2023), either of the two forms could be used theoretically (using the Baye’s rule and taking passage priors as uniform which is a reasonable assumption for reranking). This form is more practical(and faster) since passages tend to be much longer than the question and the likelihood of longer sequences could approach zero.

4 Experiments

4.1 Datasets

We have conducted experiments on three different datasets: (i) 2WikiMultiHop Ho et al. (2020), (ii) Adversarial HotPot Yang et al. (2018); Ye and Durrett (2022); (iii) MuSiQue Trivedi et al. (2022). More details of the datasets can be found in Section A in Appendix.

4.2 Baselines

We start with retrieval methods, BM25 Robertson and Walker (1994) and GTR Ni et al. (2021), by extracting the most relevant passages, and fed to the LLM for answer generation. We compare against Verify-and-Edit Zhao et al. (2023a) (CoT-SC + VE), that focuses on post-editing ‘Chain of Thought’-style reasoning, ‘Iter-Retgen’ Shao et al. (2023), ReAct Yao et al. (2023b) and SelfAsk Press et al. (2023). Both ReAct  Yao et al. (2023b) and Self-Ask Press et al. (2023) are approaches that involve the decomposition of complex/multihop questions using Large Language Model (LLM) prompting techniques.Also, we compare with a cross-encoder333https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2 trained for MS Marco Passage Ranking. Our apporach of GenSco is an inference method, hence not directly comparable to the full finetuning setup. We draw inspiration from different approaches researchers have tried to make zero-shot/few-shot settings effective for the MHQA Gao et al. (2024). The detailed explanation can be found in Apendix.

4.3 Experiment Details

For 2WikiMultiHop, as the dataset has atmost 5 supporting passages per instance, we select the top-5 passages for each of the implemented baselines (top three rows against each dataset in Table 1). The passages are then provided as context in the prompt to the LLM for QA. Also, as part of the prompt we provide two instances from the train set for our few-shot prompting of the Generator LLM . For Adversarial HotPot, we provide top-2 passages using each of the three implemented baseline methods, prompting the Generator LLM in the same way but with 4-shots. For MuSiQue, we augment the prompt with the top-4 passages for each of the three implemented baselines and include 3 shots of train instances in the prompt.

For 2WikiMultiHop, GenSco explores the search tree to a limit of 5 levels, considering that each instance has a maximum of 5 supporting passages. In terms of prompting the Generator LLM for GenSco , we include two instances from the training set as for the baseline retrieval methods. For the Adversarial HotPot dataset and MuSiQue similarly, we utilize 4 shots and 3 shots of training instances respectively as for the implemented baselines. Verify-and-Edit Zhao et al. (2023a) however, uses six shots for both its datasets.
We leverage GPT3.5444text-davinci-003 model as the generator LLM for all techniques as has been used for the Verify-and-Edit baseline. For GenSco, 3B version of Open-llama555Open LLAMA on Hugging Face is taken as the scorer LLM. We implement two main variations of our GenSco approach. The first variant, named GenSco-stop, includes the scorer log-likelihood based condition in inequality (1) as an additional stopping criteria. The second variant, called GenSco-max, either detects the specified end-of-decomposition keyword in the model output or identifies redundant subquestions (already generated for the instance) as the stopping conditions. Note that GenSco-stop incorporates the likelihood based stopping criterion as a third alternative beyond the two criteria in GenSco-max to stop the greedy search.

The scores of CoT-SC+VE Zhao et al. (2023a), ReAct Yao et al. (2023b), Self-Ask Press et al. (2023) and Iter-Retgen Shao et al. (2023) for all three datasets have been directly taken from prior work Zhao et al. (2023a); Shao et al. (2023) while all other methods listed in the tables are locally evaluated.

4.3.1 Metrics

The predicted answers are evaluated for correctness given the ground truth answers in the datasets. We evaluate Exact Match, F1, Precision and Recall based on the implementation in prior work Zhao et al. (2023b) which simply normalizes the two strings by removing punctuation, lower-casing the text, etc., before matching them word by word for Exact Match (EM) binary score for each instance. Precision computes the fraction of words in the predicted answer overlapping with the ground truth answer. Similarly, Recall measures the fraction of words in the ground truth matching the predicted answer. F1 score is computed in the standard way based on the precision and recall values.

4.4 Results

2WikiMultiHop Method EM delta EM F1 PR RE
BM25+FS 27.8 33.99 33.01 36.9
GTR+FS 34.39 43.7 42.72 48.55
Cross-encoder+FS 33.1 39.4 40.1 47.54
CoT-SC + VE 37.2 - - -
ReAct 28.0 38.5 - -
Self-Ask 37.3 48.8 - -
Iter-Retgen-6 35.5 48.1 - -
GenSco-stop+FS 41.4 51.23 49.52 56.09
GenSco-max+FS 43.2 5.9 53.24 51.08 58.28
AdvHotPot BM25+FS 46.75 57.27 58.23 58.23
GTR+FS 53.57 63.36 64.29 63.36
Cross-encoder+FS 52.9 60.1 59.7 59.4
CoT-SC + VE 56.8 - - -
GenSco-stop+FS 55.52 -1.28 62.14 63.08 64.03
GenSco-max+FS 54.87 61.71 62.65 62.65
MuSiQue BM25+FS 24.6 29.2 29.7 31.2
GTR+FS 25.4 30.2 34.7 39.1
Cross-encoder+FS 25.7 28.2 25.2 34.1
ReAct 23.4 37.0 - -
Self-Ask 27.6 41.5 - -
Iter-Retgen-7 26.1 42 - -
GenSco-stop+FS 39.1 42.1 39.7 44.7
GenSco-max+FS 42.7 15.1 46.1 44.2 47.9
Table 1: Accuracy of various methods in terms of Exact Match (EM), F1-score (F1), Precision (PR) and Recall (RE). delta EM represents max EM of the two Tree methods - max baseline EM. FS indicates few-shot prompting of the Generator LLM

In this section we will discuss the performance of our methodology on the three datasets and will compare with the baseline systems. Refer to the Table 1 for the results.

4.4.1 2WikiMultiHop

Observe, performance across various correctness metrics, shows the improvements made by GenSco in comparison to all baseline systems. The jump in performance holds consistently true for most metrics, indicating the effectivenes of GenSco for passage selection. Both our maximum hop based method (GenSco-max) and the additional stopping-criteria based approach (GenSco-stop) outperforms all baseline systems by a large margin validating the importance of sequential alignment between the retrieved passages and the question in the context of multi-hop Question Answering.

4.4.2 Adversarial HotPot

Here, we observe superior performance with significantly less variability across all baseline models. GenSco does not outperform but is on par with the best-performing baselines, which can be attributed to the dataset’s relatively smaller passage set. The advantages of our greedy search are not as pronounced because the provided passage set (for each question) is already small, thus limiting further gains with filtering/selection.

4.4.3 MuSiQue

With minimal potential reasoning shortcuts, this dataset is designed to necessitate connected reasoning of the model. Over this challenging dataset with harder distractor passages, our proposed method(s) achieve huge improvements across all four metrics over all the reported baseline systems. This is likely due to the presence of a sufficient number of candidate passages (up to 20) for each instance, allowing our algorithm to effectively explore a greedy selection. We observe that GenSco tends to excel when there are around 10 or more candidate passages on average per instance.

In the rest of the experiments, we only use 2WikiMultiHop for further analysis.

5 Discussion

5.1 What happens without the question decomposition?

To gauge the role of question decomposition in GenSco, we setup an alternate implementation with the following two modifications to GenSco-max algorithm:

  1. 1.

    Instead of using the subquestion at the level to compute the log-likehood expression for passage selection in equation 2, we use the original multi-hop question to compute the scores:

    score(nm,i)=nllS(p(Q/P[1:m1],P[i])),iϵ[1,k]\begin{multlined}score(n_{m,i})=nll_{S}(p(Q/P^{\prime}[1:m-1],P[i])),\\ i\epsilon[1,k]\end{multlined}score(n_{m,i})=nll_{S}(p(Q/P^{\prime}[1:m-1],P[i]% )),\\ i\epsilon[1,k]start_ROW start_CELL italic_s italic_c italic_o italic_r italic_e ( italic_n start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ) = italic_n italic_l italic_l start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_p ( italic_Q / italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ 1 : italic_m - 1 ] , italic_P [ italic_i ] ) ) , end_CELL end_ROW start_ROW start_CELL italic_i italic_ϵ [ 1 , italic_k ] end_CELL end_ROW (4)
  2. 2.

    For stopping criterion, we check if the best scoring paragraph at any level has already been selected along the greedy path. i.e. if P[m]ϵP[1:m1]P^{\prime}[m]\quad\epsilon\quad P^{\prime}[1:m-1]italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ italic_m ] italic_ϵ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ 1 : italic_m - 1 ].

This variant is called GenSco-no-QD in Table 2 where we compare the correctness scores with two of the baseline methods and variants of GenSco on a subset of 250 instances from 2WikiMultiHop. The numbers severely drop down wrt to both GenSco-stop and GenSco-max but are still better than BM25.

Method EM F1 PR RE
BM25+FS 24.1 29.76 28.8 32.64
GTR+FS 34.0 43.2 41.28 47.04
GenSco-stop+FS 41.6 49.92 48.96 54.72
GenSco-max+FS 44.4 52.86 50.98 57.58
GenSco-no-QD+FS 31.97 38.4 37.44 41.28
Table 2: Baseline and Variations of GenSco on 250 instances from 2WikiMultiHop (FS here stands for ‘FewShot’)

5.2 Correctness vs Faithfulness

Method 2WikiMultiHop AdvHotPot
BM25+FS 73.21 86.98
GTR+FS 83.6 88.05
GenSco-stop+FS 75.55 75.37
GenSco-max+FS 76 75.76
Table 3: Hallucination results (Metric: K-Precision)

We also evaluated the model responses in comparison to baseline retrieval methods over faithfulness. K-precision Adlakha et al. (2023) serves to quantify how well a model’s response is grounded within the provided passages. Consequently, if the generated answer shares more words with the passages, the k-precision value will be higher, regardless of its correctness. While GenSco methods are more accurate with respect to the ground truth answers, the GTR method achieves higher K-precision value (Table 3) for 2WikiMultiHop. Figure 4 shows a scatter plot over a set of 960 instance responses for 2WikiMultihop from GenSco-stop and the GTR based method to observe the correlation between K-precision (indicating faithfulness) and F1 (which is a correctness metric). It’s surprising that a higher K-precision score doesn’t necessarily indicate increased correctness, as the data points are almost uniformly dispersed across the F1 axis for high k-precision values. The Pearson correlation coefficients for the two metrics for both GTR and GenSco-stop are hence low: 0.1380.1380.1380.138 and 0.2380.2380.2380.238, respectively. Typical RAG methods such as with GTR lack flexibility by relying on a fixed number of passages (k) for each instance typically setting k to be the estimated maximum number of hops required across the instances in the dataset which may result in higher value of k-precision. In contrast, our GenSco methods effectively tailor the number of passages for each multi-hop query(based on the decomposition), potentially leading to lower k-precision. However, the correctness measures, which are our primary concern, are consistently higher for our method indicating the importance of supplying only relevant passages to the generator LLM.

Refer to caption
Figure 2: Histogram of delta (number of supporting passages - number of passages retrieved by GenSco-stop) for subsets of data with 1,2 and 4 supporting passages for 2WikiMultiHop dataset (left to right, top to bottom)
Refer to caption
Figure 3: Performance across different sized subsets of the 2WikiMultiHop dataset.

5.2.1 Retrieval Performance

Method F1 Precision Recall
GTR 56.2 42.46 88.57
GenSco-stop 83.32 93.51 78.35
Table 4: Retrieval Performance of GTR and GenSco-stop on 2WikiMultiHop dataset

2WikiMultihop also provides details on which of the passages are supporting each question/answer (and which are just distractors). We use this information to contrast the retrieval performance of GenSco-stop with GTR. Table 4 shows that GenSco-stop has a lower recall but much higher precision indicating that GenSco retrieves less irrelevant passages. Figure 2 provides another rough idea on the difference in number of hops taken (number of sub-questions generated) by GenSco-stop vs the actual number of supporting passages specified in 2WikiMultiQA. The mod of delta as is seen in the plot mostly varies between 0 to 2 showing little divergence of GenSco from the ground truth.

5.3 Does the order of the retrieved passages really matter?

Since GenSco retrieves the passages in alignment with the implicit reasoning implied by the multi-hop question, the order in which these passages should be input to the Generator LLM is derived as part of our algorithm. To assess if the order of passages plays any role in the MHQA task, we performed a simple experiment where the retrieved passage sequences for GenSco-stop and GenSco-max are randomly shuffled before feeding into the Generator LLM. Table  5 shows how the correctness scores significantly degrade on randomizing the order. This experiment confirms the intuition based on which GenSco has been proposed as not just as a passage selection technique but also as a technique that identifies the passage sequence. The retrieval baselines are not equipped to determine the order since the passage relevance is computed with respect to the entire question as opposed to its specific parts/subquestions.

5.4 Is the performance consistent over different sized subsets of the data?

To check whether the performance remains consistent over different sizes of data subsets, we plot each of our four correctness metrics averaged over the instances in figure 3. The aggregate values look fairly stable for GenSco-stop across various splits of 2WikiMultiHop dataset.

Method EM F1 PR RE
GenSco-stop+FS 41.6 49.92 48.96 54.72
GenSco-stop shuffled+FS 38.4 47.33 45.47 51.97
GenSco-max+FS 44.4 52.86 50.98 57.58
GenSco-max shuffled+FS 38 45.7 43.90 50.18
Table 5: Scores with shuffling the selected passages before prompting the Generator for QA on 2WikiMultiHop (on 250 instances)
Refer to caption
Figure 4: Scatter plot of answers for 2WikiMultiHop

6 Conclusion

We introduce an inference technique for passage sequence selection in multi-hop question-answering, outperforming baseline retrieval methods including recent SOTA systems on multiple datasets. Experimental results demonstrate that the proposed approach not only captures pertinent passages but also offers a logical sequence for passages to be effectively processed in LLM prompts for multi-hop QA. Please note that our approach does not substitute the initial stage of passage retrieval from documents. Instead, it offers an finer-level filtering (and reordering) process over a set of already retrieved passages. Nonetheless, this approach can be used on top of mainstream retrieval systems for a more refined passage sequence selection ahead of LLM generation for MHQA.

7 Limitations

  1. 1.

    For the generator LLM, we opted for GPT-3.5, a commercial LLM, as the free alternatives did not demonstrate comparable performance during our experiments. However, with the introduction of recent comparable open source models it is possible that future research could investigate the use of free LLMs instead.

  2. 2.

    Although our method is comparable with competitive baselines even on datasets with a small set of retrieved passages, we observe that it mostly thrives where we have an average of 10 or more candidate passages to choose from. However, this approach can complement traditional retrieval systems by offering a more precise selection of passage sequences before LLM generation for Multi-Hop Question Answering (MHQA).

References

Appendix A Dataset

We have conducted experiments on three different datasets: (i) 2WikiMultiHop Ho et al. (2020) offers a collection of 1000100010001000 multi-hop question-answer pairs in the validation set beside 6666 pairs for few-shot learning. For each instance, there are 10101010 passages provided, with some passages being pertinent to answering the question while others act as distractors. Along with that we also evaluated the technique on smaller contexts, hence worked on (ii) Adversarial HotPot Yang et al. (2018); Ye and Durrett (2022); it is smaller dataset comprising 308308308308 data instances, providing 4444 passages for every question-answer pair. For each instance, 2222 passages serve as supporting evidence, while the remaining 2 act as distractors.(iii) MuSiQue Trivedi et al. (2022): Following prior work  Shao et al. (2023); Press et al. (2023) , we evaluate on the 1252 questions from the Musique dev set categorized as 2-hop.666 The 3-hop and 4-hop questions are reported to be too intricate, even the authors of the paper found them challenging to comprehend at times Press et al. (2023).

Appendix B Prompt templates

B.1 For Question Answering

Table 6 shows the structure of the template used for generating an answer using an instruction and few shots. The concatenated passage sequence along with the multihop question to be answered are then added.

Answer the question given the context. Here are a few examples:
Question: Which film was released earlier, Kistimaat or I’M Taraneh, 15?
Context: Kistimaat is a 2014 Bangladeshi action film directed by Ashiqur Rahman and produced by Tiger Media Limited and The Abhi Pictures.The film features Arifin Shuvoo and Achol Akhe in lead roles while Misha Sawdagor plays the main antagonist in the film.The film is about a police officer and his fight against corruption.The film was released on Eid al- Adha, 6 October 2014, and was a commercial success.The movie was inspired by the 2010 Hindi film “Dabangg".I’m Taraneh, 15 is a 2002 Iranian film directed by Rasul Sadrameli.The film was selected as the Iranian entry for the Best Foreign Language Film at the 75th Academy Awards, but it did not make the final shortlist.
Answer: I’M Taraneh, 15
More shots can follow
Question: Question goes here
Context: Passages go here
Answer:
Table 6: Main prompt for drawing out the answer from the Generator

B.2 For Question Decomposition

Table  7 shows the template used for drawing out the next subquestion from the Generator LLM conditioned on the history of subquestions and corresponding passages selected.

I am going to give you a question. I want to decompose it into a series of subquestions. Each subquestion should be self-contained with all the information necessary to solve it. Make sure not to decompose more than necessary or have any trivial subquestions. Do not repeat any subquestion. You’ll be evaluated on the simplicity, conciseness, and correctness of your decompositions. If no more subquestion could be drawn, please generate“<FIN></FIN>". Here are a couple of examples:
Question: What are the other books from the author of “The Good Earth"?
Subquestion 1: Who is the author of the book “The Good Earth"?
Subcontext 1: The author of the book “The Good Earth" is Pearl S. Buck.
Subquestion 2: What are the titles of books written by Pearl S. Buck other than “The Good Earth"?
Question: Which movie came out first “Spiderman 2" or “Batman Begins"?
Subquestion 1: When was the release date of the movie “Spiderman 2"?
Subcontext 1: Spiderman 2 is a 2004 American superhero film based on the Marvel Comics character of the same name.
Subquestion 2: When was the release date of the movie “Batman Begins"?
Question: Question goes here
Subquestion 1: Subquestion 1 generated earlier goes here
Subcontext 1: Passage selected for Subquestion 1 goes here
k-2 pairs of follow up Subquestions and Subcontexts
Subquestion k:
Table 7: Prompt for drawing out the next subquestion from the Generator.

B.3 For Stopping criterion

Prompt for checking the stopping criterion (using Scorer LLM) is shown in Table 8. Note that this prompt is invoked both while conditioning on the decompostion upto and including subquestion k-1 and the decomposition until subquestion k individually. No shots are used here.

Generate a question given its complete decomposition into subquestions along with the context containing answers to these subquestions.
Context: Concatenation of Passages selected for Subquestions[1:k]
Decomposition: Concatenation of Subquestions [1:k]
Question:
Table 8: Prompt for computing the value of the introduced stopping criterion using S.

B.4 For Passage selection

Prompt for scoring (using Scorer) the passages based on the generated subquestion is shown in Table 9. Again, no shots are included (and the question generated by the scorer here is ignored). The scorer is prompted here only to draw out the likelihood scores from Scorer for the subquestion produced by the Generator LLM.

Generate a question based on the context.
Context: The passage to be scored goes here
Question:
Table 9: Prompt for computing the relevance scores using S.

Appendix C Sample Trace of subquestions generated for GenSco-stop

Table 10 shows the 2-hop question logically processed by GenSco-stop. Relevance scores with respect to the subquestion are provided for each passage. The most relevant passage is selected for the first subquestion which feeds into generating the next subquestion. Similarly, the most relevant passage for the next subquestion is selected and subsequently the answer for the multi-hop question is generated. GTR fails to retrieve one of the two relevant passages in top-5 and responds with an incorrect answer taken from a distractor passage.

What is the place of birth of the director of film The One And Only Ivan (Film)?
Who is the director of the film The One And Only Ivan (Film)?
1111 Thea Sharrock (born 1976) is an English theatre and film director… 0.3070.307-0.307- 0.307
2222 Peter Levin is an American director of film, television and theatre. 0.3340.334-0.334- 0.334
3333 The One and Only is a 1978 comedy film starring Henry Winkler, directed by Carl Reiner and written by Steve Gordon. 0.4880.488-0.488- 0.488
4444 Andrei Virgil Ivan( born 4 January 1997) is a Romanian professional footballer who plays as a forward for Universitatea Craiova. 0.2000.200-0.200- 0.200
5555 Katherine Alice Applegate( born October 9, 1956) is an American young adult… 0.6540.654-0.654- 0.654
6666 Radu Ivan( born 17 July 1969) is a Romanian judoka who competed at three Olympic Games. 0.0700.070-0.070- 0.070
7777 Ian Barry is an Australian director of film and TV. 0.2300.230-0.230- 0.230
8888 The One and Only Ivan is an upcoming American fantasy drama film directed by Thea Sharrock,… 0.9880.988-0.988- 0.988
9999 Dávid Ivan( born 26 February 1995) is a Slovak professional footballer who plays as a midfielder for Serie B club Chievo. 0.2960.296-0.296- 0.296
10101010 Marian Ivan( born 1 June 1969 in Bucharest) is a retired Romanian footballer… 0.2280.228-0.228- 0.228
What is the place of birth of Thea Sharrock?
1111 Thea Sharrock (born 1976) is an English theatre and film director… 1.6171.617-1.617- 1.617
2222 Peter Levin is an American director of film, television and theatre. 0.1010.101-0.101- 0.101
3333 The One and Only is a 1978 comedy film starring Henry Winkler, directed by Carl Reiner and written by Steve Gordon. 0.5940.594-0.594- 0.594
4444 Andrei Virgil Ivan( born 4 January 1997) is a Romanian professional footballer who plays as a forward for Universitatea Craiova. 0.2130.213-0.213- 0.213
5555 Katherine Alice Applegate( born October 9, 1956) is an American young adult… 0.4810.481-0.481- 0.481
6666 Radu Ivan( born 17 July 1969) is a Romanian judoka who competed at three Olympic Games. 0.2070.207-0.207- 0.207
7777 Ian Barry is an Australian director of film and TV. 0.1040.104-0.104- 0.104
8888 The One and Only Ivan is an upcoming American fantasy drama film directed by Thea Sharrock,… 1.1641.164-1.164- 1.164
9999 Dávid Ivan( born 26 February 1995) is a Slovak professional footballer who plays as a midfielder for Serie B club Chievo. 0.3470.347-0.347- 0.347
10101010 Marian Ivan( born 1 June 1969 in Bucharest) is a retired Romanian footballer… 0.3280.328-0.328- 0.328
Answer (with the passages(two passages in bold) above selected using GenSco-stop): London, England
1111 The One and Only Ivan is an upcoming American fantasy drama film directed by Thea Sharrock…
2222 Marian Ivan( born 1 June 1969 in Bucharest) is a retired Romanian footballer……
3333 Dávid Ivan( born 26 February 1995) is a Slovak professional footballer who plays as a midfielder for Serie B club Chievo.
4444 Andrei Virgil Ivan( born 4 January 1997) is a Romanian professional footballer who plays as a forward for Universitatea Craiova.
5555 Radu Ivan( born 17 July 1969) is a Romanian judoka who competed at three Olympic Games.
Answer (with passages above selected using GTR): Bucharest
Table 10: Trace of an example from 2WikiMultiHop using GenSco-stop method and GTR. The generated subquestions and selected passages are bolded for GenSco-stop. Passages are truncated for space limits.

Appendix D Computational cost

Given that we use GPT-3.5 as the Generator, our computational efficiency is constrained by the maximum number of API calls allowed by OpenAI within a given time frame. For each query, we perform up to O(k)𝑂𝑘O(k)italic_O ( italic_k ) inference operations (where k𝑘kitalic_k represents the number of passages) using LLAMA-2 and GPT-3.5, and finally, we call GPT-3.5 once to generate an answer to the multi-hop question. Replacing GPT-3.5 with a competitive open-source large language model (LLM) for generation in GenSco could potentially reduce the turnaround time and warrants further exploration in future work.

Method EM F1 PR RE
GenSco-stop+FS 41.6 49.92 48.96 54.72
GenSco-max+FS 44.4 52.86 50.98 57.58
GenSco-stop+FS temp 0.5 41.60 50.67 48.76 55.45
GenSco-max+FS temp 0.5 45.20 53.54 51.62 58.32
Table 11: Variations of GenSco on 250 instances from 2WikiMultiHop (FS here stands for ‘FewShot’) with different temperatures

Appendix E Temperature

Setting the temperature (for Generator) to a value 0.50.50.50.5 (see Table 11) gives small but consistent gains across the correctness metric. With a temperature <1absent1<1< 1, the model becomes more deterministic, leading to more focused responses which suits QA task at hand. Note that the default temperature is set to 00 wherever not mentioned otherwise.

Appendix F Can just longer context solve the problem instead?

Even though the language models can now take upto and more than 128k tokens Ding et al. (2024) as context into the prompts, it does not solve the problem of reasoning over long documents. As of now, multihop reasoning/QA remains a challenge even for the SOTA LLMsMavi et al. (2022). Effectively aligning the correct passages (in the prompt) remains a critical challenge that GenSco can address to prevent hallucinations due to distracting passages in the prompt. Therefore, we posit that GenSco can serve as a fundamental solution for the LLMs, regardless of their maximum allowed context lengths.

Appendix G GenSco is an inference method, hence not directly comparable to the full finetuning setup

With respect to the leaderboard for the datasets, GenSco scores are not directly comparable since GenSco is only an inference method that does not rely on the respective training subsets for these datasets. Fine-tuning Large Language Models (LLMs) in different settings, especially the commercial LLMs can be challenging and sometimes impractical. We draw inspiration from different approaches researchers have tried to make zero-shot/few-shot settings effective for the MHQA Gao et al. (2024). With such approaches, including our work, passage selection is done by leveraging the LLMs with in-context learning for Multihop Question Answering (MHQA). The SOTA models for 2WikiMultiHop in the leaderboard and similarly for other reported datasets, are trained on the respective train sets while we propose an inference method assuming no access to the training examples. This is the reason why the best numbers on the dataset leaderboards are not directly comparable against our setting, but comparable to other inference based approaches in prior work.