Retrieve, Summarize, Plan: Advancing Multi-hop Question Answering with an Iterative Approach

Zhouyu Jiang , Mengshu Sun , Lei Liang , Zhiqiang Zhang
Ant Group
{jiangzhouyu.jzy,mengshu.sms,leywar.liang,lingyao.zzq}@antgroup.com
Abstract

Multi-hop question answering is a challenging task with distinct industrial relevance, and Retrieval-Augmented Generation (RAG) methods based on large language models (LLMs) have become a popular approach to tackle this task. Owing to the potential inability to retrieve all necessary information in a single iteration, a series of iterative RAG methods has been recently developed, showing significant performance improvements. However, existing methods still face two critical challenges: context overload resulting from multiple rounds of retrieval, and over-planning and repetitive planning due to the lack of a recorded retrieval trajectory. In this paper, we propose a novel iterative RAG method called ReSP, equipped with a dual-function summarizer. This summarizer compresses information from retrieved documents, targeting both the overarching question and the current sub-question concurrently. Experimental results on the multi-hop question-answering datasets HotpotQA and 2WikiMultihopQA demonstrate that our method significantly outperforms the state-of-the-art, and exhibits excellent robustness concerning context length.

Retrieve, Summarize, Plan: Advancing Multi-hop Question Answering with an Iterative Approach


Zhouyu Jiang , Mengshu Sun , Lei Liang , Zhiqiang Zhang Ant Group {jiangzhouyu.jzy,mengshu.sms,leywar.liang,lingyao.zzq}@antgroup.com


1 Introduction

Open-domain question answering is a task that involves providing factual responses based on extensive documents (Voorhees and Tice, 2000; Zhang et al., 2023) and is of significant application in hot industry scenarios such as intelligent assistants and generative search engines (OpenAI, 2024; Gemini-Team, 2024). Multi-hop question answering is one common and challenging sub-task within this field, requiring the system to integrate information to complete multi-step reasoning and answer questions (Mavi et al., 2024).

With the rapid development of large language models (LLMs), retrieval-augmented generation (RAG) based on these LLMs has become a popular method for addressing open-domain question answering (Siriwardhana et al., 2023; Lewis et al., 2020; Shi et al., 2024). The typical RAG process involves using a retriever to recall documents from a corpus that are relevant to a given query and using these documents as context inputs for the LLMs to generate responses. However, when dealing with multi-hop question answering, conventional RAG techniques frequently fall short of aggregating all the critical information within a singular retrieval iteration, leading to incomplete or incorrect answers. Consequently, a new genre of iterative RAG methods that incorporate question planning has recently been developed (Shao et al., 2023; Trivedi et al., 2023; Asai et al., 2024). These methods assess after each retrieval whether the information at hand is sufficient for answering the question. If it is not, the methods generate a sub-question for the next step and perform another retrieval, iterating this process until the question can be satisfactorily answered. Owing to the employment of multiple retrieval iterations, iterative RAG has achieved a notable improvement in multi-hop question-answering scenarios compared to the single-round RAG approaches.

However, existing iterative RAG methods still encounter two principal challenges when handling multi-hop question answering. Firstly, due to multiple rounds of retrieval, iterative RAG methods have to deal with a longer context in contrast to single-round RAG methods, which consequently introduces more noise from the documents and increases the risk of the model missing key information during response generation (Liu et al., 2024). Secondly, current iterative RAG methods are heavily dependent on the model’s interpretation of retrieved documents for decision-making, lacking a concrete record of the navigated trajectory. This makes it difficult for the model to discern whether the information needed to answer the overarching question has been sufficiently gathered and whether a sub-question has already been retrieved, leading to two issues: an over-planning scenario wherein the iterative process does not stop even despite sufficient information has been retrieved, and a repetitive planning scenario wherein a sub-question that has already been retrieved is reproduced (Yao et al., 2023).

The two challenges mentioned previously are primarily related to the effective processing of information obtained during the retrieval phase. To tackle this, drawing inspiration from query-focused summarization (Dang, 2006; Xu and Lapata, 2020), we introduce the ReSP (Retrieve, Summarize, Plan) approach. This method not only condenses but also functionally decomposes the information accrued in each retrieval episode. Specifically, we integrate a novel LLM-based summarizer within the established iterative RAG framework and refine the iterative process. The summarizer undertakes dual roles: firstly, it compiles a summary of corroborative information from the retrieved documents for the overarching target question, termed the global evidence memory; secondly, it crafts a response for the current sub-question based on the retrieved documents, termed the local pathway memory. At the inception of each iteration, the accumulated global evidence memory and local pathway memory are combined as contextual inputs for the model’s evaluation. Should the information be evaluated as adequate, the procedure advances to the generation of the final response; otherwise, a new sub-question is formulated, with the requirement that the model must not generate previously retrieved sub-questions.

Our experimental findings reveal that, under uniform experimental settings, the ReSP model markedly surpasses a range of current single-round and iterative RAG approaches when evaluated on multi-hop question-answering benchmarks such as HotpotQA (Yang et al., 2018) and 2WikiMultihopQA (Ho et al., 2020). Notably, it exhibits a substantial enhancement in performance, with an increase of 4.1 F1 score over the state-of-the-art (SOTA) on HotpotQA, and an improvement of 5.9 F1 score over the SOTA on 2WikiMultihopQA. Furthermore, we conducted a series of in-depth comparative studies to discern the effect of model size on its performance and confirmed that ReSP possesses commendable robustness to context length compared to other RAG methods.

In conclusion, our contributions are as follows:

  • Targeting the multi-hop question-answering scenario, we propose an innovative iterative RAG approach that incorporates query-focused summarization to tackle the context overload problem resulting from multiple rounds of retrieval. In particular, we have refined the summarizer’s function to compress information aimed at both the overarching question and the current sub-question, thereby optimizing issues related to over-planning and repetitive planning.

  • Experimental results show that our approach significantly surpasses existing methods in performance, and it exhibits considerable robustness to variations in context length.

2 Related Works

Retrieval-Augmented Generation. Retrieval-Augmented Generation (RAG) enhances LLMs by retrieving relevant documents from external databases and integrating them into the generation process (Lewis et al., 2020; Khandelwal et al., 2020; Izacard and Grave, 2021; Jiang et al., 2024). Recent work can be divided into single-round RAG (Kim et al., 2024; Xu et al., 2024; Shi et al., 2024) and iterative RAG (Trivedi et al., 2023; Shao et al., 2023; Jiang et al., 2023; Asai et al., 2024) based on the number of retrieval rounds. In multi-hop question-answering scenarios, iterative RAG often achieves better results because it allows for detailed decomposition of the question. However, due to the increased number of iterations, iterative RAG faces challenges in long-context processing.

Refer to caption
Figure 1: The ReSP framework consists of four modules: Reasoner, Retriever, Summarizer, and Generator. The reasoner makes decisions based on the current memory queues, determining whether to exit the iteration and generate a response or to produce a sub-question for further iteration. The retriever searches the corpus based on the sub-question provided by the reasoner (for the first iteration, the sub-question is the same as the overarching question, thus the reasonser is bypassed). The summarizer performs dual summarization on the retrieved documents, extracting information relevant to both the overarching question Q and the current sub-question Q*, and stores the summaries in the global evidence memory and local pathway memory queues respectively. The generator produces answer A based on the information in the memory queues.

3 Methodology

Figure 1 illustrates our ReSP framework, which consists of four components: Reasoner, Retriever, Summarizer, and Generator. Essentially, Reasoner, Summarizer, and Generator are all based on a fine-tuning-free LLM, designed to execute specific tasks using an array of carefully selected prompt engineering techniques. For specific prompt templates, please see Appendix A. Our main contribution lies in the Summarizer, while the design of the other modules is similar to that of conventional iterative RAG methods.

3.1 Dual-Function Summarizer

As mentioned earlier, our goal is to address issues of context overload and redundant planning. To tackle context overload, a straightforward approach is to employ summarization to compress information. However, even with summarization, the model still lacks an explicit record of the planning path, which does not resolve the issue of redundant planning. During the iterative process, over-planning could arise if summaries overlook information crucial for directly addressing the overarching question, or repetitive planning might occur if the information difference between different rounds of summaries is not significant. Therefore, a more sophisticated design of the summarizer is necessary to distinguish the functions of various pieces of information.

Drawing on the idea of query-focused summarization, we have designed a dual-function summarizer. Confronted with retrieved documents, this summarizer concurrently executes two tasks: producing summaries of supporting information pertinent to the overarching question and generating responses for the current sub-question, while managing two distinct memory queues–the global evidence memory and the local pathway memory. Summaries related to the overarching question are deposited into the global evidence memory, serving to explicitly signal the model to cease iterations when information is enough, thus mitigating the risk of over-planning. Concurrently, responses for the current sub-question, alongside the sub-question itself, are stored in the local pathway memory. This explicitly guides the model’s recognition of the progress in the question planning path as well as the sub-questions that have been historically retrieved, preventing repetitive planning.

3.2 Summary-Enhanced Iterative RAG Process

Here we delineate the refined iterative RAG workflow that incorporates the dual-function summarizer.

Given a query Q and a document corpus D, we initially deploy a retriever to identify the K documents from D that are most relative to Q. These documents are then directed into the summarizer for summary creation and memory queue updates (note that during the first iteration, the sub-question is essentially the overarching question, so there is no generation of response for the current sub-question). Subsequently, the contents of the two memory queues are concatenated to provide context input for the reasoner, which is responsible for determining whether the current information is sufficient to address the overarching question. Should it be adequate, the iterative process is halted, and the memory queues are utilized as context for the generator to produce the final answer. If the information is insufficient, the reasoner generates a subsequent sub-question Q* that is distinct from previously retrieved sub-questions based on the current context, prompting the next iteration round.

4 Experimental Settings and Results

4.1 Datasets

We conduct experiments on two multi-hop question-answering benchmark datasets: HotpotQA (Yang et al., 2018) and 2WikiMultihopQA (Ho et al., 2020). Following the open-sourced RAG toolkit FlashRAG (Jin et al., 2024), we employ its preprocessed dataset format. For each dataset, we utilize the first 1,000 entries from the original development set for testing. We report the token-level F1 score of answer strings to evaluate the quality of the generation.

4.2 Experimental Setup

In our main experiment, we employ the Llama3-8B-instruct (AI@Meta, 2024) as the base model and the E5-base-v2 (Wang et al., 2022) as the retriever, while utilizing Wikipedia data from December 2018 as the retrieval corpus. For the model and data links, please refer to the FlashRAG open-source repository 111https://github.com/RUC-NLPIR/FlashRAG.

The model’s maximum input length is set to 12,000, and the maximum output length is set to 200. For each query, we retrieve the top-5 documents from the corpus based on vector similarity as the result. The maximum number of iterations is set to 3. If the retrieval process is still in iteration after 3 attempts, the model will directly proceed to the final response generation. All experiments are conducted on 4 NVIDIA A100 GPUs.

4.3 Baselines

We select representative methods from single-round RAG and iterative RAG as baselines for comparison.

Single-round RAG: Standard RAG directly generates answers based on all retrieved documents. SuRe (Kim et al., 2024) constructs and ranks summaries of the retrieved passages for each of the multiple answer candidates. RECOMP (Xu et al., 2024) compresses retrieved documents into textual summaries before in-context integration. REPLUG (Shi et al., 2024) prepends each retrieved document separately to the input context and ensembles output probabilities from different passes.

Iterative RAG: Iter-RetGen (Shao et al., 2023) leverages the model output from the previous iteration as a specific context to help retrieve more relevant knowledge. IRCoT (Trivedi et al., 2023) guides the retrieval with Chain-of-Thought (CoT) (Wei et al., 2022) and in turn uses retrieved results to improve CoT.

4.4 Main Results

Method Pipeline HotpotQA 2Wiki
Standard RAG Single 38.6 20.1
SuRe Single 33.4 20.6
RECOMP Single 37.5 32.4
REPLUG Single 31.2 21.1
Iter-RetGen Iterative 38.3 21.6
IRCoT Iterative 43.1 32.4
ReSP(ours) Iterative 47.2 38.3
Table 1: Performance comparison on HotpotQA and 2WikiMultihopQA. We report the token-level F1 score of answer strings. All methods utilize fine-tuning-free Llama3-8B-instruct for generation.

Our results on HotpotQA and 2WikiMultihopQA are displayed in Table 1. First, we notice that iterative RAG, especially IRCoT, demonstrates significant performance improvements compared to single-round RAG. This suggests that conducting multiple rounds of retrieval can indeed capture information more comprehensively and produce more accurate responses. Second, within single-round RAG, RECOMP, which incorporates summarization, exhibits superior performance, indicating that summarization is an effective method of information compression even within single-round RAG. These findings validate the rationale behind our approach, which combines multi-round retrieval with summarization.

Our method, ReSP, achieves significant improvements on both datasets, outperforming the SOTA by 4.1 F1 score on HotpotQA and 5.9 F1 score on 2WikiMultihopQA, surpassing a range of existing iterative RAG methods. This demonstrates the effectiveness of the approach we propose.

Reasoner Summarizer Generator HotpotQA 2Wiki
Llama3-8B-Instruct Llama3-8B-Instruct Llama3-8B-Instruct 47.2 38.3
Llama3-70B-Instruct Llama3-8B-Instruct Llama3-8B-Instruct 48.8(+1.6pt) 37.2(-1.1pt)
Llama3-8B-Instruct Llama3-70B-Instruct Llama3-8B-Instruct 47.3(+0.1pt) 34.1(-4.2pt)
Llama3-8B-Instruct Llama3-8B-Instruct Llama3-70B-Instruct 48.2(+1.0pt) 38.7(+0.4pt)
Table 2: Impact of base model size on different modules.

5 Empirical Analysis

To further analyze ReSP, we conduct comparative experiments to investigate the impact of the base model size on modules’ performance. Additionally, we examine the robustness of the method to context length. Through case studies, we demonstrate that ReSP can mitigate the issues of over-planning and repetitive planning.

5.1 Impact of the base model size

ReSP features a modular design, wherein each module works independently, allowing for the use of different base models to collaborate. To provide empirical guidance for model selection in practical applications, we test how the size of the model affects the performance of each module.

We substitute Llama3-70B-Instruct for Llama3-8B-Instruct and use this larger model as the base for the Reasoner, Summarizer, and Generator modules respectively, comparing the effect changes with the original results. Table 2 presents our experimental results.

Firstly, regarding the reasoner module, the changes are inconsistent across the two datasets, with an improvement on HotpotQA but a decline on 2WikiMultihopQA. The reason for this inconsistency is that 2WikiMultihopQA has questions with more logical hops compared to HotpotQA. A larger model is likely to give more detailed planning steps, leading to a failure to obtain all the necessary information to answer the question within the set maximum number of iterations, hence causing a drop in performance.

Secondly, for the summarizer module, we observe that using a larger model size does not result in performance improvements; in fact, there is a significant decline on 2WikiMultihopQA. Upon reviewing the summarization logs, we find that Llama3-70B-Instruct is more lenient in discerning relevance. It tends to extract information that seems related but is actually irrelevant to the question, which can disrupt the planning and ultimately the generation of responses.

Lastly, regarding the generator module, we observe consistent improvements when using Llama3-70B-Instruct, which suggests that even when provided with clear evidence, a model with stronger semantic comprehension capabilities still aids in generating more accurate responses.

In summary, in real-world applications, it is advisable to allocate a larger base model to the generator, as long as the available resources allow for it. However, for the reasoner module, if the allowable number of iterations is low, there is no need to use a larger base model. The summarizer also does not require a larger base model.

5.2 Robustness to context length

To determine whether ReSP can address the issue of context overload, we adjust the number of documents retrieved per iteration and observe the changes in performance. Our comparison involves using standard RAG and IRCoT as control groups. These experiments are carried out on the HotpotQA dataset. Results are shown in Figure 2.

Refer to caption
Figure 2: Bar chart of the performance variations of different RAG methods with varying numbers of retrieved documents per iteration. We report the token-level F1 score obtained from testing on the HotpotQA dataset.

Firstly, when the number of documents retrieved per iteration k is set to 3, all three methods experience a notable decrease in performance. This indicates that the adequacy of information retrieved in a single round significantly affects both single-round and iterative RAG methods. Therefore, it is necessary to extend the context length during application appropriately.

Issue Question IRCoT ReSP
Over-planning Victor Varnada directed Twisted Fortune, which starred which brother of Eddie Murphy?
Round 0 Thought:
"The text does not mention Twisted Fortune, it talks about Eddie Murphy’s career and his brother Charlie Murphy. What’s the next step?"
Round 0 Thought:
[Global Evidence Memory]: "Victor Varnado directed the black comedy Twisted Fortune, which starred Charlie Murphy, a brother of Eddie Murphy."
–>[Judgement]: "Yes"
–>[Response]: "Charlie Murphy"
Repetitive planning What is Rachelle Amy Beinart’s role in the film that follows a group of rebels on a mission?
Round 0 Thought:
"Rachelle Amy Beinart is not mentioned in any of the provided Wikipedia articles."
Round 1 Thought:
"Rachelle Amy Beinart is not mentioned."
Round 0 Thought:
[Sub-question]: "Who is Rachelle Amy Beinart?"
Round 1 Thought:
[Sub-question]: "What film mentioned in the provided passages features a group of rebels on a mission?"
Table 3: Case studies comparing IRCoT and ReSP. They are conducted to provide evidence of ReSP’s capability in addressing the issues of over-planning and repetitive planning.

When k is greater than 5, the standard RAG and IRCoT exhibit a performance degradation trend. Particularly, IRCoT, which utilizes iterative retrieval, suffers from a more significant performance drop due to the accumulation of retrieved information. This demonstrates that context overload has a pervasive impact on existing RAG methods.

Our method demonstrates exceptional robustness to context length, delivering consistent performance regardless of whether k is set to 15 or 5. This is because we extract key information in each iteration, effectively maintaining a stable and concise context for the generator. As a result, the generator remains unaffected by changes in the length of retrieved documents during the process.

5.3 Case Study

We present evidence of ReSP addressing over-planning and repetitive planning by comparing cases of IRCoT and ReSP on two questions, as shown in Table 3.

In the first case, the retrieved documents are consistent since the initial retrieval question is the same. However, the two models make different decisions about what to do next. IRCoT, which integrates information processing and planning in one step, has a higher level of task complexity. It mistakenly misses information related to "Twisted Fortune", which leads the model to decide that further retrieval is needed, resulting in over-planning. On the other hand, ReSP accurately and comprehensively acquires the supporting facts related to the overarching question through the summarizer. Consequently, the reasoner determines that the question can be answered, and the generator produces the correct response, thereby halting the iteration after the first round of retrieval.

In the second case, at the end of the first round of retrieval, both models make similar decisions due to the absence of information related to the main subject, "Rachelle Amy Beinart" in the documents: they both query for documents related to "Rachelle Amy Beinart". However, due to limitations in document coverage or retriever capability, no relevant documents on "Rachelle Amy Beinart" are found. At this point, the two models propose different sub-questions. Lacking a recorded retrieval trajectory, IRCoT can only make judgments based on the current information, thus continuing to query "Rachelle Amy Beinart", leading to repetitive planning. In contrast, ReSP, which avoids outputting previously retrieved sub-questions, adjusts the retrieval subject and produces a sub-question about "film that follows a group of rebels on a mission", thereby avoiding repetitive planning.

6 Conclusion

In this work, we propose an iterative RAG approach that incorporates query-focused summarization. By employing a dual-function summarizer to simultaneously compress information from retrieved documents targeting the overarching question and the current sub-question, we address the context overload and redundant planning issues commonly encountered in multi-hop question answering. Experimental results demonstrate that our method significantly outperforms other single-round and iterative RAG methods. Furthermore, we hope that our empirical analysis will aid the community in practical applications.

References

Appendix A Prompt Templates of Modules

The prompt templates of different modules in ReSP are shown in Table A.

Module Function Prompt
Reasoner Judge Judging based solely on the current known information and without allowing for inference, are you able to completely and accurately respond to the question Overarching question? \nKnown information: Combined memory queues. \nIf you can, please reply with ’Yes’ directly; if you cannot and need more information, please reply with ’No’ directly.
Plan You serve as an intelligent assistant, adept at facilitating users through complex, multi-hop reasoning across multiple documents. Please understand the information gap between the currently known information and the target problem.Your task is to generate one thought in the form of question for next retrieval step directly. DON\’T generate the whole thoughts at once!\n DON\’T generate thought which has been retrieved.\n [Known information]: Combined memory queues\n[Target question]: Overarching question\n[You Thought]:
Summarizer Global Evidence Passages: docs\nYour job is to act as a professional writer. You will write a good-quality passage that can support the given prediction about the question only based on the information in the provided supporting passages. Now, let’s start. After you write, please write [DONE] to indicate you are done. Do not write a prefix (e.g., ’Response:’) while writing a passage.\nQuestion:Overarching question\nPassage:
Local Pathway Judging based solely on the current known information and without allowing for inference, are you able to respond completely and accurately to the question Sub-question? \nKnown information: Combined memory queues. If yes, please reply with ’Yes’, followed by an accurate response to the question Sub-question, without restating the question; if no, please reply with ’No’ directly.
Generator Response Generation Answer the question based on the given reference.\nOnly give me the answer and do not output any other words.\nThe following are given reference: Combined memory queues\nQuestion: Overarching question
Table 4: The prompt templates of different modules in ReSP. The input parameters include Overarching question: the input question; Sub-question: the current sub-question generated by the reasoner; Combined memory queues: the concatenated content of the global evidence memory and local pathway memory queues for each iteration; docs: the documents retrieved from a single round of retrieval.