\floatsetup

[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth]

Evaluating LLMs’ Inherent Multi-hop Reasoning Ability

Jian Wu1    Linyi Yang2     Zhen Wang1
Manabu Okumura1     Yue Zhang2
1
Tokyo Institute of Technology    2School of Engineering Westlake Univeristy
[email protected],[email protected], [email protected]
[email protected], [email protected]
Abstract

While Large Language Models (LLMs) excel in question-answering (QA) tasks, their multi-step reasoning abilities on multiple evidence integration on Multi-hop QA tasks remain underexplored. LLMs sometimes generate answers that rely on internal memory rather than reasoning given context, which brings concerns about the evaluation quality of real reasoning abilities. The counterfactual QA task can separate internal memory from reasoning abilities, but focusing solely on final-QA performance without evaluating the multi-step reasoning process is insufficient for reporting LLMs’ real reasoning abilities. Current Multi-hop QA (MHQA) benchmarks are factual and annotated on open-source corpora such as Wikipedia, although useful for multi-step reasoning evaluation, showing limitations due to potential data contamination in LLMs pre-training stage. To address this issue, we introduce the Inherent Reasoning Evaluation (IRE) method, a novel evaluation way that jointly evaluates the LLMs’ chain-of-reasoning performance based on the first knowledge-edited counterfactual multi-hop QA data which involves editing the original Wikipedia passages, reducing data contamination risks. The IRE comprehensively assesses reasoning chains through sub-QA and final-QA evaluations. Our comparisons reveal significant performance gaps for several LLMs between Wikipedia-based benchmarks and IRE, deeming data contamination issues in existing benchmarks. We believe that the IRE benchmark will enhance and facilitate trustworthy LLM evaluations.

1 Introduction

Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in improving Large Language Models’ (LLMs) reasoning performance by integrating multiple pieces of evidence from a given context into the generative process, to provide accurate and relevant responses [1, 2]. However, it remains unclear whether LLMs perform reasoning, to what extent they are capable of multi-step reasoning, and how to evaluate the reasoning chain [3, 4]. Previous multi-hop QA (MHQA) benchmarks such as HotpotQA [5] and 2WikiMultihopQA [6] can be applied to evaluate LLMs’ multi-step reasoning performance on final-QA evaluation and evidence retrieval evaluation, showing limitations that final-QA and evidence retrieval are not sufficient for interpreting the multi-step reasoning ability of LLMs [7, 8, 9]. In this work, we explore whether LLMs perform the desired reasoning chain to reach the prediction and how they perform.

Previously, [10] introduced a sub-QA dataset derived from HotopotQA [5] and conducted experiments focused on sub-question reasoning. Their results reveal that strong QA models such as DFGN [11], DecompRC [12], and CogQA [13] can correctly answer multi-hop questions but often bypass the incorrect reasoning chain. This observation highlights the importance of evaluating LLMs’ multi-step reasoning abilities by assessing their performance on sub-QA. Moreover, previous MHQA benchmarks such as HotpotQA and 2WikiMultihopQA are factual and Wikipedia-based, facing the challenge of data contamination that LLMs are already pre-trained on large open-source corpora such as Wikipedia. There is a risk of data leakage in that data is memorized by the LLMs’ pre-training stage, which may inflate LLMs’ reasoning performance on MHQA tasks, complicating the assessment of their real reasoning abilities. For instance, on the top left of Figure LABEL:comparison, the question "Who was the captain of the Argentine team that was born in 1987?" is input into ChatGPT with and without the related context. ChatGPT outputs the same answer in both cases, demonstrating that it can provide the same response based on its memory regardless of the context provided.

Counterfactual QA is an effective method for disentangling LLMs’ internal memory from their reasoning abilities [14, 15, 16, 17, 18] because the given questions have answers that do not exist in the real world (counterfactual), preventing LLMs from relying solely on their internal memory to generate responses [19]. For example, the question "Which position did the coach play that serves the Chinese national basketball team and was born in 1997?" at the bottom of Figure LABEL:comparison is counterfactual because no person born in 1997 has served as a Chinese coach. ChatGPT fails to generate a correct answer from its memory without the context and can only correctly answer the question based on the given context. This observation indicates that reasoning over counterfactual contexts helps guide LLMs’ thinking based on the provided context rather than relying on pre-trained internal knowledge. Current counterfactual QA benchmarks [14, 15, 16, 17], although designed to evaluate LLMs’ real reasoning abilities, primarily assess the single-step reasoning performance of RAGs without providing a comprehensive evaluation of multi-step reasoning. To more comprehensively evaluate LLMs’ real multi-step reasoning abilities, it is vital to establish a continuous and accessible reasoning path.

To address the aforementioned limitations of factual MHQA and counterfactual QA benchmarks, we introduce IRE, which reports LLMs’ real multi-step reasoning performance on Inherent and multi-step Reasoning Evaluation that jointly computes LLMs’ reasoning performance by considering sub-QA performance as equally important as final-QA performance. To conduct experiments in a fair setting, we propose a Knowledge-edited Counterfactual MHQA dataset to comprehensively and objectively assess LLMs’ inherent multi-step reasoning ability by synthesizing counterfactual MHQA data from Wikipedia passages that LLMs may have been exposed to. Figure LABEL:comparison shows the advantages and different paradigms of our synthesized counterfactual multi-step reasoning evaluation method as well as single-step reasoning evaluation with previous Wikipedia-based MHQA evaluation, such as HotpotQA [5], 2WikiMultihopQA [6] and MuSiQue [20].

We conduct massive experiments and use IRE to comprehensively and objectively report LLMs’ real reasoning ability based on equal amounts of data from previous Wikipedia-based MHQA benchmarks as the control groups and our synthesized counterfactual MHQA data. The experiment results reveal two types of inflated performance in LLMs: 1) an obvious performance gap between Wikipedia-based MHQA benchmarks and our synthesized data; 2) inflated performance due to a low proportion of correct reasoning chains, with GPT-4 achieving only 36.3% correct reasoning chains across the entire dataset. Additionally, we observe that incorporating sub-questions into the prompt as part of the reasoning chain is a more efficient approach for improving model performance. This finding underscores the significance of this method as a future research direction for enhancing LLMs’ reasoning abilities. To the best of our knowledge, we are the first to introduce counterfactual into the evaluation of the multi-step reasoning ability of LLMs, finding that there is a significant performance gap between LLMs’ performance on factual Wikipedia-based MHQA data and our synthesized counterfactual MHQA data. Furthermore, we reveal the risk of data contamination and the urgency of evaluating LLMs’ real ability in a different paradigm. The randomly selected Wikipedia-based MHQA data and our synthesized counterfactual MHQA data are available at https://anonymous.4open.science/r/LLM_inherent_multi_step_eval-3818/.

2 Related Work

Retrieval Augmented Generation RAG improves LLM’s response [21] and also mitigates the occurrence of hallucinations, thereby enhancing the models’ credibility [22]. As demonstrated by [23], designs a RAG system for multi-hop question answering and claim verification tasks. These tasks require the extraction of evidence from two or more documents to produce a correct answer. [24] proposes a Multi-hop RAG benchmark, which consists of a large collection of multi-hop queries, ground-truth answers, and the corresponding supporting evidence. Multi-hop RAG requires LLM to reason and answer multi-hop queries given the evidence. However, LLMs’ memorized knowledge sometimes conflicts with the given context, emphasizing the importance of correcting LLMs’ generations with new facts. [14, 15, 16, 17, 18] propose counterfactual QA benchmarks to separate LLMs’ parametrical knowledge (internal) and contextual knowledge (outer) that fix LLMs to reasoning on the given context strictly by editing the contextual information or prompts. Previous works motivate us to explore LLMs’ real reasoning ability by reasoning in counterfactual contexts. However, counterfactual QA datasets still only assess final QA performance and lack reasoning process evaluation.

Multi-hop QA Datasets Multi-hop QA requires more than one reasoning step in multiple paragraphs to answer a question [25, 5, 6, 20]. Notably, [10] introduced a human-validated sub-question dataset derived from the HotpotQA dataset [5], undertaking a detailed investigation of the models’ capabilities to reason through sub-questions. Their findings revealed that notable models like DFGN [26], DecompRC [12], and CogQA [13] exhibit deficiencies in resolving sub-questions, even though they may successfully address the overarching multi-hop question. Moreover, Wikipedia-based MHQA datasets face the challenge of data contamination that hard to objectively and truthfully evaluate the reasoning ability of LLMs. Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a major issue in measuring LLMs’ real performance on other tasks. For example, HotpotQA [5], 2WikiMultihopQA [6], and MuSiQue [20] can be applied to evaluate the multi-step reasoning performance of LLMs. Typically, evaluating LLMs in MHQA datasets involves using RAG to retrieve and reason over context with a single step of retrieval. However, single-step retrieval can result in insufficient context retrieval for complex questions, as it provides a limited scope of information [2]. [4] proposes a framework that defines good reasoning chains in Correctness and Informativeness to illustrate whether the previous reasoning step could help the current reasoning step and the final answer. Such a framework, although useful, still faces the challenge of data contamination.

Benchmarking Data Leakage [27] surveys the recent works on detecting data contamination and releases a python library named LLMSanitize that implements major contamination detection methods. A handful of recent studies have provided several strategies, methods, and benchmarks for detecting contamination without the need to access pre-training data [28, 29, 30, 31]. However, these data contamination detection benchmarks are required to dynamically update because of the development of LLMs and the expansion of pertaining data. Dynamical maintenance is time-consuming and effortless, while our proposed benchmark IRE, based on the knowledge edition, is fixed and maintains the cleanness of the test data. To statically and quantitatively detect the LLMs’ data contamination extent, [32] proposes a detection pipeline by computing perplexity and N-gram accuracy to evaluate potential data leakages. [33] designs a new Grade School Math 1000 (GSM1k) to mirror the GSM8k benchmark [34] and evaluate LLMs’ mathematical reasoning ability. Clean-Eval [35], a benchmark to assess the inflated performance of LLMs. The experimental results show a drop in performance for GPT-3.5 and Llama-2 in the Clean-Eval data, deeming the data contamination problem. [36] proposes a detection pipeline with the help of perplexity and N-gram accuracy gap between the previous MathQA dataset and synthesized data, pinpointing the potential data leakage problem. The measurement illustrates that underscores the urgency for an evaluation paradigm shift in how we approach the development and evaluation of LLMs. To properly and objectively investigate multi-step reasoning abilities, it is crucial to disentangle LLMs’ inherent memory from their reasoning capabilities. [37] introduces a novel benchmark for counterfactual QA task by editing six common reasoning schemes in the real world, while still lacking muti-step reasoning evaluation for LLMs.

3 Inherent Reasoning Evaluation

In this section, we describe the pipeline of how to synthesize counterfactual MHQA data, and our novel evaluation method. As shown in Figure LABEL:Annotation_Framework, the framework has three main parts: 1) we first ask LLM such as GPT-4 to act as a passage annotator that rewrites passages from Wikipedia 111https://www.wikipedia.org/ with human evaluation and feedback; 2) LLM’s automatic counterfactual QA pairs annotation with human evaluation and feedback; 3) After getting the synthesized counterfactual MHQA data, we utilize IRE to evaluate several strong LLMs to report the real multi-step reasoning performance.

3.1 Data Construction for IRE

3.1.1 Data Collection and Passage Rewriting

We first randomly select 300 Wikipedia passages as the raw context. Inspired by recent studies on LLM’s ability to aid human annotation [38, 39] and the counterfactual data augmentation framework [40]. We design a pipeline for automatically annotating Wikipedia passages into counterfactual passages. Given a raw Wikipedia passage, LLMs are required to act as a passage annotator to do the named entity, noun phrase, and synonym replacement. Then we translate the replaced text into Chinese and finally back translation into English, e.g., the words in red of the annotated passage in figure LABEL:Annotation_Framework are the replaced named entities, noun phrases, and synonyms, and the new counterfactual passage is rewritten from the original Wikipedia passage. After obtaining the new and counterfactual passage, human experts manually check the data quality by 1) grammar issue, checking whether the annotated passages have grammar issues or not; and 2) making sure the annotated passage is new to LLMs. Since LLMs are pre-trained on the open-source corpus collected from the Internet, we manually search the key information of the annotated passage on the Internet, e.g. the characters, times, events, causes, processes, and results.

3.1.2 QA-pair Annotation and Checking

Secondly, we also utilize LLMs to generate new multi-hop questions to fit the rewritten passages. To make sure the generated QA pairs are correct and related to the given passages, we also check the QA pairs from two perspectives: 1) Grammar issue; 2) Answerabilities of the generated question, whether the question is related to the passage and make sure the answer can be reasoned from the passages based in the generated questions. To evaluate LLMs’ performance on different complexity of multi-hop questions, for each passage, we annotate 3 complex questions: one for 2-hop, one for 3-hop, and one for 4-hop questions. LLMs are also required to generate multi-hop questions along with the corresponding sub-questions for reasoning chain evaluation. For example, at the middle part of figure LABEL:Annotation_Framework is the re-annotated context, newly generated multi-hop question, sub-questions, and intermediate answers. The annotated answers are all with a short answer span that follows the settings of HotpotQA, 2WkimultihopQA, and MuSiQue [5, 6, 20]. Here we use EM and F1 scores to measure the LLMs’ output. After getting the rewritten passages with corresponding QA pairs and sub-qa pairs, we also check the logic of the whole passage and QA pairs. We expect that the passage is coherent and the answers can be reasoned from the passage based on the given questions. The prompts of the data annotation and more annotated examples are shown in Appendix A.

3.1.3 Dataset Analysis and Statistics

Table 1 shows the statistics of our dataset. We re-annotated 300 unique Wikipedia passages, with three multi-hop questions (one 2-hop, one 3-hop, and one 4-hop for each passage), a total of 3600 unique QA pairs including 900 multi-hop questions, and 2700 sub-questions with corresponding answers. Among them, the 2700 sub-questions are the decompositions of the 900 multi-hop questions. Following the settings of previous LLM evaluation benchmarks [41, 42], we treat the total of 3600 QA pairs as the test set. Followed by benchmarks such as HotpotQA [5], 2WikiMulthopQA [6], we propose a taxonomy on fine-grained question types and examples commonly used in multi-hop QA illustrated in table 8. In HotpotQA, 2WikimultihopQA, and MuSiQue datasets, the multi-hop questions are usually two types of questions, or the combination of the two types of questions, Bridge and Comparison: Bridge question is required to find the bridge entity that connects the sub-questions, while Comparison question is a type of question that compares two or more entities for the parallel sub-questions. Here we also focus on these two types and annotate multi-hop questions.

3.1.4 Human Agreement on Data Annotation

After reviewing and checking the annotated data manually, we sample 300 instances randomly from our synthesized data (100 for 2-hop, 100 for 3-hop, and 100 for 4-hop) and assign each instance to two reviewers (as expert human annotators). We ask them to select and discard the annotated data that badly follow the given guidelines. The experts first checked the grammar of the annotated questions, sub-questions, sub-answers, and answers. Then they search the knowledge of the annotated passages on the Internet to see whether thaose corpus has appeared or not. According to our statistics, the agreement rate between the annotators in the randomly selected IRE samples is 94% and the human agreement rates are 96%, 92%, and 95% in the 2-hop, 3-hop, and 4-hop datasets, respectively. This suggests that our synthesized instances reflect good data quality on annotation guideline following and achieve high human agreement among expert annotators. To quantitatively illustrate data quality, we also utilize GPT-4 to assign scores to each selected data from two perspectives: correctness and informativeness. Each data is assigned with 1 or 0, which means correct or incorrect, informative or not informative. Correctness indicates whether the answer can be reasoned from the given question and context, while informativeness means whether the QA pairs and context are related or not.

{floatrow}\capbtabbox

2 hop dataset 3 hop dataset 4 hop dataset Whole data property Value 2 hop QA pairs 300 3 hop QA pairs 300 4 hop QA pairs 300 Unique Passages 300 Sub-QA pairs 600 Sub-QA pairs 900 Sub-QA pairs 1200 Total QA pairs 3600 Correctness 94 Correctness 93 Correctness 92 Sentences per data (Avg) 38.42 Informativeness 85 Informativeness 86 Informativeness 82 Inter-annotator Agreement 94%

Table 1: The statistics of each subset of our synthesized data and the whole dataset.

Benchmarks Data Type Data Source Task Reasoning Chain HotpotQA Factual Wikipedia Multi-hop/final-QA 2WikiMultihopQA Factual Wikipedia Multi-hop/final-QA MusiQue Factual Wikipedia Multi-hop/final-QA DisentQA Counterfactual & Factual Natural Questions Single-hop/final-QA IfQA Counterfactual Crowdsourcing Open Domain/final-QA Ours Counterfactual Rewriting Wikipedia Multi-hop/final-QA/sub-qa

Table 2: Differences between our synthesized data with previous factual and counterfactual QA evaluation benchmarks.

3.2 Evaluation Method

The multi-hop QA evaluation is referred to as finding answers for complex questions that require reasoning multiple times from given passages. We employ three representative QA evaluation methods to assess the correctness of LLM-generated MHQA responses: sub-question answering evaluation, reasoning chain evaluation, and the joint performance of sub-qa and MHQA.

Sub-QA Evaluation

This part is the basis of the whole experiment and all evaluation results. Following reading comprehension [43], evaluation is conducted through lexical matching using two widely used metrics to assess the performance of models. In this section, we employ F1 and EM scores to evaluate the answers to sub-questions, similar to the single-hop QA task.

Reasoning Chain Evaluation of Multi-hop QA

Table 2 illustrates the differences between our evaluation method and previous evaluation methods of counterfactual QA and factual MHQA datasets. To interpret the behavior of existing LLMs on each hop of the reasoning process required for multi-hop questions, and to determine their reasoning ability to answer simple questions, we followed the experiment setting proposed by [10]. For example, in the 2-hop dataset, each data contains a 2-hop question, 2 sub-questions, 2 intermediate answers, and a final answer. In order to understand whether LLMs can correct answers by following the right reasoning chain, we calculate the proportion of right and incorrect reasoning chains to compare LLMs’ reasoning performance. Each question or sub-question has two results, correct or incorrect, thus an N-hop question with its N sub-questions has 2(N+1)superscript2𝑁12^{(N+1)}2 start_POSTSUPERSCRIPT ( italic_N + 1 ) end_POSTSUPERSCRIPT different reasoning chains. Due to the space limitation, we measure and collect correctness statistics for the 2-hop question dataset, qsub1subscript𝑞𝑠𝑢𝑏1q_{sub1}italic_q start_POSTSUBSCRIPT italic_s italic_u italic_b 1 end_POSTSUBSCRIPT, qsub2subscript𝑞𝑠𝑢𝑏2q_{sub2}italic_q start_POSTSUBSCRIPT italic_s italic_u italic_b 2 end_POSTSUBSCRIPT, and q𝑞qitalic_q, and show the percentage of 8 reasoning chains given by LLMs.

The Joint Performance of Sub-QA and Multi-hop QA

The previous MHQA benchmarks are traditionally evaluated on the EM or F1 score on the final answer [43, 5, 6], which is partially correct. The previous MHQA systems and LLMs are treated as a black box and we can not figure out how they find the final answer. Hence, the final answer evaluation shows limitations as it does not consider whether previous MHQA systems could answer sub-questions correctly or not. To understand the impact of sub-qa on MHQA, we introduce a joint performance that combines the evaluation of Sub-QA performance and MHQA performance. The details of computing the joint scores are shown in appendix B.

4 Experiments

We conduct comprehensive experiments and evaluate different LLMs by using IRE to answer the following questions: 1) Do LLMs show a performance gap between the Wikipedia-based factual MHQA datasets and our synthesized counterfactual MHQA data? 2) When inputting counterfactual questions, how do LLMs perform in terms of their reasoning ability? 3) How do sub-questions affect the performance of LLMs? 4) How do LLMs perform on reasoning chain evaluation? These investigations aim to shed light on the capabilities and limitations of LLMs when dealing with counterfactual MHQA and multi-step reasoning tasks.

4.1 Experiment Settings

We evaluate LLMs on the randomly selected 900 data from Wikipedia-based MHQA datasets (300 HotpotQA [5], 300 2WikiMultihopQA[6], and 300 MuSiQue [20]) and our counterfactual MHQA data (divided into 2-hop, 3-hop, and 4-hop subsets). We employ the proprietary LLMs in our experiments and to enhance reproducibility, we set the temperature to 0 for proprietary models, and all the experiment results are the average scores of three experiment results. For baselines, we adopt GPT-4 [44], GPT-3.5 [45], text-davinci-003, Bing Chat, and GEMINI-pro [46]. To decouple LLMs’ internal memory and reasoning ability and let LLMs retrieve answers from the given passage as much as possible, we design a prompt that requires LLMs to only retrieve answers based on the given context. The prompt of QA is also shown in the Appendix A.

4.2 Experimental Results

Reasoning VS Memorization

The results of the comparison between the selected Wikipedia-based MHQA data and our counterfactual MHQA data can be found in table 3. LLMs show a performance gap between the selected data and ours. Taking GPT-4 as an example, GPT-4 achieves high EM and F1 scores (69.9 and 82.3, respectively), which are even close to well-finetuned small QA models. While for our 2-hop dataset, EM and F1 scores are sharply declined (53.1 and 62.8). For 3-hop and 4-hop datasets, GPT-4 even performs worse. Since our synthesized data is new, unprecedented knowledge, our results objectively reflect the real reasoning performance of LLMs. In light of the results, we can find that LLMs achieve an inflated high performance on the Wikipedia-based MHQA dataset possibly because of the data contamination that leads to utilizing LLMs’ memory ability rather than reasoning ability.

Table 3: Performance gap between Wikipedia-based factual multi-hop QA datasets and our 2-hop, 3-hop, and 4-hop counterfactual MHQA data. The table reveals that LLMs show an obvious performance gap between previous datasets and our data. The performance is measured by EM and F1 scores with a zero-shot setting.

Models GPT-4 GPT-3.5 GEMINI-pro text-davinci-003 Bing Chat Metric EM F1 EM F1 EM F1 EM F1 EM F1 Wiki HotpotQA 69.9±1.5 82.3±1.3 58.6±0.9 69.1±1.1 58.2±1.3 68.4 ±1.3 50.3±0.9 61.4±0.8 68.1 ±0.6 78.3±1.2 2Wiki 59.7±1.4 67.4±2.7 56.3±0.9 67.6±0.8 48.5±1.6 58.5±0.9 42.3±1.4 53.9±1.5 58.9±0.5 69.9±0.5 MuSique 57.3±1.9 65.4±2.9 49.3±0.8 63.2±1.5 41.3±1.5 54.5±0.7 40.2±0.9 51.0±1.3 49.6±1.1 64.1±0.8 IRE 2-hop 53.1±0.4 62.8±1.3 40.6±0.7 56.7±0.5 35.0±0.7 45.3±1.6 32.6±0.9 48.5±0.8 41.9±0.8 53.4±0.9 3-hop 44.5±0.3 56.4±1.7 37.7±0.5 50.9±1.1 29.6±0.5 42.7±0.9 27.8 ±0.9 46.3±0.8 39.6±1.1 49.4±1.2 4-hop 42.3±0.6 53.5±0.9 32.5±1.2 44.6±0.8 26.1±1.1 35.3±1.2 24.8±0.8 44.1±0.9 30.7±0.9 42.2±0.7

Setting 2 hop 3 hop 4 hop EM F1 EM F1 EM F1 GPT-4 w/o Sub-Q 43.8±0.2 65.2±0.3 41.4±0.2 61.6±0.4 38.1±0.5 48.9±0.3 w Sub-Q 53.2±0.5 67.7±0.6 44.5±0.2 64.5±0.3 42.1±0.1 53.1±0.4 GPT-3.5 w/o Sub-Q 34.3±0.2 51.3±0.1 32.7±0.3 48.6±0.3 31.2±0.5 41.7±0.4 w sub-Q 40.4±0.5 56.9±0.4 37.5±0.1 50.2±0.2 33.5±0.2 45.9±0.6 GEMINI-pro w/o Sub-q 25.2±0.5 55.2±0.7 20.8±0.6 38.7±0.4 14.1 ±0.2 31.0±0.4 w sub-Q 34.6±0.5 64.2±0.5 27.3±0.2 42.1±0.2 25.9±0.2 33.8 ±0.3 text-davinci-003 w/o Sub-q 24.1±0.7 48.8±0.3 22.1±0.4 45.6±0.2 20.0 ±0.1 42.1±0.2 w sub-Q 32.3±0.4 52.7±0.4 27.3±0.3 46.4±0.3 24.2±0.5 42.8±0.7 Bing Chat w/o Sub-q 37.2±0.2 62.4±0.5 33.3±0.5 54.2±0.5 29.3±0.4 48.7±0.3 w sub-Q 41.8±0.5 66.8±0.5 40.1±0.2 58.4±0.5 30.4±0.3 49.7±0.3

Table 4: In the ablation study of the MHQA task, we remove the sub-question information from the prompt and only ask LLMs to get the final answer.
Sub-QA Evaluation

Figures in 4 show the performance of LLMs on the different hops of questions. According to the observation of the three figures, we find that with the hop increases, the complexity of multi-hop questions also increases, leading to the LLMs’ performance decrease. Figure 3 shows that LLMs also suffer the error propagation. When incorrectly answering the previous sub-question, the latter one will also be influenced. Consequently, the performance of Sub_Q2 is worse than that of Sub_Q1. Tables 10 and 11 also illustrate the sub-qa performance of LLMs on the 3-hop and 4-hop datasets in appendix C.

2 hop 3 hop 4 hop F1 RC𝑅𝐶RCitalic_R italic_C EM RC𝑅𝐶RCitalic_R italic_C F1 RC𝑅𝐶RCitalic_R italic_C EM RC𝑅𝐶RCitalic_R italic_C F1 RC𝑅𝐶RCitalic_R italic_C EM RC𝑅𝐶RCitalic_R italic_C GPT-4 0.7 1.5 1.1 1.6 2.3 4.2 GPT 3.5 1.7 2.4 2.7 3.2 3.6 5.8 GEMINI-pro 2.1 3.9 4.6 8.7 5.4 9.5 text-davinci-003 2.4 2.9 3.9 5.2 5.5 7.4 Bing Chat 0.9 1.9 4.2 8.4 4.7 8.9

Table 5: LLMs’ joint performance on the whole reasoning chain. The scores are the average scores of three experiment results
Reasoning Chain Evaluation

In this part, due to space limitations, we calculate the proportion of the reasoning chain on the 2-hop dataset and present the table. We follow the setting of [10] in calculating the percentage of correct or incorrect answers and record the results. Table 6 shows the reasoning chain evaluation results. The green row shows the percentage of examples whose multi-hop questions can be correctly answered from the right reasoning chain. The red rows show the percentage of examples whose multi-hop questions can be correctly answered but through an incorrect reasoning chain. Among these examples, we notice that there is a low percentage of the LLMs successfully getting the correct final answer based on the right reasoning chain. There is also a large proportion of incorrect final answers as shown in rows 2,4,6 and 8. We take the results of GPT-3.5 as an example, the right reasoning chain only accounts for 13.3% although it shows a relatively high QA performance in previous tables. The percentage of incorrect reasoning chain of GPT-3.5 is 17.7% (sum of the three red rows). However, total failure cases account for 69% (sum of rows 2, 4, 6, and 8) which is substantial for the whole dataset. We conclude that LLMs only get a small proportion of the right reasoning chain and their high performance is relatively inflated due to the considerable proportion of incorrect reasoning chain.

Joint Performance

The joint F1 RC𝑅𝐶RCitalic_R italic_C and joint EM RC𝑅𝐶RCitalic_R italic_C scores in table 5 are the whole reasoning chain evaluation results. We find that with the increases in the reasoning chain, the performances of LLMs dropped swiftly. For example, the Bing Chat could get comparable performance with GPT-4 (0.7 joint F1) on answering 2 hop questions, and get a 0.9 joint F1 score. However, in the 3-hop question, the joint F1 RC𝑅𝐶RCitalic_R italic_C and joint EM RC𝑅𝐶RCitalic_R italic_C scores of Bing Chat are 4.2 and 8.4. In the 4-hop dataset, Bing Chat gets 4.7 joint F1 RC𝑅𝐶RCitalic_R italic_C and 8.9 joint EM RC𝑅𝐶RCitalic_R italic_C scores respectively. Since the joint performance is a negative log, the larger scores mean the worse performance on the reasoning chain. We can conclude that LLMs’ reasoning ability decreases with the increases in reasoning chain length.

qsub1subscript𝑞𝑠𝑢𝑏1q_{sub1}italic_q start_POSTSUBSCRIPT italic_s italic_u italic_b 1 end_POSTSUBSCRIPT qsub2subscript𝑞𝑠𝑢𝑏2q_{sub2}italic_q start_POSTSUBSCRIPT italic_s italic_u italic_b 2 end_POSTSUBSCRIPT q𝑞qitalic_q GPT-4 GPT-3.5 GEMINI-pro text Bing Chat c c c 36.3 13.3 15.0 17.3 28.3 c c w 12.3 9.3 9.0 10.7 7.7 c w c 2.0 6.7 5.3 7.7 6.0 c w w 25.3 24.3 14.7 25.0 16.3 w c c 5.7 3.7 5.3 6.7 2.3 w c w 3.7 3.7 5.3 3.7 3.0 w w c 0.3 7.3 13.3 8.7 5. 0 w w w 14.3 31.7 32.3 30.3 31.3

Table 6: Categorical EM statistics (%) of sub-question evaluation for the five LLMs on our 2-hop dataset. Under the first three columns, c stands for correct and w stands for wrong. For example, the third row shows the percentage of questions where models correctly answer both 2-hop questions and the first sub-question but incorrectly answer the second sub-question. We abbreviate text-davinci-003 as text.

4.3 Ablation Study

To evaluate the impact of sub-questions for LLMs, we conduct an ablation study testing the performance of answering the final answer and removing the sub-questions from prompts. The results, shown in Table 4, indicate that when directly asking LLM a multi-hop question and corresponding passage, the performance is much lower than that of adding sub-questions to require LLMs reasoning step-by-step. For example, computed from table 4 the performance of GPT-4 on the 2-hop dataset decreased the F1 score and EM by 2.5 and 9.4 respectively. The results show the sub-questions could help LLMs improve the performance of final-QA.

4.4 Error Analysis

We select a total of 20 incorrect final answers generated by GPT-4 from the 4-hop dataset to comprehensively illustrate how LLMs make decision on multi-step reasoning task. We first verify the proportion of each incorrect sub-answers and final-answers. Among the 20 incorrect final answers, 9 of them are wrongly answered in the first sub-questions that leads to the incorrect final answers. While for the remaining 11, 5 of them are incorrect second sub-questions that lead to the wrong final answers. The rest of 4 are influenced by the wrong third answers and fourth answers. From this analysis, we estimate that roughly half of the incorrect final answers are incorrectly reasoning from scratch. We also select 20 correct final answers generated by GPT-4 and find that about 4 of them are reasoned from incorrect reasoning chains (wrong sub-answers), revealing that LLMs also sometimes bypass the incorrect reasoning chain and get correct final answers.

4.5 Insights of LLM Multi-step Reasoning Evaluation

Drawing from LLMs’ multi-step reasoning performance and limitations and error analysis, we offer these insights:

Exact Matching

While exact matching is a simple and effective method for MHQA evaluation, it struggles with issues when the answers have abbreviations or other expressions. For example, in our synthesized counterfactual MHQA data, if the golden answer to the question "When did Africa’s second public FM radio station launch?" is "2002" and the generated answer of GPT-4 is "4 August 2002", the exact match can not be computed accurately. All the answers generated by LLMs have this issue. Thus, It is urgent to propose a more universal QA evaluation score in LLMs’ reasoning performance evaluation.

LLMs’ multi-step reasoning

Although the experiment results demonstrates that LLMs could perform multi-step reasoning ability in a certain extent, it remain sensitive to prompts and the impact of additional contexts, especially the sub-questions. Provide sub-questions as additional information into prompts can help guide LLMs to reason in a correct direction and show a relatively strong performance.

Reasoning chain Evaluation (Joint F1 RC and Joint EM RC in this work)

The advantage of our reasoning evaluation method is we jointly consider the sub-QA performance and final-QA performance when LLMs bypass the incorrect reasoning chain and achieve a correct final answer, the scores remain very low. However, this evaluation method is easily influenced by the LLMs’ performance on the first sub-questions, since the answer order is sequential. if the first sub-question is incorrectly answered, the following sub-questions and the final question will also be influenced, leading to a very low score. How to answer sub-questions more correctly remains exploration.

5 Conclusion

In this work, we present a novel evaluation method IRE which assesses the multi-step reasoning ability of LLMs via multi-hop QA and sub-QA and jointly computes the scores of the whole reasoning chain. To disentangle LLMs’ memory and reasoning ability, we design a framework that automatically synthesizes counterfactual data with human review to serve the evaluation. Although LLMs performed relatively well on QA tasks, the performance dropped on multi-hop questions that were based on new, counterfactual knowledge. In addition, their high performances are inflated and benefit from the high proportion of incorrect reasoning chains. We hope our work can facilitate future research on developing faithful knowledge editing methods.

References

  • [1] Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting for retrieval-augmented large language models. ArXiv, abs/2305.14283, 2023.
  • [2] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. ArXiv, abs/2312.10997, 2023.
  • [3] Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. ArXiv, abs/2212.10403, 2022.
  • [4] Archiki Prasad, Swarnadeep Saha, Xiang Zhou, and Mohit Bansal. Receval: Evaluating reasoning chains via correctness and informativeness. In Conference on Empirical Methods in Natural Language Processing, 2023.
  • [5] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing, 2018.
  • [6] Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics.
  • [7] Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. ArXiv, abs/2208.14271, 2022.
  • [8] Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. ArXiv, abs/2301.13379, 2023.
  • [9] Miles Turpin, Julian Michael, Ethan Perez, and Sam Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. ArXiv, abs/2305.04388, 2023.
  • [10] Yixuan Tang, Hwee Tou Ng, and Anthony Tung. Do multi-hop question answering systems know how to answer the single-hop sub-questions? In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3244–3249, Online, April 2021. Association for Computational Linguistics.
  • [11] Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei Li, Weinan Zhang, and Yong Yu. Dynamically fused graph network for multi-hop reasoning. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6140–6150, Florence, Italy, July 2019. Association for Computational Linguistics.
  • [12] Sewon Min, Victor Zhong, Luke Zettlemoyer, and Hannaneh Hajishirzi. Multi-hop reading comprehension through question decomposition and rescoring. In ACL, 2019.
  • [13] Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, and Jie Tang. Cognitive graph for multi-hop reading comprehension at scale. ArXiv, abs/1905.05460, 2019.
  • [14] Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, and Sanjiv Kumar. Large language models with controllable working memory. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 1774–1793, Toronto, Canada, July 2023. Association for Computational Linguistics.
  • [15] Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, and Omri Abend. DisentQA: Disentangling parametric and contextual knowledge with counterfactual question answering. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10056–10070, Toronto, Canada, July 2023. Association for Computational Linguistics.
  • [16] Wenxuan Zhou, Sheng Zhang, Hoifung Poon, and Muhao Chen. Context-faithful prompting for large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14544–14556, Singapore, December 2023. Association for Computational Linguistics.
  • [17] Kevin Wu, Eric Wu, and James Zou. How faithful are rag models? quantifying the tug-of-war between rag and llms’ internal prior. 2024.
  • [18] Wenhao Yu, Meng Jiang, Peter Clark, and Ashish Sabharwal. Ifqa: A dataset for open-domain question answering under counterfactual presuppositions. arXiv preprint arXiv:2305.14010, 2023.
  • [19] Rongwu Xu, Zehan Qi, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. Knowledge conflicts for llms: A survey. ArXiv, abs/2403.08319, 2024.
  • [20] H. Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2021.
  • [21] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, T. W. Hennigan, Saffron Huang, Lorenzo Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and L. Sifre. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, 2021.
  • [22] Tianyu Gao, Ho-Ching Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. ArXiv, abs/2305.14627, 2023.
  • [23] O. Khattab, Christopher Potts, and Matei A. Zaharia. Baleen: Robust multi-hop reasoning at scale via condensed retrieval. ArXiv, abs/2101.00436, 2021.
  • [24] Yixuan Tang and Yi Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries. ArXiv, abs/2401.15391, 2024.
  • [25] Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In North American Chapter of the Association for Computational Linguistics, 2019.
  • [26] Yunxuan Xiao, Yanru Qu, Lin Qiu, Hao Zhou, Lei Li, Weinan Zhang, and Yong Yu. Dynamically fused graph network for multi-hop reasoning. ArXiv, abs/1905.06933, 2019.
  • [27] Mathieu Ravaut, Bosheng Ding, Fangkai Jiao, Hailin Chen, Xingxuan Li, Ruochen Zhao, Chengwei Qin, Caiming Xiong, and Shafiq R. Joty. How much are llms contaminated? a comprehensive survey and the llmsanitize library. 2024.
  • [28] Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. ArXiv, abs/2310.16789, 2023.
  • [29] Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, and Samuel Dooley. Data contamination through the lens of time. ArXiv, abs/2310.10628, 2023.
  • [30] Shahriar Golchin and Mihai Surdeanu. Data contamination quiz: A tool to detect and estimate contamination in large language models. ArXiv, abs/2311.06233, 2023.
  • [31] Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Graph-informed dynamic evaluation of large language models. arXiv preprint arXiv:2309.17167, 2023.
  • [32] Ruijie Xu, Zengzhi Wang, Run-Ze Fan, and Pengfei Liu. Benchmarking benchmark leakage in large language models. arXiv preprint arXiv:2404.18824, 2024.
  • [33] Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele Lunati, and Summer Yue. A careful examination of large language model performance on grade school arithmetic. 2024.
  • [34] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. ArXiv, abs/2110.14168, 2021.
  • [35] Wenhong Zhu, Hong ping Hao, Zhiwei He, Yunze Song, Yumeng Zhang, Hanxu Hu, Yiran Wei, Rui Wang, and Hongyuan Lu. Clean-eval: Clean evaluation on contaminated large language models. ArXiv, abs/2311.09154, 2023.
  • [36] Ruijie Xu, Zengzhi Wang, Run-Ze Fan, and Pengfei Liu. Benchmarking benchmark leakage in large language models. 2024.
  • [37] Wenyue Hua, Jiang Guo, Mingwen Dong, He Zhu, Patrick Ng, and Zhiguo Wang. Propagation and pitfalls: Reasoning-based assessment of knowledge editing through counterfactual tasks. ArXiv, abs/2401.17585, 2024.
  • [38] Max Bartolo, Tristan Thrush, Sebastian Riedel, Pontus Stenetorp, Robin Jia, and Douwe Kiela. Models in the loop: Aiding crowdworkers with generative annotation assistants. In North American Chapter of the Association for Computational Linguistics, 2021.
  • [39] Petter Törnberg. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. ArXiv, abs/2304.06588, 2023.
  • [40] Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  • [41] Cunxiang Wang, Sirui Cheng, Qipeng Guo, Zhikun Xu, Bowen Ding, Yidong Wang, Xiangkun Hu, Zheng Zhang, and Yue Zhang. Evaluating open-qa evaluation. 2023.
  • [42] Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Qian Wang, and Yue Zhang. Novelqa: A benchmark for long-range novel question answering. ArXiv, abs/2403.12766, 2024.
  • [43] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Conference on Empirical Methods in Natural Language Processing, 2016.
  • [44] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [45] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022.
  • [46] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • [47] Simran Arora, Avanika Narayan, Mayee F Chen, Laurel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher Ré. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441, 2022.
  • [48] Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246, 2023.

Appendix A Prompts and Examples

When evaluating large language models, prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore significant effort should be dedicated to designing a painstakingly crafted perfect prompt for the given task [47, 48]. In this study, We investigate the performance of zero-shot on our benchmark. To eliminate the randomness, we manually select one demonstration for each task, ensuring that all tasks are covered.

We give our designed input examples for three different tasks to help readers understand our implementation, as shown in Table 7, respectively. The original and rewritten passages are shown in the table 9. The annotated multi-hop questions are shown in the table 8.

Prompts of NER, Noun Phrase and Adjective replacement
Prompt Now you are a passage annotator, you need to recognize all the named entities, noun phrases, and adjectives from the given [CONTEXT], then translate the passage into Chinese and translate to English. Please output the response in JSON format {Passage: String} [CONTEXT] The given context.
Prompts of Question Generation
Example One-shot example with multi-hop QA pairs, Sub-QA pairs, and passage. Prompt Now you are a multi-hop question generation machine, given an example of 2 hop question and its sub-questions, sub-answers, and final answer is [2 hop question],[Sub-Questions],[Sub-Answers] and [Final Answer], you need to generate a new 2 hop multi-hop question same with the given example and its sub-questions, sub-answers and final answer from the given [Context]. Please follow the sentence structure of give examples and output the response in JSON format {2 hop question: String, sub-questions: List, sub-answers:List, final answer:String}: [2 hop question] The given example of 2 hop question.
[Sub-Questions] The given example of sub-questions.
[Sub-Answers] The given example of sub-answers.
[Final Answer] The given example of final answer.
[CONTEXT] The given passage
Prompts of QA
Prompt You are a QA test machine, you need to answer the [Question] from given the [Context], and you only need to come out with the correct answer without other words. Let’s think step by step, and please output the answer to the [Question] in the format of: {Final Answer: String}. [QUESTION] The given question. [CONTEXT] The given passage.

Table 7: The prompt template of passage rewriting and question generation. We here take 2-hop data annotation as the example. [WORDS] denotes the information we should give.
Table 8: Examples of annotated different question types and question hops. We emphasize keywords for their respective categories.

Question Type Hop Multi-hop Question Bridge 2 hop When was the actor who played Helen in FBC series The Murder born? 3 hop Who were the learners of the people that was the principal violist in the Fioba Symphony Band and instructed music to Michard Rokney? 4 hop Which is later, the birthday of Zephyr Bolt-Anderson or the time that 2060 Kingdom of Azkaban ATP Conqueror occurred in Gleeful Peak, Atlantis? Comparison 2 hop Where is the Blue Falls Empire located and what products are it responsible for importing? 3 hop Which is later, the opening time of Gold or the opening time of the Mad Book in 2006? 4 hop Was the release of the movie Ocean Secrets before or after Echoes of Tomorrow & Victoria Wright?

Table 9: The original passages with the rewritten passages. The first and third rows of the table are the original passages, and the second and fourth rows show the corresponding rewritten passages.

Title Passages Radio City (Indian radio station) Radio City is India’s first private FM radio station and was started on 3 July 2001. It broadcasts on 91.1 (earlier 91.0 in most cities) megahertz from Mumbai (where it was started in 2004), Bengaluru (started first in 2001), Lucknow, and New Delhi (since 2003). It plays Hindi, English, and regional songs. It was launched in Hyderabad in March 2006, in Chennai on 7 July 2006, and in Visakhapatnam in October 2007. Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity. com that offers music related news, videos, songs, and other music-related features. The Radio station currently plays a mix of Hindi and Regional music. Abraham Thomas is the CEO of the company. Permission (African radio station) Permission is Africa’s second public FM radio station, launched on 4 August 2002. It broadcasts on 95.1 (previously 95.0 in most cities) megahertz from Baili (where it was launched in 2006), Hanwi (first launched in 2008), Shuyu, and Sadem (since 2004-09). It plays Japanese, Chinese, and folk songs. It started in Hindu in April 2007, in Beuge on 8 August 2007, and in Adler in November 2008. The Permission recently forayed into Old Business in June 2009 with the launch of a music portal - BoatPermission.com, which offers music-related news, videos, songs, and other music-related features. The Permission currently plays a mix of Japanese and folk music. Amma is the founder of the company. Lights Out Paris Lights Out Paris is the first studio album by American hip-hop artist Sims, a member of Minneapolis indie hip-hop collective Doomtree. It was released July 28, 2005, on Doomtree Records and includes guest appearances from P.O.S, Crescent Moon, and Toki Wright, among others. The album was re-released with four remixes and five songs from Sims’ F̈alse Hopes Fourön vinyl in June 2015. Brilliant Brilliant is the first studio album by Australian Shout artist Allen, a member of London indie Shout collective Die. It was published on 29 October 2006 on Die Records and features guest appearances from Lucia, Lisa, and Bill, among others. The album was relisted on vinyl in July 2016, along with seven remixes and nine tracks from Allen’s Right.

Refer to caption
Figure 3: The performance change of F1 score and EM scores when answering 2 sub-questions on the 2-hop dataset.

Appendix B Joint Compuation

We here list the details of computing the joint scores on the whole reasoning chain: For example, a N-hop question and its N sub-questions, given their precisions and recalls on the MHQA (P(MHQA),R(MHQA)superscript𝑃(MHQA)superscript𝑅(MHQA)P^{\text{(MHQA)}},R^{\text{(MHQA)}}italic_P start_POSTSUPERSCRIPT (MHQA) end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT (MHQA) end_POSTSUPERSCRIPT) and the Sub-QA (P(sub_qa1),R(sub_qa1)superscript𝑃(sub_qa1)superscript𝑅(sub_qa1)P^{\text{(sub\_$qa^{1}$)}},R^{\text{(sub\_$qa^{1}$)}}italic_P start_POSTSUPERSCRIPT (sub_ italic_q italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT (sub_ italic_q italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT), … (P(sub_qaN),R(sub_qaN)superscript𝑃(sub_qaN)superscript𝑅(sub_qaN)P^{\text{(sub\_$qa^{N}$)}},R^{\text{(sub\_$qa^{N}$)}}italic_P start_POSTSUPERSCRIPT (sub_ italic_q italic_a start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT (sub_ italic_q italic_a start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT), respectively, we calculate joint performance as:

P(joint)=P(MHQA)P(sub_qa1)P(sub_qaN),superscript𝑃(joint)superscript𝑃(MHQA)superscript𝑃(sub_qa1)superscript𝑃(sub_qaN)\displaystyle P^{\text{(joint)}}=P^{\text{(MHQA)}}P^{\text{(sub\_$qa^{1}$)}}..% .P^{\text{(sub\_$qa^{N}$)}},italic_P start_POSTSUPERSCRIPT (joint) end_POSTSUPERSCRIPT = italic_P start_POSTSUPERSCRIPT (MHQA) end_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT (sub_ italic_q italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT … italic_P start_POSTSUPERSCRIPT (sub_ italic_q italic_a start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ,
R(joint)=R(MHQA)R(sub_qa1)R(sub_qaN),superscript𝑅(joint)superscript𝑅(MHQA)superscript𝑅(sub_qa1)superscript𝑅(sub_qaN)\displaystyle~{}~{}R^{\text{(joint)}}=R^{\text{(MHQA)}}R^{\text{(sub\_$qa^{1}$% )}}...R^{\text{(sub\_$qa^{N}$)}},italic_R start_POSTSUPERSCRIPT (joint) end_POSTSUPERSCRIPT = italic_R start_POSTSUPERSCRIPT (MHQA) end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT (sub_ italic_q italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT … italic_R start_POSTSUPERSCRIPT (sub_ italic_q italic_a start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ,
Joint F1 RC=log2P(joint)R(joint)P(joint)+R(joint).Joint F1 RC2superscript𝑃(joint)superscript𝑅(joint)superscript𝑃(joint)superscript𝑅(joint)\displaystyle\text{Joint F${}_{1}$ $RC$}=-\log\frac{2P^{\text{(joint)}}R^{% \text{(joint)}}}{P^{\text{(joint)}}+R^{\text{(joint)}}}.Joint F italic_R italic_C = - roman_log divide start_ARG 2 italic_P start_POSTSUPERSCRIPT (joint) end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT (joint) end_POSTSUPERSCRIPT end_ARG start_ARG italic_P start_POSTSUPERSCRIPT (joint) end_POSTSUPERSCRIPT + italic_R start_POSTSUPERSCRIPT (joint) end_POSTSUPERSCRIPT end_ARG .

where the Joint F1 RC𝑅𝐶RCitalic_R italic_C means the joint F1 performance of the reasoning chain.

Given their EM scores on the MHQA (EM(MHQA)𝐸superscript𝑀(MHQA)EM^{\text{(MHQA)}}italic_E italic_M start_POSTSUPERSCRIPT (MHQA) end_POSTSUPERSCRIPT) and the Sub-QA EM(sub_qa1)𝐸superscript𝑀(sub_qa1)EM^{\text{(sub\_$qa^{1}$)}}italic_E italic_M start_POSTSUPERSCRIPT (sub_ italic_q italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT), … EM(sub_qaN)𝐸superscript𝑀(sub_qaN)EM^{\text{(sub\_$qa^{N}$)}}italic_E italic_M start_POSTSUPERSCRIPT (sub_ italic_q italic_a start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT.

Joint EM RC=log2EM(MHQA),EM(sub_qaN)EM(MHQA)+,EM(sub_qaN).Joint EM RC2𝐸superscript𝑀(MHQA)𝐸superscript𝑀(sub_qaN)limit-from𝐸superscript𝑀(MHQA)𝐸superscript𝑀(sub_qaN)\displaystyle\text{Joint EM $RC$}=-\log\frac{2EM^{\text{(MHQA)}},...EM^{\text{% (sub\_$qa^{N}$)}}}{EM^{\text{(MHQA)}}+,...EM^{\text{(sub\_$qa^{N}$)}}}.Joint EM italic_R italic_C = - roman_log divide start_ARG 2 italic_E italic_M start_POSTSUPERSCRIPT (MHQA) end_POSTSUPERSCRIPT , … italic_E italic_M start_POSTSUPERSCRIPT (sub_ italic_q italic_a start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_E italic_M start_POSTSUPERSCRIPT (MHQA) end_POSTSUPERSCRIPT + , … italic_E italic_M start_POSTSUPERSCRIPT (sub_ italic_q italic_a start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT end_ARG .

where the Joint EM RC𝑅𝐶RCitalic_R italic_C means the joint EM performance of the reasoning chain.

Appendix C Performance Analysis

3 hop Q1 EM Q1 F1 Q2 EM Q2 F1 Q3 EM Q3 F1 GPT-4 70.9±0.3 80.8±0.6 59.7±0.3 74.9±0.4 58.1±0.2 68.8±0.5 GPT-3.5 43.0±0.7 56.4±0.7 38.6±0.1 49.3±0.2 29.0±0.3 40.6±0.2 GEMINI-pro 5.8±0.4 33.8±0.5 4.4±0.6 30.8±0.5 4.1±0.7 31.5±0.9 text-davinci-003 23.3±0.8 42.4±0.3 20.5±0.4 33.7±0.3 19.5±0.5 29.6±0.6 Bing Chat 7.2±0.9 34.0±0.6 5.8±0.7 31.5±0.5 3.1±0.6 32.3±0.4

Table 10: The LLM evaluation on IRE 3 hop dataset. We here measure the sub-qa task and compare the performance between each hop. Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT means the ith𝑖𝑡ithitalic_i italic_t italic_h sub-questions.

4 hop Q1 EM Q1 F1 Q2 EM Q2 F1 Q3 EM Q3 F1 Q4 EM Q4 F1 GPT-4 60.9±0.4 66.7±0.3 56.4±0.5 62.6±0.4 28.4±0.2 58.7 ±0.4 23.1±0.2 56.3±0.3 GPT-3.5 40.7±0.4 46.9±0.3 30.1±0.2 36.3±0.2 20.2±0.1 47.2±0.5 14.7±0.4 44.8±0.2 GEMINI-pro 14.9±0.5 39.2±0.1 10.4±0.5 38.3±0.4 9.1±0.6 34.9±0.4 7.2±0.3 29.5±0.6 text-davinci-003 19.8±0.2 39.2±0.4 19.2±0.5 30.7±0.6 18.8±0.7 28.6±0.6 18.5±0.7 27.8±0.2 Bing Chat 20.8±0.2 39.4±0.4 16.9±0.2 37.1±0.3 6.2±0.5 35.8±0.4 5.5±0.7 35.1±0.3

Table 11: The LLM performance on IRE 4 hop dataset. We here measure the sub-qa task and compare the performance between each hop.
Refer to caption
Refer to caption
Figure 4: The performance change of EM and F1 scores when answering from 2 hop questions to 4 hop questions.
Refer to caption
Refer to caption
Figure 5: Performance gap between Wikipedia-based factual multi-hop QA datasets and our 2-hop, 3-hop, and 4-hop counterfactual MHQA data of table 3. The line charts reveal that LLMs show an obvious performance gap between previous datasets and our counterfactual QA data.

As the quantitative complementary of Sub-QA evaluation, we here list the results of LLMs’ performance on 3-hop and 4-hop datasets. The LLMs’ reasoning performance dropped dramatically, e.g. in table 10, GPT-4 achieves 70.9 EM and 80.8 F1 scores on sub-question1 but only gets 59.7 EM, 74.9 F1, and 58.1 EM, 68.8 F1 scores on sub-question2 and sub-question3 respectively. In table 11, we further find that when answering 4 hop questions, the results show a cliff-like descent from sub-question2 to sub-question3, especially GPT-3.5 gets 46.9 F1 in sub-question2 but drop to 36.3 F1 score in sub-question3.

Appendix D Limiatations

In this paper, we focus on the evaluation of LLMs’ real multi-step reasoning ability on our annotated counterfactual MHQA data. Although LLMs show an obvious performance gap between previous factual MHQA datasets and our dataset, the data size of our dataset still remains improved. The Exact Match (EM) for reporting QA performance still faces challenges, because EM does not report LLMs’ real performance due to the variation in the expression of the answers.