MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition

J Wu, L Yang, M Okumura, Y Zhang - arXiv preprint arXiv:2402.11924, 2024 - arxiv.org
J Wu, L Yang, M Okumura, Y Zhang
arXiv preprint arXiv:2402.11924, 2024arxiv.org
Although Large Language Models (LLMs) have shown strong performance in Multi-hop
Question Answering (MHQA) tasks, their real reasoning ability remains exploration. Current
LLM QA evaluation benchmarks have shown limitations, including 1) data contamination,
the evaluation data are potentially exposed to LLMs during the pretraining stage; and 2)
ignoration of the reasoning chain evaluation. Thus we introduce an LLM MHQA evaluation
benchmark, the first QA benchmark based on the new, unprecedented knowledge by editing …
Although Large Language Models (LLMs) have shown strong performance in Multi-hop Question Answering (MHQA) tasks, their real reasoning ability remains exploration. Current LLM QA evaluation benchmarks have shown limitations, including 1) data contamination, the evaluation data are potentially exposed to LLMs during the pretraining stage; and 2) ignoration of the reasoning chain evaluation. Thus we introduce an LLM MHQA evaluation benchmark, the first QA benchmark based on the new, unprecedented knowledge by editing the off-the-shelf HotpotQA dataset; Besides, we also annotate and evaluate the reasoning chain in the form of sub-questions and intermediate answers corresponding to the multi-hop questions. Specifically, based on the observation, 1) LLMs show a performance gap between the original HotpotQA and our edited data, deeming that current MHQA benchmarks have the potential risk of data contamination that hard to evaluate LLMs' performance objectively and scientifically; 2) LLMs only get a small percentage of the right reasoning chain, e.g. GPT-4 only gets 36.3\% right reasoning chain. We believe this new Multi-hop QA evaluation benchmark and novel evaluation methods will facilitate the development of trustworthy LLM evaluation on the MHQA task.
arxiv.org