Google Scholar

MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition

J Wu, L Yang, M Okumura, Y Zhang - arXiv preprint arXiv:2402.11924, 2024 - arxiv.org

J Wu, L Yang, M Okumura, Y Zhang

arXiv preprint arXiv:2402.11924, 2024•arxiv.org

Although Large Language Models (LLMs) have shown strong performance in Multi-hop Question Answering (MHQA) tasks, their real reasoning ability remains exploration. Current LLM QA evaluation benchmarks have shown limitations, including 1) data contamination, the evaluation data are potentially exposed to LLMs during the pretraining stage; and 2) ignoration of the reasoning chain evaluation. Thus we introduce an LLM MHQA evaluation benchmark, the first QA benchmark based on the new, unprecedented knowledge by editing the off-the-shelf HotpotQA dataset; Besides, we also annotate and evaluate the reasoning chain in the form of sub-questions and intermediate answers corresponding to the multi-hop questions. Specifically, based on the observation, 1) LLMs show a performance gap between the original HotpotQA and our edited data, deeming that current MHQA benchmarks have the potential risk of data contamination that hard to evaluate LLMs' performance objectively and scientifically; 2) LLMs only get a small percentage of the right reasoning chain, e.g. GPT-4 only gets 36.3\% right reasoning chain. We believe this new Multi-hop QA evaluation benchmark and novel evaluation methods will facilitate the development of trustworthy LLM evaluation on the MHQA task.

arxiv.org

Show moreShow less

Speichern Sie Cite Cited by 2 Related articles All 2 versions View as HTML

Cite

Advanced search

Saved to My library

MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition