\svgpath

./figures/

Fine-grained Hallucination Detection and Mitigation in Long-form Question Answering

Rachneet Sachdeva^♠ Yixiao Song^♡ Mohit Iyyer^♡ Iryna Gurevych^♠
^♠Ubiquitous Knowledge Processing Lab (UKP Lab),
Department of Computer Science and Hessian Center for AI (hessian.AI),
Technical University of Darmstadt
^♡University of Massachusetts Amherst
www.ukp.tu-darmstadt.de

Abstract

Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension. However, such detailed responses are prone to hallucinations and factual inconsistencies, challenging their faithful evaluation. This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers. HaluQuestQA comprises 698 QA pairs with 4.7k span-level error annotations for five different error types by expert annotators, along with preference judgments. Using our collected data, we thoroughly analyze the shortcomings of long-form answers and find that they lack comprehensiveness and provide unhelpful references. We train an automatic feedback model on this dataset that predicts error spans with incomplete information and provides associated explanations. Finally, we propose a prompt-based approach, Error-informed refinement, that uses signals from the learned feedback model to refine generated answers, which we show reduces hallucination and improves answer quality. Furthermore, humans find answers generated by our approach comprehensive and highly prefer them ( $84\%$ ) over the baseline answers.¹¹1To further research, we open-source our data and code: https://github.com/UKPLab/arxiv2024-lfqa-hallucination

1 Introduction

Refer to caption — Figure 1: An overview of our data collection process. Based on our defined aspects, we collect expert human judgments for question-answer pairs on the Reddit platform and their corresponding answers from GPT-4.

Long-form question answering (LFQA) provides comprehensive, user-friendly, and in-depth responses to complex questions by leveraging state-of-the-art large language models (LLMs) and retriever components Krishna et al. (2021); Nakano et al. (2021). While LLMs generate plausible and convincing answers, they also frequently produce factually inconsistent, irrelevant, and incomplete content Goyal and Durrett (2020); Laban et al. (2022); Menick et al. (2022); Ji et al. (2022), which limits their applicability in real-world applications.

Simplistic evaluation metrics such as BLEU Papineni et al. (2002) and ROUGE Lin (2004) do not align with human experts’ judgments on long-form answers Wang et al. (2022). There are many aspects of LFQA – factuality, completeness, and relevance – that require evaluation, motivating us to focus on span-level fine-grained error detection. While previous studies have focussed on evaluating factual errors in long-form text generation (Lee et al., 2022; Min et al., 2023; Li et al., 2023; Muhlgay et al., 2023), other aspects of evaluation, such as response completeness and relevance – which can potentially mislead and confuse users – have been largely overlooked.

LLMs make many errors for LFQA, which require expert human annotations to detect Gillick and Liu (2010); Iskender et al. (2020); Wang et al. (2022). Recent work from Xu et al. (2023a) reports that aspects such as factuality, relevance, completeness, structure, references, and accessibility are essential for evaluating long-form answers. There are no prior studies for LFQA that examine these errors at the span level. Span-level error annotation and categorization have been important for evaluating and improving systems in other generation tasks such as machine translation Freitag et al. (2021). We fill this gap by collecting HaluQuestQA, a dataset of LFQA answers annotated at the span level with five different error types: question misconception, factuality, completeness, relevance, and helpful references; by expert annotators, in addition to preference judgments, as shown in Figure 1.

Next, we train an automatic feedback model on this dataset that predicts erroneous answer spans with incomplete information and provides associated explanations. The feedback model provides fine-grained feedback in the form of error location (sentence level), error reason, and confidence score without the aid of a reference text Xu et al. (2023b). Finally, we propose Error-Informed Refinement, a prompt-based approach that uses signals from the feedback model to refine generated answers Madaan et al. (2023), which we show reduces hallucination and improves answer quality.

Our contributions can be summarized as follows: (1) We release HaluQuestQA, a dataset of span-level error annotations on pairs of human-written and model-generated answers. Our data analysis shows that long-form answers lack comprehensiveness and provide unhelpful references; (2) We train a feedback model to detect span-level errors aligned with expert human judgments; (3) We propose Error-informed refinement, an approach to refine LLM-generated answers with fine-grained feedback provided by our learned model. Our approach consistently outperforms baselines utilizing coarse-grained feedback (lacking detailed error justifications), reducing hallucinations.

2 Related Work

Human evaluation.

Prior work Krishna et al. (2021) has shown that human evaluation for LFQA tasks is challenging due to long answer lengths, and expert annotators are required to evaluate them effectively. Xu et al. (2023a) hire (non-)expert annotators and identify nine multi-faceted aspects for meaningful LFQA evaluation. While some of these fine-grained aspects, such as factuality Goyal and Durrett (2020); Laban et al. (2022), coherence Goyal et al. (2022), and completeness Tang et al. (2024), have been studied to investigate hallucinations in dialogue summarization tasks, ours is amongst the first works to study LFQA-centric properties such as question misconception, factuality, relevance, completeness, and helpful references, at the span-level.

Detecting and Mitigating Hallucinations in LLMs.

Increasing focus on the reliability of LLMs has led to the development of explainable evaluation metrics Zhong et al. (2022); Fu et al. (2023) to detect errors in LLM generations. Xu et al. (2023b) present InstructScore, an explainable metric based on LLaMA Touvron et al. (2023a), to obtain detailed error analysis for LLM-generated text. However, most of the current evaluation metrics require hard-to-obtain gold references. Recent work proposes a reference-free evaluation metric, TigerScore Jiang et al. (2023b) that can locate, categorize, and explain errors across various text generation tasks, including summarization, translation, and LFQA. While LLM-based metrics can detect diverse errors, it is not always plausible to have an external evaluator during real-time inference; hence, sampling-based approaches Chen et al. (2023); Manakul et al. (2023); Malon and Zhu (2024) have been proposed, wherein consistency across multiple sampled model outputs is used as a measure of factuality.

Reinforcement learning with human feedback (RLHF) Ziegler et al. (2019), a framework to incorporate human feedback to align LMs, has been used to reduce undesirable LLM generations Ouyang et al. (2022); Bai et al. (2022a, b). Wu et al. (2023b) propose fine-grained RLHF, a framework that enables learning reward models associated with span-level human feedback on different error types. However, training multiple reward models is complex and compute-intensive. A recent alignment technique, direct preference optimization (DPO) Rafailov et al. (2023) bypasses the reward modeling step in RLHF and has been used to fine-tune LMs for factuality using preference ranking over model responses Tian et al. (2023). Human feedback has also been used to train feedback models Wang et al. (2023); Xu et al. (2024) to guide the refinement of LLM outputs Madaan et al. (2023); Welleck et al. (2023), improving answer quality. However, these feedback models are either not trained to provide fine-grained error feedback or rely on the ground truth passage to detect errors, which may not always be accessible for open-domain QA tasks. Our work aims to annotate fine-grained errors in LFQA, using this data to train a reference-free feedback model for sentence-level error detection with justifications. We further propose a prompt-based approach to refine answers with feedback, enhancing their comprehensiveness.

3 HaluQuestQA (HQ²A)

Prior LFQA evaluations with non-expert Nakano et al. (2021) and expert Xu et al. (2023a) annotators collect preference judgments over model responses. However, overall preference is not indicative of fine-grained errors in LFQA. As a first step, we annotate span-level errors in long-form answers, with explanations from domain experts.

3.1 Hiring Annotators

We recruit domain experts on Prolific’s academic annotation platform for seven domains shown in Section 3.1. The expert selection is based on age (22-32), demographics (US and UK), education (undergraduate or graduate degree in the target domain), and native language (English). For each target domain, we first conduct a small pilot comprising ten samples, where given a question and two candidate answers, the experts evaluate the answers and mark the incorrect spans based on our defined evaluation criteria (section 3.2). Based on the pilot results, we choose three experts per domain and give them each a large-scale study containing 35-50 question-answer pairs. We collect expert judgments for 698 questions.

Category (# samples)	Preference		Krippendorf’s $\alpha$
Category (# samples)	Human	Model	Krippendorf’s $\alpha$
Physics (94)	33%	67%	0.01
Chemistry (96)	22%	78%	0.20
Biology (110)	25%	75%	0.36
Technology (110)	16%	84%	0.53
Economics (110)	14%	86%	0.31
History (92)	9%	91%	0.52
Law (86)	16%	84%	0.59
Average	19.29%	80.71%	0.36

Aspect	Question	Answer A	Answer B	Expert Justification
Question Misconception	How do people in prison get so jacked with terrible nutrition and no supplements?	(human) […] While prison food generally doesn’t taste great, it does have calories and protein. That can also be supplemented with the stuff they can buy on commissary (fish packs, chicken packs, cheese, […]	(model) […] In prison, inmates often have a lot of free time, which they can use to work out and develop a consistent exercise routine. They may not have access to a fully-equipped gym, but […]	This is an assumption, not every prison offers terrible food without nutritional value, also lots of "easy" food choices (lentils, eggs, bread, meat,..) offer high amounts of protein which is needed for weight gain.
Factuality	Why can you sleep through the night without going to the bathroom?	(human) Your body produces a hormone called (vasopressin) antidiuretic hormone, or adh, which causes the body to produce less urine at night. It’s the reason […]	(model) During sleep, your body produces less urine […] Additionally, the hormone vasopressin helps to reduce urine production at night. This combination of decreased […]	The urine production is decreased due to the hormone, not additionally. The more concentrated urine is because you cannot drink while sleeping.
Relevance	Why is it taboo to talk about people’s wages?	(human) Because companies used to be able to fire people […] discussing your wages is a federally protected right now in the USA, so by all means discuss discuss discuss! […]	(model) Talking about wages is considered taboo because it can create tension and jealousy among coworkers […]. Employers often discourage discussions to prevent potential […]	This does not contribute to the answer.
Completeness	Can anyone explain the differences between copyright and trademark?	(human) A trademark protects a brand’s symbol or logo. A copyright protects content. So the ac/dc logo with the lightning bolt would be trademarked. […]	(model) […] Copyright protects creative works like books, music, and art, […]. Trademark, on the other hand, protects brand identifiers like logos, […]	The answer fails to mention the broader scope of copyright protection, which includes creative works beyond just music and lyrics.
References	How does software that can detect text generated by ChatGPT work?	(human) […] output has a high degree of regularity […] It’s like comparing a handknit scarf to an industrial machine’s knit scarf: one might have small mistakes that show it was made by a human hand, whereas the other is almost perfect because the machine follows an exact pattern.	(model) ChatGPT detection software works by analyzing textual patterns and language use to determine if the text was generated by an AI like ChatGPT. Essentially, these programs compare the input text with known characteristics of AI-generated text […]	This example is well suited for the explanation in the paragraph. It gives a clear representation of how the software detects ChatGPT text in my head.

Dataset	Error span	Accuracy ( $\uparrow$ )	Consistency Score ( $\uparrow$ )
HQ²A	Different	$\textbf{38.56}\pm 0.93$ %	$0.71\pm 0.02$
	Adjacent	$24.18\pm 0.92\%$	$0.82\pm 0.01$
	Exact	$37.25\pm 0.00$ %	$\textbf{0.86}\pm 0.01$

Dataset	Approach	Tigerscore		Error Correction
		% Hallucinated	Hallucination	Precision ( $\uparrow$ )	Recall ( $\uparrow$ )	F1 ( $\uparrow$ )
		samples ( $\downarrow$ )	score ( $\downarrow$ )
HQ²A	Human feedback	$2.61\pm 0.92$	$0.09\pm 0.01$	$0.86\pm 0.04$	$\textbf{1.00}\pm 0.00$	$0.94\pm 0.02$
\cdashline2-7	Baseline	$19.61$	$0.63$	-	-	-
	Improve	$1.31\pm 0.92$	$0.05\pm 0.04$	$\textbf{1.00}\pm 0.00$	$0.93\pm 0.05$	$0.97\pm 0.02$
	Generic	$1.31\pm 0.92$	$0.05\pm 0.03$	$0.97\pm 0.04$	$0.97\pm 0.05$	$0.97\pm 0.02$
\cdashline2-7	EIR	$\textbf{0.65}\pm 0.92$	$\textbf{0.03}\pm 0.04$	$0.97\pm 0.04$	$\textbf{1.00}\pm 0.00$	$\textbf{0.98}\pm 0.02$
	EIR w/ DPO	$4.57\pm 2.44$	$0.07\pm 0.02$	$0.90\pm 0.08$	$0.87\pm 0.05$	$0.88\pm 0.06$
ASQA	Baseline	$34.81$	$1.20$	-	-	-
ASQA	Improve	$20.85\pm 1.00$	$0.68\pm 0.03$	$0.70\pm 0.02$	$0.71\pm 0.01$	$0.70\pm 0.01$
	Generic	$18.67\pm 0.52$	$0.61\pm 0.01$	$0.72\pm 0.01$	$0.75\pm 0.01$	$0.74\pm 0.00$
\cdashline2-7	EIR	$\textbf{16.63}\pm 0.41$	$0.51\pm 0.02$	$\textbf{0.73}\pm 0.00$	$\textbf{0.82}\pm 0.02$	$\textbf{0.77}\pm 0.01$
	EIR w/ DPO	$22.61\pm 0.26$	$\textbf{0.45}\pm 0.01$	$0.64\pm 0.00$	$0.77\pm 0.01$	$0.71\pm 0.00$
ELI5	Baseline	$22.93$	$0.82$	-	-	-
ELI5	Improve	$10.05\pm 0.18$	$0.36\pm 0.02$	$0.75\pm 0.00$	$0.86\pm 0.00$	$0.80\pm 0.00$
	Generic	$6.06\pm 0.23$	$0.22\pm 0.01$	$0.84\pm 0.01$	$0.91\pm 0.00$	$0.87\pm 0.00$
\cdashline2-7	EIR	$\textbf{3.81}\pm 0.30$	$\textbf{0.13}\pm 0.01$	$\textbf{0.88}\pm 0.01$	$\textbf{0.96}\pm 0.01$	$\textbf{0.92}\pm 0.01$
	EIR w/ DPO	$5.71\pm 0.25$	$\textbf{0.13}\pm 0.00$	$0.83\pm 0.00$	$0.94\pm 0.01$	$0.88\pm 0.00$

Dataset	App.	Comprehensiveness^{( $\uparrow$ )}	Preference^{( $\uparrow$ )}
HQ²A	Baseline	0.00%	7.84 %
	Refined	100 %	92.16 %
ASQA	Baseline	82.00 %	40.00 %
	Refined	100 %	60.00 %
ELI5	Baseline	38.00 %	0.00 %
	Refined	100 %	100 %

Dataset (# samples)	Instruct Model	Tigerscore		SelfCheck Consistency ( $\downarrow$ )
Dataset (# samples)	Instruct Model	% Hallucinated samples ( $\downarrow$ )	Hallucination score ( $\downarrow$ )	SelfCheck Consistency ( $\downarrow$ )
HQ²A (70)	LLaMA2-7B	$18.57\pm 0.00$	$\textbf{0.60}\pm 0.00$	$0.166\pm 0.014$
	LLaMA2-7B + DPO	$\textbf{15.71}\pm 0.00$	$0.66\pm 0.00$	$\textbf{0.162}\pm 0.015$
	Mistral-7B	$20.00\pm 0.00$	$0.57\pm 0.00$	$\textbf{0.266}\pm 0.011$
	Mistral-7B + DPO	$\textbf{17.14}\pm 0.00$	$\textbf{0.54}\pm 0.00$	$0.285\pm 0.011$
ASQA (948)	LLaMA2-7B	$\textbf{26.58}\pm 1.49$	$\textbf{0.86}\pm 0.06$	$0.187\pm 0.014$
	LLaMA2-7B + DPO	$28.41\pm 1.06$	$0.89\pm 0.02$	$\textbf{0.178}\pm 0.006$
	Mistral-7B	$62.09\pm 0.35$	$2.08\pm 0.01$	$0.578\pm 0.003$
	Mistral-7B + DPO	$\textbf{60.80}\pm 0.56$	$\textbf{2.03}\pm 0.01$	$\textbf{0.555}\pm 0.008$
ELI5_general (1000)	LLaMA2-7B	$9.93\pm 1.05$	$0.32\pm 0.04$	$0.133\pm 0.001$
ELI5_general (1000)	LLaMA2-7B + DPO	$\textbf{9.33}\pm 0.66$	$\textbf{0.29}\pm 0.03$	$\textbf{0.130}\pm 0.004$
\cdashline 2-5	Mistral-7B	$29.97\pm 0.97$	$0.90\pm 0.04$	$0.327\pm 0.003$
	Mistral-7B + DPO	$\textbf{22.77}\pm 1.03$	$\textbf{0.72}\pm 0.03$	$\textbf{0.319}\pm 0.011$
ELI5_science (1000)	LLaMA2-7B	$\textbf{9.47}\pm 0.47$	$0.31\pm 0.02$	$\textbf{0.137}\pm 0.003$
ELI5_science (1000)	LLaMA2-7B + DPO	$\textbf{9.47}\pm 0.76$	$\textbf{0.30}\pm 0.00$	$0.139\pm 0.004$
\cdashline 2-5	Mistral-7B	$34.10\pm 0.94$	$1.07\pm 0.02$	$0.320\pm 0.004$
	Mistral-7B + DPO	$\textbf{29.03}\pm 1.51$	$\textbf{0.95}\pm 0.04$	$\textbf{0.297}\pm 0.010$
ELI5_history (1000)	LLaMA2-7B	$9.63\pm 0.59$	$0.30\pm 0.02$	$\textbf{0.188}\pm 0.005$
ELI5_history (1000)	LLaMA2-7B + DPO	$\textbf{7.60}\pm 0.08$	$\textbf{0.22}\pm 0.01$	$0.189\pm 0.005$
\cdashline 2-5	Mistral-7B	$26.23\pm 0.38$	$0.79\pm 0.02$	$0.363\pm 0.016$
	Mistral-7B + DPO	$\textbf{22.17}\pm 1.31$	$\textbf{0.69}\pm 0.04$	$\textbf{0.345}\pm 0.013$

Fine-grained Hallucination Detection and Mitigation in Long-form Question Answering

Abstract

1 Introduction

2 Related Work

Human evaluation.

Detecting and Mitigating Hallucinations in LLMs.

3 HaluQuestQA (HQ2A)

3.1 Hiring Annotators

3.2 Task Setup

3.3 Quantitative Analysis

Experts prefer GPT-4 answers over human answers.

Science questions are challenging for LLMs.

Expert (dis)agreement.

Answer scoring.

Questions from technology and economics are biased.

Answers lack comprehensiveness and provide unhelpful references.

4 Hallucination Mitigation

4.1 Error Feedback Model

Fine-tuning.

Inference.

4.2 Error-Informed Refinement (EIR)

Refinement Model.

Datasets & Evaluation Metrics.

5 Results

5.1 Detecting Errors via Feedback Model

5.2 Fine- vs. Coarse-grained Feedback

5.3 Human Evaluation

6 Conclusion

Limitations

Ethics and Broader Impact Statement

Acknowledgements

References

Appendix A Data Analysis

A.1 Answer Length Distribution

A.2 Overall Answer Preference

Appendix B Prompts

B.1 Data Collection

B.2 Feedback Model

B.3 Refinement Model

Appendix C Mitigating Hallucinations with Preference Optimization

Implementation details.

Datasets & Evaluation Metrics.

Appendix D Training, Infrastructure and Runtime

Appendix E Additional Results

E.1 Aligning LLMs

E.2 Human Evaluation

Appendix F Annotation Guidelines

F.1 Annotation Interface

F.2 Task Instructions

1) Question Misconception.

2) Factuality.

3) Relevance.

4) Completeness.

5) References/example/analogy.

Answer Preference.

Overall Requirement.

F.3 Annotation Procedure

3 HaluQuestQA (HQ²A)