\svgpath

./figures/

Fine-grained Hallucination Detection and Mitigation in Long-form Question Answering

Rachneet Sachdeva Yixiao Song Mohit Iyyer Iryna Gurevych
Ubiquitous Knowledge Processing Lab (UKP Lab),
Department of Computer Science and Hessian Center for AI (hessian.AI),
Technical University of Darmstadt
University of Massachusetts Amherst
www.ukp.tu-darmstadt.de
Abstract

Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension. However, such detailed responses are prone to hallucinations and factual inconsistencies, challenging their faithful evaluation. This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers. HaluQuestQA comprises 698 QA pairs with 4.7k span-level error annotations for five different error types by expert annotators, along with preference judgments. Using our collected data, we thoroughly analyze the shortcomings of long-form answers and find that they lack comprehensiveness and provide unhelpful references. We train an automatic feedback model on this dataset that predicts error spans with incomplete information and provides associated explanations. Finally, we propose a prompt-based approach, Error-informed refinement, that uses signals from the learned feedback model to refine generated answers, which we show reduces hallucination and improves answer quality. Furthermore, humans find answers generated by our approach comprehensive and highly prefer them (84%percent8484\%84 %) over the baseline answers.111To further research, we open-source our data and code: https://github.com/UKPLab/arxiv2024-lfqa-hallucination

1 Introduction

Refer to caption
Figure 1: An overview of our data collection process. Based on our defined aspects, we collect expert human judgments for question-answer pairs on the Reddit platform and their corresponding answers from GPT-4.

Long-form question answering (LFQA) provides comprehensive, user-friendly, and in-depth responses to complex questions by leveraging state-of-the-art large language models (LLMs) and retriever components  Krishna et al. (2021); Nakano et al. (2021). While LLMs generate plausible and convincing answers, they also frequently produce factually inconsistent, irrelevant, and incomplete content Goyal and Durrett (2020); Laban et al. (2022); Menick et al. (2022); Ji et al. (2022), which limits their applicability in real-world applications.

Simplistic evaluation metrics such as BLEU Papineni et al. (2002) and ROUGE Lin (2004) do not align with human experts’ judgments on long-form answers Wang et al. (2022). There are many aspects of LFQA – factuality, completeness, and relevance – that require evaluation, motivating us to focus on span-level fine-grained error detection. While previous studies have focussed on evaluating factual errors in long-form text generation (Lee et al., 2022; Min et al., 2023; Li et al., 2023; Muhlgay et al., 2023), other aspects of evaluation, such as response completeness and relevance – which can potentially mislead and confuse users – have been largely overlooked.

LLMs make many errors for LFQA, which require expert human annotations to detect Gillick and Liu (2010); Iskender et al. (2020); Wang et al. (2022). Recent work from  Xu et al. (2023a) reports that aspects such as factuality, relevance, completeness, structure, references, and accessibility are essential for evaluating long-form answers. There are no prior studies for LFQA that examine these errors at the span level. Span-level error annotation and categorization have been important for evaluating and improving systems in other generation tasks such as machine translation Freitag et al. (2021). We fill this gap by collecting HaluQuestQA, a dataset of LFQA answers annotated at the span level with five different error types: question misconception, factuality, completeness, relevance, and helpful references; by expert annotators, in addition to preference judgments, as shown in Figure 1.

Next, we train an automatic feedback model on this dataset that predicts erroneous answer spans with incomplete information and provides associated explanations. The feedback model provides fine-grained feedback in the form of error location (sentence level), error reason, and confidence score without the aid of a reference text Xu et al. (2023b). Finally, we propose Error-Informed Refinement, a prompt-based approach that uses signals from the feedback model to refine generated answers Madaan et al. (2023), which we show reduces hallucination and improves answer quality.

Our contributions can be summarized as follows: (1) We release HaluQuestQA, a dataset of span-level error annotations on pairs of human-written and model-generated answers. Our data analysis shows that long-form answers lack comprehensiveness and provide unhelpful references; (2) We train a feedback model to detect span-level errors aligned with expert human judgments; (3) We propose Error-informed refinement, an approach to refine LLM-generated answers with fine-grained feedback provided by our learned model. Our approach consistently outperforms baselines utilizing coarse-grained feedback (lacking detailed error justifications), reducing hallucinations.

2 Related Work

Human evaluation.

Prior work Krishna et al. (2021) has shown that human evaluation for LFQA tasks is challenging due to long answer lengths, and expert annotators are required to evaluate them effectively. Xu et al. (2023a) hire (non-)expert annotators and identify nine multi-faceted aspects for meaningful LFQA evaluation. While some of these fine-grained aspects, such as factuality Goyal and Durrett (2020); Laban et al. (2022), coherence Goyal et al. (2022), and completeness Tang et al. (2024), have been studied to investigate hallucinations in dialogue summarization tasks, ours is amongst the first works to study LFQA-centric properties such as question misconception, factuality, relevance, completeness, and helpful references, at the span-level.

Detecting and Mitigating Hallucinations in LLMs.

Increasing focus on the reliability of LLMs has led to the development of explainable evaluation metrics Zhong et al. (2022); Fu et al. (2023) to detect errors in LLM generations. Xu et al. (2023b) present InstructScore, an explainable metric based on LLaMA Touvron et al. (2023a), to obtain detailed error analysis for LLM-generated text. However, most of the current evaluation metrics require hard-to-obtain gold references. Recent work proposes a reference-free evaluation metric, TigerScore Jiang et al. (2023b) that can locate, categorize, and explain errors across various text generation tasks, including summarization, translation, and LFQA. While LLM-based metrics can detect diverse errors, it is not always plausible to have an external evaluator during real-time inference; hence, sampling-based approaches Chen et al. (2023); Manakul et al. (2023); Malon and Zhu (2024) have been proposed, wherein consistency across multiple sampled model outputs is used as a measure of factuality.

Reinforcement learning with human feedback (RLHF) Ziegler et al. (2019), a framework to incorporate human feedback to align LMs, has been used to reduce undesirable LLM generations Ouyang et al. (2022); Bai et al. (2022a, b). Wu et al. (2023b) propose fine-grained RLHF, a framework that enables learning reward models associated with span-level human feedback on different error types. However, training multiple reward models is complex and compute-intensive. A recent alignment technique, direct preference optimization (DPO) Rafailov et al. (2023) bypasses the reward modeling step in RLHF and has been used to fine-tune LMs for factuality using preference ranking over model responses Tian et al. (2023). Human feedback has also been used to train feedback models Wang et al. (2023); Xu et al. (2024) to guide the refinement of LLM outputs Madaan et al. (2023); Welleck et al. (2023), improving answer quality. However, these feedback models are either not trained to provide fine-grained error feedback or rely on the ground truth passage to detect errors, which may not always be accessible for open-domain QA tasks. Our work aims to annotate fine-grained errors in LFQA, using this data to train a reference-free feedback model for sentence-level error detection with justifications. We further propose a prompt-based approach to refine answers with feedback, enhancing their comprehensiveness.

3 HaluQuestQA (HQ2A)

Prior LFQA evaluations with non-expert Nakano et al. (2021) and expert Xu et al. (2023a) annotators collect preference judgments over model responses. However, overall preference is not indicative of fine-grained errors in LFQA. As a first step, we annotate span-level errors in long-form answers, with explanations from domain experts.

3.1 Hiring Annotators

We recruit domain experts on Prolific’s academic annotation platform for seven domains shown in Section 3.1. The expert selection is based on age (22-32), demographics (US and UK), education (undergraduate or graduate degree in the target domain), and native language (English). For each target domain, we first conduct a small pilot comprising ten samples, where given a question and two candidate answers, the experts evaluate the answers and mark the incorrect spans based on our defined evaluation criteria (section 3.2). Based on the pilot results, we choose three experts per domain and give them each a large-scale study containing 35-50 question-answer pairs. We collect expert judgments for 698 questions.

Category (# samples) Preference Krippendorf’s α𝛼\alphaitalic_α
Human Model
Physics (94) 33% 67% 0.01
Chemistry (96) 22% 78% 0.20
Biology (110) 25% 75% 0.36
Technology (110) 16% 84% 0.53
Economics (110) 14% 86% 0.31
History (92) 9% 91% 0.52
Law (86) 16% 84% 0.59
Average 19.29% 80.71% 0.36
Table 1: Overview of HaluQuestQA and expert answer preferences, with experts’ agreement on a smaller subset (15%similar-toabsentpercent15\sim 15\%∼ 15 %) calculated using Krippendorf’s alpha.

3.2 Task Setup

We evaluate two answers (human and model-generated) to the same question. This setting enables us to identify errors made by humans and state-of-the-art LFQA systems. We chose GPT-4 (gpt-4-0314) as the LFQA model to evaluate since previous work Bhat et al. (2023) has shown it to outperform existing open-source LLMs (LLaMA and Alpaca Taori et al. (2023)) in reasoning and inferring from long context. Since this model has likely seen training data up to September 2021, it may have already seen the ELI5 dataset released by Fan et al. (2019) during its pre-training. Thus, we scrape more recent questions from the r/explainlikeimfive subreddits posted between November 2022 to March 2023. The questions on the ELI5 are classified into domains via the FLAIR label (tag containing post information), which lets us perform domain-specific analysis. For unclassified categories (like History and Law), we cluster the OTHER category questions (not in pre-defined ELI5 domains), using K-means clustering Selim and Ismail (1984) and identify the domain-specific questions. For each domain, we sample between 100-200 questions with their highest voted answer of length ranging between 50-500 words (more details in Appendix A).

To obtain the model-generated answers, we zero-shot prompt the GPT-4 model (Section B.1). We host the annotation task on the INCEpTION platform Klie et al. (2018) and evaluate the following:222We provide detailed annotation guidelines in Appendix F.

  1. 1.

    Question misconception: False assumptions made within the given question.

  2. 2.

    Factuality: Accuracy and correctness of the answer as per verifiable facts.

  3. 3.

    Relevance: Specificity and meaningfulness of the answer.

  4. 4.

    Completeness: Answer comprehensiveness ensuring all question aspects are addressed.

  5. 5.

    References: (Un)helpful examples, analogies, and external references (websites or links) in the answer.

Aspect Question Answer A Answer B Expert Justification
Question Misconception How do people in prison get so jacked with terrible nutrition and no supplements? (human) […] While prison food generally doesn’t taste great, it does have calories and protein. That can also be supplemented with the stuff they can buy on commissary (fish packs, chicken packs, cheese, […] (model) […] In prison, inmates often have a lot of free time, which they can use to work out and develop a consistent exercise routine. They may not have access to a fully-equipped gym, but […] This is an assumption, not every prison offers terrible food without nutritional value, also lots of "easy" food choices (lentils, eggs, bread, meat,..) offer high amounts of protein which is needed for weight gain.
Factuality Why can you sleep through the night without going to the bathroom? (human) Your body produces a hormone called (vasopressin) antidiuretic hormone, or adh, which causes the body to produce less urine at night. It’s the reason […] (model) During sleep, your body produces less urine […] Additionally, the hormone vasopressin helps to reduce urine production at night. This combination of decreased […] The urine production is decreased due to the hormone, not additionally. The more concentrated urine is because you cannot drink while sleeping.
Relevance Why is it taboo to talk about people’s wages? (human) Because companies used to be able to fire people […] discussing your wages is a federally protected right now in the USA, so by all means discuss discuss discuss! […] (model) Talking about wages is considered taboo because it can create tension and jealousy among coworkers […]. Employers often discourage discussions to prevent potential […] This does not contribute to the answer.
Completeness Can anyone explain the differences between copyright and trademark? (human) A trademark protects a brand’s symbol or logo. A copyright protects content. So the ac/dc logo with the lightning bolt would be trademarked. […] (model) […] Copyright protects creative works like books, music, and art, […]. Trademark, on the other hand, protects brand identifiers like logos, […] The answer fails to mention the broader scope of copyright protection, which includes creative works beyond just music and lyrics.
References How does software that can detect text generated by ChatGPT work? (human) […] output has a high degree of regularity […] It’s like comparing a handknit scarf to an industrial machine’s knit scarf: one might have small mistakes that show it was made by a human hand, whereas the other is almost perfect because the machine follows an exact pattern. (model) ChatGPT detection software works by analyzing textual patterns and language use to determine if the text was generated by an AI like ChatGPT. Essentially, these programs compare the input text with known characteristics of AI-generated text […] This example is well suited for the explanation in the paragraph. It gives a clear representation of how the software detects ChatGPT text in my head.
Table 2: Examples of expert annotated errors in long-form answers based on the defined evaluation criteria.

Based on the defined evaluation criteria, annotators identify and highlight mistakes in the question or answers with free-form justifications, in addition to overall answer preference. Examples from HaluQuestQA are shown in Table 2.

3.3 Quantitative Analysis

Experts prefer GPT-4 answers over human answers.

As shown in Section 3.1, experts display a high preference (80.7%) for model-generated answers from GPT-4 compared to human answers. Potentially, humans prefer fluent answers, and LLMs are known to optimize for fluency Wu et al. (2023a); Coyne and Sakaguchi (2023). Moreover, the preference of our annotators is corroborated by similar findings in summarization Liu et al. (2023) and LFQA Xu et al. (2023a), who show that GPT-3 answers score higher than human answers.

Science questions are challenging for LLMs.

Model-generated answers are strongly preferred by experts in history, law, technology, and economics (>80%). In contrast, the science domains are more challenging, with a preference for model answers ranging between 60%-80%.

Expert (dis)agreement.

In Section 3.1, we report Krippendorf’s alpha Hayes and Krippendorff (2007) as a measure of agreement for experts’ overall answer preference. Our expert annotators achieve moderate agreement in technology, history, and law, fair agreement in biology and economics, and slight agreement in physics and chemistry.333Interpretation of agreement follows Wong et al. (2021) We emphasize that the disagreement between experts is not a failure of our evaluation. Instead, it highlights the challenges of identifying fine-grained errors in answers, affecting overall preference. Moreover, prior work has similar findings for human disagreement in LFQA evaluation Xu et al. (2023a).

Answer scoring.

We score human and model answers on our defined evaluation criteria to understand how experts’ answer preferences diverge across different domains. For each of question misconception and reference aspects, the score 𝒮=1𝒮1\mathcal{S}=1caligraphic_S = 1 when the question has no misconceptions and the references, if provided, help answer the question; otherwise, 𝒮=0𝒮0\mathcal{S}=0caligraphic_S = 0. For aspects of factuality, relevance, and completeness, we calculate 𝒮𝒮\mathcal{S}caligraphic_S as:

𝒮=1(# Error sentencesTotal # of sentences)𝒮1# Error sentencesTotal # of sentences\mathcal{S}=1-\left(\frac{{\text{{\# Error sentences}}}}{{\text{{Total \# of % sentences}}}}\right)caligraphic_S = 1 - ( divide start_ARG # Error sentences end_ARG start_ARG Total # of sentences end_ARG )

For calculating the overall answer scores, we leave out the question misconception scores because this aspect pertains to the question. We sum the other aspect scores and include the overall answer preference scores (𝒮=1𝒮1\mathcal{S}=1caligraphic_S = 1 if preferred) to get the final score. Finally, we normalize this score between 0 and 1. In Figure 2, we report the fine-grained aspect scores for human and model answers across different domains and discuss our findings below.

Refer to caption
Figure 2: Comparison of fine-grained scores of the human-written and model-generated answers for different evaluation criteria. The last figure (with red boundary) shows the averaged and normalized overall scores. A higher score represents fewer errors in the answers.

Questions from technology and economics are biased.

Ambiguous and misinformed questions can lead to undesirable answers Cole et al. (2023); Kim et al. (2023). Therefore, fair answer scoring requires prior estimation of question quality. For this, we utilize the question misconception aspect and find that questions from all evaluated domains consist of misconceptions arising from the user’s bias or misinformation. This is especially prominent in technology and economics, where 40%similar-toabsentpercent40\sim 40\%∼ 40 % of the questions are misinformed – users have low domain knowledge to ask the right questions.

Answers lack comprehensiveness and provide unhelpful references.

We observe that human-written and model-generated answers score high on factuality and relevance aspects, meaning most of the information provided in the answers is verifiable, trustworthy, and related to the question. Interestingly, the answers score low on the completeness and references aspects, lacking important information and providing web references and examples that are not useful, as per the experts’ judgments. Specifically, models hallucinate and provide incorrect or made-up web links. In contrast, human answers digress from the topic, providing irrelevant information that leads to undesirable conclusions.

Overall, model answers score better than the human answers in all the evaluated domains. While this is due to their better performance over humans on the considered aspects, we believe that the persuasive nature of model answers Salvi et al. (2024) also plays a crucial role in their higher preference.

4 Hallucination Mitigation

In Section 3.3, we have shown that LFQA answers lack comprehensiveness and omit helpful information. Therefore, we train a feedback model to identify erroneous answer spans with incomplete information and provide free-form error justifications. Our approach, Error-Informed Refinement, uses this feedback to refine answers and improve their overall quality without human intervention.

Refer to caption
Figure 3: A pictorial view of our Error-informed refinement approach. The feedback model takes as input a question-answer pair and outputs span level error with justifications and a consistency score. The refine model uses this feedback to improve the original answer.

4.1 Error Feedback Model

Given an input question and an LFQA response, the feedback model generates a label [Complete] oder [Incomplete] for every sentence 1n1𝑛1...n1 … italic_n in the response and gives associated reasons for the incomplete sentences (see Figure 3). We model this as a sequence-to-sequence task and finetune a LLaMA2-13B model Touvron et al. (2023b).

Fine-tuning.

Training the feedback model requires high-quality error annotations with justifications. To this end, we utilize our HQ2A dataset and extract QA pairs with errors in the completeness aspect. For every extracted sample, we segment the answer into sentences and mark every sentence with the [Complete] oder [Incomplete] tag along with the expert’s justifications. The final dataset consists of 509509509509 samples split into train (90%percent9090\%90 %) and test (10%percent1010\%10 %) sets. We train the model with batch size 4444, learning rate 2e52𝑒52e-52 italic_e - 5, and sequence length 1024102410241024 for 5555 epochs. We list the prompts used in Section B.2.

Inference.

The trained feedback model hallucinates web references in about 20% of test samples. This likely occurs because the training data includes web references in expert error justifications, which the model struggles to replicate coherently. To combat this, we opt for a sampling-based approach Malon and Zhu (2024) to provide more consistent feedback. The intuition is that trustworthy details and references should appear in many other generated samples. Hence, during the decoding step, we use nucleus sampling Holtzman et al. (2020) with p=0.9 and sample 20 responses from the feedback model and check their consistency in two stages: 1) Tag consistency: This pertains to the consistency of span-level tag predictions, complete oder incomplete, for each sampled response. The tag consistency score is calculated by counting the number of other sampled responses that match the tag sequence of each sampled output and averaging over the total number of samples. Formally, if the sampled tag predictions p1,,pnsubscript𝑝1subscript𝑝𝑛p_{1},...,p_{n}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT consist of tag sequences t1,,tnsubscript𝑡1subscript𝑡𝑛t_{1},...,t_{n}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT where tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a list of tag predictions for every span, the score for sample i𝑖iitalic_i is

𝒮𝒯𝒞=1ns=1n1ti=tssubscript𝒮𝒯𝒞1𝑛superscriptsubscript𝑠1𝑛subscript1subscript𝑡𝑖subscript𝑡𝑠\mathcal{S_{TC}}=\frac{1}{n}\sum_{s=1}^{n}1_{t_{i}=t_{s}}caligraphic_S start_POSTSUBSCRIPT caligraphic_T caligraphic_C end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT (1)

where 1ti=tssubscript1subscript𝑡𝑖subscript𝑡𝑠1_{t_{i}=t_{s}}1 start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT is 1 if the tag sequence tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the same as tag sequence tssubscript𝑡𝑠t_{s}italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 0 if not. The samples with the highest score are selected for the next stage. 2) Reason consistency: We assess the consistency of justifications given for the incomplete spans from the remaining samples. Specifically, we count the number of other sampled justifications from the LLM that matched each token of each sampled output and score each justification by the average count per token. Formally, if the sampled justifications j1,,jnsubscript𝑗1subscript𝑗𝑛j_{1},...,j_{n}italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT consist of words wik,k=1misuperscriptsubscript𝑤𝑖𝑘𝑘1subscript𝑚𝑖w_{i}^{k},k=1...m_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_k = 1 … italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the score of sample i𝑖iitalic_i is

𝒮𝒞=1mik=1mis=1n1wikjssubscript𝒮𝒞1subscript𝑚𝑖superscriptsubscript𝑘1subscript𝑚𝑖superscriptsubscript𝑠1𝑛subscript1superscriptsubscript𝑤𝑖𝑘subscript𝑗𝑠\mathcal{S_{RC}}=\frac{1}{m_{i}}\sum_{k=1}^{m_{i}}\sum_{s=1}^{n}1_{w_{i}^{k}% \in j_{s}}caligraphic_S start_POSTSUBSCRIPT caligraphic_R caligraphic_C end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ italic_j start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT (2)

where 1wikjssubscript1superscriptsubscript𝑤𝑖𝑘subscript𝑗𝑠1_{w_{i}^{k}\in j_{s}}1 start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ italic_j start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT is 1 if token wiksuperscriptsubscript𝑤𝑖𝑘w_{i}^{k}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is in the justification jssubscript𝑗𝑠j_{s}italic_j start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 0 if not. Finally, we select the sample output with the highest score as the feedback for the refinement model. After sampling, we notice a 50% reduction in reference hallucinations, down to 510%similar-toabsent5percent10\sim{5-10\%}∼ 5 - 10 % test set samples.

4.2 Error-Informed Refinement (EIR)

Our approach is shown in Figure 3 and consists of two main components: an error feedback model (section 4.1), and a refinement model. Given an input prompt xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a corresponding human-written or model-generated response yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the feedback model \mathcal{E}caligraphic_E generates a targeted feedback fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that represents the quality of yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in free-form natural language. Finally, the refinement model uses xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, generating a refined and improved output response y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The following sections describe our approach in more detail.

Refinement Model.

Our experiments use the LLaMA2-13B chat LLM and its DPO optimized version (see Appendix C) as the refinement models. In each case, the model is 0-shot prompted with the fine-grained error feedback received from the error detection model. We also experiment with two strong baseline feedback models, 1) Improve: The refinement model is 0-shot prompted to improve the answer without any feedback provided. 2) Generic: The refinement model is 0-shot prompted to improve the answer with a generic error feedback that asks the model to provide a more complete and accurate answer. We list the prompts used in Section B.3.

Datasets & Evaluation Metrics.

We test our error-informed refinement approach on three datasets: HQ2A with span-level error annotations for answer completeness, ASQA Stelmakh et al. (2022), and ELI5 Fan et al. (2019). The ASQA dataset consists of 6K ambiguous factoid questions with long-form answers synthesized from multiple sources to resolve the ambiguities. ELI5 consists of 270K long-form answers covering general topics from the subreddits "explainlikeimfive", "askscience", and "AskHistorians" on Reddit.

We evaluate the refined answers using TigerScore, a trained reference-free metric that identifies errors in LLM-generated text and assigns a score based on error severity. Specifically, we use the LLaMA-7B trained version of TigerScore, which highly correlates with humans for error detection in LFQA tasks Jiang et al. (2023b) while being much less expensive than human evaluation. Furthermore, we evaluate the error correction capabilities of our refinement approach using precision, recall, and F1. Lastly, we conduct a human evaluation to evaluate the comprehensiveness and preference of the refined answers compared to gold answers.

5 Results

We explore several research questions: 1) Can our learned feedback model detect errors in LFQA systems and help in downstream answer refinement task? 2) Does fine-grained feedback produce better quality LFQA answers than coarse-grained feedback? 3) Does fine-grained feedback help mitigate hallucinations and improve the comprehensiveness of LFQA answers? 4) Are comprehensive answers from our approach preferred by humans?

5.1 Detecting Errors via Feedback Model

Since detecting erroneous spans in long-form answers is hard, we measure the accuracy of our feedback model in three different settings; model-detected erroneous spans are entirely different (different), adjacent (adjacent), and exactly similar (exact) to the human-annotated spans. In Table 3, we show the sentence-level error detection accuracy of the feedback model as compared to the strong human baseline. The feedback model detects the exact and adjacent error spans with a combined accuracy of 61%. However, it is important to note that the model gives high consistency scores when confident in its predictions. A consistency score less than 0.80 means that the model is unsure in its error prediction feedback, while a score above 0.85 shows that the prediction highly aligns with humans.

Dataset Error span Accuracy (\uparrow) Consistency Score (\uparrow)
HQ2A Different 38.56±0.93plus-or-minus38.560.93\textbf{38.56}\pm 0.9338.56 ± 0.93 % 0.71±0.02plus-or-minus0.710.020.71\pm 0.020.71 ± 0.02
Adjacent 24.18±0.92%plus-or-minus24.18percent0.9224.18\pm 0.92\%24.18 ± 0.92 % 0.82±0.01plus-or-minus0.820.010.82\pm 0.010.82 ± 0.01
Exact 37.25±0.00plus-or-minus37.250.0037.25\pm 0.0037.25 ± 0.00 % 0.86±0.01plus-or-minus0.860.01\textbf{0.86}\pm 0.010.86 ± 0.01
Table 3: Accuracy of our feedback model in detecting sentence-level errors compared to the expert error annotations. The feedback model predictions closely align with humans at consistency scores above 0.800.800.800.80.

We further evaluate our error feedback model by comparing the gap in the downstream LFQA refinement task when we use human-annotated error feedback. This evaluation measures the effectiveness of our feedback model in guiding the refinement of long-form answers and reducing hallucinations. In Table 4, we present the refinement performance of our feedback model as compared to the expert human feedback on HQ2A. We find that our feedback model’s performance is very competitive, reducing hallucinated samples by 2%percent22\%2 % and improving F1 score by 4%percent44\%4 % compared to the expert human feedback. This result validates the effectiveness of our feedback model in refining LFQA answers.

Dataset Approach Tigerscore Error Correction
% Hallucinated Hallucination Precision (\uparrow) Recall (\uparrow) F1 (\uparrow)
samples (\downarrow) score (\downarrow)
HQ2A Human feedback 2.61±0.92plus-or-minus2.610.922.61\pm 0.922.61 ± 0.92 0.09±0.01plus-or-minus0.090.010.09\pm 0.010.09 ± 0.01 0.86±0.04plus-or-minus0.860.040.86\pm 0.040.86 ± 0.04 1.00±0.00plus-or-minus1.000.00\textbf{1.00}\pm 0.001.00 ± 0.00 0.94±0.02plus-or-minus0.940.020.94\pm 0.020.94 ± 0.02
\cdashline2-7 Baseline 19.6119.6119.6119.61 0.630.630.630.63 - - -
Improve 1.31±0.92plus-or-minus1.310.921.31\pm 0.921.31 ± 0.92 0.05±0.04plus-or-minus0.050.040.05\pm 0.040.05 ± 0.04 1.00±0.00plus-or-minus1.000.00\textbf{1.00}\pm 0.001.00 ± 0.00 0.93±0.05plus-or-minus0.930.050.93\pm 0.050.93 ± 0.05 0.97±0.02plus-or-minus0.970.020.97\pm 0.020.97 ± 0.02
Generic 1.31±0.92plus-or-minus1.310.921.31\pm 0.921.31 ± 0.92 0.05±0.03plus-or-minus0.050.030.05\pm 0.030.05 ± 0.03 0.97±0.04plus-or-minus0.970.040.97\pm 0.040.97 ± 0.04 0.97±0.05plus-or-minus0.970.050.97\pm 0.050.97 ± 0.05 0.97±0.02plus-or-minus0.970.020.97\pm 0.020.97 ± 0.02
\cdashline2-7 EIR 0.65±0.92plus-or-minus0.650.92\textbf{0.65}\pm 0.920.65 ± 0.92 0.03±0.04plus-or-minus0.030.04\textbf{0.03}\pm 0.040.03 ± 0.04 0.97±0.04plus-or-minus0.970.040.97\pm 0.040.97 ± 0.04 1.00±0.00plus-or-minus1.000.00\textbf{1.00}\pm 0.001.00 ± 0.00 0.98±0.02plus-or-minus0.980.02\textbf{0.98}\pm 0.020.98 ± 0.02
EIR w/ DPO 4.57±2.44plus-or-minus4.572.444.57\pm 2.444.57 ± 2.44 0.07±0.02plus-or-minus0.070.020.07\pm 0.020.07 ± 0.02 0.90±0.08plus-or-minus0.900.080.90\pm 0.080.90 ± 0.08 0.87±0.05plus-or-minus0.870.050.87\pm 0.050.87 ± 0.05 0.88±0.06plus-or-minus0.880.060.88\pm 0.060.88 ± 0.06
ASQA Baseline 34.8134.8134.8134.81 1.201.201.201.20 - - -
Improve 20.85±1.00plus-or-minus20.851.0020.85\pm 1.0020.85 ± 1.00 0.68±0.03plus-or-minus0.680.030.68\pm 0.030.68 ± 0.03 0.70±0.02plus-or-minus0.700.020.70\pm 0.020.70 ± 0.02 0.71±0.01plus-or-minus0.710.010.71\pm 0.010.71 ± 0.01 0.70±0.01plus-or-minus0.700.010.70\pm 0.010.70 ± 0.01
Generic 18.67±0.52plus-or-minus18.670.5218.67\pm 0.5218.67 ± 0.52 0.61±0.01plus-or-minus0.610.010.61\pm 0.010.61 ± 0.01 0.72±0.01plus-or-minus0.720.010.72\pm 0.010.72 ± 0.01 0.75±0.01plus-or-minus0.750.010.75\pm 0.010.75 ± 0.01 0.74±0.00plus-or-minus0.740.000.74\pm 0.000.74 ± 0.00
\cdashline2-7 EIR 16.63±0.41plus-or-minus16.630.41\textbf{16.63}\pm 0.4116.63 ± 0.41 0.51±0.02plus-or-minus0.510.020.51\pm 0.020.51 ± 0.02 0.73±0.00plus-or-minus0.730.00\textbf{0.73}\pm 0.000.73 ± 0.00 0.82±0.02plus-or-minus0.820.02\textbf{0.82}\pm 0.020.82 ± 0.02 0.77±0.01plus-or-minus0.770.01\textbf{0.77}\pm 0.010.77 ± 0.01
EIR w/ DPO 22.61±0.26plus-or-minus22.610.2622.61\pm 0.2622.61 ± 0.26 0.45±0.01plus-or-minus0.450.01\textbf{0.45}\pm 0.010.45 ± 0.01 0.64±0.00plus-or-minus0.640.000.64\pm 0.000.64 ± 0.00 0.77±0.01plus-or-minus0.770.010.77\pm 0.010.77 ± 0.01 0.71±0.00plus-or-minus0.710.000.71\pm 0.000.71 ± 0.00
ELI5 Baseline 22.9322.9322.9322.93 0.820.820.820.82 - - -
Improve 10.05±0.18plus-or-minus10.050.1810.05\pm 0.1810.05 ± 0.18 0.36±0.02plus-or-minus0.360.020.36\pm 0.020.36 ± 0.02 0.75±0.00plus-or-minus0.750.000.75\pm 0.000.75 ± 0.00 0.86±0.00plus-or-minus0.860.000.86\pm 0.000.86 ± 0.00 0.80±0.00plus-or-minus0.800.000.80\pm 0.000.80 ± 0.00
Generic 6.06±0.23plus-or-minus6.060.236.06\pm 0.236.06 ± 0.23 0.22±0.01plus-or-minus0.220.010.22\pm 0.010.22 ± 0.01 0.84±0.01plus-or-minus0.840.010.84\pm 0.010.84 ± 0.01 0.91±0.00plus-or-minus0.910.000.91\pm 0.000.91 ± 0.00 0.87±0.00plus-or-minus0.870.000.87\pm 0.000.87 ± 0.00
\cdashline2-7 EIR 3.81±0.30plus-or-minus3.810.30\textbf{3.81}\pm 0.303.81 ± 0.30 0.13±0.01plus-or-minus0.130.01\textbf{0.13}\pm 0.010.13 ± 0.01 0.88±0.01plus-or-minus0.880.01\textbf{0.88}\pm 0.010.88 ± 0.01 0.96±0.01plus-or-minus0.960.01\textbf{0.96}\pm 0.010.96 ± 0.01 0.92±0.01plus-or-minus0.920.01\textbf{0.92}\pm 0.010.92 ± 0.01
EIR w/ DPO 5.71±0.25plus-or-minus5.710.255.71\pm 0.255.71 ± 0.25 0.13±0.00plus-or-minus0.130.00\textbf{0.13}\pm 0.000.13 ± 0.00 0.83±0.00plus-or-minus0.830.000.83\pm 0.000.83 ± 0.00 0.94±0.01plus-or-minus0.940.010.94\pm 0.010.94 ± 0.01 0.88±0.00plus-or-minus0.880.000.88\pm 0.000.88 ± 0.00
Table 4: Results of the quality of answers refined through coarse-grained and fine-grained feedback. We include two baselines using coarse-grained feedback: Improve and Generic for all the datasets. Additionally, we include the results for expert human feedback on our collected test set.

5.2 Fine- vs. Coarse-grained Feedback

Table 4 shows the quality of answers refined using different forms of feedback plus the baseline quality of answers from the datasets. We observe that inadequate feedback deteriorates the quality of generation. While directly prompting the refinement model (Improve) performs better than the baseline, prompting with more targeted feedback (Generic) consistently outperforms the Improve approach and generates better quality LFQA answers. This highlights the importance of providing detailed feedback to the refinement model.

In contrast, providing fine-grained feedback from our error detection model (EIR) outperforms coarse-grained feedback and even fine-grained human feedback (on HQ2A), delivering consistent improvements in reducing hallucinated samples and hallucination scores by 3%similar-toabsentpercent3\sim{3\%}∼ 3 % and Δ38%similar-toΔpercent38\Delta\sim{38}\%roman_Δ ∼ 38 %, respectively, and improving F1 scores by 5%similar-toabsentpercent5\sim{5}\%∼ 5 % over all the evaluated datasets. Using our DPO-aligned refinement model does not reduce the hallucinated samples. However, it achieves the best hallucination score on ASQA and ELI5, showing that optimization helps correct major errors in the answers. We show further evidence of the role of alignment in reducing hallucinations in Section E.1.

5.3 Human Evaluation

To test the comprehensiveness and overall quality of the answers generated using our refinement approach, we hire three annotators and perform a human evaluation on a subset of 50 samples each from HQ2A, ASQA, and ELI5 datasets.

Table 5 shows the results of our human evaluation of the original and refined answers. Annotators find the answers produced by our approach comprehensive, meaning all the questions are answered thoroughly without omitting important information. However, a comprehensive answer does not necessarily mean a better answer. Therefore, we also evaluate the overall preference of our answers, incorporating factors such as factuality and relevance compared to the baseline answers. We observe that annotators significantly prefer the refined answers (84%similar-toabsentpercent84\sim 84\%∼ 84 %) across all the datasets, indicating their factual correctness and relevance. We provide details on the human agreement in Section E.2.

Dataset App. Comprehensiveness(\uparrow) Preference(\uparrow)
HQ2A Baseline 0.00% 7.84 %
Refined 100 % 92.16 %
ASQA Baseline 82.00 % 40.00 %
Refined 100 % 60.00 %
ELI5 Baseline 38.00 % 0.00 %
Refined 100 % 100 %
Table 5: Human evaluation results on the comprehensiveness and preference of refined answers over the baseline answers from three datasets.

6 Conclusion

In this work, we introduce HaluQuestQA, a dataset of expert human judgments on fine-grained errors (question misconception, factuality, relevance, completeness, and references) in LFQA. Using our dataset, we analyze the pitfalls of human and model long-form answers, identifying issues with comprehensiveness and unhelpful references. To address these, we propose Error-informed refinement, an approach that uses signals from our learned feedback model to refine LLM responses. Our feedback model outperforms baseline feedback models and expert human feedback in guiding answer refinement and reducing hallucinations. A human evaluation confirms the effectiveness of our approach, with participants finding our refined answers more comprehensive and preferable to baseline outputs.

Limitations

Despite providing an in-depth analysis on hallucinations in human and model generated responses, our work only focusses on the LFQA task. Thus, we encourage future work to apply our findings to different tasks such as summarization, translation, etc. We study a diverse but limited scope of long-form answers drawn from online community platforms. More diverse questions from different domains such as education or commercial may have different issues and might be to be evaluated in a different way.

Our trained error detection model shows high correlation with human annotations but relies on a high consistency of model outputs. The model may hallucinate if the consistency score is low (<0.80absent0.80<0.80< 0.80). Training larger models with more high quality data might be an interesting future work to get better results. Lastly, in our refinement approach, we only experiment with the intstruction-tuned variant of LLaMA2. Models with better or worse instruction following capabilities may give different results and improving the refinement process can be a great future direction to mitigate hallucinations.

Ethics and Broader Impact Statement

The expert annotation data collection protocol has been determined to be exempt from review by an IRB board. All the collected data will be publicly available under the CC BY-SA 4.0 license. We hire annotators on the academic annotation platform Prolific and gather no sensitive user information except demographics and annotator performance data. We examined the collected data and ascertained that it contains no toxic or harmful content.

Acknowledgements

This research work is funded by the German Federal Ministry of Education and Research and the Hessian Ministry of Higher Education, Research, Science and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE. Yixiao Song and Mohit Iyyer are supported by the award IIS-2312949 from the National Science Foundation (NSF).

We thank Sukannya Purkayastha and Haritz Puerto for their insightful feedback on the paper and Manika Arvind Arora for the valuable feedback on the annotation setup. Lastly, we are grateful to our dedicated annotators who helped create the HaluQuestQA dataset.

References

  • Bai et al. (2022a) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, Benjamin Mann, and Jared Kaplan. 2022a. Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
  • Bai et al. (2022b) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosiute, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemí Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. 2022b. Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073.
  • Bhat et al. (2023) Meghana Moorthy Bhat, Rui Meng, Ye Liu, Yingbo Zhou, and Semih Yavuz. 2023. Investigating answerability of llms for long-form question answering. CoRR, abs/2309.08210.
  • Chen et al. (2023) Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. 2023. Universal self-consistency for large language model generation. CoRR, abs/2311.17311.
  • Cole et al. (2023) Jeremy R. Cole, Michael J. Q. Zhang, Daniel Gillick, Julian Eisenschlos, Bhuwan Dhingra, and Jacob Eisenstein. 2023. Selectively answering ambiguous questions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 530–543. Association for Computational Linguistics.
  • Coyne and Sakaguchi (2023) Steven Coyne and Keisuke Sakaguchi. 2023. An analysis of gpt-3’s performance in grammatical error correction. CoRR, abs/2303.14342.
  • Fan et al. (2019) Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy. Association for Computational Linguistics.
  • Freitag et al. (2021) Markus Freitag, George F. Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Experts, errors, and context: A large-scale study of human evaluation for machine translation. CoRR, abs/2104.14478.
  • Fu et al. (2023) Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire. CoRR, abs/2302.04166.
  • Gillick and Liu (2010) Dan Gillick and Yang Liu. 2010. Non-expert evaluation of summarization systems is risky. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 148–151, Los Angeles. Association for Computational Linguistics.
  • Goyal and Durrett (2020) Tanya Goyal and Greg Durrett. 2020. Evaluating factuality in generation with dependency-level entailment. CoRR, abs/2010.05478.
  • Goyal et al. (2022) Tanya Goyal, Junyi Jessy Li, and Greg Durrett. 2022. SNaC: Coherence error detection for narrative summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 444–463, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Hayes and Krippendorff (2007) Andrew F. Hayes and Klaus Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1):77–89.
  • Holtzman et al. (2020) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Hu et al. (2022) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  • Iskender et al. (2020) Neslihan Iskender, Tim Polzehl, and Sebastian Möller. 2020. Best practices for crowd-based evaluation of German summarization: Comparing crowd, expert and automatic evaluation. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 164–175, Online. Association for Computational Linguistics.
  • Ji et al. (2022) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. 2022. Survey of hallucination in natural language generation. CoRR, abs/2202.03629.
  • Jiang et al. (2023a) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023a. Mistral 7b. CoRR, abs/2310.06825.
  • Jiang et al. (2023b) Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Bill Yuchen Lin, and Wenhu Chen. 2023b. Tigerscore: Towards building explainable metric for all text generation tasks.
  • Kim et al. (2023) Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joonsuk Park, and Jaewoo Kang. 2023. Tree of clarifications: Answering ambiguous questions with retrieval-augmented large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 996–1009. Association for Computational Linguistics.
  • Klie et al. (2018) Jan-Christoph Klie, Michael Bugert, Beto Boullosa, Richard Eckart de Castilho, and Iryna Gurevych. 2018. The INCEpTION platform: Machine-assisted and knowledge-oriented interactive annotation. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pages 5–9, Santa Fe, New Mexico. Association for Computational Linguistics.
  • Krishna et al. (2021) Kalpesh Krishna, Aurko Roy, and Mohit Iyyer. 2021. Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online. Association for Computational Linguistics.
  • Kwon et al. (2023) Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  • Laban et al. (2022) Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. 2022. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
  • Lee et al. (2022) Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Factuality enhanced language models for open-ended text generation. In Advances in Neural Information Processing Systems.
  • Li et al. (2023) Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. Halueval: A large-scale hallucination evaluation benchmark for large language models. CoRR, abs/2305.11747.
  • Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  • Liu et al. (2023) Yixin Liu, Alexander R. Fabbri, Pengfei Liu, Yilun Zhao, Linyong Nan, Ruilin Han, Simeng Han, Shafiq Joty, Chien-Sheng Wu, Caiming Xiong, and Dragomir Radev. 2023. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 4140–4170. Association for Computational Linguistics.
  • Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  • Malon and Zhu (2024) Christopher Malon and Xiaodan Zhu. 2024. Self-consistent decoding for more factual open responses. CoRR, abs/2403.00696.
  • Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9004–9017. Association for Computational Linguistics.
  • Menick et al. (2022) Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, H. Francis Song, Martin J. Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nat McAleese. 2022. Teaching language models to support answers with verified quotes. CoRR, abs/2203.11147.
  • Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. CoRR, abs/2305.14251.
  • Muhlgay et al. (2023) Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, and Yoav Shoham. 2023. Generating benchmarks for factuality evaluation of language models. CoRR, abs/2307.06908.
  • Nakano et al. (2021) Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. 2021. Webgpt: Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  • Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  • Salvi et al. (2024) Francesco Salvi, Manoel Horta Ribeiro, Riccardo Gallotti, and Robert West. 2024. On the conversational persuasiveness of large language models: A randomized controlled trial. CoRR, abs/2403.14380.
  • Selim and Ismail (1984) Shokri Z. Selim and M. A. Ismail. 1984. K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell., 6(1):81–87.
  • Stelmakh et al. (2022) Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. 2022. ASQA: Factoid questions meet long-form answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Tang et al. (2024) Liyan Tang, Igor Shalyminov, Amy Wing mei Wong, Jon Burnsky, Jake W. Vincent, Yu’an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia Sun, Yi Zhang, Saab Mansour, and Kathleen McKeown. 2024. Tofueval: Evaluating hallucinations of llms on topic-focused dialogue summarization.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  • Tian et al. (2023) Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, and Chelsea Finn. 2023. Fine-tuning language models for factuality. CoRR, abs/2311.08401.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  • Wang et al. (2022) Shufan Wang, Fangyuan Xu, Laure Thompson, Eunsol Choi, and Mohit Iyyer. 2022. Modeling exemplification in long-form question answering via retrieval. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2079–2092, Seattle, United States. Association for Computational Linguistics.
  • Wang et al. (2023) Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. 2023. Shepherd: A critic for language model generation. CoRR, abs/2308.04592.
  • Welleck et al. (2023) Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. 2023. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  • Wong et al. (2021) Ka Wong, Praveen K. Paritosh, and Lora Aroyo. 2021. Cross-replication reliability - an empirical approach to interpreting inter-rater reliability. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 7053–7065. Association for Computational Linguistics.
  • Wu et al. (2023a) Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang Jiao, and Michael R. Lyu. 2023a. Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark. CoRR, abs/2303.13648.
  • Wu et al. (2023b) Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, and Hannaneh Hajishirzi. 2023b. Fine-grained human feedback gives better rewards for language model training. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  • Xu et al. (2023a) Fangyuan Xu, Yixiao Song, Mohit Iyyer, and Eunsol Choi. 2023a. A critical evaluation of evaluations for long-form question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3225–3245, Toronto, Canada. Association for Computational Linguistics.
  • Xu et al. (2024) Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhongtao Liu, William Yang Wang, Lei Li, and Markus Freitag. 2024. Llmrefine: Pinpointing and refining large language models via fine-grained actionable feedback.
  • Xu et al. (2023b) Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Wang, and Lei Li. 2023b. INSTRUCTSCORE: towards explainable text generation evaluation with automatic feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 5967–5994. Association for Computational Linguistics.
  • Zhong et al. (2022) Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. 2022. Towards a unified multi-dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 2023–2038. Association for Computational Linguistics.
  • Ziegler et al. (2019) Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul F. Christiano, and Geoffrey Irving. 2019. Fine-tuning language models from human preferences. CoRR, abs/1909.08593.

Appendix A Data Analysis

This section presents additional insights on our HaluQuestQA (HQ2A) dataset.

A.1 Answer Length Distribution

Figure 4 compares the length distribution of human-written and model-generated answers. We observe that the length of human and model answers is comparable, resulting in a fair evaluation. Across all domains, the length of collected answers ranges between 50-500 words with an average length of 100 words.

Refer to caption
Figure 4: Answer length distribution of human-written and model-generated answers (H/M) in our expert-annotated dataset.

A.2 Overall Answer Preference

In Figure 5, we plot the word frequency distribution of the free-form answer justifications provided by our expert annotators. Apart from our considered evaluation aspects, we observe that the annotators also find answers clarity, conciseness, and ease of understanding helpful in deciding the overall best answer. We encourage future LFQA research to consider these aspects in their evaluation.

Refer to caption
Figure 5: Distribution of the top 50 most common words mentioned by our expert annotators in their overall answer justifications. The size and color of the bubble represent the word frequency and importance, respectively. The green and orange colors denote the important evaluated and non-evaluated aspects, respectively, while blue depicts the generic terms used in answer justifications.

Appendix B Prompts

This section lists the prompts for data collection, training the error detection model, and refining answers using our Error-informed approach.

B.1 Data Collection

We prompt GPT-4 in a zero-shot manner to generate responses to questions asked on the Reddit platform, as shown in Section 3.1. We use the default generation parameters in OpenAI API with temperature=0.1 and max_tokens=1.5*(human_answer_length). We specifically instruct the model to generate a response of length similar to the corresponding human response on Reddit to compare model-generated and human-written answers fairly on our defined evaluation criteria.

{listing}

[!t] {minted}[fontsize=, frame=single, breaklines]python f"""Your task is to answer a question by providing a clear and concise explanation of a complex concept in a way that is accessible for laypeople. The question was posted on the Reddit forum Explain Like I’m Five (r/explainlikeimfive). Please keep in mind that the question is not literally meant for 5-year-olds, so you should not answer the question in a way that you are talking to a child. Your answer should be around human_answer_length words and should break down the concept into understandable parts, providing relevant examples or analogies where appropriate. You should also aim to make your explanation easy to follow, using clear and concise language throughout. Your answer should maintain accuracy and clarity. When appropriate, you can start with one sentence summarizing the main idea of the answer.

Question: question

Answer (around human_answer_length words): """

B.2 Feedback Model

We use expert error annotations for the completeness aspect from our HQ2A dataset to train our feedback model. In Section 3.1, we show an example prompt used to train our feedback model. Given an instruction and input question-answer, the output is a sentence-level prediction of answer completeness with detailed justifications.

{listing}

[!h] {minted}[fontsize=, frame=single, breaklines]python f"""### Instruction: When given a question and answer statements, evaluate whether each given statement provides sufficient information for answering the question. Use the ’[Incomplete]’ tag to indicate answer incompleteness, and ’[Complete]’ tag to indicate completeness, with reasons. Please note that the answer can have single, multiple or no incomplete statements.

### Input: Question: Can anyone explain the differences between copyright and trademark? Answer: 1. A trademark protects a brand’s symbol or logo. 2. A copyright protects content. 3. So the ac/dc logo with the lightning bolt would be trademarked. 4. The music and lyrics to thunderstruck would be copyrighted. 5. Edit: eli10 addendum: just to be clear, the content of a copyright can also be an image. 6. So the album cover to thunderstruck’s album, razor’s edge, would be copyrighted because it is artistic content owned by someone, but doesn’t identify ac/dc as a whole.

### Response: 1. [Complete] 2. [Incomplete] Reasons: The answer fails to mention the broader scope of copyright protection, which includes creative works beyond just music and lyrics. 3. [Complete] 4. [Complete] 5. [Complete] 6. [Complete] """

B.3 Refinement Model

As detailed in Section 4.2, the refinement model uses coarse-grained feedback (Improve and Generic) and fine-grained feedback from the learned error detection model to refine input answers. We list the prompts used for Improve, Generic and incorporating fine-grained feedback in Section 3.1, Section 3.1 and Section 3.1, respectively.

{listing}

[!h] {minted}[fontsize=, frame=single, breaklines]python f""" Answer the following question: "question" Your answer is: "answer". Please improve your answer. Your improved answer:

"""

{listing}

[!h] {minted}[fontsize=, frame=single, breaklines]python f""" Answer the following question: "question" Your answer is: "answer". The answer is not complete. Please improve your answer. Your improved answer:

"""

{listing}

[!t] {minted}[fontsize=, frame=single, breaklines]python f""" Answer the following question: "question" Your answer is: "answer". The answer is not complete because: "reason". Please improve your answer. Your improved answer:

"""

# reasons are given as: # 1. Reason 1 # 2. Reason 2 # …

Appendix C Mitigating Hallucinations with Preference Optimization

While language models acquire large amounts of world knowledge and strong reasoning skills from unsupervised training over massive web corpora, aligning them with human expectations is often hard. Model alignment techniques like DPO allow us to directly use preference data to optimize the language model by casting the RL-based objective used by existing RLHF methods to an objective that can be directly optimized via a simple binary cross-entropy loss. This simplifies the process of refining LLMs greatly. The following paragraphs detail how we use DPO to reduce LLM hallucinations.

Implementation details.

We model data from HQ2A as a preference dataset where every question has a chosen and a rejected response selected by expert annotators based on the given evaluation criteria. Using this dataset, we fine-tune the LLaMA2-7B-chat Touvron et al. (2023b) and Mistral-7B-instruct-v0.1 Jiang et al. (2023a) models with the DPO algorithm. We use batch_size=16𝑏𝑎𝑡𝑐_𝑠𝑖𝑧𝑒16batch\_size=16italic_b italic_a italic_t italic_c italic_h _ italic_s italic_i italic_z italic_e = 16, warmup_ratio=0.1𝑤𝑎𝑟𝑚𝑢𝑝_𝑟𝑎𝑡𝑖𝑜0.1warmup\_ratio=0.1italic_w italic_a italic_r italic_m italic_u italic_p _ italic_r italic_a italic_t italic_i italic_o = 0.1, learning_rate=2e5𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒2𝑒5learning\_rate=2e-5italic_l italic_e italic_a italic_r italic_n italic_i italic_n italic_g _ italic_r italic_a italic_t italic_e = 2 italic_e - 5, num_epochs=5𝑛𝑢𝑚_𝑒𝑝𝑜𝑐𝑠5num\_epochs=5italic_n italic_u italic_m _ italic_e italic_p italic_o italic_c italic_h italic_s = 5, beta=0.1𝑏𝑒𝑡𝑎0.1beta=0.1italic_b italic_e italic_t italic_a = 0.1, and max_length=1024𝑚𝑎𝑥_𝑙𝑒𝑛𝑔𝑡1024max\_length=1024italic_m italic_a italic_x _ italic_l italic_e italic_n italic_g italic_t italic_h = 1024 for training the models.

Due to compute limitations, we train Llama2-13B-chat model on our preference dataset using LoRA Hu et al. (2022). We use the following training parameters: r=256𝑟256r=256italic_r = 256, alpha=128𝑎𝑙𝑝𝑎128alpha=128italic_a italic_l italic_p italic_h italic_a = 128, lora_dropout=0.05𝑙𝑜𝑟𝑎_𝑑𝑟𝑜𝑝𝑜𝑢𝑡0.05lora\_dropout=0.05italic_l italic_o italic_r italic_a _ italic_d italic_r italic_o italic_p italic_o italic_u italic_t = 0.05, learning_rate=5e5𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒5𝑒5learning\_rate=5e-5italic_l italic_e italic_a italic_r italic_n italic_i italic_n italic_g _ italic_r italic_a italic_t italic_e = 5 italic_e - 5, beta=0.1𝑏𝑒𝑡𝑎0.1beta=0.1italic_b italic_e italic_t italic_a = 0.1, max_length=1024𝑚𝑎𝑥_𝑙𝑒𝑛𝑔𝑡1024max\_length=1024italic_m italic_a italic_x _ italic_l italic_e italic_n italic_g italic_t italic_h = 1024 and train the model for 5 epochs.

Datasets & Evaluation Metrics.

We experiment with three datasets: HQ2A, ASQA Stelmakh et al. (2022), and ELI5 Fan et al. (2019). HQ2A dataset consists of 698 high-quality long-form question-answer pairs split into train (80%), dev (10%), and test (10%) sets. The ASQA dataset consists of 6K ambiguous factoid questions with long-form answers synthesized from multiple sources to resolve the ambiguities. ELI5 consists of 270K long-form answers covering general topics from the subreddits "explainlikeimfive", "askscience", and "AskHistorians" on the Reddit platform.

We report the quality of the generated long-form answers using TigerScore Jiang et al. (2023b), a trained reference-free evaluation metric to pinpoint mistakes in the LLM-generated text. TigerScore detects hallucinations in the input text and assigns a hallucination score based on the severity of the error detected. Specifically, we use the LLaMA-7B trained version of TigerScore, which highly correlates with humans for error detection in LFQA tasks Jiang et al. (2023b). We also measure the factual correctness of the generated answers using sample-based consistency metrics Manakul et al. (2023). Following their approach, we zero-shot prompt a LLaMA-13B-chat model to check if ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT sentence in the original answer is supported by the sampled answer Snsuperscript𝑆𝑛S^{n}italic_S start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and return a score xinsuperscriptsubscript𝑥𝑖𝑛x_{i}^{n}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT using the mapping: {"Yes: 1.0", "No: 0.0", "N/A: 0.5"}. The final consistency score is then calculated as:

SPrompt(i)=1Nn=1Nxinsubscript𝑆𝑃𝑟𝑜𝑚𝑝𝑡𝑖1𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑥𝑖𝑛S_{Prompt}(i)=\frac{1}{N}\sum_{n=1}^{N}x_{i}^{n}italic_S start_POSTSUBSCRIPT italic_P italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT ( italic_i ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT

Appendix D Training, Infrastructure and Runtime

We use a server with 8888 NVIDIA A100100100100 Tensor Core GPUs, each with 80808080GB VRAM, to run all our experiments. Each experiment required, at most, two A100100100100 GPUs. Fine-tuning the LLaMA2-13B feedback model took 6666 hours on 2 A100100100100 GPUs using our HQ2A dataset. LoRA fine-tuning of the LLaMA2-13B-chat refinement model took 2222 hours on a single A100100100100 GPU using the preference data from HQ2A. Refining answers with our Error-Informed Refinement approach took 0.50.50.50.5, 3333, and 23232323 hours for the HQ2A, ASQA, and ELI5 datasets, respectively, on a single A100100100100 GPU. The evaluation of the refined answers with TigerScore (LLaMA-7B) utilized the VLLM inference library Kwon et al. (2023) and took approximately 1111, 15151515, and 30303030 minutes for HQ2A, ASQA, and ELI5 datasets, respectively, on a single A100100100100 GPU.

Appendix E Additional Results

E.1 Aligning LLMs

Section E.1 shows the results for training language models with DPO using our collected preference annotations. Our preference-tuned models outperform the strong baseline models and reduce hallucinated generations in all the evaluation settings except the LLaMA model on the ASQA dataset. We hypothesize that this is due to the ambiguous nature of questions in the ASQA dataset that can have multiple correct answers.

We also observe that the models become more robust and generate more consistent responses after preference-tuning. The only exception is the Mistral model on our held-out test set, which has lower response consistency. We believe this is likely due to the conservative nature of DPO-trained models wherein, during sampling, it can refrain from answering a question in some cases and not in others, leading to a lower consistency score.

Dataset (# samples) Instruct Model Tigerscore SelfCheck Consistency (\downarrow)
% Hallucinated samples (\downarrow) Hallucination score (\downarrow)
HQ2A (70) LLaMA2-7B 18.57±0.00plus-or-minus18.570.0018.57\pm 0.0018.57 ± 0.00 0.60±0.00plus-or-minus0.600.00\textbf{0.60}\pm 0.000.60 ± 0.00 0.166±0.014plus-or-minus0.1660.0140.166\pm 0.0140.166 ± 0.014
LLaMA2-7B + DPO 15.71±0.00plus-or-minus15.710.00\textbf{15.71}\pm 0.0015.71 ± 0.00 0.66±0.00plus-or-minus0.660.000.66\pm 0.000.66 ± 0.00 0.162±0.015plus-or-minus0.1620.015\textbf{0.162}\pm 0.0150.162 ± 0.015
Mistral-7B 20.00±0.00plus-or-minus20.000.0020.00\pm 0.0020.00 ± 0.00 0.57±0.00plus-or-minus0.570.000.57\pm 0.000.57 ± 0.00 0.266±0.011plus-or-minus0.2660.011\textbf{0.266}\pm 0.0110.266 ± 0.011
Mistral-7B + DPO 17.14±0.00plus-or-minus17.140.00\textbf{17.14}\pm 0.0017.14 ± 0.00 0.54±0.00plus-or-minus0.540.00\textbf{0.54}\pm 0.000.54 ± 0.00 0.285±0.011plus-or-minus0.2850.0110.285\pm 0.0110.285 ± 0.011
ASQA (948) LLaMA2-7B 26.58±1.49plus-or-minus26.581.49\textbf{26.58}\pm 1.4926.58 ± 1.49 0.86±0.06plus-or-minus0.860.06\textbf{0.86}\pm 0.060.86 ± 0.06 0.187±0.014plus-or-minus0.1870.0140.187\pm 0.0140.187 ± 0.014
LLaMA2-7B + DPO 28.41±1.06plus-or-minus28.411.0628.41\pm 1.0628.41 ± 1.06 0.89±0.02plus-or-minus0.890.020.89\pm 0.020.89 ± 0.02 0.178±0.006plus-or-minus0.1780.006\textbf{0.178}\pm 0.0060.178 ± 0.006
Mistral-7B 62.09±0.35plus-or-minus62.090.3562.09\pm 0.3562.09 ± 0.35 2.08±0.01plus-or-minus2.080.012.08\pm 0.012.08 ± 0.01 0.578±0.003plus-or-minus0.5780.0030.578\pm 0.0030.578 ± 0.003
Mistral-7B + DPO 60.80±0.56plus-or-minus60.800.56\textbf{60.80}\pm 0.5660.80 ± 0.56 2.03±0.01plus-or-minus2.030.01\textbf{2.03}\pm 0.012.03 ± 0.01 0.555±0.008plus-or-minus0.5550.008\textbf{0.555}\pm 0.0080.555 ± 0.008
ELI5_general (1000) LLaMA2-7B 9.93±1.05plus-or-minus9.931.059.93\pm 1.059.93 ± 1.05 0.32±0.04plus-or-minus0.320.040.32\pm 0.040.32 ± 0.04 0.133±0.001plus-or-minus0.1330.0010.133\pm 0.0010.133 ± 0.001
LLaMA2-7B + DPO 9.33±0.66plus-or-minus9.330.66\textbf{9.33}\pm 0.669.33 ± 0.66 0.29±0.03plus-or-minus0.290.03\textbf{0.29}\pm 0.030.29 ± 0.03 0.130±0.004plus-or-minus0.1300.004\textbf{0.130}\pm 0.0040.130 ± 0.004
\cdashline 2-5 Mistral-7B 29.97±0.97plus-or-minus29.970.9729.97\pm 0.9729.97 ± 0.97 0.90±0.04plus-or-minus0.900.040.90\pm 0.040.90 ± 0.04 0.327±0.003plus-or-minus0.3270.0030.327\pm 0.0030.327 ± 0.003
Mistral-7B + DPO 22.77±1.03plus-or-minus22.771.03\textbf{22.77}\pm 1.0322.77 ± 1.03 0.72±0.03plus-or-minus0.720.03\textbf{0.72}\pm 0.030.72 ± 0.03 0.319±0.011plus-or-minus0.3190.011\textbf{0.319}\pm 0.0110.319 ± 0.011
ELI5_science (1000) LLaMA2-7B 9.47±0.47plus-or-minus9.470.47\textbf{9.47}\pm 0.479.47 ± 0.47 0.31±0.02plus-or-minus0.310.020.31\pm 0.020.31 ± 0.02 0.137±0.003plus-or-minus0.1370.003\textbf{0.137}\pm 0.0030.137 ± 0.003
LLaMA2-7B + DPO 9.47±0.76plus-or-minus9.470.76\textbf{9.47}\pm 0.769.47 ± 0.76 0.30±0.00plus-or-minus0.300.00\textbf{0.30}\pm 0.000.30 ± 0.00 0.139±0.004plus-or-minus0.1390.0040.139\pm 0.0040.139 ± 0.004
\cdashline 2-5 Mistral-7B 34.10±0.94plus-or-minus34.100.9434.10\pm 0.9434.10 ± 0.94 1.07±0.02plus-or-minus1.070.021.07\pm 0.021.07 ± 0.02 0.320±0.004plus-or-minus0.3200.0040.320\pm 0.0040.320 ± 0.004
Mistral-7B + DPO 29.03±1.51plus-or-minus29.031.51\textbf{29.03}\pm 1.5129.03 ± 1.51 0.95±0.04plus-or-minus0.950.04\textbf{0.95}\pm 0.040.95 ± 0.04 0.297±0.010plus-or-minus0.2970.010\textbf{0.297}\pm 0.0100.297 ± 0.010
ELI5_history (1000) LLaMA2-7B 9.63±0.59plus-or-minus9.630.599.63\pm 0.599.63 ± 0.59 0.30±0.02plus-or-minus0.300.020.30\pm 0.020.30 ± 0.02 0.188±0.005plus-or-minus0.1880.005\textbf{0.188}\pm 0.0050.188 ± 0.005
LLaMA2-7B + DPO 7.60±0.08plus-or-minus7.600.08\textbf{7.60}\pm 0.087.60 ± 0.08 0.22±0.01plus-or-minus0.220.01\textbf{0.22}\pm 0.010.22 ± 0.01 0.189±0.005plus-or-minus0.1890.0050.189\pm 0.0050.189 ± 0.005
\cdashline 2-5 Mistral-7B 26.23±0.38plus-or-minus26.230.3826.23\pm 0.3826.23 ± 0.38 0.79±0.02plus-or-minus0.790.020.79\pm 0.020.79 ± 0.02 0.363±0.016plus-or-minus0.3630.0160.363\pm 0.0160.363 ± 0.016
Mistral-7B + DPO 22.17±1.31plus-or-minus22.171.31\textbf{22.17}\pm 1.3122.17 ± 1.31 0.69±0.04plus-or-minus0.690.04\textbf{0.69}\pm 0.040.69 ± 0.04 0.345±0.013plus-or-minus0.3450.013\textbf{0.345}\pm 0.0130.345 ± 0.013
Table 11: Results of aligning LLMs with DPO using our collected answer preference data. We measure the hallucinations using Tigerscore and the consistency of model outputs using SelfCheckGPT.

E.2 Human Evaluation

This section presents additional details of our human evaluation of the answers refined with our Error-informed feedback approach. In Table 12, we present the agreement of our annotators on two evaluation metrics: comprehensiveness and overall answer preference. The annotators strongly agree that the refined answers are comprehensive, i.e., the answer contains all the required information as asked by the question. For the overall answer preference compared to the baseline, we observe weak agreement between annotators, primarily due to the low agreement value on the ASQA dataset. We hypothesize that the annotators struggle to align on ASQA due to the ambiguous nature of the questions in this dataset, which may have multiple correct answers, and choosing between two answers is difficult.

Dataset Comprehensiveness (\uparrow) Preference (\uparrow)
HQ2A 0.70 0.31
ASQA 0.86 0.02
ELI5 0.92 0.61
Average 0.83 0.31
Table 12: Agreement of annotators on the comprehensiveness and preference of refined answers over the baseline answers from three datasets.

Appendix F Annotation Guidelines

We have previously described our data collection setup in Section 3.3. This section provides additional details on the annotation interface, detailed task instructions, and annotation procedure.

F.1 Annotation Interface

In Figure 6, we show the interface for collecting expert error annotations on LFQA answers. For every question, experts see a human-written and model-generated answer (randomized order). Our expert annotators must select the evaluation layer (top right) and highlight the error span in the question or answer, giving justifications with web references, wherever applicable. After annotating for all the evaluation criteria, experts judge the better answer and mark it in the left pane, giving reasons for their preference.

Refer to caption
Figure 6: Screenshot of annotation interface for collecting expert error annotations on LFQA answers.

F.2 Task Instructions

We provide experts with detailed task instructions for evaluating answers according to the defined evaluation criteria. We go through every evaluation aspect in depth, defining it and giving annotation examples for clarification, as detailed in the next paragraphs.

1) Question Misconception.

You should select a span of text in the question that contains a misconception or false assumption. The question is repeated twice. You only need to select the span in one repetition. If you select such spans, we would like you to indicate in your reason (obligatorily):

  • whether the answers reject or correct the misconception/false assumption,

  • if no answer rejects/corrects it, please explain in your reason why that is a misconception/false assumption (preferably with references).

Example:

Question: Why is it so important for humans to have a balanced nutrition but not for animals? Most animals have a fairly simple diet, carnivores eat only meat their whole life, cows eat exclusively grass etc. So why are human bodies so picky and need a balance of protein, fat, carbs etc from different sources to perform well?

2) Factuality.

You should select a span of text in the answers that is factually incorrect. If you select such spans, we would like you to (obligatorily):

  • preferably give references (e.g., credible websites, academic papers, or books) that show the content is factually wrong, or

  • give examples that show the content is factually wrong.

Example:

Question: Why is it so important for humans to have a balanced nutrition but not for animals? Most animals have a fairly simple diet, carnivores eat only meat their whole life, cows eat exclusively grass etc. So why are human bodies so picky and need a balance of protein, fat, carbs etc from different sources to perform well?

Answer: Animals generally have a simpler diet than humans. For example, carnivores only eat meat, while cows only eat grass

Reason: This is a reductionist view of animal nutrition as it doesn’t consider how animals have evolved and the complexities of the food chain. For example, lions are carnivores that only eat meat but they eat the stomach of zebras that contain grass/plants and are able to digest it.

3) Relevance.

You should select a span of text in the answers that is irrelevant to answering the question. Removing such content should not affect the overall quality of an answer. If you select such spans, we would like you to (obligatorily):

  • explain why the selected text is not relevant to answering the question.

Example:

Question: What is happening when you get migraines that cause you to lose part of your vision for a short time?

Answer: My wife gets these. An ocular migraine is a condition where the blood vessels in the optic nerve at the back of your eye tighten and swell, resulting in various visual distortions. While classic migraines usually result in intense headaches, sensitivity to light and loud sounds, or nausea, ocular migraines are not necessarily painful.

Reason: Answer contains irrelevant information (writer’s wife having them, migraine may not be painful). The person’s wife’s personal health condition doesn’t provide useful information to the question, and the question doesn’t ask about whether ocular migraines are painful or not.

4) Completeness.

You should: (a) select a span of text in the answer that does not offer enough details, or (b) select the label Answer 1 or Answer 2 if some relevant information that should be included in the answer is missing. If you select such spans, we would like you to (obligatorily):

  • offer the details or relevant information that you think should be included. References from credible sources is encouraged.

Example:

Question: Why does alcohol make your throat or stomach feel warm when drinking?

Answer: There are a few reasons why alcohol might make your throat or stomach feel warm. first, alcohol is a solvent, which means it can dissolve other substances. when you drink alcohol, it can break down the barrier between your stomach and your throat, causing a burning sensation. second, alcohol can irritate the lining of your stomach, causing inflammation. this can also lead to a burning feeling. finally, when you drink alcohol, your blood vessels dilate, or widen. this increased blood flow can cause a feeling of warmth.

Reason: The answer should give more details on what the barrier is. [Offer relevant information based on your knowledge, for example, what the barrier is.]

5) References/example/analogy.

You should select a span of text in the answer that is examples/analogies or a reference to a external credible source (e.g., websites, papers, or books). If you select such spans, we would like you to (obligatorily):

  • indicate whether the example/analogy/credible source is helpful, and

  • give a reason on why it is helpful/not helpful.

Example:

Question: How does a speaker vibrate at multiple frequencies simultaneously to deliver sounds to our ears? I’ll try and explain what I mean.. say a song is playing - how do the vocals, drums, guitars etc which all have different frequencies (aka a plate is moving back & forth at a rate to deliver these sound waves) play at the same time? Surely the plate can’t vibrate to deliver two or more frequencies at once?

Answer: Imagine an ocean with a consistent wave. It flows up and down, with equal distance between the two waves at any time. Now imagine I push a larger, shorter wave into this ocean. The two waves will collide, resulting in some new wave pattern. This new wave pattern is a combination of those two waves. Speakers work similarly. If I combine two sound waves, I get a new combination wave that sounds different.

Reason: I like the analogy with the ocean waves, and due to how visual the explanation is it is easier to understand in my opinion.

Answer Preference.

Based on the five criteria you have went through, please weight all good and bad points of each answer and choose the one that is better overall. Writing a motivation to support your choice is obligatory.

  1. 1.

    When writing your motivation, please refer to the five layers of evaluation.

  2. 2.

    If there are other aspects you used to make your decision but are not in the evaluation layers, please mention them in the reason

  3. 3.

    If you have quotations from the answers, please indicate which answer are the quotations from.

  4. 4.

    Here are some aspects for you to consider (not obligatorily):

    • Nice example/analogy, to the point, generic, concise, informative, useful, well structured, easy to follow …

Overall Requirement.

The overall task requirements are summarized below. Please read them carefully to avoid redoing the task.

  1. 1.

    You have to highlight spans in both question answers for these aspects and give reason why you highlight a span for an aspect.

  2. 2.

    Mark as many spans as necessary.

  3. 3.

    Please be objective in your reasons and avoid using phrases like “I believe” or “I think”.

  4. 4.

    Your reasons should be informative and succinct.

  5. 5.

    Please use declarative sentences and avoid using questions in your reasons.

  6. 6.

    Products like ChatGPT or BARD are absolutely not allowed.

F.3 Annotation Procedure

The expert annotators spend around 15-20 minutes per question, highlighting the demanding nature of this task. We accordingly pay £10/hour and provide a bonus of £10 for good-quality annotations, resulting in a total cost of £3000 to collect expert judgments for 698 questions. The annotators understand that we will use their annotated data for research purposes. We show a screenshot of an expert annotated answer in Figure 7.

Refer to caption
Figure 7: Screenshot of an expert annotated answer on the INCEpTION platform.
Table 6: Zero-shot prompt for GPT-4 to generate long-form answers to questions asked on the ELI5 subreddit on the reddit platform.
Table 7: An example prompt used for training LLaMA2-13B model for error feedback.
Table 8: Zero-shot prompt for LLaMA2-13B-chat model to refine long-form answers without feedback from the error detection model (Improve).
Table 9: Zero-shot prompt for LLaMA2-13B-chat model to refine long-form answers with generic feedback (Generic).
Table 10: Zero-shot prompt for LLaMA2-13B-chat model to refine long-form answers with error feedback from the error detection model.