Navigating the Noisy Crowd: Finding Key Information for Claim Verification

Haisong Gong1,2, Huanhuan Ma1,2, Qiang Liu1,2, Shu Wu1,2, Liang Wang1,2
Abstract

Claim verification is a task that involves assessing the truthfulness of a given claim based on multiple evidence pieces. Using large language models (LLMs) for claim verification is a promising way. However, simply feeding all the evidence pieces to an LLM and asking if the claim is factual does not yield good results. The challenge lies in the noisy nature of both the evidence and the claim: evidence passages typically contain irrelevant information, with the key facts hidden within the context, while claims often convey multiple aspects simultaneously. To navigate this “noisy crowd” of information, we propose EACon (Evidence Abstraction and Claim Deconstruction), a framework designed to find key information within evidence and verify each aspect of a claim separately. EACon first finds keywords from the claim and employs fuzzy matching to select relevant keywords for each raw evidence piece. These keywords serve as a guide to extract and summarize critical information into abstracted evidence. Subsequently, EACon deconstructs the original claim into subclaims, which are then verified against both abstracted and raw evidence individually. We evaluate EACon using two open-source LLMs on two challenging datasets. Results demonstrate that EACon consistently and substantially improve LLMs’ performance in claim verification.

Refer to caption
Figure 1: Architecture of EACon. The input is a claim and raw evidence, and the output is the predicted veracity of the claim. EACon extracts keywords from the claim and uses fuzzy matching to select keywords for each piece of the raw evidence. These selected keywords are then used to summarize the raw evidence into abstracted evidence. EACon then deconstructs the claim into subclaims, which are verified against both the raw and abstracted evidence using a zero-shot approach.

1 Introduction

The ease of creating and sharing information has led to a surge in misinformation within society, spanning from social media to prominent events like the U.S. Presidential debates, disrupting societal norms (Bakir and McStay 2018). Consequently, the automated verification of information accuracy has become paramount. One critical aspect of this is claim verification, which involves using models to evaluate the truthfulness of a given statement (claim) based on multiple evidence pieces (Guo, Schlichtkrull, and Vlachos 2022).

Claim verification can be viewed as a type of Natural Language Inference (NLI) task. Prior studies have delved into techniques such as fine-tuning pre-trained language models and utilizing graph neural networks to establish relationships between evidence in claim verification (Ma et al. 2023; Gong et al. 2024). With recent advancements in large language models (LLMs) (Zhao et al. 2023), leveraging these models for claim verification holds significant promise.

Despite the potential of LLMs, applying them directly to claim verification by simply feeding all the evidence pieces and asking if a claim is factual falls short in yielding satisfactory outcomes. Even advanced methods, such as leveraging in-context examples through few-shot learning or enhancing LLM reasoning via strategies like Chain of Thought (CoT) (Wei et al. 2022) or complex reasoning chains (Fu et al. 2022), do not consistently improve claim verification outcomes (Hu et al. 2023). This is because the task of claim verification necessitates not only reasoning abilities but also the capacity to handle the inherently “noisy” nature of evidence and claims, which both direct LLM applications and these prompt techniques struggle to address effectively.

In the case of “noisy” evidence, an evidence piece may be rife with irrelevant information, while the key information occupies only a small portion and is hidden deeply within the context. This necessitates the model to possess the capability to sift through the noise and extract the pertinent information from the “noisy evidence crowd.” On the other hand, “noisy” claims are often expressed in a convoluted manner, encompassing multiple aspects simultaneously rather than presenting a concise, atomic statement. These “noisy” claims pose challenges for the direct application of LLMs. This is because LLMs typically tend to compare the overall semantic meaning between the evidence and claim, overlooking minor details. However, in the realm of claim verification, even minor inaccuracies should render a claim false, irrespective of the overall semantic coherence.

To address this challenge, we propose the EACon (Evidence Abstraction and Claim Deconstruction) framework. EACon extracts and summarizes the key information from the raw evidence into abstracted evidence to aid LLM verification. It also deconstructs the claim into subclaims, allowing each aspect of the claim to be checked in detail, increasing the likelihood of identifying errors. In this framework, we design a keyword-based technique to extract keywords from the claim and use fuzzy matching to select relevant keywords as guidance to conduct evidence abstraction. This keyword-guided strategy mitigates the impact of conflicts between inaccurate claims and evidence content, while selecting relevant keywords by fuzzy matching aids in reducing the LLM’s tendency to generate content not conveyed by evidence, as illustrated in Figure 2. Furthermore, for complex scenarios, we provide the LLM with contextual information about the original claim during the subclaim verification stage, further improving the model’s performance.

In summary, our key contributions include:

  • We highlight the key challenge in claim verification as navigating the “noisy crowd” of claim and evidence information, which hampers the performance of LLMs in claim verification.

  • We propose the EACon framework, which extracts and summarizes the key information from raw evidence into abstracted evidence based on selected keywords and deconstructs the claim into subclaims for verification.

  • We demonstrate the effectiveness of EACon on the HOVER and FEVEROUS-S datasets, using two open-source LLMs (Vicuna-13B and Mixtral-8x7B). The results show that EACon can consistently and substantially improve LLMs’ performance in claim verification.

2 Related Work

Claim Verification

Traditional methods for claim verification can be categorized into two main approaches. The first approach employs pre-trained language models fine-tuned specifically for claim verification. These models either concatenate the evidence and claims into a single input (Aly et al. 2021; Thorne et al. 2018; Hu et al. 2022) or process each piece of evidence separately and then aggregate the results (Soleimani, Monz, and Worring 2020; Jiang, Pradeep, and Lin 2021; Gi, Fang, and Tsai 2021). The second approach utilizes graph neural networks to capture complex semantic interactions through evidence graphs (Gi, Fang, and Tsai 2021; Zhao et al. 2020; Liu et al. 2020; Zhong et al. 2020; Chen et al. 2022b; Gong et al. 2024). Recent studies have explored leveraging the reasoning abilities of LLMs in verification tasks. For example, ProgramFC (Pan et al. 2023b) employs LLMs to generate reasoning programs that guide the verification process, while EX-FEVER (Ma et al. 2023) elicits LLMs’ capability to generate textual explanations for claim verification results. Factscore (Min et al. 2023) proposes fine-grained atomic evaluation for long text inputs. Pan et al. (2023a); Chen et al. (2022a); Li et al. (2023); Rani et al. (2023) propose to generate a series of questions or queries for claim verification. However, none of these methods address the “noisy” problem of both evidence and claim information, which our method focuses on.

Large Language Model Reasoning

The reasoning capabilities of LLMs form the cornerstone for LLM-based verification tasks. In recent years, in-context learning, popularized by the few-shot prompting approach of Brown et al. (2020), has enabled models to generalize tasks from a few examples. The reasoning ability of LLMs has been further enhanced through various strategies, such as chain-of-thought prompting (CoT) (Wei et al. 2022; Wang et al. 2022; Kojima et al. 2022), which improves reasoning by generating intermediate steps in problem-solving. Fu et al. (2022) propose selecting complex reasoning examples as prompts to boost LLMs’ reasoning performance. However, these strategies do not consistently enhance performance in claim verification tasks (Hu et al. 2023). Our method does not focus on enhancing the reasoning abilities of LLMs to solve claim verification tasks. Instead, it aims to improve claim verification performance by reducing “noise” through evidence abstraction and claim deconstruction, thereby leveraging the LLMs’ strengths more effectively.

Refer to caption
Figure 2: Illustration of different methods for prompting LLM to abstract evidence. From left to right: (I) Prompting LLM based on claim leads to incorrect output due to conflicting claim and evidence. (II) Prompting LLM with all keywords may result in generating content not supported by the evidence. (III) Our proposed method using selected keywords leads to correct output.

3 Method

In this section, we introduce the details of our proposed framework, EACon. Generally, EACon is composed of three major components: Evidence Abstraction, Claim Deconstruction, and Subclaim Verification. Both Evidence Abstraction and Claim Deconstruction are designed to address the “noisy crowd” problem for claim verification. After these two preparatory steps, the final Subclaim Verification component verifies each subclaim and produces the overall result. The architecture of EACon is shown in Figure 1.

3.1 Task Formulation

The objective of claim verification is to determine the veracity of a given claim based on multiple pieces of evidence. Typically, each piece of evidence is a sentence or paragraph drawn from sources like Wikipedia. Mathematically, given a claim c𝑐citalic_c and an evidence set containing n𝑛nitalic_n piece of evidence ={e1,e2,,en}subscript𝑒1subscript𝑒2subscript𝑒𝑛\mathcal{E}=\{e_{1},e_{2},\cdots,e_{n}\}caligraphic_E = { italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. the task is to find a model p^=f(c,)^𝑝𝑓𝑐\hat{p}=f(c,\mathcal{E})over^ start_ARG italic_p end_ARG = italic_f ( italic_c , caligraphic_E ) that outputs the predicted veracity p^^𝑝\hat{p}over^ start_ARG italic_p end_ARG, where p^=TrueorFalse^𝑝True𝑜𝑟False\hat{p}=\text{True}\ or\ \text{False}over^ start_ARG italic_p end_ARG = True italic_o italic_r False.

3.2 Evidence Abstraction

Evidence Abstraction serves as the first part of EACon. It involves processing each raw piece of evidence to extract useful information and eliminate noisy information. Instead of naively prompting an LLM to perform extraction and summarization tasks, we designed a keyword-based method.

As shown in Part I of Figure 2, if an LLM is prompted to summarize the evidence against a claim that is inherently false, there is a risk of conflicting information that could lead to subpar results. To alleviate this problem, we designed a keyword-based method. EACon extracts keywords from the claim, which encapsulate the claim’s essence without introducing any biases from the claim itself. This approach helps in circumventing potential conflicts. However, as shown in Part II of Figure 2, if all the keywords are used to summarize an evidence, the redundant keywords may lead the LLM to output content not conveyed by the evidence. To address this, we designed a Keyword Selection procedure to remove the irrelevant keywords, resulting in better summarization results. The effectiveness of the keyword-based design is empirically validated in Section 4.5. Subsequent sections will delve into the details of the Evidence Abstraction process, encompassing Keyword Extraction, Keyword Selection, and Evidence Summarization.

Keyword Extraction

Keyword Extraction is the initial step in the Evidence Abstraction process, aimed at identifying essential keywords from the claim. These keywords, including important nouns, verbs, and phrases, serve as a guide for extracting information from evidence in subsequent sections. We instruct an LLM to perform keyword extraction and provide examples to aid its understanding of the task and output formatting. The process can be formally described as:

{k1,k2,,km}=argmaxp(𝒦|TKE,c;θLLM)subscript𝑘1subscript𝑘2subscript𝑘𝑚𝑝conditional𝒦subscript𝑇𝐾𝐸𝑐subscript𝜃𝐿𝐿𝑀\{k_{1},k_{2},\cdots,k_{m}\}=\arg\max p(\mathcal{K}|T_{KE},c;\theta_{LLM}){ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } = roman_arg roman_max italic_p ( caligraphic_K | italic_T start_POSTSUBSCRIPT italic_K italic_E end_POSTSUBSCRIPT , italic_c ; italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ) (1)

where kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_ith keyword selected by the LLM from the potential keywords set 𝒦𝒦\mathcal{K}caligraphic_K, and m𝑚mitalic_m keywords are selected in total for the claim c𝑐citalic_c. θLLMsubscript𝜃𝐿𝐿𝑀\theta_{LLM}italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT represents the LLM model, and TKEsubscript𝑇𝐾𝐸T_{KE}italic_T start_POSTSUBSCRIPT italic_K italic_E end_POSTSUBSCRIPT is the prompt template used for Keyword Extraction:

Task Description: Extract key components such as important verbs, nouns, and phrases from the provided sentence. Focus on identifying and highlighting the most relevant elements. Instructions: Carefully read the input sentence. Identify and list the significant verbs, nouns, and pertinent phrases. Ensure the output succinctly encapsulates the essence of the input by focusing on these key components. Examples: Input: Spam is canned cooked meat by Hormel Foods Corporation is never used to make a popular snack and lunch food in Hawaii. Output: spam, canned cooked meat, Hormel Foods Corporation, used, popular snack, lunch food, Hawaii. [More Examples] Given the following input and keywords, provide a concise and factual summary based on the examples above. Exclude any information not directly related to the keywords. Input: [Claim] (c𝑐citalic_c) Output:

Keyword Selection

We have obtained m𝑚mitalic_m keywords from the claim c𝑐citalic_c. Given the i𝑖iitalic_ith piece of evidence eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the second step is to identify which of these m𝑚mitalic_m keywords are related to eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Since we do not want to use all the keywords as the evidence summary guidance, as described in the previous section and Figure 2, we employ fuzzy matching as a low-cost and efficient way to implement this task.

Fuzzy matching can be used to evaluate the similarity between a keyword and a piece of evidence. Specifically, we use two functions from the fuzzywuzzy package111https://github.com/seatgeek/thefuzz, namely partial_ratio and token_set_ratio. The partial_ratio function computes the similarity (edit distance) of the best matching substring of the evidence to the input keyword, while the token_set_ratio function determines the similarity score of the intersection of unique tokens between the input keyword and the evidence piece. These functions are applied to compare each keyword kj,j{1,2,,m}subscript𝑘𝑗𝑗12𝑚k_{j},j\in\{1,2,\cdots,m\}italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ { 1 , 2 , ⋯ , italic_m } with the i𝑖iitalic_ith piece of evidence eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. To prevent the omission of potentially relevant keywords, those where either of the similarity scores exceeds a preset threshold will be selected. Mathematically:

𝒮i={kj partial_ratio(kj,ei)>t1 oder  token_set_ratio(kj,ei)>t2,1jm}subscript𝒮𝑖conditional-setsubscript𝑘𝑗formulae-sequence partial_ratiosubscript𝑘𝑗subscript𝑒𝑖subscript𝑡1 oder  token_set_ratiosubscript𝑘𝑗subscript𝑒𝑖subscript𝑡21𝑗𝑚\mathcal{S}_{i}=\{k_{j}\mid\texttt{ partial\_ratio}(k_{j},e_{i})>t_{1}\text{ % or }\\ \texttt{ token\_set\_ratio}(k_{j},e_{i})>t_{2},1\leq j\leq m\}start_ROW start_CELL caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ partial_ratio ( italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or end_CELL end_ROW start_ROW start_CELL token_set_ratio ( italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 1 ≤ italic_j ≤ italic_m } end_CELL end_ROW (2)

where 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates the selected keywords set for the i𝑖iitalic_ith piece of evidence eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are two set threshold for these two fuzzy matching functions.

Evidence Summarization

After obtaining the selected keywords set 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponding to the i𝑖iitalic_ith piece of evidence eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, our goal is to extract the information centered around these keywords within the evidence and discard the irrelevant. This extracted information is intended to be the most useful for verifying the claim. Essentially, extracting information centered on keywords is akin to uncovering the relationships between these keywords. Since meaningful relationships generally exist between multiple keywords, we focus our summaries on evidence containing at least two relevant keywords. Evidence with |𝒮i|<2subscript𝒮𝑖2|\mathcal{S}_{i}|<2| caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | < 2 will not be summarized, as we deem them unlikely to provide sufficient useful information. Still, we prompt the LLM to serve as the extractor and summarizer. Additionally, we equip the prompt with some examples to help the LLM better understand the compositional task and format its output. Formally, this process can be described as:

ai=argmaxp(ai|TES,𝒮i,ei;θLLM),|𝒮i|2formulae-sequencesubscript𝑎𝑖𝑝conditionalsubscript𝑎𝑖subscript𝑇𝐸𝑆subscript𝒮𝑖subscript𝑒𝑖subscript𝜃𝐿𝐿𝑀subscript𝒮𝑖2a_{i}=\arg\max p(a_{i}|T_{ES},\mathcal{S}_{i},e_{i};\theta_{LLM}),|\mathcal{S}% _{i}|\geq 2italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_arg roman_max italic_p ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_T start_POSTSUBSCRIPT italic_E italic_S end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ) , | caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≥ 2 (3)

where aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the abstracted evidence from the raw evidence eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the selected keywords set. TESsubscript𝑇𝐸𝑆T_{ES}italic_T start_POSTSUBSCRIPT italic_E italic_S end_POSTSUBSCRIPT is the prompt template used for Evidence Summarization:

Task Description: Extract and summarize key information from sentences based on specified keywords. The output should be concise, directly related to the keywords, and devoid of extraneous details. Instructions: Carefully read the provided input sentence. Use the specified keywords to guide your extraction of information. Generate a summary that includes only the facts directly associated with the keywords. Examples: Input: Spam msubi is a popular snack and lunch food in Hawaii composed of a slice of grilled Spam on top of a block of rice, wrapped together with nori in the traditional of Japanese ‘omusubi’. Keywords: spam, popular snack, lunch food, Hawaii. Output: Spam is popular snack and lunch food in Hawaii. [More Examples] Based on the following input, identify and list the key components as demonstrated in the examples. Input: [Raw Evidence] (eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) Keywords: [Selected Keywords] (𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) Output:

We perform the Keyword Selection and Evidence Summarization procedures for each piece of evidence, resulting in a set of abstracted evidence denoted as 𝒜={a1,,ai{[|𝒮i|>2]}\mathcal{A}=\{a_{1},\cdots,a_{\sum_{i}{\{[}|\mathcal{S}_{i}|>2]}\}caligraphic_A = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { [ | caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | > 2 ] end_POSTSUBSCRIPT }. This set 𝒜𝒜\mathcal{A}caligraphic_A is then combined with the raw evidence set \mathcal{E}caligraphic_E for subsequent verification tasks. The integration is essential because 𝒜𝒜\mathcal{A}caligraphic_A encapsulates key elements crucial for directly validating the truthfulness of the claim, but may not contain all the information. By supplementing the raw evidence set, 𝒜𝒜\mathcal{A}caligraphic_A can be considered an enhancement of the original evidence, providing a more concise and focused representation. Furthermore, it offers a shortcut for the LLM, reducing the complexity of subsequent inference processes.

3.3 Claim Deconstruction

Claim Deconstruction is the second component of EACon. It takes the original claim as input and generates several subclaims that focus on different aspects. Simply asking an LLM to judge the claim’s truthfulness is insufficient, as LLMs tend to compare the overall semantic meaning rather than scrutinize minor details. However, for claim verification, even the slightest error should result in the claim being judged as false, even if the semantic meaning remains largely unchanged. Relying solely on LLM judgments can only address obvious errors, failing to meet the objectives of comprehensive claim verification. By deconstructing the claim into subclaims, we can leverage LLMs to individually verify different aspects and details, thereby increasing the likelihood of identifying errors. Still, we prompt an LLM to deconstruct claim into subclaims:

{u1,u2,,ur}=argmaxp(𝒰|TCD,c;θLLM)subscript𝑢1subscript𝑢2subscript𝑢𝑟𝑝conditional𝒰subscript𝑇𝐶𝐷𝑐subscript𝜃𝐿𝐿𝑀\{u_{1},u_{2},\cdots,u_{r}\}=\arg\max p(\mathcal{U}|T_{CD},c;\theta_{LLM}){ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } = roman_arg roman_max italic_p ( caligraphic_U | italic_T start_POSTSUBSCRIPT italic_C italic_D end_POSTSUBSCRIPT , italic_c ; italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ) (4)

where uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT means the i𝑖iitalic_ith subclaim from the potential subclaim set 𝒰𝒰\mathcal{U}caligraphic_U. TCDsubscript𝑇𝐶𝐷T_{CD}italic_T start_POSTSUBSCRIPT italic_C italic_D end_POSTSUBSCRIPT is the prompt template used for Claim Deconstruction:

Task Description: Dissect a given claim into multiple atomic statements. These statements should be complete in meaning, devoid of uncertain pronouns, and retain all original details. Each atomic statement should stand alone and be independently verifiable. Examples: Claim: Spam is canned cooked meat by Hormel Foods Corporation is never used to make a popular snack and lunch food in Hawaii. Output: \n #1 Spam is a canned cooked meat product manufactured by Hormel Foods Corporation. \n #2 Spam is not used to make a popular snack and lunch food in Hawaii. [More Examples] Here is the claim given to you. Your answer should follow the format of above demonstrations. Each atomic statement should stand alone and be independently verifiable with as least pronouns as possible. Give your answer only, no explanation. Claim: [Claim] (c𝑐citalic_c) Output:

3.4 Subclaim Verification

The last step of EACon is Subclaim Verification. After obtaining the set of subclaims {u1,u2,,ur}subscript𝑢1subscript𝑢2subscript𝑢𝑟\{u_{1},u_{2},\cdots,u_{r}\}{ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT }, the truthfulness of the original claim can be verified by checking each subclaim individually. If any subclaim is false, the original claim is deemed false. The original claim is considered true only if all subclaims are correct. Mathematically, this can be represented as:

p^={Falseif i,f(ui,𝒜)=FalseTrueOther^𝑝casesFalseif 𝑖𝑓subscript𝑢𝑖𝒜FalseTrueOther\hat{p}=\begin{cases}\text{False}&\text{if }\exists i,f(u_{i},\mathcal{A}\cup% \mathcal{E})=\text{False}\\ \text{True}&\text{Other}\end{cases}over^ start_ARG italic_p end_ARG = { start_ROW start_CELL False end_CELL start_CELL if ∃ italic_i , italic_f ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_A ∪ caligraphic_E ) = False end_CELL end_ROW start_ROW start_CELL True end_CELL start_CELL Other end_CELL end_ROW (5)

where p^^𝑝\hat{p}over^ start_ARG italic_p end_ARG is the veracity prediction of the claim c𝑐citalic_c, 𝒜𝒜\mathcal{A}caligraphic_A and \mathcal{E}caligraphic_E are the abstracted evidence set and raw evidence set. f𝑓fitalic_f is the function used to verify the truthfulness of each subclaim uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We implement f𝑓fitalic_f using LLM in a zero-shot manner, consistent with prior work (Pan et al. 2023b). Mathematically, this can be written as:

f(ui,𝒜)=argmaxp(pi|TSV,ui,𝒜,c;θLLM)𝑓subscript𝑢𝑖𝒜𝑝conditionalsubscript𝑝𝑖subscript𝑇𝑆𝑉subscript𝑢𝑖𝒜𝑐subscript𝜃𝐿𝐿𝑀f(u_{i},\mathcal{A}\cup\mathcal{E})=\arg\max p(p_{i}|T_{SV},u_{i},\mathcal{A}% \cup\mathcal{E},c;\theta_{LLM})italic_f ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_A ∪ caligraphic_E ) = roman_arg roman_max italic_p ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_T start_POSTSUBSCRIPT italic_S italic_V end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_A ∪ caligraphic_E , italic_c ; italic_θ start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ) (6)

where pi=TrueorFalsesubscript𝑝𝑖True𝑜𝑟Falsep_{i}=\text{True}\ or\ \text{False}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = True italic_o italic_r False represents the veracity prediction of the i𝑖iitalic_ith subclaim uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and TSVsubscript𝑇𝑆𝑉T_{SV}italic_T start_POSTSUBSCRIPT italic_S italic_V end_POSTSUBSCRIPT is the prompt template used for Subclaim Verification:

Given golden evidence: [Abstracted Evidence & Raw Evidence] (𝒜𝒜\mathcal{A}\cup\mathcal{E}caligraphic_A ∪ caligraphic_E) In the saying of [Claim] (c𝑐citalic_c) . Based on the golden evidence. Is it true that [Subclaim] (uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT)? (Yes or No)

The segment of the prompt highlighted in dark color is optional. The decision to incorporate the context of the original claim for subclaim verification depends on the complexity of the claim. In our experiments, we observe that for complex claims, incorporating the original claim as context is more beneficial for verification. Further discussion on the optional prompt segment will be provided in the experimental section.

4 Experiment

Models HOVER-2 HOVER-3 HOVER-4 FEVEROUS-S
Pretrained/Fine-tuned Models BERT-FC 53.40 50.90 50.86 74.71
LisT5 56.15 53.76 51.67 77.88
RoBERTa-NLI 74.62 62.23 57.98 88.28
DeBERTaV3-NLI 77.22 65.98 60.49 91.98
MULTIVERS 68.86 59.87 55.67 86.03
Vicuna Zero-Shot 64.08 64.63 59.59 81.69
Few-Shot 63.02 62.18 56.81 78.65
ProgramFC 66.07 60.35 56.74 87.51
+ EACon (our method) 68.55+4.47 66.43+1.8 63.42+3.83 89.37+7.68
Mixtral Zero-Shot 67.86 64.03 62.09 85.06
Few-Shot 66.59 63.59 62.55 88.49
ProgramFC 59.97 61.75 59.82 81.76
+ EACon (our method) 73.17+5.31 69.40+5.37 67.78+5.69 89.52+4.46
Table 1: Comparison of baseline models on subsets of HOVER dataset and FEVEROUS-S dataset in terms of Macro-F1 score. HOVER-2 represents the 2-hops subset of the HOVER dataset. Green numbers show improvement over zero-shot performance when our method is applied to backbone LLMs, as the verification process of EACon is also zero-shot.

4.1 Dataset

In line with existing research, we have selected two publicly available datasets to assess the performance of EACon. Evaluation is carried out using the validation set. The chosen datasets are HOVER (Jiang et al. 2020) and FEVEROUS-S (Aly et al. 2021).

  • HOVER The HOVER dataset comprises claims that necessitate verification through multiple pieces of evidence and multi-hop reasoning. It is organized into three subsets, each corresponding to a different level of reasoning complexity based on the number of hops. Specifically, the two-hop subset (HOVER-2) consists of 1,126 claims, the three-hop subset (HOVER-3) comprises 1,835 claims, and the four-hop (HOVER-4) subset includes 1,039 claims.

  • FEVEROUS-S FEVEROUS is a fact-checking dataset designed to validate claims using both structured and unstructured data sources. Our experimentation is focused on a subset of FEVEROUS, known as FEVEROUS-S, which exclusively involves claims that rely on unstructured data. In terms of claim complexity, it is noted that the claims in the HOVER dataset exhibit higher complexity compared to those in FEVEROUS-S.

Given our emphasis on the claim verification task, all experiments are executed using the evidence provided within the dataset (referred to as golden evidence). The performance is assessed using the Macro-F1 score as the evaluation metric.

4.2 Baselines

EACon is a versatile framework that can be adapted to various existing large language models. In order to ensure credibility and inclusivity, we have selected two open-source LLMs with differing parameter sizes as the foundational backbone for EACon. These models are Vicuna-13B (Chiang et al. 2023) and Mixtral-8x7B (Jiang et al. 2024). Our experimentation includes zero-shot and few-shot trials using these backbone models. Furthermore, we conduct experiments with these two language models within another framework, ProgramFC (Pan et al. 2023b), which prompts LLMs to generate and execute programs for the purpose of claim verification.

The following pretrained or fine-tuned models are also considered as baseline models:

  • BERT-FC (Soleimani, Monz, and Worring 2020): Pretrained BERT model (Devlin et al. 2019) tailored for fact-checking tasks.

  • LisT5 (Jiang, Pradeep, and Lin 2021): Pretrained T5 model (Raffel et al. 2020) specialized for fact-checking tasks.

  • RoBERTa-NLI (Nie et al. 2020): Pretrained RoBERTa-large model (Liu et al. 2019) fine-tuned on four natural language inference datasets.

  • DeBERTaV3-NLI (He, Gao, and Chen 2021): Pretrained DeBERTaV3 model fine-tuned on FEVER (Thorne et al. 2018) and four natural language inference datasets.

  • MULTIVERS (Wadden et al. 2022): A LongFormer model (Beltagy, Peters, and Cohan 2020) fine-tuned on the FEVER dataset.

4.3 Implementation Details

In the Keyword Selection process, both similarity score thresholds (t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) are set to 60606060 (maximum is 100) to ensure the retention of important keywords.

EACon conducts the verification process in a zero-shot manner but includes in-context examples in the prompts for Evidence Abstraction and Claim Deconstruction. To ensure fair experimentation, the model does not use any examples that are not utilized by the baseline models. Few-shot experiments with backbone models use the same examples as prompts in ProgramFC. Examples for Evidence Abstraction and Claim Deconstruction are rephrased from ProgramFC to suit task requirements. Subclaim Verification uses the optional prompt component TSVsubscript𝑇𝑆𝑉T_{SV}italic_T start_POSTSUBSCRIPT italic_S italic_V end_POSTSUBSCRIPT for the HOVER dataset but not for the FEVEROUS-S dataset. A more detailed discussion on TSVsubscript𝑇𝑆𝑉T_{SV}italic_T start_POSTSUBSCRIPT italic_S italic_V end_POSTSUBSCRIPT is provided in Section 4.5.

Since open-source models are utilized, all experiments are conducted on a local machine server equipped with an AMD EPYC 7742 (256) @ 2.250GHz CPU and NVIDIA RTX 3090 (24G) GPUs (Vicuna-13B experiments require two GPUs, while Mixtral-8x7B necessitates a minimum of five GPUs). A temperature of 0.05 is utilized to reduce randomness, while all other hyperparameters in sampling output of LLMs remain default.

CD EAvsubscriptEA𝑣\textbf{EA}_{v}EA start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT EAmsubscriptEA𝑚\textbf{EA}_{m}EA start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT HOVER-2 HOVER-3 HOVER-4
×\times× ×\times× ×\times× 64.08 64.63 59.59
×\times× ×\times× 66.90 65.61 62.46
×\times× ×\times× 64.25 64.98 61.52
×\times× ×\times× 63.99 65.15 63.90
×\times× 68.55 66.43 63.42
×\times× 66.25 66.97 64.23
Table 2: Ablation study on EACon using Vicuna as the subclaim verifier. CD refers to Claim Deconstruction with Vicuna, while EAvsubscriptEA𝑣\text{EA}_{v}EA start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT/EAmsubscriptEA𝑚\text{EA}_{m}EA start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes Evidence Abstraction with Vicuna/Mixtral. Macro-F1 scores are reported.

4.4 Overall Performance

Table 1 presents the results of our method and various baseline models. The data clearly demonstrates that EACon consistently and substantially improve model performance across both datasets, using either Vicuna or Mixtral as the backbone model.

Compared to pretrained/fine-tuned models, zero-shot LLMs do not exhibit advantage, especially compared to DeBERTaV3. However, applying our proposed model, EACon, to these LLMs demonstrates a more pronounced advantage in complex tasks. On simpler datasets like FEVEROUS-S and HOVER-2 (2-hop reasoning), EACon-equipped models perform comparably to pretrained/fine-tuned models. But for more complex tasks like HOVER-3 and HOVER-4, EACon-equipped LLMs show a distinct advantage.

In comparison to other LLM-based approaches for claim verification, our model demonstrates superior stability. The few-shot technique does not consistently improve performance, aligning with prior research (Hu et al. 2023). ProgramFC’s reliance on LLMs’ program generation and execution capabilities makes it less adaptable and more sensitive to intermediate errors compared to the model.

4.5 Ablation Study

The Impact of Evidence Abstraction and Claim Deconstruction

To navigate the “noisy” crowd of evidence and claim, EACon contains two key components: Evidence Abstraction and Claim Deconstruction. In this section, we conduct ablation studies to understand the contribution of each component. Removing the Evidence Abstraction component eliminates the use of the abstracted evidence set 𝒜𝒜\mathcal{A}caligraphic_A in verification, while removing Claim Deconstruction results in direct assessment of the claim’s truthfulness without generating subclaims. We show the results of these ablation experiments with Vicuna as the subclaim verifier in Table 2.

The results indicate that utilizing either the Evidence Abstraction or Claim Deconstruction component independently leads to improvements in the backbone LLM’s performance in claim verification. Combining both components further enhances the model’s performance. Furthermore, we observe that the choice of the LLM used for the Evidence Abstraction component also affects the model’s performance. Specifically, using Mixtral for Evidence Abstraction enhances the large language model’s ability to evaluate complex claims more significantly than using the Vicuna model.

Model HOVER-2 HOVER-3 HOVER-4
Full Model 68.55 66.43 63.42
w/o Keyword 64.62 63.97 60.48
w/o Selection 63.74 63.73 62.78
w/o Raw 65.80 65.56 60.17
Table 3: Macro-F1 scores of different Evidence Abstraction settings using Vicuna as the backbone model. “w/o Keyword” indicates abstraction without keyword guidance, relying solely on the claim. “w/o Selection” indicates abstraction using all keywords without the Keyword Selection process. “w/o Raw” indicates using solely the abstracted evidence set 𝒜𝒜\mathcal{A}caligraphic_A for verification without the raw evidence set \mathcal{E}caligraphic_E.

The Rationale Behind Keyword Selection

We employ a keyword-based method in Evidence Abstraction. As elucidated in Section 3.2 and Figure 2, extracting keywords from the claim to guide evidence abstraction serves to preempt potential conflicts between claim and evidence content. Selecting relevant keywords by fuzzy matching reduces LLMs’ tendency to generate content not supported by the evidence. To further assess this methodology, we examine Evidence Abstraction performed without keyword guidance (w/o Keyword) and without Keyword Selection (w/o Selection). In the w/o Keyword scenario, the LLM summarizes raw evidence based solely on the claim (Part I of Figure 2). In the w/o Selection scenario, the LLM uses all keywords for guidance (Part II of Figure 2). Results are presented in Table 3.

As shown in the table, the performance of EACon significantly deteriorates when either keyword guidance or keyword selection is omitted. This highlights the crucial role of selecting keywords as guidance in enhancing the effectiveness of Evidence Abstraction.

Model HOVER-2 HOVER-3 HOVER-4 FS-S
EACon w/ Claim 68.55 66.43 63.42 82.53
EACon w/o Claim 68.37 62.57 55.7 89.37
Table 4: Macro-F1 scores of different Subclaim Verification settings using Vicuna as the backbone model. “w/ Claim” indicates the use of the optional part in the prompt TSVsubscript𝑇𝑆𝑉T_{SV}italic_T start_POSTSUBSCRIPT italic_S italic_V end_POSTSUBSCRIPT, which mentions the original claim. “w/o Claim” indicates its absence. FS-S refers to the FEVEROUS-S dataset.

Effectiveness of Concatenating Raw Evidence Set \mathcal{E}caligraphic_E

In Evidence Summarization, the abstracted evidence set 𝒜𝒜\mathcal{A}caligraphic_A is considered an augmentation to the raw evidence set. Analyzing Table 3, it is apparent that using only the abstracted evidence set 𝒜𝒜\mathcal{A}caligraphic_A (w/o Raw) results in suboptimal performance compared to using the combined set 𝒜𝒜\mathcal{A}\cup\mathcal{E}caligraphic_A ∪ caligraphic_E (Full Model). This discrepancy arises because the keyword-based abstraction method, while capturing crucial information, may overlook hard-to-identify details. Therefore, the strategy of concatenating \mathcal{E}caligraphic_E and 𝒜𝒜\mathcal{A}caligraphic_A proves to be an effective approach.

Analysis of Optional Claim Context in Subclaim Verification

In Section 3.4, we mentioned that the prompt used in the Subclaim Verification process includes an optional segment: “In the saying of [Claim] (c𝑐citalic_c)”. In our EACon experiments, we included this optional component for the HOVER dataset but not for the FEVEROUS-S dataset. The presence of this optional component significantly impacts the Subclaim Verification step.

As depicted in Table 4, its inclusion enhances model performance in complex reasoning scenarios such as HOVER-3 and HOVER-4, while showing minimal improvement in simpler datasets like FEVEROUS-S. Complex datasets may feature intricate logical relationships in claims, where nested logic, like “The coach, who worked with the Seattle Seahawks, was an employee of the Cleveland Browns,” could lead to LLM deconstructing a subclaim as “The coach was an employee of the Cleveland Browns.” In such cases, providing comprehensive context is crucial. In essence, for straightforward scenarios, minimizing additional contextual information optimally leverages the LLM’s reasoning abilities in subclaim verification. Conversely, in complex scenarios, offering extensive context proves more effective, aligning with common-sense judgment.

5 Conclusion

In this paper, we introduce the EACon framework to enhance LLMs in claim verification task. We address the challenge posed by “noisy crowd” of evidence and claims that can negatively impact LLMs’ performance. To address this, we propose Evidence Abstraction to extract essential information from noisy evidence and Claim Deconstruction to verify distinct aspects of the original claim individually. We present an abstraction method based on selected keywords to mitigate conflicts between claims and evidence, reducing the risk of generating unsupported content during evidence abstraction. We also examine the impact of incorporating the original claim into the subclaim verification process. Our validation on two datasets using two open-source LLMs shows the effectiveness of the EACon framework.

References

  • Aly et al. (2021) Aly, R.; Guo, Z.; Schlichtkrull, M. S.; Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Cocarascu, O.; and Mittal, A. 2021. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Bakir and McStay (2018) Bakir, V.; and McStay, A. 2018. Fake news and the economy of emotions: Problems, causes, solutions. Digital journalism, 6(2): 154–175.
  • Beltagy, Peters, and Cohan (2020) Beltagy, I.; Peters, M. E.; and Cohan, A. 2020. Longformer: The long-document transformer. ArXiv preprint, abs/2004.05150.
  • Brown et al. (2020) Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are Few-Shot Learners. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Chen et al. (2022a) Chen, J.; Sriram, A.; Choi, E.; and Durrett, G. 2022a. Generating literal and implied subquestions to fact-check complex claims. ArXiv preprint, abs/2205.06938.
  • Chen et al. (2022b) Chen, Z.; Hui, S. C.; Zhuang, F.; Liao, L.; Li, F.; Jia, M.; and Li, J. 2022b. EvidenceNet: Evidence Fusion Network for Fact Verification. In WWW, 2636–2645. ACM.
  • Chiang et al. (2023) Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J. E.; Stoica, I.; and Xing, E. P. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  • Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.
  • Fu et al. (2022) Fu, Y.; Peng, H.; Sabharwal, A.; Clark, P.; and Khot, T. 2022. Complexity-Based Prompting for Multi-Step Reasoning. ArXiv preprint, abs/2210.00720.
  • Gi, Fang, and Tsai (2021) Gi, I.-Z.; Fang, T.-Y.; and Tsai, R. T.-H. 2021. Verdict Inference with Claim and Retrieved Elements Using RoBERTa. In Aly, R.; Christodoulopoulos, C.; Cocarascu, O.; Guo, Z.; Mittal, A.; Schlichtkrull, M.; Thorne, J.; and Vlachos, A., eds., Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER), 60–65. Dominican Republic: Association for Computational Linguistics.
  • Gong et al. (2024) Gong, H.; Xu, W.; Wu, S.; Liu, Q.; and Wang, L. 2024. Heterogeneous Graph Reasoning for Fact Checking over Texts and Tables. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 100–108.
  • Guo, Schlichtkrull, and Vlachos (2022) Guo, Z.; Schlichtkrull, M.; and Vlachos, A. 2022. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10: 178–206.
  • He, Gao, and Chen (2021) He, P.; Gao, J.; and Chen, W. 2021. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. ArXiv preprint, abs/2111.09543.
  • Hu et al. (2022) Hu, N.; Wu, Z.; Lai, Y.; Liu, X.; and Feng, Y. 2022. Dual-channel evidence fusion for fact verification over texts and tables. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5232–5242.
  • Hu et al. (2023) Hu, X.; Chen, J.; Li, X.; Guo, Y.; Wen, L.; Yu, P. S.; and Guo, Z. 2023. Do Large Language Models Know about Facts? ArXiv preprint, abs/2310.05177.
  • Jiang et al. (2024) Jiang, A. Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D. S.; Casas, D. d. l.; Hanna, E. B.; Bressand, F.; et al. 2024. Mixtral of experts. ArXiv preprint, abs/2401.04088.
  • Jiang, Pradeep, and Lin (2021) Jiang, K.; Pradeep, R.; and Lin, J. 2021. Exploring Listwise Evidence Reasoning with T5 for Fact Verification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 402–410. Online: Association for Computational Linguistics.
  • Jiang et al. (2020) Jiang, Y.; Bordia, S.; Zhong, Z.; Dognin, C.; Singh, M.; and Bansal, M. 2020. HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification. In Findings of the Association for Computational Linguistics: EMNLP 2020, 3441–3460. Online: Association for Computational Linguistics.
  • Kojima et al. (2022) Kojima, T.; Gu, S. S.; Reid, M.; Matsuo, Y.; and Iwasawa, Y. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 22199–22213.
  • Li et al. (2023) Li, M.; Peng, B.; Galley, M.; Gao, J.; and Zhang, Z. 2023. Self-checker: Plug-and-play modules for fact-checking with large language models. ArXiv preprint, abs/2305.14623.
  • Liu et al. (2019) Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. ArXiv preprint, abs/1907.11692.
  • Liu et al. (2020) Liu, Z.; Xiong, C.; Sun, M.; and Liu, Z. 2020. Fine-grained Fact Verification with Kernel Graph Attention Network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7342–7351. Online: Association for Computational Linguistics.
  • Ma et al. (2023) Ma, H.; Xu, W.; Wei, Y.; Chen, L.; Wang, L.; Liu, Q.; and Wu, S. 2023. Ex-fever: A dataset for multi-hop explainable fact verification. ArXiv preprint, abs/2310.09754.
  • Min et al. (2023) Min, S.; Krishna, K.; Lyu, X.; Lewis, M.; Yih, W.-t.; Koh, P. W.; Iyyer, M.; Zettlemoyer, L.; and Hajishirzi, H. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. ArXiv preprint, abs/2305.14251.
  • Nie et al. (2020) Nie, Y.; Williams, A.; Dinan, E.; Bansal, M.; Weston, J.; and Kiela, D. 2020. Adversarial NLI: A New Benchmark for Natural Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4885–4901. Online: Association for Computational Linguistics.
  • Pan et al. (2023a) Pan, L.; Lu, X.; Kan, M.-Y.; and Nakov, P. 2023a. QACHECK: A Demonstration System for Question-Guided Multi-Hop Fact-Checking. ArXiv preprint, abs/2310.07609.
  • Pan et al. (2023b) Pan, L.; Wu, X.; Lu, X.; Luu, A. T.; Wang, W. Y.; Kan, M.-Y.; and Nakov, P. 2023b. Fact-Checking Complex Claims with Program-Guided Reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 6981–7004.
  • Raffel et al. (2020) Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140): 1–67.
  • Rani et al. (2023) Rani, A.; Tonmoy, S.; Dalal, D.; Gautam, S.; Chakraborty, M.; Chadha, A.; Sheth, A.; and Das, A. 2023. FACTIFY-5WQA: 5W Aspect-based Fact Verification through Question Answering. arXiv preprint arXiv:2305.04329.
  • Soleimani, Monz, and Worring (2020) Soleimani, A.; Monz, C.; and Worring, M. 2020. BERT for Evidence Retrieval and Claim Verification. In ECIR (2), volume 12036 of Lecture Notes in Computer Science, 359–366. Springer.
  • Thorne et al. (2018) Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; and Mittal, A. 2018. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 809–819. New Orleans, Louisiana: Association for Computational Linguistics.
  • Wadden et al. (2022) Wadden, D.; Lo, K.; Wang, L. L.; Cohan, A.; Beltagy, I.; and Hajishirzi, H. 2022. MultiVerS: Improving scientific claim verification with weak supervision and full-document context. In Findings of the Association for Computational Linguistics: NAACL 2022, 61–76.
  • Wang et al. (2022) Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; and Zhou, D. 2022. Self-consistency improves chain of thought reasoning in language models. ArXiv preprint, abs/2203.11171.
  • Wei et al. (2022) Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 24824–24837.
  • Zhao et al. (2020) Zhao, C.; Xiong, C.; Rosset, C.; Song, X.; Bennett, P. N.; and Tiwary, S. 2020. Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Zhao et al. (2023) Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. 2023. A survey of large language models. ArXiv preprint, abs/2303.18223.
  • Zhong et al. (2020) Zhong, M.; Liu, P.; Chen, Y.; Wang, D.; Qiu, X.; and Huang, X. 2020. Extractive Summarization as Text Matching. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6197–6208. Online: Association for Computational Linguistics.