Navigating the Noisy Crowd: Finding Key Information for Claim Verification

Haisong Gong^1,2, Huanhuan Ma^1,2, Qiang Liu^1,2, Shu Wu^1,2, Liang Wang^1,2

Abstract

Claim verification is a task that involves assessing the truthfulness of a given claim based on multiple evidence pieces. Using large language models (LLMs) for claim verification is a promising way. However, simply feeding all the evidence pieces to an LLM and asking if the claim is factual does not yield good results. The challenge lies in the noisy nature of both the evidence and the claim: evidence passages typically contain irrelevant information, with the key facts hidden within the context, while claims often convey multiple aspects simultaneously. To navigate this “noisy crowd” of information, we propose EACon (Evidence Abstraction and Claim Deconstruction), a framework designed to find key information within evidence and verify each aspect of a claim separately. EACon first finds keywords from the claim and employs fuzzy matching to select relevant keywords for each raw evidence piece. These keywords serve as a guide to extract and summarize critical information into abstracted evidence. Subsequently, EACon deconstructs the original claim into subclaims, which are then verified against both abstracted and raw evidence individually. We evaluate EACon using two open-source LLMs on two challenging datasets. Results demonstrate that EACon consistently and substantially improve LLMs’ performance in claim verification.

Refer to caption — Figure 1: Architecture of EACon. The input is a claim and raw evidence, and the output is the predicted veracity of the claim. EACon extracts keywords from the claim and uses fuzzy matching to select keywords for each piece of the raw evidence. These selected keywords are then used to summarize the raw evidence into abstracted evidence. EACon then deconstructs the claim into subclaims, which are verified against both the raw and abstracted evidence using a zero-shot approach.

1 Introduction

The ease of creating and sharing information has led to a surge in misinformation within society, spanning from social media to prominent events like the U.S. Presidential debates, disrupting societal norms (Bakir and McStay 2018). Consequently, the automated verification of information accuracy has become paramount. One critical aspect of this is claim verification, which involves using models to evaluate the truthfulness of a given statement (claim) based on multiple evidence pieces (Guo, Schlichtkrull, and Vlachos 2022).

Claim verification can be viewed as a type of Natural Language Inference (NLI) task. Prior studies have delved into techniques such as fine-tuning pre-trained language models and utilizing graph neural networks to establish relationships between evidence in claim verification (Ma et al. 2023; Gong et al. 2024). With recent advancements in large language models (LLMs) (Zhao et al. 2023), leveraging these models for claim verification holds significant promise.

Despite the potential of LLMs, applying them directly to claim verification by simply feeding all the evidence pieces and asking if a claim is factual falls short in yielding satisfactory outcomes. Even advanced methods, such as leveraging in-context examples through few-shot learning or enhancing LLM reasoning via strategies like Chain of Thought (CoT) (Wei et al. 2022) or complex reasoning chains (Fu et al. 2022), do not consistently improve claim verification outcomes (Hu et al. 2023). This is because the task of claim verification necessitates not only reasoning abilities but also the capacity to handle the inherently “noisy” nature of evidence and claims, which both direct LLM applications and these prompt techniques struggle to address effectively.

In the case of “noisy” evidence, an evidence piece may be rife with irrelevant information, while the key information occupies only a small portion and is hidden deeply within the context. This necessitates the model to possess the capability to sift through the noise and extract the pertinent information from the “noisy evidence crowd.” On the other hand, “noisy” claims are often expressed in a convoluted manner, encompassing multiple aspects simultaneously rather than presenting a concise, atomic statement. These “noisy” claims pose challenges for the direct application of LLMs. This is because LLMs typically tend to compare the overall semantic meaning between the evidence and claim, overlooking minor details. However, in the realm of claim verification, even minor inaccuracies should render a claim false, irrespective of the overall semantic coherence.

To address this challenge, we propose the EACon (Evidence Abstraction and Claim Deconstruction) framework. EACon extracts and summarizes the key information from the raw evidence into abstracted evidence to aid LLM verification. It also deconstructs the claim into subclaims, allowing each aspect of the claim to be checked in detail, increasing the likelihood of identifying errors. In this framework, we design a keyword-based technique to extract keywords from the claim and use fuzzy matching to select relevant keywords as guidance to conduct evidence abstraction. This keyword-guided strategy mitigates the impact of conflicts between inaccurate claims and evidence content, while selecting relevant keywords by fuzzy matching aids in reducing the LLM’s tendency to generate content not conveyed by evidence, as illustrated in Figure 2. Furthermore, for complex scenarios, we provide the LLM with contextual information about the original claim during the subclaim verification stage, further improving the model’s performance.

In summary, our key contributions include:

•

We highlight the key challenge in claim verification as navigating the “noisy crowd” of claim and evidence information, which hampers the performance of LLMs in claim verification.
•

We propose the EACon framework, which extracts and summarizes the key information from raw evidence into abstracted evidence based on selected keywords and deconstructs the claim into subclaims for verification.
•

We demonstrate the effectiveness of EACon on the HOVER and FEVEROUS-S datasets, using two open-source LLMs (Vicuna-13B and Mixtral-8x7B). The results show that EACon can consistently and substantially improve LLMs’ performance in claim verification.

2 Related Work

Claim Verification

Traditional methods for claim verification can be categorized into two main approaches. The first approach employs pre-trained language models fine-tuned specifically for claim verification. These models either concatenate the evidence and claims into a single input (Aly et al. 2021; Thorne et al. 2018; Hu et al. 2022) or process each piece of evidence separately and then aggregate the results (Soleimani, Monz, and Worring 2020; Jiang, Pradeep, and Lin 2021; Gi, Fang, and Tsai 2021). The second approach utilizes graph neural networks to capture complex semantic interactions through evidence graphs (Gi, Fang, and Tsai 2021; Zhao et al. 2020; Liu et al. 2020; Zhong et al. 2020; Chen et al. 2022b; Gong et al. 2024). Recent studies have explored leveraging the reasoning abilities of LLMs in verification tasks. For example, ProgramFC (Pan et al. 2023b) employs LLMs to generate reasoning programs that guide the verification process, while EX-FEVER (Ma et al. 2023) elicits LLMs’ capability to generate textual explanations for claim verification results. Factscore (Min et al. 2023) proposes fine-grained atomic evaluation for long text inputs. Pan et al. (2023a); Chen et al. (2022a); Li et al. (2023); Rani et al. (2023) propose to generate a series of questions or queries for claim verification. However, none of these methods address the “noisy” problem of both evidence and claim information, which our method focuses on.

Large Language Model Reasoning

The reasoning capabilities of LLMs form the cornerstone for LLM-based verification tasks. In recent years, in-context learning, popularized by the few-shot prompting approach of Brown et al. (2020), has enabled models to generalize tasks from a few examples. The reasoning ability of LLMs has been further enhanced through various strategies, such as chain-of-thought prompting (CoT) (Wei et al. 2022; Wang et al. 2022; Kojima et al. 2022), which improves reasoning by generating intermediate steps in problem-solving. Fu et al. (2022) propose selecting complex reasoning examples as prompts to boost LLMs’ reasoning performance. However, these strategies do not consistently enhance performance in claim verification tasks (Hu et al. 2023). Our method does not focus on enhancing the reasoning abilities of LLMs to solve claim verification tasks. Instead, it aims to improve claim verification performance by reducing “noise” through evidence abstraction and claim deconstruction, thereby leveraging the LLMs’ strengths more effectively.

3 Method

In this section, we introduce the details of our proposed framework, EACon. Generally, EACon is composed of three major components: Evidence Abstraction, Claim Deconstruction, and Subclaim Verification. Both Evidence Abstraction and Claim Deconstruction are designed to address the “noisy crowd” problem for claim verification. After these two preparatory steps, the final Subclaim Verification component verifies each subclaim and produces the overall result. The architecture of EACon is shown in Figure 1.

3.1 Task Formulation

The objective of claim verification is to determine the veracity of a given claim based on multiple pieces of evidence. Typically, each piece of evidence is a sentence or paragraph drawn from sources like Wikipedia. Mathematically, given a claim $c$ and an evidence set containing $n$ piece of evidence $\mathcal{E}=\{e_{1},e_{2},\cdots,e_{n}\}$ . the task is to find a model $\hat{p}=f(c,\mathcal{E})$ that outputs the predicted veracity $\hat{p}$ , where $\hat{p}=\text{True}\ or\ \text{False}$ .

3.2 Evidence Abstraction

Evidence Abstraction serves as the first part of EACon. It involves processing each raw piece of evidence to extract useful information and eliminate noisy information. Instead of naively prompting an LLM to perform extraction and summarization tasks, we designed a keyword-based method.

As shown in Part I of Figure 2, if an LLM is prompted to summarize the evidence against a claim that is inherently false, there is a risk of conflicting information that could lead to subpar results. To alleviate this problem, we designed a keyword-based method. EACon extracts keywords from the claim, which encapsulate the claim’s essence without introducing any biases from the claim itself. This approach helps in circumventing potential conflicts. However, as shown in Part II of Figure 2, if all the keywords are used to summarize an evidence, the redundant keywords may lead the LLM to output content not conveyed by the evidence. To address this, we designed a Keyword Selection procedure to remove the irrelevant keywords, resulting in better summarization results. The effectiveness of the keyword-based design is empirically validated in Section 4.5. Subsequent sections will delve into the details of the Evidence Abstraction process, encompassing Keyword Extraction, Keyword Selection, and Evidence Summarization.

Keyword Extraction

Keyword Extraction is the initial step in the Evidence Abstraction process, aimed at identifying essential keywords from the claim. These keywords, including important nouns, verbs, and phrases, serve as a guide for extracting information from evidence in subsequent sections. We instruct an LLM to perform keyword extraction and provide examples to aid its understanding of the task and output formatting. The process can be formally described as:

\{k_{1},k_{2},\cdots,k_{m}\}=\arg\max p(\mathcal{K}|T_{KE},c;\theta_{LLM})

(1)

where $k_{i}$ is the $i$ th keyword selected by the LLM from the potential keywords set $\mathcal{K}$ , and $m$ keywords are selected in total for the claim $c$ . $\theta_{LLM}$ represents the LLM model, and $T_{KE}$ is the prompt template used for Keyword Extraction:

Keyword Selection

We have obtained $m$ keywords from the claim $c$ . Given the $i$ th piece of evidence $e_{i}$ , the second step is to identify which of these $m$ keywords are related to $e_{i}$ . Since we do not want to use all the keywords as the evidence summary guidance, as described in the previous section and Figure 2, we employ fuzzy matching as a low-cost and efficient way to implement this task.

Fuzzy matching can be used to evaluate the similarity between a keyword and a piece of evidence. Specifically, we use two functions from the fuzzywuzzy package¹¹1https://github.com/seatgeek/thefuzz, namely partial_ratio and token_set_ratio. The partial_ratio function computes the similarity (edit distance) of the best matching substring of the evidence to the input keyword, while the token_set_ratio function determines the similarity score of the intersection of unique tokens between the input keyword and the evidence piece. These functions are applied to compare each keyword $k_{j},j\in\{1,2,\cdots,m\}$ with the $i$ th piece of evidence $e_{i}$ . To prevent the omission of potentially relevant keywords, those where either of the similarity scores exceeds a preset threshold will be selected. Mathematically:

\mathcal{S}_{i}=\{k_{j}\mid\texttt{ partial\_ratio}(k_{j},e_{i})>t_{1}\text{ % or }\\ \texttt{ token\_set\_ratio}(k_{j},e_{i})>t_{2},1\leq j\leq m\}

(2)

where $\mathcal{S}_{i}$ indicates the selected keywords set for the $i$ th piece of evidence $e_{i}$ , $t_{1}$ and $t_{2}$ are two set threshold for these two fuzzy matching functions.

Evidence Summarization

After obtaining the selected keywords set $\mathcal{S}_{i}$ corresponding to the $i$ th piece of evidence $e_{i}$ , our goal is to extract the information centered around these keywords within the evidence and discard the irrelevant. This extracted information is intended to be the most useful for verifying the claim. Essentially, extracting information centered on keywords is akin to uncovering the relationships between these keywords. Since meaningful relationships generally exist between multiple keywords, we focus our summaries on evidence containing at least two relevant keywords. Evidence with $|\mathcal{S}_{i}|<2$ will not be summarized, as we deem them unlikely to provide sufficient useful information. Still, we prompt the LLM to serve as the extractor and summarizer. Additionally, we equip the prompt with some examples to help the LLM better understand the compositional task and format its output. Formally, this process can be described as:

a_{i}=\arg\max p(a_{i}|T_{ES},\mathcal{S}_{i},e_{i};\theta_{LLM}),|\mathcal{S}% _{i}|\geq 2

(3)

where $a_{i}$ is the abstracted evidence from the raw evidence $e_{i}$ , $\mathcal{S}_{i}$ is the selected keywords set. $T_{ES}$ is the prompt template used for Evidence Summarization:

We perform the Keyword Selection and Evidence Summarization procedures for each piece of evidence, resulting in a set of abstracted evidence denoted as $\mathcal{A}=\{a_{1},\cdots,a_{\sum_{i}{\{[}|\mathcal{S}_{i}|>2]}\}$ . This set $\mathcal{A}$ is then combined with the raw evidence set $\mathcal{E}$ for subsequent verification tasks. The integration is essential because $\mathcal{A}$ encapsulates key elements crucial for directly validating the truthfulness of the claim, but may not contain all the information. By supplementing the raw evidence set, $\mathcal{A}$ can be considered an enhancement of the original evidence, providing a more concise and focused representation. Furthermore, it offers a shortcut for the LLM, reducing the complexity of subsequent inference processes.

3.3 Claim Deconstruction

Claim Deconstruction is the second component of EACon. It takes the original claim as input and generates several subclaims that focus on different aspects. Simply asking an LLM to judge the claim’s truthfulness is insufficient, as LLMs tend to compare the overall semantic meaning rather than scrutinize minor details. However, for claim verification, even the slightest error should result in the claim being judged as false, even if the semantic meaning remains largely unchanged. Relying solely on LLM judgments can only address obvious errors, failing to meet the objectives of comprehensive claim verification. By deconstructing the claim into subclaims, we can leverage LLMs to individually verify different aspects and details, thereby increasing the likelihood of identifying errors. Still, we prompt an LLM to deconstruct claim into subclaims:

\{u_{1},u_{2},\cdots,u_{r}\}=\arg\max p(\mathcal{U}|T_{CD},c;\theta_{LLM})

(4)

where $u_{i}$ means the $i$ th subclaim from the potential subclaim set $\mathcal{U}$ . $T_{CD}$ is the prompt template used for Claim Deconstruction:

3.4 Subclaim Verification

The last step of EACon is Subclaim Verification. After obtaining the set of subclaims $\{u_{1},u_{2},\cdots,u_{r}\}$ , the truthfulness of the original claim can be verified by checking each subclaim individually. If any subclaim is false, the original claim is deemed false. The original claim is considered true only if all subclaims are correct. Mathematically, this can be represented as:

\hat{p}=\begin{cases}\text{False}&\text{if }\exists i,f(u_{i},\mathcal{A}\cup% \mathcal{E})=\text{False}\\ \text{True}&\text{Other}\end{cases}

(5)

where $\hat{p}$ is the veracity prediction of the claim $c$ , $\mathcal{A}$ and $\mathcal{E}$ are the abstracted evidence set and raw evidence set. $f$ is the function used to verify the truthfulness of each subclaim $u_{i}$ . We implement $f$ using LLM in a zero-shot manner, consistent with prior work (Pan et al. 2023b). Mathematically, this can be written as:

f(u_{i},\mathcal{A}\cup\mathcal{E})=\arg\max p(p_{i}|T_{SV},u_{i},\mathcal{A}% \cup\mathcal{E},c;\theta_{LLM})

(6)

where $p_{i}=\text{True}\ or\ \text{False}$ represents the veracity prediction of the $i$ th subclaim $u_{i}$ , and $T_{SV}$ is the prompt template used for Subclaim Verification:

The segment of the prompt highlighted in dark color is optional. The decision to incorporate the context of the original claim for subclaim verification depends on the complexity of the claim. In our experiments, we observe that for complex claims, incorporating the original claim as context is more beneficial for verification. Further discussion on the optional prompt segment will be provided in the experimental section.

4 Experiment

Models		HOVER-2	HOVER-3	HOVER-4	FEVEROUS-S
Pretrained/Fine-tuned Models	BERT-FC	53.40	50.90	50.86	74.71
	LisT5	56.15	53.76	51.67	77.88
	RoBERTa-NLI	74.62	62.23	57.98	88.28
	DeBERTaV3-NLI	77.22	65.98	60.49	91.98
	MULTIVERS	68.86	59.87	55.67	86.03
Vicuna	Zero-Shot	64.08	64.63	59.59	81.69
	Few-Shot	63.02	62.18	56.81	78.65
	ProgramFC	66.07	60.35	56.74	87.51
	+ EACon (our method)	68.55+4.47	66.43+1.8	63.42+3.83	89.37+7.68
Mixtral	Zero-Shot	67.86	64.03	62.09	85.06
	Few-Shot	66.59	63.59	62.55	88.49
	ProgramFC	59.97	61.75	59.82	81.76
	+ EACon (our method)	73.17+5.31	69.40+5.37	67.78+5.69	89.52+4.46

Table 1: Comparison of baseline models on subsets of HOVER dataset and FEVEROUS-S dataset in terms of Macro-F1 score. HOVER-2 represents the 2-hops subset of the HOVER dataset. Green numbers show improvement over zero-shot performance when our method is applied to backbone LLMs, as the verification process of EACon is also zero-shot.

4.1 Dataset

In line with existing research, we have selected two publicly available datasets to assess the performance of EACon. Evaluation is carried out using the validation set. The chosen datasets are HOVER (Jiang et al. 2020) and FEVEROUS-S (Aly et al. 2021).

•

HOVER The HOVER dataset comprises claims that necessitate verification through multiple pieces of evidence and multi-hop reasoning. It is organized into three subsets, each corresponding to a different level of reasoning complexity based on the number of hops. Specifically, the two-hop subset (HOVER-2) consists of 1,126 claims, the three-hop subset (HOVER-3) comprises 1,835 claims, and the four-hop (HOVER-4) subset includes 1,039 claims.
•

FEVEROUS-S FEVEROUS is a fact-checking dataset designed to validate claims using both structured and unstructured data sources. Our experimentation is focused on a subset of FEVEROUS, known as FEVEROUS-S, which exclusively involves claims that rely on unstructured data. In terms of claim complexity, it is noted that the claims in the HOVER dataset exhibit higher complexity compared to those in FEVEROUS-S.

Given our emphasis on the claim verification task, all experiments are executed using the evidence provided within the dataset (referred to as golden evidence). The performance is assessed using the Macro-F1 score as the evaluation metric.

4.2 Baselines

EACon is a versatile framework that can be adapted to various existing large language models. In order to ensure credibility and inclusivity, we have selected two open-source LLMs with differing parameter sizes as the foundational backbone for EACon. These models are Vicuna-13B (Chiang et al. 2023) and Mixtral-8x7B (Jiang et al. 2024). Our experimentation includes zero-shot and few-shot trials using these backbone models. Furthermore, we conduct experiments with these two language models within another framework, ProgramFC (Pan et al. 2023b), which prompts LLMs to generate and execute programs for the purpose of claim verification.

The following pretrained or fine-tuned models are also considered as baseline models:

•

BERT-FC (Soleimani, Monz, and Worring 2020): Pretrained BERT model (Devlin et al. 2019) tailored for fact-checking tasks.
•

LisT5 (Jiang, Pradeep, and Lin 2021): Pretrained T5 model (Raffel et al. 2020) specialized for fact-checking tasks.
•

RoBERTa-NLI (Nie et al. 2020): Pretrained RoBERTa-large model (Liu et al. 2019) fine-tuned on four natural language inference datasets.
•

DeBERTaV3-NLI (He, Gao, and Chen 2021): Pretrained DeBERTaV3 model fine-tuned on FEVER (Thorne et al. 2018) and four natural language inference datasets.
•

MULTIVERS (Wadden et al. 2022): A LongFormer model (Beltagy, Peters, and Cohan 2020) fine-tuned on the FEVER dataset.

4.3 Implementation Details

In the Keyword Selection process, both similarity score thresholds ( $t_{1}$ and $t_{2}$ ) are set to $60$ (maximum is 100) to ensure the retention of important keywords.

EACon conducts the verification process in a zero-shot manner but includes in-context examples in the prompts for Evidence Abstraction and Claim Deconstruction. To ensure fair experimentation, the model does not use any examples that are not utilized by the baseline models. Few-shot experiments with backbone models use the same examples as prompts in ProgramFC. Examples for Evidence Abstraction and Claim Deconstruction are rephrased from ProgramFC to suit task requirements. Subclaim Verification uses the optional prompt component $T_{SV}$ for the HOVER dataset but not for the FEVEROUS-S dataset. A more detailed discussion on $T_{SV}$ is provided in Section 4.5.

Since open-source models are utilized, all experiments are conducted on a local machine server equipped with an AMD EPYC 7742 (256) @ 2.250GHz CPU and NVIDIA RTX 3090 (24G) GPUs (Vicuna-13B experiments require two GPUs, while Mixtral-8x7B necessitates a minimum of five GPUs). A temperature of 0.05 is utilized to reduce randomness, while all other hyperparameters in sampling output of LLMs remain default.

CD	$\textbf{EA}_{v}$	$\textbf{EA}_{m}$	HOVER-2	HOVER-3	HOVER-4
$\times$	$\times$	$\times$	64.08	64.63	59.59
✓	$\times$	$\times$	66.90	65.61	62.46
$\times$	✓	$\times$	64.25	64.98	61.52
$\times$	$\times$	✓	63.99	65.15	63.90
✓	✓	$\times$	68.55	66.43	63.42
✓	$\times$	✓	66.25	66.97	64.23

Table 2: Ablation study on EACon using Vicuna as the subclaim verifier. CD refers to Claim Deconstruction with Vicuna, while

\text{EA}_{v}

\text{EA}_{m}

denotes Evidence Abstraction with Vicuna/Mixtral. Macro-F1 scores are reported.

4.4 Overall Performance

Table 1 presents the results of our method and various baseline models. The data clearly demonstrates that EACon consistently and substantially improve model performance across both datasets, using either Vicuna or Mixtral as the backbone model.

Compared to pretrained/fine-tuned models, zero-shot LLMs do not exhibit advantage, especially compared to DeBERTaV3. However, applying our proposed model, EACon, to these LLMs demonstrates a more pronounced advantage in complex tasks. On simpler datasets like FEVEROUS-S and HOVER-2 (2-hop reasoning), EACon-equipped models perform comparably to pretrained/fine-tuned models. But for more complex tasks like HOVER-3 and HOVER-4, EACon-equipped LLMs show a distinct advantage.

In comparison to other LLM-based approaches for claim verification, our model demonstrates superior stability. The few-shot technique does not consistently improve performance, aligning with prior research (Hu et al. 2023). ProgramFC’s reliance on LLMs’ program generation and execution capabilities makes it less adaptable and more sensitive to intermediate errors compared to the model.

4.5 Ablation Study

The Impact of Evidence Abstraction and Claim Deconstruction

To navigate the “noisy” crowd of evidence and claim, EACon contains two key components: Evidence Abstraction and Claim Deconstruction. In this section, we conduct ablation studies to understand the contribution of each component. Removing the Evidence Abstraction component eliminates the use of the abstracted evidence set $\mathcal{A}$ in verification, while removing Claim Deconstruction results in direct assessment of the claim’s truthfulness without generating subclaims. We show the results of these ablation experiments with Vicuna as the subclaim verifier in Table 2.

The results indicate that utilizing either the Evidence Abstraction or Claim Deconstruction component independently leads to improvements in the backbone LLM’s performance in claim verification. Combining both components further enhances the model’s performance. Furthermore, we observe that the choice of the LLM used for the Evidence Abstraction component also affects the model’s performance. Specifically, using Mixtral for Evidence Abstraction enhances the large language model’s ability to evaluate complex claims more significantly than using the Vicuna model.

Model	HOVER-2	HOVER-3	HOVER-4
Full Model	68.55	66.43	63.42
w/o Keyword	64.62	63.97	60.48
w/o Selection	63.74	63.73	62.78
w/o Raw	65.80	65.56	60.17

Table 3: Macro-F1 scores of different Evidence Abstraction settings using Vicuna as the backbone model. “w/o Keyword” indicates abstraction without keyword guidance, relying solely on the claim. “w/o Selection” indicates abstraction using all keywords without the Keyword Selection process. “w/o Raw” indicates using solely the abstracted evidence set

\mathcal{A}

for verification without the raw evidence set

\mathcal{E}

The Rationale Behind Keyword Selection

We employ a keyword-based method in Evidence Abstraction. As elucidated in Section 3.2 and Figure 2, extracting keywords from the claim to guide evidence abstraction serves to preempt potential conflicts between claim and evidence content. Selecting relevant keywords by fuzzy matching reduces LLMs’ tendency to generate content not supported by the evidence. To further assess this methodology, we examine Evidence Abstraction performed without keyword guidance (w/o Keyword) and without Keyword Selection (w/o Selection). In the w/o Keyword scenario, the LLM summarizes raw evidence based solely on the claim (Part I of Figure 2). In the w/o Selection scenario, the LLM uses all keywords for guidance (Part II of Figure 2). Results are presented in Table 3.

As shown in the table, the performance of EACon significantly deteriorates when either keyword guidance or keyword selection is omitted. This highlights the crucial role of selecting keywords as guidance in enhancing the effectiveness of Evidence Abstraction.

Model	HOVER-2	HOVER-3	HOVER-4	FS-S
EACon w/ Claim	68.55	66.43	63.42	82.53
EACon w/o Claim	68.37	62.57	55.7	89.37

Table 4: Macro-F1 scores of different Subclaim Verification settings using Vicuna as the backbone model. “w/ Claim” indicates the use of the optional part in the prompt

T_{SV}

, which mentions the original claim. “w/o Claim” indicates its absence. FS-S refers to the FEVEROUS-S dataset.

Effectiveness of Concatenating Raw Evidence Set $\mathcal{E}$

In Evidence Summarization, the abstracted evidence set $\mathcal{A}$ is considered an augmentation to the raw evidence set. Analyzing Table 3, it is apparent that using only the abstracted evidence set $\mathcal{A}$ (w/o Raw) results in suboptimal performance compared to using the combined set $\mathcal{A}\cup\mathcal{E}$ (Full Model). This discrepancy arises because the keyword-based abstraction method, while capturing crucial information, may overlook hard-to-identify details. Therefore, the strategy of concatenating $\mathcal{E}$ and $\mathcal{A}$ proves to be an effective approach.

Analysis of Optional Claim Context in Subclaim Verification

In Section 3.4, we mentioned that the prompt used in the Subclaim Verification process includes an optional segment: “In the saying of [Claim] ( $c$ )”. In our EACon experiments, we included this optional component for the HOVER dataset but not for the FEVEROUS-S dataset. The presence of this optional component significantly impacts the Subclaim Verification step.

As depicted in Table 4, its inclusion enhances model performance in complex reasoning scenarios such as HOVER-3 and HOVER-4, while showing minimal improvement in simpler datasets like FEVEROUS-S. Complex datasets may feature intricate logical relationships in claims, where nested logic, like “The coach, who worked with the Seattle Seahawks, was an employee of the Cleveland Browns,” could lead to LLM deconstructing a subclaim as “The coach was an employee of the Cleveland Browns.” In such cases, providing comprehensive context is crucial. In essence, for straightforward scenarios, minimizing additional contextual information optimally leverages the LLM’s reasoning abilities in subclaim verification. Conversely, in complex scenarios, offering extensive context proves more effective, aligning with common-sense judgment.

5 Conclusion

In this paper, we introduce the EACon framework to enhance LLMs in claim verification task. We address the challenge posed by “noisy crowd” of evidence and claims that can negatively impact LLMs’ performance. To address this, we propose Evidence Abstraction to extract essential information from noisy evidence and Claim Deconstruction to verify distinct aspects of the original claim individually. We present an abstraction method based on selected keywords to mitigate conflicts between claims and evidence, reducing the risk of generating unsupported content during evidence abstraction. We also examine the impact of incorporating the original claim into the subclaim verification process. Our validation on two datasets using two open-source LLMs shows the effectiveness of the EACon framework.

References

Aly et al. (2021) Aly, R.; Guo, Z.; Schlichtkrull, M. S.; Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Cocarascu, O.; and Mittal, A. 2021. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Bakir and McStay (2018) Bakir, V.; and McStay, A. 2018. Fake news and the economy of emotions: Problems, causes, solutions. Digital journalism, 6(2): 154–175.
Beltagy, Peters, and Cohan (2020) Beltagy, I.; Peters, M. E.; and Cohan, A. 2020. Longformer: The long-document transformer. ArXiv preprint, abs/2004.05150.
Brown et al. (2020) Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are Few-Shot Learners. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Chen et al. (2022a) Chen, J.; Sriram, A.; Choi, E.; and Durrett, G. 2022a. Generating literal and implied subquestions to fact-check complex claims. ArXiv preprint, abs/2205.06938.
Chen et al. (2022b) Chen, Z.; Hui, S. C.; Zhuang, F.; Liao, L.; Li, F.; Jia, M.; and Li, J. 2022b. EvidenceNet: Evidence Fusion Network for Fact Verification. In WWW, 2636–2645. ACM.
Chiang et al. (2023) Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J. E.; Stoica, I.; and Xing, E. P. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.
Fu et al. (2022) Fu, Y.; Peng, H.; Sabharwal, A.; Clark, P.; and Khot, T. 2022. Complexity-Based Prompting for Multi-Step Reasoning. ArXiv preprint, abs/2210.00720.
Gi, Fang, and Tsai (2021) Gi, I.-Z.; Fang, T.-Y.; and Tsai, R. T.-H. 2021. Verdict Inference with Claim and Retrieved Elements Using RoBERTa. In Aly, R.; Christodoulopoulos, C.; Cocarascu, O.; Guo, Z.; Mittal, A.; Schlichtkrull, M.; Thorne, J.; and Vlachos, A., eds., Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER), 60–65. Dominican Republic: Association for Computational Linguistics.
Gong et al. (2024) Gong, H.; Xu, W.; Wu, S.; Liu, Q.; and Wang, L. 2024. Heterogeneous Graph Reasoning for Fact Checking over Texts and Tables. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 100–108.
Guo, Schlichtkrull, and Vlachos (2022) Guo, Z.; Schlichtkrull, M.; and Vlachos, A. 2022. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10: 178–206.
He, Gao, and Chen (2021) He, P.; Gao, J.; and Chen, W. 2021. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. ArXiv preprint, abs/2111.09543.
Hu et al. (2022) Hu, N.; Wu, Z.; Lai, Y.; Liu, X.; and Feng, Y. 2022. Dual-channel evidence fusion for fact verification over texts and tables. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5232–5242.
Hu et al. (2023) Hu, X.; Chen, J.; Li, X.; Guo, Y.; Wen, L.; Yu, P. S.; and Guo, Z. 2023. Do Large Language Models Know about Facts? ArXiv preprint, abs/2310.05177.
Jiang et al. (2024) Jiang, A. Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D. S.; Casas, D. d. l.; Hanna, E. B.; Bressand, F.; et al. 2024. Mixtral of experts. ArXiv preprint, abs/2401.04088.
Jiang, Pradeep, and Lin (2021) Jiang, K.; Pradeep, R.; and Lin, J. 2021. Exploring Listwise Evidence Reasoning with T5 for Fact Verification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 402–410. Online: Association for Computational Linguistics.
Jiang et al. (2020) Jiang, Y.; Bordia, S.; Zhong, Z.; Dognin, C.; Singh, M.; and Bansal, M. 2020. HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification. In Findings of the Association for Computational Linguistics: EMNLP 2020, 3441–3460. Online: Association for Computational Linguistics.
Kojima et al. (2022) Kojima, T.; Gu, S. S.; Reid, M.; Matsuo, Y.; and Iwasawa, Y. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 22199–22213.
Li et al. (2023) Li, M.; Peng, B.; Galley, M.; Gao, J.; and Zhang, Z. 2023. Self-checker: Plug-and-play modules for fact-checking with large language models. ArXiv preprint, abs/2305.14623.
Liu et al. (2019) Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. ArXiv preprint, abs/1907.11692.
Liu et al. (2020) Liu, Z.; Xiong, C.; Sun, M.; and Liu, Z. 2020. Fine-grained Fact Verification with Kernel Graph Attention Network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7342–7351. Online: Association for Computational Linguistics.
Ma et al. (2023) Ma, H.; Xu, W.; Wei, Y.; Chen, L.; Wang, L.; Liu, Q.; and Wu, S. 2023. Ex-fever: A dataset for multi-hop explainable fact verification. ArXiv preprint, abs/2310.09754.
Min et al. (2023) Min, S.; Krishna, K.; Lyu, X.; Lewis, M.; Yih, W.-t.; Koh, P. W.; Iyyer, M.; Zettlemoyer, L.; and Hajishirzi, H. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. ArXiv preprint, abs/2305.14251.
Nie et al. (2020) Nie, Y.; Williams, A.; Dinan, E.; Bansal, M.; Weston, J.; and Kiela, D. 2020. Adversarial NLI: A New Benchmark for Natural Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4885–4901. Online: Association for Computational Linguistics.
Pan et al. (2023a) Pan, L.; Lu, X.; Kan, M.-Y.; and Nakov, P. 2023a. QACHECK: A Demonstration System for Question-Guided Multi-Hop Fact-Checking. ArXiv preprint, abs/2310.07609.
Pan et al. (2023b) Pan, L.; Wu, X.; Lu, X.; Luu, A. T.; Wang, W. Y.; Kan, M.-Y.; and Nakov, P. 2023b. Fact-Checking Complex Claims with Program-Guided Reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 6981–7004.
Raffel et al. (2020) Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140): 1–67.
Rani et al. (2023) Rani, A.; Tonmoy, S.; Dalal, D.; Gautam, S.; Chakraborty, M.; Chadha, A.; Sheth, A.; and Das, A. 2023. FACTIFY-5WQA: 5W Aspect-based Fact Verification through Question Answering. arXiv preprint arXiv:2305.04329.
Soleimani, Monz, and Worring (2020) Soleimani, A.; Monz, C.; and Worring, M. 2020. BERT for Evidence Retrieval and Claim Verification. In ECIR (2), volume 12036 of Lecture Notes in Computer Science, 359–366. Springer.
Thorne et al. (2018) Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; and Mittal, A. 2018. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 809–819. New Orleans, Louisiana: Association for Computational Linguistics.
Wadden et al. (2022) Wadden, D.; Lo, K.; Wang, L. L.; Cohan, A.; Beltagy, I.; and Hajishirzi, H. 2022. MultiVerS: Improving scientific claim verification with weak supervision and full-document context. In Findings of the Association for Computational Linguistics: NAACL 2022, 61–76.
Wang et al. (2022) Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; and Zhou, D. 2022. Self-consistency improves chain of thought reasoning in language models. ArXiv preprint, abs/2203.11171.
Wei et al. (2022) Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 24824–24837.
Zhao et al. (2020) Zhao, C.; Xiong, C.; Rosset, C.; Song, X.; Bennett, P. N.; and Tiwary, S. 2020. Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
Zhao et al. (2023) Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. 2023. A survey of large language models. ArXiv preprint, abs/2303.18223.
Zhong et al. (2020) Zhong, M.; Liu, P.; Chen, Y.; Wang, D.; Qiu, X.; and Huang, X. 2020. Extractive Summarization as Text Matching. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6197–6208. Online: Association for Computational Linguistics.