RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

Robert Friel
Galileo Technologies Inc.
[email protected]
&Masha Belyi
Galileo Technologies Inc.
[email protected]
&Atindriyo Sanyal
Galileo Technologies Inc.
[email protected]
Abstract

Retrieval-Augmented Generation (RAG) has become a standard architectural pattern for incorporating domain-specific knowledge into user-facing chat applications powered by Large Language Models (LLMs). RAG systems are characterized by (1) a document retriever that queries a domain-specific corpus for context information relevant to an input query, and (2) an LLM that generates a response based on the provided query and context. However, comprehensive evaluation of RAG systems remains a challenge due to the lack of unified evaluation criteria and annotated datasets. In response, we introduce RAGBench: the first comprehensive, large-scale RAG benchmark dataset of 100k examples. It covers five unique industry-specific domains and various RAG task types. RAGBench examples are sourced from industry corpora such as user manuals, making it particularly relevant for industry applications. Further, we formalize the TRACe evaluation framework: a set of explainable and actionable RAG evaluation metrics applicable across all RAG domains. We release the labeled dataset at https://huggingface.co/datasets/rungalileo/ragbench. RAGBench explainable labels facilitate holistic evaluation of RAG systems, enabling actionable feedback for continuous improvement of production applications. Thorough extensive benchmarking, we find that LLM-based RAG evaluation methods struggle to compete with a finetuned RoBERTa model on the RAG evaluation task. We identify areas where existing approaches fall short and propose the adoption of RAGBench with TRACe towards advancing the state of RAG evaluation systems.

**footnotetext: Equal Contributions

1 Introduction

Despite remarkable reasoning and conversational abilities, out-of-the-box pre-trained Large Language Models (LLMs) struggle to reason about out-of-domain, knowledge-intensive queries [21, 14]. In response, Retriever-Augmented Generation (RAG) systems [21, 20] are becoming increasingly popular in user-facing dialogue applications [35]. Generally, RAG systems comprise a retriever component that queries relevant documents from an in-domain corpus and a downstream LLM generator model that incorporates the retrieved documents along with the original user query to output an informed response. The additional context helps ground the LLM in factual information and has been shown to boost performance on knowledge-intensive tasks [21].

Still, when used in production settings, RAG systems are prone to hallucinations as the generator model struggles to retrieve relevant information from the context [1, 31, 7]. In the absence of a one-fits-all approach, application-specific RAG systems must be fine-tuned for optimal performance on domain-specific tasks. However, the choice of retriever and generator models for each application is complex and has serious implications on overall system quality and costs. With numerous commercial and open-source generative LLMs readily available111https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard and many variable parameters in the RAG system design (Figure 1), tuning an optimal system for a particular RAG application involves iterative evaluation of multiple configurations. This motivates the need for automated RAG evaluation solutions.

In response, automated RAG evaluation systems like RAGAS [9] and TruLens [37] have emerged. These systems adopt a zero-shot LLM prompt-based approach to predict a set of curated RAG evaluation metrics. However, the lack of unified RAG benchmarks makes it difficult to compare approaches against each other. Each new study designs a new dataset, often employing LLMs as generators and labelers [9, 33, 4], which renders them irreproducible. A few benchmarks like RGB [4], AttributionBench [22] and RAGTruth [41] have been proposed recently, but they are small in size and target a disjoint set of labels. The exact RAG evaluation criteria also vary from study to study. ARES [33] and RAGAS [9] define a context relevance metric to evaluate the quality of the retrieved documents, along with answer relevance and faithfulness to evaluate the quality of the generative model. However, others have explored other metrics like correctness [1] noise rejection and robustness [4], to name a few. Finally, most studies evaluate on small in-domain evaluation datasets that are specific to each new application [33, 34, 9, 1, 4], leaving cross-domain generalization an open question.

In this work we propose RAGBench: a comprehensive dataset for training and benchmarking RAG evaluation models. RAGBench comprises data sourced from multiple domains along with a comprehensive suite of evaluation metrics. Specifically, we adopt existing metric definitions for context relevance, answer faithfulness [9, 33] and introduce two new metrics: context utilization and answer completeness. We argue that this new suite of metrics better describes the overall RAG system performance, with the potential to provide granular, actionable insights to the RAG practitioner.

We evaluate state-of-the art LLMs and existing RAG evaluation systems on RAGBench. We find that, while few-shot LLM judges perform equally well across domains and task types, they still under-perform compared to a fine-tuned DeBERTa-large model. We motivate future work to leverage these data for advancing RAG evaluation approaches and improve on the proposed benchmark.

2 Related Work

RAG benchmarks

Numerous general LLM evaluation benchmarks, such as ChatbotArena [46] have been proposed in past work. However, human preference datasets, constructed through pairwise comparisons, have limitations. While these data are appropriate for fine-tuning general purpose LLM judges, they are insufficient for building RAG evaluation systems because preference judgements under-represent important RAG dimensions like factuality and completeness of the response [13].

ChatRAGBench [24] is a recent initiative that is similar in intent to our work in that it contributes a large-scale unified RAG benchmark. However, ChatRAGBench only contains ground truth responses and lacks the granular component-specific labels that we release with RAGBench. As future work, we can consider annotating ChatRAGBench with the schema proposed in this paper, to further scale RAGBench.

RAGTruth [41] is another recent effort at a RAG Benchmark. RAGTruth combines QA, Data-toText, and Summarization RAG data with human annotated hallucinated spans in the response. While it is an excellent benchmark for hallucination detection, it does not offer the level of granularity we present with RAGBench that is necessary to understand the RAG system as a whole.

RAG evaluation

Recently, several parallel efforts have proposed approaches to automated RAG evaluation. In RAGAS [9], the authors query an LLM-judge (GPT-3.5) with a curated prompt to evaluate context relevance, answer relevance and faithfulness of a RAG response. Next, Saad-Falcon et al. [33] propose ARES, a framework for fine-tuning smaller NLI models to predict the same metrics. This approach benefits from fine-tuning, though domain-specific annotated validation sets are required for each domain adaptation. In parallel, Chen et al. [4] develop a heuristic system to probe LLM’s robustness to noisy and irrelevant context documents, and Adlakha et al. [1] explore heuristic algorithms to estimate RAG correctness and faithfulness. The lack of established RAG benchmarks makes it difficult to compare these approaches against each other. We aim to address this limitation by introducing RAGBench.

Finetuned RAG evaluation models

Fine-tuned LLM judges are another a common way to approach the LLM evaluation task [17, 44, 41]. A number of studies also leverage small, fine-tuned Natural Language Inference (NLI) models for RAG hallucination detection [2, 22, 33]. NLI models measure the degree of entailment between a premise and a hypothesis, which has been successfully repurposed for evaluating LLM response attribution in RAG setting. In this work, we train and evaluate an NLI model for RAG evaluation using RAGBench. The fine-tuned model not only outperforms LLM judges in hallucination/attribution detection but also excels on the new RAG evaluation metrics we propose.

Refer to caption
Figure 1: RAG system workflow, with highlighted variable parameters: (1) Context format and length, (2) retriever model, (3) number of retrieved documents, and (4) generation model.

3 RAGBench Construction

3.1 Component Datasets

RAGBench is a collection of real-world datasets that span different domains and RAG task types. We source data from open-book Question-Answer (QA) datasets (CovidQA [27], PubmedQA [15], HotpotQA [42], MS Marco [29], CUAD [12], EManual [28], TechQA [3], FinQA [5], TAT-QA [47], ExpertQA [26], HAGRID [16]), as well one that was specifically adapted for RAG (DelucionQA [34]). We transform all 12 component datasets to a standardized RAG format with consistent annotations. To best represent real-world RAG scenarios, we vary a number parameters to construct the benchmark: the source domain, number of context documents, context token length, and the response generator model Figure 1 illustrates where these variable parameters fall in the RAG pipeline.

Source Domains

RAGBench comprises five distinct domains: bio-medical research (PubmedQA, CovidQA), general knowledge (HotpotQA, MS Marco, HAGRID, ExperQA), legal contracts (CuAD), customer support (DelucionQA, EManual, TechQA), and finance (FinBench, TAT-QA). We select these specific domains based on availability of data, and applicability to real-world RAG applications across different industry verticals. For detailed descriptions of each component data source, refer to Appendix 9.2.

Context Token Length

Context token length in RAGBench ranges from 100 to 11k tokens, which we report in Table 1. Notably, CUAD documents feature long contexts of up to 11k tokens each, compared to the relatively short context in PubMedQA.

Table 1: RAGBench component datasets.
Dataset Domain Document Source Question Source #docs doc length #Train #Dev #Test
PubMedQA biomedical
research
research
abstracts
automated heuristics 4 99 19.5k 2.5k 2.5k
CovidQA-RAG biomedical
research
research
papers
expert 4 122 2.5k 534 492
HotpotQA general
knowledge
wikipedia crowd-sourced 4 126 3.7k 847 776
MS Marco general
knowledge
web pages user
web queries
10 94 3.7k 790 839
HAGRID general knowledge wikipedia expert 3 153 2.0k 322 1.3k
ExpertQA general knowledge google search expert 3 548 1.6k 202 203
CUAD legal legal
contracts
expert 1 11k 1.5k 506 508
DelucionQA customer
support
Jeep manual LLM 3 296 1.5k 177 182
EManual customer
support
TV manual annotator 3 165 1k 132 132
TechQA customer
support
Technotes tech forums 5 1.8k 1.2k 302 310
FinQA finance earning
reports
expert 3 310 12k 1.7k 2.2k
TAT-QA finance financial
reports
expert 5 96 26k 3.2k 3.2k
Total 78k 12k 11k

Task Types

We curate RAGBench to inlcude a variety of difficult RAG task types. Customer support datasets simulate a common application of RAG in industry settings. FinQA and TAT-QA require numerical reasoning over hybrid tabular and text data. HotpotQA, CovidQA, and PubMedQA necessitate retrieval and reasoning over multiple context docs. The CUAD dataset is a challenging addition to RAGBench for several reasons: (i) it represents a difficult and highly-specialized real-world domain in which of-the-shelf pre-trained LLM models struggle to perform well [25], and (ii) it is equally challenging in RAG context due to very long context lengths of legal contract documents.

Question Sources

All component datasets include domain-specific questions that represent real-world user queries about various topics. Questions for DelucionQA, HotpotQA, and EManual are crowd-sourced; questions for CovidQA, CUAD, HAGRID, ExpertQA, and FinQA are composed by domain experts; MS Marco is sourced from real-world user web search queries; likewise, TechQA questions are user queries posted on IBM technical forums; PubMedQA is the only dataset with automatically-generated questions from research article titles.

Response Generation

For each component dataset we generate responses with LLMs. Exceptions to this are HAGRID and ExpertQA datasets, which contain LLM-generated responses in the original data. To introduce variability into the dataset, we generate two responses per input with different modes: GPT-3.5 (gpt-3.5-0125) and Claude 3 Haiku. Both are proprietary models that are offered at a reasonable price point222https://openai.com/api/pricing/, https://www.anthropic.com/api, which we believe make them suitable candidates for generating real-world RAG responses. For CUAD we only generate responses with Claude 3 Haiku due to prohibitively long context lengths that exceed the GPT-3.5 16k token limit. To encourage a diverse distribution of labels in RAGBench, we use a basic prompt (Appendix 9.3) that does not explicitly require the model to stick to the provided context when generating the response. We set the temperature to 1.0 for generation.

Data Splits

We split each component dataset into train, validation, and test sets, ensuring there is no overlap in queries across splits from the same data source. RAGBench totals 100k samples, split across train, validation, and test sets. Component dataset statistics are reported in Table 1.

RAGBench Statistics

RAGBench component datasets contain between 1% - 20% hallucinations. ExpertQA, CovidQA, and MS Marco contain the highest fraction of hallucinated responses (12%, 16%, and 13%, respectively), while Cuad, FinQA, and TAT-QA contain the least (about 1% for each). We visualize distributions of relevance, utilization, and completeness scores in Figure 2.

Refer to caption
Figure 2: Distributions of relevance, utilization, and completeness labels in RAGBench. Y-axis is normalized to visualize densities.

3.2 TRACe Evaluation Framework

We propose a suite of four comprehensive metrics to evaluate the quality of the retriever and the response generator components of RAG. An optimal RAG system must balance accuracy and efficiency. The retriever should precisely return all the necessary information to address the user query, avoiding any superfluous data. The generator must effectively utilize the retrieved information, ensuring the response is strictly based on the provided context without introducing any hallucinations in the output.

Towards comprehensive evaluation of the abovementioned criteria, we introduce the TRACe evaluation framework to measure uTilization, Relevance, Adherence, and Completeness of a RAG system. Utilization, Adherence, and Completeness measure the quality of the generator. Adherence here is synonymous with previously proposed answer faithfullness, groundednes, and attribution, all terms used in literature to measure how well an LLM output adheres to a source of factual information. Relevance measures the quality of the retriever output with respect to the query. Below we formalize the definition of each metric.

Definitions

Let D𝐷Ditalic_D be a set of context documents {d1dn}subscript𝑑1subscript𝑑𝑛\{d_{1}...d_{n}\}{ italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } retrieved for a RAG input query. We define a set of relevant tokens in disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as Ri={t1,tr}subscript𝑅𝑖subscript𝑡1subscript𝑡𝑟R_{i}=\{t_{1},...t_{r}\}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT }. Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT encodes information in context document disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that is useful for answering the query. Similarly, we define Ui={t1,tu}subscript𝑈𝑖subscript𝑡1subscript𝑡𝑢U_{i}=\{t_{1},...t_{u}\}italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } as the set of utilized tokens in document disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which reflect information that the generation model is using to produce a response. Refer to Figure 3 for a visual representation of relevant and utilized spans. Len(x)𝐿𝑒𝑛𝑥Len(x)italic_L italic_e italic_n ( italic_x ) measures the length of strings in x𝑥xitalic_x, which can be interpreted as character length, token length, or sentence length. For calculating ground-truth metrics, we employ sentence-length, since it aligns best with our annotation schema (Section 3.3). However, token or character length may also be suitable for other use cases.

Refer to caption
Figure 3: Example of RAG Question, Context, and Response. Relevant context spans are highlighted, and utilized spans are underlined.

Context Relevance

Context Relevance is defined in [9, 33] as the fraction of the retrieved context that is relevant to the input query. Low relevance points to an inefficient retriever that supplies excess information to the generation model. Long context inputs into the generator may accrue unnecessary costs, as well as compromise the quality of the generated output. We measure relevance of context document disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as:

document relevance=Len(Ri)Len(di)document relevance𝐿𝑒𝑛subscript𝑅𝑖𝐿𝑒𝑛subscript𝑑𝑖\text{document relevance}=\frac{{Len(R_{i})}}{Len(d_{i})}document relevance = divide start_ARG italic_L italic_e italic_n ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_L italic_e italic_n ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (1)

Example-level relevance can be aggregated over all context documents in the example as:

example relevance=i=1|D|Len(Ri)i=1|D|Len(di)example relevancesuperscriptsubscript𝑖1𝐷𝐿𝑒𝑛subscript𝑅𝑖superscriptsubscript𝑖1𝐷𝐿𝑒𝑛subscript𝑑𝑖\text{example relevance}=\frac{\sum_{i=1}^{|D|}{Len(R_{i})}}{\sum_{i=1}^{|D|}% Len(d_{i})}example relevance = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_D | end_POSTSUPERSCRIPT italic_L italic_e italic_n ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_D | end_POSTSUPERSCRIPT italic_L italic_e italic_n ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (2)

Context Utilization

Context Utilization is a new metric introduced in TRACe. We aim to measure the the fraction of the retrieved context that is used by the generator to produce the response. Low Utilization in combination with low Relevance points to a greedy retriever, while low Utilization alone points to a weak generator that fails to leverage the provided context efficiently. Document-level and example-level Utilization are defined as:

document utilization=Len(Ui)Len(di)example utilization=i=1|D|Len(Ui)i=1|D|Len(di)formulae-sequencedocument utilization𝐿𝑒𝑛subscript𝑈𝑖𝐿𝑒𝑛subscript𝑑𝑖example utilizationsuperscriptsubscript𝑖1𝐷𝐿𝑒𝑛subscript𝑈𝑖superscriptsubscript𝑖1𝐷𝐿𝑒𝑛subscript𝑑𝑖\text{document utilization}=\frac{{Len(U_{i})}}{Len(d_{i})}\quad\text{example % utilization}=\frac{\sum_{i=1}^{|D|}{Len(U_{i})}}{\sum_{i=1}^{|D|}Len(d_{i})}document utilization = divide start_ARG italic_L italic_e italic_n ( italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_L italic_e italic_n ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG example utilization = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_D | end_POSTSUPERSCRIPT italic_L italic_e italic_n ( italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_D | end_POSTSUPERSCRIPT italic_L italic_e italic_n ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (3)

Completeness

Completeness is another new metrics we introduce to measure how well the response incorporates all the relevant information in the context. Note that this is different from Utilization; it is possible to have high Relevance and high Utilization, but low Completeness when the generator utilizes irrelevant information in the context to produce a low quality response. Completeness for document disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is calculated as the fraction of utilized substrings among all relevant substrings:

completeness=Len(RiUi)Len(Ri)completeness𝐿𝑒𝑛subscript𝑅𝑖subscript𝑈𝑖𝐿𝑒𝑛subscript𝑅𝑖\text{completeness}=\frac{Len(R_{i}\cap U_{i})}{Len(R_{i})}completeness = divide start_ARG italic_L italic_e italic_n ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_L italic_e italic_n ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (4)

And can be extended to example-level by considering all relevant and utilized substrings across all context documents.

Adherence

Adherence is designed to detect hallucinations in RAG responses. Our definition of Adherence is synonymous with answer faithfullness [9, 33], groundednes [37], and attribution [32]. For alignment with existing hallucination detection approaches, we define example-level adherence as a boolean indicating whether or not all parts of the response are grounded in the context. However, in our annotation schema (Section 3.3) we also define Ai={t1,ta}subscript𝐴𝑖subscript𝑡1subscript𝑡𝑎A_{i}=\{t_{1},...t_{a}\}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } as the set of response tokens that are supported by the context to enable granular Adherence evaluation.

3.3 LLM annotator

We prompt GPT-4 (gpt-4-0125-preview) to produce ground truth Adherence, Relevance, and Utilization labels for input (documents, query, response) tuples in RAGBench. Completeness is easily derived from span-level Relevance and Utilization annotations, thus we don’t request explicit annotations for it.

For high quality labels, we use proven techniques like chain of thought [40] that have been shown to maximize the correlation between GPT-4 and human judgements [43, 46]. For relevance and utilization we request the LLM-annotator to directly identify relevant and utilized sub-strings in the input documents. For adherence, we instruct the LLM to identify which response sentences, if any, are supported by the provided context. We can then derive an example-level boolean adherence label by checking if all response sentences are supported. The exact prompt used for annotation is provided in Appendix 9.4. We apply post-processing steps to ensure high quality, reliable annotations from our GPT-labeler, which we outline in Appendix 9.5. We further validate our annotation approach in Section 4, and discuss the limitations of using an LLM-annotator in Section 8.

RAGBench raw annotations contain token-level labels for utilization and relevance, which are converted to TRACe metrics using equations in Section 3.2. We encourage future work on automated evaluators to predict the raw token-level labels, like relevant and utilized spans, rather than predicting the example-level scores directly which are less interpretable for the end user.

Table 2: Ranking of Simulated RAG Systems. We evaluate GPT-4-turbo annotations on simulated RAG datasets from Saad-Falcon et al. [33]. The data from each source are synthetically augmented to create sets with increasing degrees of context relevance (Rel) and answer adherence (Adh). We annotate 500 samples from each set and rank them according to the average context relevance and answer adherence metrics. We report Kendall’s tau to evaluate the agreement between GPT-4-turbo rankings and ground truth (higher is better).
NQ HotpotQA WoW FEVER
Rel Adh Rel Adh Rel Adh Rel Adh
Kendall’s Tau binary 1.0 0.83 0.87 1.0 1.0 0.89 1.0 0.78
Kendall’s Tau continuous 0.94 - 0.73 - 1.0 - 0.77 -

4 Annotation Validation

We validate out metric formulations and labeling approach on simulated RAG datasets of varying quality. We use mock RAG datasets generated by Saad-Falcon et al. [33] for this analysis. Their RAG validation set is sampled from KILT [30], including Natural Questions (NQ)[18], HotpotQA[42], FEVER[36], and Wizards of Wikipedia (WoW) [8] datasets. The authors synthetically generate systems of varying quality by adjusting the ratio of relevant documents and responses in the data. We sample 500 examples from each simulated RAG dataset and annotated them as described in section 3.3. Next, we calculate average annotated context relevance and adherence scores for each dataset and use those to rank the mock systems. We compare our rankings to ground truth with the Kendall rank correlation (Kendall’s τ𝜏\tauitalic_τ) metric, which evaluates the agreement between two sets of ranks on a scale from 0 (no agreement) to 1 (perfect agreement).

As shown in Table 2, the GPT-4 annotations achieve high Kendall’s τ𝜏\tauitalic_τ ranging from 0.78 to 1. For a fair comparison with the ground truth labels, we derive binary context relevance and labels from the GPT-4 annotations by thresholding the example Relevance score (equation 2) at 0. For comparison, we also report ranking results with out more granular example-level Relevance scores that range from 0-1. We find that these metric produce a different ranking (see lower Kendall’s τ𝜏\tauitalic_τ in Table 2), which we attribute to the metrics capturing differences in retrieved context length across the different examples.

5 Experiments

5.1 LLM Judge

We benchmarks a few LLM evaluators on RAGBench: (1) zero-shot GPT-3.5-judge, where we query GPT-3.5 with our annotation prompt, (2) RAGAS [9], and (3) TruLens [37]. RAGAS employs a series of few-shot prompts to GPT-3.5 to measure answer groundedness (Adherence) and Context Relevance metrics. Trulens is another zero-shot prompting approach that measures answer faithfulness (Adherence) and Context Relevance.

5.2 Fine-tuned Judge

We fine-tune a DeBERTa-v3-Large [10] NLI checkpoint333https://huggingface.co/MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli from Laurer et al. [19] with one key architecture modification: we add a shallow prediction head for each of the output RAG metrics, which allows us to compute all TRACe metrics in a single forward pass. This is both cost-effective and enables transfer learning from head to head through back-propagation down to the shared base layers. Each prediction head is a single layer feed-forward net that acts on the token-level output of the last DeBERTa layer.

We attach two heads on the context tokens to estimate Relevance and Utilization probabilities, and another head on the response tokens to estimate Adherence. For training, we broadcast sentence-level annotations to tokens, and tune to maximize token-level probabilities of Relevant, Utilized, and Adherent spans. At inference, we impose a probability threshold=0.5 to predict Relevant and Utilized spans and Adherent spans and calculate TRACe metrics using equations 2, 3, and 4. For comparison with existing hallucination detection approaches, we also aggregate Adherence probabilities across the entire response to produce an example-level response adherence label. For details about training and hyperparameters, refer to Appendix 9.6.

5.3 Evaluation

Our granular annotation schema allows for various evaluation setups. For example, we could evaluate either span-level or example/response-level predictions. For easy comparison with existing RAG evaluation approaches that are less granular, we report area under the receiver-operator curve (AUROC) on the response-level hallucination detection task, and root mean squared error (RMSE) for example-level context Relevance and Utilization predictions.

Table 3: Benchmark evaluation on test splits. Reporting AUROC for predicting hallucinated responses (Hal), RMSE for predicting Context Relevance (Rel) and utilization (Util). indicates statistical significance at 95% confidence intervals, measured by bootstrap comparing the top and second-best results. RAGAS and Trulens do not evaluate Utilization.
GPT-3.5 RAGAS TruLens DeBERTA
Dataset Hal\uparrow Rel\downarrow Util\downarrow Hal\uparrow Rel\downarrow Util\downarrow Hal\uparrow Rel\downarrow Util\downarrow Hal\uparrow Rel\downarrow Util\downarrow
PubMedQA 0.51 0.21 0.16 0.54 0.37 - 0.62 0.45 - 0.80 0.26 0.17
CovidQA-RAG 0.57 0.18 0.11 0.58 0.17 - 0.62 0.58 - 0.77 0.19 0.11
HotpotQA 0.59 0.11 0.08 0.62 0.14 - 0.64 0.73 - 0.85 0.11 0.08
MS Marco 0.65 0.23 0.11 0.63 0.25 - 0.62 0.61 - 0.70 0.22 0.10
HAGRID 0.58 0.22 0.15 0.62 0.22 - 0.67 0.69 - 0.81 0.20 0.13
ExpertQA 0.55 0.31 0.23 0.57 0.28 - 0.70 0.60 - 0.87 0.18 0.11
DelucionQA 0.57 0.18 0.10 0.70 0.22 - 0.55 0.64 - 0.64 0.15 0.10
EManual 0.54 0.17 0.11 0.57 0.27 - 0.61 0.64 - 0.76 0.13 0.13
TechQA 0.51 0.10 0.05 0.52 0.12 - 0.57 0.70 - 0.86 0.08 0.04
FinQA 0.57 0.10 0.13 0.57 0.06 - 0.53 0.79 - 0.81 0.10 0.10
TAT-QA 0.52 0.20 0.17 0.63 0.18 - 0.59 0.72 - 0.83 0.27 0.23
CUAD 0.51 0.27 0.11 0.66 0.19 - 0.40 0.66 - 0.80 0.24 0.10

6 Discussion

Table 3 reports results on test splits of each RAGBench component dataset. We compare baseline LLM methods with a finetunes DeBERTA encoder that trained on the full RAGBench train split.

LLMs underperform on the RAG evaluation task

We observe that the finetuned DeBERTa model outperforms the few/zero-shot LLM-judge baselines on most datasets. While GPT-3.5 demonstrates competitive performance with DeBERTa on a few metrics, DeBERTa consistently achieves superior performance metrics across all evaluations. Despite the versatility of LLM judges across various tasks, their lack of specialization necessitates finetuning for optimal results. Future work may focus on finetuning LLM judges to close the gap between DeBERTA and GPT-4 evaluation performance. In Appendix 9.7, we demonstrate that, despite its small size, the finetuned DeBERTA model does generalize to out of domain RAG datasets in the same way that LLM-based approaches do.

Estimating Context Relevance is Difficult

As shown in Table 3, Relevance RMSE scores are generally higher than those for Utilization, indicating a greater difficulty in the relevance prediction task. Utilization can be assessed through a straightforward semantic comparison between the context and the response. In contrast, relevance is a more intricate metric. Due to the nature of RAG, the majority of retrieved documents are semantically related to the query. However, mere semantic similarity is insufficient. The model must ascertain whether the provided context includes specific information necessary to accurately answer the question. Thus, the task inherently involves deriving the correct answer, followed by assessing what information in the context may be used to arrive at that answer.

7 Conclusion

In this paper we introduce RAGBench, a large-scale dataset composed of real-world RAG examples intended for training and benchmarking RAG evaluation models. Additionally, we formulate TRACe, a RAG evaluation framework comprising four metrics: uTilization, Relevance, Adherence, and Completeness. TRACe standardizes the evaluation process, offering a consistent and systematic approach to measuring RAG system performance across various dimensions.

We benchmark existing RAG evaluation framework using RAGBench and demonstrate that LLM-judges struggle to compete with a fine-tuned RAG evaluation expert model. Future work may involve fine-tuning larger expert models to explore the potential for narrowing the performance gap between these models and the ground truth.

Our contributions address the need for standardized benchmarks and methodologies, enabling more precise and actionable insights into the strengths and weaknesses of different RAG systems. This, in turn, will facilitate the iterative improvement of RAG models, driving forward the capabilities of retrieval-augmented generation in real-world applications.

8 Limitations

LLM Annotations

Though LLMs demonstrate high correlations with human judgements on a variety of tasks [6, 11], using them as a singular source of ground truth remains controversial [23]. At the same time, human judgements of LLM outputs are also prone to inconsistencies and bias. In [13], the authors find that human evaluators are often misled by the assertiveness and complexity of the LLM model output, which leads them to underestimate the rate of factuality errors in LLM responses.

In this work, we acknowledge the potential of noise and bias in RAGBench resulting from automated GPT-4-turbo annotations, and the concerns about the potential transmission of such biases into subsequent RAG systems. One way to address this in future may be to replace the GPT-4-annotator with and LLM "jury" as suggested in [38]. By aggregating judgements from diverse models, this approach can help reduce the noise and bias in the output judgements at low cost.

References

  • Adlakha et al. [2023] V. Adlakha, P. BehnamGhader, X. H. Lu, N. Meade, and S. Reddy. Evaluating correctness and faithfulness of instruction-following models for question answering. arXiv preprint arXiv:2307.16877v1, 2023.
  • Bohnet et al. [2023] B. Bohnet, V. Q. Tran, P. Verga, R. Aharoni, D. Andor, L. B. Soares, M. Ciaramita, J. Eisenstein, K. Ganchev, J. Herzig, K. Hui, T. Kwiatkowski, J. Ma, J. Ni, L. S. Saralegui, T. Schuster, W. W. Cohen, M. Collins, D. Das, D. Metzler, S. Petrov, and K. Webster. Attributed question answering: Evaluation and modeling for attributed large language models, 2023.
  • Castelli et al. [2020] V. Castelli, R. Chakravarti, S. Dana, A. Ferritto, R. Florian, M. Franz, D. Garg, D. Khandelwal, S. McCarley, M. McCawley, M. Nasr, L. Pan, C. Pendus, J. Pitrelli, S. Pujar, S. Roukos, A. Sakrajda, A. Sil, R. Uceda-Sosa, T. Ward, and R. Zhang. The TechQA dataset. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1269–1278, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.117. URL https://aclanthology.org/2020.acl-main.117.
  • Chen et al. [2023] J. Chen, H. Lin, X. Han, and L. Sun. Benchmarking large language models in retrieval-augmented generation. arXiv preprint arXiv:2309.01431, 2023.
  • Chen et al. [2021] Z. Chen, W. Chen, C. Smiley, S. Shah, I. Borova, D. Langdon, R. Moussa, M. Beane, T.-H. Huang, B. Routledge, and W. Y. Wang. FinQA: A dataset of numerical reasoning over financial data. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.300. URL https://aclanthology.org/2021.emnlp-main.300.
  • Chiang and Lee [2023] C.-H. Chiang and H.-y. Lee. Can large language models be an alternative to human evaluations? In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.870. URL https://aclanthology.org/2023.acl-long.870.
  • Chiesurin et al. [2023] S. Chiesurin, D. Dimakopoulos, M. A. Sobrevilla Cabezudo, A. Eshghi, I. Papaioannou, V. Rieser, and I. Konstas. The dangers of trusting stochastic parrots: Faithfulness and trust in open-domain conversational question answering. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 947–959, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.60. URL https://aclanthology.org/2023.findings-acl.60.
  • Dinan et al. [2019] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston. Wizard of wikipedia: Knowledge-powered conversational agents, 2019.
  • Es et al. [2024] S. Es, J. James, L. Espinosa Anke, and S. Schockaert. RAGAs: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Mar. 2024.
  • He et al. [2023] P. He, J. Gao, and W. Chen. DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=sE7-XhLxHA.
  • He et al. [2024] X. He, Z. Lin, Y. Gong, A.-L. Jin, H. Zhang, C. Lin, J. Jiao, S. M. Yiu, N. Duan, and W. Chen. Annollm: Making large language models to be better crowdsourced annotators, 2024.
  • Hendrycks et al. [2021] D. Hendrycks, C. Burns, A. Chen, and S. Ball. Cuad: An expert-annotated nlp dataset for legal contract review. NeurIPS, 2021.
  • Hosking et al. [2024] T. Hosking, P. Blunsom, and M. Bartolo. Human feedback is not gold standard. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=7W3GLNImfS.
  • Huang and Huang [2024] Y. Huang and J. Huang. A survey on retrieval-augmented text generation for large language models, 2024.
  • Jin et al. [2019] Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu. PubMedQA: A dataset for biomedical research question answering. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China, Nov. 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1259. URL https://aclanthology.org/D19-1259.
  • Kamalloo et al. [2023] E. Kamalloo, A. Jafari, X. Zhang, N. Thakur, and J. Lin. Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution, 2023.
  • Kim et al. [2024] S. Kim, J. Suk, S. Longpre, B. Y. Lin, J. Shin, S. Welleck, G. Neubig, M. Lee, K. Lee, and M. Seo. Prometheus 2: An open source language model specialized in evaluating other language models, 2024.
  • Kwiatkowski et al. [2019] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
  • Laurer et al. [2022] M. Laurer, W. van Atteveldt, A. Casas, and K. Welbers. Less annotating, more classifying – addressing the data scarcity issue of supervised machine learning with deep transfer learning and bert - nli. Open Science Framework Preprint, 2022. URL https://osf.io/74b8k.
  • Lee et al. [2019] K. Lee, M.-W. Chang, and K. Toutanova. Latent retrieval for weakly supervised open domain question answering. In A. Korhonen, D. Traum, and L. Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1612. URL https://aclanthology.org/P19-1612.
  • Lewis et al. [2020] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  • Li et al. [2024] Y. Li, X. Yue, Z. Liao, and H. Sun. Attributionbench: How hard is automatic attribution evaluation? arXiv preprint arXiv:2402.15089v1, 2024.
  • Li et al. [2023] Z. Li, H. Zhu, Z. Lu, and M. Yin. Synthetic data generation with large language models for text classification: Potential and limitations. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.647. URL https://aclanthology.org/2023.emnlp-main.647.
  • Liu et al. [2024] Z. Liu, W. Ping, R. Roy, P. Xu, C. Lee, M. Shoeybi, and B. Catanzaro. Chatqa: Building gpt-4 level conversational qa models. arXiv preprint arXiv:2401.10225, 2024.
  • Magesh et al. [2024] V. Magesh, F. Surani, M. Dahl, M. Suzgun, C. D. Manning, and D. E. Ho. Hallucination-free? assessing the reliability of leading ai legal research tools, 2024.
  • Malaviya et al. [2024] C. Malaviya, S. Lee, S. Chen, E. Sieber, M. Yatskar, and D. Roth. Expertqa: Expert-curated questions and attributed answers, 2024.
  • Möller et al. [2020] T. Möller, A. Reina, R. Jayakumar, and M. Pietsch. COVID-QA: A question answering dataset for COVID-19. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, July 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.nlpcovid19-acl.18.
  • Nandy et al. [2021] A. Nandy, S. Sharma, S. Maddhashiya, K. Sachdeva, P. Goyal, and N. Ganguly. Question answering over electronic devices: A new benchmark dataset and a multi-task learning based QA framework. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4600–4609, Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.392. URL https://aclanthology.org/2021.findings-emnlp.392.
  • Nguyen et al. [2016] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng. Ms marco: A human generated machine reading comprehension dataset. November 2016. URL https://www.microsoft.com/en-us/research/publication/ms-marco-human-generated-machine-reading-comprehension-dataset/.
  • Petroni et al. [2021] F. Petroni, A. Piktus, A. Fan, P. Lewis, M. Yazdani, N. De Cao, J. Thorne, Y. Jernite, V. Karpukhin, J. Maillard, V. Plachouras, T. Rocktäschel, and S. Riedel. KILT: a benchmark for knowledge intensive language tasks. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.200. URL https://aclanthology.org/2021.naacl-main.200.
  • Rashkin et al. [2021] H. Rashkin, D. Reitter, G. S. Tomar, and D. Das. Increasing faithfulness in knowledge-grounded dialogue with controllable features. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 704–718, Online, Aug. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.58. URL https://aclanthology.org/2021.acl-long.58.
  • Rashkin et al. [2023] H. Rashkin, V. Nikolaev, M. Lamm, L. Aroyo, M. Collins, D. Das, S. Petrov, G. S. Tomar, I. Turc, and D. Reitter. Measuring attribution in natural language generation models. Computational Linguistics, 49(4):777–840, 12 2023.
  • Saad-Falcon et al. [2024] J. Saad-Falcon, O. Khattab, C. Potts, and M. Zaharia. Ares: An automated evaluation framework for retrieval-augmented generation systems. arXiv preprint arXiv:2311.09476v2, 2024.
  • Sadat et al. [2023] M. Sadat, Z. Zhou, L. Lange, J. Araki, A. Gundroo, B. Wang, R. Menon, M. Parvez, and Z. Feng. Delucionqa: Detecting hallucinations in domain-specific question answering. pages 822–835, 01 2023. doi: 10.18653/v1/2023.findings-emnlp.59.
  • Siriwardhana et al. [2023] S. Siriwardhana, R. Weerasekera, E. Wen, T. Kaluarachchi, R. Rana, and S. Nanayakkara. Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11:1–17, 2023. doi: 10.1162/tacl_a_00530. URL https://aclanthology.org/2023.tacl-1.1.
  • Thorne et al. [2018] J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In M. Walker, H. Ji, and A. Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https://aclanthology.org/N18-1074.
  • Trulens [2023] Trulens, 2023. https://www.trulens.org/.
  • Verga et al. [2024] P. Verga, S. Hofstatter, S. Althammer, Y. Su, A. Piktus, A. Arkhangorodsky, M. Xu, N. White, and P. Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models, 2024.
  • Wang et al. [2020] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis, R. M. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson, A. D. Wade, K. Wang, N. X. R. Wang, C. Wilhelm, B. Xie, D. M. Raymond, D. S. Weld, O. Etzioni, and S. Kohlmeier. CORD-19: The COVID-19 open research dataset. In K. Verspoor, K. B. Cohen, M. Dredze, E. Ferrara, J. May, R. Munro, C. Paris, and B. Wallace, editors, Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, July 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.nlpcovid19-acl.1.
  • Wei et al. [2022] J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
  • Wu et al. [2023] Y. Wu, J. Zhu, S. Xu, K. Shum, C. Niu, R. Zhong, J. Song, and T. Zhang. Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models, 2023.
  • Yang et al. [2018] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
  • Ye et al. [2024] S. Ye, D. Kim, S. Kim, H. Hwang, S. Kim, Y. Jo, J. Thorne, J. Kim, and M. Seo. FLASK: Fine-grained language model evaluation based on alignment skill sets. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=CYmF38ysDa.
  • Yue et al. [2023] X. Yue, B. Wang, Z. Chen, K. Zhang, Y. Su, and H. Sun. Automatic evaluation of attribution by large language models. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4615–4635, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.307. URL https://aclanthology.org/2023.findings-emnlp.307.
  • Zhang et al. [2022] X. Zhang, N. Thakur, O. Ogundepo, E. Kamalloo, D. Alfonso-Hermelo, X. Li, Q. Liu, M. Rezagholizadeh, and J. Lin. Making a miracl: Multilingual information retrieval across a continuum of languages, 2022.
  • Zheng et al. [2023] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=uccHPGDlao.
  • Zhu et al. [2021] F. Zhu, W. Lei, Y. Huang, C. Wang, S. Zhang, J. Lv, F. Feng, and T.-S. Chua. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3277–3287, Online, Aug. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.254. URL https://aclanthology.org/2021.acl-long.254.

9 Appendix

9.1 RAGBench Code and Data

We release RAGBench data on Hugginggface: https://huggingface.co/datasets/rungalileo/ragbench. Refer to model card and documentation there.

We publish our inferfence and evaluation code on Gihub: https://github.com/rungalileo/ragbench/tree/main/ragbench.

9.2 RAGBench Dataset Details

RAGBench is sourced from publicly released acadmic and industry datasets. As far as we know, none of the component datasets contain personally identifiable information or offensive content.

PubMedQA [15]

PubMedQA is a collection of PubMed research abstracts with corresponding yes/no/maybe questions paired with each abstract. The original dataset comprises 3 subsets: PQA-L, PQA-U, and PQA-A, with 1k, 60k, and 210k abstracts, respectively. For all subsets, the question is derived from the title of the PubMed article using rule-based heuristics. Long answers are automatically derived from the last sentence of the abstract for PQA-L and PQA-U, and QA-L answers are further reviewed by expert annotators and annotated as yes/no/maybe. PQA-A comprises exclusively automatically generated questions and short answers.

For RAGBench we utilize the PQA-U subset and re-frame it from QA into a RAG task. To simulate RAG, we leverage already segmented PQA-U abstracts context chunks and we encode them into a vector DB with OpenAI embeddings. The size of the resulting DB is 200k. We retrieve 4 chunks for each PQA-U question using FAISS with eucledian distance as the similarity function. We ignore the responses and labels in the original dataset and generate new responses with an LLM.

CovidQA-RAG

CovidQA-RAG is a combination of 2k expert-annotated questions sourced from COVID-QA [27] and a vector database of 250,000 100-word passages built by Siriwardhana et al. [35]. Both questions and answers are sourced from CORD-19 [39] collection of research articles about COVID-19.

We embed the questions and database passages with OpenAI embeddings and retrieve up to N passages for each COVID-QA question from the vector database using FAISS with eucledian distance as the similarity function and max_distance=0.25. We generate responses for each resulting RAG (context, question) instance with an LLM.

HotpotQA [42]

HotpotQA comprises 113K crowd-sourced question-answer pairs sourced from Wikipedia. Each pair is associated with a set of related context passages from one or multiple Wikipedia pages. The dataset is constructed in a way that requires multi-hop reasoning over multiple context documents to arrive at the answer, which renders it a valuable candidate for our benchmark. We sample data from the dev-distractor split, which contains up to 8 distractor context documents per sample. We downsample the context documents to 4 per example, making sure to include the document containing the response. We treat the context passages in HotpotQA as RAG context documents, and generate responses for each (context, question) instance with an LLM.

MS Marco [29]

MS Marco is an open-domain question answering dataset sourced from Bing search engine user query logs. Each question is associated with 10 context passages retrieved via Bing web search. Human annotators compose a response based on the provided context documents, and label the documents utilized in the response as relevant. We sample data from the original version of the dataset, comprising 80k train, 10k validation, and 10k test samples. As with other datasets, we ignore the human annotated answers and generate responses with an LLM in RAG setting.

CUAD [12]

CUAD is a collection of commercial legal contracts with expert annotated questions and responses. The contracts are sourced from a public legal contract library(EDGAR) and range from 1-100 pages in length. Experts in the legal domain compose multiple questions per contract and label the relevant parts of the contract that are useful for answering the questions. There are 21k questions pertaining to 510 documents in total. The questions are very specific to each contract, thus we don’t perform additional retrieval over the contract corpus, and form RAG examples with 1 context contract each for our benchmark. Due to high anntoation costs associated with long-context RAG, we sample 5 question per doc. As with other datasets, we generate responses with an LLM in RAG setting.

DelucionQA [34]

DelucionQA is a domain-specific RAG dataset leveraging Jeep’s 2023 Gladiator model manual as the source of knowledge. The questions and answers are automatically generated by large language models. RAG context passages are retrieved from the Jeep car manual via both sparse and dense retrieval methods to add variance in the sample distribution. Further, MTurk workers annotate whether or not responses are supported by the context.

Upon closer inspection, we found only 1 relevant passage associated with each question in the DelucionQA dataset. To make the dataset more challenging for RAGBench, we build a vector database from the 1,046 context passages in DelucionQA and and retrieve up to 3 context documents per question from it. We use text-embedding-ada-002 embeddings from OpenAI to build the database. There are 913 unique questions in DelucionQA. For each resulting (context, question) sample, we generate responses with an LLM.

EManual [28]

EManual is a question answer dataset comprising consumer electronic device manuals and realistic questions about them composed by human annotators. The subset made available at the time of writing amounts to 659 unique questions about the Samsung Smart TV/remote and the accompanying user manual, segmented into 261 chunks. To form a RAG dataset, we embed the manual segments into a vector database with OpenAI embedding and retrieve up to 3 context documents per question from it. For each resulting (context, question) sample, we generate responses with an LLM.

TechQA [3]

TechQA is a collection of real-world user questions posted on IBMDeveloper and DeveloperWorks forums, along with 50 technical support documents relating to each question. The documents are sourced from database of 800k technical documents that support accepted answers on the tech forums. The authors release 1.4k questions, split between train, validation, and test sets. The data are curated such that fractions on the each split unanswerable given the information in the linked documents, which makes it a good candidate for RAGBench. To reduce annotation costs, we sub-sample the data down to 10 documents per question, making sure to include the document containing the answer, when applicable. We use the provided splits with (context document, question) examples and generate responses for each with an LLM.

FinQA [5]

FinQA is a QA dataset of financial report passages and associated questions. Questions are curated such that numerical reasoning over multiple unstructured and tabular inputs is required to arrive at the answer. FinQA totals 8,281 financial QA pairs, split between train, validation, and test splits. We retain the original splits and generate 2 LLM responses per each context-query example in FinQA.

TAT-QA [47]

TAT-QA is another financial QA dataset that requires numerical reasoning over tables and text. The data are sourced from 500 financial reports released on https://www.annualreports.com/. Expert annotators with background in finance annotate question-answer pairs based on the available documents. We leverage the full dataset (13k train, 1.6k validation and test) but generate new responses with LLMs for RAGBench.

HAGRID [16]

HAGRID is a QA dataset built on top of MIRACL [45], a multi-lingual information-retrieval dataset. HAGRID passes questions and relevant context documents from MIRACLE through an LLM to produce a response for each example in the dataset. Annotors then rate the response on informativeness and attribution dimensions. The original context documents are sourced from Wikipedia and associated questions are generated by expert annotators. Since HAGRID already contains LLM-generated responses, we directly use them and don’t generate additional responses for RAGBench.

ExpertQA [26]

ExpertQA is a collection of curated questions from domain-experts in various fields of sicence, arts, and law. The dataset also contains expert curated passsages relevant to each question, alongside LLM-generated responses. As with HAGRID, we leverage the LLM-generated responses in ExpertQA directly for our RAG dataset.

9.3 Response Generation Prompt

We use the following prompt template to generate LLM responses for each sample in RAGBench. Context documents, separated by line breaks, along with the question are slotted in for each generation sample.

    Use the following pieces of context to answer the question.

    {documents}

    Question: {question}

9.4 GPT Labeling Prompt

We use the following prompt template to generate annotations with GPT-4

I asked someone to answer a question based on one or more documents.
Your task is to review their response and assess whether or not each sentence
in that response is supported by text in the documents. And if so, which
sentences in the documents provide that support. You will also tell me which
of the documents contain useful information for answering the question, and
which of the documents the answer was sourced from.

Here are the documents, each of which is split into sentences. Alongside each
sentence is associated key, such as ’0a.’ or ’0b.’ that you can use to refer
to it:

‘‘‘
{documents}
‘‘‘

The question was:
‘‘‘
{question}
‘‘‘

Here is their response, split into sentences. Alongside each sentence is
associated key, such as ’a.’ or ’b.’ that you can use to refer to it. Note
that these keys are unique to the response, and are not related to the keys
in the documents:

‘‘‘
{answer}
‘‘‘

You must respond with a JSON object matching this schema:

‘‘‘
{{
  "relevance_explanation": string,
  "all_relevant_sentence_keys": [string],
  "overall_supported_explanation": string,
  "overall_supported": boolean,
  "sentence_support_information": [
    {{
      "response_sentence_key": string,
      "explanation": string,
      "supporting_sentence_keys": [string],
      "fully_supported": boolean
    }},
  ],
  "all_utilized_sentence_keys": [string]
}}
‘‘‘
The relevance_explanation field is a string explaining which documents
contain useful information for answering the question. Provide a step-by-step
breakdown of information provided in the documents and how it is useful for
answering the question.

The all_relevant_sentence_keys field is a list of all document sentences keys
(e.g. ’0a’) that are revant to the question. Include every sentence that is
useful and relevant to the question, even if it was not used in the response,
or if only parts of the sentence are useful. Ignore the provided response when
making this judgement and base your judgement solely on the provided documents
and question. Omit sentences that, if removed from the document, would not
impact someone’s ability to answer the question.

The overall_supported_explanation field is a string explaining why the response
*as a whole* is or is not supported by the documents. In this field, provide a
step-by-step breakdown of the claims made in the response and the support (or
lack thereof) for those claims in the documents. Begin by assessing each claim
separately, one by one; don’t make any remarks about the response as a whole
until you have assessed all the claims in isolation.

The overall_supported field is a boolean indicating whether the response as a
whole is supported by the documents. This value should reflect the conclusion
you drew at the end of your step-by-step breakdown in overall_supported_explanation.

In the sentence_support_information field, provide information about the support
*for each sentence* in the response.

The sentence_support_information field is a list of objects, one for each sentence
in the response. Each object MUST have the following fields:
- response_sentence_key: a string identifying the sentence in the response.
This key is the same as the one used in the response above.
- explanation: a string explaining why the sentence is or is not supported by the
documents.
- supporting_sentence_keys: keys (e.g. ’0a’) of sentences from the documents that
support the response sentence. If the sentence is not supported, this list MUST
be empty. If the sentence is supported, this list MUST contain one or more keys.
In special cases where the sentence is supported, but not by any specific sentence,
you can use the string "supported_without_sentence" to indicate that the sentence
is generally supported by the documents. Consider cases where the sentence is
expressing inability to answer the question due to lack of relevant information in
the provided contex as "supported_without_sentence". In cases where the sentence
is making a general statement (e.g. outlining the steps to produce an answer, or
summarizing previously stated sentences, or a transition sentence), use the
sting "general".In cases where the sentence is correctly stating a well-known fact,
like a mathematical formula, use the string "well_known_fact". In cases where the
sentence is performing numerical reasoning (e.g. addition, multiplication), use
the string "numerical_reasoning".
- fully_supported: a boolean indicating whether the sentence is fully supported by
the documents.
  - This value should reflect the conclusion you drew at the end of your step-by-step
  breakdown in explanation.
  - If supporting_sentence_keys is an empty list, then fully_supported must be false.
  - Otherwise, use fully_supported to clarify whether everything in the response
  sentence is fully supported by the document text indicated in supporting_sentence_keys
  (fully_supported = true), or whether the sentence is only partially or incompletely
  supported by that document text (fully_supported = false).

The all_utilized_sentence_keys field is a list of all sentences keys (e.g. ’0a’) that
were used to construct the answer. Include every sentence that either directly supported
the answer, or was implicitly used to construct the answer, even if it was not used
in its entirety. Omit sentences that were not used, and could have been removed from
the documents without affecting the answer.

You must respond with a valid JSON string.  Use escapes for quotes, e.g. ‘\\"‘, and
newlines, e.g. ‘\\n‘. Do not write anything before or after the JSON string. Do not
wrap the JSON string in backticks like ‘‘‘ or ‘‘‘json.

As a reminder: your task is to review the response and assess which documents contain
useful information pertaining to the question, and how each sentence in the response
is supported by the text in the documents.\

9.5 Annotation Post-Processing Steps

As shown in Appendix 9.4, we request very detailed annotations with explanations from GPT-4-turbo. We pivot on chain-of-thought [40] and redundancy to encourage high quality labels from the annotator model.

For Adherence, we request both response-level and sentence-level annotations that we compare in post-processing to identify inconsistencies where GPT-4 disagrees with its own judgements. For example, if GPT-4 claims a response as supported by the context as a whole, but identifies no supporting information for one or more claims in the response, we send the example for re-annotation. We re-annotate all data up to 3 times, after which a fraction (<2%) of the data are still conflicting. After manual inspection, we find that the majority of the conflicts arise from partially hallucinated sentences that are somewhat, but not fully, grounded in the context. We leverage a sentence-level "fully_supported" boolean annotation to identify and resolve such cases. According to our annotation schema, we treat all partially supported sentences as hallucinations.

Since all TRACe metrics are related, we qualitatively observe that taking the extra measures for Adherence also positively impacts the quality and stability of the relevance and utilization labels.

In the final post-processing step, we remove any off-schema keys that GPT-4-turbo sometimes injects into the response. For example, it will occasionally misspell "supporting_sentence_keys" as "supported_sentence_keys" and/or introduce completely new fields into the output json. We algorithmically find and remove/replace such annotation errors.

9.6 DeBERTa model training

We train the model on a Google Cloud Platform A-100 GPU instance for 3 epochs with initial learning rate 56superscript565^{-6}5 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT for the base model layers and 25superscript252^{-5}2 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for the heads, with warmup and a linear decay rate.

9.7 OOD DeBERTa

We evaluate generalizability of a fine-tuned DeBERTa model to Out-of-Domian (OOD) data. For this evaluation, we train DeBERTA on the general knowledge subset of RAGBench. This subset includes academic datasets that are less aligned with real-world industry used cases (e.g. compared to customer service subset). With the exception of FinQA and TAT-QA, we find that the model still achieves reasonable generalization to the other domains in RAGBench. FinQA and TAT-QA are the two financial numerical reasoning datasets in RAGBench. The tabular nature of the FinQA and TAT-QA datasets contribute to the poor performance of the OOD model as such format would not have been seen in training.

Table 4: Comparison of DeBERTA tuned on the full RAGBench train split vs. DeBERTaOOD, which was tuned on the General Knowledge train subset.
DeBERTa DeBERTaOOD
Dataset Hal\uparrow Rel\downarrow Util\downarrow Hal\uparrow Rel\downarrow Util\downarrow
PubMedQA 0.80 0.26 0.17 0.68 0.21 0.16
CovidQA-RAG 0.77 0.19 0.11 0.76 0.19 0.14
HotpotQA 0.85 0.11 0.08 0.87 0.10 0.09
MS Marco 0.70 0.22 0.10 0.68 0.21 0.10
HAGRID 0.81 0.20 0.13 0.82 0.20 0.13
ExpertQA 0.87 0.18 0.11 0.85 0.18 0.11
DelucionQA 0.64 0.15 0.10 0.65 0.16 0.11
EManual 0.76 0.13 0.13 0.71 0.14 0.14
TechQA 0.86 0.08 0.04 0.76 0.09 0.07
FinQA 0.81 0.10 0.10 0.67 0.09 0.08
TAT-QA 0.83 0.27 0.23 0.75 0.19 0.18
CUAD 0.80 0.24 0.10 0.76 0.25 0.11