Evaluating Quality of Answers for Retrieval-Augmented Generation:
A Strong LLM Is All You Need

Yang Wang
Visa Data & AI Platform
Foster City, California
[email protected] &Alberto Garcia Hernandez
Visa Consulting & Analytics
London, United Kingdom
[email protected] \ANDRoman Kyslyi
Visa Consulting & Analytics
Kyiv, Ukraine
[email protected] &Nicholas Kersting
Visa Data & AI Platform
Austin, Texas
[email protected]
Abstract

We present a comprehensive study of answer quality evaluation in Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel grading system that is designed to assess correctness, completeness, and honesty. We further map the grading of quality aspects aforementioned into a binary score, indicating an accept or reject decision, mirroring the intuitive "thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This approach suits factual business settings where a clear decision opinion is essential. Our assessment applies vRAG-Eval to two Large Language Models (LLMs), evaluating the quality of answers generated by a vanilla RAG application. We compare these evaluations with human expert judgments and find a substantial alignment between GPT-4’s assessments and those of human experts, reaching 83% agreement on accept or reject decisions. This study highlights the potential of LLMs as reliable evaluators in closed-domain, closed-ended settings, particularly when human evaluations require significant resources.

Evaluating Quality of Answers for Retrieval-Augmented Generation:
A Strong LLM Is All You Need


Yang Wang Visa Data & AI Platform Foster City, California [email protected]                        Alberto Garcia Hernandez Visa Consulting & Analytics London, United Kingdom [email protected]


Roman Kyslyi Visa Consulting & Analytics Kyiv, Ukraine [email protected]                        Nicholas Kersting Visa Data & AI Platform Austin, Texas [email protected]


1 Introduction

Since the launch of ChatGPT in November 2022, Large Language Models (LLMs) have become increasingly popular integrated components for organizations seeking to enhance productivity and enrich their product portfolio offerings. However, it is well known that GPT-4 Turbo’s training corpora cut-off date is April 2023, rendering the model lacking in current events knowledge. Furthermore, as LLMs are pre-trained on public domain text and do not possess proprietary information, their capabilities are limited when it comes to our company’s knowledge-intensive applications.

Fine-tuning (Radford et al., 2018; Dodge et al., 2020) is a technique that can be used to inject new knowledge into pre-trained LLMs by adjusting their gradient parameters through further, specialized training. However, OpenAI’s fine-tuning API is currently only available through an experimental access program, and GPT-4 fine-tuning requires more effort to achieve significant improvements, as noted by OpenAI (2023).

Retrieval-augmented generation (RAG) is proposed initially by Lewis et al. (2021). This method involves storing extra knowledge in a non-parametric dense vector index and using a pre-trained neural retriever to search relevant context, followed by generating content with a pre-trained sequence-to-sequence (seq2seq) model. Ovadia et al. (2024) argue that RAG consistently outperforms unsupervised fine-tuning on a wide range of knowledge-intensive tasks.

A RAG application leveraging LLMs enhanced with the company’s proprietary knowledge has become one of the pivotal factors advancing the adoption of GPT-4-based applications at our enterprise, a global payment technology company. This phenomenon highlights the need to apply viable approaches to evaluate RAG applications, as lacking performance metrics poses risks to the enterprise’ business and may have negative consequences.

One approach to evaluate the quality of RAGs focuses on their unique characteristic: that they consist of a Retriever model and a Generator model. Therefore, studies have used metrics such as context relevance, and answer relevance to evaluate the two components separately then assess the answer faithfulness between them (Saad-Falcon et al., 2024; Es et al., 2023).

The other paradigm for evaluating RAG applications treats them as traditional question answering systems. Researchers have proposed metrics such as SAS, a semantic answer similarity estimation (Risch et al., 2021), and F1/EM scores (Wang et al., 2024) to evaluate these applications on public question answering (QA) datasets. Zheng et al. (2023) propose using strong LLMs such as GPT-4 to evaluate other LLMs. They argue that traditional benchmarks cannot effectively align quality measurement with human preferences in open-ended tasks.

In some companies, RAG applications are mostly developed for a closed-domain, closed-ended setting, where employees seek factual answers based on proprietary knowledge. Meanwhile, they often develop external-facing RAG applications that allow clients to search for authoritative information that is not necessarily available on the Internet. Therefore, it is crucial to establish definitive answer quality evaluation practices that are suitable for these businesses’ needs.

To study this, we introduce a benchmark dataset called VND-Bench, comprising 155 high-quality questions across 14 subject areas in a proprietary payment network data domain. We also collect their corresponding answers from a communication channel on an internal employee collaboration platform to serve as ground truth labels. We build a vanilla RAG application, tRAG, as a test bed and ask it to answer the 155 questions. We then request five human experts, GPT-4, and Llama 3 (Meta, 2024) to assess tRAG’s answer quality based on the labels and vRAG-Eval, a set of grading instructions and a rubric. Our results show that GPT-4’s evaluation agrees with human evaluators’ scores at a rate of 83% based on a binary accept/reject category.

We summarize our contributions as follows:

  • We design a RAG evaluation mechanism, vRAG-Eval, which includes a set of grading instructions and a rubric that measures answer quality in three aspects: correctness, completeness, and honesty, resulting in one single score. It can be readily converted to a binary accept/reject decision that suits business settings.

  • We build a high-quality benchmark dataset, VND-Bench, with 155 questions/answers pairs that cover 14 subject areas in data domain.

  • We conduct experiments using LLMs as a RAG’s quality evaluators in a closed-domain closed-ended single-turn QA setting and find that GPT-4’s grading agrees with human experts’ opinions at a rate of 83% in terms of answer accepted or rejected decisions.

The rest of the paper is organized as follows: in section 2, we introduce related work. And in section 3, we explain the motivation, problem setting, and our method of experimentation. In section 4, we provide our experimental results. We suggest future research directions and make concluding remarks in sections 5&6.

2 Related Work

Embedding and semantic similarity Zhang et al. (2020) propose the BERTScore metric as an automatic evaluation method for text generation tasks. Unlike traditional lexical approaches that rely on word matching, BERTScore sums the cosine similarities between the token embeddings of both two sentences: the reference answer and the QA system-generated answer. Risch et al. (2021) introduce SAS, a metric designed to estimate the semantic similarity between ground-truth annotations and answers generated by a QA model. It is found that semantic similarity metrics, especially those based on contemporary transformer models (Reimers and Gurevych, 2019), align more accurately with human assessments compared to the conventional lexical similarity metrics.

LLM as a judge Two projects closely related to our research are LLM-as-a-Judge (Zheng et al., 2023) and A Case Study on the Databricks Documentation Bot (Leng et al., 2023). To evaluate LLM-based chat assistants, Zheng et al. (2023) examine the usage and limitations of LLMs as judges for open-ended questions. This setting may not be suitable for most RAG applications in enterprises, where users typically seek definitive answers. Leng et al. (2023) employ LLMs to generate grades as a weighted composite score of Correctness, Comprehensiveness, and Readability. While they report an alignment rate of 80% with human scores on individual factors, their grading system lacks a crucial element: one single metric that provides a decision-making-ready quality score. In business-centric settings, this limitation becomes particularly noticeable.

Reference-free evaluation In scenarios where human annotations are not readily available, Es et al. (2023) present RAGAS, a framework for automating the assessment of RAG systems. This framework considers RAG’s 2-module composition and proposes three key metrics: Context Relevance to evaluate the Retriever, Answer Relevance to assess the Generator, and Faithfulness to measure the coherence between the two modules. Similarly, ARES (Saad-Falcon et al., 2024) evaluates along the same 3 dimensions to assess the quality of RAG components. While not entirely reference-free, ARES leverages a few hundred human annotations during evaluation and employs a lightweight fine-tuned LLM as a judge, rather than relying on a frozen LLM.

3 Motivation and Problem Setting

Prior research has shown that LLMs are powerful tools and become increasingly popular for evaluating RAG answer qualities. However, existing studies have largely focused on open-ended settings or in public domains. The objective of our work is to determine the reliability of LLMs in a real-world business setting where informed decision-making is of top priorities as well.

3.1 Preparation

We first develop a test bed RAG application, tRAG. Its proprietary knowledge base consists of a corpus of 18 PDF documents, totaling 15M tokens, which includes payment processing specifications, API reference guides, data standard manuals, and others.

During the preprocessing stage, for each document d𝑑ditalic_d in the knowledge corpus, we split its content into chunks such that d={ci}i=1n𝑑superscriptsubscriptsubscript𝑐𝑖𝑖1𝑛d=\{c_{i}\}_{i=1}^{n}italic_d = { italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and vectorize each text chunk cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using an embedding model M𝑀Mitalic_M (OpenAI, 2022), where vi=M(ci)subscript𝑣𝑖𝑀subscript𝑐𝑖v_{i}=M(c_{i})italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Then we store the key value pair <vi,ci><v_{i},c_{i}>< italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > into a local vector store instance S𝑆Sitalic_S. Appendix A illustrates the high-level workflow, and configuration details.

Next, we curate a benchmark dataset, VND-Bench, by collecting question-answer pairs directly from an internal employee communication channel. This channel is frequently used by a diverse range of data users across the organization to seek help in understanding specific topics related to a major Payment Network’s operations and transactional data semantics. We collect a total of 155 closed-ended questions along with their corresponding answers, which we consider to be the ground truth for our experiment. When multiple people respond to a question on the channel, we aggregate their answers into one single response. The set of questions are categorized into subject areas specified in Table 1.

1. Acceptance 8. Issuing
2. Authentication 9. Master Data
3. Authorization 10. OCT
4. Clearing and Settlement 11. Other
5. Commercial 12. Processing
6. Dispute 13. Product
7. Fraud 14. Token
Table 1: Subject areas of the questions considered in the experiment.

3.2 Retrieval, Augmentation, and Generation

For each question Q𝑄Qitalic_Q in the VND-Bench dataset, tRAG embeds the text into a fixed-length dense vector using model M:q=M(Q):𝑀𝑞𝑀𝑄M:q=M(Q)italic_M : italic_q = italic_M ( italic_Q ), then conducts distance search ΔΔ\Deltaroman_Δ within the knowledge database S𝑆Sitalic_S, and returns the top K𝐾Kitalic_K most relevant document chunks matching the query.

K=argmin𝑖Δ(vi,q)𝐾𝑖argminΔsubscript𝑣𝑖𝑞K=\underset{i}{\mathrm{argmin}}\,\Delta(v_{i},q)italic_K = underitalic_i start_ARG roman_argmin end_ARG roman_Δ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q )

During the Augmentation stage tRAG concatenates those top K𝐾Kitalic_K chunks as the context C𝐶Citalic_C.

C={cj|jK}𝐶conditional-setsubscript𝑐𝑗𝑗𝐾C=\{c_{j}|j\in K\}italic_C = { italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_j ∈ italic_K }

And combines with the question to form a prompt: QCdirect-sum𝑄𝐶Q\oplus Citalic_Q ⊕ italic_C . This prompt is then sent to the answer generator.

We utilize GPT-4 model as the tRAG’s answer generator model G𝐺Gitalic_G. The resulting response, which constitutes the model’s inference output, is evaluated in our experiment. Appendix B illustrates tRAG’s single-turn question answering workflow.

3.3 Grading Method

We compile a fielded dataset containing 155 entries with the following columns: Subject Area, Question, Label, tRAG’s Answer. Additionally, we design a grading rubric as illustrated in Table 2 to evaluate tRAG’s Answer.

1: The response is not aligned with the Label or is off-topic; includes hallucination.
2: The response admits it cannot provide an answer or lacks context; honest.
3: The response is relevant but contains notable discrepancies or inaccuracies.
4: The response is acceptable, sufficient but not exhaustive.
5: The response is fully accurate and comprehensive, based on the Label.
Table 2: Grading rubric

Each score assesses a distinct aspect of answer quality. Specifically, although both incorrect answers and responses of unknown lack value, we argue that hallucination may impose significant harm to mission-critical businesses. Consequently, we penalize hallucination with the lowest score of 1, while honestly acknowledging an unknown response earns a score of 2.

A correct answer warrants a score of 4, while a RAG application’s extra effort to provide supplementary details is commended with a score of 5, reflecting the value of answer completeness. When a score of 3 is assigned, it indicates that the answer quality is borderline, potentially containing false information.

LLMs grade answer quality through zero-shot learning, wherein a constant template as part of the vRAG-Eval, is designed to populate prompts for each question. Table 3 illustrates the grading template.

You are an AI assistant. In the following task, you are given a Question,
a RAG application’s response, and a Ground-truth Answer referred to as ’Label’.
Assess how well the RAG application’s response aligns with the Label,
using the grading rubric below:
[Start of Grading Rubric]
{rubric}
[End of Grading Rubric]
Treat the Label as the definitive answer. Present your final score in the format: "[[score]]",
followed by your justification. Example:
Score: [[3]], Reason: [[The RAG’s response partially aligns with the Label
but with some discrepancies]].
[Start of User Question]
{question}
[End of User Question]
[Start of Label]
{label}
[End of Label]
[Start of RAG’s Application Response]
{tRAG answer}
[End of RAG’s Application Response]
Table 3: Prompt template used for LLM grading.

In our experiment, inputs to the evaluators are a sequence of tRAG’s answers y^1,y^2,,y^ksubscript^𝑦1subscript^𝑦2subscript^𝑦𝑘\hat{y}_{1},\hat{y}_{2},\cdots,\hat{y}_{k}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT together with a sequence of labels y1,y2,,yksubscript𝑦1subscript𝑦2subscript𝑦𝑘y_{1},y_{2},\cdots,y_{k}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The quality scores given by evaluators can be generally denoted as SE:=P(Y^=Y)assignsubscript𝑆𝐸𝑃^𝑌𝑌S_{E}:=P(\hat{Y}=Y)italic_S start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT := italic_P ( over^ start_ARG italic_Y end_ARG = italic_Y ), where E{GPT4,Llama3,Human}𝐸𝐺𝑃𝑇4𝐿𝑙𝑎𝑚𝑎3𝐻𝑢𝑚𝑎𝑛E\in\{GPT-4,Llama3,Human\}italic_E ∈ { italic_G italic_P italic_T - 4 , italic_L italic_l italic_a italic_m italic_a 3 , italic_H italic_u italic_m italic_a italic_n }. Appendix C depicts the workflow of the grading experiments.

To ensure reproducibility and prevent potential grading hallucinations, we fix the temperature parameter for each LLM call at T=0.0𝑇0.0T=0.0italic_T = 0.0.

4 Experiments

We ask GPT-4, Llama 3 8B, and human experts at our company to evaluate the quality of the aforementioned tRAG’s answers respectively and independently. Their assessment is guided by the vRAG-Eval instructions and grading rubric.

4.1 GPT-4

vRAG-Eval instructs that grading responses should be in the format of “Score: [[score]], Reason: [[explanation]]”. GPT-4 demonstrates the capability to strictly adhere to the instructions. For example:

  • Score: [[2]], Reason: [[The RAG’s response provides general insights into the user’s queries but does not align with the specific answer provided in the Label. ...]]

Appendix D shows an anomaly transaction question, and Appendix E illustrates the full text of GPT-4’s grading response.

We prompt GPT-4 to explain the justifications that substantiate the grading. It allows us to understand LLM’s thinking process. In our earlier design of vRAG-Eval, we explicitly emphasized Correctness, Completeness, and Honesty. It became evident that the GPT-4 evaluator partially deviated from the Label then graded answers based on the language model’s own definitions of those quality metrics. For instance, as illustrated in Appendix F, tRAG’s answer for the same question was graded 4 instead of 2:

  • Rating: [[4]], Reason: [[The RAG application’s answer is acceptable and provides a general understanding of the possible reasons for the user’s queries. ...]]

This example underscores the importance of developing clear and distinct grading criteria tailored to individual businesses’ specific requirements, contrasting with vague guidelines that are commonly applied in public domains.

4.2 Human

We gather human grading that score tRAG’s answer quality from experts. They are experienced data and machine learning practitioners employed by our company.

The 155 Q&A pairs in VND-Bench are collected in two phases. tRAG’s answers to the first 52 questions are graded by three human experts, while the next set of 103 answers are graded by two others.

To obtain a robust and representative view of human’s opinion, we choose the median function to vote the first 52 grading scores by the three human experts, and then randomly pick the other two experts’ assessment to sample remaining 103 scores.

The score distributions are illustrated in Figure 1 and Figure 2 respectively.

Refer to caption
Figure 1: Distribution of human grading scores for tRAG’s answers to the first 52 questions.
Refer to caption
Figure 2: Distribution of human grading scores for tRAG’s answers to the next 103 questions.

4.3 Compare GPT-4 with Human

The overall agreement between GPT-4 and Human evaluators at each of 5 quality score levels defined by the vRAG-Eval rubric is 29.7%. Figure 3 shows comparison at each level.

Refer to caption
Figure 3: Score distribution comparison at each quality level.

A closer analysis of grading samples reveals that the two evaluators may have subtle opinion differences at exact quality levels to assign scores. For example, Human grades the tRAG’s answer to a transaction code question shown in Appendix G at quality level 2, while GPT-4 assigns 1:

  • Score: [[1]], Reason: [[The RAG’s response does not align with the Label. ...]]

Appendix H displays the full text of GPT-4’s grading justification.

In the vRAG-Eval grading rubric, the criteria are “2: The response admits it cannot provide an answer or lacks context; honest.”. Meanwhile, as the reason explained by GPT-4 evaluator, it believes that tRAG’s answer does not directly address the question, so its assessment is based on “1: The response is not aligned with the Label or is off-topic; includes hallucination.” We also believe that GPT-4 explanation justifies the score 1. It is evident that both evaluators agree that tRAG’s response is not a good answer to the question, yet each grade on different basis of incorrectness.

Leng et al. (2023) confirms that both humans and LLMs struggle to hold the same standard for the same score when grading on high precision. However, the definitions of “high precision” and “low precision” are subjective and can vary based on individual perspectives. To address this issue, we propose aligning with the typical thumbs-up and thumbs-down gestures used in chat applications. It is also appropriate in business settings, where a binary quality scale effectively predicts that whether an answer can be accepted or rejected by clients.

Therefore, we convert the 5-level grading scale to binary: 1&2&3 (answer rejected), 4&5 (answer accepted). We observe an impressive 82.6% agreement rate between GPT-4 and human evaluators. Our subsequent findings reveal that GPT-4 and human evaluators both agree the tRAG’s answer quality being at the lower end: 77% and 72% reject rate respectively. Note that tRAG is a preliminarily developed application that serves as a test bed. However, it highlights the seriousness of the issue if we blindly deploy RAG applications without answer quality control measures in place. This could have adverse consequences for any enterprise’ business operations.

4.4 Llama 3

Llama 3 is an open-source LLM available in two variants: 8B and 70B parameters. Due to our infrastructure constraints, we are only able to utilize the smaller 8B version.

Similar to the GPT-4 experiment, we ask Llama 3 to assess the quality of tRAG’s responses based on vRAG-Eval’s grading instructions and rubric. Details of results can be found in Appendix I.

The overall agreement between 8B and Human evaluators at each of 5 quality score levels is 23.9%. Next, we convert the scaled grading to answer rejected/accepted categories and observe only 36.8% alignment.

At this relatively small parameter size, we hypothesize that 8B may not have sufficient intelligence to serve as a reliable evaluator to assess the answer quality on par with human experts.

5 Future Directions

While LLMs continue to grow in size and intelligence, utilizing powerful LLMs as an automated alternative to human expert graders is a promising and prominent area of research. This approach complements traditional lexical and semantic measurements and demonstrates exceptional explainability and human preference alignment capabilities. We identify several areas for potential future research and practice directions:

Enterprise’s Values In designing the vRAG-Eval grading system, we incorporate honesty into the grading rubric to reflect that human evaluators prefer answers that admit uncertainty over hallucination. We also conduct preliminary experiments on the subject of "verbalized probability" to explore the potential of using self-expressed confidence levels as a factor in measuring a LLM evaluator’s own honesty. Additionally, we consider integrating more ethical aspects, such as inclusiveness and fairness, into the grading metrics to ensure that the system is both effective and responsible.

More Language Support The VND-Bench dataset currently only includes Q&A pairs in English, yet our enterprise is a global company that supports businesses worldwide. We consider extending the dataset to include other languages, reflecting the company’s value of inclusiveness and diversity.

Domain Adaptation We curate the VND-Bench dataset, focusing on transactional data domain knowledge. We consider building benchmark datasets in other areas and study vRAG-Eval’s applicability to explore its potential in those business priorities.

Few-shot Evaluation. Our experiments demonstrate that the GPT-4 evaluator exhibits high agreement rates in accept/reject decisions. We consider analyzing the score discrepancy at each of the five grading levels to better understand the relationship between human and LLM evaluators’ thinking process. By leveraging "few-shot learning", we hypothesize that we can improve the alignment of LLM evaluators with human preferences at a greater refined granularity.

More Question Categories. The VND-Bench dataset currently focuses on factual knowledge questions in the transactional data domain. We consider expanding to include math and logical reasoning questions, which are essential for evaluating business-centric RAG applications. By developing high-order grading capabilities, we can better support the company’s growth and innovation.

Llama 3 (70B). We contemplate the quantization of the 70B model into 4-bit integers and evaluate its grading agreement with human experts. This experiment could explore the possibility of open source LLMs becoming more economical and less risky alternatives, suitable for deployment on commodity hardware.

6 Conclusion

In this paper, we thoroughly study evaluating Retrieval-Augmented Generation (RAG) applications in a high-stakes business environment where answer quality is paramount.

We design a novel system to evaluate answer correctness, completeness, and honesty through one 5-level grading scale. Furthermore, we uniquely cast fine-grained scores into either accepted or rejected categories. This mapping leads to a clearer, more direct comparison between human evaluator and LLMs.

We observe a remarkably high agreement rate of 82.6% between GPT-4 and human evaluators, which underscores the potential of LLMs in understanding and aligning with human judgement criteria through exceptional explainability.

Furthermore, our study reveals a significant disparity between Llama 3 8B’s evaluation and human grading, highlighting the importance of selecting a strong LLM.

Acknowledgments

We thank Ranjan Dutta, Toni Wang, and Ajit Patil for their thoughtful discussions and assistance with data annotation. We acknowledge Ping Zhu, Chi Wo Chung, Jiayin Zheng, Mohammad Rahman and their team for providing robust OpenAI API access. We appreciate our managers Salila Khilani, Dmitrii Krainov, and Yu Gu for their support. We also thank Stephen Shin, Nissa Strottman, Allee McDermott, Kathleen Patel, Tuesday Uhland, and Devon Grant for their support on compliance matters.

References

Appendix A tRAG: Knowledge Preprocessing Workflow and Configurations

[Uncaptioned image]
Chunk size 1000
Chunk overlap 25%
Embedding model text-embedding-ada-002
Indexing Hierarchical Navigable Small World (HNSW)
Distance Squared L2 Norm
Top-K 3
Embeddings 13644

Appendix B tRAG: Document Retrieval and Answer Generation Workflow

[Uncaptioned image]

Appendix C Answer Quality Evaluation Workflow

[Uncaptioned image]

Appendix D An Anomaly Transaction Question and Its Label

Question Label
I have following queries related to data in [anomaly_tranx_table_name].

There are around [number_of_records] records in this table where [purchase_date_field_name] is greater than [post_date_field_name]. May I know what are the scenarios where a purchase date can be greater than [anomaly] posted date?

There are records where same [tranx_id_field_name] have more than one record. For ref, [tranx_id_field_name] [tranx_id_value] has [number_of_entries] entries in dedupe table and each record has different [tranx_amount_field_name] reported. May I know why there can be more than 1 entry for the same [tranx_id_field_name]
[tranx_id_field_name] may be used to tie multiple clearing records to one authorisation, as is the case with split shipments (one order placed online will have original authorisation, followed by multiple shipments, each one cleared separately)
Table A1: Note that we retain spelling errors in the original label text. Exact table names, column names, and values are masked in square brackets for compliance purposes.

Appendix E tRAG’s Answer to the Anomaly Transaction Question and GPT-4’s Grading

tRAG’s Answer Assessment
Based on the context provided, I can provide some general insights into your queries. However, please note that the specific reasons for these scenarios in your data may require a more detailed understanding of the data and the business rules applied.

1. Scenarios where purchase date ([purchase_date_field_name]) is greater than [anomaly] posted date ([post_date_field_name]): This could occur due to time zone differences, system errors, or late reporting of transactions. For instance, a transaction could be made in one time zone and recorded in another, causing the purchase date to appear later than the [anomaly] posted date. It could also be due to delays in transaction posting or reporting.

2. Multiple entries for the same transaction ID ([tranx_id_field_name]) with different [anomaly] transaction amounts ([tranx_amount_field_name]): This could be due to multiple instances of [anomaly] associated with the same transaction, or updates to the [anomaly] amount as more information becomes available. It could also be a result of data errors or duplications.

Without more specific information about the data and the business rules applied, it’s difficult to provide a more precise explanation. I would recommend further investigation into these records and consultation with the team responsible for data management and [anomaly] detection.
Score: [[2]], Reason: [[The RAG’s response provides general insights into the user’s queries but does not align with the specific answer provided in the Label. The Label explains that multiple entries for the same transaction ID could be due to split shipments, which is not mentioned in the RAG’s response. The RAG’s response also does not provide a specific reason for the scenario where the purchase date is greater than the fraud posted date, which is not addressed in the Label. Therefore, the RAG’s response lacks the specific context provided in the Label.]]

Appendix F Early Design of Grading Instructions, Rubric, and GPT-4’s Assessment

Please act as an impartial judge and evaluate the quality of the answer provided by a RAG application.
Your evaluation should consider correctness, completeness, and honesty.
You will be given a reference answer, and the RAG application’s answer,
then grade the quality of the RAG application’s answer based on rubric below:
[The Start of Grading Rubric]
{rubric}
[The End of Grading Rubric]
Be as objective as possible. Output your final verdict by strictly following this format: "[[rating]]",
then providing your explanation.
For example: "Rating: [[5]], Reason: [[The answer is correct and comprehensive]].
1: completely incorrect, hallucination
2: admits it cannot answer or lack of context, honest
3: pertinent but contains noticeable errors or inaccuracies
4: acceptable answer, adequate but not comprehensive
5: fully accurate and exhaustive
Rating: [[4]], Reason: [[The RAG application’s answer is acceptable and provides a general understanding of the possible reasons for the user’s queries. However, it does not specifically address the scenario of multiple clearing records tied to one authorization, as mentioned in the reference answer. The RAG application’s answer is not incorrect, but it is not as comprehensive as the reference answer.]]

Appendix G A Transaction Code Question and Label

Question Label
I would like to browse transactions - invoice from [tranx_code_value]. Could you please advice if the data is availabe somewhere. [tranx_code_value]s are carried in [tranx_detail_table_name] [tranx_code_field_name] = [tranx_code_value]. [tranx_code_value]s are used for all types of information transferance such as additional hotel portfolio information that could not be fit in the [tranx_component_record_type] record.

Appendix H tRAG’s Answer to the Transaction Code Question and GPT-4’s grading justification

tRAG’s Answer GPT-4’s Grading
Based on the context provided, it appears that the data for [tranx_code_value] - [tranx_component_record_type] - Commercial Card - Invoice (Line Detail) is contained within specific tables. However, the exact location or access method for these tables is not specified in the context. As a data engineer, I would typically expect this data to be stored in a secure database or data warehouse, accessible to authorized personnel or systems. If you are an authorized user, you may need to use a specific tool or application to browse this data, or you may need to write a query or script to extract the data from the database. If you are not sure how to access this data, I would recommend reaching out to your organization’s IT or data management team for assistance. They should be able to provide you with the necessary access and instructions. Please note that this is a general advice based on typical data management practices and the actual process may vary depending on your organization’s specific systems and policies. Score: [[1]], Reason: [[The RAG’s response does not align with the Label. The Label specifically mentions that [tranx_code_value]s are carried in [tranx_detail_table_name] with [tranx_code_field_name] = [tranx_code_value], but the RAG’s response does not provide this information. Instead, it gives a general advice on how to access data, which is not asked in the question.]]
Table A2: tRAG’s answer and GPT-4’s grading justification.

Appendix I Llama 3

Llama 3 8B assigns a quality score of 4 to 80% of tRAG’s answers, and a score of 3 to just one answer. The score distribution is depicted in the left-hand figure of Figure A1.

Refer to caption
Refer to caption
Figure A1: Left: Distribution of Llama 3 8B grading scores for tRAG’s answers. Right: The heatmap illustrates the correlation between quality scores and grading confidence.

The skewness leads us to speculate that at this not-so-large parameter size, 8B may not be intelligent enough to discern the answer quality on the borderline. And it may avoid assigning extreme scores when not confident in its grading. To test the hypothesis, we add an additional instruction text, as shown in Table A3, to elicit Llama 3’s confidence level. This approach, coined as “verbalized probability” by (Lin et al., 2022), allows the LLM to express its confidence level in natural language without use of model logits.

Present your final score in the format: "[[score]]",
followed by your confidence level of the grading in the range of 0 to 100,
with 100 being very confident and 0 being not sure about your grading at all.
At the end, disclose your justification. Example:
Score: [[3]], Confidence: [[50]], Reason: [[The RAG’s response partially aligns with the Label but with some discrepancies]].
Table A3: Addition instruction to request Llama 3 disclosing its internal confidence level.

The right-hand figure in Figure A1 illustrates the correlation between the quality scores of answers and the confidence of the 8B evaluator.

Hendrycks et al. (2021) show that LLMs should be calibrated. However, we argue that the observation agrees with our hypothesis that when the model’s grading confidence is relatively low (at 80%), 8B tends to assign a quality score of 4 as a "default" choice.