Evaluating Quality of Answers for Retrieval-Augmented Generation:
A Strong LLM Is All You Need
Abstract
We present a comprehensive study of answer quality evaluation in Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel grading system that is designed to assess correctness, completeness, and honesty. We further map the grading of quality aspects aforementioned into a binary score, indicating an accept or reject decision, mirroring the intuitive "thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This approach suits factual business settings where a clear decision opinion is essential. Our assessment applies vRAG-Eval to two Large Language Models (LLMs), evaluating the quality of answers generated by a vanilla RAG application. We compare these evaluations with human expert judgments and find a substantial alignment between GPT-4’s assessments and those of human experts, reaching 83% agreement on accept or reject decisions. This study highlights the potential of LLMs as reliable evaluators in closed-domain, closed-ended settings, particularly when human evaluations require significant resources.
Evaluating Quality of Answers for Retrieval-Augmented Generation:
A Strong LLM Is All You Need
Yang Wang Visa Data & AI Platform Foster City, California [email protected] Alberto Garcia Hernandez Visa Consulting & Analytics London, United Kingdom [email protected]
Roman Kyslyi Visa Consulting & Analytics Kyiv, Ukraine [email protected] Nicholas Kersting Visa Data & AI Platform Austin, Texas [email protected]
1 Introduction
Since the launch of ChatGPT in November 2022, Large Language Models (LLMs) have become increasingly popular integrated components for organizations seeking to enhance productivity and enrich their product portfolio offerings. However, it is well known that GPT-4 Turbo’s training corpora cut-off date is April 2023, rendering the model lacking in current events knowledge. Furthermore, as LLMs are pre-trained on public domain text and do not possess proprietary information, their capabilities are limited when it comes to our company’s knowledge-intensive applications.
Fine-tuning (Radford et al., 2018; Dodge et al., 2020) is a technique that can be used to inject new knowledge into pre-trained LLMs by adjusting their gradient parameters through further, specialized training. However, OpenAI’s fine-tuning API is currently only available through an experimental access program, and GPT-4 fine-tuning requires more effort to achieve significant improvements, as noted by OpenAI (2023).
Retrieval-augmented generation (RAG) is proposed initially by Lewis et al. (2021). This method involves storing extra knowledge in a non-parametric dense vector index and using a pre-trained neural retriever to search relevant context, followed by generating content with a pre-trained sequence-to-sequence (seq2seq) model. Ovadia et al. (2024) argue that RAG consistently outperforms unsupervised fine-tuning on a wide range of knowledge-intensive tasks.
A RAG application leveraging LLMs enhanced with the company’s proprietary knowledge has become one of the pivotal factors advancing the adoption of GPT-4-based applications at our enterprise, a global payment technology company. This phenomenon highlights the need to apply viable approaches to evaluate RAG applications, as lacking performance metrics poses risks to the enterprise’ business and may have negative consequences.
One approach to evaluate the quality of RAGs focuses on their unique characteristic: that they consist of a Retriever model and a Generator model. Therefore, studies have used metrics such as context relevance, and answer relevance to evaluate the two components separately then assess the answer faithfulness between them (Saad-Falcon et al., 2024; Es et al., 2023).
The other paradigm for evaluating RAG applications treats them as traditional question answering systems. Researchers have proposed metrics such as SAS, a semantic answer similarity estimation (Risch et al., 2021), and F1/EM scores (Wang et al., 2024) to evaluate these applications on public question answering (QA) datasets. Zheng et al. (2023) propose using strong LLMs such as GPT-4 to evaluate other LLMs. They argue that traditional benchmarks cannot effectively align quality measurement with human preferences in open-ended tasks.
In some companies, RAG applications are mostly developed for a closed-domain, closed-ended setting, where employees seek factual answers based on proprietary knowledge. Meanwhile, they often develop external-facing RAG applications that allow clients to search for authoritative information that is not necessarily available on the Internet. Therefore, it is crucial to establish definitive answer quality evaluation practices that are suitable for these businesses’ needs.
To study this, we introduce a benchmark dataset called VND-Bench, comprising 155 high-quality questions across 14 subject areas in a proprietary payment network data domain. We also collect their corresponding answers from a communication channel on an internal employee collaboration platform to serve as ground truth labels. We build a vanilla RAG application, tRAG, as a test bed and ask it to answer the 155 questions. We then request five human experts, GPT-4, and Llama 3 (Meta, 2024) to assess tRAG’s answer quality based on the labels and vRAG-Eval, a set of grading instructions and a rubric. Our results show that GPT-4’s evaluation agrees with human evaluators’ scores at a rate of 83% based on a binary accept/reject category.
We summarize our contributions as follows:
-
•
We design a RAG evaluation mechanism, vRAG-Eval, which includes a set of grading instructions and a rubric that measures answer quality in three aspects: correctness, completeness, and honesty, resulting in one single score. It can be readily converted to a binary accept/reject decision that suits business settings.
-
•
We build a high-quality benchmark dataset, VND-Bench, with 155 questions/answers pairs that cover 14 subject areas in data domain.
-
•
We conduct experiments using LLMs as a RAG’s quality evaluators in a closed-domain closed-ended single-turn QA setting and find that GPT-4’s grading agrees with human experts’ opinions at a rate of 83% in terms of answer accepted or rejected decisions.
The rest of the paper is organized as follows: in section 2, we introduce related work. And in section 3, we explain the motivation, problem setting, and our method of experimentation. In section 4, we provide our experimental results. We suggest future research directions and make concluding remarks in sections 5&6.
2 Related Work
Embedding and semantic similarity Zhang et al. (2020) propose the BERTScore metric as an automatic evaluation method for text generation tasks. Unlike traditional lexical approaches that rely on word matching, BERTScore sums the cosine similarities between the token embeddings of both two sentences: the reference answer and the QA system-generated answer. Risch et al. (2021) introduce SAS, a metric designed to estimate the semantic similarity between ground-truth annotations and answers generated by a QA model. It is found that semantic similarity metrics, especially those based on contemporary transformer models (Reimers and Gurevych, 2019), align more accurately with human assessments compared to the conventional lexical similarity metrics.
LLM as a judge Two projects closely related to our research are LLM-as-a-Judge (Zheng et al., 2023) and A Case Study on the Databricks Documentation Bot (Leng et al., 2023). To evaluate LLM-based chat assistants, Zheng et al. (2023) examine the usage and limitations of LLMs as judges for open-ended questions. This setting may not be suitable for most RAG applications in enterprises, where users typically seek definitive answers. Leng et al. (2023) employ LLMs to generate grades as a weighted composite score of Correctness, Comprehensiveness, and Readability. While they report an alignment rate of 80% with human scores on individual factors, their grading system lacks a crucial element: one single metric that provides a decision-making-ready quality score. In business-centric settings, this limitation becomes particularly noticeable.
Reference-free evaluation In scenarios where human annotations are not readily available, Es et al. (2023) present RAGAS, a framework for automating the assessment of RAG systems. This framework considers RAG’s 2-module composition and proposes three key metrics: Context Relevance to evaluate the Retriever, Answer Relevance to assess the Generator, and Faithfulness to measure the coherence between the two modules. Similarly, ARES (Saad-Falcon et al., 2024) evaluates along the same 3 dimensions to assess the quality of RAG components. While not entirely reference-free, ARES leverages a few hundred human annotations during evaluation and employs a lightweight fine-tuned LLM as a judge, rather than relying on a frozen LLM.
3 Motivation and Problem Setting
Prior research has shown that LLMs are powerful tools and become increasingly popular for evaluating RAG answer qualities. However, existing studies have largely focused on open-ended settings or in public domains. The objective of our work is to determine the reliability of LLMs in a real-world business setting where informed decision-making is of top priorities as well.
3.1 Preparation
We first develop a test bed RAG application, tRAG. Its proprietary knowledge base consists of a corpus of 18 PDF documents, totaling 15M tokens, which includes payment processing specifications, API reference guides, data standard manuals, and others.
During the preprocessing stage, for each document in the knowledge corpus, we split its content into chunks such that , and vectorize each text chunk using an embedding model (OpenAI, 2022), where . Then we store the key value pair into a local vector store instance . Appendix A illustrates the high-level workflow, and configuration details.
Next, we curate a benchmark dataset, VND-Bench, by collecting question-answer pairs directly from an internal employee communication channel. This channel is frequently used by a diverse range of data users across the organization to seek help in understanding specific topics related to a major Payment Network’s operations and transactional data semantics. We collect a total of 155 closed-ended questions along with their corresponding answers, which we consider to be the ground truth for our experiment. When multiple people respond to a question on the channel, we aggregate their answers into one single response. The set of questions are categorized into subject areas specified in Table 1.
1. Acceptance | 8. Issuing |
2. Authentication | 9. Master Data |
3. Authorization | 10. OCT |
4. Clearing and Settlement | 11. Other |
5. Commercial | 12. Processing |
6. Dispute | 13. Product |
7. Fraud | 14. Token |
3.2 Retrieval, Augmentation, and Generation
For each question in the VND-Bench dataset, tRAG embeds the text into a fixed-length dense vector using model , then conducts distance search within the knowledge database , and returns the top most relevant document chunks matching the query.
During the Augmentation stage tRAG concatenates those top chunks as the context .
And combines with the question to form a prompt: . This prompt is then sent to the answer generator.
We utilize GPT-4 model as the tRAG’s answer generator model . The resulting response, which constitutes the model’s inference output, is evaluated in our experiment. Appendix B illustrates tRAG’s single-turn question answering workflow.
3.3 Grading Method
We compile a fielded dataset containing 155 entries with the following columns: Subject Area, Question, Label, tRAG’s Answer. Additionally, we design a grading rubric as illustrated in Table 2 to evaluate tRAG’s Answer.
1: | The response is not aligned with the Label or is off-topic; includes hallucination. |
---|---|
2: | The response admits it cannot provide an answer or lacks context; honest. |
3: | The response is relevant but contains notable discrepancies or inaccuracies. |
4: | The response is acceptable, sufficient but not exhaustive. |
5: | The response is fully accurate and comprehensive, based on the Label. |
Each score assesses a distinct aspect of answer quality. Specifically, although both incorrect answers and responses of unknown lack value, we argue that hallucination may impose significant harm to mission-critical businesses. Consequently, we penalize hallucination with the lowest score of 1, while honestly acknowledging an unknown response earns a score of 2.
A correct answer warrants a score of 4, while a RAG application’s extra effort to provide supplementary details is commended with a score of 5, reflecting the value of answer completeness. When a score of 3 is assigned, it indicates that the answer quality is borderline, potentially containing false information.
LLMs grade answer quality through zero-shot learning, wherein a constant template as part of the vRAG-Eval, is designed to populate prompts for each question. Table 3 illustrates the grading template.
You are an AI assistant. In the following task, you are given a Question, |
a RAG application’s response, and a Ground-truth Answer referred to as ’Label’. |
Assess how well the RAG application’s response aligns with the Label, |
using the grading rubric below: |
[Start of Grading Rubric]
|
{rubric}
|
[End of Grading Rubric]
|
Treat the Label as the definitive answer. Present your final score in the format: "[[score]]", |
followed by your justification. Example: |
Score: [[3]], Reason: [[The RAG’s response partially aligns with the Label |
but with some discrepancies]]. |
[Start of User Question]
|
{question}
|
[End of User Question]
|
[Start of Label]
|
{label}
|
[End of Label]
|
[Start of RAG’s Application Response]
|
{tRAG answer}
|
[End of RAG’s Application Response]
|
In our experiment, inputs to the evaluators are a sequence of tRAG’s answers together with a sequence of labels . The quality scores given by evaluators can be generally denoted as , where . Appendix C depicts the workflow of the grading experiments.
To ensure reproducibility and prevent potential grading hallucinations, we fix the temperature parameter for each LLM call at .
4 Experiments
We ask GPT-4, Llama 3 8B, and human experts at our company to evaluate the quality of the aforementioned tRAG’s answers respectively and independently. Their assessment is guided by the vRAG-Eval instructions and grading rubric.
4.1 GPT-4
vRAG-Eval instructs that grading responses should be in the format of “Score: [[score]], Reason: [[explanation]]”. GPT-4 demonstrates the capability to strictly adhere to the instructions. For example:
-
Score: [[2]], Reason: [[The RAG’s response provides general insights into the user’s queries but does not align with the specific answer provided in the Label. ...]]
Appendix D shows an anomaly transaction question, and Appendix E illustrates the full text of GPT-4’s grading response.
We prompt GPT-4 to explain the justifications that substantiate the grading. It allows us to understand LLM’s thinking process. In our earlier design of vRAG-Eval, we explicitly emphasized Correctness, Completeness, and Honesty. It became evident that the GPT-4 evaluator partially deviated from the Label then graded answers based on the language model’s own definitions of those quality metrics. For instance, as illustrated in Appendix F, tRAG’s answer for the same question was graded 4 instead of 2:
-
Rating: [[4]], Reason: [[The RAG application’s answer is acceptable and provides a general understanding of the possible reasons for the user’s queries. ...]]
This example underscores the importance of developing clear and distinct grading criteria tailored to individual businesses’ specific requirements, contrasting with vague guidelines that are commonly applied in public domains.
4.2 Human
We gather human grading that score tRAG’s answer quality from experts. They are experienced data and machine learning practitioners employed by our company.
The 155 Q&A pairs in VND-Bench are collected in two phases. tRAG’s answers to the first 52 questions are graded by three human experts, while the next set of 103 answers are graded by two others.
To obtain a robust and representative view of human’s opinion, we choose the median function to vote the first 52 grading scores by the three human experts, and then randomly pick the other two experts’ assessment to sample remaining 103 scores.
![Refer to caption](extracted/5712610/images/3humans_score_distribution.png)
![Refer to caption](extracted/5712610/images/2humans_score_distribution.png)
4.3 Compare GPT-4 with Human
The overall agreement between GPT-4 and Human evaluators at each of 5 quality score levels defined by the vRAG-Eval rubric is 29.7%. Figure 3 shows comparison at each level.
![Refer to caption](extracted/5712610/images/human_vs_gpt4_each_level.png)
A closer analysis of grading samples reveals that the two evaluators may have subtle opinion differences at exact quality levels to assign scores. For example, Human grades the tRAG’s answer to a transaction code question shown in Appendix G at quality level 2, while GPT-4 assigns 1:
-
Score: [[1]], Reason: [[The RAG’s response does not align with the Label. ...]]
Appendix H displays the full text of GPT-4’s grading justification.
In the vRAG-Eval grading rubric, the criteria are “2: The response admits it cannot provide an answer or lacks context; honest.”. Meanwhile, as the reason explained by GPT-4 evaluator, it believes that tRAG’s answer does not directly address the question, so its assessment is based on “1: The response is not aligned with the Label or is off-topic; includes hallucination.” We also believe that GPT-4 explanation justifies the score 1. It is evident that both evaluators agree that tRAG’s response is not a good answer to the question, yet each grade on different basis of incorrectness.
Leng et al. (2023) confirms that both humans and LLMs struggle to hold the same standard for the same score when grading on high precision. However, the definitions of “high precision” and “low precision” are subjective and can vary based on individual perspectives. To address this issue, we propose aligning with the typical thumbs-up and thumbs-down gestures used in chat applications. It is also appropriate in business settings, where a binary quality scale effectively predicts that whether an answer can be accepted or rejected by clients.
Therefore, we convert the 5-level grading scale to binary: 1&2&3 (answer rejected), 4&5 (answer accepted). We observe an impressive 82.6% agreement rate between GPT-4 and human evaluators. Our subsequent findings reveal that GPT-4 and human evaluators both agree the tRAG’s answer quality being at the lower end: 77% and 72% reject rate respectively. Note that tRAG is a preliminarily developed application that serves as a test bed. However, it highlights the seriousness of the issue if we blindly deploy RAG applications without answer quality control measures in place. This could have adverse consequences for any enterprise’ business operations.
4.4 Llama 3
Llama 3 is an open-source LLM available in two variants: 8B and 70B parameters. Due to our infrastructure constraints, we are only able to utilize the smaller 8B version.
Similar to the GPT-4 experiment, we ask Llama 3 to assess the quality of tRAG’s responses based on vRAG-Eval’s grading instructions and rubric. Details of results can be found in Appendix I.
The overall agreement between 8B and Human evaluators at each of 5 quality score levels is 23.9%. Next, we convert the scaled grading to answer rejected/accepted categories and observe only 36.8% alignment.
At this relatively small parameter size, we hypothesize that 8B may not have sufficient intelligence to serve as a reliable evaluator to assess the answer quality on par with human experts.
5 Future Directions
While LLMs continue to grow in size and intelligence, utilizing powerful LLMs as an automated alternative to human expert graders is a promising and prominent area of research. This approach complements traditional lexical and semantic measurements and demonstrates exceptional explainability and human preference alignment capabilities. We identify several areas for potential future research and practice directions:
Enterprise’s Values In designing the vRAG-Eval grading system, we incorporate honesty into the grading rubric to reflect that human evaluators prefer answers that admit uncertainty over hallucination. We also conduct preliminary experiments on the subject of "verbalized probability" to explore the potential of using self-expressed confidence levels as a factor in measuring a LLM evaluator’s own honesty. Additionally, we consider integrating more ethical aspects, such as inclusiveness and fairness, into the grading metrics to ensure that the system is both effective and responsible.
More Language Support The VND-Bench dataset currently only includes Q&A pairs in English, yet our enterprise is a global company that supports businesses worldwide. We consider extending the dataset to include other languages, reflecting the company’s value of inclusiveness and diversity.
Domain Adaptation We curate the VND-Bench dataset, focusing on transactional data domain knowledge. We consider building benchmark datasets in other areas and study vRAG-Eval’s applicability to explore its potential in those business priorities.
Few-shot Evaluation. Our experiments demonstrate that the GPT-4 evaluator exhibits high agreement rates in accept/reject decisions. We consider analyzing the score discrepancy at each of the five grading levels to better understand the relationship between human and LLM evaluators’ thinking process. By leveraging "few-shot learning", we hypothesize that we can improve the alignment of LLM evaluators with human preferences at a greater refined granularity.
More Question Categories. The VND-Bench dataset currently focuses on factual knowledge questions in the transactional data domain. We consider expanding to include math and logical reasoning questions, which are essential for evaluating business-centric RAG applications. By developing high-order grading capabilities, we can better support the company’s growth and innovation.
Llama 3 (70B). We contemplate the quantization of the 70B model into 4-bit integers and evaluate its grading agreement with human experts. This experiment could explore the possibility of open source LLMs becoming more economical and less risky alternatives, suitable for deployment on commodity hardware.
6 Conclusion
In this paper, we thoroughly study evaluating Retrieval-Augmented Generation (RAG) applications in a high-stakes business environment where answer quality is paramount.
We design a novel system to evaluate answer correctness, completeness, and honesty through one 5-level grading scale. Furthermore, we uniquely cast fine-grained scores into either accepted or rejected categories. This mapping leads to a clearer, more direct comparison between human evaluator and LLMs.
We observe a remarkably high agreement rate of 82.6% between GPT-4 and human evaluators, which underscores the potential of LLMs in understanding and aligning with human judgement criteria through exceptional explainability.
Furthermore, our study reveals a significant disparity between Llama 3 8B’s evaluation and human grading, highlighting the importance of selecting a strong LLM.
Acknowledgments
We thank Ranjan Dutta, Toni Wang, and Ajit Patil for their thoughtful discussions and assistance with data annotation. We acknowledge Ping Zhu, Chi Wo Chung, Jiayin Zheng, Mohammad Rahman and their team for providing robust OpenAI API access. We appreciate our managers Salila Khilani, Dmitrii Krainov, and Yu Gu for their support. We also thank Stephen Shin, Nissa Strottman, Allee McDermott, Kathleen Patel, Tuesday Uhland, and Devon Grant for their support on compliance matters.
References
- Dodge et al. (2020) Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv:2002.06305.
- Es et al. (2023) Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2023. Ragas: Automated evaluation of retrieval augmented generation. arXiv:2309.15217.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. In ICLR 2021.
- Leng et al. (2023) Quinn Leng, Kasey Uhlenhuth, and Alkis Polyzotis. 2023. Best practices for llm evaluation of rag applications. Databrick Engineering Blog.
- Lewis et al. (2021) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv:2005.11401.
- Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Teaching models to express their uncertainty in words. arXiv:2205.14334.
- Meta (2024) Meta. 2024. Introducing meta llama 3: The most capable openly available llm to date.
- OpenAI (2022) OpenAI. 2022. New and improved embedding model.
- OpenAI (2023) OpenAI. 2023. New models and developer products announced at devday.
- Ovadia et al. (2024) Oded Ovadia, Menachem Brief, Moshik Mishaeli, and Oren Elisha. 2024. Fine-tuning or retrieval? comparing knowledge injection in llms. arXiv:2312.05934.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI.
- Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Risch et al. (2021) Julian Risch, Timo Möller, Julian Gutsch, and Malte Pietsch. 2021. Semantic answer similarity for evaluating question answering models. arXiv:2108.06130.
- Saad-Falcon et al. (2024) Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. Ares: An automated evaluation framework for retrieval-augmented generation systems. arXiv:2311.09476.
- Wang et al. (2024) Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, and Bryan Catanzaro. 2024. Instructretro: Instruction tuning post retrieval-augmented pretraining. arXiv:2310.07713.
- Zhang et al. (2020) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. In ICLR 2020.
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks.
Appendix A tRAG: Knowledge Preprocessing Workflow and Configurations
![[Uncaptioned image]](extracted/5712610/images/preprocessing.png)
Chunk size | 1000 |
---|---|
Chunk overlap | 25% |
Embedding model | text-embedding-ada-002 |
Indexing | Hierarchical Navigable Small World (HNSW) |
Distance | Squared L2 Norm |
Top-K | 3 |
Embeddings | 13644 |
Appendix B tRAG: Document Retrieval and Answer Generation Workflow
![[Uncaptioned image]](extracted/5712610/images/rag.png)
Appendix C Answer Quality Evaluation Workflow
![[Uncaptioned image]](extracted/5712610/images/qa.png)
Appendix D An Anomaly Transaction Question and Its Label
Question | Label |
I have following queries related to data in [anomaly_tranx_table_name].
There are around [number_of_records] records in this table where [purchase_date_field_name] is greater than [post_date_field_name]. May I know what are the scenarios where a purchase date can be greater than [anomaly] posted date? There are records where same [tranx_id_field_name] have more than one record. For ref, [tranx_id_field_name] [tranx_id_value] has [number_of_entries] entries in dedupe table and each record has different [tranx_amount_field_name] reported. May I know why there can be more than 1 entry for the same [tranx_id_field_name] |
[tranx_id_field_name] may be used to tie multiple clearing records to one authorisation, as is the case with split shipments (one order placed online will have original authorisation, followed by multiple shipments, each one cleared separately) |
Appendix E tRAG’s Answer to the Anomaly Transaction Question and GPT-4’s Grading
tRAG’s Answer | Assessment |
Based on the context provided, I can provide some general insights into your queries. However, please note that the specific reasons for these scenarios in your data may require a more detailed understanding of the data and the business rules applied.
1. Scenarios where purchase date ([purchase_date_field_name]) is greater than [anomaly] posted date ([post_date_field_name]): This could occur due to time zone differences, system errors, or late reporting of transactions. For instance, a transaction could be made in one time zone and recorded in another, causing the purchase date to appear later than the [anomaly] posted date. It could also be due to delays in transaction posting or reporting. 2. Multiple entries for the same transaction ID ([tranx_id_field_name]) with different [anomaly] transaction amounts ([tranx_amount_field_name]): This could be due to multiple instances of [anomaly] associated with the same transaction, or updates to the [anomaly] amount as more information becomes available. It could also be a result of data errors or duplications. Without more specific information about the data and the business rules applied, it’s difficult to provide a more precise explanation. I would recommend further investigation into these records and consultation with the team responsible for data management and [anomaly] detection. |
Score: [[2]], Reason: [[The RAG’s response provides general insights into the user’s queries but does not align with the specific answer provided in the Label. The Label explains that multiple entries for the same transaction ID could be due to split shipments, which is not mentioned in the RAG’s response. The RAG’s response also does not provide a specific reason for the scenario where the purchase date is greater than the fraud posted date, which is not addressed in the Label. Therefore, the RAG’s response lacks the specific context provided in the Label.]] |
Appendix F Early Design of Grading Instructions, Rubric, and GPT-4’s Assessment
Please act as an impartial judge and evaluate the quality of the answer provided by a RAG application. |
Your evaluation should consider correctness, completeness, and honesty. |
You will be given a reference answer, and the RAG application’s answer, |
then grade the quality of the RAG application’s answer based on rubric below: |
[The Start of Grading Rubric]
|
{rubric}
|
[The End of Grading Rubric]
|
Be as objective as possible. Output your final verdict by strictly following this format: "[[rating]]", |
then providing your explanation. |
For example: "Rating: [[5]], Reason: [[The answer is correct and comprehensive]]. |
1: completely incorrect, hallucination |
2: admits it cannot answer or lack of context, honest |
3: pertinent but contains noticeable errors or inaccuracies |
4: acceptable answer, adequate but not comprehensive |
5: fully accurate and exhaustive |
Rating: [[4]], Reason: [[The RAG application’s answer is acceptable and provides a general understanding of the possible reasons for the user’s queries. However, it does not specifically address the scenario of multiple clearing records tied to one authorization, as mentioned in the reference answer. The RAG application’s answer is not incorrect, but it is not as comprehensive as the reference answer.]] |
Appendix G A Transaction Code Question and Label
Question | Label |
I would like to browse transactions - invoice from [tranx_code_value]. Could you please advice if the data is availabe somewhere. | [tranx_code_value]s are carried in [tranx_detail_table_name] [tranx_code_field_name] = [tranx_code_value]. [tranx_code_value]s are used for all types of information transferance such as additional hotel portfolio information that could not be fit in the [tranx_component_record_type] record. |
Appendix H tRAG’s Answer to the Transaction Code Question and GPT-4’s grading justification
tRAG’s Answer | GPT-4’s Grading |
Based on the context provided, it appears that the data for [tranx_code_value] - [tranx_component_record_type] - Commercial Card - Invoice (Line Detail) is contained within specific tables. However, the exact location or access method for these tables is not specified in the context. As a data engineer, I would typically expect this data to be stored in a secure database or data warehouse, accessible to authorized personnel or systems. If you are an authorized user, you may need to use a specific tool or application to browse this data, or you may need to write a query or script to extract the data from the database. If you are not sure how to access this data, I would recommend reaching out to your organization’s IT or data management team for assistance. They should be able to provide you with the necessary access and instructions. Please note that this is a general advice based on typical data management practices and the actual process may vary depending on your organization’s specific systems and policies. | Score: [[1]], Reason: [[The RAG’s response does not align with the Label. The Label specifically mentions that [tranx_code_value]s are carried in [tranx_detail_table_name] with [tranx_code_field_name] = [tranx_code_value], but the RAG’s response does not provide this information. Instead, it gives a general advice on how to access data, which is not asked in the question.]] |
Appendix I Llama 3
Llama 3 8B assigns a quality score of 4 to 80% of tRAG’s answers, and a score of 3 to just one answer. The score distribution is depicted in the left-hand figure of Figure A1.
![Refer to caption](extracted/5712610/images/llama3.png)
![Refer to caption](extracted/5712610/images/llama3correlation.png)
The skewness leads us to speculate that at this not-so-large parameter size, 8B may not be intelligent enough to discern the answer quality on the borderline. And it may avoid assigning extreme scores when not confident in its grading. To test the hypothesis, we add an additional instruction text, as shown in Table A3, to elicit Llama 3’s confidence level. This approach, coined as “verbalized probability” by (Lin et al., 2022), allows the LLM to express its confidence level in natural language without use of model logits.
Present your final score in the format: "[[score]]",
followed by your confidence level of the grading in the range of 0 to 100, with 100 being very confident and 0 being not sure about your grading at all. At the end, disclose your justification. Example: Score: [[3]], Confidence: [[50]], Reason: [[The RAG’s response partially aligns with the Label but with some discrepancies]]. |
The right-hand figure in Figure A1 illustrates the correlation between the quality scores of answers and the confidence of the 8B evaluator.
Hendrycks et al. (2021) show that LLMs should be calibrated. However, we argue that the observation agrees with our hypothesis that when the model’s grading confidence is relatively low (at 80%), 8B tends to assign a quality score of 4 as a "default" choice.