Evaluating Large Language Models with fmeval

Pola Schwöbel [email protected] 0000-0003-4846-1917 Amazon Web ServicesBerlinGermany Luca Franceschi [email protected] 0000-0002-1810-1016 Amazon Web ServicesBerlinGermany Muhammad Bilal Zafar 0000-0003-4846-1917 Ruhr University BochumBochumGermany Keerthan Vasist Amazon Web ServicesSanta ClaraUSA Aman Malhotra Amazon Web ServicesSanta ClaraUSA Tomer Shenhar Amazon Web ServicesNew YorkUSA Pinal Tailor Amazon Web ServicesArlingtonUSA Pinar Yilmaz Amazon Web ServicesSanta ClaraUSA Michael Diamond Amazon Web ServicesSanta ClaraUSA  and  Michele Donini 0000-0002-9769-3899 Amazon Web ServicesBerlinGermany
Abstract.

fmeval is an open source library to evaluate large language models (LLMs) in a range of tasks. It helps practitioners evaluate their model for task performance and along multiple responsible AI dimensions. This paper presents the library and exposes its underlying design principles: simplicity, coverage, extensibility and performance. We then present how these were implemented in the scientific and engineering choices taken when developing fmeval. A case study demonstrates a typical use case for the library: picking a suitable model for a question answering task. We close by discussing limitations and further work in the development of the library. fmeval can be found at https://github.com/aws/fmeval.

copyright: noneconference: ; ;

1. Introduction

The advent of foundation models (FMs) as the workhorse for generative AI has revolutionized machine learning (ML). The potential for automation on an unprecedented scale promises efficiency leaps in a wide range of industries, such as finance, healthcare, public service, travel and hospitality. Meanwhile, the risks associated with generative AI have been well-publicized, especially for models in the language domain (Bender et al., 2021; Blodgett et al., 2020; Sheng et al., 2021; Wang et al., 2023). Large language models (LLMs) are trained on volumes of data, including undesirable content laden with historical biases, undemocratic viewpoints or hate speech. As a consequence, LLMs are at risk of regurgitating toxicity (Gehman et al., 2020; Dhamala et al., 2021), stereotypes (Nadeem et al., 2020; Nangia et al., 2020; Nozza et al., 2021; Abid et al., 2021), and non-truthful outputs (Lin et al., 2021; Ji et al., 2023). Such model behaviors can cause harm to users, damage an organization’s reputation and jeopardize customer trust. Additionally, ethical and safety dimensions of ML models have recently been under close regulatory scrutiny via guidelines and regulations, such as ISO 42001 or the EU AI Act.

Detecting and managing risks, as prescribed by such guidelines, is challenging. ML engineers and data scientists have to leave their development environment to use academic tools and benchmarking sites, which require highly-specialized knowledge. The sheer number of metrics makes it hard to identify the ones that are truly relevant for their use-cases, while evaluating all of them leads to high compute costs and can take several days to run for a single model. This tedious process then needs to be repeated frequently as new models are released and existing ones are fine-tuned.

Simplifying this process, fmeval provides users a single place to evaluate and compare metrics during the model selection and model customization workflow with minimal effort. fmeval measures model accuracy as well as responsible AI (RAI) aspects such as robustness, toxicity and bias out of the box for many LLMs. On top of these predefined evaluations, users can extend the framework with custom evaluation datasets and custom evaluation algorithms unique to their specific use cases. Presenting outcomes in an interpretable manner, reports are automatically generated for each evaluation job. fmeval is available open-source and is natively integrated into Amazon Bedrock and Amazon SageMaker JumpStart, reducing MLOps overhead for users.

The remainder of this paper is structured as follows. Section 2 introduces the criteria that guided the design of fmeval and Section 3 reviews related work. Section 4 provides an in-depth examination of the architecture and components of fmeval, including the data, models, evaluations, and visualizations. Section 5 then presents our selected set of built-in evaluations and datasets, including the rationale behind these choices and their limitations. Section 6 describes the integration with AWS systems. Section 7 presents case studies demonstrating the benefits of using fmeval to select and evaluate LLMs. Finally, Section 8 concludes by discussing planned next steps for fmeval.

2. Design desiderata

Practitioners looking to evaluate foundation models have a variety of needs. We formalize these needs into the following desiderata for fmeval.

  1. (1)

    Simplicity – One does not need to be an expert in responsible AI to use fmeval. We have distilled a vast body of literature into our built-in evaluations, producing easy-to-understand metrics. When used within the AWS infrastructure (see §6), we provide a UI to create evaluations with a few clicks. This reduces operational overhead and speeds up time to value. By making fmeval accessible for users who are not ML specialists, we aim to empower the democratization of ML as a safe and responsible practice.

  2. (2)

    Coveragefmeval evaluates a wide range of LLMs for both quality and responsibility. This includes native support for Amazon SageMaker JumpStart and Amazon Bedrock models as well as examples from HuggingFace and third party model providers (see §4.2). Similarly, our built-in evaluations cover many common tasks (see §5).

  3. (3)

    Extensibility – Additionally, we support the use of custom datasets and evaluations via bring-your-own (BYO) functionalities. The motivation for extensibility is two-fold: first, we want to support domain-specific use cases and needs that are not covered by existing benchmarks. Second, since diverse perspectives are crucial to RAI, we want to enable the open source community to contribute to fmeval.

  4. (4)

    Performance – LLM evaluation workloads can be large, hence speed and scalability of the evaluations are crucial.

In summary, our goal is to balance broad coverage of use cases with simplicity and ease-of-use, also for non-experts. Our approach to navigating this trade-off consists in offering built-in evaluations (see §5) combined with a bring-your-own functionality for users with additional evaluation needs. When designing the built-in components we aimed to distill existing literature into a minimal set of evaluation with maximal coverage.

3. Related work

Existing LLM evaluation frameworks. Evaluating LLMs has gained significant attention in recent years and several evaluation frameworks already exist. Popular examples include HELM (Liang et al., 2022), HuggingFace Evaluate (hug, 2020), OpenAI Evals (ope, 2023), EleutherAI LM Evaluation Harness (Gao et al., 2021) and DecodingTrust (Wang et al., 2023).

However, none of the frameworks meets all the desiderata listed in Section 2. For instance, HuggingFace Evaluate enforces a restrictive API that requires the model predictions and reference outputs to be computed in advance, thus limiting Extensibility and Simplicity. OpenAI Evals focuses on evaluating OpenAI models, thus offering limited Coverage. On the other end, evaluation frameworks such as HELM prioritize Coverage at the expense of Simplicity. The sheer number of scores produced by their evaluations, while immensely useful in an academic or research context, might be less interpretable for some users. Instead, we hone in on a few key evaluations, guiding our users in navigating the extensive literature on responsible AI.

LLM evaluation metrics. Metrics for evaluating LLMs are an active research area. Evaluation metrics can generally be divided into the following categories: (1) Human evaluation metrics: Metrics such as response quality or hallucination are often evaluated by humans which annotate each model output (Ouyang et al., 2022). (2) Model-based evaluation metrics: Metrics that use another model (usually a second LLM) to rate the quality of the response. Examples include BERTScore (Zhang et al., 2019), or Faithfulness (Es et al., 2023) in a Retrieval Augmented Generation (RAG, (Lewis et al., 2020)) setting. (3) Reference-based metrics: Metrics like ROUGE and F1-score that require a reference answer. Unlike human-evaluation metrics, metrics in this class do not require each individual model output to be annotated in real time, but only once during data collection.

fmeval currently offers metrics in categories 2 and 3 but can also be extended to include metrics from category 1.

4. Architecture

Generally, to perform an evaluation one queries the model on a series of inputs from one or more datasets. For example, to evaluate how well a model can summarize text, we prompt it to summarize the newspaper articles from the Government Report dataset (Huang et al., 2021). Then, the model outputs are scored under one or more metrics, usually against ground-truth outcomes. For the summarization example, the model summaries are compared to the reference summaries included in Government Report using ROUGE-N, METEOR and other metrics (see §5.2 for details on the summarization accuracy evaluation and its metrics)

Although different evaluations require different data, metrics, and processing logic, a core of functionalities is shared among many of them. In this section we outline and describe the main components of fmeval, discussing design choices and relating them to our desiderata. In the next section, we dive deep into the built-in evaluations which we provide. The library consists of the following main building blocks:

  1. (1)

    Data components, for loading and managing datasets

  2. (2)

    Model components, for interacting with the LLM under evaluation

  3. (3)

    Evaluation components, containing core evaluation logic, metrics and auxiliary models (e.g., toxicity detectors)

  4. (4)

    Reporting components, for producing summary reports and plots.

Refer to caption
Figure 1. High-level component interaction in FMEval. The user creates a ModelRunner and a DataConfig, and passes them to an implementation of EvalAlgorithmInterface. The evaluation algorithm loads data based on the DataConfig, executes the algorithm, and returns the result as an EvalOutput object. This can be visualized using the reporting module.

4.1. Data components

The fmeval data-loading module supports loading of JSON and JSONLines files as Ray datasets. We have opted to use Ray as our distributed framework because it provides a Python-native user experience. A computing framework such as Ray allows fmeval algorithms to be executed in a distributed and parallel fashion out of the box (improving Performance, §2). We also considered other environments such as PySpark, and chose Ray mainly for maintainability – debugging and troubleshooting have proven easier compared to PySpark. Ray can also be set up seamlessly in a cluster environment. In internal benchmarking, we found Ray’s performance to match or exceed that of PySpark in our use cases.

Users interact with datasets via the DataConfig dataclasses. The library includes several predefined DataConfigs consumed by the built-in evaluations for Simplicity2). These point to open-source datasets which we preprocessed and stored in Amazon S3; we will return on these datasets in more details in Section 5. At the same time, users can load their own datasets by defining appropriate DataConfigs (Extensibility, §2). Users specify the location of the dataset (this could be a local or a remote Amazon S3 address), alongside the dataset name used in reporting, and information such as input and target field names. As we seek to enable a flexible data processing pipeline where relevant information can be stored in free-form JSON objects, we support field extraction via JMESpath strings (please refer to https://jmespath.org/ for documentation).

4.2. Model components

One intent of the library is to provide wide model Coverage2) where our primary targets are auto-regressive LLMs. These models may be deployed in diverse environments (e.g. local, remote, clusters, closed-source APIs, etc.) and can have different querying mechanisms and output formats. In order to simplify the execution of evaluation algorithms we abstract away these differences under a common interface called ModelRunner.

The abstract class has a single method, predict that expects an input prompt and returns a pair of text output and log probability of the input string if available.111The latter, as we shall see, is used in the built-in stereotyping evaluation. Many closed-source systems do not feature such output. The library includes three built-in model runners specific for Amazon SageMaker, Amazon SageMaker JumpStart and Amazon Bedrock models (see also §6) and two example implementations for OpenAI and HuggingFace models, showing how the library can be extended to perform evaluations on a wide range of frameworks and providers.

Internally, ModelRunner utilizes content templates, composers, and extractors. The content template and composer are responsible for creating the payload that is sent to the model, while the extractor parses the model response to retrieve the generated text output and the input log-probability (if available) for the model response. These components serve a dual purpose: first, they allow passing (sampling) parameters such as temperature, top-k𝑘kitalic_k tokens and top-p𝑝pitalic_p mass; and second, they handle format conversion where needed (e.g. from plain strings to JSON and vice-versa). The extractor is compatible with JMESpath strings to flexibly parse JSON responses, if needed. Finally, the accept_type and content_type constructor arguments specify the input data format.

4.3. Evaluation components

The evaluation algorithms are the core of fmeval. We will describe the details of built-in evaluations in Section 5. Architecturally, evaluations implement an abstract interface called EvaluationAlgorithmInterface that contains two methods: evaluate, which applies the evaluation logic to an entire dataset, and evaluate_sample, which processes a single sample. The evaluate method takes an instance of ModelRunner, a list of DataConfig objects, and optionally a prompt template.222For built-in evaluation algorithms, default prompt templates are defined in the library It returns a list of EvalOutput objects – one for each input dataset – which contain aggregated results of the evaluation. The main scoring logic is implemented in evaluate_sample and varies depending on the particular evaluation.

For Extensibility (see §2) users can add custom evaluations. This requires implementing the EvaluationAlgorithmInterface with custom logic for the evaluate and evaluate_sample methods.

4.4. Reporting components

Finally, the reporting components provided in the fmeval library allow users to create markdown reports that include numerical results, examples of model inputs and outputs, and plots. The reports are organized in EvalOutputCell objects and can be visualized in Jupyter notebooks or console or written to the file system. These can be later converted to PDFs or HTML files as required to create consolidated reports (see §C, Fig. 11).

EVALUATIONS
Task Accuracy Semantic Robustness Factual Knowledge Prompt Stereotyping Toxicity
TASKS Open-Ended Generation
Summarization
QA
Classification
Table 1. Task-evaluation pairings.

5. Built-in evaluations

Practitioners use LLMs to solve different tasks. fmeval currently covers the following four commonly faced tasks: open-ended language generation, summarization, question answering (QA) and text classification.

For these tasks, we offer the following five built-in evaluations: accuracy, semantic robustness, toxicity, prompt stereotyping and factual knowledge. For each task we recommend that multiple – though not every – evaluation be performed. Table 1 shows which evaluations we suggest our users to evaluate for each task. For example, when a model is deployed for summarization, it makes sense to evaluate accuracy (how well did it work?), robustness (did typos in the text infer with the model’s ability to accurately summarize it?) and toxicity (did the model use toxic language in its summary?).

We will now review each built-in evaluation in detail, focusing on the rationale behind metric and dataset choices, and discussing limitations. For each built-in evaluation, we defer details, examples and further background to the appendix (§A).

5.1. Classification accuracy

5.1.1. Background

Text classification is a standard task in NLP and many benchmarks exist to evaluate performance, e.g., GLUE (Wang et al., 2018), SentEval (Conneau and Kiela, 2018) and parts of HELM (Liang et al., 2022). More specifically, tasks range from predicting linguistic acceptability (i.e., classifying whether a sentence is grammatical or not, (Warstadt et al., 2019)), opinion popularity (Wiebe et al., 2005) to sentiment analysis. Traditionally, text classification is tackled with supervised machine learning algorithms, using sequence-to-labels models or processing the input text as a bag-of-words (e.g., for sentiment analysis (Turney, 2002; Pang et al., 2002, 2008; Socher et al., 2013)). However, LLMs are gaining popularity in this task and are typically used with short prompts like: “Classify the sentiment of the following question: […]”. In this scenario, an additional challenge is the correct parsing and extraction of a class label from an output generation.

5.1.2. Built-in datasets

Women’s E-Commerce Clothing Reviews is a dataset about clothing reviews where target labels are either binary (overall sentiment of the review) or on an 1 to 5 scale for multiclass classification.

5.1.3. Built-in metrics

We offer an array of standard metrics to evaluate binary and multiclass classification tasks. 0-1-score measures if the predicted label matches the target label and, averaged over the dataset, yields the standard accuracy score. Precision measures the fraction of true positive over predicted positives. We expose a multiclass_average_strategy parameter that determines how the scores are aggregated across classes.

Recall measures the fraction of true positives over ground-truth positives. Similar to precision, the behavior is controlled by the parameter multiclass_average_strategy.

Balanced accuracy is the same as accuracy in the binary case and is the averaged recall per class in the multiclass case. For all these metrics, higher is better. They can be aggregated over the whole dataset or over categories.

5.1.4. Limitations

When using a general purpose language model responses are strings. The provided convert_model_output_to_label function looks for any valid label in the string output and extracts it. For example, if the correct label is 3 and the model returns “The answer is 3.”, this output is considered correct. If no valid label is found, we mark the model output as “unknown”. an “unknown” answer (typically incorrect). While this allows for some flexibility, it does not cover the case where the model rephrases the label, e.g., returning “negative” instead of 0 and “positive” instead of 1 would not be processed as correct. Users may also provide a custom convert_model_output_to_label function.

5.2. Summarization accuracy

5.2.1. Background

Historically performed by specialized algorithms, LLMs achieve impressive performance in summarization (Zhang et al., 2023). General-purpose LLMs may be instructed with short prompts such as “Summarize the following: […]” while fine-tuned or purpose-built models may not need any additional context. Given an original text, extractive summarization consists of selecting a few passages from the text in order to produce a summary. Abstractive summaries may instead rephrase and modify the text while preserving its meaning. Evaluating summaries, especially if abstractive, is a notoriously challenging task that requires understanding to which extent a summary covers the original text as well as many other dimensions such as coherency and fluency (Fabbri et al., 2021a).

5.2.2. Built-in datasets

We use the Government Report Dataset (Huang et al., 2021) for this evaluation.

5.2.3. Built-in metrics

ROUGE-N (Lin, 2004) are a class of metrics that compute N-gram word overlaps between reference and model summary. The metrics are case insensitive and the values are in the range of 00 (no match) to 1111 (perfect match). For the choice and effect of the parameter N𝑁Nitalic_N, see §A.2. Meteor (Banerjee and Lavie, 2005) is similar to ROUGE-1, but includes stemming (with a Porter stemmer) and synonym matching via synonym lists (e.g., “rain” matches with “drizzle”). BERTScore (Zhang et al., 2019) uses a second ML model from the BERT family to compute embeddings of reference and predicted summaries, and compares their cosine similarity. Users can choose from two embedding models.

All three metrics are well-known, standard metrics – with ROUGE-N being the most widely used to assess summarization quality. We added METEOR and BERTScore for additional linguistic flexibility. Due to their ability to capture the similarity of rephrased text rather than verbatim overlap only, they should more accurately measure the quality of abstractive summaries.

5.2.4. Limitations

While METEOR and BERTScore are more suitable for evaluating abstractive summaries than ROUGE, they still do not capture the full complexity of the task. Specifically, BERTScore relies on a second ML model and inherits its limitations when used for comparing passages. Fully automated metrics for abstractive summarization quality remain an active research area (Liang et al., 2022; Fabbri et al., 2021a, b).

5.3. QA accuracy

5.3.1. Background

This evaluation measures how well the model performs in question answering (QA) tasks. The model is queried for general or domain-specific facts, and we evaluate the accuracy of its response. This task comes in two variants: In open-book QA the model is presented with a reference text containing the answer, i.e., the model’s task is “reading comprehension”, extracting the correct answer from the text. In closed-book QA the model is not presented with any additional information but uses its own world knowledge to answer the question. See (Rogers et al., 2023) for a detailed taxonomy of QA tasks and benchmarks.

5.3.2. Built-in datasets

We use the BoolQ (Clark et al., 2019), NaturalQuestions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017) datasets. Ranging from categorical yes-no questions over common sense to complex and specialized questions, they cover increasing levels of difficulty.

5.3.3. Built-in metrics

Below metrics evaluate a model’s QA performance by comparing its model output to the ground truth answer included in the dataset. This comparison can be performed in different ways. Exact match (EM): Binary score, 1111 if model output and answer match exactly. Quasi-exact match (Liang et al., 2022): Binary score. Similar as before, but both model output and answer are normalized first by removing any articles and punctuation as they usually do not impact correctness for natural language questions. Punctuation might matter for other domains such as code generation, so following HELM (Liang et al., 2022) we provide both Exact and Quasi-Exact Match metrics. Precision over Words: Precision score (see §A.3 for definition). The text is normalized as before. Recall over Words: Recall over words on normalized text. F1 over Words: The harmonic mean of precision and recall, over words (normalized). Precision, Recall and F1 over Words are more flexible as they assign non-zero scores to model answers containing parts of the ground truth. Specifically, recall measures whether the ground truth answer is contained in the model output, whereas precision penalizes verbosity. Comparing the metrics can yield additional insights in model idiosyncrasies as we will see in §7. All metrics are reported on average over the whole dataset, or per category, resulting in a number between 00 (worst) and 1111 (best) for each metric.

5.3.4. Limitations

The built-in metrics are based on comparing predicted and reference answers word for word. Hence, they may be less reliable for questions with linguistically ambiguous answers, e.g. those were the answer can be rephrased without modifying its meaning. An example from the NaturalQuestions dataset is question “Who lives in the imperial palace in Tokyo” with answer “the Imperial Family”. Other valid answers might be “the Emperor of Japan” or “the Emperor and their family”. Those would not be recorded as correct. However, for most questions in the built-in datasets the answers are unambiguous, e.g. country, city or individuals’ names.

5.4. Factual knowledge

5.4.1. Background

This evaluation measures the ability of language models to reproduce facts about the real world. The evaluation queries the model with prompts like “Berlin is the capital of” and “Tata Motors is a subsidiary of” and compares the model generation with one of more reference answers. The prompts are divided into different knowledge categories like capitals, subsidiaries. This evaluation was proposed by Petroni et al. (Petroni et al., 2019).

5.4.2. Built-in datasets

We use the T-REx (Elsahar et al., 2018) dataset for this evaluation which is extracted from Wikipedia. For details see § A.4.

5.4.3. Built-in metrics

This evaluation outputs a single binary metric which computes to 1 if the reference answer is found within the generation.

5.4.4. Limitations

This evaluation relies on knowledge extracted from Wikipedia which might be incomplete, out of date or inaccurate. The evaluation also requires comparing the model generation to reference answer(s), leading to similar issues as discussed in the QA evaluation in § 5.3.

5.5. Prompt stereotyping

5.5.1. Background

Prompt stereotyping is one of many ways to measure algorithmic bias in LLMs (Blodgett et al., 2020). Many bias evaluations are task specific. For example, in the coreference resolution task, it is common to test whether the gender of a pronoun impacts whether the model can correctly identify its reference (Rudinger et al., 2018; Zhao et al., 2018) (e.g., is the model more likely to resolve a “he” pronoun to doctor and a “she” pronoun to nurse). For sentiment analysis, authors test whether the model associates different occupations, genders, names and countries with positive or negative sentiments (Huang et al., 2019; Dhamala et al., 2021). Meeting the Coverage desideratum (§2), we instead opt for a more general bias evaluation for language generation. We measure whether the model encodes stereotypes by measuring the probability it assigns to more or less stereotypical sentences. This is following the evaluation paradigm from (Nangia et al., 2020).

5.5.2. Built-in dataset

CrowS-Pairs (Nangia et al., 2020): This dataset provides crowdsourced sentence pairs (i.e., more and less stereotypical sentences) for the categories race/color, gender identity, sexual orientation, religion, age, nationality, disability, physical appearance and socioeconomic status along which stereotyping is measured.

5.5.3. Built-in metrics

The LLM is presented with two sentences: the more stereotypical sentence Smoresubscript𝑆moreS_{\text{more}}italic_S start_POSTSUBSCRIPT more end_POSTSUBSCRIPT and the less stereotypical sentence Slesssubscript𝑆lessS_{\text{less}}italic_S start_POSTSUBSCRIPT less end_POSTSUBSCRIPT. We compute two metrics, both based on comparing the sentence probabilities p(Smore)𝑝subscript𝑆morep(S_{\text{more}})italic_p ( italic_S start_POSTSUBSCRIPT more end_POSTSUBSCRIPT ) and p(Sless)𝑝subscript𝑆lessp(S_{\text{less}})italic_p ( italic_S start_POSTSUBSCRIPT less end_POSTSUBSCRIPT ) under the model. Is-biased: Measures whether p(Smore)>p(Sless)𝑝subscript𝑆more𝑝subscript𝑆lessp(S_{\text{more}})>p(S_{\text{less}})italic_p ( italic_S start_POSTSUBSCRIPT more end_POSTSUBSCRIPT ) > italic_p ( italic_S start_POSTSUBSCRIPT less end_POSTSUBSCRIPT ) for each sentence pair. The binary is-biased metric is averaged over the whole dataset and per category to produce the final prompt stereotyping score (Nangia et al., 2020; Touvron et al., 2023; Workshop et al., 2022). After averaging, a value between 00 and 1111 is obtained. 1111 indicates that the model always prefers the more stereotypical sentence while 00 means that it never prefers the more stereotypical sentence. An unbiased model prefers both at equal rates corresponding to a score of 0.50.50.50.5. Log-probability-difference: A more fine-grained, numerical score indicating how much the model stereotypes on each pair.

5.5.4. Limitations

CrowS measures U.S.-typical stereotypes. Specifically, the bias categories are taken from the US Equal Employment Opportunities Commission’s list of protected categories and the sentence pairs are produced by Amazon Mechanical Turk workers in the United States. Other stereotypes prevail in other countries. Additionally, the CrowS dataset has been found to be noisy (Blodgett et al., 2021), a consequence of being crowd-sourced. Some sentence pairs are low-quality or invalid. To address both limitations, user can bring in their own paired dataset to perform the prompt stereotyping evaluation on different data, or their own full evaluation if they opt to change the bias evaluation paradigm entirely. For improved Simplicity2), we aim to extend our built-in evaluations to distributional biases in language generation. Those will include geographical biases (Schwöbel et al., 2023) and gender or race vs. occupation biases (Rae et al., 2021; Liang et al., 2022).

5.6. Toxicity

5.6.1. Background

Toxicity in NLP refers to obscenity, hate speech, insults or any other type of harmful language (Dixon et al., 2018; Vidgen et al., 2019). The toxicity evaluation aims at assessing and quantifying the level of toxic content in text generated by LLMs. This is done to prevent models from outputting toxic language, thus averting harm to individuals, reputational damage to organizations, polarization, among many other reasons. Detecting and quantifying toxicity is a challenging problem in NLP due to its high degree of subjectivity, cultural diversity, nuanced and context-dependent meaning, and ethical complexity. We use toxicity detector models to score the toxicity of passages and run the toxicity evaluation on all tasks except classification. For open-ended generation we further provide two built-in datasets designed to elicit toxic responses.

5.6.2. Built-in datasets

Real Toxicity Prompts (Gehman et al., 2020): a collection of truncated sentence snippets from the web. It contains a subset of samples marked as “Real Toxicity Prompts-Challenging” that are likely to elicit toxic generation. BOLD (Dhamala et al., 2021): a series of prompts aimed at testing for biased and toxic generation across profession, gender, race, religion, and political ideology.

5.6.3. Built-in toxicity detectors

We support UnitaryAI Detoxify-unbiased (Hanu and Unitary team, 2020) and ToxiGen-RoBERTa. Both models are based on a RoBERTa architecture (Liu et al., 2019). The first is a multi-label classifier trained on Toxic Comment Classification Challenge and Jigsaw Unintended Bias in Toxicity Classification (Borkan et al., 2019) with the following labels: toxicity, severe toxicity, obscenity, threat, insult, sexual explicitness and identity attack. The second is a binary classifier fine-tuned on the ToxiGen dataset (Hartvigsen et al., 2022). All scores are between 0 and 1, where lower is better.

5.6.4. Limitations

The concept of toxicity may vary culturally and by context. Our toxicity evaluation employs a model to score the likelihood that generated passages include toxic content. These models are unlikely to detect toxicity in all cases and contexts. We refer the reader to the original works for further discussions on limitations of the respective models and other considerations

5.7. Semantic robustness

5.7.1. Background

This evaluation measures how sensitive the model is to small semantic-preserving changes in the input. When reading, humans have a remarkable ability to understand written text even when it contains typographical errors or typos. In a similar vein, we expect that introducing a small error in the input should not have a big impact on the model output. Semantic robustness is a meta-evaluation—it is computed differently depending on the base evaluation. When the task is open-ended generation, we test whether the model’s response changes when we perturb the input, as we will detail in § 5.7.3. For all other downstream tasks, we follow (Liang et al., 2022) and evaluate whether task performance degrades when typos are introduced.

5.7.2. Built-in datasets

Built-in datasets depend on the selected base task. For the base tasks classification, summarization and QA we use the respective built-in datasets (see §5.1.2, §5.2.2 and §5.3.2). For the open-ended generation task, we use the T-REx dataset from the Factual Knowledge evaluation (§5.4.2), and additionally use the BOLD and WikiText-2 datasets. See §A.7 for details.

5.7.3. Built-in metrics

For all tasks except open-ended generation, this evaluation consists of just one metric, performance change, which measures how much the model performance changes as a result of semantic preserving perturbations to the input. How performance is measured depends on the task. For classification, the performance score is the binary indicator on whether or not the model answer is correct. For summarization, the performance scores are ROUGE-N, METEOR and BERTScore, and the performance change is measured once for each of the three scores. For QA, the scores are Exact Match, Quasi Exact Match and F1 over Words (see §5.1.3, §5.3.3, §5.2.3). For open-ended generation, instead of measuring a difference in performance, we instead measure the change in model output using the Word Error Rate.

For details on types of perturbations, a description of Word Error Rate, and the computation of performance change, see § A.7.

6. Deep integrations in AWS

Refer to caption
Figure 2. High-level system architecture of Amazon SageMaker FM Evaluations. A dataset and a configuration file serve as input to a batch processing job that produces evaluation results. These outputs are stored in a filesystem, and visualized in Amazon SageMaker Studio.

Improving Performance2), fmeval can be run as part of Amazon SageMaker. It leverages SageMaker’s processing job APIs, and executes batch processing jobs on a cluster of AWS compute instances in order to process large amounts of data in parallel. Each processing job has an associated cluster of fully managed SageMaker compute instances running the specified container image, provisioned specifically for the processing job. The container used to run fmeval on SageMaker is a thin wrapper around the library.

Users can interact with the library by running their evaluations for models hosted on SageMaker through Amazon SageMaker Studio (see Figure 2). They can call fmeval programmatically in notebooks or through MLOps orchestration tools like Amazon SageMaker Pipelines. Users can also run evaluations through the Evaluation interface (UI) under Jobs. When using the Evaluation UI, evaluations with built-in datasets or custom datasets can be set up with a few clicks. Specifically, SageMaker JumpStart LLMs, SageMaker applies default model and prompt settings, so that evaluation reports can be created in minutes, and does not require MLOps expertise (see §C, Figure 9 for a screenshot of the UI). Summary results are displayed directly in the SageMaker Studio (see §C, Figure 10). A detailed evaluation report in pdf format, with insights and examples of the highest and lowest scoring prompts, is written to Amazon S3 (see §C, Figure 11).

7. Case study

7.1. Choosing a model for a QA task

Refer to caption
(a) Accuracy, higher is better (\uparrow).
Refer to caption
(b) Robustness, measured as performance drop on perturbation, lower is better (\downarrow).
Refer to caption
(c) Toxicity, lower is better (\downarrow).
Figure 3. Built-in metrics for task accuracy and robustness in the QA evaluation (see §5.3), on average over the built-in datasets. Toxicity is reported on the RealToxicityPrompts-Challenging subset (see §5.6.2). See §B for per-dataset results.

A common use case for fmeval is picking the best model for a given task, either to deploy directly or to build upon and customize, e.g., by fine-tuning. We pick the Question Answering (QA) as an example. fmeval is used to benchmark models against each other and pick one. Table 2 (two left columns) lists the candidate models and the mode of access (i.e., Amazon SageMaker JumpStart, Amazon Bedrock or third party APIs). The models have been anonymized.

Model name Access Toxicity
model-1 Third party API 0.53
model-2 Third party API 0.47
model-3 Amazon Bedrock API 0.07
model-4 Amazon SageMaker JumpStart API 1.55
model-5 Amazon SageMaker JumpStart API 1.52
model-6 Amazon SageMaker JumpStart API 1.47
model-7 Amazon SageMaker JumpStart API 0.88
Table 2. Candidate models and toxicity results. Toxicity results are summed over the seven categories.

The relevant evaluations for the QA task (see Table 1) are Task Accuracy, Semantic Robustness and Toxicity. We run the fmeval evaluations as is, i.e., without modifying default values such as prompt templates. Evaluating base models under the default settings is expected to yield a lower bound on the best possible performance after prompt engineering and other tuning has been performed. We run each evaluation on 100100100100 samples from each of the built-in datasets BoolQ, NaturalQuestions and TriviaQA (see §5.3.2) for task accuracy and robustness, as well as Real Toxicity Prompts and BOLD for toxicity (see §5.6.2). Figure 3 shows the results, aggregated over the built-in datasets for each task. For task accuracy, in Figure 3 (a) the closed source models model-1, model-2 and model-3 perform best under all metrics. Then, there is a group of models that achieve high recall but score much lower under the other metrics: model-4, model-5 and model-6. To analyze this disparity qualitatively, we investigate the model_outputs obtained from fmeval in §B.1. The three models for which recall and the other metrics differ vastly give non-standard answers that would likely be considered invalid by a human. Evaluating with more than one metric surfaced this unexpected behavior that a user might wish to tackle via prompt engineering, or that lead the user to exclude these models altogether. For robustness, Figure 3 (b) shows the absolute performance drop (ΔΔ\Deltaroman_Δ-score, see §5.7). The ΔΔ\Deltaroman_Δ-scores follow the accuracy scores in trend, with model-2 being the most robust in relative terms (i.e., ΔscorescoreΔscorescore\frac{\Delta-\text{score}}{\text{score}}divide start_ARG roman_Δ - score end_ARG start_ARG score end_ARG is smallest for almost all scores). For toxicity, none of the models produce significantly toxic outputs on the QA datasets. Hence, we additionally evaluate toxicity on the built-in dataset BOLD, RealToxicityPrompts and RealToxicityPrompts-Challenging (§5.6.2). The latter is known to elicit toxic responses from models, we plot results on this dataset in Figure 3 (c). model-3 fares best in this evaluation (see Table 2 for numerical results).

7.1.1. Additional evaluation: open-book QA

The built-in evaluation for QA Accuracy evaluates performance in a closed-book setting, i.e., the model is solving the task using its world knowledge only. In open-book QA, the model additionally has access to a reference text. The task then consists of extracting the correct answer from the reference text. This reading comprehension task can be a good proxy for model performance in a system where the model has access to additional information such as RAG.

We implement open-book QA using fmeval’s standard QA evaluation and the BYO dataset functionality. Specifically, we modify the built-in QA datasets to contain the reference. Here is an example prompt from the BoolQ dataset: “Respond to the following question. Valid answers are “True” or “False”. Is there a difference between sweating and perspiring?”. For the open-book task, this is modified to: “Perspiration, also known as sweating, is the production of fluids secreted by the sweat glands in the skin of mammals. Respond to the following question. Valid answers are “True” or “False”. Is there a difference between sweating and perspiring?”. Implementation-wise, we save the updated dataset locally in JSONLines format and update the dataset_uri field of the DataConfig with the local file path (experiment scripts will be released on publication). Due to context length limitations for some of the models, we filter the modified datasets for records with less than 4000400040004000 characters in question and reference combined. We exclude model-6 model from this evaluation since its context length is 1024102410241024 only.

Refer to caption
Figure 4. Open-book QA Accuracy, higher is better (\uparrow).

Performance in the open-book task is similar in trend to closed-book (Figure 4). Performance overall improves by 15.6%percent15.615.6\%15.6 % on average, since the additional information makes the task easier to solve. model-3 is the exception with a 6%percent66\%6 % performance decrease in the closed-book compared to the open-book task.

In summary, our example user has identified two strong candidate models for their application: model-2 and model-3. On the one hand, model-2 performed better on task accuracy and robustness evaluations, and incorporated the additional information passed in the open-book task successfully. On the other hand, model-3 exhibited lower levels of toxicity. This toy experiment illustrates the usage of the library. In a real world use case there might, of course, be other factors in the final decision. Such may include cost, infrastructure requirements or the desire to employ an open source model that can easily be modified.

8. Limitations and further work

The task of evaluating large language models is as multifaceted as their use cases are, and the research landscape is constantly evolving. fmeval addresses a range of well-known use cases and evaluations out of the box, and empowers users to customize evaluations for their own needs – balancing Simplicity, Coverage and Extensibility (see §2). Further work on the library can roughly be divided into the following groups.

First, new built-in evaluations should be added to improve coverage, e.g., tests for hallucinations. To this end, we will continue to engage with the open source community and AWS customers in order to identify the most common evaluation needs.

Similarly, only English model evaluations are supported out of the box. This limitation mainly stems from the built-in datasets which are English only, as well as a few metrics that are language specific (e.g., METEOR which relies on language-specific resources such as stemmers, §5.2.3).

Second, entirely new evaluation paradigms could be considered. We currently focus on benchmarking, i.e., evaluating a model on static datasets and comparing against ground truth (or running model answers against a toxicity detector, see §5.6). Benchmarking is performed usually before the model is deployed, fine-tuned, or at fixed intervals after deployment to monitor quality. An alternative evaluation paradigm is to evaluate model in- or outputs for toxicity, misinformation or similar in real time, referred to as guardrailing. fmeval currently focuses on benchmarking but could be extended to include guardrailing. We are also planning to extend the library to allow for system-wide evaluation in the context of RAG.

9. Conclusions

We have presented fmeval, an open source library that enables practitioners to evaluate LLMs across a variety of tasks and responsible AI dimensions. The library is organized around the principles of Simplicity, Coverage, Extensibility and Performance. We have explained the reasoning behind these principles and have shown how they guided the scientific and engineering decisions that have been made during its development. After outlining the library’s architecture, we have highlighted its usage within AWS infrastructure, which requires less to no coding compared to using the standalone library. To demonstrate its functionalities, we have then presented a case study in which fmeval is used to select a model for a question answering task; both in open-book and closed-book settings. Using the library, our example user has identified two suitable candidates for their task: the most successful model in task accuracy and robustness, and the safest model in terms of toxic outputs. We conclude by reflecting on current limitations of the library and opportunities for future enhancement.

References

  • (1)
  • hug (2020) 2020. HuggingFace Evaluate: A library for easily evaluating machine learning models and datasets. Github. https://github.com/huggingface/evaluate.
  • ope (2023) 2023. OpenAI Evals. Github. https://github.com/openai/evals.
  • Abid et al. (2021) Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 298–306.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.
  • Bender et al. (2021) Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 610–623.
  • Blodgett et al. (2020) Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of “bias”’ in nlp. arXiv preprint arXiv:2005.14050 (2020).
  • Blodgett et al. (2021) Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. 2021. Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1004–1015.
  • Borkan et al. (2019) Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2019. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference. 491–500.
  • Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044 (2019).
  • Conneau and Kiela (2018) Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449 (2018).
  • Dhamala et al. (2021) Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 862–872.
  • Dhole et al. (2021) Kaustubh D Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish Shrivastava, Samson Tan, et al. 2021. Nl-augmenter: A framework for task-sensitive natural language augmentation. arXiv preprint arXiv:2112.02721 (2021).
  • Dixon et al. (2018) Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 67–73.
  • Elsahar et al. (2018) Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Frederique Laforest, and Elena Simperl. 2018. T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (Eds.). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1544
  • Es et al. (2023) Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2023. Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217 (2023).
  • Fabbri et al. (2021a) Alexander R Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021a. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics 9 (2021), 391–409.
  • Fabbri et al. (2021b) Alexander R Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. 2021b. QAFactEval: Improved QA-based factual consistency evaluation for summarization. arXiv preprint arXiv:2112.08542 (2021).
  • Gao et al. (2021) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. A framework for few-shot language model evaluation. https://doi.org/10.5281/zenodo.5371628
  • Gehman et al. (2020) Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462 (2020).
  • Hanu and Unitary team (2020) Laura Hanu and Unitary team. 2020. Detoxify. Github. https://github.com/unitaryai/detoxify.
  • Hartvigsen et al. (2022) Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. 2022. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509 (2022).
  • Huang et al. (2021) Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. 2021. Efficient attentions for long document summarization. arXiv preprint arXiv:2104.02112 (2021).
  • Huang et al. (2019) Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stanforth, Johannes Welbl, Jack Rae, Vishal Maini, Dani Yogatama, and Pushmeet Kohli. 2019. Reducing sentiment bias in language models via counterfactual evaluation. arXiv preprint arXiv:1911.03064 (2019).
  • Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 (2017).
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 453–466.
  • Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  • Liang et al. (2022) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022).
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  • Lin et al. (2021) Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 (2021).
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  • Nadeem et al. (2020) Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. StereoSet: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456 (2020).
  • Nangia et al. (2020) Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R Bowman. 2020. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. arXiv preprint arXiv:2010.00133 (2020).
  • Nozza et al. (2021) Debora Nozza, Federico Bianchi, Dirk Hovy, et al. 2021. HONEST: Measuring hurtful sentence completion in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  • Pang et al. (2008) Bo Pang, Lillian Lee, et al. 2008. Opinion mining and sentiment analysis. Foundations and Trends® in information retrieval 2, 1–2 (2008), 1–135.
  • Pang et al. (2002) Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment classification using machine learning techniques. arXiv preprint cs/0205070 (2002).
  • Petroni et al. (2019) Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language Models as Knowledge Bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong Kong, China, 2463–2473. https://doi.org/10.18653/v1/D19-1250
  • Porter (1980) Martin F Porter. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130–137.
  • Rae et al. (2021) Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021).
  • Rogers et al. (2023) Anna Rogers, Matt Gardner, and Isabelle Augenstein. 2023. Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. Comput. Surveys 55, 10 (2023), 1–45.
  • Rudinger et al. (2018) Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301 (2018).
  • Schwöbel et al. (2023) Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cédric Archambeau, and Danish Pruthi. 2023. Geographical Erasure in Language Generation. arXiv preprint arXiv:2310.14777 (2023).
  • Sheng et al. (2021) Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2021. Societal biases in language generation: Progress and challenges. arXiv preprint arXiv:2105.04054 (2021).
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631–1642.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  • Turney (2002) Peter D Turney. 2002. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. arXiv preprint cs/0212032 (2002).
  • Vidgen et al. (2019) Bertie Vidgen, Alex Harris, Dong Nguyen, Rebekah Tromble, Scott Hale, and Helen Margetts. 2019. Challenges and frontiers in abusive content detection. In Proceedings of the third workshop on abusive language online. Association for Computational Linguistics.
  • Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
  • Wang et al. (2023) Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. 2023. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. arXiv preprint arXiv:2306.11698 (2023).
  • Warstadt et al. (2019) Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics 7 (2019), 625–641.
  • Wiebe et al. (2005) Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language resources and evaluation 39 (2005), 165–210.
  • Workshop et al. (2022) BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  • Zhang et al. (2019) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).
  • Zhang et al. (2023) Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B Hashimoto. 2023. Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848 (2023).
  • Zhao et al. (2018) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876 (2018).

Appendix A Built-in Evaluations – Details

A.1. Classification accuracy

A.1.1. Datasets

Women’s E-Commerce Clothing Reviews consists of  23k clothing reviews, both as a text and numerical scores. The task is to predict the score from the text, and it comes in two versions. For the binary classification task the model predicts whether or not the customer recommends the product (1111 is recommended, 00 is not recommended). For the multiclass classification task, a numerical rating on a scale from 1111 (worst) to 5555 (best) is predicted. For a more fine-grained analysis, the class name variable indicates the category or type of garment, e.g. “Pants” or “Dresses”. There are 21 categories.

A.1.2. Metrics

Classification performance is measured with four metrics, each of them is explained with an example computation below. Throughout the examples we will use the following toy dataset:

Review text True label Class name Predicted label
#1 Delicious cake! Would buy again. 3 brownie 3
#2 Tasty cake! Recommended. 2 pound cake 2
#3 Terrible! Got food poisoning. 1 pound cake 2

Classification accuracy is computed as predicted_label == true_label. This metric is computed for each datapoint as well as on average over the whole dataset. Example computation: The accuracies are [1,1,0]110[1,1,0][ 1 , 1 , 0 ] for the example above, 2323\frac{2}{3}divide start_ARG 2 end_ARG start_ARG 3 end_ARG on average.

Precision is defined as true positives / (true positives + false positives). This metric is computed once for the whole dataset. The multiclass_average_strategy parameter determines how the scores are aggregated across classes in the multiclass classification setting. Options are {’micro’, ’macro’, ’samples’, ’weighted’, ’binary’} oder None, default=’micro’. In the default case ‘micro’ the metric is calculated globally across all classes by counting the total true positives, false negatives and false positives. See scikit-learn documentation for the other options.
Example computation (for multiclass_average_strategy=’micro’): Examples #1 and #2 are true positives. #3 is a false positive (2 is predicted even though it’s not correct). Hence, the precision is 2323\frac{2}{3}divide start_ARG 2 end_ARG start_ARG 3 end_ARG.

Recall, computed as true positives / (true positives + false negatives), is computed once for the whole dataset. It has the parameter multiclass_average_strategy with the same meaning as for precision.
Example computation (for multiclass_average_strategy=’micro’): Examples #1 and #2 are true positives. #3 is a false negative (model fails to predict the correct label 1). Hence, the recall is 2323\frac{2}{3}divide start_ARG 2 end_ARG start_ARG 3 end_ARG.

Balanced classification accuracy is the same as accuracy in the binary case, otherwise computed as the averaged recall per class. This metric is computed once for the whole dataset.
Example computation: Recall for class 1 is 0 (model misses all the 1s). Recall for class 2 is 1 (model misses no 2). Recall for class 3 is 1 (model misses no 3). Hence, the balanced accuracy is 1+1+03=23110323\frac{1+1+0}{3}=\frac{2}{3}divide start_ARG 1 + 1 + 0 end_ARG start_ARG 3 end_ARG = divide start_ARG 2 end_ARG start_ARG 3 end_ARG.

All four metrics take values between 00 (worst) and 1111 (best). They are reported over the whole dataset as well as per category (i.e., by “Class Name” in the built-in Women’s E-Commerce Clothing Review dataset).

A.2. Summarization

A.2.1. Datasets

Government Report Dataset (Huang et al., 2019) is a dataset for long-form summarization benchmarking. This dataset features articles of more than 9K words in average. The reference summaries for this dataset also tend to be long, with an average length of 553 words.

A.2.2. Metrics

ROUGE-N (Lin, 2004) are a class of recall and F-measure based metrics that compute N-gram word overlaps between reference and model summary. The metrics are case insensitive and the values are in the range of 00 (no match) to 1111 (perfect match). Users can specify the N𝑁Nitalic_N parameter, sometimes called order of the metric, specifically:

  • N=1𝑁1N=1italic_N = 1 matches single words (unigrams) and is recall-based;

  • N=2𝑁2N=2italic_N = 2 matches word pairs (bigrams) and is recall-based;

  • N=L𝑁𝐿N=Litalic_N = italic_L matches the longest common subsequence and is an F-measure. For computing the longest common subsequence, order is accounted for, but consecutiveness is discounted. E.g., for prediction = “It rains today” and reference = “It rains again today” we have that LongestCommonSubsequence(prediction, reference)=3.

Users can further preprocess predictions and references with the Porter stemmer to strip word suffices (Porter, 1980). For example, “raining” or ”rained” are mapped into “rain”.
Meteor (Banerjee and Lavie, 2005) is similar to ROUGE-1, but always includes Porter-stemming and synonym matching via synonym lists (e.g. “rain” matches with “drizzle”).
BERTScore (Zhang et al., 2019) uses a second ML model (from the BERT family) to compute embeddings of reference and predicted summaries and compares their cosine similarity. This score may account for additional linguistic flexibility over ROUGE and METEOR since semantically similar sentences may be embedded closer to each other. We support a choice of two models for computing embeddings, which users can specify via the parameter model_name: one of “microsoft/deberta-xlarge-mnl” (default, the model with the best correlation to human labellers according to https://github.com/Tiiiger/bert_score) and “roberta-large-mnli”.

A.2.3. Example

Reference Model Summary Metric Value
It is fall. It is autumn. ROUGE-2 0.67
METEOR 0.99
BERTScore 0.98
It is summer. ROUGE-2 0.67
METEOR 0.64
BERTScore 0.93
Table 3. Summarization example.

As a toy example, consider a text about the weather. The ground truth reference summary is given as “It is fall.”. Consider two different model summaries, “It is autumn.” and “It is summer.”. The word overlap is the same for both predictions (2 out of 3 words match, green in Table 3), this is reflected in the same ROUGE-2 score for both summaries. However, since autumn and fall are synonymous, the first summary clearly is better. METEOR and BERTScore pick up on this difference by matching words that are similar in meaning (marked yellow in Table 3). As a consequence, they adequately rate the first summary higher (0.990.990.990.99 and 0.980.980.980.98) than the second (0.640.640.640.64 and 0.930.930.930.93, respectively).

A.3. QA Accuracy

A.3.1. Built-in Datasets

BoolQ (Clark et al., 2019) is a dataset consisting of  16161616K question-passage-answer triplets. The questions are categorical in the sense that they can be answered with yes/no, and the answer is contained in the passage. The questions are provided anonymously and unsolicited by users of the Google search engine, and afterwards paired with a paragraph from a Wikipedia article containing the answer. As outlined above, providing the passage is optional depending on whether the open-book or closed-book case should be evaluated.
NaturalQuestions (Kwiatkowski et al., 2019) is a dataset consisting of  320320320320K question-passage-answer triplets. Similar to BoolQ, the questions are naturally-occurring questions extracted from google queries. In our implementation, the passages are extracts from Wikipedia articles (referred to as “long answers” in the original dataset).
TriviaQA (Joshi et al., 2017) is a dataset consisting of 95959595K question-answer pairs with with on average six supporting evidence documents per question, leading to  650650650650K question-passage-answer triplets. The questions are authored by trivia enthusiasts and the evidence documents are independently gathered.

A.3.2. Built-in Metrics

Below metrics evaluate a model’s QA performance by comparing its predicted answers to the given ground truth answers in different ways. We introduce the metrics and illustrate this with an example after.

Exact match (EM): Binary score, 1111 if model output and answer match exactly, else 00.

Quasi-exact match: Binary score. Similar as above, but both model output and answer are normalized first by removing any articles and punctuation. E.g., the score is 1111 also for predicted answers “Antarctic.” or “the Antarctic”.

Precision over words: Numerical score between 00 (worst) and 1111 (best) that is computed as follows: precision = true positives / (true positives + false positives). true positives are the words in the model output that are also contained within the expected answer. Intuitively, this measures whether the model output only contains correct words (i.e., precision penalizes verbosity). false positives are the words in the model output that are not contained within the expected answer. The text is normalized as before.

Recall over words: Numerical score between 0 (worst) and 1 (best) that is computed as follows: recall = true positives / (true positives + false negatives). true positives are defined as before, false negatives are words that missing from the model output but are included in the ground truth. Intuitively, this measures whether the correct answer is included in the model output; recall does not penalize verbosity. Again, the text is normalized first.

F1 over words: Numerical score between 00 (worst) and 1111 (best). F1-score is the harmonic mean of precision and recall: F1 = 2 (precision \cdot recall)/(precision + recall). The text is normalized as before.

A.3.3. Example

We illustrate the metric computations with an example from the NaturalQuestions dataset in Table 4. The question is “Where is the world’s largest ice sheet located today?”, the ground truth answer is “Antarctic”.

The reference and model response differ, hence Exact Match evaluates to 00. For Quasi-Exact Match, articles are removed, hence the metric evaluates to 1111. For recall, we obtain that recall = true positives / (true positives + false negatives) = 1111 / (1111 + 00) = 1111. Precision is true positives / (true positives + false positives) = 1111 / (1111 + 1111) = 1111 / 2222. Lastly, for F1 we have F1 = 2 (precision \cdot recall)/(precision + recall) = 2(121)/(12+1)=232121121232(\frac{1}{2}\cdot 1)/(\frac{1}{2}+1)=\frac{2}{3}2 ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⋅ 1 ) / ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG + 1 ) = divide start_ARG 2 end_ARG start_ARG 3 end_ARG.

Reference Model Response Metric Value
Antarctic the Antarctic Exact Match 00
Quasi-Exact Match 1111
Precision over Words 1/212\nicefrac{{1}}{{2}}/ start_ARG 1 end_ARG start_ARG 2 end_ARG
Recall over Words 1111
F1 over Words 2/323\nicefrac{{2}}{{3}}/ start_ARG 2 end_ARG start_ARG 3 end_ARG
Table 4. QA example.

A.4. Factual Knowledge

The T-REx (Elsahar et al., 2018) dataset consists knowledge triplets extracted from Wikipedia. The triplets take the form (subject, predicate, object), for instance, (Berlin, capital of, Germany) or (Tata Motors, subsidiary of, Tata Group). We convert these predicates to prompts, e.g., Berlin is the capital of   (expected answer: Germany) and Tata Motors is a subsidiary of   (expected answer: Tata Group).

The T-REx data consists of over 600 predicates. However, many predicates are too broad to form questions with precise answers e.g., (Laozi, is a, Chinese classic text) and (Basic Input/Output System, type of, firmware). For this reason, we manually select predicates. We inspect top 100 predicates in the T-REx 10K sample and the predicates used by Petroni et al. (Petroni et al., 2019) and dropped the ones that are too general. We also merged similar predicates, e.g., “profession” and “occupation”. The final selection consists  32K prompts from the following 15 knowledge categories:

  1. (1)

    Capitals. ¡subject¿ is the capital of

  2. (2)

    Founders: ¡subject¿ was founded by

  3. (3)

    Director: ¡subject¿ was directed by

  4. (4)

    Country: The country ¡subject¿ is located in is

  5. (5)

    Profession: The profession of ¡subject¿ is

  6. (6)

    Team: ¡subject¿ played for

  7. (7)

    Developer: ¡subject¿ is developed by

  8. (8)

    Owner: ¡subject¿ is owned by

  9. (9)

    Official Language: The official language of ¡subject¿ is

  10. (10)

    Author: ¡subject¿ is written by

  11. (11)

    Tributary: ¡subject¿ is a tributary of

  12. (12)

    Creator: ¡subject¿ is created by

  13. (13)

    Named After: ¡subject¿ is named after

  14. (14)

    Manufacturer: ¡subject¿ is manufactured by

  15. (15)

    Subsidiary: ¡subject¿ is a subsidiary of

A.4.1. Built-in Metric

This evaluation outputs a single binary metric. The metric value is 1 if the lower-cased expected answer is contained anywhere within the lower-cased model response. For instance, consider the prompt “Berlin is the capital of” with the expected answer “Germany”. If the model generation is “Germany, and is also its most populous city”, then the metric evaluates to 1.

Some subject / predicate pairs can have more than one expected answer. Consider for instance (Bloemfontein, capital, South Africa) and (Bloemfontein, capital, Free State Province) because the city Bloemfontein is the capital of both South Africa and Free State Province. In such case, either of the answers are considered correct.

A.5. Stereotyping

A.5.1. Built-in Dataset

CrowS-Pairs (Nangia et al., 2020): This dataset provides 1,508 crowdsourced sentence pairs for the different categories along which stereotyping is to be measured. The above example is from the “gender/gender identity” category.

A.5.2. Built-in Metrics

We compute two metrics, both based on comparing the sentence probabilities p(Smore)𝑝subscript𝑆morep(S_{\text{more}})italic_p ( italic_S start_POSTSUBSCRIPT more end_POSTSUBSCRIPT ) and p(Sless)𝑝subscript𝑆lessp(S_{\text{less}})italic_p ( italic_S start_POSTSUBSCRIPT less end_POSTSUBSCRIPT ). p𝑝pitalic_p is computed by the language model (LM). For autoregressive (sometimes called causal) LMs such as models from the GPT family it is computed token-by-token, i.e.

p(\displaystyle p(italic_p ( My mom spent all day cooking for Thanksgiving)\displaystyle\text{My mom spent all day cooking for Thanksgiving})My mom spent all day cooking for Thanksgiving )
=p(My)p(mom |My)p(spent |My mom)absent𝑝My𝑝mom |My𝑝spent |My mom\displaystyle=p(\text{My})\cdot p(\text{mom \textbar My})\cdot p(\text{spent % \textbar My mom})\cdot...= italic_p ( My ) ⋅ italic_p ( mom |My ) ⋅ italic_p ( spent |My mom ) ⋅ …
p(Thanksgiving |My mom spent all day cooking for)absent𝑝Thanksgiving |My mom spent all day cooking for\displaystyle\cdot p(\text{Thanksgiving \textbar My mom spent all day cooking % for})⋅ italic_p ( Thanksgiving |My mom spent all day cooking for )

Is-biased: Binary score, measuring whether p(Smore)>p(Sless)𝑝subscript𝑆more𝑝subscript𝑆lessp(S_{\text{more}})>p(S_{\text{less}})italic_p ( italic_S start_POSTSUBSCRIPT more end_POSTSUBSCRIPT ) > italic_p ( italic_S start_POSTSUBSCRIPT less end_POSTSUBSCRIPT ) for each sentence pair (Smore,Sless)i,i=1,,1508formulae-sequencesuperscriptsubscript𝑆moresubscript𝑆less𝑖𝑖11508(S_{\text{more}},S_{\text{less}})^{i},i=1,...,1508( italic_S start_POSTSUBSCRIPT more end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT less end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_i = 1 , … , 1508. The is-biased metric is reported on average over the whole dataset (as well as per category), to produce the final prompt stereotyping score from the literature (Nangia et al., 2020; Touvron et al., 2023; Workshop et al., 2022). After averaging the 1508150815081508 binary values a numerical value between 00 and 1111 is obtained. 1111 indicates that the model always prefers the more stereotypical sentence while 00 means that it never prefers the more stereotypical sentence. An unbiased model prefers both at equal rates corresponding to a score of 0.50.50.50.5.

Log-probability-difference: Numerical score, measuring the log-ratio log[p(Smore)p(Sless)]=logp(Smore)logp(Sless)𝑝subscript𝑆more𝑝subscript𝑆less𝑝subscript𝑆more𝑝subscript𝑆less\log\left[\frac{p(S_{\text{more}})}{p(S_{\text{less}})}\right]=\log p(S_{\text% {more}})-\log p(S_{\text{less}})roman_log [ divide start_ARG italic_p ( italic_S start_POSTSUBSCRIPT more end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_S start_POSTSUBSCRIPT less end_POSTSUBSCRIPT ) end_ARG ] = roman_log italic_p ( italic_S start_POSTSUBSCRIPT more end_POSTSUBSCRIPT ) - roman_log italic_p ( italic_S start_POSTSUBSCRIPT less end_POSTSUBSCRIPT ). This number indicates how much the model stereotypes on each pair. The log-probability-difference score is reported in addition to the binary score for each sentence pair. This more fine-grained score can be used, for example, to extract those sentence pairs where the model stereotyped the most.

A.6. Toxicity

A.6.1. Built-in Datasets

Real Toxicity Prompts (Gehman et al., 2020) is a dataset of 100k truncated sentence snippets from the web. Prompts marked as “challenging” have been found by the authors to consistently lead to generation of toxic continuation by tested models (GPT-1, GPT-2, GPT-3, CTRL, CTRL-WIKI). We divide the dataset in two parts based on this attribute. Some prompts of this dataset may contain toxic content.
BOLD (Dhamala et al., 2021) is a large-scale dataset that consists of 23,679 English prompts extracted from Wikipedia and it is aimed at testing biased and toxicity text generation across five domains: profession, gender, race, religion, and political ideology.

A.6.2. Built-in toxicity detectors

UnitaryAI Detoxify-unbiased (Hanu and Unitary team, 2020) is a multilabel text classifier trained on Toxic Comment Classification Challenge and Jigsaw Unintended Bias in Toxicity Classification (Borkan et al., 2019). It outputs a score from 0 (no toxicity detected) to 1 (toxicity detected) to measure general toxicity, and additional six other scores that reflect specific types of toxic content: severe toxicity, obscenity, threat, insult, sexual explicitness and identity attack.
ToxiGen-RoBERTa (Hartvigsen et al., 2022) is a binary text classifier fine-tuned on the ToxiGen dataset (Hartvigsen et al., 2022), a dataset of generated passages which contains sentences with implicit and subtle toxicity content pertaining 13 minority groups alongside benign sentences.

Both models have a RoBERTa text classifier architecture (Liu et al., 2019).

A.7. Semantic Robustness

A.7.1. Built-in Datasets

Built-in datasets depend on the selected base task. For the base tasks classification, summarization and QA we use the respective built-in datasets (see §5.1.2, §5.2.2 and §5.3.2). For the open-ended generation task, we use the T-REx dataset from the Factual Knowledge evaluation (§5.4.2), as well as two additional datasets: The BOLD (Dhamala et al., 2021) dataset consists of 23,6792367923,67923 , 679 prompts extracted from Wikipedia articles. The prompts are divided into five main categories: gender, political ideology, profession, race and religious ideology. Each category is further subdivided into subcategories. E.g., profession is divided into scientific occupations, engineering branches, etc.

The WikiText-2 dataset consists of 44,8364483644,83644 , 836 Good and Featured articles from Wikipedia. To create prompts, we broke each article down into sentences and extracted first 6666 tokens from each sentence as the prompt.

A.7.2. Built-in metrics

As described in § 5.7.3, this evaluation measures the change in model output as a result of semantic preserving perturbations, where the metrics are task-dependent.

Types of perturbations. Assume that the input to the model is A quick brown fox jumps over the lazy dog. Then the evaluation will make one of the following three perturbations adapted from NL-Augmenter (Dhole et al., 2021).

  1. (1)

    Butter Fingers: Typos introduced due to hitting adjacent keyboard key, e.g., W quick brmwn fox jumps over the lazy dig.

  2. (2)

    Random Upper Case: Changing randomly selected letters to upper-case, e.g., A qUick brOwn fox jumps over the lazY dog.

  3. (3)

    Whitespace Add Remove: Randomly adding and removing whitespaces from the input, e.g., A q uick bro wn fox ju mps overthe lazy dog.

Measuring output change. For all tasks except open ended generation, we measure the change in task related performance after perturbations. The computation of performance change is as follows: Let y𝑦yitalic_y be the model output on the original, unperturbed, input and s𝑠sitalic_s be the corresponding accuracy score, e.g., ROUGE score when the task is summarization or classification accuracy when the task is classification. The evaluation then generates P𝑃Pitalic_P perturbed versions of the input. Let the model outputs for the perturbed inputs be y¯1,y¯2,,y¯Psubscript¯𝑦1subscript¯𝑦2subscript¯𝑦𝑃\bar{y}_{1},\bar{y}_{2},...,\bar{y}_{P}over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and the corresponding accuracy scores be s¯1,s¯2,,s¯Psubscript¯𝑠1subscript¯𝑠2subscript¯𝑠𝑃\bar{s}_{1},\bar{s}_{2},...,\bar{s}_{P}over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT.

Then the performance change is the average difference between the original score s𝑠sitalic_s and the scores on the perturbed inputs s¯¯𝑠\bar{s}over¯ start_ARG italic_s end_ARG, that is:

(1) Δs=1Pi=1P|ss¯i|Δ𝑠1𝑃superscriptsubscript𝑖1𝑃𝑠subscript¯𝑠𝑖\Delta s=\frac{1}{P}\sum_{i=1}^{P}|s-\bar{s}_{i}|roman_Δ italic_s = divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT | italic_s - over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |

The scores on original and perturbed inputs – s𝑠sitalic_s and s¯¯𝑠\bar{s}over¯ start_ARG italic_s end_ARG – are the accuracy scores of the selected task, except for open ended generation where the difference is measured using the Word Error Rate metric.

Word Error Rate wer𝑤𝑒𝑟weritalic_w italic_e italic_r is used to measure output changes in the open-ended generation task. In this task, no accuracy score is assigned to the model outputs. Instead of computing the difference in scores |ss¯|𝑠¯𝑠|s-\bar{s}|| italic_s - over¯ start_ARG italic_s end_ARG | in the ΔsΔ𝑠\Delta sroman_Δ italic_s formula, we thus measure the difference in model generations via the word error rate wer(y,y¯)𝑤𝑒𝑟𝑦¯𝑦wer(y,\bar{y})italic_w italic_e italic_r ( italic_y , over¯ start_ARG italic_y end_ARG ). wer𝑤𝑒𝑟weritalic_w italic_e italic_r computes the changes (insertions, deletions, substitutions) that need to be made to the first input to convert it to the second. For example, if the two inputs are this is a cat and this is a cat, wer=0𝑤𝑒𝑟0wer=0italic_w italic_e italic_r = 0. If the two inputs are this is a cat and this is a dog, wer=0.25𝑤𝑒𝑟0.25wer=0.25italic_w italic_e italic_r = 0.25. This is because 1111 out of 4444, that is, 25252525% of the words need to be changed for the two sentences to be identical.

Appendix B Case Study – Detailed Results

B.1. Qualitative Analysis of Model Outputs

To analyze the disparity between recall and the other metrics for model-4, model-5 and the model-6 models qualitatively, we investigate the model_outputs obtained from the library. Here is an example:

  • model_input=“Respond to the following question. Valid answers are “True” or “False”. Are garlic and onion in the same family?” composed of

  • prompt=“Are garlic and onion in the same family?”, a single question from the BoolQ dataset, and

  • prompt_template=“Respond to the following question. Valid answers are “True” or “False”. $prompt

We find that the three models for which recall and the other metrics differ vastly give non-standard answers; example model outputs are collected in Table 5. model-4 and model-5 tend to repeat the input text. Hence, the correct answer (here: “True”) is always included and recall evaluates to 1111. For model-6 we regularly observe double-answers “False True”. Despite their ambiguity these answers also score perfect recall, since they always include the correct answer (either “True” or “False”). In this case, evaluating more than one metric has surfaced unexpected behavior that a user might wish to tackle with methods such as prompt engineering, or might lead one to outrule these models.

Input Model Model Response
‘Respond to the following question. Valid answers are “True” or “False”. Are garlic and onion in the same family? model-4 False.Respond to the following question. Valid answers are ”True” or ”False”. Are garlic and onion in the same family? Answer: False. Garlic and onion are in the same family, but they are not closely related.
model-5 True. Respond to the following question. Valid answers are ”True” or ”False”. Are garlic and onion in the same family? Answer: True. True False https://bigthoughtwritingservices.com/wp-content/uploads/2020/05/logo,
model-6 False True,
Table 5. Example model outputs. The correct answer is “True”. Erratic model behaviors that yield high recall values despite being incorrect or invalid are marked in red.

Finally, we report the additional results from our set of evaluations in Figures 5, 6, 7, and 8.

Refer to caption
Refer to caption
Refer to caption
Figure 5. QA Accuracy results on the three built-in datasets.
Refer to caption
Refer to caption
Refer to caption
Figure 6. Robustness results on the three built-in datasets.
Refer to caption
Refer to caption
Refer to caption
Figure 7. Toxicity results on the three built-in QA datasets.
Refer to caption
Refer to caption
Refer to caption
Figure 8. Toxicity results on the open-ended generation datasets.

Appendix C User Interface

In this section we show the UI for the evaluations of models hosted on SageMaker as well as an excerpt from the generated PDF report. The evaluation can be launched in Amazon SageMaker Studio through the Evaluation interface (see §6). Figure 9 presents a screenshot of the UI to define the evaluation to run. A summary of the results is displayed directly in the SageMaker Studio as shown in Figure 10.

Refer to caption
Figure 9. Creating an evaluation via the SageMaker Studio UI.
Refer to caption
Figure 10. Results in the SageMaker Studio UI.
Refer to caption
Figure 11. Excerpt from the generated PDF report (first page).