\pdftrailerid

redacted \correspondingauthor[email protected] and [email protected] \reportnumber001

Foundational Autoraters:
Taming Large Language Models for
Better Automatic Evaluation

Tu Vu Co-lead (equal contribution) Google DeepMind Kalpesh Krishna Co-lead (equal contribution) Google
Salaheddin Alzubi
UMass Amherst
Chris Tar Google DeepMind Manaal Faruqui Google Yun-Hsuan Sung Google DeepMind
Abstract

As large language models (LLMs) advance, it becomes more challenging to reliably evaluate their output due to the high costs of human evaluation. To make progress towards better LLM autoraters, we introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+++ quality assessment tasks comprising 5M+++ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAMe significantly improves generalization to a wide variety of held-out tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning, using reward modeling evaluation as a case study (FLAMe-RM). Notably, on RewardBench, our FLAMe-RM-24B model (with an accuracy of 87.8%) is the top-performing generative model trained exclusively on permissively licensed data, outperforming both GPT-4-0125 (85.9%) and GPT-4o (84.7%). Additionally, we explore a more computationally efficient approach using a novel tail-patch fine-tuning strategy to optimize our FLAMe multitask mixture for reward modeling evaluation (FLAMe-Opt-RM), offering competitive RewardBench performance while requiring approximately 25×\times× less training datapoints. Overall, our FLAMe variants outperform all popular proprietary LLM-as-a-Judge models we consider across 8 out of 12 autorater evaluation benchmarks, encompassing 53 quality assessment tasks, including RewardBench and LLM-AggreFact. Finally, our analysis reveals that FLAMe is significantly less biased than these LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark, while effectively identifying high-quality responses for code generation.

1 Introduction

Refer to caption
Figure 1: Our FLAMe-24B variants outperform popular proprietary LLM-as-a-Judge models like GPT-4 and Claude-3 on many held-out autorater evaluation benchmarks, including RewardBench. Notably, FLAMe-RM, with an overall accuracy of 87.8%, is the top-performing generative model trained solely on permissively licensed data on RewardBench, surpassing both GPT-4-0125 (85.9%) and GPT-4o (84.7%).

The increasing power and versatility of large language models (LLMs) bring with them a growing challenge: How can we reliably evaluate their long-form outputs? Recent research suggests a promising solution: these models themselves, after undergoing large-scale multitask instruction tuning, can generalize to follow new human instructions (Mishra et al., 2022; Wei et al., 2022; Sanh et al., 2022; Longpre et al., 2023; Chung et al., 2024), making them suitable for use as autoraters of model outputs. This is particularly appealing because human evaluation, though crucial for assessing model performance, is limited by subjectivity (Krishna et al., 2023a), variability among raters (Karpinska et al., 2021), and the high costs of extensive evaluations (Min et al., 2023; Vu et al., 2023; Wei et al., 2024).

To align LLM autoraters with human preferences, training on human judgments is crucial (Ouyang et al., 2022). However, obtaining these judgments is both costly and time-consuming. Collecting existing human evaluations from previous research seems promising but faces challenges such as lack of standardization, diverse evaluation criteria, inadequate documentation, data privacy, and proprietary concerns. Alternatively, using model outputs for autorater training offers consistency (Jiang et al., 2024b; Kim et al., 2024b) but also carries with risks, including reinforcing biases and hallucinations (Gudibande et al., 2023; Muennighoff et al., 2023). Additionally, it may violate terms of use for proprietary LLM services, which prohibit using their models’ outputs to develop competing models.111https://openai.com/policies/terms-of-use,https://policies.google.com/terms/generative-ai

To address these limitations, we curated and standardized human evaluations from prior research to create FLAMe, a collection of 102 quality assessment tasks comprising more than 5.3M total human judgments (Section 3). FLAMe spans a wide variety of task types, from assessing machine translation quality to evaluating how well AI assistants follow user instructions. We hypothesized that training on this large and diverse data collection would enable LLM autoraters to learn robust, generalized patterns of human judgment, minimizing the impact of noisy or low-quality human judgments.

For transparency and reproducibility, we use only publicly available human evaluation data with permissive licenses from previous studies (Section 3.2). To overcome challenges in collecting such data, which rarely adhere to a particular standard and often lack documentation, we thoroughly examined the associated research (Section 3.4) and additionally consulted with the original authors to address ambiguities or inconsistencies (spending 3-4 hours per dataset).

Refer to caption
Figure 2: All our quality assessment tasks are formulated into a unified text-to-text format with manually crafted task definitions and evaluation instructions. We format training examples as input-target pairs, where the input includes task-specific context, and the target contains the expected human evaluations.

We train LLM autoraters using supervised, multitask fine-tuning on our data collection. Inspired by T5’s unified task format (Raffel et al., 2020), we convert all our quality assessment tasks into a text-to-text format with manually crafted task definitions and evaluation instructions. All training examples are formulated as input-target pairs, where the input includes task-specific context, and the target contains the expected human evaluations (see Figure 2). This approach facilitates effective transfer learning across tasks, enabling our models to interpret and respond to various tasks consistently. Additionally, our task format is simple, intuitive, and easily accommodates new tasks.

Our approach can be viewed as developing general-purpose LLM autoraters that can perform various quality assessment tasks. We demonstrate that training an instruction-tuned LLM, i.e., PaLM-2-24B (Anil et al., 2023), on our FLAMe collection significantly improves generalization to a wide variety of held-out tasks, outperforming models like GPT-4, Claude-3, and Llama-3 on many tasks. This suggests that our large-scale multitask instruction tuning approach effectively equips the model with general-purpose quality assessment capabilities.

Motivated by these results, we further investigate the impact of using FLAMe as a powerful starting point for fine-tuning on targeted downstream applications, using reward modeling evaluation as a case study (FLAMe-RM). Specifically, we further fine-tune FLAMe for only 50 steps on a mixture of four datasets with human pairwise preference judgments, covering chat, reasoning, and safety. Our resulting FLAMe-RM-24B model significantly improves FLAMe’s performance on RewardBench (Lambert et al., 2024), achieving an overall accuracy of 87.8% (up from 86.0%). Notably, it is the top-performing generative model trained solely on permissively licensed data, outperforming both GPT-4-0125 (85.9%) and GPT-4o (84.7%); see Figure 1.

Additionally, we present FLAMe-Opt-RM, a computationally efficient method that optimizes our FLAMe multitask mixture for targeted reward modeling evaluation. Using a novel tail-patch fine-tuning technique, we analyze the impact of each dataset on specific RewardBench distributions, allowing us to determine the optimal proportions of individual datasets in our multitask mixture. By fine-tuning the initial instruction-tuned PaLM-2-24B checkpoint on this optimized mixture for only 5000 steps, we obtain competitive RewardBench performance (87.0%) compared to FLAMe (86.0%), using approximately 25×\times× less training datapoints.

Overall, our FLAMe variants outperform all popular proprietary LLM-as-a-Judge models we consider across 8 out of 12 autorater evaluation benchmarks (1 held-in and 11 held-out), encompassing 53 quality assessment tasks, including benchmarks like RewardBench and LLM-AggreFact (Tang et al., 2024).

Finally, we investigate whether biases exist in our LLM autoraters, a common criticism of LLM-as-a-Judge autoraters (Section 6.1), and their potential utility for AI development, particularly in identifying high-quality model responses (Section 6.2). Our analysis results reveal that FLAMe variants are significantly less biased than popular LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark (Koo et al., 2023), showing more robustness to changes in pairwise ordering, response length, and irrelevant context. Additionally, we find that FLAMe effectively re-ranks LLM responses to Python programming prompts in the HumanEval benchmark (Chen et al., 2021), improving pass@1 by 6-10% across settings.

In summary, our main contributions are:

  1. 1.

    Data Collection: We curated and standardized human evaluations from permissively licensed datasets to create a collection of 100+ diverse quality assessment tasks comprising 5M+++ human judgments. To facilitate future research, we will make our data collection publicly available.

  2. 2.

    LLM Autoraters: We demonstrate the effectiveness of using our multitask mixture both in training general-purpose LLM autoraters (FLAMe) and optimizing LLM autoraters for targeted downstream applications (FLAMe-RM and FLAMe-Opt-RM). Our LLM autoraters outperform all popular proprietary LLM-as-a-Judge models we consider across 8 out of 12 autorater evaluation benchmarks, covering 53 quality assessment tasks, including RewardBench and LLM-AggreFact.

  3. 3.

    Computationally Efficient Multitask Training: We introduce a computationally efficient method using a novel fine-tuning strategy to optimize our multitask mixture for targeted distributions, achieving competitive performance with significantly less compute.

Our work demonstrates the potential of accessible AI solutions, which we hope will spur more fundamental research into reusable human evaluations and the development of effective and efficient LLM autoraters.

2 Related Work

Below, we discuss existing literature in the space of autoraters, drawing connections to FLAMe.

Automatic Evaluation Metrics:

Traditional metrics like BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) assess the lexical overlap between model output and human references. In the BERT (Devlin et al., 2019) era, several methods use pretrained models to measure the distributional similarity (Zhao et al., 2019; Zhang et al., 2020) or token probabilities (Thompson and Post, 2020; Yuan et al., 2021). A line of work explores statistical methods to measure the divergence between two text distributions (Gehrmann et al., 2019; Pillutla et al., 2021). Other work fine-tunes pretrained models on human ratings to create automatic evaluation metrics for specific tasks, including machine translation (Sellam et al., 2020; Rei et al., 2020; Fernandes et al., 2023), text summarization (Durmus et al., 2020; Deutsch et al., 2021; Goyal and Durrett, 2021), question answering (Chen et al., 2020; Lin et al., 2022), and text simplification (Maddela et al., 2023). Unlike these task-specific evaluation metrics, FLAMe is trained on various fine-grained quality assessment tasks and can be prompted during inference to tackle novel tasks.

LLM-as-a-Judge Autoraters:

With the advent of LLMs like ChatGPT, recent work has used these models as judges (Liu et al., 2023a; Fu et al., 2024; Bai et al., 2023; Wang et al., 2023a; Chiang et al., 2023; Chiang and Lee, 2023; Bubeck et al., 2023) to evaluate LLM capabilities on various benchmarks, including AlpacaEval (Li et al., 2023b; Dubois et al., 2024), MT-Bench (Zheng et al., 2023), and WildBench (Lin et al., 2024). However, LLM-as-a-Judge autoraters are often found to favor their own generated responses (Liu et al., 2023a; Panickssery et al., 2024; Liu et al., 2023b; Bai et al., 2023), exhibiting “cognitive” biases towards aspects like length, order, and entity preference (Koo et al., 2023). In contrast, our models are trained on a large, diverse collection of human evaluations, allowing them to learn unbiased, generalized patterns of human judgment (Section 6.1). Unlike LLM-as-a-Judge autoraters, our models are not tasked with evaluating their own responses, avoiding self-preference bias.

General-purpose LLM Autoraters:

Recent work has explored training general-purpose LLM autoraters.  Jiang et al. (2024b) introduce TIGERScore, a Llama-2 model trained on GPT-4 generated error analysis data across several tasks, including summarization, translation, data2text, long-form QA, and instruction-following. Similar approaches include InstructScore (Xu et al., 2023b), Prometheus (Kim et al., 2024a), and Prometheus-2 (Kim et al., 2024b). Unlike these efforts, our approach relies solely on open-source human evaluations instead of model outputs. We show that FLAMe significantly outperforms Prometheus-2 on RewardBench (see Table 2).

Reward Models:

Our work relates to the development of reward models (RMs) used for aligning LLMs to human preferences via reinforcement learning with human feedback (RLHF) (Ouyang et al., 2022; Korbak et al., 2023). In RLHF, human preference data is either used to train stand-alone discriminative RMs, or directly fed into LLMs via algorithms like DPO (Rafailov et al., 2024) or SLiC-HF (Zhao et al., 2023). While we evaluate our models as RMs in our RewardBench experiments (Section 5), there are key distinctions: (1) RMs primarily train on pairwise preference data,222A notable exception is RLAI(Bai et al., 2022b), which asks the model to critique its responses based on a constitution. whereas our models use diverse task types in a unified format; (2) RMs optimize for overall preference, while our models can be prompted to judge specific aspects of model responses (e.g., safety).

3 The FLAMe Collection

At a high level, we fine-tune instruction-tuned LLMs on our multitask mixture of standardized human evaluations (102 tasks, 5.3M human judgments). This data collection is meticulously curated to encompass human evaluations across a broad spectrum of LLM capabilities (Section 3.2-3.3). We manually crafted task definitions and evaluation instructions, reformatting all tasks into a unified text-to-text format (Section 3.4).

3.1 Task Definition

We use the term “task” to refer to a specific assignment for the model, which involves presenting a text (e.g., a machine-generated summary) alongside its context (the original article) and instructing the model to evaluate one or more aspects of the text based on provided evaluation criteria (see Figure 2). Each task has distinct definitions and evaluation guidelines. It is possible to derive different tasks from the same dataset. For example, HelpSteer (Wang et al., 2023b) includes human annotations for different attributes of model responses such as Helpfulness, Correctness, Coherence, Complexity, and Verbosity, allowing us to create distinct tasks, each focused on a specific attribute. Additionally, tasks with similar definitions and evaluation criteria but sourced from different datasets are treated as distinct tasks. Based on this definition, the FLAMe collection has a total of 102 distinct tasks.

3.2 Principles for Data Collection

We adhere to the following principles while choosing our datasets:

Public, Open-source Datasets:

For transparency and reproducibility, we use only permissively licensed datasets from HuggingFace Datasets (Lhoest et al., 2021), TensorFlow Datasets,333https://www.tensorflow.org/datasets or the original authors’ GitHub repositories.

Human-labeled Annotations:

We exclusively use datasets with human-labeled annotations, avoiding those generated by models like GPT-4 due to potential inaccuracies and legal concerns raised in recent research (Gudibande et al., 2023; Muennighoff et al., 2023).

Refer to caption
Figure 3: A breakdown of our FLAMe data collection by task type, with each slice representing the percentage of datapoints (out of 5.3M) for that specific task type. More than half of FLAMe focuses on standard pairwise (“Which of the two responses is better?”) and pointwise (“Rate the response on a Likert scale.”) evaluation tasks. The rest of FLAMe focuses on custom classification (e.g., “Is the summary fully attributable to the source article? (Yes/No)”) and open-ended generation (e.g., “Explain why response A is better than response B.”) evaluation tasks.
Various Task Types:

To enhance the generalizability of our models, we gather datasets from a diverse range of task types. These include (see breakdown in Figure 3):

  1. 1.

    Pairwise Evaluation: Tasks that involve comparing two responses at a time to determine a preference (e.g., “Which response, A or B, is more helpful?”).

  2. 2.

    Pointwise Evaluation: Tasks that involve evaluating specific attributes of individual responses independently (e.g., “Please rate the overall coherence of the response on a 5-point Likert scale.”).

  3. 3.

    Classification: Tasks that involve categorizing individual responses into predefined categories (e.g., “Does the model output follow the instructions? (Yes/No)”).

  4. 4.

    Open-ended Evaluation: Tasks that require free-form, unrestricted answers (e.g., “Is the summary fully attributable to the source article? Provide a short explanation.”).

Various LLM Capabilities:

We choose datasets from literature that assess diverse LLM capabilities, including factuality, safety, reasoning, instruction-following, long-form generation, creativity, attribution, coding, etc. (see Section 3.3).

3.3 LLM Capabilities Covered by FLAMe

Following the principles outlined in Section 3.2, we curated a comprehensive data collection of 5.3M datapoints, spanning 102 training tasks (with an additional 53 tasks reserved for evaluation, as detailed in Section 5.1).  Appendix 8 contains the list of datasets used in our study. Our data collection encompasses key capabilities of contemporary LLMs, as outlined below (see breakdown in Figure 4).

Refer to caption
Figure 4: A breakdown of our FLAMe data collection by LLM capability, with each slice representing the percentage of datapoints (out of 5.3M) for that specific LLM capability. We focus on the standard evaluation pillars regularly used in LLM evaluation: general response quality, factuality, safety, coding, and math. Additionally, we add some non-evaluation instruction tuning data (like LIMA) to help preserve the general-purpose instruction-following capabilities of FLAMe.
General Response Quality:

To evaluate LLM response quality, we use various datasets that measure attributes like helpfulness, coherence, fluency, creativity, complexity, and verbosity. These include: Summary Comparisons (SummFeedback) (Stiennon et al., 2020), LMSYS Chatbot Arena conversations (Zheng et al., 2023), HH RLHF Helpfulness (Bai et al., 2022a), WebGPT (Nakano et al., 2021), SummEval (Fabbri et al., 2021), News Summary Evaluation (Goyal et al., 2022), SHP (Ethayarajh et al., 2022), BeaverTails Helpfulness (Ji et al., 2023), SEAHORSE (Clark et al., 2023), HelpSteer (Wang et al., 2023b), etc. Additionally, to measure LLM instruction-following capabilities, we include datasets like GENIE (Khashabi et al., 2022), InstruSum (Liu et al., 2024), and riSum (Skopek et al., 2023).

Factuality/Attribution:

To address the increasing importance of measuring hallucinations in generated LLM responses, we incorporate several datasets that evaluate the factual accuracy of responses and their grounding, measuring whether claims are supported by source documents. These include: XSum Hallucination (Maynez et al., 2020), QAGS (Wang et al., 2020), WikiBio Hallucination (Manakul et al., 2023), FRANK (Pagnoni et al., 2021), FactScore (Min et al., 2023), VitaminC (Schuster et al., 2021), HaluEval (Li et al., 2023a), Q2 (Honovich et al., 2021), FaithDial (Dziri et al., 2022a), DialFact (Gupta et al., 2022), BEGIN (Dziri et al., 2022b), and MNLI (Williams et al., 2018), etc.444We reformulate natural language inference as quality assessment because it naturally aligns with attribution.

Mathematical Reasoning:

We construct datasets to help FLAMe differentiate between correct and incorrect solutions to mathematical problems. We leverage PRM800K (Lightman et al., 2024) and extract human vs incorrect LLM-generated solutions, as well as pairs of (correct, incorrect) LLM-generated solutions.

Coding:

In addition to natural language evaluation, we also train FLAMe to perform code evaluation. We utilize Code Contests (Li et al., 2022a), CommitPack (Muennighoff et al., 2023), and COFFEE (Moon et al., 2023) to construct pairs of (correct, buggy) programs in response to coding problems or GitHub issues. The model is trained to select the correct program or fix from each pair. Our training data covers popular programming languages, such as Python, JavaScript, Java, C++, Go, and Rust.

Safety:

Developing safe and harmless AI assistants for broad public use is increasingly important. To facilitate safety evaluation, we train FLAMe to identify more helpful and harmless responses. Our training data includes tasks from sources like HH RLHF Harmlessness (Bai et al., 2022a), HH RLHF Red Teaming (Ganguli et al., 2022), BeaverTails QA-Classification and Harmlessness (Ji et al., 2023).

Instruction Tuning:

Finally, to help preserve the instruction-following capabilities of our models, we incorporate instruction tuning data from datasets with human-written responses. These include: LIMA (Zhou et al., 2023), PRM800K IF (Lightman et al., 2024),555We train the model to produce the ground truth solution for each problem. and TULU-2 (Ivison et al., 2023).666We only use TULU-2 instruction tuning subsets with human-written responses, including FLAN, CoT, Open Assistant 1, Science literature, and Hardcoded (see Section 2 in Ivison et al., 2023 for details).

3.4 Unified Task Format

After carefully selecting our training datasets (Section 3.2-3.3), we process and standardize them into a unified text-to-text format. This preprocessing step typically takes about 3-4 hours per dataset and involves several key tasks:

  1. 1.

    Comprehensive Review and Author Consultations: We carefully review the associated research and additionally consult with the original authors to clarify ambiguities or inconsistencies.

  2. 2.

    Data Collection: We collect all relevant data files from the corresponding HuggingFace Datasets, TensorFlow Datasets, or GitHub repositories.

  3. 3.

    Data Extraction: We identify and extract specific data fields containing quality assessments conducted by human annotators.

  4. 4.

    Task Definitions and Evaluation Instructions: We meticulously create detailed task definitions and evaluation instructions for each quality assessment task, ensuring consistency and standardization. To maintain alignment with the original evaluation criteria, we adhere to any available instructions provided to the original human annotators. Our instructions help the model identify the input and output formats, as well as understand the specific aspects it should assess.

  5. 5.

    Text-to-Text Format Conversion: Finally, we reformat all tasks as text-to-text tasks (see Figure 2). Task definitions, evaluation instructions, and desired output fields are listed under an INSTRUCTIONS block. Input field values and target field values are placed under CONTEXT and EVALUATION blocks, respectively. This flexible text-to-text format is easily adaptable to a wide range of quality assessment tasks.

4 Model

We now leverage our large and diverse multitask mixture of quality assessment tasks to train general-purpose LLM autoraters, which can be prompted during inference to perform various tasks. We train three model variants: FLAMe, which is trained with examples-proportional mixture weights (Raffel et al., 2020); FLAMe-RM, which is initialized with FLAMe and slightly fine-tuned on a balanced mixture of four pairwise evaluation datasets, spanning chat, reasoning, and safety (Section 4.2); and FLAMe-Opt-RM, which is trained with reward modeling optimized mixture weights, determined using a tail-patch fine-tuning strategy (Section 4.3).

4.1 Training General-purpose LLM Autoraters (FLAMe)

We start with a baseline training approach by using supervised multitask training to train an instruction-tuned PaLM-2-24B model on our multitask mixture for a fixed number of 30K training steps. We employ examples-proportional mixture weights, capped at a maximum of 216superscript2162^{16}2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT per task to avoid oversampling large datasets. Our resulting FLAMe model significantly improves generalization to a diverse array of held-out tasks, outperforming models like GPT-4, Claude-3, and Llama-3 on many tasks (see Figure 1 and Table 1). These findings support our hypothesis that large-scale multitask instruction tuning effectively equips the model with general-purpose quality assessment capabilities. However, we find that this approach is not optimal for specialized downstream applications like reward modeling evaluation, which motivates our approaches targeting specific downstream distributions (Section 4.2 and Section 4.3).

4.2 Fine-tuning FLAMe for Reward Modeling Evaluation (FLAMe-RM)

Motivated by our findings with FLAMe, we delve deeper into the potential of FLAMe as a powerful starting point for further fine-tuning on specific downstream applications. We focus on reward modeling evaluation as a case study. We create FLAMe-RM by fine-tuning FLAMe on a mixture of four pairwise evaluation datasets, equally mixed, spanning chat, reasoning, and safety. These include: HelpSteer (Wang et al., 2023b), PRM800K (Lightman et al., 2024), CommitPack (Muennighoff et al., 2023), and HH-RLHF Harmlessness (Bai et al., 2022a). Since FLAMe is already trained on these datasets, we only fine-tune it for 50 steps. The resulting FLAMe-RM model significantly improves the original FLAMe’s RewardBench overall score from 86.0% to 87.8% accuracy. Remarkably, FLAMe-RM-24B is the top-performing generative model trained exclusively on permissively licensed data, surpassing both GPT-4-0125 (85.9%) and GPT-4o (84.7%); see Figure 1 and Table 1.

4.3 Optimizing FLAMe Multitask Mixture for Reward Modeling Evaluation (FLAME-Opt-RM)

Refer to caption
Refer to caption
Figure 5: A comparison of FLAMe-Opt-RM and FLAMe in early training stages (first 5000 steps) based on RewardBench Chat Hard and Safety performance. FLAMe-Opt-RM, with optimized mixture weights, achieves significantly higher Chat Hard and Safety scores faster than FLAMe. For reference, FLAMe achieves Chat Hard and Safety scores of 66.2 and 88.5, respectively, at 30K training steps.

While our vanilla FLAMe mixture with examples-proportional mixing performs well across many tasks, it requires extensive training to attain strong performance on certain specialized downstream applications, for example, RewardBench (see Figure 5). We attribute this to suboptimal mixture weights that undersample beneficial tasks during training. To address this, we introduce a novel tail-patch ablation strategy that analyzes the impact of each dataset on targeted distributions. This allows us to find the optimal proportions of individual datasets in our multitask mixture, efficiently optimizing all mixing weight hyperparameters at once. By fine-tuning the initial instruction-tuned PaLM-2-24B checkpoint on this optimized mixture for only 5000 steps, we achieve competitive RewardBench performance (87.0%) with our baseline FLAMe approach (86.0%) while using approximately 25×\times× less training datapoints.

Here, we directly optimized our multitask mixture based on RewardBench performance changes due to its lack of a development set. In early experiments, we observed weak correlations between RewardBench performance and performance on our other held-out tasks across model variants, preventing us from creating a reliable “proxy” development set. We emphasize that our goal here is not to achieve state-of-the-art RewardBench results but instead to demonstrate how our multitask mixture can be optimized for targeted distributions. We found that longer training and/or additional fine-tuning, as is done for FLAMe-RM, further improved our RewardBench performance, though we did not submit these FLAMe-Opt-RM results to the official RewardBench leaderboard. Furthermore, FLAMe-Opt-RM’s robust performance across other held-out tasks (see Table 1) indicates that we have not overfitted to RewardBench, affirming the broad applicability of FLAMe-Opt-RM across diverse tasks.

Tail-patch Ablations to Determine Beneficial Tasks:

Setting the right mixing weight for each individual training task in our multitask mixture is non-trivial due to the large number of tasks. Instead, we examine the impact of each task on targeted distributions, and then use this information for weight assignment. First, we select a checkpoint that is partially trained on our vanilla mixture, showing decent but not optimal performance across RewardBench categories.777We hypothesize that using a partially trained checkpoint, rather than the initial one, is better for tail-patch ablations, since the model has already been exposed to multitask data and is familiar with its overall distribution. Then, we perform a brief fine-tuning stage (“tail-patch”) exclusively on each individual training task, limited to 3000 training steps. We posit that training on a beneficial task would bridge the gap between fair and optimal performance. We note that this is a one-time procedure per downstream application and can be done with smaller models to further reduce computational costs.

A Re-weighted Mixture Based on Tail-patch Ablations: After training a tail-patch on each task, we rate how helpful each training task is to each category of RewardBench using one of four ratings: Helpful (+2, performance significantly improves and remains stable), Somewhat helpful (+1, performance slightly improves), No clear effect (0, performance is nearly unchanged), Harmful (-1, performance is significantly worse). We then organize tasks into seven bundles: Generally helpful (tasks with the highest total ratings, 5absent5\geq 5≥ 5 in our study), Category-specific, one for each of the five RewardBench categories (most beneficial tasks for a specific category where performance crosses a threshold τ𝜏\tauitalic_τ),888We separate Math and Coding for the Reasoning category, and use thresholds of τ=95%,66%,99.8%,84%,85%𝜏percent95percent66percent99.8percent84percent85\tau=95\%,66\%,99.8\%,84\%,85\%italic_τ = 95 % , 66 % , 99.8 % , 84 % , 85 % for Chat, Chat Hard, Math, Coding, and Safety, respectively. and Others for the remaining tasks.

We assign a fixed mixing weight for each bundle: wgeneralsubscript𝑤𝑔𝑒𝑛𝑒𝑟𝑎𝑙w_{general}italic_w start_POSTSUBSCRIPT italic_g italic_e italic_n italic_e italic_r italic_a italic_l end_POSTSUBSCRIPT=100K for Generally helpful, wspecificsubscript𝑤𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐w_{specific}italic_w start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c italic_i italic_f italic_i italic_c end_POSTSUBSCRIPT=30K for each Category-specific bundle, and wotherssubscript𝑤𝑜𝑡𝑒𝑟𝑠w_{others}italic_w start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r italic_s end_POSTSUBSCRIPT=3K for Others. A task can belong to more than one bundle; in this case, its final weight is the sum of the mixture weights from all the bundles it belongs to. For example, if a task is generally helpful and specifically beneficial for both Chat Hard and Safety, it contributes wt=wgeneral+2×wspecificsubscript𝑤𝑡subscript𝑤𝑔𝑒𝑛𝑒𝑟𝑎𝑙2subscript𝑤𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐w_{t}=w_{general}+2\times w_{specific}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_g italic_e italic_n italic_e italic_r italic_a italic_l end_POSTSUBSCRIPT + 2 × italic_w start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c italic_i italic_f italic_i italic_c end_POSTSUBSCRIPT to our final mixture. An exception to this rule: we prioritize the top two most helpful tasks in three categories with suboptimal performance—–Chat Hard, Coding, and Safety–—each with a fixed weight of wtop_specificsubscript𝑤𝑡𝑜𝑝_𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐w_{top\_specific}italic_w start_POSTSUBSCRIPT italic_t italic_o italic_p _ italic_s italic_p italic_e italic_c italic_i italic_f italic_i italic_c end_POSTSUBSCRIPT=200K. These weight values were initially set based on our intuition and were not extensively tuned. FLAME-Opt-RM is initialized with the initial instruction-tuned PaLM-2-24B and then fine-tuned using our re-weighted multitask mixture.

4.4 Training Details

We initialize both FLAMe and FLAMe-Opt-RM with the PaLM-2-24B model (Anil et al., 2023), instruction-tuned on the Flan collection (Chung et al., 2024; Longpre et al., 2023), and train for 30K and 5K steps, respectively. FLAMe is then further fine-tuned for 50 steps to create FLAMe-RM. Our models are trained using T5X (Roberts et al., 2023) with the Adam optimizer (Kingma and Ba, 2015), a learning rate of 0.0001, and a dropout rate of 0.05. FLAMe is trained on 256 Cloud TPU chips with a batch size of 32, whereas FLAMe-RM and FLAMe-Opt-RM use 128 Cloud TPU chips with a batch size of 8.999cloud.google.com/tpu/docs/v5e-training, https://cloud.google.com/tpu/docs/v3

5 Experiments

Having discussed our FLAMe variants and their implementations in Section 3, we now present our main experiments. We compare FLAMe to several popular LLM-as-a-Judge autoraters (Section 5.2) using an evaluation suite that includes 12 autorater benchmarks: 1 held-in and 11 held-out, covering a total of 53 quality assessment tasks (Section 5.1). Overall, we find that our FLAMe variants, trained exclusively on permissively licensed data, outperform LLMs trained on proprietary data like GPT-4 and Claude-3 on 8 out of 12 benchmarks (Section 5.3).

5.1 Evaluation Datasets

Our goal is to measure general quality assessment capabilities of FLAMe. As such, we evaluate our models using a diverse set of held-in and held-out tasks. We cast each task into our unified text-to-text format (Section 3.4) and prompt our models to perform the task. For benchmarks with multiple categories (e.g., RewardBench and LLM-AggreFact), we use the same prompt instructions across categories. To reduce model API costs, we randomly sample 256 examples per evaluation task,101010For tasks with fewer than 256 examples, we use the full evaluation set. except for RewardBench, where we report results on the full evaluation sets.

5.1.1 Held-in Evaluations

HelpSteer (Wang et al., 2023b):

We assess FLAMe’s performance on rating helpfulness, correctness, coherence, complexity, and verbosity, using the HelpSteer validation set.

5.1.2 Held-out Evaluations

RewardBench (Lambert et al., 2024):

RewardBench is a widely used benchmark for assessing reward models, focusing on their capabilities and safety. It involves pairwise preference tasks where reward models choose the better response between two options based on a given prompt. RewardBench encompasses four main categories aimed at evaluating specific desired capabilities in LLMs: Chat, Chat Hard, Reasoning (Math + Coding), and Safety. The benchmark incorporates 23 individual datasets.111111We excluded the “Prior sets” of RewardBench because three out of the four datasets were used in training FLAMe.

LLM-AggreFact (Tang et al., 2024):

LLM-AggreFact is a benchmark for measuring the grounding capabilities of autoraters. Given a reference document and a claim, the autorater determines if the claim is fully supported by the document. This holistic benchmark combines 10 attribution datasets used in recent studies on LLM factuality.

Other Evaluation Benchmarks:

In addition to RewardBench and LLM-AggreFact, we evaluate FLAMe on a diverse array of other held-out pairwise comparison and pointwise evaluation benchmarks, including: Summary Comparisons (SummFeedback) (Stiennon et al., 2020);121212During training, we used only pairwise ratings from the dataset and reserved pointwise ratings for evaluation. Helpful, Honest, and Harmless Alignment (HHH) (Askell et al., 2021); AlpacaFarm (Dubois et al., 2023); Paraphrase Evaluation (Dipper) (Krishna et al., 2023b); Sequence Continuation Preference (RankGen) (Krishna et al., 2022); Poem Preference (CoPoet) (Chakrabarty et al., 2022); Literary Translation Comparisons (LitTrans) (Karpinska and Iyyer, 2023); Long-form QA Evaluation (LFQAEval) (Xu et al., 2023a); and Text Continuation Preference (ContrSearch) (Su and Xu, 2022). None of the tasks in these benchmarks were included in our training data.

5.2 Evaluated Models

We evaluate several popular LLM-as-a-Judge models as baselines, including: Llama-3-70B-Instruct (Meta, 2024), Mixtral 8×\times×7B (Jiang et al., 2024a), Claude-3-Opus (Anthropic, 2024), GPT-3.5-turbo-0125 (OpenAI, 2024a), GPT-4-0125 (OpenAI, 2024b), and OpenAI’s current flagship model GPT-4o (OpenAI, 2024c).131313For fair comparison, we use the same FLAMe prompt instructions when evaluating LLM-as-a-Judge baselines. For better reproducibility, we set the temperature to 0 and generate up to 1024 tokens across all models. We also compare our results with several models on the official RewardBench leaderboard, notably Gemini-1.5-Pro (Reid et al., 2024), Prometheus-2-8×\times×7B (Kim et al., 2024b), and NVIDIA’s Nemotron-4-340B-Reward and Llama-3-70B-SteerLM-RM (Wang et al., 2024).

We evaluate all our three FLAMe variants: FLAMe, FLAMe-RM, and FLAMe-Opt-RM, as described in Section 4.1-Section 4.3. Additionally, we include the initial instruction-tuned PaLM-2-24B checkpoint, which has not been trained on our FLAMe data, to separate the impact of instruction tuning and FLAMe training.

5.3 Main Results

Model Reward LLM Summ Alpaca Rank Co Contr HHH Dipper Lit LFQA Help Bench AggreFact Feedback Farm Gen Poet Search Trans Eval Steer Llama-3-70B-Instruct 76.1 76.1 50.8 53.9 65.6 53.6 53.1 91.9 42.8 60.5 71.1 39.7 Mixtral-8×\times×7B 77.8 73.8 43.8 55.1 63.3 52.9 56.6 90.0 42.2 61.7 71.5 34.0 GPT-3.5-turbo-0125 64.5 70.0 15.6 55.5 58.2 49.0 57.5 85.5 45.0 54.3 69.9 32.0 Claude-3-Opus 80.7 79.2 31.6 49.6 55.1 49.0 45.1 94.6 50.6 71.1 71.1 41.3 GPT-4-0125 85.9 80.6 46.5 49.6 62.5 56.9 55.8 94.6 45.0 67.6 77.0 37.9 GPT-4o 84.7 80.2 30.9 50.4 66.0 55.6 57.5 92.3 45.6 72.7 75.0 40.1 our models PaLM-2-24B 62.9 54.8 13.3 52.3 58.2 54.2 46.0 85.5 48.3 62.5 70.3 20.0 FLAMe-24B 86.0 81.1 48.0 58.2 62.1 53.6 69.9 91.4 48.3 67.2 74.2 48.4 FLAMe-RM-24B 87.8 80.8 53.1 57.8 65.2 57.5 57.5 91.0 47.8 67.6 72.7 46.6 FLAMe-Opt-RM-24B 87.0 80.2 52.3 53.1 69.5 52.9 48.7 89.1 48.3 69.5 69.5 35.9

Table 1: Performance of FLAMe compared to LLM-as-a-Judge baselines across a wide variety of autorater evaluation benchmarks. Overall, FLAMe variants outperform all popular proprietary LLM-as-a-Judge models on 8 out of 12 benchmarks,including RewardBench and LLM-AggreFact. See Section 5.1 for the sources of our evaluation benchmarks.
Model Average Chat Chat Hard Safety Reasoning
custom classifiers on the official RewardBench leaderboard
Nemotron-4-340B-Reward 92.2 95.8 87.1 92.2 93.6
Cohere May 2024 89.5 96.4 71.3 92.7 97.7
Llama3-70B-SteerLM-RM 89.0 91.3 80.3 93.7 90.6
generative models on the official RewardBench leaderboard
GPT-3.5-turbo-0125 64.5 92.2 44.5 62.3 59.1
Prometheus-2-8×\times×7B 75.3 93.0 47.1 83.5 77.4
Llama-3-70B-Instruct 76.0 97.6 58.9 69.2 78.5
Mixtral-8×\times×7B 77.8 95.0 64.0 73.4 78.7
Claude-3-Opus 80.7 94.7 60.3 89.1 78.7
Gemini-1.5-Flash 82.1 92.2 63.5 87.7 85.1
GPT-4o 84.7 96.6 70.4 86.7 84.9
GPT-4-0125 85.9 95.3 74.3 87.2 86.9
Gemini-1.5-Pro 88.1 92.3 80.6 87.5 92.0
our generative autorater models
PaLM-2-24B 62.9 89.9 61.2 55.3 45.2
FLAMe-24B 86.0 94.7 66.2 88.5 94.7
FLAMe-RM-24B 87.8 92.2 75.7 89.6 93.8
FLAME-Opt-RM-24B 87.0 92.2 77.0 86.2 92.5
Table 2: A comparison of FLAMe with other generative models on the official RewardBench leaderboard. FLAMe-RM-24B achieves the best overall performance (87.8%) among generative models trained solely on permissively licensed data.
Model Overall LLM-FactVerify Wiki-FactVerify Summarization Long-form QA
GPT-3.5-turbo-0125 70.0 80.1 71.1 64.6 65.4
Mixtral-8×\times×7B 73.8 73.8 50.8 78.1 76.6
Llama-3-70B-Instruct 76.1 75.3 58.4 80.3 77.7
Claude-3-Opus 79.2 78.6 70.6 83.8 75.0
GPT-4o 80.2 79.6 71.6 85.0 76.0
GPT-4-0125 80.6 79.6 71.6 85.3 77.3
our models
PaLM-2-24B 54.8 34.4 28.9 68.2 71.7
FLAMe-24B 81.1 82.3 77.7 85.3 72.7
FLAMe-RM-24B 80.8 82.6 77.2 85.4 70.9
FLAMe-Opt-RM-24B 80.2 77.6 81.2 84.7 74.8
Table 3: LLM-AggreFact performance across four common use-cases: LLM-FactVerify (ClaimVerify + FactCheck + Reveal), Wiki-FactVerify (WiCE), Summarization (AggreFact + TofuEval), and Long-form QA (ExpertQA + LFQA). FLAMe variants outperform all tested LLM-as-a-Judge models in three out of the four use-cases. FLAMe-24B achieves the highest overall performance of 81.1, while the next-best model GPT-4-0125 scores 80.6.

Table 1 shows our main results across all evaluation benchmarks. RewardBench and LLM-AggreFact results are shown in Table 2 and Table 3, respectively. Below, we first provide an overview of these results before analyzing them in more detail:

FLAMe Variants Outperform all LLM-as-a-Judge baselines on 8 out of 12 benchmarks:

Our results in Table 1 suggest that FLAMe variants, despite being trained solely on permissively licensed datasets, perform strongly across various evaluation benchmarks. Remarkably, our models outperform all state-of-the-art LLM-as-a-Judge models trained on proprietary data on 8 out of 12 benchmarks. FLAMe variants exceed the next-best model by a large margin on several held-out benchmarks, including: ContrSearch (69.9 vs 57.5 for GPT-4o/GPT-3.5-turbo-0125), RankGen (69.5 vs 66.0 for GPT-4o), AlpacaFarm (58.2 vs 55.5 for GPT-3.5-turbo-0125), SummFeedback (53.1 vs 50.8 for Llama-3-70B-Instruct), and RewardBench (87.8 vs 85.9 for GPT-4-0125). Unsurprisingly, our models also obtain the best held-in performance on HelpSteer (48.4 vs. 41.3 for Claude-3-Opus). On the other hand, FLAMe variants lag behind proprietary models on several benchmarks, including HHH (91.4 vs 94.6 for GPT-4-0125/Claude-3-Opus), LitTrans (69.5 vs 72.7 for GPT-4o), and LFQAEva (74.2 vs 77.0 for GPT-4-0125), suggesting that these proprietary models may have been optimized for these capabilities. Interestingly, GPT-4-0125 outperforms GPT-4o on 6 out of 12 benchmarks, including RewardBench, despite GPT-4o achieving a higher rank on the official LMSYS leaderboard (Chiang et al., 2024). Finally, FLAMe provides significant gains over the initial instruction-tuned PaLM-2-24B across almost all benchmarks, highlighting the benefits of FLAMe training. Overall, our results demonstrate FLAMe’s robust generalization to held-out tasks, showcasing its effectiveness as a versatile LLM autorater.

FLAMe Variants Are Among The Most Powerful Generative Models on RewardBench:

Our results in Table 2 indicate that FLAMe variants are among the top-performing generative models on the official RewardBench leaderboard, achieving strong performance across all categories: Chat, Chard Hard, Safety, and Reasoning. Notably, FLAMe-RM-24B achieves an overall score of 87.8%, the best performance among generative models trained solely on permissively licensed data, surpassing both GPT-4-0125 (85.9) and GPT-4o (84.7). As of July 15, 2024, FLAMe-RM-24B ranks second among generative models (below Gemini-1.5-Pro) and sixth among all models (spanning various model types such as custom classifier, generative, sequence classifier, and DPO) on the official RewardBench leaderboard.141414https://huggingface.co/spaces/allenai/reward-bench. While RewardBench is a widely used benchmark for evaluating reward models, we identified issues with length and token bias during our evaluations. We provide an analysis of bias in RewardBench in Appendix 9.

FLAMe Attains the Best Performance on LLM-AggreFact:

Finally, Table 3 presents our attribution results on LLM-AggreFact (Tang et al., 2024), categorized into four common use-cases: 1) LLM-FactVerify: fact verification of LLM-generated responses, 2) Wiki-FactVerify: evaluating correctness of Wikipedia claims, 3) Summarization: assessing faithfulness of summaries, and 4) Long-form QA: evaluating long-form answers to questions. FLAMe variants outperform all other models in three out of the four categories (LLM-FactVerify, Wiki-FactVerify, and Summarization). FLAMe-24B achieves the highest overall performance of 81.1, while the next-best baseline model GPT-4-0125 obtains a score of 80.6. In long-form QA attribution evaluation, our best model FLAMe-Opt-RM underperforms compared to GPT-4-0125 (74.8 vs 77.3), aligning with our findings in Table 1.

6 Further Analysis of FLAMe

In this section, we provide an analysis to elucidate some interesting aspects of our models. We depart from the usual focus on analyzing the effect of factors like model size, data size, and data quality in multitask learning, which have been extensively studied in recent work on multitask learning and instruction tuning (Raffel et al., 2020; Longpre et al., 2023). Instead, we explore potential biases inherent in our LLM autoraters. Additionally, we demonstrate the potential utility of FLAMe for AI development, such as sampling high-quality responses.

6.1 Autorater Bias Analysis

Autorater Avg. (\downarrow) Order (\downarrow) Compassion (\downarrow) Length (\downarrow) Egocentric (\downarrow) Bandwagon (\downarrow) Attention (\downarrow)
Random 0.30 0.50 0.50 0.00 0.25 0.25 0.25
baselines reported in Koo et al. (2023)
Falcon-40B 0.31 0.77 0.27 0.09 0.05 0.28 0.40
Cohere-54B 0.41 0.50 0.65 0.10 0.27 0.82 0.14
Llama-2-70B 0.19 0.61 0.26 0.12 0.06 0.04 0.03
InstructGPT 0.45 0.38 0.48 0.16 0.28 0.85 0.54
ChatGPT 0.45 0.41 0.66 0.13 0.58 0.86 0.06
GPT-4 0.31 0.23 0.79 0.06 0.78 0.00 0.00
our models
FLAMe-24B 0.13 0.08 0.09 0.03 0.38 0.18 0.00
FLAMe-RM-24B 0.13 0.11 0.08 0.02 0.40 0.17 0.00
FLAMe-Opt-RM-24B 0.15 0.15 0.14 0.00 0.41 0.17 0.00
Table 4: Autorater bias analysis on the CoBBLEr bias benchmark from Koo et al. (2023). Lower values indicate better or less biased autoraters across all columns. Overall, we find that FLAMe variants exhibit significantly less bias compared to popular LLM-as-a-Judge autoraters like GPT-4. Compared to Table 2 in Koo et al. (2023), we combine first/last numbers for Order/Compassion, report |bias0.5|bias0.5|\text{bias}-0.5|| bias - 0.5 | for Length, and exclusively report the order variant in Egocentric.

A common criticism of LLM-as-a-Judge autoraters involves their bias towards certain judgments (Liu et al., 2023a; Panickssery et al., 2024; Liu et al., 2023b; Bai et al., 2023). In this section, we evaluate FLAMe variants on the CoBBLEr autorater bias benchmark (Koo et al., 2023). We find that our models are significantly less biased than other popular LLM-as-a-Judge autoraters.

CoBBLEr measures six types of biases in LLM autoraters:

  1. 1.

    Order: Does the autorater have a preference towards the response position?

  2. 2.

    Compassion: Does the autorater’s judgment change when the response-generating LLM’s actual name, such as “GPT-4”, is used instead of aliases like “Model A”?

  3. 3.

    Length: Does the autorater have a preference for longer or shorter outputs?

  4. 4.

    Egocentric: Does the autorater have a preference for outputs generated by itself?

  5. 5.

    Bandwagon: Does the autorater get swayed by sentences like “90% people prefer response A”?

  6. 6.

    Attention: Does the autorater get distracted by irrelevant context, such as “Response A is about cats.”?

We leverage the original (prompt,response) pairs from Koo et al. (2023) and reformat them into our unified FLAMe format (Figure 2). We compare FLAMe variants to other LLM-as-a-Judge autoraters reported in Koo et al. (2023), including GPT-4.

Our results are shown in Table 4. We find that FLAMe variants exhibit significantly lower bias compared to GPT-4 and other autoraters, with an average bias of 0.13 vs 0.31 for GPT-4 (lower is better). FLAMe yields significantly better or on-par performance compared to GPT-4 across all six bias categories. These results demonstrate FLAMe’s effectiveness as a robust and reliable autorater.

Ranker CodeGen-16B davinci002 InCoder-6B
10 code samples re-ranked in round-robin fashion
None 21.2 17.6 14.6
FLAMe-24B 31.1 22.6 22.0
FLAMe-RM-24B 29.9 23.2 21.3
FLAME-Opt-RM-24B 29.3 18.3 16.5
Oracle 46.9 63.4 29.3
Table 5: Pass@1 performance on the HumanEval coding benchmark (Chen et al., 2021). Re-ranking code samples with FLAMe variants significantly improves performance across models.

6.2 Using FLAMe to Re-rank Decoded Outputs

Finally, we explore the application of our LLM autoraters in selecting optimal outputs from multiple responses, a method known as “Best-of-N” sampling (Nakano et al., 2021; Krishna et al., 2022). Using FLAMe for re-ranking, we assess its impact on code generation performance with the HumanEval Python programming benchmark (Chen et al., 2021). We conduct experiments by re-ranking 10 code samples generated by three models: OpenAI’s davinci-002, InCoder-6B (Fried et al., 2023), and CodeGen-16B (Nijkamp et al., 2023) using a round-robin competition, and then measuring performance with the top-ranked code sample.151515We use relatively weak LLMs from Chen et al. (2023) for two main reasons: (1) to assess the potential benefits of re-ranking with FLAMe, and (2) HumanEval has been extensively used to develop newer LLMs. Results in Table 5 show that FLAMe provides significant gains in pass@1 accuracy across all three models. Notably, FLAMe improves CodeGen-16B’s pass@1 from 21.2 to 31.1, closing nearly 40% of the gap to the Oracle ranker (46.9).

7 Conclusion

We introduce FLAMe, a family of foundational autorater models that can perform various quality assessment tasks. FLAMe is trained on a large and diverse collection of curated and standardized human evaluations derived exclusively from permissively licensed datasets. We demonstrate FLAMe’s strong zero-shot generalization abilities, outperforming models trained on proprietary data like GPT-4 and Claude-3 on many held-out tasks. FLAMe can also effectively serve as a powerful starting point for further downstream fine-tuning. Our FLAMe-RM variant, which is fine-tuned for reward modeling evaluation, is among the top-performing generative models on RewardBench, despite being trained solely on permissively licensed data, outperforming both GPT-4-0125 and GPT-4o. Additionally, we present a more computationally efficient approach using a novel tail-patch fine-tuning strategy to optimize our FLAMe multitask mixture for targeted distributions, offering competitive performance with significantly less compute. Our FLAMe variants outperform popular proprietary LLM-as-a-Judge models across 8 out of 12 autorater evaluation benchmarks, covering 53 quality assessment tasks, including RewardBench and LLM-AggreFact. Finally, our analysis shows that FLAMe exhibits significantly lower bias compared to popular LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark, while effectively identifying high-quality responses for code generation.

Limitations and Future work

Evaluating LLMs is challenging due to evolving evaluation standards and the need to assess new LLM capabilities. Expanding our data collection with open-source contributions could address this issue. Additionally, our models, trained primarily on English data with a context length of 2048 tokens, might not perform well on multilingual (Freitag et al., 2021) or long-context (Kim et al., 2024c; Karpinska et al., 2024) quality assessment tasks. In future releases, we plan to include training on more multilingual datasets with longer context lengths. Finally, in this work, we train our models in a supervised multitask fashion. Exploring alternative training approaches such as RLHF and DPO is a promising direction for future work.

Ethical Considerations and Risks

All considerations and risks outlined by prior work for pretrained and instruction-tuned LLMs (Chowdhery et al., 2022; Anil et al., 2023) apply to LLM autoraters. We recommend following standard practice for responsible development of these models (Achiam et al., 2023; Gemini et al., 2023; Reid et al., 2024). Additionally, LLM autoraters raise new risks due to increased quality assessment capabilities. First, our models can inherit and amplify biases from human evaluations, leading to unfair or discriminatory outcomes. For instance, the model may replicate biases related to race, gender, or other sensitive attributes from the training data, potentially harming certain groups. Second, overreliance on LLM autoraters risks automating decisions that need human understanding and empathy. To mitigate these risks, transparency in model development and use, along with robust measures like bias audits, data anonymization, and incorporating diverse perspectives, is essential for promoting fairness, accountability, and trustworthiness.

Acknowledgments

We are grateful to Jie Ren, Denny Zhou, and Tania Bedrax-Weiss for their comments on this manuscript. We thank Mohit Iyyer, Daniel Cer, Elizabeth Clark, Jeremiah Liu, Balaji Lakshminarayanan, Clara Huiyi Hu, Aliaksei Severyn, Adam Sadovsky, Yonghui Wu, Quoc Le, Slav Petrov, Séb Arnold, Taylan Bilal, Noah Constant, Colin Raffel, Nan Hua, Marzena Karpinska, Yixiao Song, Tuhin Chakrabarty, the Gemini model quality team, the Descartes team at Google, and the UMass NLP group for useful discussions and valuable feedback at different stages of this project. We thank the authors of the datasets used in this work, especially Niklas Muennighoff, Hyungjoo Chae, Mounica Maddela, Tanya Goyal, and Yuanhao Wu, for their helpful suggestions and for answering our questions. Finally, we thank Grady Simon, Chung-Ching Chang, Sho Kannan, Gustavo Hernandez Abrego, and the T5X team for their assistance with the codebase, implementation, and computational resources.

\nobibliography

*

References

  • Achiam et al. (2023) J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. URL https://arxiv.org/abs/2303.08774.
  • Anil et al. (2023) R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023. URL https://arxiv.org/abs/2305.10403.
  • Anthropic (2024) A. Anthropic. Introducing the next generation of claude, 2024. URL https://www.anthropic.com/news/claude-3-family.
  • Askell et al. (2021) A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021. URL https://arxiv.org/abs/2112.00861.
  • Bai et al. (2022a) Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a. URL https://arxiv.org/abs/2204.05862.
  • Bai et al. (2022b) Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b. URL https://arxiv.org/abs/2212.08073.
  • Bai et al. (2023) Y. Bai, J. Ying, Y. Cao, X. Lv, Y. He, X. Wang, J. Yu, K. Zeng, Y. Xiao, H. Lyu, J. Zhang, J. Li, and L. Hou. Benchmarking foundation models with language-model-as-an-examiner. In Advances in Neural Information Processing Systems 36 (NeurIPS), volume 36, pages 78142–78167, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/f64e55d03e2fe61aa4114e49cb654acb-Paper-Datasets_and_Benchmarks.pdf.
  • Bubeck et al. (2023) S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023. URL arXivpreprintarXiv:2303.12712.
  • Camburu et al. (2018) O.-M. Camburu, T. Rocktäschel, T. Lukasiewicz, and P. Blunsom. e-snli: Natural language inference with natural language explanations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31 (NeurIPS), volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/4c7a167bb329bd92580a99ce422d6fa6-Paper.pdf.
  • Cer et al. (2017) D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In S. Bethard, M. Carpuat, M. Apidianaki, S. M. Mohammad, D. Cer, and D. Jurgens, editors, Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, 2017. URL https://aclanthology.org/S17-2001.
  • Chakrabarty et al. (2022) T. Chakrabarty, V. Padmakumar, and H. He. Help me write a poem - instruction tuning as a vehicle for collaborative poetry writing. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6848–6863, 2022. URL https://aclanthology.org/2022.emnlp-main.460.
  • Chen et al. (2020) A. Chen, G. Stanovsky, S. Singh, and M. Gardner. MOCHA: A dataset for training and evaluating generative reading comprehension metrics. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6521–6532, 2020. URL https://aclanthology.org/2020.emnlp-main.528.
  • Chen et al. (2023) B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=ktrw68Cmu9c.
  • Chen et al. (2021) M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  • Chiang and Lee (2023) C.-H. Chiang and H.-y. Lee. Can large language models be an alternative to human evaluations? In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 15607–15631, 2023. URL https://aclanthology.org/2023.acl-long.870.
  • Chiang et al. (2023) W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  • Chiang et al. (2024) W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024. URL https://arxiv.org/abs/2403.04132.
  • Chowdhery et al. (2022) A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311.
  • Chung et al. (2024) H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei. Scaling instruction-finetuned language models. Journal of Machine Learning Research (JMLR), 25(70):1–53, 2024. URL http://jmlr.org/papers/v25/23-0870.html.
  • Clark et al. (2023) E. Clark, S. Rijhwani, S. Gehrmann, J. Maynez, R. Aharoni, V. Nikolaev, T. Sellam, A. Siddhant, D. Das, and A. Parikh. SEAHORSE: A multilingual, multifaceted dataset for summarization evaluation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9397–9413, 2023. URL https://aclanthology.org/2023.emnlp-main.584.
  • Deutsch et al. (2021) D. Deutsch, T. Bedrax-Weiss, and D. Roth. Towards question-answering as an automatic metric for evaluating the content quality of a summary. Transactions of the Association for Computational Linguistics (TACL), 9:774–789, 2021. URL https://aclanthology.org/2021.tacl-1.47.
  • Devlin et al. (2019) J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 4171–4186, 2019. URL https://aclanthology.org/N19-1423.
  • Dou et al. (2022a) Y. Dou, M. Forbes, R. Koncel-Kedziorski, N. A. Smith, and Y. Choi. Is GPT-3 text indistinguishable from human text? scarecrow: A framework for scrutinizing machine text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 7250–7274, May 2022a. URL https://aclanthology.org/2022.acl-long.501.
  • Dou et al. (2022b) Y. Dou, C. Jiang, and W. Xu. Improving large-scale paraphrase acquisition and generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9301–9323, 2022b. URL https://aclanthology.org/2022.emnlp-main.631.
  • Dubois et al. (2023) Y. Dubois, C. X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, P. S. Liang, and T. B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. In Advances in Neural Information Processing Systems 36 (NeurIPS), volume 36, pages 30039–30069, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/5fc47800ee5b30b8777fdd30abcaaf3b-Paper-Conference.pdf.
  • Dubois et al. (2024) Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024. URL https://arxiv.org/abs/2404.04475.
  • Durmus et al. (2020) E. Durmus, H. He, and M. Diab. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5055–5070, 2020. URL https://aclanthology.org/2020.acl-main.454.
  • Dziri et al. (2022a) N. Dziri, E. Kamalloo, S. Milton, O. Zaiane, M. Yu, E. M. Ponti, and S. Reddy. FaithDial: A faithful benchmark for information-seeking dialogue. Transactions of the Association for Computational Linguistics (TACL), 10:1473–1490, 2022a. URL https://aclanthology.org/2022.tacl-1.84.
  • Dziri et al. (2022b) N. Dziri, H. Rashkin, T. Linzen, and D. Reitter. Evaluating attribution in dialogue systems: The BEGIN benchmark. Transactions of the Association for Computational Linguistics (TACL), 10:1066–1083, 2022b. URL https://aclanthology.org/2022.tacl-1.62.
  • Ethayarajh et al. (2022) K. Ethayarajh, Y. Choi, and S. Swayamdipta. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Proceedings of the 39th International Conference on Machine Learning (ICML), volume 162 of Proceedings of Machine Learning Research (PMLR), pages 5988–6008, 2022. URL https://proceedings.mlr.press/v162/ethayarajh22a.html.
  • Fabbri et al. (2021) A. R. Fabbri, W. Kryściński, B. McCann, C. Xiong, R. Socher, and D. Radev. SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics (TACL), 9:391–409, 2021. URL https://aclanthology.org/2021.tacl-1.24.
  • Fernandes et al. (2023) P. Fernandes, D. Deutsch, M. Finkelstein, P. Riley, A. Martins, G. Neubig, A. Garg, J. Clark, M. Freitag, and O. Firat. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In P. Koehn, B. Haddow, T. Kocmi, and C. Monz, editors, Proceedings of the Eighth Conference on Machine Translation (WMT), pages 1066–1083, 2023. URL https://aclanthology.org/2023.wmt-1.100.
  • Freitag et al. (2021) M. Freitag, G. Foster, D. Grangier, V. Ratnakar, Q. Tan, and W. Macherey. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics (TACL), 9:1460–1474, 2021. URL https://aclanthology.org/2021.tacl-1.87.
  • Fried et al. (2023) D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, S. Yih, L. Zettlemoyer, and M. Lewis. Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=hQwb-lbM6EL.
  • Fu et al. (2024) J. Fu, S.-K. Ng, Z. Jiang, and P. Liu. GPTScore: Evaluate as you desire. In K. Duh, H. Gomez, and S. Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 6556–6576, 2024. URL https://aclanthology.org/2024.naacl-long.365.
  • Ganguli et al. (2022) D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022. URL https://arxiv.org/abs/2209.07858.
  • Gehrmann et al. (2019) S. Gehrmann, H. Strobelt, and A. Rush. GLTR: Statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL), pages 111–116, 2019. URL https://aclanthology.org/P19-3019.
  • Gemini et al. (2023) T. Gemini, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. URL https://arxiv.org/abs/2312.11805.
  • Goyal and Durrett (2021) T. Goyal and G. Durrett. Annotating and modeling fine-grained factuality in summarization. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 1449–1462, 2021. URL https://aclanthology.org/2021.naacl-main.114.
  • Goyal et al. (2022) T. Goyal, J. J. Li, and G. Durrett. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356, 2022. URL https://arxiv.org/abs/2209.12356.
  • Gudibande et al. (2023) A. Gudibande, E. Wallace, C. Snell, X. Geng, H. Liu, P. Abbeel, S. Levine, and D. Song. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023. URL https://arxiv.org/abs/2305.15717.
  • Gupta et al. (2022) P. Gupta, C.-S. Wu, W. Liu, and C. Xiong. DialFact: A benchmark for fact-checking in dialogue. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3785–3801, 2022. URL https://aclanthology.org/2022.acl-long.263.
  • Honovich et al. (2021) O. Honovich, L. Choshen, R. Aharoni, E. Neeman, I. Szpektor, and O. Abend. q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7856–7870, 2021. URL https://aclanthology.org/2021.emnlp-main.619.
  • Ivison et al. (2023) H. Ivison, Y. Wang, V. Pyatkin, N. Lambert, M. Peters, P. Dasigi, J. Jang, D. Wadden, N. A. Smith, I. Beltagy, et al. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023. URL https://arxiv.org/abs/2311.10702.
  • Iyer et al. (2017) S. Iyer, N. Dandekar, and K. Csernai. First Quora Dataset release: Question pairs, 2017. URL https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs.
  • Ji et al. (2023) J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, B. Chen, R. Sun, Y. Wang, and Y. Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. In Advances in Neural Information Processing Systems 36 (NeurIPS), volume 36, pages 24678–24704, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/4dbb61cb68671edc4ca3712d70083b9f-Paper-Datasets_and_Benchmarks.pdf.
  • Jiang et al. (2024a) A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024a. URL https://arxiv.org/abs/2401.04088.
  • Jiang et al. (2024b) D. Jiang, Y. Li, G. Zhang, W. Huang, B. Y. Lin, and W. Chen. TIGERScore: Towards building explainable metric for all text generation tasks. Transactions on Machine Learning Research (TMLR), 2024b. ISSN 2835-8856. URL https://openreview.net/forum?id=EE1CBKC0SZ.
  • Karpinska and Iyyer (2023) M. Karpinska and M. Iyyer. Large language models effectively leverage document-level context for literary translation, but critical errors persist. In P. Koehn, B. Haddow, T. Kocmi, and C. Monz, editors, Proceedings of the Eighth Conference on Machine Translation (WMT), pages 419–451, 2023. URL https://aclanthology.org/2023.wmt-1.41.
  • Karpinska et al. (2021) M. Karpinska, N. Akoury, and M. Iyyer. The perils of using Mechanical Turk to evaluate open-ended text generation. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1265–1285, 2021. URL https://aclanthology.org/2021.emnlp-main.97.
  • Karpinska et al. (2024) M. Karpinska, K. Thai, K. Lo, T. Goyal, and M. Iyyer. One thousand and one pairs: A" novel" challenge for long-context language models. arXiv preprint arXiv:2406.16264, 2024. URL https://arxiv.org/abs/2406.16264.
  • Khashabi et al. (2022) D. Khashabi, G. Stanovsky, J. Bragg, N. Lourie, J. Kasai, Y. Choi, N. A. Smith, and D. Weld. GENIE: Toward reproducible and standardized human evaluation for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 11444–11458, 2022. URL https://aclanthology.org/2022.emnlp-main.787.
  • Kim et al. (2024a) S. Kim, J. Shin, Y. Cho, J. Jang, S. Longpre, H. Lee, S. Yun, S. Shin, S. Kim, J. Thorne, and M. Seo. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations (ICLR), 2024a. URL https://openreview.net/forum?id=8euJaTveKw.
  • Kim et al. (2024b) S. Kim, J. Suk, S. Longpre, B. Y. Lin, J. Shin, S. Welleck, G. Neubig, M. Lee, K. Lee, and M. Seo. Prometheus 2: An open source language model specialized in evaluating other language models. arXiv preprint arXiv:2405.01535, 2024b. URL https://arxiv.org/abs/2405.01535.
  • Kim et al. (2024c) Y. Kim, Y. Chang, M. Karpinska, A. Garimella, V. Manjunatha, K. Lo, T. Goyal, and M. Iyyer. Fables: Evaluating faithfulness and content selection in book-length summarization. arXiv preprint arXiv:2404.01261, 2024c. URL https://arxiv.org/abs/2404.01261.
  • Kingma and Ba (2015) D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. URL https://arxiv.org/abs/1412.6980.
  • Koo et al. (2023) R. Koo, M. Lee, V. Raheja, J. I. Park, Z. M. Kim, and D. Kang. Benchmarking cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012, 2023. URL https://arxiv.org/abs/2309.17012.
  • Korbak et al. (2023) T. Korbak, K. Shi, A. Chen, R. V. Bhalerao, C. Buckley, J. Phang, S. R. Bowman, and E. Perez. Pretraining language models with human preferences. In Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202 of Proceedings of Machine Learning Research (PMLR), pages 17506–17533, 2023. URL https://proceedings.mlr.press/v202/korbak23a.html.
  • Krishna et al. (2021) K. Krishna, A. Roy, and M. Iyyer. Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 4940–4957, 2021. URL https://aclanthology.org/2021.naacl-main.393.
  • Krishna et al. (2022) K. Krishna, Y. Chang, J. Wieting, and M. Iyyer. RankGen: Improving text generation with large ranking models. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 199–232, 2022. URL https://aclanthology.org/2022.emnlp-main.15.
  • Krishna et al. (2023a) K. Krishna, E. Bransom, B. Kuehl, M. Iyyer, P. Dasigi, A. Cohan, and K. Lo. LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. In A. Vlachos and I. Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023a. URL https://aclanthology.org/2023.eacl-main.121.
  • Krishna et al. (2023b) K. Krishna, Y. Song, M. Karpinska, J. Wieting, and M. Iyyer. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. In Advances in Neural Information Processing Systems 36 (NeurIPS), volume 36, pages 27469–27500, 2023b. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/575c450013d0e99e4b0ecf82bd1afaa4-Paper-Conference.pdf.
  • Lambert et al. (2024) N. Lambert, V. Pyatkin, J. Morrison, L. Miranda, B. Y. Lin, K. Chandu, N. Dziri, S. Kumar, T. Zick, Y. Choi, et al. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787, 2024. URL https://arxiv.org/abs/2403.13787.
  • Lhoest et al. (2021) Q. Lhoest, A. Villanova del Moral, Y. Jernite, A. Thakur, P. von Platen, S. Patil, J. Chaumond, M. Drame, J. Plu, L. Tunstall, J. Davison, M. Šaško, G. Chhablani, B. Malik, S. Brandeis, T. Le Scao, V. Sanh, C. Xu, N. Patry, A. McMillan-Major, P. Schmid, S. Gugger, C. Delangue, T. Matussière, L. Debut, S. Bekman, P. Cistac, T. Goehringer, V. Mustar, F. Lagunas, A. Rush, and T. Wolf. Datasets: A community library for natural language processing. In H. Adel and S. Shi, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP), pages 175–184, 2021.
  • Li et al. (2023a) J. Li, X. Cheng, X. Zhao, J.-Y. Nie, and J.-R. Wen. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6449–6464, 2023a. URL https://aclanthology.org/2023.emnlp-main.397.
  • Li et al. (2024) R. Li, T. Patel, and X. Du. PRD: Peer rank and discussion improve large language model based evaluations. Transactions on Machine Learning Research (TMLR), 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=YVD1QqWRaj.
  • Li et al. (2023b) X. Li, T. Zhang, Y. Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
  • Li et al. (2022a) Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022a. URL https://www.science.org/doi/abs/10.1126/science.abq1158.
  • Li et al. (2022b) Z. Li, P. Sharma, X. H. Lu, J. Cheung, and S. Reddy. Using interactive feedback to improve the accuracy and explainability of question answering systems post-deployment. In Findings of the Association for Computational Linguistics: ACL 2022 (ACL Findings), pages 926–937, 2022b. URL https://aclanthology.org/2022.findings-acl.75.
  • Lightman et al. (2024) H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=v8L0pN6EOi.
  • Lin et al. (2024) B. Y. Lin, Y. Deng, K. Chandu, F. Brahman, A. Ravichander, V. Pyatkin, N. Dziri, R. L. Bras, and Y. Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild. arXiv preprint arXiv:2406.04770, 2024. URL https://arxiv.org/abs/2406.04770.
  • Lin (2004) C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop of Text Summarization Branches Out (WS), pages 74–81, July 2004. URL https://aclanthology.org/W04-1013.
  • Lin et al. (2022) S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring how models mimic human falsehoods. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3214–3252, 2022. URL https://aclanthology.org/2022.acl-long.229.
  • Liu et al. (2023a) Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2511–2522, 2023a. URL https://aclanthology.org/2023.emnlp-main.153.
  • Liu et al. (2023b) Y. Liu, N. S. Moosavi, and C. Lin. Llms as narcissistic evaluators: When ego inflates evaluation scores. arXiv preprint arXiv:2311.09766, 2023b. URL https://arxiv.org/abs/2311.09766.
  • Liu et al. (2024) Y. Liu, A. Fabbri, J. Chen, Y. Zhao, S. Han, S. Joty, P. Liu, D. Radev, C.-S. Wu, and A. Cohan. Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization. In K. Duh, H. Gomez, and S. Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024 (NAACL Findings), pages 4481–4501, 2024. URL https://aclanthology.org/2024.findings-naacl.280.
  • Longpre et al. (2023) S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, and A. Roberts. The flan collection: Designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202 of Proceedings of Machine Learning Research (PMLR), pages 22631–22648, 2023. URL https://proceedings.mlr.press/v202/longpre23a.html.
  • Maddela et al. (2023) M. Maddela, Y. Dou, D. Heineman, and W. Xu. LENS: A learnable evaluation metric for text simplification. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 16383–16408, 2023. URL https://aclanthology.org/2023.acl-long.905.
  • Manakul et al. (2023) P. Manakul, A. Liusie, and M. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9004–9017, 2023. URL https://aclanthology.org/2023.emnlp-main.557.
  • Maynez et al. (2020) J. Maynez, S. Narayan, B. Bohnet, and R. McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1906–1919, 2020. URL https://aclanthology.org/2020.acl-main.173.
  • Meta (2024) A. Meta. Introducing meta llama 3: The most capable openly available llm to date. Meta AI., 2024. URL https://ai.meta.com/blog/meta-llama-3/.
  • Min et al. (2023) S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 12076–12100, 2023. URL https://aclanthology.org/2023.emnlp-main.741.
  • Mishra et al. (2022) S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3470–3487, 2022. URL https://aclanthology.org/2022.acl-long.244.
  • Moon et al. (2023) S. Moon, Y. Song, H. Chae, D. Kang, T. Kwon, K. T.-i. Ong, S.-w. Hwang, and J. Yeo. Coffee: Boost your code llms by fixing bugs with feedback. arXiv preprint arXiv:2311.07215, 2023. URL https://arxiv.org/abs/2311.07215.
  • Muennighoff et al. (2023) N. Muennighoff, Q. Liu, A. Zebaze, Q. Zheng, B. Hui, T. Y. Zhuo, S. Singh, X. Tang, L. Von Werra, and S. Longpre. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023. URL https://arxiv.org/abs/2308.07124.
  • Nakano et al. (2021) R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021. URL https://arxiv.org/abs/2112.09332.
  • Nijkamp et al. (2023) E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_.
  • OpenAI (2024a) OpenAI. GPT-3.5 Turbo, 2024a. URL https://platform.openai.com/docs/models/gpt-3-5-turbo.
  • OpenAI (2024b) OpenAI. GPT-4 Turbo and GPT-4, 2024b. URL https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4.
  • OpenAI (2024c) OpenAI. Hello GPT-4o, 2024c. URL https://openai.com/index/hello-gpt-4o.
  • Ouyang et al. (2022) L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 27730–27744, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
  • Pagnoni et al. (2021) A. Pagnoni, V. Balachandran, and Y. Tsvetkov. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 4812–4829, 2021. URL https://aclanthology.org/2021.naacl-main.383.
  • Panickssery et al. (2024) A. Panickssery, S. R. Bowman, and S. Feng. Llm evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076, 2024. URL https://arxiv.org/abs/2404.13076.
  • Papineni et al. (2002) K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 311–318, 2002. URL https://aclanthology.org/P02-1040.
  • Parekh et al. (2021) Z. Parekh, J. Baldridge, D. Cer, A. Waters, and Y. Yang. Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for MS-COCO. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (EACL), pages 2855–2870, 2021. URL https://aclanthology.org/2021.eacl-main.249.
  • Pillutla et al. (2021) K. Pillutla, S. Swayamdipta, R. Zellers, J. Thickstun, S. Welleck, Y. Choi, and Z. Harchaoui. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In Advances in Neural Information Processing Systems 34 (NeurIPS), volume 34, pages 4816–4828, 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/260c2432a0eecc28ce03c10dadc078a4-Paper.pdf.
  • Rafailov et al. (2024) R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 37 (NeurIPS), 36, 2024. URL https://openreview.net/forum?id=HPuSIXJaa9.
  • Raffel et al. (2020) C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR 2020), 21(140):1–67, 2020. URL https://jmlr.org/papers/v21/20-074.html.
  • Rei et al. (2020) R. Rei, C. Stewart, A. C. Farinha, and A. Lavie. COMET: A neural framework for MT evaluation. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, 2020. URL https://aclanthology.org/2020.emnlp-main.213.
  • Reid et al. (2024) M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. URL https://arxiv.org/abs/2403.05530.
  • Roberts et al. (2023) A. Roberts, H. W. Chung, G. Mishra, A. Levskaya, J. Bradbury, D. Andor, S. Narang, B. Lester, C. Gaffney, A. Mohiuddin, C. Hawthorne, A. Lewkowycz, A. Salcianu, M. van Zee, J. Austin, S. Goodman, L. B. Soares, H. Hu, S. Tsvyashchenko, A. Chowdhery, J. Bastings, J. Bulian, X. Garcia, J. Ni, A. Chen, K. Kenealy, K. Han, M. Casbon, J. H. Clark, S. Lee, D. Garrette, J. Lee-Thorp, C. Raffel, N. Shazeer, M. Ritter, M. Bosma, A. Passos, J. Maitin-Shepard, N. Fiedel, M. Omernick, B. Saeta, R. Sepassi, A. Spiridonov, J. Newlan, and A. Gesmundo. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research (JMLR), 24(377):1–8, 2023. URL http://jmlr.org/papers/v24/23-0795.html.
  • Sanh et al. (2022) V. Sanh, A. Webson, C. Raffel, S. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Fevry, J. A. Fries, R. Teehan, T. L. Scao, S. Biderman, L. Gao, T. Wolf, and A. M. Rush. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations (ICLR), 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
  • Schuster et al. (2021) T. Schuster, A. Fisch, and R. Barzilay. Get your vitamin C! robust fact verification with contrastive evidence. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 624–643, 2021. URL https://aclanthology.org/2021.naacl-main.52.
  • Sellam et al. (2020) T. Sellam, D. Das, and A. Parikh. BLEURT: Learning robust metrics for text generation. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 7881–7892, 2020. URL https://aclanthology.org/2020.acl-main.704.
  • Skopek et al. (2023) O. Skopek, R. Aralikatte, S. Gooding, and V. Carbune. Towards better evaluation of instruction-following: A case-study in summarization. In J. Jiang, D. Reitter, and S. Deng, editors, Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 221–237, 2023. URL https://aclanthology.org/2023.conll-1.16.
  • Stiennon et al. (2020) N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems 33 (NeurIPS), volume 33, pages 3008–3021, 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf.
  • Su and Xu (2022) Y. Su and J. Xu. An empirical study on contrastive search and contrastive decoding for open-ended text generation. arXiv preprint arXiv:2211.10797, 2022. URL https://arxiv.org/abs/2211.10797.
  • Tang et al. (2024) L. Tang, P. Laban, and G. Durrett. Minicheck: Efficient fact-checking of llms on grounding documents. arXiv preprint arXiv:2404.10774, 2024. URL https://arxiv.org/abs/2404.10774.
  • Thompson and Post (2020) B. Thompson and M. Post. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 90–121, 2020. URL https://aclanthology.org/2020.emnlp-main.8.
  • Vu et al. (2023) T. Vu, M. Iyyer, X. Wang, N. Constant, J. Wei, J. Wei, C. Tar, Y.-H. Sung, D. Zhou, Q. Le, et al. Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214, 2023. URL https://arxiv.org/abs/2310.03214.
  • Wang et al. (2020) A. Wang, K. Cho, and M. Lewis. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5008–5020, 2020. URL https://aclanthology.org/2020.acl-main.450.
  • Wang et al. (2023a) J. Wang, Y. Liang, F. Meng, Z. Sun, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou. Is ChatGPT a good NLG evaluator? a preliminary study. In Y. Dong, W. Xiao, L. Wang, F. Liu, and G. Carenini, editors, Proceedings of the 4th New Frontiers in Summarization Workshop (NewSum), pages 1–11, 2023a. URL https://aclanthology.org/2023.newsum-1.1.
  • Wang et al. (2023b) Z. Wang, Y. Dong, J. Zeng, V. Adams, M. N. Sreedhar, D. Egert, O. Delalleau, J. P. Scowcroft, N. Kant, A. Swope, et al. Helpsteer: Multi-attribute helpfulness dataset for steerlm. arXiv preprint arXiv:2311.09528, 2023b. URL https://arxiv.org/abs/2311.09528.
  • Wang et al. (2024) Z. Wang, Y. Dong, O. Delalleau, J. Zeng, G. Shen, D. Egert, J. J. Zhang, M. N. Sreedhar, and O. Kuchaiev. Helpsteer2: Open-source dataset for training top-performing reward models. arXiv preprint arXiv:2406.08673, 2024. URL https://arxiv.org/abs/2406.08673.
  • Warstadt et al. (2019) A. Warstadt, A. Singh, and S. R. Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics (TACL), 7:625–641, 2019. URL https://aclanthology.org/Q19-1040.
  • Wei et al. (2022) J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations (ICLR), 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
  • Wei et al. (2024) J. Wei, C. Yang, X. Song, Y. Lu, N. Hu, D. Tran, D. Peng, R. Liu, D. Huang, C. Du, et al. Long-form factuality in large language models. arXiv preprint arXiv:2403.18802, 2024. URL https://arxiv.org/abs/2403.18802.
  • Williams et al. (2018) A. Williams, N. Nangia, and S. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (NAACL), pages 1112–1122, 2018. URL https://aclanthology.org/N18-1101.
  • Wu et al. (2023a) Y. Wu, J. Zhu, S. Xu, K. Shum, C. Niu, R. Zhong, J. Song, and T. Zhang. Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. arXiv preprint arXiv:2401.00396, 2023a. URL https://arxiv.org/abs/2401.00396.
  • Wu et al. (2023b) Z. Wu, Y. Hu, W. Shi, N. Dziri, A. Suhr, P. Ammanabrolu, N. A. Smith, M. Ostendorf, and H. Hajishirzi. Fine-grained human feedback gives better rewards for language model training. In Advances in Neural Information Processing Systems 36 (NeurIPS), volume 36, pages 59008–59033, 2023b. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/b8c90b65739ae8417e61eadb521f63d5-Paper-Conference.pdf.
  • Xu et al. (2022) F. Xu, J. J. Li, and E. Choi. How do we answer complex questions: Discourse structure of long-form answers. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3556–3572, 2022. URL https://aclanthology.org/2022.acl-long.249.
  • Xu et al. (2023a) F. Xu, Y. Song, M. Iyyer, and E. Choi. A critical evaluation of evaluations for long-form question answering. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 3225–3245, 2023a. URL https://aclanthology.org/2023.acl-long.181.
  • Xu et al. (2023b) W. Xu, D. Wang, L. Pan, Z. Song, M. Freitag, W. Wang, and L. Li. INSTRUCTSCORE: Towards explainable text generation evaluation with automatic feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5967–5994, 2023b. URL https://aclanthology.org/2023.emnlp-main.365.
  • Yu et al. (2023) X. Yu, S. Min, L. Zettlemoyer, and H. Hajishirzi. CREPE: Open-domain question answering with false presuppositions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 10457–10480, 2023. URL https://aclanthology.org/2023.acl-long.583.
  • Yuan et al. (2021) W. Yuan, G. Neubig, and P. Liu. Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems 34 (NeurIPS), volume 34, pages 27263–27277, 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Paper.pdf.
  • Zhang et al. (2020) T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations (ICLR), 2020. URL https://openreview.net/forum?id=SkeHuCVFDr.
  • Zhang et al. (2019) Y. Zhang, J. Baldridge, and L. He. PAWS: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (ACL), pages 1298–1308, 2019. URL https://aclanthology.org/N19-1131.
  • Zhao et al. (2019) W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, 2019. URL https://aclanthology.org/D19-1053.
  • Zhao et al. (2023) Y. Zhao, R. Joshi, T. Liu, M. Khalman, M. Saleh, and P. J. Liu. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023. URL https://arxiv.org/abs/2305.10425.
  • Zheng et al. (2023) L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems 36 (NeurIPS), volume 36, pages 46595–46623, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf.
  • Zhou et al. (2023) C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. YU, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy. Lima: Less is more for alignment. In Advances in Neural Information Processing Systems 36 (NeurIPS), volume 36, pages 55006–55021. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ac662d74829e4407ce1d126477f4a03a-Paper-Conference.pdf.

Appendix

8 List of Training Datasets in FLAMe

Table 7 shows the list of datasets used in our study.

9 Analyzing Length and Token Bias in RewardBench

In this section, we provide an analysis of length (Appendix 9.1) and token (Appendix 9.2) bias issues identified in the RewardBench benchmark. Given these issues, we encourage future work to evaluate LLM autoraters on a wide variety of benchmarks (such as our evaluation suite in Section 5), rather than relying solely on RewardBench.

9.1 Length Bias in RewardBench

Table 6 highlights length bias in RewardBench. Overall, RewardBench shows significant imbalance across categories regarding length: Chat Hard, Math, and Coding favor shorter outputs, while Chat leans towards longer outputs. An adversarial submission might strategically select longer or shorter outputs based on prompt categories to achieve higher scores, without necessarily reflecting a genuinely strong preference model.

RewardBench Category % Preference for Longer Outputs
Chat 79.1%
Chat Hard 29.6%
Math 6.5%
Coding 35.7%
Safety 41.9%
Table 6: A summary of length bias in RewardBench. Overall, we find that four out of five RewardBench categories show a strong preference towards either longer or shorter outputs.

9.2 Token Bias in RewardBench

Besides length bias, we identified token bias in the Math and Safety categories of RewardBench. In Safety, favored responses significantly leaned towards phrases like “I’m sorry”, which suggest hedged responses. The word “sorry” appeared nearly 23% more frequently in preferred responses compared to non-preferred ones. Similarly, the Math split exhibited token bias, where tokens such as “i”, “can”, “need”, “to”, “find” were predominantly found in rejected responses.

Capability Dataset Source Output Format
General Response Quality BeaverTails Helpfulness Ji et al. (2023) Pairwise
HH RLHF Helpfulness Bai et al. (2022a) Pairwise
Hurdles LFQA Krishna et al. (2021) Pairwise
LMSYS Chatbot Arena conversations Zheng et al. (2023) Pairwise
MAUVE Pillutla et al. (2021) Pairwise
News Summary Evaluation Goyal et al. (2022) Pairwise
PRD Li et al. (2024) Pairwise
SHP Ethayarajh et al. (2022) Pairwise
HelpSteer Wang et al. (2023b) Pairwise, Pointwise
Summary Comparisons Stiennon et al. (2020) Pairwise, Pointwise
GENIE Khashabi et al. (2022) Pairwise, Pointwise, Generative
Fine-grained RLHF Wu et al. (2023b) Pairwise, Classification
InstruSum Liu et al. (2024) Pairwise, Classification
WebGPT Nakano et al. (2021) Pairwise, Generative
LENS Maddela et al. (2023) Pointwise
SummEval Fabbri et al. (2021) Pointwise
riSum Skopek et al. (2023) Pointwise, Classification
FeedbackQA Li et al. (2022b) Pointwise, Generative
CoLA Warstadt et al. (2019) Classification
SEAHORSE Clark et al. (2023) Classification
CREPE Yu et al. (2023) Classification, Generative
Scarecrow Dou et al. (2022a) Classification, Generative
Validity LFQA Xu et al. (2022) Classification, Generative
Factuality/Attribution MOCHA Chen et al. (2020) Pointwise
Sentence Similarity - C×\times×C Parekh et al. (2021) Pointwise
Sentence Similarity - STS-B Cer et al. (2017) Pointwise
WikiBio Hallucination Manakul et al. (2023) Pointwise
BEGIN Dziri et al. (2022b) Classification
DialFact Gupta et al. (2022) Classification
FActScore Min et al. (2023) Classification
FRANK Pagnoni et al. (2021) Classification
FaithDial Dziri et al. (2022a) Classification
HaluEval Li et al. (2023a) Classification
MNLI Williams et al. (2018) Classification
MultiPIT Dou et al. (2022b) Classification
PAWS Zhang et al. (2019) Classification
Q2 Honovich et al. (2021) Classification
QAGS Wang et al. (2020) Classification
QQP Iyer et al. (2017) Classification
VitaminC Schuster et al. (2021) Classification
RAGTruth Wu et al. (2023a) Classification
ESNLI Camburu et al. (2018) Classification, Generative
XSum Hallucination Maynez et al. (2020) Generative
Mathematical Reasoning PRM800K Lightman et al. (2024) Pairwise
Coding Code Contests Li et al. (2022a) Pairwise
COFFEE Moon et al. (2023) Pairwise
CommitPack Muennighoff et al. (2023) Pairwise
CommitPack - Bugs Muennighoff et al. (2023) Pairwise
Safety BeaverTails Harmlessness Ji et al. (2023) Pairwise
HH RLHF Harmlessness Bai et al. (2022a) Pairwise
HH RLHF Red Teaming Bai et al. (2022a) Pointwise
BeaverTails QA-Classification Ji et al. (2023) Classification
Instruction Tuning LIMA Zhou et al. (2023) Generative
PRM800K IF Lightman et al. (2024) Generative
TULU-2 Ivison et al. (2023) Generative
Table 7: A complete list of training datasets in our FLAMe collection, including their output formats and categorized capabilities. We derive multiple tasks from certain datasets. For example, HelpSteer (Wang et al., 2023b) includes human annotations for different attributes of model responses such as Helpfulness, Correctness, Coherence, Complexity, and Verbosity, allowing us to create distinct tasks, each focused on a specific attribute.