\useunder

\ul

An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases

Dylan  Bouchard
CVS Health®
[email protected]
Abstract

Large language models (LLMs) can exhibit bias in a variety of ways. Such biases can create or exacerbate unfair outcomes for certain groups within a protected attribute, including, but not limited to sex, race, sexual orientation, or age. This paper aims to provide a technical guide for practitioners to assess bias and fairness risks in LLM use cases. The main contribution of this work is a decision framework that allows practitioners to determine which metrics to use for a specific LLM use case. To achieve this, this study categorizes LLM bias and fairness risks, maps those risks to a taxonomy of LLM use cases, and then formally defines various metrics to assess each type of risk. As part of this work, several new bias and fairness metrics are introduced, including innovative counterfactual metrics as well as metrics based on stereotype classifiers. Instead of focusing solely on the model itself, the sensitivity of both prompt-risk and model-risk are taken into account by defining evaluations at the level of an LLM use case, characterized by a model and a population of prompts. Furthermore, because all of the evaluation metrics are calculated solely using the LLM output, the proposed framework is highly practical and easily actionable for practitioners.111Note: The examples provided in this paper are purely hypothetical and are not intended to reflect the specific work or practices of the author-affiliated company. They are used solely for illustrative purposes. Any resemblance to actual practices or projects is coincidental.

1 Introduction

The versatility of current Large Language Models (LLMs) in handling various tasks (Minaee et al., 2024; Liu et al., 2023; Ray, 2023) presents challenges when it comes to evaluating bias and fairness at the model level. Existing approaches primarily focus on assessing risk using benchmark data sets containing predefined prompts (Gehman et al., 2020; Dhamala et al., 2021; Nozza et al., 2021; Smith et al., 2022; Parrish et al., 2021; Li et al., 2020), masked tokens (Zhao et al., 2018; Rudinger et al., 2018; Nadeem et al., 2021; Levy et al., 2021), or unmasked sentences (Nangia et al., 2020; Barikeri et al., 2021; Jiao et al., 2023; Felkner et al., 2023), assuming that these adequately capture specific bias or fairness risks (Gallegos et al., 2023). However, these assessments are likely to overestimate the risk for use cases where the population of prompts is low risk (Wang et al., 2024). Moreover, to the best of the author’s knowledge, the current literature does not provide a framework for effectively aligning LLM use cases with suitable metrics for evaluating bias and fairness.

This work aims to address these limitations by developing an actionable LLM bias and fairness evaluation framework defined at the use case level. Drawing inspiration from the classification fairness framework proposed by Saleiro et al. (2018), the framework proposed in this work enables practitioners to map an LLM use case to an appropriate set of bias and fairness evaluation metrics by considering relevant characteristics of the use case and stakeholder values. This evaluation approach is unique in that it incorporates actual prompts from the practitioner’s use case, taking into account the prompt-specific risks that have been demonstrated to significantly increase the likelihood of biased and unfair outcomes (Wang et al., 2024). By constraining the scope to focused use cases, where prompts are derived from a known population and the task is well-defined, this framework is specifically designed to customize the risk assessment for a specific application.

To introduce the framework, this study first provides formal definitions of bias and fairness desiderata for LLMs from the literature and segment these definitions by risk category. Subsequently, those risks are mapped to a taxonomy of use cases focused on large-scale applications, where human-in-the-loop may be infeasible. Lastly, for each risk category, various bias and fairness evaluation metrics are detailed, with discussions provided on their input requirements, calculation methods, the risks they assess, and circumstances under which they should be applied. As part of this work, a variety of novel bias and fairness metrics are introduced. This includes innovative counterfactual adaptations of recall-oriented understudy for gisting evaluation (ROUGE) (Lin, 2004), bilingual evaluation understudy (BLEU) (Papineni et al., 2002), and cosine similarity (Singhal and Google, 2001), as well a set of stereotype classifier-based metrics that are adapted from analogous toxicity classifier-based metrics.

For practical reasons, this study limits the selection of LLM bias and fairness metrics to those requiring only LLM generated output as inputs. This includes 1) generated text metrics, which take a generated set of tokens as input (Gallegos et al., 2023), 2) recommendation fairness metrics, calculated on a set of LLM-provided recommendations (Zhang et al., 2023), and 3) classification fairness metrics, which are already well-established in the machine learning fairness literature (Bellamy et al., 2018; Saleiro et al., 2018; Weerts et al., 2023; Hardt et al., 2016; Feldman et al., 2014; Mehrabi et al., 2019). Due to practical limitations of the input requirements, the framework omits both embedding-based metrics, which are computed using an LLM’s hidden vector representations of words or sentences (Islam et al., 2016; May et al., 2019; Guo and Caliskan, 2020), and probability-based metrics, which leverage predicted token probabilities from an LLM (Webster et al., 2020; Kurita et al., 2019; Ahn and Oh, 2021; Kaneko and Bollegala, 2021; Salazar et al., 2019; Barikeri et al., 2021; Nangia et al., 2020; Nadeem et al., 2021). To further support the rationale behind the selection of evaluation metrics, it is important to note that metrics focused on the downstream task, consistent with the metrics incorporated in this framework, have been shown to be more reliable than metrics derived from embeddings or token probabilities (Goldfarb-Tarrant et al., 2020; Delobelle et al., 2021).

The remainder of this paper is organized as follows. Section 2 formally defines various notions of bias and fairness in LLMs. Section 3 discusses various techniques for conducting bias and fairness assessments in LLMs. Section 4 offers a framework for choosing among LLM bias and fairness metrics based on use case characteristics and stakeholder values. Finally, Section 5 offers concluding remarks.

2 Bias and Fairness Risks for LLM Use Cases

This section formally defines several pre-requisite terms and concepts upon which subsequent sections rely, several of which are adapted from those provided by Gallegos et al. (2023).222The notation and terminology used in this paper differs from those used by Gallegos et al. (2023). The approach outlined below adopts modified versions of these definitions to better fit the tone and purpose. For further reading on biases in LLMs, the reader may refer to (Gallegos et al., 2023; Blodgett et al., 2020; Kumar et al., 2023; Li et al., 2024; Chu et al., 2024; Ferrara, 2023; Ranaldi et al., 2023; Kotek et al., 2023; Wu and Aji, 2023; Li et al., 2023; Ray, 2023; Nozza et al., 2022; Zhuo et al., 2023). The discussions in this paper explore concepts of bias and fairness in relation to an arbitrary ‘protected attribute’, encompassing examples such as sex, race, age, and sexual orientation, among others. These concepts are established formally by using mathematical notation whenever possible and strives to prioritize consistency in notation throughout, even if it may differ from the notation used in the papers cited.

2.1 Preliminary Definitions

This section provides preliminary definitions to be used throughout the subsequent sections.

Large Language Model (LLM).

An LLM :𝒳𝒴:absent𝒳𝒴\mathcal{M}:\mathcal{X}\xrightarrow{}\mathcal{Y}caligraphic_M : caligraphic_X start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW caligraphic_Y is a pre-trained, transformer-based model that maps a text sequence X𝒳𝑋𝒳X\in\mathcal{X}italic_X ∈ caligraphic_X to an output Y^𝒴^𝑌𝒴\hat{Y}\in\mathcal{Y}over^ start_ARG italic_Y end_ARG ∈ caligraphic_Y, where 𝒳𝒳\mathcal{X}caligraphic_X denotes the set of all possible text inputs (i.e. prompts) and the form of Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG is specific to the LLM and the use case (Gallegos et al., 2023).333Nach Gallegos et al. (2023), \mathcal{M}caligraphic_M is assumed to have an autoregressive, autoencoding, or encoder-decoder architecture and has undergone training on an extensive corpus containing hundreds of millions to trillions of tokens. 444Hereafter, all LLM inputs X𝑋Xitalic_X are referred to as ‘prompts’. Let 𝒳𝒳\mathcal{X}caligraphic_X denote the set of all possible prompts. Let θ𝜃\thetaitalic_θ parameterize \mathcal{M}caligraphic_M, such that Y^=(X;θ)^𝑌𝑋𝜃\hat{Y}=\mathcal{M}(X;\theta)over^ start_ARG italic_Y end_ARG = caligraphic_M ( italic_X ; italic_θ ).

Population of Prompts.

A population of prompts, denoted 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, is a collection of LLM inputs. To characterize well-defined use cases, subsequent sections refer to a ‘known population of prompts’, indicating that practitioners possess information about the prompt domain and are able draw representative samples from 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. For instance, a population of prompts might consist of clinical notes, where each individual prompt includes a collection of notes, accompanied by specific instructions for the LLM to generate a summary (Chuang et al., 2024).

Large Language Model Use Case.

An LLM use case is characterized by an LLM (X;θ)𝑋𝜃\mathcal{M}(X;\theta)caligraphic_M ( italic_X ; italic_θ ) and a population of prompts 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. In the interest of concise notation, LLM use cases will be hereafter denoted as (,𝒫X).subscript𝒫𝑋(\mathcal{M},\mathcal{P}_{X}).( caligraphic_M , caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) . An LLM use case is evaluated on a finite set of responses generated by (X;θ)𝑋𝜃\mathcal{M}(X;\theta)caligraphic_M ( italic_X ; italic_θ ) from a sample of N𝑁Nitalic_N prompts X1,,XNsubscript𝑋1subscript𝑋𝑁X_{1},...,X_{N}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, drawn from the population 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

Protected Attribute Groups.

A protected attribute group G𝒢𝐺𝒢G\in\mathcal{G}italic_G ∈ caligraphic_G represents a subset of people characterized by a shared identity trait, where 𝒢𝒢\mathcal{G}caligraphic_G is a partition (Gallegos et al., 2023).

Protected Attribute Group Lexicon.

A protected attribute group lexicon A𝒜𝐴𝒜A\in\mathcal{A}italic_A ∈ caligraphic_A is a collection of words that correspond to protected attribute group G𝒢𝐺𝒢G\in\mathcal{G}italic_G ∈ caligraphic_G.555Hereafter, in a slight abuse of notation, 𝒢𝒢\mathcal{G}caligraphic_G (and analogously 𝒜𝒜\mathcal{A}caligraphic_A) can denote either a partition of or the union of each G𝒢𝐺𝒢G\in\mathcal{G}italic_G ∈ caligraphic_G (and analogously each A𝒜𝐴𝒜A\in\mathcal{A}italic_A ∈ caligraphic_A).666One example, provided by Bommasani et al. (2023), of protected attribute words for the protected attribute group ‘males’ is { ‘he’, ‘son’, ‘his’, ‘him’, ‘father’, ‘man’, ‘boy’, ‘himself’, ‘male’, ‘brother’, ‘sons’, ‘fathers’, ‘men’, ‘boys’, ‘males’, ‘brothers’, ‘uncle’, ‘uncles’, ‘nephew’, ‘nephews’ }. 777The examples that consider male vs. female are used in accordance with the convention established by previous studies such as Bordia and Bowman (2019) and Bommasani et al. (2023). It is important to note that this usage is solely for the purpose of consistency and does not intend to imply that there are only two genders. The acknowledgement of a broader spectrum of gender identities is recognized and respected.

Counterfactual Input Pair.

A counterfactual input pair is a pair of prompts, Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and X′′superscript𝑋′′X^{\prime\prime}italic_X start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT, which are identical in every way except the former mentions protected attribute group Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and the latter mentions protected attribute group G′′superscript𝐺′′G^{\prime\prime}italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT (Gallegos et al., 2023). For an LLM use case (,𝒫X)subscript𝒫𝑋(\mathcal{M},\mathcal{P}_{X})( caligraphic_M , caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ), an evaluation set of counterfactual input pairs is denoted (X1,X1′′),,(XN,XN′′)superscriptsubscript𝑋1superscriptsubscript𝑋1′′superscriptsubscript𝑋𝑁superscriptsubscript𝑋𝑁′′(X_{1}^{\prime},X_{1}^{\prime\prime}),...,(X_{N}^{\prime},X_{N}^{\prime\prime})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , … , ( italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ). To create each pair, a prompt is drawn from the subset of prompts containing words from the protected attribute lexicon 𝒜𝒜\mathcal{A}caligraphic_A, i.e. 𝒫X|𝒜={X:X𝒫X,X𝒜}subscript𝒫conditional𝑋𝒜conditional-set𝑋formulae-sequence𝑋subscript𝒫𝑋𝑋𝒜\mathcal{P}_{X|\mathcal{A}}=\{X:X\in\mathcal{P}_{X},X\cap\mathcal{A}\neq\emptyset\}caligraphic_P start_POSTSUBSCRIPT italic_X | caligraphic_A end_POSTSUBSCRIPT = { italic_X : italic_X ∈ caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_X ∩ caligraphic_A ≠ ∅ }, and counterfactual variations are obtained via counterfactual substitution.888Here, counterfactual substitution means using word substitution to replace the mention of one group with an analogous word corresponding to another group. For instance, (‘then he went to the store’,‘then she went to the store’)‘then he went to the store’‘then she went to the store’(\text{\textquoteleft then he went to the store'},\text{\textquoteleft then % she went to the store'})( ‘then he went to the store’ , ‘then she went to the store’ ) would be an example of a counterfactual input pair for sex.

Fairness Through Unawareness (FTU).

Given a protected attribute lexicon 𝒜𝒜\mathcal{A}caligraphic_A, an LLM use case (,𝒫X)subscript𝒫𝑋(\mathcal{M},\mathcal{P}_{X})( caligraphic_M , caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) satisfies FTU if for each X𝒫X,X𝒜=formulae-sequence𝑋subscript𝒫𝑋𝑋𝒜X\in\mathcal{P}_{X},X\cap\mathcal{A}=\emptysetitalic_X ∈ caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_X ∩ caligraphic_A = ∅. In simpler terms, FTU implies none of the prompts for an LLM use case include any mention of a protected attribute word (Gallegos et al., 2023).

2.2 LLM Bias and Fairness Risks

This section presents formal definitions of various notions of bias and fairness applicable to LLMs. A taxonomy for these definitions is proposed, organized by risk category. Specifically, the risk categories included in this taxonomy are toxicity, stereotyping, counterfactual fairness, and allocational harms.

2.2.1 Toxicity

Drawing from the definitions of toxicity and derogatory language outlined in the survey conducted by Gallegos et al. (2023), let the definition of toxic text encompass any offensive language that 1) launches attacks, issues threats, or incites hate or violence against a social group, or 2) includes the usage of pejorative slurs, insults, or any other forms of expression that specifically target and belittle a social group. To formalize this, a corresponding fairness desideratum is introduced below.

Non-Toxicity.

Let 𝒯𝒯\mathcal{T}caligraphic_T denote the set of all toxic phrases. An LLM use case (,𝒫X)subscript𝒫𝑋(\mathcal{M},\mathcal{P}_{X})( caligraphic_M , caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) exhibits non-toxicity if (X;θ)𝒯=𝑋𝜃𝒯\mathcal{M}(X;\theta)\cap\mathcal{T}=\emptysetcaligraphic_M ( italic_X ; italic_θ ) ∩ caligraphic_T = ∅ for each X𝒫X𝑋subscript𝒫𝑋X\in\mathcal{P}_{X}italic_X ∈ caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

2.2.2 Stereotyping

Stereotyping is an important type of social bias that should be considered in the context of LLMs (Bommasani et al., 2023; Bordia and Bowman, 2019; Zekun et al., 2023). This study follows the work by Gallegos et al. (2023), in which stereotypes are defined as negative generalizations about a protected attribute group, often reflected by differences in frequency with which various groups are linked to stereotyped terms (Bommasani et al., 2023). To formalize this notion, two fairness desiderata are considered, which are also proposed by Gallegos et al. (2023): equal group associations (EGA) and equal neutral associations (ENA).

Equal Group Associations (Gallegos et al., 2023).

For two protected attribute groups G,G′′,superscript𝐺superscript𝐺′′G^{\prime},G^{\prime\prime},italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , and a set of neutral words W𝑊Witalic_W, an LLM use case (,𝒫X)subscript𝒫𝑋(\mathcal{M},\mathcal{P}_{X})( caligraphic_M , caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) satisfies equal group associations if, for each wW,P(wY^|Y^A)=P(wY^|Y^A′′)formulae-sequence𝑤𝑊𝑃𝑤conditional^𝑌^𝑌superscript𝐴𝑃𝑤conditional^𝑌^𝑌superscript𝐴′′w\in W,P(w\in\hat{Y}|\hat{Y}\cap A^{\prime}\neq\emptyset)=P(w\in\hat{Y}|\hat{Y% }\cap A^{\prime\prime}\neq\emptyset)italic_w ∈ italic_W , italic_P ( italic_w ∈ over^ start_ARG italic_Y end_ARG | over^ start_ARG italic_Y end_ARG ∩ italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ ∅ ) = italic_P ( italic_w ∈ over^ start_ARG italic_Y end_ARG | over^ start_ARG italic_Y end_ARG ∩ italic_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ≠ ∅ ). Put simply, equal group associations requires that each neutral word in W𝑊Witalic_W is equally likely to be contained in the output of \mathcal{M}caligraphic_M regardless of which protected attribute group is mentioned.

Equal Neutral Associations (Gallegos et al., 2023).

For two protected attribute groups G,G′′,superscript𝐺superscript𝐺′′G^{\prime},G^{\prime\prime},italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , with respective associated lexicons A,A′′,superscript𝐴superscript𝐴′′A^{\prime},A^{\prime\prime},italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , and a set of neutral words W𝑊Witalic_W, an LLM use case (,𝒫X)subscript𝒫𝑋(\mathcal{M},\mathcal{P}_{X})( caligraphic_M , caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) satisfies Equal Neutral Associations if, P(AY^|Y^W)=P(A′′Y^|Y^W).𝑃superscript𝐴^𝑌conditional^𝑌𝑊𝑃superscript𝐴′′^𝑌conditional^𝑌𝑊P(A^{\prime}\cap\hat{Y}\neq\emptyset|\hat{Y}\cap W\neq\emptyset)=P(A^{\prime% \prime}\cap\hat{Y}\neq\emptyset|\hat{Y}\cap W\neq\emptyset).italic_P ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∩ over^ start_ARG italic_Y end_ARG ≠ ∅ | over^ start_ARG italic_Y end_ARG ∩ italic_W ≠ ∅ ) = italic_P ( italic_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∩ over^ start_ARG italic_Y end_ARG ≠ ∅ | over^ start_ARG italic_Y end_ARG ∩ italic_W ≠ ∅ ) . In other words, Equal Neutral Associations requires that co-occurrence of protected attribute words with a set of neutral words W𝑊Witalic_W is equally probable for both protected attribute groups.999In practice, the set W𝑊Witalic_W referenced in the definitions of equal group associations and equal neutral associations typically contains frequently stereotyped terms such as professions or adjectives (Bordia and Bowman, 2019; Bommasani et al., 2023).

2.2.3 Counterfactual Fairness

In many contexts, it is undesirable for an LLM to generate substantially different output as a result of different protected attribute words contained in the input prompts, all else equal (Huang et al., 2020; Nozza et al., 2021; Wang et al., 2024). Following previous work (Huang et al., 2020; Garg et al., 2019), this concept is hereafter referred to as (lack of) counterfactual fairness. Depending on context and stakeholder values, the practitioner may wish to assess an LLM for differences in overall content or sentiment resulting from inclusion of different protected attribute words in a prompt. Below, this section provides a formal definition of the corresponding fairness desiderata known as counterfactual invariance, adapted from Gallegos et al. (2023).

Counterfactual Invariance.

For two protected attribute groups G,G′′,superscript𝐺superscript𝐺′′G^{\prime},G^{\prime\prime},italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , an LLM use case (,𝒫X)subscript𝒫𝑋(\mathcal{M},\mathcal{P}_{X})( caligraphic_M , caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) satisfies counterfactual invariance if, for a specified invariance metric υ(,)𝜐\upsilon(\cdot,\cdot)italic_υ ( ⋅ , ⋅ ), expected value of the invariance metric is less than some tolerance level ϵitalic-ϵ\epsilonitalic_ϵ:

𝔼[υ((X;θ),(X′′;θ))]ϵ,𝔼delimited-[]𝜐superscript𝑋𝜃superscript𝑋′′𝜃italic-ϵ\mathbb{E}[\upsilon(\mathcal{M}(X^{\prime};\theta),\mathcal{M}(X^{\prime\prime% };\theta))]\leq\epsilon,blackboard_E [ italic_υ ( caligraphic_M ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ ) , caligraphic_M ( italic_X start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ; italic_θ ) ) ] ≤ italic_ϵ ,

where (X,X′′)superscript𝑋superscript𝑋′′(X^{\prime},X^{\prime\prime})( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) is a counterfactual input pair corresponding to G,G′′superscript𝐺superscript𝐺′′G^{\prime},G^{\prime\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT (Gallegos et al., 2023).101010This modified definition of counterfactual invariance relaxes strict equality as in Gallegos et al. (2023).

2.2.4 Allocational Harms

Allocational harms, which Gallegos et al. (2023) define as an unequal distribution of resources or opportunities among different protected attribute groups, have been widely studied in the machine learning fairness literature (Saleiro et al., 2018; Bellamy et al., 2018; Weerts et al., 2023; Kamishima et al., 2012; Zhang et al., 2018; Hardt et al., 2016; Feldman et al., 2014; Pleiss et al., 2017; Kamiran et al., 2012; Agarwal et al., 2018; Kamiran and Calders, 2011; Chouldechova, 2016). The corresponding fairness desideratum, is known as group fairness, defined formally below.

Group Fairness.

Given two protected attribute groups G,G′′,superscript𝐺superscript𝐺′′G^{\prime},G^{\prime\prime},italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , and a tolerance level ϵitalic-ϵ\epsilonitalic_ϵ, an LLM use case (,𝒫X)subscript𝒫𝑋(\mathcal{M},\mathcal{P}_{X})( caligraphic_M , caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) satisfies group fairness if

|B((X;θ)|G)B((X;θ)|G′′)|ϵ,|B(\mathcal{M}(X;\theta)|G^{\prime})-B(\mathcal{M}(X;\theta)|G^{\prime\prime})% |\leq\epsilon,| italic_B ( caligraphic_M ( italic_X ; italic_θ ) | italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_B ( caligraphic_M ( italic_X ; italic_θ ) | italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) | ≤ italic_ϵ ,

where B𝐵Bitalic_B is a statistical performance metric (e.g. false negative rate) applied to \mathcal{M}caligraphic_M, conditioned on membership in a protected attribute group (Gallegos et al., 2023). Here, conditioning on G𝐺Gitalic_G implies calculating B𝐵Bitalic_B on the subset of input prompts that either contain a direct mention of group G𝐺Gitalic_G or, in the case of person-level prompt granularity, correspond to individuals belonging to group G𝐺Gitalic_G. Note that the choice of B𝐵Bitalic_B will depend on context and stakeholder values.

2.3 Mapping Bias and Fairness Risks to LLM Use Cases

Section 2.1 establishes the characterization of a well-defined LLM use case based on a model and a known population of prompts.111111Note that use case segmentation proposed in this work is by task, which can be controlled through various means such as fine-tuning of the LLM, providing examples with few-shot prompting, or incorporating instructions in system or user prompts. To set up the decision framework, this section segments use cases by task, according to the following three categories: 1) text generation and summarization, 2) classification, and 3) recommendation. Descriptions and examples are provided in Table 1.

The bias and fairness evaluation framework proposed in this work is intended for large-scale applications, in which the volume of generated responses makes exhaustive human review impractical. It is important to note that, for scenarios in which the practitioner manually evaluates each generated output, the evaluations proposed here may be unnecessary if concerns related to bias and fairness can be effectively addressed by the individual who is reviewing the outputs.

Table 1: Taxonomy of LLM Use Cases and Associated Bias/Fairness Risks
Use Case Category Description Examples Bias/Fairness Risk
Text Generation and Summarization LLM generates text outputs that are not constrained to a predefined set of classes or list elements Create personalized outreach messages to individuals; Summarize clinical notes Toxic text, stereotypes*, counterfactual fairness*
Classification LLM classifies a text input among a pre-defined set of classes Classify intent of customer support inquiries to assign assistance; Classify customer feedback as positive or negative to assign follow-ups Allocational harms**
Recommendation LLM generates lists of recommendations Generate lists of recommended products; Generate lists of recommended news articles Counterfactual fairness*

*Risk is applicable if FTU is not satisfied. Counterfactual fairness may not be relevant in certain contexts.
**Risk is applicable if text inputs correspond to a protected attribute.

Text Generation and Summarization.

First, consider use cases where an LLM generates text outputs that are not constrained to a predefined set of classes (e.g., positive vs. negative) or list elements (e.g., products to recommend). For the sake of brevity, this group of use cases will be hereafter referred to as “text generation and summarization," acknowledging that this category can encompass additional use cases including, but not limited to, machine translation, retrieval augmented generation (RAG), and question-answering. An example of a text generation use case could be utilizing an LLM to compose personalized messages for customer outreach. Similarly, a summarization use case may involve employing an LLM to extract pertinent information or provide summaries from clinical notes. These use cases carry the potential risk of generating toxic text in their outputs. Moreover, if these use cases fail to meet the criteria of FTU, meaning that the prompts include references to protected attributes, they also pose the risk of perpetuating stereotypes or exhibiting counterfactual unfairness.

Classification.

LLMs have been widely used for text classification (Sun et al., 2023; Widmann and Wich, 2022; Bonikowski et al., 2022; Howard and Ruder, 2018; Sun et al., 2019; Chai et al., 2020; Chen et al., 2020; Lin et al., 2021). In the context of bias and fairness, it is important to distinguish whether the text inputs can be mapped to a protected attribute, either by containing direct mentions of a protected attribute group, or in the case of person-level prompt granularity, corresponding to individuals belonging to certain protected attribute groups. For instance, one example of a person-level classification use case could involve utilizing an LLM to classify customer feedback as positive or negative in order to assign appropriate follow-ups. Similar to traditional person-level classification challenges in machine learning, these use cases present the risk of allocational harms. On the other hand, classification use cases that do not involve person-level data and satisfy FTU are not subject to these bias and fairness risks.

Recommendation.

Recommendation is another potential application of LLMs (Bao et al., 2023; Gao et al., 2023), such as using an LLM to recommend products to customers. Zhang et al. (2023) show that LLMs used as recommendation engines can discriminate when exposed to protected attribute information. Given this concern, it follows that LLM recommendation use cases pose the risk of counterfactual unfairness if they do not satisfy FTU.

3 Bias and Fairness Evaluation Metrics

This section introduces a range of evaluation metrics segmented by the use case task and its applicable bias and fairness risks. The proposed framework encompasses three distinct use case categories: 1) text generation and summarization, 2) classification, and 3) recommendation. For each category, bias and fairness evaluation metrics are included that address the applicable risks.

Accounting for prompt-specific risk is important to accurately reflect the risk of a specific use case, as Wang et al. (2024) find in their evaluations that toxicity probability is 26 to 101 times higher when dealing with toxic prompts compared to non-toxic prompts. Moreover, Goldfarb-Tarrant et al. (2020); Delobelle et al. (2021) have demonstrated that evaluation metrics which take into account the LLM’s task provide a more accurate reflection of the associated risk compared to metrics based on embeddings or token probabilities. Accordingly, the objective of this work is to assess the risks associated with a particular use case, taking into consideration not only the LLM used but also the task at hand and the prompt population. Following the characterization of a use case based on a model and a known population of prompts, as established in Section 2.1, each metric definition presented below is contextualized within an evaluation sample of size N𝑁Nitalic_N drawn from a known population of prompts 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

3.1 Bias Metrics for Text Generation and Summarization Use Cases

Gallegos et al. (2023) propose a detailed taxonomy of evaluation metrics for bias evaluations in LLMs. They partition these metrics into three main categories: embedding-based metrics, probability-based metrics, and generated-text metrics. While embedding- and probability-based metrics require access to an LLM’s upstream architecture, generated-text metrics instead treat an LLM like a black box and can be easily calculated from LLM output alone. Hence, given their ease of application, only generated-text metrics are included in the framework proposed here.

While Gallegos et al. (2023) further segment generated-text metrics into three subcategories (namely distribution metrics, classifier metrics, and lexicon metrics), this work segments these metrics according to the risk taxonomy outlined in 2.2. In particular, generated-text bias and fairness metrics are segmented into the following categories: toxicity metrics, stereotype metrics, and counterfactual fairness metrics. Toxicity metrics leverage a pre-trained toxicity classifier, such as Perspective API121212https://perspectiveapi.com, to assign a toxicity score to an LLM’s output (Chowdhery et al., 2022; Lees et al., 2022; Wang et al., 2024; Bommasani et al., 2023; Gehman et al., 2020). Stereotype metrics assess the relative co-occurrence of stereotype words with protected attribute words (Bordia and Bowman, 2019; Bommasani et al., 2023) or leverage a pre-trained stereotype classifier to assign a stereotype score to an LLM’s output (Zekun et al., 2023). Lastly, counterfactual fairness metrics assess fairness through a causal lens by exploiting variation in mentions of protected attributes in otherwise-identical prompts (Huang et al., 2020). Select metrics are detailed in each category below.131313Note that the discussion of generated text bias metrics in this paper is not exhaustive. Rather, the intention is to select an appropriate set of evaluation metrics to address each risk to be assessed. Notable metrics omitted from this framework include HONEST (Nozza et al., 2021), score parity (Sicilia and Alikhani, 2023), Full Gen Bias (Smith et al., 2022), regard score (Sheng et al., 2019), demographic representations (Bommasani et al., 2023), psycholinguistic norms (Dhamala et al., 2021), and gender polarity (Dhamala et al., 2021).

3.1.1 Toxicity Metrics

Toxicity metrics leverage a pre-trained toxicity classifier to assess general toxicity that is not specific to a particular protected attribute. Given a toxicity classifier T:𝒴[0,1]:𝑇absent𝒴01T:\mathcal{Y}\xrightarrow[]{}[0,1]italic_T : caligraphic_Y start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW [ 0 , 1 ], this section defines toxicity metrics in the context of an LLM use case (,𝒫X)subscript𝒫𝑋(\mathcal{M},\mathcal{P}_{X})( caligraphic_M , caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ). Specifically, these metrics are evaluated on a set of m𝑚mitalic_m responses per prompt, {Y^ij}j=1msuperscriptsubscriptsubscript^𝑌𝑖𝑗𝑗1𝑚\{\hat{Y}_{ij}\}_{j=1}^{m}{ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, generated by \mathcal{M}caligraphic_M, from a sample of prompts X1,,XNsubscript𝑋1subscript𝑋𝑁X_{1},...,X_{N}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT drawn from 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. In practice, variation in responses for the same prompt can be achieved by choosing a non-zero temperature for the LLM or by utilizing various random seeds (Wang et al., 2024). Below, formal definitions of three toxicity metrics are provided: Expected Maximum Toxicity, Toxicity Probability, and Toxic Fraction. All three metrics range in values from 0 to 1, with smaller values indicating a greater degree of fairness.

Expected Maximum Toxicity (EMT) (Gehman et al., 2020).

EMT estimates the maximum predicted toxicity probability among the top m𝑚mitalic_m generations:

EMT=1Ni=1Nmax1jmT(Y^ij).𝐸𝑀𝑇1𝑁superscriptsubscript𝑖1𝑁subscript1𝑗𝑚𝑇subscript^𝑌𝑖𝑗EMT=\frac{1}{N}\sum_{i=1}^{N}\max_{1\leq j\leq m}T(\hat{Y}_{ij}).italic_E italic_M italic_T = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_m end_POSTSUBSCRIPT italic_T ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) .

Note that the standard choice of m𝑚mitalic_m for this metric is m=25𝑚25m=25italic_m = 25 (Gehman et al., 2020; Wang et al., 2024).

Toxicity Probability (TP) (Gehman et al., 2020).

TP is calculated as the empirical probability of having at least one toxic prediction (T(Y^)0.5𝑇^𝑌0.5T(\hat{Y})\geq 0.5italic_T ( over^ start_ARG italic_Y end_ARG ) ≥ 0.5) among the top m𝑚mitalic_m generations:

TP=1Ni=1NI(max1jmT(Y^ij)0.5).𝑇𝑃1𝑁superscriptsubscript𝑖1𝑁𝐼subscript1𝑗𝑚𝑇subscript^𝑌𝑖𝑗0.5TP=\frac{1}{N}\sum_{i=1}^{N}I(\max_{1\leq j\leq m}T(\hat{Y}_{ij})\geq 0.5).italic_T italic_P = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I ( roman_max start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_m end_POSTSUBSCRIPT italic_T ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ≥ 0.5 ) .

As with EMT, the standard choice of m𝑚mitalic_m for this metric is m=25𝑚25m=25italic_m = 25 (Gehman et al., 2020; Wang et al., 2024).

Toxicity Fraction (TF) (Bommasani et al., 2023).

TF is defined as the fraction of generations that are classified as toxic:

TF=1N1mi=1Nj=1mI(T(Y^ij)0.5),𝑇𝐹1𝑁1𝑚superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑚𝐼𝑇subscript^𝑌𝑖𝑗0.5TF=\frac{1}{N}\frac{1}{m}\sum_{i=1}^{N}\sum_{j=1}^{m}I(T(\hat{Y}_{ij})\geq 0.5),italic_T italic_F = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_I ( italic_T ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ≥ 0.5 ) ,

This metric effectively estimates the likelihood that responses generated by \mathcal{M}caligraphic_M on prompts from 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT contain toxic text (Bommasani et al., 2023). Note that while the standard choice of m𝑚mitalic_m for this metric is m=1𝑚1m=1italic_m = 1 (Bommasani et al., 2023), a larger value of m𝑚mitalic_m may be preferred in practice if sampling a large N𝑁Nitalic_N is infeasible.

3.1.2 Stereotype Metrics

Stereotype metrics aim to identify harmful stereotypes specific to protected attributes that might be present in an LLM’s output. Because these metrics rely on mentions of protected attribute groups, these metrics may be unnecessary if FTU is satisfied for an LLM use case.141414Note that stereotype risk, while low, may still exist even if FTU is satisfied for an LLM use case. Among stereotype metrics, this work distinguishes between metrics based on co-occurrence of protected attribute words and stereotypical words, and metrics that leverage a stereotype classifier.

3.1.2.1 Co-occurrence-Based Metrics

This section outlines a set of metrics that assess stereotype risk based on relative co-occurrence of protected attribute words with neutral words of interest. In particular, formal definitions of two co-occurrence-based stereotype metrics are provided: Co-Occurrence Bias Score and Stereotypical Associations.

Co-Occurence Bias Score (COBS) (Bordia and Bowman, 2019).

Given two protected attribute groups G,G′′superscript𝐺superscript𝐺′′G^{\prime},G^{\prime\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT with associated sets of protected attribute words A,A′′superscript𝐴superscript𝐴′′A^{\prime},A^{\prime\prime}italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT, a set of stereotypical words W𝑊Witalic_W, and an LLM use case (,𝒫X)subscript𝒫𝑋(\mathcal{M},\mathcal{P}_{X})( caligraphic_M , caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ), the full calculation of COBS is as follows:

P(w|A)=i=1Ncooccur(w,A|Y^i)/i=1Nw~Y^icooccur(w~,A|Y~i)i=1NaAC(a,Y^i)/i=1Nw~Y~iC(w~,Y^i)𝑃conditional𝑤𝐴superscriptsubscript𝑖1𝑁𝑐𝑜𝑜𝑐𝑐𝑢𝑟𝑤conditional𝐴subscript^𝑌𝑖superscriptsubscript𝑖1𝑁subscript~𝑤subscript^𝑌𝑖𝑐𝑜𝑜𝑐𝑐𝑢𝑟~𝑤conditional𝐴subscript~𝑌𝑖superscriptsubscript𝑖1𝑁subscript𝑎𝐴𝐶𝑎subscript^𝑌𝑖superscriptsubscript𝑖1𝑁subscript~𝑤subscript~𝑌𝑖𝐶~𝑤subscript^𝑌𝑖P(w|A)=\frac{\sum_{i=1}^{N}cooccur(w,A|\hat{Y}_{i})/\sum_{i=1}^{N}\sum_{\tilde% {w}\in\hat{Y}_{i}}cooccur(\tilde{w},A|\tilde{Y}_{i})}{\sum_{i=1}^{N}\sum_{a\in A% }C(a,\hat{Y}_{i})/\sum_{i=1}^{N}\sum_{\tilde{w}\in\tilde{Y}_{i}}C(\tilde{w},% \hat{Y}_{i})}italic_P ( italic_w | italic_A ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_c italic_o italic_o italic_c italic_c italic_u italic_r ( italic_w , italic_A | over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT over~ start_ARG italic_w end_ARG ∈ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c italic_o italic_o italic_c italic_c italic_u italic_r ( over~ start_ARG italic_w end_ARG , italic_A | over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT italic_C ( italic_a , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT over~ start_ARG italic_w end_ARG ∈ over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_C ( over~ start_ARG italic_w end_ARG , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG
COBS=1|W|wWlogP(w|A)P(w|A′′).𝐶𝑂𝐵𝑆1𝑊subscript𝑤𝑊𝑃conditional𝑤superscript𝐴𝑃conditional𝑤superscript𝐴′′COBS=\frac{1}{|W|}\sum_{w\in W}\log\frac{P(w|A^{\prime})}{P(w|A^{\prime\prime}% )}.italic_C italic_O italic_B italic_S = divide start_ARG 1 end_ARG start_ARG | italic_W | end_ARG ∑ start_POSTSUBSCRIPT italic_w ∈ italic_W end_POSTSUBSCRIPT roman_log divide start_ARG italic_P ( italic_w | italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_P ( italic_w | italic_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) end_ARG .

Above, C(x,Y^i)𝐶𝑥subscript^𝑌𝑖C(x,\hat{Y}_{i})italic_C ( italic_x , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the count of x𝑥xitalic_x in Y^isubscript^𝑌𝑖\hat{Y}_{i}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Y~isubscript~𝑌𝑖\tilde{Y}_{i}over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the LLM output Y^isubscript^𝑌𝑖\hat{Y}_{i}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, with words from AA′′superscript𝐴superscript𝐴′′A^{\prime}\cup A^{\prime\prime}italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ italic_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT and stop words excluded. The co-occurrence function cooccur(w,A|Y^)𝑐𝑜𝑜𝑐𝑐𝑢𝑟𝑤conditional𝐴^𝑌cooccur(w,A|\hat{Y})italic_c italic_o italic_o italic_c italic_c italic_u italic_r ( italic_w , italic_A | over^ start_ARG italic_Y end_ARG ) computes a weighted count of words from A𝐴Aitalic_A that are found within a context window centered around w𝑤witalic_w, each time w𝑤witalic_w appears in Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG.151515To specify the co-occurrence function cooccur(w,A|Y^)𝑐𝑜𝑜𝑐𝑐𝑢𝑟𝑤conditional𝐴^𝑌cooccur(w,A|\hat{Y})italic_c italic_o italic_o italic_c italic_c italic_u italic_r ( italic_w , italic_A | over^ start_ARG italic_Y end_ARG ), Bordia and Bowman (2019) define a fixed or infinite context window centered at a target word w𝑤witalic_w. In the case of fixed context (Bordia and Bowman (2019) recommend including 10 words before and 10 words after), each word in the context window receives equal weight. In the case of an infinite context window, each word receives weight βdist,superscript𝛽𝑑𝑖𝑠𝑡\beta^{dist},italic_β start_POSTSUPERSCRIPT italic_d italic_i italic_s italic_t end_POSTSUPERSCRIPT , where dist𝑑𝑖𝑠𝑡distitalic_d italic_i italic_s italic_t denotes the number of tokens between the two tokens of interest (Bordia and Bowman (2019) use β=0.95𝛽0.95\beta=0.95italic_β = 0.95). In both cases, only words from the context window included in A𝐴Aitalic_A are included in the weighted count. In words, COBS computes the relative likelihood that an LLM \mathcal{M}caligraphic_M generates output having co-occurrence of wW𝑤𝑊w\in Witalic_w ∈ italic_W with Asuperscript𝐴A^{\prime}italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT versus A′′superscript𝐴′′A^{\prime\prime}italic_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT. This metric has a range of possible values of [,][-\infty,\infty][ - ∞ , ∞ ], with values closer to 0 signifying a greater degree of fairness.

Stereotypical Associations (SA) (Bommasani et al., 2023).

Consider a set of protected attribute groups 𝒢𝒢\mathcal{G}caligraphic_G, an associated set of protected attribute lexicons 𝒜𝒜\mathcal{A}caligraphic_A, and an associated set of stereotypical words W𝑊Witalic_W. Additionally, let C(x,Y^)𝐶𝑥^𝑌C(x,\hat{Y})italic_C ( italic_x , over^ start_ARG italic_Y end_ARG ) denote the number of times that the word x𝑥xitalic_x appears in the output Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG, I()𝐼I(\cdot)italic_I ( ⋅ ) denote the indicator function, Prefsuperscript𝑃refP^{\text{ref}}italic_P start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT denote a reference distribution, and TVD𝑇𝑉𝐷TVDitalic_T italic_V italic_D denote total variation difference.161616The reference distribution recommended by Bommasani et al. (2023) is the uniform distribution. Total variation distance measures the distance between probability distributions. For a given LLM (X;θ)𝑋𝜃\mathcal{M}(X;\theta)caligraphic_M ( italic_X ; italic_θ ) and a sample of prompts X1,,XNsubscript𝑋1subscript𝑋𝑁X_{1},...,X_{N}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT drawn from 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, the full computation of SA is as follows:

γ(w|A)=aAi=1NC(a,Y^i)I(C(w,Y^i)>0)𝛾conditional𝑤superscript𝐴subscript𝑎superscript𝐴superscriptsubscript𝑖1𝑁𝐶𝑎subscript^𝑌𝑖𝐼𝐶𝑤subscript^𝑌𝑖0\gamma{(w|A^{\prime})}=\sum_{a\in A^{\prime}}\sum_{i=1}^{N}C(a,\hat{Y}_{i})I(C% (w,\hat{Y}_{i})>0)italic_γ ( italic_w | italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_C ( italic_a , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_I ( italic_C ( italic_w , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > 0 )
π(w|A)=γ(w|A)A𝒜γ(w|A)𝜋conditional𝑤superscript𝐴𝛾conditional𝑤superscript𝐴subscript𝐴𝒜𝛾conditional𝑤𝐴\pi(w|A^{\prime})=\frac{\gamma(w|A^{\prime})}{\sum_{A\in\mathcal{A}}\gamma(w|A)}italic_π ( italic_w | italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG italic_γ ( italic_w | italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_A ∈ caligraphic_A end_POSTSUBSCRIPT italic_γ ( italic_w | italic_A ) end_ARG
P(w)={π(w|A):A𝒜}P^{(w)}=\{\pi(w|A^{\prime}):A^{\prime}\in\mathcal{A}\}italic_P start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT = { italic_π ( italic_w | italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) : italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A }
SA=1|W|wWTVD(P(w),Pref).𝑆𝐴1𝑊subscript𝑤𝑊𝑇𝑉𝐷superscript𝑃𝑤superscript𝑃refSA=\frac{1}{|W|}\sum_{w\in W}TVD(P^{(w)},P^{\text{ref}}).italic_S italic_A = divide start_ARG 1 end_ARG start_ARG | italic_W | end_ARG ∑ start_POSTSUBSCRIPT italic_w ∈ italic_W end_POSTSUBSCRIPT italic_T italic_V italic_D ( italic_P start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT ) .

In words, SA measures the relative co-occurrence of a set of stereotypically associated words across protected attribute groups. SA ranges in value from 0 to 1, where smaller values indicate greater fairness.

3.1.2.2 Metrics Leveraging a Stereotype Classifier

It has been shown that stereotype classifiers can be an effective tool for assessing stereotype risk in LLM use cases (Zekun et al., 2023).171717Zekun et al. (2023) refer to https://huggingface.co/Narrativa/distilroberta-finetuned-stereotype-detection. This section introduces three new metrics by extending the toxicity metrics outlined in Section 3.1.1, leveraging a pre-trained stereotype classifier St:𝒴[0,1]:𝑆𝑡absent𝒴01St:\mathcal{Y}\xrightarrow[]{}[0,1]italic_S italic_t : caligraphic_Y start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW [ 0 , 1 ] rather than a toxicity classifier. Namely, these metrics include: Expected Maximum Stereotype, Stereotype Probability, and Stereotype Fraction. All three metrics range in values from 0 to 1, with smaller values indicating a greater degree of fairness.

Expected Maximum Stereotype (EMS).

EMS, analogous to EMT, estimates the maximum predicted stereotype probability among the top m𝑚mitalic_m generations:

EMS=1Ni=1Nmax1jmSt(Y^ij).𝐸𝑀𝑆1𝑁superscriptsubscript𝑖1𝑁subscript1𝑗𝑚𝑆𝑡subscript^𝑌𝑖𝑗EMS=\frac{1}{N}\sum_{i=1}^{N}\max_{1\leq j\leq m}St(\hat{Y}_{ij}).italic_E italic_M italic_S = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_m end_POSTSUBSCRIPT italic_S italic_t ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) .

Following the convention of EMT, practitioners may wish to use m=25𝑚25m=25italic_m = 25 for this metric.

Stereotype Probability (SP).

Analogous to TP, SP measures as the empirical probability of having at least one stereotype prediction (St(Y^)0.5𝑆𝑡^𝑌0.5St(\hat{Y})\geq 0.5italic_S italic_t ( over^ start_ARG italic_Y end_ARG ) ≥ 0.5), among the top m𝑚mitalic_m generations:

SP=1Ni=1NI(max1jmSt(Y^ij)0.5),𝑆𝑃1𝑁superscriptsubscript𝑖1𝑁𝐼subscript1𝑗𝑚𝑆𝑡subscript^𝑌𝑖𝑗0.5SP=\frac{1}{N}\sum_{i=1}^{N}I(\max_{1\leq j\leq m}St(\hat{Y}_{ij})\geq 0.5),italic_S italic_P = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I ( roman_max start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_m end_POSTSUBSCRIPT italic_S italic_t ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ≥ 0.5 ) ,

To be consistent with the convention of TP, practitioners may wish to use m=25𝑚25m=25italic_m = 25 for this metric.

Stereotype Fraction (SF).

SF, presented as an extension of TF, measures as the fraction of generations that are predicted to contain a stereotype:

SF=1N1mi=1Nj=1mI(St(Y^ij)0.5),𝑆𝐹1𝑁1𝑚superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑚𝐼𝑆𝑡subscript^𝑌𝑖𝑗0.5SF=\frac{1}{N}\frac{1}{m}\sum_{i=1}^{N}\sum_{j=1}^{m}I(St(\hat{Y}_{ij})\geq 0.% 5),italic_S italic_F = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_I ( italic_S italic_t ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ≥ 0.5 ) ,

effectively estimating the likelihood that responses generated by \mathcal{M}caligraphic_M on prompts from 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT contain stereotypes. Note that while the standard choice of m𝑚mitalic_m for the analogous toxicity metric, EMT, is m=1𝑚1m=1italic_m = 1 (Bommasani et al., 2023), a larger value of m𝑚mitalic_m may be preferred in practice if sampling a large N𝑁Nitalic_N is infeasible.

3.1.3 Counterfactual Fairness Metrics

Counterfactual metrics aim to assess differences in LLM output when different protected attributes are mentioned in input prompts, all else equal. Given two protected attribute groups G,G′′superscript𝐺superscript𝐺′′G^{\prime},G^{\prime\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT, these metrics are defined in the context of an LLM use case (,𝒫X)subscript𝒫𝑋(\mathcal{M},\mathcal{P}_{X})( caligraphic_M , caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ). In particular, these metrics are evaluated on a sample of counterfactual response pairs (Y^1,Y^1′′),,(Y^N,Y^N′′)superscriptsubscript^𝑌1superscriptsubscript^𝑌1′′superscriptsubscript^𝑌𝑁superscriptsubscript^𝑌𝑁′′(\hat{Y}_{1}^{\prime},\hat{Y}_{1}^{\prime\prime}),...,(\hat{Y}_{N}^{\prime},% \hat{Y}_{N}^{\prime\prime})( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , … , ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) generated by \mathcal{M}caligraphic_M, from a sample of counterfactual input pairs (X1,X1′′),,(XN,XN′′)superscriptsubscript𝑋1superscriptsubscript𝑋1′′superscriptsubscript𝑋𝑁superscriptsubscript𝑋𝑁′′(X_{1}^{\prime},X_{1}^{\prime\prime}),...,(X_{N}^{\prime},X_{N}^{\prime\prime})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , … , ( italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) drawn from 𝒫X|𝒜subscript𝒫conditional𝑋𝒜\mathcal{P}_{X|\mathcal{A}}caligraphic_P start_POSTSUBSCRIPT italic_X | caligraphic_A end_POSTSUBSCRIPT.181818In practice, counterfactual substitution can be achieved by leveraging a mapping of one protected attribute group lexicon to another. For instance, a female to male lexicon mapping could include substitutions such as ’she’: ’he’, ’hers’: ’his’, ’her’: ’him’, ’herself’: ’himself’, ’female’: ’male’, ’females’: ’males’, ’woman’: ’man’, ’women’: ’men’, ’girl’: ’boy’, ’girls’: ’boys’, ’daughter’: ’son’, ’daughters’: ’sons’, ’mother’: ’father’, ’mothers’: ’fathers’, ’sister’: ’brother’, ’sisters’: ’brothers’, ’aunt’: ’uncle’, ’aunts’: ’uncles’, ’niece’: ’nephew’, ’nieces’: ’nephews’, ’lady’: ’gentleman’, ’ladies’: ’gentlemen’, ’grandmother’: ’grandfather’, ’grandmothers’: ’grandfathers’. It is important to note that the mapping does not need to be exhaustive, as long as it has sufficient coverage to generate a large sample of counterfactual input pairs. Note that, in scenarios where a large N𝑁Nitalic_N is infeasible, practitioners may opt to generate multiple response pairs per counterfactual input pair.

These metrics, which are categorized into counterfactual similarity metrics and counterfactual sentiment metrics, respectively quantify the differences in text similarity and sentiment by leveraging the variations in LLM output observed across counterfactual input pairs. Due to their reliance on mentions of protected attributes in input prompts, if FTU is satisfied for an LLM use case, these metrics need not be used.

3.1.3.1 Counterfactual Similarity

Counterfactual similarity metrics measure the similarity in outputs generated from counterfactual input pairs according to a specified invariance metric υ𝜐\upsilonitalic_υ, i.e. υ((X;θ),(X′′;θ)).𝜐superscript𝑋𝜃superscript𝑋′′𝜃\upsilon(\mathcal{M}(X^{\prime};\theta),\mathcal{M}(X^{\prime\prime};\theta)).italic_υ ( caligraphic_M ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ ) , caligraphic_M ( italic_X start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ; italic_θ ) ) . These metrics effectively assess whether the LLM use case satisfies the counterfactual invariance property defined in 2.2. One such example of υ𝜐\upsilonitalic_υ is exact match (Rajpurkar et al., 2016), but (Gallegos et al., 2023) argue that this metric is too strict. This work introduces three, less stringent, counterfactual similarity metrics: Counterfactual ROUGE-L, Counterfactual BLEU, and Counterfactual Cosine Similarity, which are extensions of state-of-the-art text similarity metrics (Minaee et al., 2024; Lin, 2004; Papineni et al., 2002; Singhal and Google, 2001; Gomaa and Fahmy, 2013). While the first two assess similarity using token-sequence overlap, the the third assesses similarity using sentence embeddings. All three metrics range in values from 0 to 1, with larger values indicating a greater degree of fairness.

Counterfactual ROUGE-L (CROUGE-L).

This work introduces CROUGE-L, defined as the average ROUGE-L score (Lin, 2004) over counterfactually generated output pairs. The full calculation of CROUGE-L is as follows:

ri=LCS(Y^i,Y^i′′)len(Y^i)superscriptsubscript𝑟𝑖𝐿𝐶𝑆superscriptsubscript^𝑌𝑖superscriptsubscript^𝑌𝑖′′𝑙𝑒𝑛superscriptsubscript^𝑌𝑖r_{i}^{\prime}=\frac{LCS(\hat{Y}_{i}^{\prime},\hat{Y}_{i}^{\prime\prime})}{len% (\hat{Y}_{i}^{\prime})}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG italic_L italic_C italic_S ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_l italic_e italic_n ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG
ri′′=LCS(Y^i′′,Y^i)len(Y^i′′)superscriptsubscript𝑟𝑖′′𝐿𝐶𝑆superscriptsubscript^𝑌𝑖′′superscriptsubscript^𝑌𝑖𝑙𝑒𝑛superscriptsubscript^𝑌𝑖′′r_{i}^{\prime\prime}=\frac{LCS(\hat{Y}_{i}^{\prime\prime},\hat{Y}_{i}^{\prime}% )}{len(\hat{Y}_{i}^{\prime\prime})}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = divide start_ARG italic_L italic_C italic_S ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_l italic_e italic_n ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) end_ARG
CROUGE-L=1Ni=1N2riri′′ri+ri′′,𝐶𝑅𝑂𝑈𝐺𝐸-𝐿1𝑁superscriptsubscript𝑖1𝑁2superscriptsubscript𝑟𝑖superscriptsubscript𝑟𝑖′′superscriptsubscript𝑟𝑖superscriptsubscript𝑟𝑖′′CROUGE\text{-}L=\frac{1}{N}\sum_{i=1}^{N}\frac{2r_{i}^{\prime}r_{i}^{\prime% \prime}}{r_{i}^{\prime}+r_{i}^{\prime\prime}},italic_C italic_R italic_O italic_U italic_G italic_E - italic_L = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG 2 italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_ARG ,

where LCS(,)𝐿𝐶𝑆LCS(\cdot,\cdot)italic_L italic_C italic_S ( ⋅ , ⋅ ) denotes the longest common subsequence of tokens between two LLM outputs, and len(Y^)𝑙𝑒𝑛^𝑌len(\hat{Y})italic_l italic_e italic_n ( over^ start_ARG italic_Y end_ARG ) denotes the number of tokens in an LLM output. The CROUGE-L metric effectively uses ROUGE-L to assess similarity as the longest common subsequence (LCS) relative to generated text length.

Given its reliance on matching token sequences, practitioners should mask protected attribute words in counterfactual output pairs before computing CROUGE-L. For instance, suppose, for the counterfactual input pair (X^,X^′′)=(‘What did he do next’, ‘What did she do next’)superscript^𝑋superscript^𝑋′′‘What did he do next’, ‘What did she do next’(\hat{X}^{\prime},\hat{X}^{\prime\prime})=(\text{`What did he do next', `What % did she do next'})( over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) = ( ‘What did he do next’, ‘What did she do next’ ), an LLM generates the output pair (Y^,Y^′′)=(‘then he drove his car to work’, ‘then she drove her car to work’)superscript^𝑌superscript^𝑌′′‘then he drove his car to work’, ‘then she drove her car to work’(\hat{Y}^{\prime},\hat{Y}^{\prime\prime})=(\text{`then he drove his car to % work', `then she drove her car to work'})( over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) = ( ‘then he drove his car to work’, ‘then she drove her car to work’ ). In this context, these two responses are effectively identical. Masking the tokens {‘he’, ‘she’, ‘his’, ‘her’}‘he’, ‘she’, ‘his’, ‘her’\{\text{`he', `she', `his', `her'}\}{ ‘he’, ‘she’, ‘his’, ‘her’ } accomplishes this computationally.

Counterfactual BLEU (CBLEU).

This work introduces CBLEU, defined as the average BLEU score (Papineni et al., 2002) over counterfactually generated output pairs. The full calculation of CBLEU is as follows:

precisionb(Y^i,Y^i′′)=sntY^ib-gramsntmin(C(b-gram,Y^i|Y^i′′),C(b-gram,Y^i′′))snt~Y^ib-gramsnt~C(b-gram,Y^i)𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜subscript𝑛𝑏superscriptsubscript^𝑌𝑖superscriptsubscript^𝑌𝑖′′subscript𝑠𝑛𝑡superscriptsubscript^𝑌𝑖subscript𝑏-𝑔𝑟𝑎𝑚𝑠𝑛𝑡𝐶𝑏-𝑔𝑟𝑎𝑚conditionalsuperscriptsubscript^𝑌𝑖superscriptsubscript^𝑌𝑖′′𝐶𝑏-𝑔𝑟𝑎𝑚superscriptsubscript^𝑌𝑖′′subscript~𝑠𝑛𝑡superscriptsubscript^𝑌𝑖subscript𝑏-𝑔𝑟𝑎𝑚~𝑠𝑛𝑡𝐶𝑏-𝑔𝑟𝑎𝑚superscriptsubscript^𝑌𝑖precision_{b}(\hat{Y}_{i}^{\prime},\hat{Y}_{i}^{\prime\prime})=\frac{\sum_{snt% \in\hat{Y}_{i}^{\prime}}\sum_{b{\text{-}}gram\in snt}\min(C(b{\text{-}}gram,% \hat{Y}_{i}^{\prime}|\hat{Y}_{i}^{\prime\prime}),C(b{\text{-}}gram,\hat{Y}_{i}% ^{\prime\prime}))}{\sum_{\tilde{snt}\in\hat{Y}_{i}^{\prime}}\sum_{b{\text{-}}% gram\in\tilde{snt}}C(b{\text{-}}gram,\hat{Y}_{i}^{\prime})}italic_p italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_s italic_n italic_t ∈ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_b - italic_g italic_r italic_a italic_m ∈ italic_s italic_n italic_t end_POSTSUBSCRIPT roman_min ( italic_C ( italic_b - italic_g italic_r italic_a italic_m , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , italic_C ( italic_b - italic_g italic_r italic_a italic_m , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT over~ start_ARG italic_s italic_n italic_t end_ARG ∈ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_b - italic_g italic_r italic_a italic_m ∈ over~ start_ARG italic_s italic_n italic_t end_ARG end_POSTSUBSCRIPT italic_C ( italic_b - italic_g italic_r italic_a italic_m , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG
BLEU(Y^i,Y^i′′)=min(1,exp{1len(Y^i′′)len(Y^i)})(b=14precisionb(Y^i,Y^i′′))1/4𝐵𝐿𝐸𝑈superscriptsubscript^𝑌𝑖superscriptsubscript^𝑌𝑖′′11𝑙𝑒𝑛superscriptsubscript^𝑌𝑖′′𝑙𝑒𝑛superscriptsubscript^𝑌𝑖superscriptsuperscriptsubscriptproduct𝑏14𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜subscript𝑛𝑏superscriptsubscript^𝑌𝑖superscriptsubscript^𝑌𝑖′′14BLEU(\hat{Y}_{i}^{\prime},\hat{Y}_{i}^{\prime\prime})=\min(1,\exp\{1-\frac{len% (\hat{Y}_{i}^{\prime\prime})}{len(\hat{Y}_{i}^{\prime})}\})(\prod_{b=1}^{4}% precision_{b}(\hat{Y}_{i}^{\prime},\hat{Y}_{i}^{\prime\prime}))^{1/4}italic_B italic_L italic_E italic_U ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) = roman_min ( 1 , roman_exp { 1 - divide start_ARG italic_l italic_e italic_n ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_l italic_e italic_n ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG } ) ( ∏ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_p italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT
CBLEU=1Ni=1Nmin(BLEU(Y^i,Y^i′′),BLEU(Y^i′′,Y^i)),𝐶𝐵𝐿𝐸𝑈1𝑁superscriptsubscript𝑖1𝑁𝐵𝐿𝐸𝑈superscriptsubscript^𝑌𝑖superscriptsubscript^𝑌𝑖′′𝐵𝐿𝐸𝑈superscriptsubscript^𝑌𝑖′′superscriptsubscript^𝑌𝑖CBLEU=\frac{1}{N}\sum_{i=1}^{N}\min(BLEU(\hat{Y}_{i}^{\prime},\hat{Y}_{i}^{% \prime\prime}),BLEU(\hat{Y}_{i}^{\prime\prime},\hat{Y}_{i}^{\prime})),italic_C italic_B italic_L italic_E italic_U = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_min ( italic_B italic_L italic_E italic_U ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , italic_B italic_L italic_E italic_U ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ,

where snt𝑠𝑛𝑡sntitalic_s italic_n italic_t denotes a sentence in an LLM output, len(Y^)𝑙𝑒𝑛^𝑌len(\hat{Y})italic_l italic_e italic_n ( over^ start_ARG italic_Y end_ARG ) denotes the number of tokens in an LLM output, C(b-gram,Y^i)𝐶𝑏-𝑔𝑟𝑎𝑚superscriptsubscript^𝑌𝑖C(b{\text{-}}gram,\hat{Y}_{i}^{\prime})italic_C ( italic_b - italic_g italic_r italic_a italic_m , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) denotes the number of times b-gram𝑏-𝑔𝑟𝑎𝑚b{\text{-}}gramitalic_b - italic_g italic_r italic_a italic_m appears in Y^isuperscriptsubscript^𝑌𝑖\hat{Y}_{i}^{\prime}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and C(b-gram,Y^i|Y^i′′)𝐶𝑏-𝑔𝑟𝑎𝑚conditionalsuperscriptsubscript^𝑌𝑖superscriptsubscript^𝑌𝑖′′C(b{\text{-}}gram,\hat{Y}_{i}^{\prime}|\hat{Y}_{i}^{\prime\prime})italic_C ( italic_b - italic_g italic_r italic_a italic_m , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) denotes the number of times b-gram𝑏-𝑔𝑟𝑎𝑚b{\text{-}}gramitalic_b - italic_g italic_r italic_a italic_m appears in Y^isuperscriptsubscript^𝑌𝑖\hat{Y}_{i}^{\prime}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT given that it also appears in Y^i′′superscriptsubscript^𝑌𝑖′′\hat{Y}_{i}^{\prime\prime}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT (Papineni et al., 2002; goo, ). For symmetry, the minimum of these two BLEU scores for each counterfactual pair is obtained before averaging. For the same reasons as with CROUGE-L, practitioners should mask protected attribute words in counterfactual output pairs before computing CBLEU.

Counterfactual Cosine Similarity (CCS).

Given a sentence transformer 𝐕:𝒴d:𝐕absent𝒴superscript𝑑\mathbf{V}:\mathcal{Y}\xrightarrow{}\mathbb{R}^{d}bold_V : caligraphic_Y start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, this work defines CCS as:

CCS=1Ni=1N𝐕(Yi)𝐕(Yi′′)𝐕(Yi)𝐕(Yi′′),𝐶𝐶𝑆1𝑁superscriptsubscript𝑖1𝑁𝐕superscriptsubscript𝑌𝑖𝐕superscriptsubscript𝑌𝑖′′delimited-∥∥𝐕superscriptsubscript𝑌𝑖delimited-∥∥𝐕superscriptsubscript𝑌𝑖′′CCS=\frac{1}{N}\sum_{i=1}^{N}\frac{\mathbf{V}(Y_{i}^{\prime})\cdot\mathbf{V}(Y% _{i}^{\prime\prime})}{\lVert\mathbf{V}(Y_{i}^{\prime})\rVert\lVert\mathbf{V}(Y% _{i}^{\prime\prime})\rVert},italic_C italic_C italic_S = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG bold_V ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ bold_V ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG ∥ bold_V ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ ∥ bold_V ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ∥ end_ARG ,

i.e. the average cosine similarity (Singhal and Google, 2001) between counterfactually generated output pairs for an LLM use case.

3.1.3.2 Counterfactual Sentiment Bias

Counterfactual sentiment metrics measure the sentiment consistency across counterfactually generated pairs of output. To achieve this, these metrics leverage a pre-trained sentiment classifier Sm:𝒴[0,1]:𝑆𝑚absent𝒴01Sm:\mathcal{Y}\xrightarrow[]{}[0,1]italic_S italic_m : caligraphic_Y start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW [ 0 , 1 ]. This section outlines two counterfactual sentiment metrics: Strict Counterfactual Sentiment Parity, proposed by Huang et al. (2020), and an extension of this metric introduced in this work called Weak Counterfactual Sentiment Parity.191919Huang et al. (2020) use Google Cloud sentiment API and a BERT-based sentiment classifier. Both metrics have a range of values of [0,1]01[0,1][ 0 , 1 ], with smaller values indicating a higher degree of fairness.

Strict Counterfactual Sentiment Parity (SCSP) (Huang et al., 2020).

SCSP calculates Wasserstein-1 distance (Jiang et al., 2019) between the output distributions of a sentiment classifier applied to counterfactually generated LLM outputs:

𝒲1=𝔼τ𝒰(0,1)|P(Sm(Y^)>τ)P(Sm(Y^′′)>τ)|,subscript𝒲1subscript𝔼similar-to𝜏𝒰01𝑃𝑆𝑚superscript^𝑌𝜏𝑃𝑆𝑚superscript^𝑌′′𝜏\mathcal{W}_{1}=\mathbb{E}_{\tau\sim\mathcal{U}(0,1)}|P(Sm(\hat{Y}^{\prime})>% \tau)-P(Sm(\hat{Y}^{\prime\prime})>\tau)|,caligraphic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ caligraphic_U ( 0 , 1 ) end_POSTSUBSCRIPT | italic_P ( italic_S italic_m ( over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > italic_τ ) - italic_P ( italic_S italic_m ( over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) > italic_τ ) | ,

where 𝒰(0,1)𝒰01\mathcal{U}(0,1)caligraphic_U ( 0 , 1 ) denotes the uniform distribution. Above, 𝔼τ𝒰(0,1)subscript𝔼similar-to𝜏𝒰01\mathbb{E}_{\tau\sim\mathcal{U}(0,1)}blackboard_E start_POSTSUBSCRIPT italic_τ ∼ caligraphic_U ( 0 , 1 ) end_POSTSUBSCRIPT is calculated empirically on a sample of counterfactual response pairs (Y^1,Y^1′′),,(Y^N,Y^N′′)superscriptsubscript^𝑌1superscriptsubscript^𝑌1′′superscriptsubscript^𝑌𝑁superscriptsubscript^𝑌𝑁′′(\hat{Y}_{1}^{\prime},\hat{Y}_{1}^{\prime\prime}),...,(\hat{Y}_{N}^{\prime},% \hat{Y}_{N}^{\prime\prime})( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , … , ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) generated by \mathcal{M}caligraphic_M, from a sample of counterfactual input pairs (X1,X1′′),,(XN,XN′′)superscriptsubscript𝑋1superscriptsubscript𝑋1′′superscriptsubscript𝑋𝑁superscriptsubscript𝑋𝑁′′(X_{1}^{\prime},X_{1}^{\prime\prime}),...,(X_{N}^{\prime},X_{N}^{\prime\prime})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , … , ( italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) drawn from 𝒫X|𝒜subscript𝒫conditional𝑋𝒜\mathcal{P}_{X|\mathcal{A}}caligraphic_P start_POSTSUBSCRIPT italic_X | caligraphic_A end_POSTSUBSCRIPT.

Weak Counterfactual Sentiment Parity (WCSP).

This study presents WCSP, defined as the discrepancy in predicted sentiment rates by a sentiment classifier when applied to counterfactually generated LLM outputs, given a threshold τ𝜏\tauitalic_τ for binarizing sentiment scores.

WCSP=1Ni=1N|P(Sm(Y^i)>τ)P(Sm(Y^i′′)>τ)|.𝑊𝐶𝑆𝑃1𝑁superscriptsubscript𝑖1𝑁𝑃𝑆𝑚superscriptsubscript^𝑌𝑖𝜏𝑃𝑆𝑚superscriptsubscript^𝑌𝑖′′𝜏WCSP=\frac{1}{N}\sum_{i=1}^{N}|P(Sm(\hat{Y}_{i}^{\prime})>\tau)-P(Sm(\hat{Y}_{% i}^{\prime\prime})>\tau)|.italic_W italic_C italic_S italic_P = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_P ( italic_S italic_m ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > italic_τ ) - italic_P ( italic_S italic_m ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) > italic_τ ) | .

In practice, practitioners may select an appropriate value of τ𝜏\tauitalic_τ depending on stakeholder values and the sentiment classifier being used.

3.2 Fairness Metrics for Classification Use Cases

It is well-established that classification models can produce unfair outcomes for certain protected attribute groups (Saleiro et al., 2018; Bellamy et al., 2018; Weerts et al., 2023; Feldman et al., 2014; Hardt et al., 2016; Mehrabi et al., 2019). Let a classification LLM use case be defined as an LLM tasked with classification, denoted as (c)superscript𝑐\mathcal{M}^{(c)}caligraphic_M start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT, and a population of prompts 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. Note that the metrics introduced in this section are confined to binary classification use cases, where (c):𝒳{0,1},:superscript𝑐absent𝒳01\mathcal{M}^{(c)}:\mathcal{X}\xrightarrow{}\{0,1\},caligraphic_M start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT : caligraphic_X start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW { 0 , 1 } , with the understanding that evaluating fairness for multiclass classification is a straightforward extension from the binary case (Rouzot et al., 2023).

For the remainder of this section, assume each prompt in 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT for a given classification LLM use case corresponds to a protected attribute group. Under this assumption, traditional machine learning fairness metrics (Bellamy et al., 2018; Saleiro et al., 2018; Weerts et al., 2023) may be applied (Czarnowska et al., 2021). Accordingly, these metrics are defined on binary predictions Y^1,,Y^Nsubscript^𝑌1subscript^𝑌𝑁\hat{Y}_{1},...,\hat{Y}_{N}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, generated from a sample of prompts X1,,XN𝒫Xsubscript𝑋1subscript𝑋𝑁subscript𝒫𝑋X_{1},...,X_{N}\in\mathcal{P}_{X}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, with some metrics also incorporating corresponding ground truth values Y1,,YNsubscript𝑌1subscript𝑌𝑁Y_{1},...,Y_{N}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. These metrics effectively assess group fairness (see Section 2.2), with choice of statistical outcome measure B𝐵Bitalic_B depending on stakeholder values (e.g. the relative cost of false negatives vs. false positives).

This section distinguishes between representation fairness metrics, calculated using only predictions, and error-based fairness metrics, calculated using both predictions and ground truth values. The formulation of each fairness metric involves calculating the absolute difference between a pair of group-level metrics. This calculation yields a range of values between 0 and 1, where smaller values signify a higher level of fairness.202020Note that while this work considers group fairness in terms of differences between protected attribute groups, as in Gallegos et al. (2023); Bellamy et al. (2018), analogous metrics can be calculated as ratios (Saleiro et al., 2018).

3.2.1 Representation Fairness Metrics for Binary Classification

Representation fairness metrics aim to determine whether protected attribute groups are adequately represented in the positive predictions generated by classifier. It is recommended that practitioners reserve this set of metrics for classification LLM use cases for which group-level predicted prevalence rates, i.e. the proportion of predictions belonging to the positive class, should be approximately equal.212121For example, when predicting which applicants are qualified for a job it may be desirable to target equal rates of positive predictions for males and females. However, when predicting certain diseases, this may not be a desirable model behavior. A single representation fairness metric is defined below: Demographic Parity.

Demographic Parity (DP) (Dwork et al., 2011).

DP calculates the absolute difference in group-level predicted prevalence rates:

DP=|P(Y^=1|G=G)P(Y^=1|G=G′′)|,DP=|{P(\hat{Y}=1|G=G^{\prime})}-{P(\hat{Y}=1|G=G^{\prime\prime})}|,italic_D italic_P = | italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_G = italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_G = italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) | ,

where Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG denotes a model prediction and P()𝑃P(\cdot)italic_P ( ⋅ ) denotes the empirical probability based on predictions generated from a sample prompts drawn from 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

3.2.2 Error-Based Fairness Metrics for Binary Classification

Error-based fairness metrics aim to determine whether disparities in model performance exist across protected attribute groups. For error-based fairness, two metrics focused on false negatives, False Negative Rate Difference and False Omission Rate Difference, and two metrics focused on false positives, False Positive Rate Difference and False Discovery Rate Difference, are introduced. Following Saleiro et al. (2018), it is recommended that the choice between these two types of error-based fairness metrics be informed by whether the interventions assigned by the model are assistive (meaning false negatives are undesirable) or punitive (meaning false positives are undesirable) in nature.

False Negative Rate Difference (FNRD) (Bellamy et al., 2018).

FNRD measures the absolute difference in group-level false negative rates:

FNRD=|P(Y^=0|Y=1,G=G)P(Y^=0|Y=1,G=G′′)|,FNRD=|{P(\hat{Y}=0|Y=1,G=G^{\prime})}-{P(\hat{Y}=0|Y=1,G=G^{\prime\prime})}|,italic_F italic_N italic_R italic_D = | italic_P ( over^ start_ARG italic_Y end_ARG = 0 | italic_Y = 1 , italic_G = italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_P ( over^ start_ARG italic_Y end_ARG = 0 | italic_Y = 1 , italic_G = italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) | ,

where Y𝑌Yitalic_Y denotes ground truth value corresponding to Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG and P()𝑃P(\cdot)italic_P ( ⋅ ) denotes the empirical probability based on predictions generated from a sample of prompts drawn from 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. Note that false negative rate measures the proportion of actual positives (Y=1)𝑌1(Y=1)( italic_Y = 1 ) that are falsely classified as negative (Y^=0)^𝑌0(\hat{Y}=0)( over^ start_ARG italic_Y end_ARG = 0 ). FNRD is equivalent to the equal opportunity difference metric proposed by Hardt et al. (2016).

False Omission Rate Difference (FORD) (Bellamy et al., 2018).

FORD measures the absolute difference in group-level false omission rates:

FORD=|P(Y=1|Y^=0,G=G)P(Y=1|Y^=0,G=G′′)|,FORD=|{P(Y=1|\hat{Y}=0,G=G^{\prime})}-{P(Y=1|\hat{Y}=0,G=G^{\prime\prime})}|,italic_F italic_O italic_R italic_D = | italic_P ( italic_Y = 1 | over^ start_ARG italic_Y end_ARG = 0 , italic_G = italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_P ( italic_Y = 1 | over^ start_ARG italic_Y end_ARG = 0 , italic_G = italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) | ,

where Y𝑌Yitalic_Y denotes ground truth value corresponding to Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG and P()𝑃P(\cdot)italic_P ( ⋅ ) denotes the empirical probability based predictions generated from on a sample of prompts drawn from 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. Instead of concentrating on actual positives, false omission rate calculates the percentage of predicted negatives (Y^=0^𝑌0\hat{Y}=0over^ start_ARG italic_Y end_ARG = 0) that are misclassified. Thus, similar to the FNRD, a higher FORD indicates a greater difference in the likelihood of false negatives across groups.

False Positive Rate Difference (FPRD) (Bellamy et al., 2018).

FPRD measures the absolute difference in group-level false positive rates:

FPRD=|P(Y^=1|Y=0,G=G)P(Y^=1|Y=0,G=G′′)|,FPRD=|{P(\hat{Y}=1|Y=0,G=G^{\prime})}-{P(\hat{Y}=1|Y=0,G=G^{\prime\prime})}|,italic_F italic_P italic_R italic_D = | italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_Y = 0 , italic_G = italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_Y = 0 , italic_G = italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) | ,

where Y𝑌Yitalic_Y denotes ground truth value corresponding to Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG and P()𝑃P(\cdot)italic_P ( ⋅ ) denotes the empirical probability based on predictions generated from a sample of prompts drawn from 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. Note that false positive rate measures the percentage of actual negatives (Y=0𝑌0Y=0italic_Y = 0) being incorrectly predicted as positive (Y^=1)^𝑌1(\hat{Y}=1)( over^ start_ARG italic_Y end_ARG = 1 ).

False Discovery Rate Difference (FDRD) (Bellamy et al., 2018).

FDRD measures the absolute difference in group-level false discovery rates:

FDRD=|P(Y=0|Y^=1,G=G)P(Y=0|Y^=1,G=G′′)|,FDRD=|{P(Y=0|\hat{Y}=1,G=G^{\prime})}-{P(Y=0|\hat{Y}=1,G=G^{\prime\prime})}|,italic_F italic_D italic_R italic_D = | italic_P ( italic_Y = 0 | over^ start_ARG italic_Y end_ARG = 1 , italic_G = italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_P ( italic_Y = 0 | over^ start_ARG italic_Y end_ARG = 1 , italic_G = italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) | ,

where Y𝑌Yitalic_Y denotes ground truth value corresponding to Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG and P()𝑃P(\cdot)italic_P ( ⋅ ) denotes the empirical probability based on predictions generated from a sample of prompts drawn from 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. Rather than considering actual negatives, false discovery rate calculates the proportion of predicted positives (Y^=1^𝑌1\hat{Y}=1over^ start_ARG italic_Y end_ARG = 1) that are incorrectly classified. Hence, as with FPRD, a higher FDRD indicates a larger disparity in the likelihood of false positives across groups.

3.2.3 Multiclass Fairness Metrics

For multiclass classifiers, the framework proposed here follows the fairness guidelines provided by Rouzot et al. (2023) and hence recommends conducting class-wise fairness assessments using the appropriate binary classification fairness metrics, as per sections 3.2.1, 3.2.2, on each of the ‘sensitive’ classes. In particular, Rouzot et al. (2023) characterize sensitive classes as outcomes having significant impact on the lives of individuals to whom the model is applied.

3.3 Fairness Metrics for Recommendation Use Cases

Zhang et al. (2023) have shown that LLMs tasked with recommendation can exhibit discrimination when exposed to protected attribute information in the input prompts. Let a recommendation LLM use case be defined as an LLM tasked with recommendation, denoted as (R)superscript𝑅\mathcal{M}^{(R)}caligraphic_M start_POSTSUPERSCRIPT ( italic_R ) end_POSTSUPERSCRIPT, and a population of prompts 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. Specifically, (R):𝒳K:superscript𝑅absent𝒳superscript𝐾\mathcal{M}^{(R)}:\mathcal{X}\xrightarrow{}\mathcal{R}^{K}caligraphic_M start_POSTSUPERSCRIPT ( italic_R ) end_POSTSUPERSCRIPT : caligraphic_X start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW caligraphic_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT maps a prompt X𝒳𝑋𝒳X\in\mathcal{X}italic_X ∈ caligraphic_X to an ordered K𝐾Kitalic_K-tuple R^K^𝑅superscript𝐾\hat{R}\in\mathcal{R}^{K}over^ start_ARG italic_R end_ARG ∈ caligraphic_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT of distinct recommendations from a set of possible recommendations \mathcal{R}caligraphic_R.

This section outlines a set of fairness metrics for recommendation LLM use cases, as proposed by Zhang et al. (2023). To ensure consistency with the metrics discussed in Section 3.1.3, this section presents modified versions of these metrics to be pairwise in nature, rather than attribute-wise. Given two protected attribute groups G,G′′,superscript𝐺superscript𝐺′′G^{\prime},G^{\prime\prime},italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , and an LLM use case ((R),𝒫X)superscript𝑅subscript𝒫𝑋(\mathcal{M}^{(R)},\mathcal{P}_{X})( caligraphic_M start_POSTSUPERSCRIPT ( italic_R ) end_POSTSUPERSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ), these metrics assess similarity in counterfactually generated recommendation lists. Below, each metric is defined according to responses generated from a sample of counterfactual input pairs (X1,X1′′),,(XN,XN′′)superscriptsubscript𝑋1superscriptsubscript𝑋1′′superscriptsubscript𝑋𝑁superscriptsubscript𝑋𝑁′′(X_{1}^{\prime},X_{1}^{\prime\prime}),...,(X_{N}^{\prime},X_{N}^{\prime\prime})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , … , ( italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) which are drawn from 𝒫X|𝒜subscript𝒫conditional𝑋𝒜\mathcal{P}_{X|\mathcal{A}}caligraphic_P start_POSTSUBSCRIPT italic_X | caligraphic_A end_POSTSUBSCRIPT. In particular, three metrics are introduced: Jaccard Similarity, Search Result Page Misinformation Score at K, and Pairwise Ranking Accuracy Gap at K. Each of these metrics ranges in value from 0 to 1, with larger values indicating a greater degree of fairness.

Jaccard Similarity at K (Jaccard-K) (Zhang et al., 2023).

Below, a pairwise version of Jaccard-K is introduced. This metric calculates the average Jaccard Similarity (Han et al., 2011)—the ratio of the intersection cardinality to the union cardinality—among pairs of counterfactually generated recommendation lists. Formally, this metric is computed as follows:

Jaccard-K=1Ni=1N|R^iR^i′′||R^iR^i′′|,𝐽𝑎𝑐𝑐𝑎𝑟𝑑-𝐾1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript^𝑅𝑖superscriptsubscript^𝑅𝑖′′superscriptsubscript^𝑅𝑖superscriptsubscript^𝑅𝑖′′Jaccard\text{-}K=\frac{1}{N}\sum_{i=1}^{N}\frac{|\hat{R}_{i}^{\prime}\cap\hat{% R}_{i}^{\prime\prime}|}{|\hat{R}_{i}^{\prime}\cup\hat{R}_{i}^{\prime\prime}|},italic_J italic_a italic_c italic_c italic_a italic_r italic_d - italic_K = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG | over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∩ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT | end_ARG start_ARG | over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT | end_ARG ,

where R^i,R^i′′superscriptsubscript^𝑅𝑖superscriptsubscript^𝑅𝑖′′\hat{R}_{i}^{\prime},\hat{R}_{i}^{\prime\prime}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT respectively denote the generated lists of recommendations by (X;θ)𝑋𝜃\mathcal{M}(X;\theta)caligraphic_M ( italic_X ; italic_θ ) from the counterfactual input pair (Xi,Xi′′)superscriptsubscript𝑋𝑖superscriptsubscript𝑋𝑖′′(X_{i}^{\prime},X_{i}^{\prime\prime})( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ). Note that this metric does not account for ranking differences between the two lists (Zhang et al., 2023).

Search Result Page Misinformation Score at K (SERP-K) (Zhang et al., 2023).

Adapted from Tomlein et al. (2021), SERP-K reflects the similarity of two lists, considering both overlap and ranks. A modified version of SERP-K, adapted for pairwise application, is introduced as follows:

ψ(Xi,Xi′′)=vR^iI(vR^i′′)(Krank(v,R^i)+1)K(K+1)/2,𝜓superscriptsubscript𝑋𝑖superscriptsubscript𝑋𝑖′′subscript𝑣superscriptsubscript^𝑅𝑖𝐼𝑣superscriptsubscript^𝑅𝑖′′𝐾𝑟𝑎𝑛𝑘𝑣superscriptsubscript^𝑅𝑖1𝐾𝐾12\psi(X_{i}^{\prime},X_{i}^{\prime\prime})=\sum_{v\in\hat{R}_{i}^{\prime}}\frac% {I(v\in\hat{R}_{i}^{\prime\prime})*(K-rank(v,\hat{R}_{i}^{\prime})+1)}{K*(K+1)% /2},italic_ψ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_v ∈ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_I ( italic_v ∈ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ∗ ( italic_K - italic_r italic_a italic_n italic_k ( italic_v , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + 1 ) end_ARG start_ARG italic_K ∗ ( italic_K + 1 ) / 2 end_ARG ,
SERP-K=1Ni=1Nmin(ψ(Xi,Xi′′),ψ(Xi′′,Xi))𝑆𝐸𝑅𝑃-𝐾1𝑁superscriptsubscript𝑖1𝑁𝜓superscriptsubscript𝑋𝑖superscriptsubscript𝑋𝑖′′𝜓superscriptsubscript𝑋𝑖′′superscriptsubscript𝑋𝑖SERP\text{-}K=\frac{1}{N}\sum_{i=1}^{N}\min(\psi(X_{i}^{\prime},X_{i}^{\prime% \prime}),\psi(X_{i}^{\prime\prime},X_{i}^{\prime}))italic_S italic_E italic_R italic_P - italic_K = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_min ( italic_ψ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , italic_ψ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )

where R^i,R^i′′superscriptsubscript^𝑅𝑖superscriptsubscript^𝑅𝑖′′\hat{R}_{i}^{\prime},\hat{R}_{i}^{\prime\prime}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT respectively denote the generated lists of recommendations by (X;θ)𝑋𝜃\mathcal{M}(X;\theta)caligraphic_M ( italic_X ; italic_θ ) from the counterfactual input pair (Xi,Xi′′)superscriptsubscript𝑋𝑖superscriptsubscript𝑋𝑖′′(X_{i}^{\prime},X_{i}^{\prime\prime})( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ), v𝑣vitalic_v is a recommendation from R^isuperscriptsubscript^𝑅𝑖\hat{R}_{i}^{\prime}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and rank(v,R^i)𝑟𝑎𝑛𝑘𝑣superscriptsubscript^𝑅𝑖rank(v,\hat{R}_{i}^{\prime})italic_r italic_a italic_n italic_k ( italic_v , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) denotes the rank of v𝑣vitalic_v in R^isuperscriptsubscript^𝑅𝑖\hat{R}_{i}^{\prime}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Note that the use of min(,)\min(\cdot,\cdot)roman_min ( ⋅ , ⋅ ) is included to achieve symmetry.

Pairwise Ranking Accuracy Gap at K (PRAG-K) (Zhang et al., 2023).

Adapted from Beutel et al. (2019), PRAG-K reflects the similarity in pairwise ranking between two recommendation results. A pairwise version of PRAG-K is presented as follows:

rankmatchi(v1,v2)=I(rank(v1,R^i)<rank(v2,R^i))I(rank(v1,R^i′′)<rank(v2,R^i′′))𝑟𝑎𝑛𝑘𝑚𝑎𝑡𝑐subscript𝑖subscript𝑣1subscript𝑣2𝐼𝑟𝑎𝑛𝑘subscript𝑣1superscriptsubscript^𝑅𝑖𝑟𝑎𝑛𝑘subscript𝑣2superscriptsubscript^𝑅𝑖𝐼𝑟𝑎𝑛𝑘subscript𝑣1superscriptsubscript^𝑅𝑖′′𝑟𝑎𝑛𝑘subscript𝑣2superscriptsubscript^𝑅𝑖′′rankmatch_{i}(v_{1},v_{2})=I(rank(v_{1},\hat{R}_{i}^{\prime})<rank(v_{2},\hat{% R}_{i}^{\prime}))*I(rank(v_{1},\hat{R}_{i}^{\prime\prime})<rank(v_{2},\hat{R}_% {i}^{\prime\prime}))italic_r italic_a italic_n italic_k italic_m italic_a italic_t italic_c italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_I ( italic_r italic_a italic_n italic_k ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < italic_r italic_a italic_n italic_k ( italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ∗ italic_I ( italic_r italic_a italic_n italic_k ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) < italic_r italic_a italic_n italic_k ( italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) )
η(Xi,Xi′′)=v1,v2R^iv1v2I(v1R^i′′)rankmatchi(v1,v2)K(K+1),𝜂superscriptsubscript𝑋𝑖superscriptsubscript𝑋𝑖′′subscriptsubscript𝑣1subscript𝑣2superscriptsubscript^𝑅𝑖subscript𝑣1subscript𝑣2𝐼subscript𝑣1superscriptsubscript^𝑅𝑖′′𝑟𝑎𝑛𝑘𝑚𝑎𝑡𝑐subscript𝑖subscript𝑣1subscript𝑣2𝐾𝐾1\eta(X_{i}^{\prime},X_{i}^{\prime\prime})=\sum_{v_{1},v_{2}\in\hat{R}_{i}^{% \prime}\\ v_{1}\neq v_{2}}\frac{I(v_{1}\in\hat{R}_{i}^{\prime\prime})*rankmatch_{i}(v_{1% },v_{2})}{K*(K+1)},italic_η ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_I ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ∗ italic_r italic_a italic_n italic_k italic_m italic_a italic_t italic_c italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_K ∗ ( italic_K + 1 ) end_ARG ,
PRAG-K=1Ni=1Nmin(η(Xi,Xi′′),η(Xi′′,Xi)),𝑃𝑅𝐴𝐺-𝐾1𝑁superscriptsubscript𝑖1𝑁𝜂superscriptsubscript𝑋𝑖superscriptsubscript𝑋𝑖′′𝜂superscriptsubscript𝑋𝑖′′superscriptsubscript𝑋𝑖PRAG\text{-}K=\frac{1}{N}\sum_{i=1}^{N}\min(\eta(X_{i}^{\prime},X_{i}^{\prime% \prime}),\eta(X_{i}^{\prime\prime},X_{i}^{\prime})),italic_P italic_R italic_A italic_G - italic_K = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_min ( italic_η ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , italic_η ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ,

where R^i,R^i′′superscriptsubscript^𝑅𝑖superscriptsubscript^𝑅𝑖′′\hat{R}_{i}^{\prime},\hat{R}_{i}^{\prime\prime}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT respectively denote the generated lists of recommendations by (X;θ)𝑋𝜃\mathcal{M}(X;\theta)caligraphic_M ( italic_X ; italic_θ ) from the counterfactual input pair (Xi,Xi′′)superscriptsubscript𝑋𝑖superscriptsubscript𝑋𝑖′′(X_{i}^{\prime},X_{i}^{\prime\prime})( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ), v1,v2subscript𝑣1subscript𝑣2v_{1},v_{2}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are recommendations from R^isuperscriptsubscript^𝑅𝑖\hat{R}_{i}^{\prime}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and rank(v,R^i)𝑟𝑎𝑛𝑘𝑣subscript^𝑅𝑖rank(v,\hat{R}_{i})italic_r italic_a italic_n italic_k ( italic_v , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the rank of v𝑣vitalic_v in R^isubscript^𝑅𝑖\hat{R}_{i}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. As with SERP-K, min(,)\min(\cdot,\cdot)roman_min ( ⋅ , ⋅ ) is used to achieve symmetry.

Table 2: Bias and Fairness Evaluation Metrics
Evaluation Metric Required Input
Toxicity
   Expected Maximum Toxicity 25 generations per prompt
   Toxicity Probability 25 generations per prompt
   Toxic Fraction 1 (or more) generation per prompt
Stereotype
   Stereotypical Associations 1 (or more) generation per prompt
   Co-occurrence Bias Score 1 (or more) generation per prompt
   Expected Maximum Stereotype 25 generations per prompt
   Stereotype Probability 25 generations per prompt
   Stereotype Fraction 1 (or more) generation per prompt
Counterfactual Fairness (Generated Text)
   Counterfactual ROUGE-L 1 (or more) counterfactual pair of generations per prompt
   Counterfactual BLEU 1 (or more) counterfactual pair of generations per prompt
   Counterfactual Cosine Similarity 1 (or more) counterfactual pair of generations per prompt
   Weak Counterfactual Sentiment Parity 1 (or more) counterfactual pair of generations per prompt
   Strict Counterfactual Sentiment Parity 1 (or more) counterfactual pair of generations per prompt
Allocational harms
   Demographic Parity Binary predictions and associated protected attribute groups
   False Negative Rate Difference Binary predictions, ground truth values, and associated protected attributed groups
   False Omission Rate Difference Binary predictions, ground truth values, and associated protected attributed groups
   False Positive Rate Difference Binary predictions, ground truth values, and associated protected attributed groups
   False Discovery Rate Difference Binary predictions, ground truth values, and associated protected attributed groups
Counterfactual Fairness (Recommendation)
   Jaccard-K Counterfactual pairs of generated recommendation lists of length K𝐾Kitalic_K
   SERP-K Counterfactual pairs of generated recommendation lists of length K𝐾Kitalic_K
   PRAG-K Counterfactual pairs of generated recommendation lists of length K𝐾Kitalic_K

4 A Unified Framework for Bias and Fairness Assessments of LLM Use Cases

In general, bias and fairness assessments of LLM use cases do not require satisfying all possible evaluation metrics. Instead, practitioners should prioritize and concentrate on a relevant subset of metrics that align with their use case. To demystify metric choice for these assessments, this section introduces a decision framework that enables practitioners to determine suitable choices of bias and fairness evaluation metrics, drawing inspiration from the classification fairness framework proposed by Saleiro et al. (2018).

In defining the proposed decision framework for selecting LLM bias and fairness evaluation metrics, the scope is restricted to use cases for which prompts can be sampled from a known population and the task is well-defined. Use cases are categorized into three distinct groups based on task: 1) text generation and summarization, 2) classification, and 3) recommendation. The framework includes suitable metrics for each category to assess the potential bias and fairness risks that align with the specific characteristics of the use case. The mapping of these metrics is illustrated in Figure 1, and a thorough discussion of this mapping is provided below. For a comprehensive list of bias and fairness evaluation metrics, the reader may refer to Table 2.

First, consider text generation and summarization. For this collection of use cases, an important factor in determining relevant bias and fairness metrics is whether the use case upholds FTU, meaning that prompts do not include any mentions of protected attribute words. If FTU cannot be achieved for this type of use case, it is recommended that practitioners include counterfactual fairness and stereotype metrics, as respectively outlined in Sections 3.1.3 and 3.1.2, in their assessments.222222Note that counterfactual similarity may be too strict in some contexts, particularly in scenarios where textual content ought to differ across protected attribute groups (e.g. clinical contexts). For these scenarios, practitioners may opt to focus only on counterfactual sentiment and omit counterfactual similarity metrics from their assessment. 232323Note that stereotype risk, while low, may still exist even if FTU is satisfied. Hence, practitioners may wish to conduct stereotype assessments in these scenarios. Additionally, it is recommended that all text generation and summarization use cases undergo toxicity evaluation, as outlined in Section 3.1.1, regardless of whether or not FTU is achieved.

For classification use cases, a modified version of the decision framework proposed by Saleiro et al. (2018) is adopted. This framework can be applied to any classification use case where inputs correspond to protected attribute groups. Following Saleiro et al. (2018), the following approach is recommended: if fairness necessitates that model predictions exhibit approximately equal predicted prevalence across different groups, representation fairness metrics should be used; Otherwise, error-based fairness metrics should be used. For error-based fairness, practitioners should focus on disparities in false negatives (positives), assessed by FNRD and FORD (FPRD and FDRD), if the model is used to assign assistive (punitive) interventions.242424In the context of fairness, if interventions are punitive, and hence can hurt individuals, it is undesirable for a model to produce false positives disproportionately for any protected attribute group. Analogously, having a model produce false negatives disproportionately for any protected attribute group is undesirable in the case of assistive interventions. If inputs cannot be mapped to a protected attribute, meaning they are not person-level inputs and they satisfy FTU, then a fairness assessment is not applicable.

Third, for recommendation use cases, counterfactual unfairness is a risk if FTU cannot be satisfied, as shown by Zhang et al. (2023). Note that counterfactual invariance may not be a desirable property for certain recommendation use cases. For instance, it may be preferred to recommend different products for male vs. female customers. Hence, if counterfactual invariance is a desired property, it is recommended that recommendation use cases not satisfying FTU be assessed for counterfactual unfairness in recommendations using the metrics outlined in 3.3. On the other hand, if a recommendation use case satisfies FTU or if counterfactual invariance is not desired, then a fairness assessment is not applicable.

It is important to note that the framework presented here is intended for use cases in which the volume of LLM responses makes exhaustive human review infeasible. Accordingly, it is important to note that in situations where the practitioner manually reviews each generated output, the proposed evaluations may be unnecessary if the concerns regarding bias and fairness can be effectively addressed by the output reviewer themselves.

Refer to caption
Figure 1: Bias and Fairness Evaluation Framework for LLM Use Cases

5 Conclusion

This paper proposes an actionable decision framework for selecting bias and fairness evaluation metrics for LLM use cases, introducing several new evaluation metrics as part of the framework. This work addresses two gaps in the current literature. First, to the best of the author’s knowledge, the current literature does not offer a framework for selecting bias and fairness evaluation metrics for LLM use cases. To fill this gap, the proposed framework draws inspiration from Saleiro et al. (2018) and incorporates use case characteristics and stakeholder values to guide the selection of evaluation metrics. Second, this framework tackles limitations of existing LLM bias and fairness evaluation approaches that rely on benchmark data sets containing predefined prompts. Instead, the approach outlined in this work involves using actual prompts from the practitioner’s use case. By considering both prompt-risk and the assigned task of the LLM, this approach provides a more customized risk assessment for the practitioner’s specific use case. Furthermore, the proposed framework aims to enhance practicality and ease of implementation, as all evaluation metrics are computed solely from the LLM output.

Acknowledgements

Thank you to Mohit Singh Chauhan, Blake Aber, Piero Ferrante, Xue (Crystal) Gu, Almira Pillay, Zeya Ahmad, and Vasistha Singhal Vinod for your helpful suggestions.

References

  • Minaee et al. [2024] Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2024.
  • Liu et al. [2023] Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li, Mengshen He, Zhengliang Liu, Zihao Wu, Lin Zhao, Dajiang Zhu, Xiang Li, Ning Qiang, Dingang Shen, Tianming Liu, and Bao Ge. Summary of ChatGPT-related research and perspective towards the future of large language models. Meta-Radiology, 1(2):100017, sep 2023. doi:10.1016/j.metrad.2023.100017. URL https://doi.org/10.1016%2Fj.metrad.2023.100017.
  • Ray [2023] Partha Pratim Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 3:121–154, 2023. ISSN 2667-3452. doi:https://doi.org/10.1016/j.iotcps.2023.04.003. URL https://www.sciencedirect.com/science/article/pii/S266734522300024X.
  • Gehman et al. [2020] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301.
  • Dhamala et al. [2021] Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21. ACM, March 2021. doi:10.1145/3442188.3445924. URL http://dx.doi.org/10.1145/3442188.3445924.
  • Nozza et al. [2021] Debora Nozza, Federico Bianchi, and Dirk Hovy. HONEST: Measuring hurtful sentence completion in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2398–2406, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.191. URL https://aclanthology.org/2021.naacl-main.191.
  • Smith et al. [2022] Eric Michael Smith, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. "i’m sorry to hear that": Finding new biases in language models with a holistic descriptor dataset, 2022.
  • Parrish et al. [2021] Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. BBQ: A hand-built bias benchmark for question answering. CoRR, abs/2110.08193, 2021. URL https://arxiv.org/abs/2110.08193.
  • Li et al. [2020] Tao Li, Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Vivek Srikumar. UNQOVERing stereotyping biases via underspecified questions. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3475–3489, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.311. URL https://aclanthology.org/2020.findings-emnlp.311.
  • Zhao et al. [2018] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference resolution: Evaluation and debiasing methods. CoRR, abs/1804.06876, 2018. URL http://arxiv.org/abs/1804.06876.
  • Rudinger et al. [2018] Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 8–14, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi:10.18653/v1/N18-2002. URL https://aclanthology.org/N18-2002.
  • Nadeem et al. [2021] Moin Nadeem, Anna Bethke, and Siva Reddy. StereoSet: Measuring stereotypical bias in pretrained language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online, August 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.acl-long.416. URL https://aclanthology.org/2021.acl-long.416.
  • Levy et al. [2021] Shahar Levy, Koren Lazar, and Gabriel Stanovsky. Collecting a large-scale gender bias dataset for coreference resolution and machine translation. CoRR, abs/2109.03858, 2021. URL https://arxiv.org/abs/2109.03858.
  • Nangia et al. [2020] Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.154. URL https://aclanthology.org/2020.emnlp-main.154.
  • Barikeri et al. [2021] Soumya Barikeri, Anne Lauscher, Ivan Vulić, and Goran Glavaš. RedditBias: A real-world resource for bias evaluation and debiasing of conversational language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1941–1955, Online, August 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.acl-long.151. URL https://aclanthology.org/2021.acl-long.151.
  • Jiao et al. [2023] Fangkai Jiao, Bosheng Ding, Tianze Luo, and Zhanfeng Mo. Panda llm: Training data and evaluation for open-sourced chinese instruction-following large language models, 2023.
  • Felkner et al. [2023] Virginia K. Felkner, Ho-Chun Herbert Chang, Eugene Jang, and Jonathan May. Winoqueer: A community-in-the-loop benchmark for anti-lgbtq+ bias in large language models, 2023.
  • Gallegos et al. [2023] Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. Bias and fairness in large language models: A survey, 2023.
  • Wang et al. [2024] Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, and Bo Li. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models, 2024.
  • Saleiro et al. [2018] Pedro Saleiro, Benedict Kuester, Abby Stevens, Ari Anisfeld, Loren Hinkson, Jesse London, and Rayid Ghani. Aequitas: A bias and fairness audit toolkit. CoRR, abs/1811.05577, 2018. URL http://arxiv.org/abs/1811.05577.
  • Lin [2004] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  • Papineni et al. [2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA, 2002. Association for Computational Linguistics. doi:10.3115/1073083.1073135. URL https://doi.org/10.3115/1073083.1073135.
  • Singhal and Google [2001] Amit Singhal and I. Google. Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin, 24, 01 2001.
  • Zhang et al. [2023] Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation, 2023.
  • Bellamy et al. [2018] Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. Ai fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias, 2018.
  • Weerts et al. [2023] Hilde Weerts, Miroslav DudÃk, Richard Edgar, Adrin Jalali, Roman Lutz, and Michael Madaio. Fairlearn: Assessing and improving fairness of ai systems. Journal of Machine Learning Research, 24(257):1–8, 2023. URL http://jmlr.org/papers/v24/23-0389.html.
  • Hardt et al. [2016] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. CoRR, abs/1610.02413, 2016. URL http://arxiv.org/abs/1610.02413.
  • Feldman et al. [2014] Michael Feldman, Sorelle Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact, 2014. URL https://arxiv.org/abs/1412.3756.
  • Mehrabi et al. [2019] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. CoRR, abs/1908.09635, 2019. URL http://arxiv.org/abs/1908.09635.
  • Islam et al. [2016] Aylin Caliskan Islam, Joanna J. Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora necessarily contain human biases. CoRR, abs/1608.07187, 2016. URL http://arxiv.org/abs/1608.07187.
  • May et al. [2019] Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. On measuring social biases in sentence encoders. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 622–628, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi:10.18653/v1/N19-1063. URL https://aclanthology.org/N19-1063.
  • Guo and Caliskan [2020] Wei Guo and Aylin Caliskan. Detecting emergent intersectional biases: Contextualized word embeddings contain a distribution of human-like biases. CoRR, abs/2006.03955, 2020. URL https://arxiv.org/abs/2006.03955.
  • Webster et al. [2020] Kellie Webster, Xuezhi Wang, Ian Tenney, Alex Beutel, Emily Pitler, Ellie Pavlick, Jilin Chen, and Slav Petrov. Measuring and reducing gendered correlations in pre-trained models. CoRR, abs/2010.06032, 2020. URL https://arxiv.org/abs/2010.06032.
  • Kurita et al. [2019] Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. Measuring bias in contextualized word representations, 2019.
  • Ahn and Oh [2021] Jaimeen Ahn and Alice Oh. Mitigating language-dependent ethnic bias in BERT. CoRR, abs/2109.05704, 2021. URL https://arxiv.org/abs/2109.05704.
  • Kaneko and Bollegala [2021] Masahiro Kaneko and Danushka Bollegala. Unmasking the mask - evaluating social biases in masked language models. CoRR, abs/2104.07496, 2021. URL https://arxiv.org/abs/2104.07496.
  • Salazar et al. [2019] Julian Salazar, Davis Liang, Toan Q. Nguyen, and Katrin Kirchhoff. Pseudolikelihood reranking with masked language models. CoRR, abs/1910.14659, 2019. URL http://arxiv.org/abs/1910.14659.
  • Goldfarb-Tarrant et al. [2020] Seraphina Goldfarb-Tarrant, Rebecca Marchant, Ricardo Muñoz Sánchez, Mugdha Pandya, and Adam Lopez. Intrinsic bias metrics do not correlate with application bias. CoRR, abs/2012.15859, 2020. URL https://arxiv.org/abs/2012.15859.
  • Delobelle et al. [2021] Pieter Delobelle, Ewoenam Kwaku Tokpo, Toon Calders, and Bettina Berendt. Measuring fairness with biased rulers: A survey on quantifying biases in pretrained language models. CoRR, abs/2112.07447, 2021. URL https://arxiv.org/abs/2112.07447.
  • Blodgett et al. [2020] Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna M. Wallach. Language (technology) is power: A critical survey of "bias" in NLP. CoRR, abs/2005.14050, 2020. URL https://arxiv.org/abs/2005.14050.
  • Kumar et al. [2023] Sachin Kumar, Vidhisha Balachandran, Lucille Njoo, Antonios Anastasopoulos, and Yulia Tsvetkov. Language generation models can cause harm: So what can we do about it? an actionable survey. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3299–3321, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.eacl-main.241. URL https://aclanthology.org/2023.eacl-main.241.
  • Li et al. [2024] Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. A survey on fairness in large language models, 2024.
  • Chu et al. [2024] Zhibo Chu, Zichong Wang, and Wenbin Zhang. Fairness in large language models: A taxonomic survey, 2024.
  • Ferrara [2023] Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models, 2023.
  • Ranaldi et al. [2023] Leonardo Ranaldi, Elena Sofia Ruzzetti, Davide Venditti, Dario Onorati, and Fabio Massimo Zanzotto. A trip towards fairness: Bias and de-biasing in large language models, 2023.
  • Kotek et al. [2023] Hadas Kotek, Rikker Dockum, and David Q. Sun. Gender bias in llms, 2023. URL https://arxiv.org/abs/2308.14921.
  • Wu and Aji [2023] Minghao Wu and Alham Fikri Aji. Style over substance: Evaluation biases for large language models, 2023.
  • Li et al. [2023] Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. A survey on fairness in large language models, 2023.
  • Nozza et al. [2022] Debora Nozza, Federico Bianchi, and Dirk Hovy. Pipelines for social bias testing of large language models. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 68–74, virtual+Dublin, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.bigscience-1.6. URL https://aclanthology.org/2022.bigscience-1.6.
  • Zhuo et al. [2023] Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity, 2023.
  • Chuang et al. [2024] Yu-Neng Chuang, Ruixiang Tang, Xiaoqian Jiang, and Xia Hu. Spec: A soft prompt-based calibration on performance variability of large language model in clinical notes summarization. Journal of Biomedical Informatics, 151:104606, 2024. ISSN 1532-0464. doi:https://doi.org/10.1016/j.jbi.2024.104606. URL https://www.sciencedirect.com/science/article/pii/S1532046424000248.
  • Bommasani et al. [2023] Rishi Bommasani, Percy Liang, and Tony Lee. Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525(1):140–146, 2023. doi:https://doi.org/10.1111/nyas.15007. URL https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/nyas.15007.
  • Bordia and Bowman [2019] Shikha Bordia and Samuel R. Bowman. Identifying and reducing gender bias in word-level language models. CoRR, abs/1904.03035, 2019. URL http://arxiv.org/abs/1904.03035.
  • Zekun et al. [2023] Wu Zekun, Sahan Bulathwela, and Adriano Soares Koshiyama. Towards auditing large language models: Improving text-based stereotype detection, 2023.
  • Huang et al. [2020] Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stanforth, Johannes Welbl, Jack Rae, Vishal Maini, Dani Yogatama, and Pushmeet Kohli. Reducing sentiment bias in language models via counterfactual evaluation, 2020.
  • Garg et al. [2019] Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur Taly, Ed H. Chi, and Alex Beutel. Counterfactual fairness in text classification through robustness, 2019.
  • Kamishima et al. [2012] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairness-aware classifier with prejudice remover regularizer. In Peter A. Flach, Tijl De Bie, and Nello Cristianini, editors, Machine Learning and Knowledge Discovery in Databases, pages 35–50, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-33486-3.
  • Zhang et al. [2018] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. CoRR, abs/1801.07593, 2018. URL http://arxiv.org/abs/1801.07593.
  • Pleiss et al. [2017] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon M. Kleinberg, and Kilian Q. Weinberger. On fairness and calibration. CoRR, abs/1709.02012, 2017. URL http://arxiv.org/abs/1709.02012.
  • Kamiran et al. [2012] Faisal Kamiran, Asim Karim, and Xiangliang Zhang. Decision theory for discrimination-aware classification. In 2012 IEEE 12th International Conference on Data Mining, pages 924–929, 2012. doi:10.1109/ICDM.2012.45.
  • Agarwal et al. [2018] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna M. Wallach. A reductions approach to fair classification. CoRR, abs/1803.02453, 2018. URL http://arxiv.org/abs/1803.02453.
  • Kamiran and Calders [2011] Faisal Kamiran and Toon Calders. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33:1 – 33, 2011. URL https://api.semanticscholar.org/CorpusID:14637938.
  • Chouldechova [2016] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments, 2016.
  • Sun et al. [2023] Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. Text classification via large language models, 2023.
  • Widmann and Wich [2022] Tobias Widmann and Maximilian Wich. Creating and comparing dictionary, word embedding, and transformer-based models to measure discrete emotions in german political text. SSRN Electronic Journal, 01 2022. doi:10.2139/ssrn.4127133.
  • Bonikowski et al. [2022] Bart Bonikowski, Yuchen Luo, and Oscar Stuhler. Politics as usual? measuring populism, nationalism, and authoritarianism in u.s. presidential campaigns (1952–2020) with neural language models. Sociological Methods & Research, 51(4):1721–1787, 2022. doi:10.1177/00491241221122317. URL https://doi.org/10.1177/00491241221122317.
  • Howard and Ruder [2018] Jeremy Howard and Sebastian Ruder. Fine-tuned language models for text classification. CoRR, abs/1801.06146, 2018. URL http://arxiv.org/abs/1801.06146.
  • Sun et al. [2019] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to fine-tune BERT for text classification? CoRR, abs/1905.05583, 2019. URL http://arxiv.org/abs/1905.05583.
  • Chai et al. [2020] Duo Chai, Wei Wu, Qinghong Han, Fei Wu, and Jiwei Li. Description based text classification with reinforcement learning. CoRR, abs/2002.03067, 2020. URL https://arxiv.org/abs/2002.03067.
  • Chen et al. [2020] Jiaao Chen, Zichao Yang, and Diyi Yang. MixText: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2147–2157, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.194. URL https://aclanthology.org/2020.acl-main.194.
  • Lin et al. [2021] Yuxiao Lin, Yuxian Meng, Xiaofei Sun, Qinghong Han, Kun Kuang, Jiwei Li, and Fei Wu. Bertgcn: Transductive text classification by combining GCN and BERT. CoRR, abs/2105.05727, 2021. URL https://arxiv.org/abs/2105.05727.
  • Bao et al. [2023] Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23. ACM, September 2023. doi:10.1145/3604915.3608857. URL http://dx.doi.org/10.1145/3604915.3608857.
  • Gao et al. [2023] Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei Zhang. Chat-rec: Towards interactive and explainable llms-augmented recommender system, 2023.
  • Chowdhery et al. [2022] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022.
  • Lees et al. [2022] Alyssa Lees, Vinh Q. Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and Lucy Vasserman. A new generation of perspective api: Efficient multilingual character-level transformers, 2022.
  • Sicilia and Alikhani [2023] Anthony Sicilia and Malihe Alikhani. Learning to generate equitable text in dialogue from biased training data, 2023.
  • Sheng et al. [2019] Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. CoRR, abs/1909.01326, 2019. URL http://arxiv.org/abs/1909.01326.
  • Rajpurkar et al. [2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi:10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264.
  • Gomaa and Fahmy [2013] Wael Gomaa and Aly Fahmy. A survey of text similarity approaches. international journal of Computer Applications, 68, 04 2013. doi:10.5120/11638-7118.
  • [80] Evaluating models  |  AutoML Translation Documentation  |  Google Cloud — cloud.google.com. https://cloud.google.com/translate/automl/docs/evaluate. [Accessed 13-05-2024].
  • Jiang et al. [2019] Ray Jiang, Aldo Pacchiano, Tom Stepleton, Heinrich Jiang, and Silvia Chiappa. Wasserstein fair classification, 2019.
  • Rouzot et al. [2023] Julien Rouzot, Julien Ferry, and Marie-José Huguet. Learning optimal fair scoring systems for multi-class classification, 2023.
  • Czarnowska et al. [2021] Paula Czarnowska, Yogarshi Vyas, and Kashif Shah. Quantifying social biases in NLP: A generalization and empirical comparison of extrinsic fairness metrics. CoRR, abs/2106.14574, 2021. URL https://arxiv.org/abs/2106.14574.
  • Dwork et al. [2011] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S. Zemel. Fairness through awareness. CoRR, abs/1104.3913, 2011. URL http://arxiv.org/abs/1104.3913.
  • Han et al. [2011] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2011. ISBN 0123814790.
  • Tomlein et al. [2021] Matus Tomlein, Branislav Pecher, Jakub Simko, Ivan Srba, Robert Moro, Elena Stefancova, Michal Kompan, Andrea Hrckova, Juraj Podrouzek, and Maria Bielikova. An audit of misinformation filter bubbles on youtube: Bubble bursting and recent behavior changes. In Fifteenth ACM Conference on Recommender Systems, RecSys ’21. ACM, September 2021. doi:10.1145/3460231.3474241. URL http://dx.doi.org/10.1145/3460231.3474241.
  • Beutel et al. [2019] Alex Beutel, Jilin Chen, Tulsee Doshi, Hai Qian, Li Wei, Yi Wu, Lukasz Heldt, Zhe Zhao, Lichan Hong, Ed H. Chi, and Cristos Goodrow. Fairness in recommendation ranking through pairwise comparisons. CoRR, abs/1903.00780, 2019. URL http://arxiv.org/abs/1903.00780.