BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs

First Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
&Second Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain

Abstract

With the rapid development and widespread deployment of Large Language Models (LLMs), evaluating and ensuring their fairness has become crucial and garners great attention. However, existing evaluating approaches rely on fixed-form outputs (e.g., answer choices), which are not suitable for the flexible and diverse open-text generation scenarios of LLMs (e.g., sentence completion and question answering). On the other hand, evaluating fairness in open-text generation tasks is challenging due to the lack of reliable methods to assess the generated content. To address this, we introduce BiasAlert, a plug-and-play tool designed to detect and evaluate social biases in LLM-generated text across open text generation tasks. BiasAlert integrates external human knowledge with LLMs’ inherent reasoning capabilities to reliably identify bias by analyzing generated content against a comprehensive, human-annotated database of social biases. Extensive experiments demonstrate that BiasAlert significantly outperforms existing tools and state-of-the-art models like GPT-4 in detecting biases, demonstrating enhanced reliability, adaptability, and scalability. Furthermore, through several case studies, we showcase BiasAlert’s utility in bias evaluation and mitigation across various deployment scenarios, demonstrating BiasAlert’s immense potential in promoting fairer and more reliable evaluation and deployment of LLMs in various applications.

\pdfcolInitStack

tcb@breakable

BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs

Anonymous ACL submission

1 Introduction

Refer to caption — Figure 1: Illustration of the main challenge for existing evaluation approaches: relying on fixed-form outputs (e.g., lexicon or answer options), which are not suitable for the flexible and diverse open-text generation scenarios of LLMs (e.g., sentence completion, online dialogue, and question answering). In this paper, we propose a plug-and-play bias detection tool BiasAlert.

Large Language Models (LLMs), characterized by their extensive parameter sets and substantial training datasets, has reduced human intervention in complex tasks such as automatic translation, sentiment analysis, and dialogue generation achiam2023gpt; touvron2023llama; devlin2018bert, profoundly impacting human communication, learning, and decision-making processes. Despite LLMs having brought significant technological advancements and efficiency improvements across various fields, recent studies have shown that these models are prone to exhibiting social biases that are inherited from their training data navigli2023biases; sheng2021societal. Evaluating social biases in LLMs can not only enhance their fairness and reliability but also expedite their widespread deployment, which garners increasing attention kaneko2021debiasing.

Many efforts have been made to evaluate the fairness of LLMs, which mainly lies in two categories: embedding or probability-based methods assess LLMs by analyzing hidden representations or token probability predictions from counterfactual inputs, using techniques like computing distances in the embedding space caliskan2017semantics; nadeem2020stereoset, or comparing the predicted probabilities for tokens or counterfactual sentences may2019measuring; nangia2020crows. Generated text-based methods evaluate LLMs by prompting them to complete text or answer questions dhamala2021bold; wan2023biasasker, then measuring bias by analyzing differences in co-occurrence distributions, word frequencies, or using a trained classifier for evaluation bordia2019identifying; nozza2021honest; huang2019reducing. However, these approaches rely on fixed-form inputs and outputs (e.g., lexicon or answer choices), while ignoring capture how LLMs generate biased contents under open text generation scenarios (e.g., sentence completion, online dialogue, and question answering), which limits these benchmarks’ value for flexible and diverse generations and broader practical scenarios. Furthermore, the challenge of evaluating fairness in open-text generation tasks is exacerbated by the absence of reliable and efficient methods to assess the fairness of the generated content.

To bridge this gap, we introduce BiasAlert, a plug-and-play tool for detecting social bias. BiasAlert integrates external human knowledge with the inherent reasoning abilities of LLMs to automatically and reliably identify biases in generated text, thus enabling bias evaluation in open text generation tasks including sentence completion, online dialogue, and question answering. Specifically, BiasAlert takes the generated contents of LLMs, e.g., from text completion or question answering task, as input, then the retrieving module will retrieve top five relevant social biases from our self-constructed database, finally the detection module will judge the fairness and identify the biased content, based on the retrieved context. To achieve this, we first construct a comprehensive retrieval database, which contains human-annotated social biases collected from media and existing datasets. Then, we train BiasAlert by carefully crafting instruction-following dataset, which enhance the reasoning and language comprehension abilities of LLMs, thus enabling retrieval-augmented bias detection.

We evaluate the efficacy of BiasAlert with comprehensive experiments on Redditbias and Crows-pairs datasets. The results indicate that BiasAlert outperforms all existing automatic bias detection tools (e.g., Llama-Guard) and state-of-the-art language models (e.g., GPT-4) in detecting biases across various metrics. Additional experiments demonstrate the effectiveness of retrieval in bias detection task, the characteristics of reliable, adaptive and attributable detection, as well as the scalability and practical utility of BiasAlert. Finally, we provide three case studies to showcase the applicable scenarios of BiasAlert: bias evaluation on text generation task, bias evaluation on question answering task, and bias mitigation in LLM deployment.

Our contributions are:

•

We develope a plug-and-play bias detection tool for open text generation called BiasAlert, which can be widely applied to the generated contents of LLMs. Experiments prove that our BiasAlert can surpass all existing tools as well as state-of-the-art language models in bias detection task, achieving reliable, continual, and attributable detection.
•

Case studies showcase that our BiasAlert is applicable in many scenarios: bias evaluation on text generation task, bias evaluation on question answering task, and bias mitigation in LLM deployment.

2 Related Works

2.1 Bias Evaluation

Recent research has revealed the tendency of LLMs to manifests as discriminatory language or stereotypes against certain vulnerable groups philological1870transactions. These unfair treatments or disparities may be derived from the data they were trained on, which reflects historical and structural power asymmetries of human society gallegos2023bias.

Many efforts have been made to evaluate the fairness of LLMs, which can be categorized into two parts: embedding or probability-based approaches and generated text-based approaches. (1) Embedding or probability-based approaches evaluate LLM by comparing the hidden representations or predicted token probabilities of counterfactual inputs. Methods include computing the correlations between static word embeddings caliskan2017semantics or contextualized embeddings may2019measuring; guo2021detecting with different social groups, comparing the predicted probabilities for counterfactual tokens (e.g., man/woman) via fill-in-the-blank task nadeem2020stereoset, or comparing the predicted pseudo-log-likelihoods between counterfactual sentences nangia2020crows. (2) Generated text-based approaches evaluate LLM by providing prompts (e.g., questions) to a generative LLM and ask the LLM to provide sentence completions dhamala2021bold or select an answer to a question wan2023biasasker. Then, bias is calculated based on the generated texts by calculating co-occurrence distributions difference bordia2019identifying; liang2022holistic, comparing word frequency according to the pre-defined lexicon nozza2021honest; dhamala2021bold (or scoring with a trained classifier).

However, existing approaches still face many limitations. First, many studies indicate that, bias evaluated by embedding or probability-based approaches have a weak relation to bias in downstream text-generation tasks cabello2023independence; delobelle2022measuring; kaneko2022debiasing; blodgett2021stereotyping, which limits their generalizability and reliability. On the other hand, generated text-based approaches still rely on fixed-form inputs and outputs (i.e., choices or lexicon), as it is difficult to assess the bias of content in open text generation scenarios wan2023biasasker; fang2024bias; parrish2021bbq; kaneko2024evaluating; li2020unqovering. To bridge this gap, we propose a plug-and-play bias detection tool, BiasAlert, which can automatically and reliably detect bias in open generated text, thus enabling bias evaluation for the most common LLM deployment scenarios, such as sentence completion, online dialogue, and question answering.

2.2 Bias Detection Tool

Notably, LLM-as-Judge is another potential paradigm for bias detection, which leverages the capabilities of LLMs to achieve data-independent automatic detection through fine-tuning or prompting lin2024indivec; inan2023llama; achiam2023gpt; touvron2023llama. However, our experiments in Section 4.2 demonstrate that current LLMs still struggle to accurately identify social biases due to a lack of human social ethical knowledge. Additionally, existing bias detection strategies often lead LLMs to "over-avoid" sensitive issues, impairing their usability. Perspective API and Llama-guard are two automatic detection tools similar to our approach. However, they focus on detecting toxicity and safety, respectively, and perform poorly on fairness data, as demonstrated in Section 4.2.

3 Method

Task Formulation

We focus on open-ended text generation tasks such as text completion and question answering. Given a LLM $\mathcal{G}$ (e.g., GPT-4), we define user input as $\mathcal{X}$ , and the output of the LLM as $\mathcal{Y}$ = $\mathcal{G}$ ( $\mathcal{X}$ ). In this paper, we introduce a plug-and-play tool BiasAlert, denoted as $\mathcal{A}$ , to detect the bias in LLMs’ generation. Formally, BiasAlert takes the generated content of $\mathcal{G}$ as input and output the analyzing results, denoted as $\mathcal{J}$ = $\mathcal{A}$ ( $\mathcal{Y}$ ), which consist of the judgement and the corresponding explanation.

In this section, we describe the establishment of BiasAlert $\mathcal{A}$ . BiasAlert is trained on instruction-following dataset, and at inference, it incorporates retrieved social biases from database and determines whether the input is biased or not and identifies (fine-grained) bias type and the specific bias content in sentences.

3.1 Social Bias Retrieval Database

We propose to construct a comprehensive retrieval database that encompasses a wide range of human social biases, aimed at providing reliable decision reference for our detection tool. The database can be easily updated in real-time, ensuring that detection results more accurately and timely reflect ethics of human society. Specifically, we first leverage bias data from existing social bias datasets that are reliably annotated by humans and widely used. Then, we standardize the collected data, simplifying the text into refined corpora (i.e., texts with respect to the target demographic group and biased descriptions). Finally, we integrate the corpora labels according to the type of bias. Here are some examples from our constructed database:

’black people are niggers’, racial bias; ’women can’t handle drugs’, gender bias.

The constructed bias retrieval database encompasses seven types of biases including gender, race, religion, and culture, with a total of 41,000 entries. Each entry corresponds to a specific biased corpora and its annotation. We use the Contriever encoder izacard2021unsupervised to embed the contents of the retrieval database. During the bias detection with BiasAlert, we first use Contriever-MSMARCO to query the top five most relevant entries from the retrieval database. These retrieved entries, representing potential biases within the input text, are then input into the bias detection model along with the query to guide the model in determining whether these biased viewpoints are present in the input sentence.

Overall, the retrieval database directly provides comprehensive knowledge of human societal biases to the bias detection model, serving as the basis for its decisions. This ensures that the model’s judgments are not solely reliant on knowledge acquired through training or fine-tuning, which is not reliable, attributable and hard to update. Ablation studies in Section 5.1 demonstrates the necessity of external retrieval database, where we found that even the most powerful language models, such as GPT-4, are unable to accurately detect biases without the aid of external retrieval.

3.2 Instruction-following Bias Detection

We propose to construct BiasAlert using an instruction-following paradigm, aiming to leverage external retrieval information combined with the LLM’s internal reasoning and language understanding capabilities to achieve reliable, adaptive, and attributable bias detection. With the content to be checked and retrieved references as inputs, we designed step-by-step instructions. We first define bias and judgment criteria in the instruction, then guide the model to understand the content by identifying specific groups and potential biased descriptions within the content. Subsequently, we instruct the model to review the retrieved references and make judgments about the content. Additionally, we employ in-context demonstrations, to help it better understand and adapt to diverse and complicated scenarios, and enhance the model’s generalization ability. Here is an example of our constructed instruction:

Training and Inference. In the training phase, we utilize the constructed instruction-tuning dataset, which combines instruction, retrieved references, and in-context demonstrations, to fine-tune the pretrained LM. In the inference phase, for each piece of generated content to be analyzed, we first query the retrieval database for the top five social biases as references. Then, the generated content is input into the trained BiasAlert along with the references. Finally, BiasAlert reports whether the generated content contains bias, as well as the type and content of the bias.

Model	RedditBias_Eval						Crows-pairs
Model	Efficacy	F1	CS	Attribution	OS	Overall	Efficacy	F1	OS	CS	Overall
Online Detection Tools
LlamaGuard-7b	0.59	0.74	-	-	-	-	0.67	0.76	-	-	-
Azure-Safety	0.61	0.63	-	-	-	-	0.63	0.76	-	-	-
OpenAI	0.62	0.75	-	-	-	-	0.76	0.86	-	-	-
Large Language Model Baselines
Llama-2-7b-chat	0.43	0.03	0.43	0.04	0.02	0.01	0.24	0.05	0.14	0.28	0.07
Llama-2-13b-chat	0.45	0.15	0.45	0.67	0.94	0.13	0.44	0.52	0.58	0.27	0.12
Gemma-7b-it	0.43	0.05	0.13	0.82	1.0	0.05	0.27	0.14	1.0	0.13	0.04
GPT-3.5	0.5	0.46	0.57	0.37	1.0	0.11	0.26	0.13	1.0	0.24	0.06
GPT-4	0.61	0.59	0.86	0.41	1.0	0.21	0.43	0.50	1.0	0.24	0.10
Ours	0.84	0.82	1.0	0.97	1.0	0.82	0.70	0.82	1.0	0.5	0.34

Table 1: Evaluation on Bias Detection performance. The best result is in bold and the second best in underline.

4 Experiment

4.1 Experiment Setup

Bias Retrieval Database.

The bias retrieval database is constructed based on the SBIC dataset sap2019social. SBIC has manually structured annotations for 150k social media posts, encompassing over 34k biased posts targeting about 1k social groups. It includes not only explicitly stated biases but also common implied biased meanings. We select samples with group biases annotation, and extract the biased statements, and the annotations of the bias type.

Training Dataset.

The instruction-following dataset is constructed based on Redditbias barikeri2021redditbias, which contains 11k conversational data from Reddit with human annotations on 4 types of social bias (gender, race, religion, and queerness). We format the comments from the Redditbias into standard inputs and extract information from the data annotations regarding the presence of bias, type of bias, the targeted group, and the biased attribute. These information are formatted as the groud-truth output for detecting, classifying and attributing the biases.

Evaluating Datasets.

We randomly selected and reserved 30% for each category from RedditBias as evaluating dataset. These data are not used to construct the instruction-following dataset and do not overlap with the training dataset, ensuring fair comparisons in the evaluation.

Baselines.

We compare our BiasAlert with 9 baselines, mainly lie in two categories: (1) Bias Detection models: Azure Content Safety¹¹1https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety, OpenAI Moderation API²²2https://platform.openai.com/docs/guides/moderation/ and Llama-Guard inan2023llama. (2) Most powerful instruction-following LLMs: Llama-2-chat 7B and 13B touvron2023llama, Llama-3-Instruct 8Bllama3modelcard, Gemma-it 7B gemma_2024, GPT 3.5 openai2022chatgpt and GPT4 Turbo achiam2023gpt. Detailed descriptions of baselines can be referred to in the Appendix A.1.

Evaluating Metrics.

We employ five evaluating metrics to assess the performance of the model: the Efficacy Score measures the percentage of samples that the model can correctly identify as biased or not; the Classification Score (CS) measures the accuracy in recognizing the type of bias; the Attribution Score assesses the accuracy attributing biases to specific social groups and attributes; the Over-Safety Score (OS) indicates the proportion of usable responses generated by the model, as some LLMs’ protective mechanisms can lead to over-safety response; the Overall Score is the percentage of responses that are correct in all of bias presence, category, and attribution. It is worth mentioning that we report the Classification Score and Attribution Score only on the data predicted to be biased. Furthermore, only the Efficacy Score is employed for safety detection tools, as they do not support classification and attribution.

	Module			Performance
Model	Retrieval	Step-by-step Instruction	In-context Demonstration	Efficacy Score	Classification Score	Attribution Score	Overall Score
Llama-2-7b				0.43	0.43	0.04	0.01
		$\checkmark$		0.51	0.64	0.58	0.19
	$\checkmark$	$\checkmark$		0.59	0.76	0.71	0.28
		$\checkmark^{\prime}$		0.74	0.96	0.94	0.67
	$\checkmark$	$\checkmark^{\prime}$		0.83	0.99	0.96	0.79
	$\checkmark$	$\checkmark^{\prime}$	$\checkmark$	0.84	1.00	0.97	0.82
GPT-4				0.61	0.86	0.41	0.21
		$\checkmark$		0.62	0.86	0.75	0.40
	$\checkmark$	$\checkmark$		0.67	0.90	0.85	0.51
	$\checkmark$	$\checkmark$	$\checkmark$	0.69	0.89	0.91	0.56

Table 2: Ablation Study.

\checkmark

: employed.

\checkmark^{\prime}

: instruction-tuned.

Implementation Details.

We utilize the LLama-2-7b-chat model as the base model of BiasAlert. We set the batch size to 16 and employ the AdamW optimizer loshchilov2017decoupled with a learning rate of 5e-5 and weight decay of 0.05. Each batch is trained for 10 epochs via the Low-Rank Adaptation (LoRA) hu2021lora on all linear modules, with a rank of 16. The training is conducted on 8 RTX 3090 GPUs, each with 24 GB of memory.

4.2 Bias Detection Results

Table 1 displays the comparative results on two bias detection datasets. In terms of efficacy score, it is evident that almost all baselines struggle to achieve accurate detection, suggesting that the internal knowledge of the models is insufficient for judging human social biases. In comparison, BiasAlert achieves significantly better results than all the baselines, demonstrating the superiority of our framework that combines external knowledge with internal reasoning capabilities. Regarding the classification score and attribution score, it is observed that BiasAlert not only surpasses the baselines but also achieves nearly perfect performance. This indicates that the model’s judgments align with human annotations, confirming the reliability and interpretability of the detection results. On the more challenging Crows-pairs dataset, BiasAlert also performs better than all models except for the OpenAI Safety API.

5 Analysis and Discussion

5.1 Ablation Study

We conduct a set of ablation studies to evaluate the efficacy of our proposed methods, with results presented in Table 2. First, we investigate the effect of retrieved social bias knowledge on bias detection, conducting experiments on Llama2-chat (base model of BiasAlert) and GPT-4. We observe a significant performance disparity between scenarios with and without retrieval, particularly in terms of the overall score. These findings underscore the necessity of external human social ethical knowledge for LLMs to accurately and reliably detect biases. Furthermore, we discover that step-by-step Instruction significantly enhances performance, especially in the Classification Score and Attribution Score. This suggests that our designed step-by-step instructions effectively stimulate the internal reasoning capabilities of LLMs to understand the input generations. Although the improvements from in-context demonstration are relatively modest, the results demonstrate its effectiveness in guiding the LLMs to generate answers that better align with expectations.

5.2 Discussion on the Continual Updating of Retrieval Database

Model	Alpaca-7b	GPT-3.5	GPT-Neo		Llama-2-chat		OPT
Model	Alpaca-7b	GPT-3.5	1.3b	2.7b	13b	7b	125m	2.7b	6.7b
Text Completion Task
Bias Level	0.02	0	0.04	0.05	0.07	0.06	0.02	0.03	0
Human	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00
Question-answering Task
Bias Level	0.26	0.01	0.08	0.08	0.02	0.01	0.09	0.15	0.16
Human	0.93 $\pm$ 0.03	0.99 $\pm$ 0.01	0.92 $\pm$ 0.04	0.95 $\pm$ 0.00	0.99 $\pm$ 0.01	1.00 $\pm$ 0.00	0.96 $\pm$ 0.03	0.95 $\pm$ 0.03	0.97 $\pm$ 0.02

Table 3: Bias evaluation results on open-text generation tasks.

6 Case Study

We include three applications of BiasAlert: (1) bias evaluation on text completion task, and (2) question-answering task, and (3) bias mitigation.

6.1 Bias Evaluation on Text Completion Task with BiasAlert

Setup.

We employ BOLD dhamala2021bold, an text generation dataset which consists of different text generation prompts and assess bias by counting the number of generated words according to a lexicon. We utilize the prompts from BOLD and employ 6 advanced large language models (LLMs) to generate completions with length of 200. Finally, we employ BiasAlert to conduct bias detection on the generated completions. Bias is assessed by the ratios of biased generation among all generations, which are reported in Table XX. To validate the reliability of BiasAlert, we sample 40 completions and BiasAlert annotations for each LLM and employ crowdsourcers to validate them, with consistency reported in Table 3.

Results.

The BiasAlert values range from 0.00 to 0.07, indicating varying degrees of bias across the models, and the overall results on the BiasAlert test are relatively low on the selected BOLD data. Notably, OPT-6.7b and GPT3.5 exhibited no detectable bias with a BiasAlert value smaller than 0.01. On the other hand, Llama-2-13b-chat displayed the highest level of bias with a BiasAlert value of 0.07. Overall, BiasAlert test results indicate that while some models, like OPT-6.7b and GPT3.5, have effectively minimized bias, others still exhibit moderate levels of bias.

The consistency between the human annotation results and the detection results of BiasAlert is above 0.XX. This demonstrates the utility of BiasAlert in evaluating the fairness of LLMs in text completion tasks.

6.2 Bias Evaluation on Question-answering Task with BiasAlert

Setup.

As there is currently no open-text question-answering dataset for bias evaluation, we employ Beavertails ji2023beavertails, which is a safety-focused question-answer pair dataset covering 14 harm categories. We only utilize the question-answer pairs involving discrimination or stereotypes category and use the questions as prompts. Then, these prompts are input into 10 LLMs, including Alpaca-7b taori2023alpaca, GPT-3.5, GPT-Neo-1.3b gpt-neo, GPT-Neo-2.7b, Llama-2-7b-chat, Llama-2-13b-chat, OPT-125m zhang2022opt, OPT-2.7b, OPT-6.7b to generate responses. We used BiasAlert to detect the presence of bias in these responses. The ratios of biased responses among all responses for different LLMs are reported in Table 3. To validate the reliability of BiasAlert, we sample 40 responses and BiasAlert annotations for each LLM and employ crowdsourcers to validate them, with consistency reported.

Results.

Table 3 presents the bias score of the generations from different LLMs based on Beavertails. Each model produces bias to some degrees, while the results of different models show significant variation. Among them, Alpaca-7b demonstrates the most bias in the generated content, and the OPT series also exhibits a high degree of bias. This indicates that LLMs are prone to inheriting social biases from their training data, which are revealed during text generation. On the other hand, the Llama series and GPT models exhibit lower levels of bias in their generated content. These LLMs tend to decline inappropriate requests, responding with “I’m sorry, but I cannot provide a response …” , indicating their considerable efforts invested in promoting fair development and deployment.

The consistency between the human annotation results and the detection results of BiasAlert is above 0.92. This demonstrates the utility of BiasAlert in evaluating the fairness of LLMs in question-answering tasks.

6.3 Bias Mitigation with BiasAlert

We describe the scenario of applying BiasAlert in the online dialogue of LLMs to reduce potential biased answer. Specifically, in the online dialogue between humans and LLMs, BiasAlert will monitor each response from the LLM. If a response exhibits bias, an alert will be triggered, and the current generation of the LLM will be terminated.

Setup.

We use Beavertails dataset, as described in Section 6.2. We sample 40 prompts as input to 8 different LLMs, and use BiasAlert to audit the generations. Then we employ crowdsourcers to annotate whether the generation is biased with and without BiasAlert.

Model	wo/ BiasAlert	w/ BiasAlert	Zeit
GPT-Neo-1.3b	0.125±0.000	0.033±0.014	1.39s
GPT-Neo-2.7b	0.133±0.014	0.025±0.000	1.41s
OPT-2.7b	0.142±0.014	0.025±0.025	1.44s
OPT-6.7b	0.167±0.014	0.042±0.014	1.51s
Alpaca-7b	0.283±0.014	0.042±0.038	1.74s
Llama-2-7b	0.008±0.014	0.000±0.000	1.30s
Llama-2-13b	0.050±0.000	0.017±0.014	1.27s
GPT3.5	0.025±0.000	0.008±0.014	1.31s

Table 4: Bias Mitigation Results without and with BiasAlert.

Results and Utility Analysis.

Table 4 shows that for different open-source or API-based LLMs, deployed with BiasAlert can significantly reduce the proportion of biased output information, proving the effectiveness of BiasAlert in bias mitigation. Additionally, we report the average time cost for BiasAlert to process one generation when deployed on 2 RTX 3090 GPUs. BiasAlert takes an average of 1.4 seconds to monitor one online response, demonstrating its feasibility for real-world deployment.

7 Conclusion

In this paper, we address the challenges of detecting bias in open-text generation by proposing a plug-and-play bias detection tool: BiasAlert. Our study highlights the benefits of using retrieved external human knowledge, which compensates for the lack of internal knowledge in LLMs and enables reliable and interpretable bias detection. Our empirical results, validated on the Redditbias and Crows-pairs dataset, demonstrate the effectiveness of BiasAlert in achieving superior bias detection performance. Our case studies establishes BiasAlert as a indispensable support, paving the way for fairer and more reliable evaluation and deployment of LLMs in various applications.

Limitation

The limitations of this work can primarily be divided into two aspects. First, from the perspective of experimental design, our case study was conducted by sampling from existing biased datasets, without constructing a comprehensive and large-scale benchmark based on the main application scenarios of BiasAlert.On the other hand, from a methodological perspective, the retrieval database and the retriever we used are somewhat outdated and have not been adapted to the characteristics of bias knowledge. Additionally, when we retrieve the reference, the model did not assess the quality of retrieval conditions and content, which may lead to information that does not helpful for the model’s decision-making.

In future work, we plan to integrate existing biased datasets and generate new datasets to address issues such as incomplete bias types, data imbalance among different bias types, and limited applicability to various tasks. This will be used to construct a multi-group, large-scale, multi-scenario, and multi-dimensional benchmark for bias detection. Additionally, we will improve the retrieval framework to ensure the effectiveness of the retrieved data.

BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs

Abstract

1 Introduction

2 Related Works

2.1 Bias Evaluation

2.2 Bias Detection Tool

3 Method

Task Formulation

3.1 Social Bias Retrieval Database

3.2 Instruction-following Bias Detection

4 Experiment

4.1 Experiment Setup

Bias Retrieval Database.

Training Dataset.

Evaluating Datasets.

Baselines.

Evaluating Metrics.

Implementation Details.

4.2 Bias Detection Results

5 Analysis and Discussion

5.1 Ablation Study

5.2 Discussion on the Continual Updating of Retrieval Database

6 Case Study

6.1 Bias Evaluation on Text Completion Task with BiasAlert

Setup.

Results.

6.2 Bias Evaluation on Question-answering Task with BiasAlert

Setup.

Results.

6.3 Bias Mitigation with BiasAlert

Setup.

Results and Utility Analysis.

Take-Aways.

7 Conclusion

Limitation

Appendix A Example Appendix

A.1 Experiment