Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models

Rauh, Maribeth; Mellor, John; Uesato, Jonathan; Huang, Po-Sen; Welbl, Johannes; Weidinger, Laura; Dathathri, Sumanth; Glaese, Amelia; Irving, Geoffrey; Gabriel, Iason; Isaac, William; Hendricks, Lisa Anne

Computer Science > Computation and Language

arXiv:2206.08325 (cs)

[Submitted on 16 Jun 2022 (v1), last revised 28 Oct 2022 (this version, v2)]

Title:Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models

Authors:Maribeth Rauh, John Mellor, Jonathan Uesato, Po-Sen Huang, Johannes Welbl, Laura Weidinger, Sumanth Dathathri, Amelia Glaese, Geoffrey Irving, Iason Gabriel, William Isaac, Lisa Anne Hendricks

View PDF

Abstract:Large language models produce human-like text that drive a growing number of applications. However, recent literature and, increasingly, real world observations, have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful. Though work to evaluate language model harms is under way, translating foresight about which harms may arise into rigorous benchmarks is not straightforward. To facilitate this translation, we outline six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks. We then use these characteristics as a lens to identify trends and gaps in existing benchmarks. Finally, we apply them in a case study of the Perspective API, a toxicity classifier that is widely used in harm benchmarks. Our characteristics provide one piece of the bridge that translates between foresight and effective evaluation.

Comments:	Accepted to NeurIPS 2022 Datasets and Benchmarks Track; 10 pages plus appendix
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as:	arXiv:2206.08325 [cs.CL]
	(or arXiv:2206.08325v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2206.08325

Submission history

From: Lisa Anne Hendricks [view email]
[v1] Thu, 16 Jun 2022 17:28:01 UTC (47 KB)
[v2] Fri, 28 Oct 2022 17:55:58 UTC (51 KB)

Computer Science > Computation and Language

Title:Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators