ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Zhang, Zhexin; Lu, Yida; Ma, Jingyuan; Zhang, Di; Li, Rui; Ke, Pei; Sun, Hao; Sha, Lei; Sui, Zhifang; Wang, Hongning; Huang, Minlie

Computer Science > Computation and Language

arXiv:2402.16444 (cs)

[Submitted on 26 Feb 2024]

Title:ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Authors:Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang

View PDF

Abstract:The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with general human safety standards, supports customizable detection rules, and provides explanations for its decisions. To train ShieldLM, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. Through extensive experiments, we demonstrate that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. Besides performing well on standard detection datasets, ShieldLM has also been shown to be effective in real-world situations as a safety evaluator for advanced LLMs. We release ShieldLM at \url{this https URL} to support accurate and explainable safety detection under various safety standards, contributing to the ongoing efforts to enhance the safety of LLMs.

Comments:	17 pages
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2402.16444 [cs.CL]
	(or arXiv:2402.16444v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2402.16444

Submission history

From: Zhexin Zhang [view email]
[v1] Mon, 26 Feb 2024 09:43:02 UTC (9,893 KB)

Computer Science > Computation and Language

Title:ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators