Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains

Hung, Chia-Chien; Rim, Wiem Ben; Frost, Lindsay; Bruckner, Lars; Lawrence, Carolin

Computer Science > Computation and Language

arXiv:2311.14966 (cs)

[Submitted on 25 Nov 2023]

Title:Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains

Authors:Chia-Chien Hung, Wiem Ben Rim, Lindsay Frost, Lars Bruckner, Carolin Lawrence

View PDF

Abstract:High-risk domains pose unique challenges that require language models to provide accurate and safe responses. Despite the great success of large language models (LLMs), such as ChatGPT and its variants, their performance in high-risk domains remains unclear. Our study delves into an in-depth analysis of the performance of instruction-tuned LLMs, focusing on factual accuracy and safety adherence. To comprehensively assess the capabilities of LLMs, we conduct experiments on six NLP datasets including question answering and summarization tasks within two high-risk domains: legal and medical. Further qualitative analysis highlights the existing limitations inherent in current LLMs when evaluating in high-risk domains. This underscores the essential nature of not only improving LLM capabilities but also prioritizing the refinement of domain-specific metrics, and embracing a more human-centric approach to enhance safety and factual reliability. Our findings advance the field toward the concerns of properly evaluating LLMs in high-risk domains, aiming to steer the adaptability of LLMs in fulfilling societal obligations and aligning with forthcoming regulations, such as the EU AI Act.

Comments:	EMNLP 2023 Workshop on Benchmarking Generalisation in NLP (GenBench)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2311.14966 [cs.CL]
	(or arXiv:2311.14966v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.14966

Submission history

From: Chia-Chien Hung [view email]
[v1] Sat, 25 Nov 2023 08:58:07 UTC (266 KB)

Computer Science > Computation and Language

Title:Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators