Generative Large Language Models for Detection of Speech Recognition Errors in Radiology Reports

Reuben A Schmidt; Jarrel C Y Seah; Ke Cao; Lincoln Lim; Wei Lim; Justin Yeung

doi:10.1148/ryai.230205

Generative Large Language Models for Detection of Speech Recognition Errors in Radiology Reports

Radiol Artif Intell. 2024 Mar;6(2):e230205. doi: 10.1148/ryai.230205.

Authors

Reuben A Schmidt¹, Jarrel C Y Seah¹, Ke Cao¹, Lincoln Lim¹, Wei Lim¹, Justin Yeung¹

Affiliation

¹ From the Department of Medical Imaging, Western Health, Footscray, Australia (R.A.S., L.L., W.L.); Alfred Health, Harrison.ai, Monash University, Clayton, Australia (J.C.Y.S.); Department of Surgery, Western Precinct, University of Melbourne, Melbourne, Australia (K.C., J.Y.); and Department of Surgery, Western Health, Melbourne, Australia (J.Y.).

Abstract

This study evaluated the ability of generative large language models (LLMs) to detect speech recognition errors in radiology reports. A dataset of 3233 CT and MRI reports was assessed by radiologists for speech recognition errors. Errors were categorized as clinically significant or not clinically significant. Performances of five generative LLMs-GPT-3.5-turbo, GPT-4, text-davinci-003, Llama-v2-70B-chat, and Bard-were compared in detecting these errors, using manual error detection as the reference standard. Prompt engineering was used to optimize model performance. GPT-4 demonstrated high accuracy in detecting clinically significant errors (precision, 76.9%; recall, 100%; F1 score, 86.9%) and not clinically significant errors (precision, 93.9%; recall, 94.7%; F1 score, 94.3%). Text-davinci-003 achieved F1 scores of 72% and 46.6% for clinically significant and not clinically significant errors, respectively. GPT-3.5-turbo obtained 59.1% and 32.2% F1 scores, while Llama-v2-70B-chat scored 72.8% and 47.7%. Bard showed the lowest accuracy, with F1 scores of 47.5% and 20.9%. GPT-4 effectively identified challenging errors of nonsense phrases and internally inconsistent statements. Longer reports, resident dictation, and overnight shifts were associated with higher error rates. In conclusion, advanced generative LLMs show potential for automatic detection of speech recognition errors in radiology reports. Keywords: CT, Large Language Model, Machine Learning, MRI, Natural Language Processing, Radiology Reports, Speech, Unsupervised Learning Supplemental material is available for this article.

Keywords: CT; Large Language Model; MRI; Machine Learning; Natural Language Processing; Radiology Reports; Speech; Unsupervised Learning.

MeSH terms

Animals
Camelids, New World*
Radiology Information Systems*
Radiology*
Reproducibility of Results
Speech
Speech Perception*
Speech Recognition Software