Comparing Large Language Model and Human Reader Accuracy with New England Journal of Medicine Image Challenge Case Image Inputs

Pae Sun Suh; Woo Hyun Shim; Chong Hyun Suh; Hwon Heo; Kye Jin Park; Pyeong Hwa Kim; Se Jin Choi; Yura Ahn; Sohee Park; Ho Young Park; Na Eun Oh; Min Woo Han; Sung Tan Cho; Chang-Yun Woo; Hyungjun Park

doi:10.1148/radiol.241668

Comparing Large Language Model and Human Reader Accuracy with New England Journal of Medicine Image Challenge Case Image Inputs

Radiology. 2024 Dec;313(3):e241668. doi: 10.1148/radiol.241668.

Authors

Affiliation

¹ From the Department of Radiology, Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Radiology and Research Institute of Radiology (W.H.S., C.H.S., K.J.P., P.H.K., S.J.C., Y.A., S.P., H.Y.P., N.E.O.), Department of Medical Science, Asan Medical Institute of Convergence Science and Technology (W.H.S., H.H.), and Department of Internal Medicine (C.Y.W.), Asan Medical Center, University of Ulsan College of Medicine, Olympic-ro 33, Songpa-gu, 05505 Seoul, Republic of Korea; University of Ulsan College of Medicine, Seoul, Republic of Korea (M.W.H.); Department of Orthopaedic Surgery, Seoul Seonam Hospital, Republic of Korea (S.T.C.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.).

^# Contributed equally.

PMID: 39656125
DOI: 10.1148/radiol.241668

Abstract

Background Application of multimodal large language models (LLMs) with both textual and visual capabilities has been steadily increasing, but their ability to interpret radiologic images is still doubted. Purpose To evaluate the accuracy of LLMs and compare it with that of human readers with varying levels of experience and to assess the factors affecting LLM accuracy in answering New England Journal of Medicine Image Challenge cases. Materials and Methods Radiologic images of cases from October 13, 2005, to April 18, 2024, were retrospectively reviewed. Using text and image inputs, LLMs (Open AI's GPT-4 Turbo with Vision [GPT-4V] and GPT-4 Omni [GPT-4o], Google's DeepMind Gemini 1.5 Pro, and Anthropic's Claude 3) provided answers. Human readers (seven junior faculty radiologists, two clinicians, one in-training radiologist, and one medical student), blinded to the published answers, also answered. LLM accuracy with and without image inputs and short (cases from 2005 to 2015) versus long text inputs (from 2016 to 2024) was evaluated in subgroup analysis to determine the effect of these factors. Factor analysis was assessed using multivariable logistic regression. Accuracy was compared with generalized estimating equations, with multiple comparisons adjusted by using Bonferroni correction. Results A total of 272 cases were included. GPT-4o achieved the highest overall accuracy among LLMs (59.6%; 162 of 272), outperforming a medical student (47.1%; 128 of 272; P < .001) but not junior faculty (80.9%; 220 of 272; P < .001) or the in-training radiologist (70.2%; 191 of 272; P = .003). GPT-4o exhibited similar accuracy regardless of image inputs (without images vs with images, 54.0% [147 of 272] vs 59.6% [162 of 272], respectively; P = .59). Human reader accuracy was unaffected by text length, whereas LLMs demonstrated higher accuracy with long text inputs (all P < .001). Text input length affected LLM accuracy (odds ratio range, 3.2 [95% CI: 1.9, 5.5] to 6.6 [95% CI: 3.7, 12.0]). Conclusion LLMs demonstrated substantial accuracy with text and image inputs, outperforming a medical student. However, their accuracy decreased with shorter text lengths, regardless of image input. © RSNA, 2024 Supplemental material is available for this article.

Publication types

Comparative Study

MeSH terms

Clinical Competence*
Female
Humans
Image Interpretation, Computer-Assisted / methods
Male
Retrospective Studies