Assessing GPT-4 multimodal performance in radiological image analysis

Dana Brin; Vera Sorin; Yiftach Barash; Eli Konen; Benjamin S Glicksberg; Girish N Nadkarni; Eyal Klang

doi:10.1007/s00330-024-11035-5

Assessing GPT-4 multimodal performance in radiological image analysis

Eur Radiol. 2024 Aug 30. doi: 10.1007/s00330-024-11035-5. Online ahead of print.

Authors

Dana Brin^{1

2}, Vera Sorin^{3

4

5}, Yiftach Barash^{3

4

5}, Eli Konen^{3

4}, Benjamin S Glicksberg⁶, Girish N Nadkarni^{7

8}, Eyal Klang^{3

4

5

7

8}

Affiliations

¹ Department of Diagnostic Imaging, Chaim Sheba Medical Center, Tel Hashomer, Israel. [email protected].
² Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel. [email protected].
³ Department of Diagnostic Imaging, Chaim Sheba Medical Center, Tel Hashomer, Israel.
⁴ Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel.
⁵ DeepVision Lab, Chaim Sheba Medical Center, Tel Hashomer, Israel.
⁶ Hasso Plattner Institute for Digital Health, Icahn School of Medicine at Mount Sinai, New York, New York, USA.
⁷ Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, New York, USA.
⁸ The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA.

PMID: 39214893
DOI: 10.1007/s00330-024-11035-5

Abstract

Objectives: This study aims to assess the performance of a multimodal artificial intelligence (AI) model capable of analyzing both images and textual data (GPT-4V), in interpreting radiological images. It focuses on a range of modalities, anatomical regions, and pathologies to explore the potential of zero-shot generative AI in enhancing diagnostic processes in radiology.

Methods: We analyzed 230 anonymized emergency room diagnostic images, consecutively collected over 1 week, using GPT-4V. Modalities included ultrasound (US), computerized tomography (CT), and X-ray images. The interpretations provided by GPT-4V were then compared with those of senior radiologists. This comparison aimed to evaluate the accuracy of GPT-4V in recognizing the imaging modality, anatomical region, and pathology present in the images.

Results: GPT-4V identified the imaging modality correctly in 100% of cases (221/221), the anatomical region in 87.1% (189/217), and the pathology in 35.2% (76/216). However, the model's performance varied significantly across different modalities, with anatomical region identification accuracy ranging from 60.9% (39/64) in US images to 97% (98/101) and 100% (52/52) in CT and X-ray images (p < 0.001). Similarly, pathology identification ranged from 9.1% (6/66) in US images to 36.4% (36/99) in CT and 66.7% (34/51) in X-ray images (p < 0.001). These variations indicate inconsistencies in GPT-4V's ability to interpret radiological images accurately.

Conclusion: While the integration of AI in radiology, exemplified by multimodal GPT-4, offers promising avenues for diagnostic enhancement, the current capabilities of GPT-4V are not yet reliable for interpreting radiological images. This study underscores the necessity for ongoing development to achieve dependable performance in radiology diagnostics.

Clinical relevance statement: Although GPT-4V shows promise in radiological image interpretation, its high diagnostic hallucination rate (> 40%) indicates it cannot be trusted for clinical use as a standalone tool. Improvements are necessary to enhance its reliability and ensure patient safety.

Key points: GPT-4V's capability in analyzing images offers new clinical possibilities in radiology. GPT-4V excels in identifying imaging modalities but demonstrates inconsistent anatomy and pathology detection. Ongoing AI advancements are necessary to enhance diagnostic reliability in radiological applications.

Keywords: Artificial intelligence; Computed tomography (x-ray); Diagnostic imaging; Radiology; Ultrasonography.