Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images

Ziman Chen; Nonhlanhla Chambara; Chaoqun Wu; Xina Lo; Shirley Yuk Wah Liu; Simon Takadiyi Gunda; Xinyang Han; Jingguo Qu; Fei Chen; Michael Tin Cheung Ying

doi:10.1007/s12020-024-04066-x

Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images

Endocrine. 2024 Oct 11. doi: 10.1007/s12020-024-04066-x. Online ahead of print.

Authors

Affiliations

¹ Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China. [email protected].
² School of Healthcare Sciences, Cardiff University, Cardiff, UK.
³ Department of Ultrasound, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China.
⁴ Department of Surgery, North District Hospital, Sheung Shui, New Territories, Hong Kong, China.
⁵ Department of Surgery, The Chinese University of Hong Kong, Prince of Wales Hospital, Shatin, New Territories, Hong Kong, China.
⁶ Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China.
⁷ Department of Ultrasound, The Fifth Affiliated Hospital of Sun Yat-sen University, Zhuhai, China. [email protected].
⁸ Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China. [email protected].

PMID: 39394537
DOI: 10.1007/s12020-024-04066-x

Abstract

Purpose: Large language models (LLMs) are pivotal in artificial intelligence, demonstrating advanced capabilities in natural language understanding and multimodal interactions, with significant potential in medical applications. This study explores the feasibility and efficacy of LLMs, specifically ChatGPT-4o and Claude 3-Opus, in classifying thyroid nodules using ultrasound images.

Methods: This study included 112 patients with a total of 116 thyroid nodules, comprising 75 benign and 41 malignant cases. Ultrasound images of these nodules were analyzed using ChatGPT-4o and Claude 3-Opus to diagnose the benign or malignant nature of the nodules. An independent evaluation by a junior radiologist was also conducted. Diagnostic performance was assessed using Cohen's Kappa and receiver operating characteristic (ROC) curve analysis, referencing pathological diagnoses.

Results: ChatGPT-4o demonstrated poor agreement with pathological results (Kappa = 0.116), while Claude 3-Opus showed even lower agreement (Kappa = 0.034). The junior radiologist exhibited moderate agreement (Kappa = 0.450). ChatGPT-4o achieved an area under the ROC curve (AUC) of 57.0% (95% CI: 48.6-65.5%), slightly outperforming Claude 3-Opus (AUC of 52.0%, 95% CI: 43.2-60.9%). In contrast, the junior radiologist achieved a significantly higher AUC of 72.4% (95% CI: 63.7-81.1%). The unnecessary biopsy rates were 41.4% for ChatGPT-4o, 43.1% for Claude 3-Opus, and 12.1% for the junior radiologist.

Conclusion: While LLMs such as ChatGPT-4o and Claude 3-Opus show promise for future applications in medical imaging, their current use in clinical diagnostics should be approached cautiously due to their limited accuracy.

Keywords: Artificial intelligence; Diagnostic accuracy; Large language model; Thyroid cancer; Ultrasound.

Grants and funding

P0048845/Hong Kong Polytechnic University