Comparative Analysis of M4CXR, an LLM-Based Chest X-Ray Report Generation Model, and ChatGPT in Radiological Interpretation

Ro Woon Lee; Kyu Hong Lee; Jae Sung Yun; Myung Sub Kim; Hyun Seok Choi

doi:10.3390/jcm13237057

Comparative Analysis of M4CXR, an LLM-Based Chest X-Ray Report Generation Model, and ChatGPT in Radiological Interpretation

J Clin Med. 2024 Nov 22;13(23):7057. doi: 10.3390/jcm13237057.

Authors

Ro Woon Lee¹, Kyu Hong Lee¹, Jae Sung Yun², Myung Sub Kim³, Hyun Seok Choi⁴

Affiliations

¹ Department of Radiology, Inha University College of Medicine, Incheon 22332, Republic of Korea.
² Department of Radiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea.
³ Department of Radiology, Kangbuk Samsung Hospital, Sungkyunkwan University School of Medicine, Seoul 03181, Republic of Korea.
⁴ Deepnoid Inc., Seoul 08376, Republic of Korea.

Abstract

Background/Objectives: This study investigated the diagnostic capabilities of two AI-based tools, M4CXR (research-only version) and ChatGPT-4o, in chest X-ray interpretation. M4CXR is a specialized cloud-based system using advanced large language models (LLMs) for generating comprehensive radiology reports, while ChatGPT, built on the GPT-4 architecture, offers potential in settings with limited radiological expertise. Methods: This study evaluated 826 anonymized chest X-ray images from Inha University Hospital. Two experienced radiologists independently assessed the performance of M4CXR and ChatGPT across multiple diagnostic parameters. The evaluation focused on diagnostic accuracy, false findings, location accuracy, count accuracy, and the presence of hallucinations. Interobserver agreement was quantified using Cohen's kappa coefficient. Results: M4CXR consistently demonstrated superior performance compared to ChatGPT across all evaluation metrics. For diagnostic accuracy, M4CXR achieved approximately 60-62% acceptability ratings compared to ChatGPT's 42-45%. Both systems showed high interobserver agreement rates, with M4CXR generally displaying stronger consistency. Notably, M4CXR showed better performance in anatomical localization (76-77.5% accuracy) compared to ChatGPT (36-36.5%) and demonstrated fewer instances of hallucination. Conclusions: The findings highlight the complementary potential of these AI technologies in medical diagnostics. While M4CXR shows stronger performance in specialized radiological analysis, the integration of both systems could potentially optimize diagnostic workflows. This study emphasizes the role of AI in augmenting human expertise rather than replacing it, suggesting that a combined approach leveraging both AI capabilities and clinical judgment could enhance patient care outcomes.

Keywords: AI; LLM; chest X-ray.

Grants and funding

No.2024-22-006/Deepnoid