Performance of Chat Generative Pre-trained Transformer-4o in the Adult Clinical Cardiology Self-Assessment Program

Abdulaziz Malik; Christopher Madias; Benjamin S Wessler

doi:10.1093/ehjdh/ztae077

Performance of Chat Generative Pre-trained Transformer-4o in the Adult Clinical Cardiology Self-Assessment Program

Eur Heart J Digit Health. 2024 Oct 21;6(1):155-158. doi: 10.1093/ehjdh/ztae077. eCollection 2025 Jan.

Authors

Abdulaziz Malik¹, Christopher Madias¹, Benjamin S Wessler¹

Affiliation

¹ Cardiovascular Center, Tufts Medical Center, 800 Washington Street, Boston, MA 02111, USA.

Abstract

Aims: This study evaluates the performance of OpenAI's latest large language model (LLM), Chat Generative Pre-trained Transformer-4o, on the Adult Clinical Cardiology Self-Assessment Program (ACCSAP).

Methods and results: Chat Generative Pre-trained Transformer-4o was tested on 639 ACCSAP questions, excluding 45 questions containing video clips, resulting in 594 questions for analysis. The questions included a mix of text-based and static image-based [electrocardiogram (ECG), angiogram, computed tomography (CT) scan, and echocardiogram] formats. The model was allowed one attempt per question. Further evaluation of image-only questions was performed on 25 questions from the database. Chat Generative Pre-trained Transformer-4o correctly answered 69.2% (411/594) of the questions. The performance was higher for text-only questions (73.9%) compared with those requiring image interpretation (55.3%, P < 0.001). The model performed worse on questions involving ECGs, with a correct rate of 56.5% compared with 73.3% for non-ECG questions (P < 0.001). Despite its capability to interpret medical images in the context of a text-based question, the model's accuracy varied, demonstrating strengths and notable gaps in diagnostic accuracy. It lacked accuracy in reading images (ECGs, echocardiography, and angiograms) with no context.

Conclusion: Chat Generative Pre-trained Transformer-4o performed moderately well on ACCSAP questions. However, the model's performance remains inconsistent, especially in interpreting ECGs. These findings highlight the potential and current limitations of using LLMs in medical education and clinical decision-making.

Keywords: Artificial intelligence; Large language models; Medical education.