Aims: This study evaluates the performance of OpenAI's latest large language model (LLM), Chat Generative Pre-trained Transformer-4o, on the Adult Clinical Cardiology Self-Assessment Program (ACCSAP).
Methods and results: Chat Generative Pre-trained Transformer-4o was tested on 639 ACCSAP questions, excluding 45 questions containing video clips, resulting in 594 questions for analysis. The questions included a mix of text-based and static image-based [electrocardiogram (ECG), angiogram, computed tomography (CT) scan, and echocardiogram] formats. The model was allowed one attempt per question. Further evaluation of image-only questions was performed on 25 questions from the database. Chat Generative Pre-trained Transformer-4o correctly answered 69.2% (411/594) of the questions. The performance was higher for text-only questions (73.9%) compared with those requiring image interpretation (55.3%, P < 0.001). The model performed worse on questions involving ECGs, with a correct rate of 56.5% compared with 73.3% for non-ECG questions (P < 0.001). Despite its capability to interpret medical images in the context of a text-based question, the model's accuracy varied, demonstrating strengths and notable gaps in diagnostic accuracy. It lacked accuracy in reading images (ECGs, echocardiography, and angiograms) with no context.
Conclusion: Chat Generative Pre-trained Transformer-4o performed moderately well on ACCSAP questions. However, the model's performance remains inconsistent, especially in interpreting ECGs. These findings highlight the potential and current limitations of using LLMs in medical education and clinical decision-making.
Keywords: Artificial intelligence; Large language models; Medical education.
© The Author(s) 2024. Published by Oxford University Press on behalf of the European Society of Cardiology.