Can the large language model ChatGPT-4omni predict outcomes in adult patients with status epilepticus?

Epilepsia. 2024 Dec 26. doi: 10.1111/epi.18215. Online ahead of print.

Abstract

Objective: Large language models (LLMs) have recently gained attention for clinical decision-making and diagnosis. This study evaluates the performance of the recently updated LLM (ChatGPT-4o) in predicting clinical outcomes in patients with status epilepticus and compares its prognostic performance to the Status Epilepticus Severity Score (STESS).

Methods: This retrospective single-center cohort study was performed at the University Hospital Basel (tertiary academic medical center) from January 2005 to December 2022. It included consecutive adult patients (≥18 years of age) with a diagnosis of status epilepticus. The primary outcome was survival at hospital discharge, and the secondary outcome was return to premorbid neurological function at hospital discharge. The performance characteristics of ChatGPT4-o (sensitivity, specificity, Youden Index) were evaluated and compared to those of the STESS.

Results: Of 760 patients, 689 patients (90.7%) survived to discharge, and 317 survivors (41.7%) regained their premorbid neurological function at discharge. ChatGPT-4o predicted survival in 567 of 760 patients (74.6%), of which 45 died. ChatGPT-4o predicted death in 193 of 760 patients (25.4%), of which 167 survived, resulting in a sensitivity of 75.8% and a specificity of 36.6% (Youden Index 0.12, 95% confidence interval [CI] 0-.28) for predicting survival. ChatGPT-4o predicted return to premorbid neurologic function in 249 of 760 patients (32.8%), of which 112 did not return to their premorbid neurological function. ChatGPT-4o predicted no return to premorbid function in 511 of 760 patients (67.2%), of which 180 returned to their premorbid function, resulting in a sensitivity of 43.2% and a specificity of 74.7% (Youden Index .12, 95% CI .08-.28) for predicting return to premorbid neurological function. There was no difference in the prognostic performance of ChatGPT-4o and the STESS. A second round of prompting did not increase the predictive performance of ChatGPT-4o.

Significance: ChatGPT-4o unreliably predicts outcomes in patients with status epilepticus. Clinicians should refrain from using ChatGPT-4o for prognostication in these patients.

Keywords: artificial intelligence; large language models; neurocritical care; outcome prediction; status epilepticus.

Grants and funding