Evaluating Large Language Model-Assisted Emergency Triage: A Comparison of Acuity Assessments by GPT-4 and Medical Experts

Gal Ben Haim; Mor Saban; Yiftach Barash; David Cirulnik; Amit Shaham; Ben Zion Eisenman; Livnat Burshtein; Orly Mymon; Eyal Klang

doi:10.1111/jocn.17490

Evaluating Large Language Model-Assisted Emergency Triage: A Comparison of Acuity Assessments by GPT-4 and Medical Experts

J Clin Nurs. 2024 Nov 28. doi: 10.1111/jocn.17490. Online ahead of print.

Authors

Gal Ben Haim^{1

2}, Mor Saban², Yiftach Barash^{2

3

4}, David Cirulnik¹, Amit Shaham¹, Ben Zion Eisenman¹, Livnat Burshtein¹, Orly Mymon¹, Eyal Klang⁵

Affiliations

¹ Department of Emergency Medicine, Sheba Medical Center, Ramat-Gan, Israel.
² Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel.
³ Division of Diagnostic Imaging, Sheba Medical Center, Tel Hashomer, Israel.
⁴ DeepVision Lab, Sheba Medical Center, Tel Hashomer, Israel.
⁵ The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, USA.

PMID: 39610042
DOI: 10.1111/jocn.17490

Abstract

Aim: To evaluate the accuracy of the Emergency Severity Index (ESI) assignments by GPT-4, a large language model (LLM), compared to senior emergency department (ED) nurses and physicians.

Method: An observational study of 100 consecutive adult ED patients was conducted. ESI scores assigned by GPT-4, triage nurses, and by a senior clinician. Both model and human experts were provided the same patient data.

Results: GPT-4 assigned a lower median ESI score (2.0) compared to human evaluators (median 3.0; p < 0.001), suggesting a potential overestimation of patient severity by the LLM. The results showed differences in the triage assessment approaches between GPT-4 and the human evaluators, including variations in how patient age and vital signs were considered in the ESI assignments.

Conclusion: While GPT-4 offers a novel methodology for patient triage, its propensity to overestimate patient severity highlights the necessity for further development and calibration of LLM tools in clinical environments. The findings underscore the potential and limitations of LLM in clinical decision-making, advocating for cautious integration of LLMs in healthcare settings.

Reporting method: This study adhered to relevant EQUATOR guidelines for reporting observational studies.

Keywords: AI in healthcare; Emergency Severity Index; GPT‐4; clinical decision support; large language model; patient triage.