Evaluating Large Language Model-Assisted Emergency Triage: A Comparison of Acuity Assessments by GPT-4 and Medical Experts

J Clin Nurs. 2024 Nov 28. doi: 10.1111/jocn.17490. Online ahead of print.

Abstract

Aim: To evaluate the accuracy of the Emergency Severity Index (ESI) assignments by GPT-4, a large language model (LLM), compared to senior emergency department (ED) nurses and physicians.

Method: An observational study of 100 consecutive adult ED patients was conducted. ESI scores assigned by GPT-4, triage nurses, and by a senior clinician. Both model and human experts were provided the same patient data.

Results: GPT-4 assigned a lower median ESI score (2.0) compared to human evaluators (median 3.0; p < 0.001), suggesting a potential overestimation of patient severity by the LLM. The results showed differences in the triage assessment approaches between GPT-4 and the human evaluators, including variations in how patient age and vital signs were considered in the ESI assignments.

Conclusion: While GPT-4 offers a novel methodology for patient triage, its propensity to overestimate patient severity highlights the necessity for further development and calibration of LLM tools in clinical environments. The findings underscore the potential and limitations of LLM in clinical decision-making, advocating for cautious integration of LLMs in healthcare settings.

Reporting method: This study adhered to relevant EQUATOR guidelines for reporting observational studies.

Keywords: AI in healthcare; Emergency Severity Index; GPT‐4; clinical decision support; large language model; patient triage.