AI Versus MD: Evaluating the surgical decision-making accuracy of ChatGPT-4

Deanna L Palenzuela; John T Mullen; Roy Phitayakorn

doi:10.1016/j.surg.2024.04.003

AI Versus MD: Evaluating the surgical decision-making accuracy of ChatGPT-4

Surgery. 2024 Aug;176(2):241-245. doi: 10.1016/j.surg.2024.04.003. Epub 2024 May 19.

Authors

Deanna L Palenzuela¹, John T Mullen², Roy Phitayakorn³

Affiliations

¹ Massachusetts General Hospital, Boston, MA. Electronic address: [email protected].
² Massachusetts General Hospital, Boston, MA.
³ Massachusetts General Hospital, Boston, MA. Electronic address: https://www.twitter.com/RoyPhit.

PMID: 38769038
DOI: 10.1016/j.surg.2024.04.003

Abstract

Background: ChatGPT-4 is a large language model with possible applications to surgery education The aim of this study was to investigate the accuracy of ChatGPT-4's surgical decision-making compared with general surgery residents and attending surgeons.

Methods: Five clinical scenarios were created from actual patient data based on common general surgery diagnoses. Scripts were developed to sequentially provide clinical information and ask decision-making questions. Responses to the prompts were scored based on a standardized rubric for a total of 50 points. Each clinical scenario was run through Chat GPT-4 and sent electronically to all general surgery residents and attendings at a single institution. Scores were compared using Wilcoxon rank sum tests.

Results: On average, ChatGPT-4 scored 39.6 points (79.2%, standard deviation ± 0.89 points). A total of five junior residents, 12 senior residents, and five attendings completed the clinical scenarios (resident response rate = 15.9%; attending response rate = 13.8%). On average, the junior residents scored a total of 33.4 (66.8%, standard deviation ± 3.29), senior residents 38.0 (76.0%, standard deviation ± 4.75), and attendings 38.8 (77.6%, standard deviation ± 5.45). ChatGPT-4 scored significantly better than junior residents (P = .009) but was not significantly different from senior residents or attendings. ChatGPT-4 was significantly better than junior residents at identifying the correct operation to perform (P = .0182) and recommending additional workup for postoperative complications (P = .012).

Conclusion: ChatGPT-4 performed superior to junior residents and equivalent to senior residents and attendings when faced with surgical patient scenarios. Large language models, such as ChatGPT, may have the potential to be an educational resource for junior residents to develop surgical decision-making skills.

Publication types

Comparative Study

MeSH terms

Clinical Competence*
Clinical Decision-Making*
General Surgery* / education
Humans
Internship and Residency*