Background: ChatGPT-4 is a large language model with possible applications to surgery education The aim of this study was to investigate the accuracy of ChatGPT-4's surgical decision-making compared with general surgery residents and attending surgeons.
Methods: Five clinical scenarios were created from actual patient data based on common general surgery diagnoses. Scripts were developed to sequentially provide clinical information and ask decision-making questions. Responses to the prompts were scored based on a standardized rubric for a total of 50 points. Each clinical scenario was run through Chat GPT-4 and sent electronically to all general surgery residents and attendings at a single institution. Scores were compared using Wilcoxon rank sum tests.
Results: On average, ChatGPT-4 scored 39.6 points (79.2%, standard deviation ± 0.89 points). A total of five junior residents, 12 senior residents, and five attendings completed the clinical scenarios (resident response rate = 15.9%; attending response rate = 13.8%). On average, the junior residents scored a total of 33.4 (66.8%, standard deviation ± 3.29), senior residents 38.0 (76.0%, standard deviation ± 4.75), and attendings 38.8 (77.6%, standard deviation ± 5.45). ChatGPT-4 scored significantly better than junior residents (P = .009) but was not significantly different from senior residents or attendings. ChatGPT-4 was significantly better than junior residents at identifying the correct operation to perform (P = .0182) and recommending additional workup for postoperative complications (P = .012).
Conclusion: ChatGPT-4 performed superior to junior residents and equivalent to senior residents and attendings when faced with surgical patient scenarios. Large language models, such as ChatGPT, may have the potential to be an educational resource for junior residents to develop surgical decision-making skills.
Copyright © 2024 Elsevier Inc. All rights reserved.