Educational Limitations of ChatGPT in Neurosurgery Board Preparation

Cureus. 2024 Apr 20;16(4):e58639. doi: 10.7759/cureus.58639. eCollection 2024 Apr.

Abstract

Objective This study evaluated the potential of Chat Generative Pre-trained Transformer (ChatGPT) as an educational tool for neurosurgery residents preparing for the American Board of Neurological Surgery (ABNS) primary examination. Methods Non-imaging questions from the Congress of Neurological Surgeons (CNS) Self-Assessment in Neurological Surgery (SANS) online question bank were input into ChatGPT. Accuracy was evaluated and compared to human performance across subcategories. To quantify ChatGPT's educational potential, the concordance and insight of explanations were assessed by multiple neurosurgical faculty. Associations among these metrics as well as question length were evaluated. Results ChatGPT had an accuracy of 50.4% (1,068/2,120), with the highest and lowest accuracies in the pharmacology (81.2%, 13/16) and vascular (32.9%, 91/277) subcategories, respectively. ChatGPT performed worse than humans overall, as well as in the functional, other, peripheral, radiology, spine, trauma, tumor, and vascular subcategories. There were no subjects in which ChatGPT performed better than humans and its accuracy was below that required to pass the exam. The mean concordance was 93.4% (198/212) and the mean insight score was 2.7. Accuracy was negatively associated with question length (R2=0.29, p=0.03) but positively associated with both concordance (p<0.001, q<0.001) and insight (p<0.001, q<0.001). Conclusions The current study provides the largest and most comprehensive assessment of the accuracy and explanatory quality of ChatGPT in answering ABNS primary exam questions. The findings demonstrate shortcomings regarding ChatGPT's ability to pass, let alone teach, the neurosurgical boards.

Keywords: artificial intelligence; large language model; machine learning; medical education; neurosurgery.