The Performance of a Customized Generative Pre-trained Transformer on the American Society for Surgery of the Hand Self-Assessment Examination

Jason C Flynn; Jacob Zeitlin; Sebastian D Arango; Nathaniel Pineda; Andrew J Miller; Tristan B Weir

doi:10.7759/cureus.70205

The Performance of a Customized Generative Pre-trained Transformer on the American Society for Surgery of the Hand Self-Assessment Examination

Cureus. 2024 Sep 25;16(9):e70205. doi: 10.7759/cureus.70205. eCollection 2024 Sep.

Authors

Jason C Flynn¹, Jacob Zeitlin¹, Sebastian D Arango², Nathaniel Pineda³, Andrew J Miller¹, Tristan B Weir¹

Affiliations

¹ Department of Orthopaedic Surgery, Philadelphia Hand to Shoulder Center, Philadelphia, USA.
² Department of Orthopaedics, Philadelphia Hand to Shoulder Center, Philadelphia, USA.
³ Department of Orthopaedic Surgery, Drexel University College of Medicine, Philadelphia, USA.

Abstract

Introduction: Multimodal large language models (MLLMs), such as OpenAI's ChatGPT (San Francisco, CA), have the potential to improve medical care delivery and education, although important shortcomings in accuracy and image interpretation have been noted. The aim of this study was to assess the multimodal performance of a ChatGPT model customized with hand surgery-specific knowledge.

Methods: A customized generative pre-trained transformer (GPT) was trained using peer-reviewed literature recommended by the American Society for Surgery of the Hand (ASSH). Questions were taken from the ASSH 2022 Self-Assessment Examination (SAE). GPT-4 and the customized GPT were asked text-based multiple-choice questions. The customized GPT was also asked image-containing questions, both with and without access to the image(s) associated with each question.

Results: A total of 192 questions were included. The customized GPT responded to the 119 text-only questions with greater accuracy than GPT-4 (107 (89.9%) versus 91 (76.5%), P = 0.001). Human examinees answered 87.3% (IQR: 71.6-93.7%) of the same text-based questions correctly. Of the 73 questions with images, the customized GPT answered 55 (75.3%) questions correctly, which dropped to 51 (69.9%) when the images were withheld (P = 0.317). The human examinees answered 87.2% (IQR: 79.4-95.4%) of image-based questions correctly.

Conclusion: Our findings suggest significant improvements in ChatGPT's ability to answer text-based hand surgery questions with hand-specific training. ChatGPT is still limited in its ability to interpret images to answer questions related to hand conditions. These data show hand surgeons can create customized GPT models to provide tailored answers to specific questions, which may serve as the foundation for educational and clinical tools.

Keywords: artificial intelligence; chatgpt; hand; natural language processing; orthopaedic.