Faster and better than a physician?: Assessing diagnostic proficiency of ChatGPT in misdiagnosed individuals with neuromyelitis optica spectrum disorder

Kevin Shan; Mahi A Patel; Morgan McCreary; Tom G Punnen; Francisco Villalobos; Lauren M Tardo; Lindsay A Horton; Peter V Sguigna; Kyle M Blackburn; Shanan B Munoz; Katy W Burgess; Tatum M Moog; Alexander D Smith; Darin T Okuda

doi:10.1016/j.jns.2024.123360

Faster and better than a physician?: Assessing diagnostic proficiency of ChatGPT in misdiagnosed individuals with neuromyelitis optica spectrum disorder

J Neurol Sci. 2024 Dec 19:468:123360. doi: 10.1016/j.jns.2024.123360. Online ahead of print.

Affiliations

¹ The University of Texas Southwestern Medical Center, School of Medicine, Dallas, TX, USA.
² The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA.
³ Texas Tech University Health Sciences Center, School of Medicine, Lubbock, TX, USA.
⁴ The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA. Electronic address: [email protected].

PMID: 39733714
DOI: 10.1016/j.jns.2024.123360

Abstract

Background: Neuromyelitis optica spectrum disorder (NMOSD) is a commonly misdiagnosed condition. Driven by cost-consciousness and technological fluency, distinct generations may gravitate towards healthcare alternatives, including artificial intelligence (AI) models, such as ChatGPT (Generative Pre-trained Transformer). Our objective was to evaluate the speed and accuracy of ChatGPT-3.5 (GPT-3.5) in the diagnosis of people with NMOSD (PwNMOSD) initially misdiagnosed.

Methods: Misdiagnosed PwNMOSD were retrospectively identified with clinical symptoms and time line of medically related events processed through GPT-3.5. For each subject, seven digital derivatives representing different races, ethnicities, and sexes were created and processed identically to evaluate the impact of these variables on accuracy. Scoresheets were used to track diagnostic success and time to diagnosis. Diagnostic speed of GPT-3.5 was evaluated against physicians using a Cox proportional hazards model, clustered by subject. Logistical regression was used to estimate the diagnostic accuracy of GPT-3.5 compared with the estimated accuracy of physicians.

Results: Clinical time lines for 68 individuals (59 female, 42 Black/African American, 13 White, 11 Hispanic, 2 Asian; mean age at first symptoms 34.4 years (y) (standard deviation = 15.5y)) were analyzed and 476 digital simulations created, yielding 544 conversations for analysis. The instantaneous probability of correct diagnosis was 70.65% less for physicians relative to GPT-3.5 within 240 days of symptom onset (p < 0.0001). The estimated probability of correct diagnosis for GPT-3.5 was 80.88% [95% CI = (76.35%, 99.81%)].

Conclusion: GPT-3.5 may be of value in recognizing NMOSD. However, the manner in which medical information is conveyed, combined with the potential for inaccuracies may result in unnecessary psychological stress.

Keywords: ChatGPT; Generation Z; Generative AI; Misdiagnosis; Neuromyelitis optica spectrum disorder.