ChatGPT4's diagnostic accuracy in inpatient neurology: A retrospective cohort study

Heliyon. 2024 Dec 9;10(24):e40964. doi: 10.1016/j.heliyon.2024.e40964. eCollection 2024 Dec 30.

Abstract

Background: Large language models (LLMs) such as ChatGPT-4 (CG4) are proving to be valuable tools in the medical field, not only in facilitating administrative tasks, but in augmenting medical decision-making. LLMs have previously been tested for diagnostic accuracy with expert-generated questions and standardized test data. Among those studies, CG4 consistently outperformed alternative LLMs, including ChatGPT-3.5 (now no longer publicly available for use) and Google Bard (now known as "Google Gemini"). The next logical step was to begin to explore CG4's accuracy within a specific clinical domain. Our study evaluated the diagnostic accuracy of CG4 within an inpatient neurology consultation service.

Methods: We conducted a review of all patients listed on the daily neurology consultation roster at Arrowhead Regional Medical Center in Colton, CA, for all days surveyed until we reached a total of 51 patients, ensuring a complete and representative sample of the patient population. ChatGPT-4, using HIPAA-compliant methodology, received patient data from the Epic EHR as input and was asked to provide an initial differential diagnoses list, investigations and recommended actions, a final diagnosis, and a treatment plan for each patient. A comprehensiveness scale (an ordinal scale between 0 and 3) was then used to rate match accuracy between consultant and CG4 initial diagnoses and the consultants' final diagnoses. In this proof-of-concept study, we assumed that the neurology consultants' final diagnoses were accurate. We employed a non-parametric bootstrap resampling method to create 95 % confidence intervals around mean scores, a Fisher's Exact test, a Wilcoxon Rank Sum test, and ordinal logistic regression models to compare the performance between consultant and CG4 groups.

Findings: Our study found that CG4 demonstrated diagnostic accuracy comparable to that of consultant neurologists. The most frequent comprehensiveness score achieved by both groups was "3," with consultant neurologists achieving this score 43 times and CG4 achieving it 31 times. The mean comprehensiveness scores were 2.75 (95 % CI: 2.49-2.90) for the consultant group and 2.57 (95 % CI: 2.31-2.67) for the CG4 group. The success rate for comprehensive diagnoses (a score of "2″ or "3″) was 94.1 % (95 % CI: 84.1%-98.0 %) for consultants and 96.1 % (95 % CI: 86.8%-98.9 %) for CG4, with no statistically significant difference in success rates (p = 1.00). The Wilcoxon Rank Sum Test indicated that the consultant group had a higher likelihood of providing more comprehensive diagnoses (W = 1583, p = 0.02). Ordinal logistic regression models identified significant predictors of diagnostic accuracy, with the consultant diagnosis group showing an odds ratio of 3.68 (CI 95 %: 1.28-10.55) for higher value outcomes. Notably, integrating CG4's initial diagnoses with those from consultants could achieve comprehensive diagnostics in all cases, indicating a number needed to treat (NNT) of 17 to attain one additional comprehensive diagnosis.

Interpretation: Our findings suggest that CG4 can serve as a valuable diagnostic tool within the domain of inpatient neurology, providing comprehensive and accurate initial diagnoses comparable to those of consultant neurologists. The use of CG4 might contribute to better patient outcomes by serving as an aid in diagnosis and treatment recommendations, potentially leading to reduced missed diagnoses and quicker diagnostic processes. Continuous strategies and evaluations to improve LLMs' accuracy remain crucial. Further studies with larger sample sizes and independent third-party evaluations are recommended to confirm these findings and assess the impact of LLMs on patient health.

Keywords: Artificial intelligence (AI); ChatGPT-4 (CG4); Clinical decision support; Differential diagnoses; Epic EHR; Inpatient neurology; Large language models (LLMs); Ordinal logistic regression; Treatment recommendations; diagnostic accuracy.