Performance of large language models on advocating the management of meningitis: a comparative qualitative study

BMJ Health Care Inform. 2024 Feb 2;31(1):e100978. doi: 10.1136/bmjhci-2023-100978.

Abstract

Objectives: We aimed to examine the adherence of large language models (LLMs) to bacterial meningitis guidelines using a hypothetical medical case, highlighting their utility and limitations in healthcare.

Methods: A simulated clinical scenario of a patient with bacterial meningitis secondary to mastoiditis was presented in three independent sessions to seven publicly accessible LLMs (Bard, Bing, Claude-2, GTP-3.5, GTP-4, Llama, PaLM). Responses were evaluated for adherence to good clinical practice and two international meningitis guidelines.

Results: A central nervous system infection was identified in 90% of LLM sessions. All recommended imaging, while 81% suggested lumbar puncture. Blood cultures and specific mastoiditis work-up were proposed in only 62% and 38% sessions, respectively. Only 38% of sessions provided the correct empirical antibiotic treatment, while antiviral treatment and dexamethasone were advised in 33% and 24%, respectively. Misleading statements were generated in 52%. No significant correlation was found between LLMs' text length and performance (r=0.29, p=0.20). Among all LLMs, GTP-4 demonstrated the best performance.

Discussion: Latest LLMs provide valuable advice on differential diagnosis and diagnostic procedures but significantly vary in treatment-specific information for bacterial meningitis when introduced to a realistic clinical scenario. Misleading statements were common, with performance differences attributed to each LLM's unique algorithm rather than output length.

Conclusions: Users must be aware of such limitations and performance variability when considering LLMs as a support tool for medical decision-making. Further research is needed to refine these models' comprehension of complex medical scenarios and their ability to provide reliable information.

Keywords: Artificial intelligence; Decision Making, Computer-Assisted; Disease Management; Patient-Centered Care; Safety Management.

MeSH terms

  • Algorithms
  • Guanosine Triphosphate
  • Humans
  • Language
  • Mastoiditis*
  • Meningitis, Bacterial* / drug therapy

Substances

  • Guanosine Triphosphate