The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis

William J Waldock; Joe Zhang; Ahmad Guni; Ahmad Nabeel; Ara Darzi; Hutan Ashrafian

doi:10.2196/56532

The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis

J Med Internet Res. 2024 Nov 5:26:e56532. doi: 10.2196/56532.

Authors

William J Waldock¹, Joe Zhang¹, Ahmad Guni¹, Ahmad Nabeel², Ara Darzi¹, Hutan Ashrafian²

Affiliations

¹ Imperial College London, London, United Kingdom.
² Institute of Global Health Innovation, Imperial College London, London, United Kingdom.

PMID: 39499913
DOI: 10.2196/56532

Abstract

Background: Large language models (LLMs) have dominated public interest due to their apparent capability to accurately replicate learned knowledge in narrative text. However, there is a lack of clarity about the accuracy and capability standards of LLMs in health care examinations.

Objective: We conducted a systematic review of LLM accuracy, as tested under health care examination conditions, as compared to known human performance standards.

Methods: We quantified the accuracy of LLMs in responding to health care examination questions and evaluated the consistency and quality of study reporting. The search included all papers up until September 10, 2023, with all LLMs published in English journals that report clear LLM accuracy standards. The exclusion criteria were as follows: the assessment was not a health care exam, there was no LLM, there was no evaluation of comparable success accuracy, and the literature was not original research.The literature search included the following Medical Subject Headings (MeSH) terms used in all possible combinations: "artificial intelligence," "ChatGPT," "GPT," "LLM," "large language model," "machine learning," "neural network," "Generative Pre-trained Transformer," "Generative Transformer," "Generative Language Model," "Generative Model," "medical exam," "healthcare exam," and "clinical exam." Sensitivity, accuracy, and precision data were extracted, including relevant CIs.

Results: The search identified 1673 relevant citations. After removing duplicate results, 1268 (75.8%) papers were screened for titles and abstracts, and 32 (2.5%) studies were included for full-text review. Our meta-analysis suggested that LLMs are able to perform with an overall medical examination accuracy of 0.61 (CI 0.58-0.64) and a United States Medical Licensing Examination (USMLE) accuracy of 0.51 (CI 0.46-0.56), while Chat Generative Pretrained Transformer (ChatGPT) can perform with an overall medical examination accuracy of 0.64 (CI 0.6-0.67).

Conclusions: LLMs offer promise to remediate health care demand and staffing challenges by providing accurate and efficient context-specific information to critical decision makers. For policy and deployment decisions about LLMs to advance health care, we proposed a new framework called RUBRICC (Regulatory, Usability, Bias, Reliability [Evidence and Safety], Interoperability, Cost, and Codesign-Patient and Public Involvement and Engagement [PPIE]). This presents a valuable opportunity to direct the clinical commissioning of new LLM capabilities into health services, while respecting patient safety considerations.

Trial registration: OSF Registries osf.io/xqzkw; https://osf.io/xqzkw.

Keywords: AI; LLM; artificial intelligence; clinical commissioning; health care exam; health care examination; health services; large language model; narrative medical response; safety.

©William J Waldock, Joe Zhang, Ahmad Guni, Ahmad Nabeel, Ara Darzi, Hutan Ashrafian. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 05.11.2024.

Publication types

Review