Toward expert-level medical question answering with large language models

Karan Singhal; Tao Tu; Juraj Gottweis; Rory Sayres; Ellery Wulczyn; Mohamed Amin; Le Hou; Kevin Clark; Stephen R Pfohl; Heather Cole-Lewis; Darlene Neal; Qazi Mamunur Rashid; Mike Schaekermann; Amy Wang; Dev Dash; Jonathan H Chen; Nigam H Shah; Sami Lachgar; Philip Andrew Mansfield; Sushant Prakash; Bradley Green; Ewa Dominowska; Blaise Agüera Y Arcas; Nenad Tomašev; Yun Liu; Renee Wong; Christopher Semturs; S Sara Mahdavi; Joelle K Barral; Dale R Webster; Greg S Corrado; Yossi Matias; Shekoofeh Azizi; Alan Karthikesalingam; Vivek Natarajan

doi:10.1038/s41591-024-03423-7

Toward expert-level medical question answering with large language models

Nat Med. 2025 Jan 8. doi: 10.1038/s41591-024-03423-7. Online ahead of print.

Authors

Karan Singhal^#¹, Tao Tu^#¹, Juraj Gottweis^#¹, Rory Sayres^#¹, Ellery Wulczyn¹, Mohamed Amin¹, Le Hou¹, Kevin Clark², Stephen R Pfohl¹, Heather Cole-Lewis¹, Darlene Neal¹, Qazi Mamunur Rashid¹, Mike Schaekermann¹, Amy Wang¹, Dev Dash³, Jonathan H Chen^{4

5

6}, Nigam H Shah^{7

8}, Sami Lachgar¹, Philip Andrew Mansfield¹, Sushant Prakash¹, Bradley Green¹, Ewa Dominowska², Blaise Agüera Y Arcas¹, Nenad Tomašev², Yun Liu¹, Renee Wong¹, Christopher Semturs¹, S Sara Mahdavi², Joelle K Barral², Dale R Webster¹, Greg S Corrado¹, Yossi Matias¹, Shekoofeh Azizi⁹, Alan Karthikesalingam¹⁰, Vivek Natarajan¹¹

Affiliations

¹ Google Research, Mountain View, CA, USA.
² Google DeepMind, Mountain View, CA, USA.
³ Department of Emergency Medicine, Stanford University School of Medicine, Stanford, CA, USA.
⁴ Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA.
⁵ Division of Hospital Medicine, Stanford University, Stanford, CA, USA.
⁶ Clinical Excellence Research Center, Stanford University, Stanford, CA, USA.
⁷ Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA.
⁸ Technology and Digital Solutions, Stanford Healthcare, Palo Alto, CA, USA.
⁹ Google DeepMind, Mountain View, CA, USA. [email protected].
¹⁰ Google Research, Mountain View, CA, USA. [email protected].
¹¹ Google Research, Mountain View, CA, USA. [email protected].

^# Contributed equally.

PMID: 39779926
DOI: 10.1038/s41591-024-03423-7

Abstract

Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a 'passing' score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.