Standardized patient profile review using large language models for case adjudication in observational research

Martijn J Schuemie; Anna Ostropolets; Aleh Zhuk; Uladzislau Korsik; Seung In Seo; Marc A Suchard; George Hripcsak; Patrick B Ryan

doi:10.1038/s41746-025-01433-4

Standardized patient profile review using large language models for case adjudication in observational research

NPJ Digit Med. 2025 Jan 9;8(1):18. doi: 10.1038/s41746-025-01433-4.

Authors

Martijn J Schuemie^{1

2

3}, Anna Ostropolets^{4

5}, Aleh Zhuk^{4

6}, Uladzislau Korsik^{4

6}, Seung In Seo^{4

7}, Marc A Suchard^{4

8}, George Hripcsak^{4

5}, Patrick B Ryan^{4

9

5}

Affiliations

¹ Observational Health Data Science and Informatics, New York, NY, USA. [email protected].
² Global Epidemiology Organization, Johnson & Johnson, Titusville, NJ, USA. [email protected].
³ Department of Biostatistics, UCLA, Los Angeles, CA, USA. [email protected].
⁴ Observational Health Data Science and Informatics, New York, NY, USA.
⁵ Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA.
⁶ Odysseus Data Services, Cambridge, MA, USA.
⁷ Division of Gastroenterology, Department of Internal Medicine, Kangdong Sacred Heart Hospital, Hallym University College of Medicine, Seoul, Republic of Korea.
⁸ Department of Biostatistics, UCLA, Los Angeles, CA, USA.
⁹ Global Epidemiology Organization, Johnson & Johnson, Titusville, NJ, USA.

Abstract

Using administrative claims and electronic health records for observational studies is common but challenging due to data limitations. Researchers rely on phenotype algorithms, requiring labor-intensive chart reviews for validation. This study investigates whether case adjudication using the previously introduced Knowledge-Enhanced Electronic Profile Review (KEEPER) system with large language models (LLMs) is feasible and could serve as a viable alternative to manual chart review. The task involves adjudicating cases identified by a phenotype algorithm, with KEEPER extracting predefined findings such as symptoms, comorbidities, and treatments from structured data. LLMs then evaluate KEEPER outputs to determine whether a patient truly qualifies as a case. We tested four LLMs including GPT-4, hosted locally to ensure privacy. Using zero-shot prompting and iterative prompt optimization, we found LLM performance, across ten diseases, varied by prompt and model, with sensitivities from 78 to 98% and specificities from 48 to 98%, indicating promise for automating phenotype evaluation.

Grants and funding

R01 LM006910/LM/NLM NIH HHS/United States