A dataset and benchmark for hospital course summarization with adapted large language models

Asad Aali; Dave Van Veen; Yamin Ishraq Arefeen; Jason Hom; Christian Bluethgen; Eduardo Pontes Reis; Sergios Gatidis; Namuun Clifford; Joseph Daws; Arash S Tehrani; Jangwon Kim; Akshay S Chaudhari

doi:10.1093/jamia/ocae312

A dataset and benchmark for hospital course summarization with adapted large language models

J Am Med Inform Assoc. 2024 Dec 30:ocae312. doi: 10.1093/jamia/ocae312. Online ahead of print.

Authors

Asad Aali^{1

2}, Dave Van Veen^{3

4}, Yamin Ishraq Arefeen², Jason Hom⁵, Christian Bluethgen^{5

6}, Eduardo Pontes Reis^{3

7}, Sergios Gatidis^{1

3}, Namuun Clifford⁸, Joseph Daws⁹, Arash S Tehrani⁹, Jangwon Kim¹⁰, Akshay S Chaudhari^{1

3

11}

Affiliations

¹ Department of Radiology, Stanford University, Stanford, CA 94304, United States.
² Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712, United States.
³ Center for Artificial Intelligence in Medicine and Imaging, Stanford University, Palo Alto, CA 94304, United States.
⁴ Department of Electrical Engineering, Stanford University, Stanford, CA 94304, United States.
⁵ Department of Medicine, Stanford University, Stanford, CA 94304, United States.
⁶ University Hospital Zurich, Zurich 8091, Switzerland.
⁷ Albert Einstein Israelite Hospital, São Paulo 05652-900, Brazil.
⁸ School of Nursing, The University of Texas at Austin, Austin, TX 78712, United States.
⁹ One Medical, San Francisco, CA 94111, United States.
¹⁰ Amazon, Seattle, WA 98109, United States.
¹¹ Department of Biomedical Data Science, Stanford University, Stanford, CA 94304, United States.

PMID: 39786555
DOI: 10.1093/jamia/ocae312

Abstract

Objective: Brief hospital course (BHC) summaries are clinical documents that summarize a patient's hospital stay. While large language models (LLMs) depict remarkable capabilities in automating real-world tasks, their capabilities for healthcare applications such as synthesizing BHCs from clinical notes have not been shown. We introduce a novel preprocessed dataset, the MIMIC-IV-BHC, encapsulating clinical note and BHC pairs to adapt LLMs for BHC synthesis. Furthermore, we introduce a benchmark of the summarization performance of 2 general-purpose LLMs and 3 healthcare-adapted LLMs.

Materials and methods: Using clinical notes as input, we apply prompting-based (using in-context learning) and fine-tuning-based adaptation strategies to 3 open-source LLMs (Clinical-T5-Large, Llama2-13B, and FLAN-UL2) and 2 proprietary LLMs (Generative Pre-trained Transformer [GPT]-3.5 and GPT-4). We evaluate these LLMs across multiple context-length inputs using natural language similarity metrics. We further conduct a clinical study with 5 clinicians, comparing clinician-written and LLM-generated BHCs across 30 samples, focusing on their potential to enhance clinical decision-making through improved summary quality. We compare reader preferences for the original and LLM-generated summary using Wilcoxon signed-rank tests. We further request optional qualitative feedback from clinicians to gain deeper insights into their preferences, and we present the frequency of common themes arising from these comments.

Results: The Llama2-13B fine-tuned LLM outperforms other domain-adapted models given quantitative evaluation metrics of Bilingual Evaluation Understudy (BLEU) and Bidirectional Encoder Representations from Transformers (BERT)-Score. GPT-4 with in-context learning shows more robustness to increasing context lengths of clinical note inputs than fine-tuned Llama2-13B. Despite comparable quantitative metrics, the reader study depicts a significant preference for summaries generated by GPT-4 with in-context learning compared to both Llama2-13B fine-tuned summaries and the original summaries (P<.001), highlighting the need for qualitative clinical evaluation.

Discussion and conclusion: We release a foundational clinically relevant dataset, the MIMIC-IV-BHC, and present an open-source benchmark of LLM performance in BHC synthesis from clinical notes. We observe high-quality summarization performance for both in-context proprietary and fine-tuned open-source LLMs using both quantitative metrics and a qualitative clinical reader study. Our research effectively integrates elements from the data assimilation pipeline: our methods use (1) clinical data sources to integrate, (2) data translation, and (3) knowledge creation, while our evaluation strategy paves the way for (4) deployment.

Keywords: electronic health records; information storage and retrieval; machine learning; natural language processing.

Grants and funding

R01 HL167974/GF/NIH HHS/United States