\theorembodyfont\theoremheaderfont\theorempostheader

: \theoremsep

Query-Guided Self-Supervised Summarization of Nursing Notes

\NameYa Gao 1 \Email[email protected]
\NameHans Moen 1 \Email[email protected]
\NameSaila Koivusalo 2 \Email[email protected]
\NameMiika Koskinen 2 \Email[email protected]
\NamePekka Marttinen 1 \Email[email protected]
\addr1 Aalto University
\addr2 Helsinki University Hospital
Abstract

Nursing notes, an important component of Electronic Health Records (EHRs), keep track of the progression of a patient’s health status during a care episode. Distilling the key information in nursing notes through text summarization techniques can improve clinicians’ efficiency in understanding patients’ conditions when reviewing nursing notes. However, existing abstractive summarization methods in the clinical setting have often overlooked nursing notes and require the creation of reference summaries for supervision signals, which is time-consuming. In this work, we introduce QGSumm, a query-guided self-supervised domain adaptation framework for nursing note summarization. Using patient-related clinical queries as guidance, our approach generates high-quality, patient-centered summaries without relying on reference summaries for training. Through automatic and manual evaluation by an expert clinician, we demonstrate the strengths of our approach compared to the state-of-the-art Large Language Models (LLMs) in both zero-shot and few-shot settings. Ultimately, our approach provides a new perspective on conditional text summarization, tailored to the specific interests of clinical personnel.

1 Introduction

Electronic Health Records (EHRs) document the events that patients go through during their hospitalization. These records consist of both free-text clinical notes and structured data. Among them, nursing notes are important for keeping track of the progression of patients’ illnesses, changes in health status, as well as the medications and procedures administered (Törnvall and Wilhelmsson, 2008). Nursing notes provide clinicians with comprehensive insights into the patients’ conditions, assisting in formulating next-step treatment and care plans, as well as in writing the final discharge summaries. However, a patient’s care episode may result in a large number of nursing notes, especially for patients suffering from complex health problems, which causes the problem of information overload (Hall and Walton, 2004). Additionally, the information in nursing notes is usually intricate and highly condensed, making it time-consuming for clinicians to understand (Clarke et al., 2013).

In Natural Language Processing (NLP), text summarization techniques can be used to distill the content of nursing notes (Wang et al., 2021) to help clinicians quickly grasp their contents. Automatic clinical note summarization has been extensively studied, with existing approaches categorized into extractive (Pivovarov and Elhadad, 2015; Moen et al., 2016; Tang et al., 2019) and abstractive methods (Zhang et al., 2020b; Liu et al., 2022a; Searle et al., 2023). However, these methods have certain limitations: (1) Extractive methods only retain sentences, important words, or keyphrases from the original note, limiting coherence and fluency of the summary. This presents a challenge for understanding the summarized content. On the other hand, (2) abstractive methods can generate smoother summaries, but most of the existing works on abstractive clinical note summarization require explicit supervision, i.e., a reference summary as the ground truth. Writing the references is time-consuming (O’Donnell et al., 2009), causing a shortage of training data. Moreover, (3) most abstractive clinical note summarization methods focus on a specific type of notes, such as discharge summaries, radiology reports, or the dialogues between doctors and patients, and there is a lack of research on abstractive nursing note summarization.

Refer to caption
Figure 1: From a patient’s admission to discharge, multiple nursing notes may be generated. As shown in one artificial nursing note example, the notes could be poorly organized and lack clarity.

To address these limitations, we propose a novel approach for abstractive nursing note summarization, which does not require reference summaries for training. The nature of the nursing documentation poses additional challenges. For example, as shown in Figure 1, information in nursing notes may lack clarity, be poorly organized, and contain medical jargon that often includes non-standard abbreviations. In the absence of supervised learning signals, guiding the summarization model to understand the semantic information in notes and generate a good summary becomes challenging. Some previous text summarization works (Chu and Liu, 2019; Elsahar et al., 2021) adapt strategies in self-supervised learning (Liu et al., 2021) to such a scenario. In their methods, the training objective is to decrease the semantic distance between a summary and the original text based on the assumption that a good summary is capable of reconstructing the source text. However, simply making the semantic representation of the summary close to that of the original document does not allow controlling the generated summaries, which may result in a lack of relevant information. In clinical domains, we have specific requirements for the content of the summary, where the focus should be on the patient’s condition. Thus, methods that rely solely on the semantic similarity may not be fully satisfactory.

A good summary of a clinical note is centered on the patient’s condition. Consequently, when queried about the state of the patient, answers obtained from the summary will be similar to those obtained from the original note.

For instance, a query can concern the probability of a patient’s condition improving in the near future. We can train a model to answer this query using the current nursing note or its summary as input, and if the summary is accurate these two answers should be similar. Accordingly, we design a learning objective that aims at minimizing the discrepancy between the responses from the summary or from the source note to given queries. To further encourage the model to prioritize the patient’s current medical condition, we integrate into the summarization workflow both the patient’s metadata as well as information in the previous notes of the patient recorded on the same admission.

To the best of our knowledge, our proposed framework is the first on abstractive summarization of nursing notes, and there is no previous work on employing the self-supervised learning strategy for clinical note summarization. Our primary contributions are:

  • The study focuses on nursing notes that play a critical role in clinical settings, filling a gap in previous research by introducing a method for abstractive nursing note summarization. Our method’s ability to operate effectively without requiring reference summaries highlights its practical applicability.

  • We propose a novel self-supervised domain adaptation framework. By leveraging patient-related queries to guide the model, we achieve the goal of generating nursing note summaries that prioritize specific content, i.e., patients’ conditions and health status, without the need for ground truth.

  • We conduct a comprehensive automatic evaluation and a manual evaluation by an expert clinician. We compare our approach with state-of-the-art Large Language Models (LLMs) in few- and zero-shot settings. This demonstrates the method’s ability to generate high-quality summaries of nursing notes, and additionally provides an independent evaluation of the common LLMs in this task.

Generalizable Insights about Machine Learning in the Context of Healthcare

In existing healthcare-related NLP research, there is a noticeable gap in addressing nursing notes specifically. Summarizing key patient information within nursing notes could enhance the efficiency of medical personnel’s workflow. We provide a new perspective on obtaining summaries of nursing notes through self-supervised domain adaptation without the need for manually written reference summaries. In unconditional text summarization, the generated summary lacks explicit constraints. On the other hand, most conditional text summarization methods typically require data annotation (Vig et al., 2022) or the extraction of information related to content conditions (Pagnoni et al., 2023) from the source text, making them less suitable for our task. We introduce easily applicable patient-related queries as a way to facilitate conditional text summarization of nursing notes, which ensure that the generated summaries contain information required to respond to the query effectively, and are closely linked to the patient’s key information. Such constraints and guidance make our method highly suitable for the clinical and healthcare field since the resulting summaries are centered on the information nurses and clinicians are most concerned about.

2 Related Work

Key Information Extraction and Summarization from Clinical Notes

Research in this domain focuses on two approaches, extractive and abstractive summarization. The extractive method can preserve faithfulness but results in the inability to paraphrase and difficulties in maintaining coherence. Earlier work primarily used semantic similarity-based techniques (Pivovarov and Elhadad, 2015; Moen et al., 2016). The emergence of Transformer models has shifted the focus to attention-based methods to determine key information in clinical text with an emphasis on explainability (Tang et al., 2019; Reunamo et al., 2022; Kanwal and Rizzo, 2022).

Recently, there has been a notable increase in research on abstractive clinical text summarization. From an application perspective, these methods mainly target discharge summary generation (Shing et al., 2021; Adams et al., 2022; Searle et al., 2023), radiology report summarization (Zhang et al., 2020b; Van Veen et al., 2023), summarization of doctor-patient conversations (Zhang et al., 2021; Krishna et al., 2021; Abacha et al., 2023) and problem list summarization (Gao et al., 2022, 2023). However, unlike our approach, most of these works depend on data annotation or reference summaries for training and domain adaptation.

LLMs demonstrate a remarkable capability in clinical text understanding, leading to interest in investigating their performance in clinical text summarization. Van Veen et al. (2024) extensively analyze the clinical text summarization performance of various LLMs with in-context learning (Lampinen et al., 2022) and QLoRA (Dettmers et al., 2024) adaptation. They compare the performance of LLMs with medical experts, providing insights into the strengths and limitations of LLMs.

Unsupervised and Self-Supervised Abstractive Text Summarization

The scarcity of annotated text for abstractive text summarization tasks has spurred interest in unsupervised and self-supervised text summarization. Previous works have relied on source document reconstruction, operating under the assumption that a good summary should be able to reconstruct the source document or capture its essential content (Chu and Liu, 2019). However, reconstructing an entire text using a summary without any guiding signal or prompt is challenging. In contrast, Yang et al. (2020) leverage the lead bias in news articles, by pretraining a model to predict the leading sentences as the target summary. However, this approach requires specific information distribution and text layout, which is not generally applicable. Some works have proposed two-step approaches to first extract important information or entities in the source text and then perform abstractive summarization with the guidance of this extracted information (Zhong et al., 2022; Ke et al., 2022; Liu et al., 2022b). However, the quality of the generated summary relies on the effectiveness of the extraction process, and developing a reliable extractor may entail significant costs. Zhuang et al. (2022) propose a contrastive learning strategy, using source documents as positive and edited source documents as negative examples. Their training objective aims at maximizing the semantic similarity between generated summaries and positive examples while minimizing those between generated summaries and negative examples. Hosking et al. (2023) propose an attributable opinion summarization system, which encodes sentences as paths through a hierarchical discrete latent space. Given a specific entity, the system can identify its common subpaths that are decoded as the output summary.

3 Methods

Refer to caption
Figure 2: The overall architecture. While fine-tuning the encoder (ENC) and decoder (DEC), DECrecsuperscriptDECrec\operatorname{DEC^{rec}}roman_DEC start_POSTSUPERSCRIPT roman_rec end_POSTSUPERSCRIPT and the query responder network RR\operatorname{R}roman_R are frozen. Raw patient information is processed by ENC into the embeddings 𝐇PAsuperscript𝐇𝑃𝐴\mathbf{H}^{PA}bold_H start_POSTSUPERSCRIPT italic_P italic_A end_POSTSUPERSCRIPT, but this is omitted here for clarity (see text and Fig. 3 for details).

Next, we introduce QGSumm, a novel framework to automatically summarize and refine clinical notes, with a focus on capturing important patient-centered information in a self-supervised fashion (Figure 2). In line with the hypothesis, we propose a self-supervised domain adaptation strategy applied on the base model presented in Section 3.1. This strategy positive-contrastively learns from the original nursing notes, providing the summaries with an ability comparable to the original notes to resolve patient-related queries (Section 3.2). Moreover, we aim for the model to maintain focus on the patient’s meta information while also considering temporal aspects during the generation process. To achieve this, we propose two augmentation blocks, detailed in Section 3.3, to enhance the overall performance. Our model summarizes one nursing note at a time, taking into account its context. Assume a patient PT𝑃𝑇PTitalic_P italic_T has a sequence of nursing notes N={N1,N2,,Nm}𝑁subscript𝑁1subscript𝑁2subscript𝑁𝑚N=\{N_{1},N_{2},\ldots,N_{m}\}italic_N = { italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } sorted by time. Our objective is to obtain a summary Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for note Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the distribution P(Si|Ni,PA,{N1,,Ni1},U)𝑃conditionalsubscript𝑆𝑖subscript𝑁𝑖𝑃𝐴subscript𝑁1subscript𝑁𝑖1𝑈P(S_{i}|N_{i},{PA},\{N_{1},\ldots,N_{i-1}\},U)italic_P ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P italic_A , { italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_N start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT } , italic_U ), which is conditioned on the patient’s metadata PA𝑃𝐴{PA}italic_P italic_A, information in the past notes {N1,,Ni1}subscript𝑁1subscript𝑁𝑖1\{N_{1},\ldots,N_{i-1}\}{ italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_N start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT }, and the user query U𝑈Uitalic_U guiding the generation.

3.1 Base Model

The backbone of our framework is an off-the-shelf transformer-based language model with an encoder-decoder structure, denoted by MM\operatorname{M}roman_M. Specifically, we leverage a checkpoint MsumsuperscriptMsum\operatorname{M^{sum}}roman_M start_POSTSUPERSCRIPT roman_sum end_POSTSUPERSCRIPT of MM\operatorname{M}roman_M as the base model, which has been fine-tuned for text summarization using publicly available datasets. This allows us to efficiently utilize the extensive resources offered by the pre-trained language model without the effort of training from scratch. Hence, MsumsuperscriptMsum\operatorname{M^{sum}}roman_M start_POSTSUPERSCRIPT roman_sum end_POSTSUPERSCRIPT has been enriched with task-specific knowledge for improved performance in text summarization. However, the capability of MsumsuperscriptMsum\operatorname{M^{sum}}roman_M start_POSTSUPERSCRIPT roman_sum end_POSTSUPERSCRIPT to understand clinical text still remains limited. Therefore, additional refinement of MsumsuperscriptMsum\operatorname{M^{sum}}roman_M start_POSTSUPERSCRIPT roman_sum end_POSTSUPERSCRIPT is necessary to enhance its ability to grasp the complicated semantic information within nursing notes.

Let Ni=[t1,t2,.,tn]N_{i}=[t_{1},t_{2},....,t_{n}]italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … . , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], where ti,i=1,,nformulae-sequencesubscript𝑡𝑖𝑖1𝑛t_{i},i=1,\ldots,nitalic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_n, denote tokens in Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. As a preliminary step, we first train the encoder ENCENC\operatorname{ENC}roman_ENC of the base model MsumsuperscriptMsum\operatorname{M^{sum}}roman_M start_POSTSUPERSCRIPT roman_sum end_POSTSUPERSCRIPT by reconstructing Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

rec(Ni,ENC,DECrec)=CrossEntropy(Ni,DECrec(ENC(Ni)).\mathcal{L}_{rec}(N_{i},\operatorname{ENC},\operatorname{DEC^{rec}})=\mathcal{% L}_{CrossEntropy}(N_{i},\operatorname{DEC^{rec}}(\operatorname{ENC}(N_{i})).caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_ENC , start_OPFUNCTION roman_DEC start_POSTSUPERSCRIPT roman_rec end_POSTSUPERSCRIPT end_OPFUNCTION ) = caligraphic_L start_POSTSUBSCRIPT italic_C italic_r italic_o italic_s italic_s italic_E italic_n italic_t italic_r italic_o italic_p italic_y end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , start_OPFUNCTION roman_DEC start_POSTSUPERSCRIPT roman_rec end_POSTSUPERSCRIPT end_OPFUNCTION ( roman_ENC ( italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) . (1)

Here, DECrecsuperscriptDECrec\operatorname{DEC^{rec}}roman_DEC start_POSTSUPERSCRIPT roman_rec end_POSTSUPERSCRIPT is the decoder of the original pretrained model MM\operatorname{M}roman_M, which remains frozen during the training. This process empowers the encoder with the ability to understand the semantic information and the clinical knowledge embedded in nursing notes, enabling it to encode the notes more effectively. This preparatory step as precedes the primary workflow for nursing notes summarization.

3.2 Training Objective

Since there is no ground truth summary available, the conventional method to guide the model MsumsuperscriptMsum\operatorname{M^{sum}}roman_M start_POSTSUPERSCRIPT roman_sum end_POSTSUPERSCRIPT through supervised fine-tuning is not feasible. Instead, we adopt a self-supervised strategy to force the model to generate high-quality, patient-centered summaries that can respond to patient-related queries effectively. We introduce a model RR\operatorname{R}roman_R, which serves as a query responder. This model has been trained to generate responses to specific queries concerning the patient. For example, if a query pertains to the patient’s readmission status, RR\operatorname{R}roman_R is trained to classify patients based on readmission risk using data available in the patient database.

When giving the original nursing note Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or its summary Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT generated by MsumsuperscriptMsum\operatorname{M^{sum}}roman_M start_POSTSUPERSCRIPT roman_sum end_POSTSUPERSCRIPT as an input to the responder RR\operatorname{R}roman_R, the training objective is to minimize the discrepancy between the two responses:

minCrossEntropy(R(Ni),R(Si)).minsubscript𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦Rsubscript𝑁𝑖Rsubscript𝑆𝑖\operatorname{min}\mathcal{L}_{CrossEntropy}(\operatorname{R}(N_{i}),% \operatorname{R}(S_{i})).roman_min caligraphic_L start_POSTSUBSCRIPT italic_C italic_r italic_o italic_s italic_s italic_E italic_n italic_t italic_r italic_o italic_p italic_y end_POSTSUBSCRIPT ( roman_R ( italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , roman_R ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) . (2)

This formulation ensures that when responding to a certain patient-related query, using the summary will result in a response similar to that obtained using the original nursing note. To prevent MsumsuperscriptMsum\operatorname{M^{sum}}roman_M start_POSTSUPERSCRIPT roman_sum end_POSTSUPERSCRIPT from generating summaries that are too verbose or direct “copy-paste” from the original notes, we introduce a length penalty term. Therefore, the final loss function for nursing notes within one batch becomes:

summ=1Kr=1KCrossEntropy(R(Nr),R(Sr))×(1+λ1e(α0.5)),subscript𝑠𝑢𝑚𝑚1𝐾superscriptsubscript𝑟1𝐾subscript𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦Rsubscript𝑁𝑟Rsubscript𝑆𝑟1subscript𝜆1superscript𝑒𝛼0.5\mathcal{L}_{summ}=\frac{1}{K}\sum_{r=1}^{K}\mathcal{L}_{CrossEntropy}(% \operatorname{R}(N_{r}),\operatorname{R}(S_{r}))\times(1+\lambda_{1}e^{(\alpha% -0.5)}),caligraphic_L start_POSTSUBSCRIPT italic_s italic_u italic_m italic_m end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C italic_r italic_o italic_s italic_s italic_E italic_n italic_t italic_r italic_o italic_p italic_y end_POSTSUBSCRIPT ( roman_R ( italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , roman_R ( italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ) × ( 1 + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT ( italic_α - 0.5 ) end_POSTSUPERSCRIPT ) , (3)

where

α=r=1KLen(Sr)r=1KLen(Nr),𝛼superscriptsubscript𝑟1𝐾Lensubscript𝑆𝑟superscriptsubscript𝑟1𝐾Lensubscript𝑁𝑟\alpha=\frac{\sum_{r=1}^{K}\operatorname{Len}(S_{r})}{\sum_{r=1}^{K}% \operatorname{Len}(N_{r})},italic_α = divide start_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Len ( italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Len ( italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG , (4)

K𝐾Kitalic_K is the batch size, and LenLen\operatorname{Len}roman_Len denotes the length of the document in terms of the number of tokens. The hyperparameter λ1[0,1]subscript𝜆101\lambda_{1}\in[0,1]italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ [ 0 , 1 ] regulates the extent of the penalty. The information flow from MsumsuperscriptMsum\operatorname{M^{sum}}roman_M start_POSTSUPERSCRIPT roman_sum end_POSTSUPERSCRIPT and RR\operatorname{R}roman_R introduces nondifferentiability into the framework, and we resolve it using the straight-through gumbel softmax trick (Bengio et al., 2013; Jang et al., 2017).

3.3 Augmentation Blocks for the Context of the Patient

Refer to caption
Figure 3: The proposed Temporal Information Fusion(TIF) block and the Patient Information Augmentation (PIA) block. The figure shows the process of deriving 𝐇decsuperscript𝐇𝑑𝑒𝑐\mathbf{H}^{dec}bold_H start_POSTSUPERSCRIPT italic_d italic_e italic_c end_POSTSUPERSCRIPT for generating the (j+1)𝑗1(j+1)( italic_j + 1 )th token in the summary.
Temporal Information Fusion (TIF).

A patient typically has multiple nursing notes ordered in time to document the evolution of her condition. Therefore, the key information crucial for summarizing a patient’s current status is influenced by the context provided by the prior notes. We regard this as temporal information which should be incorporated during summarization to help the model understand the progression of the patient’s condition.

For Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the embeddings of its previous notes are represented by the embeddings of their respective first tokens, which are special tokens indicating the start of each note. These embeddings are obtained at the last hidden state in the ENCENC\operatorname{ENC}roman_ENC, denoted as {𝐡𝟏,𝐡𝟐,,𝐡𝐢𝟏}subscript𝐡1subscript𝐡2subscript𝐡𝐢1\{\mathbf{h_{1}},\mathbf{h_{2}},...,\mathbf{h_{i-1}}\}{ bold_h start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT bold_i - bold_1 end_POSTSUBSCRIPT }, where 𝐡𝐢dsubscript𝐡𝐢superscript𝑑\mathbf{h_{i}}\in\mathbb{R}^{d}bold_h start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and d𝑑ditalic_d is the dimension of the hidden state. We aggregate the representations of the past notes by weighted mean pooling such that the most recent notes receive the largest weight. In practice, we determine initial weights βj,j=1,,i1formulae-sequencesubscript𝛽𝑗𝑗1𝑖1\beta_{j},j=1,\ldots,i-1italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j = 1 , … , italic_i - 1 for each past note Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT based on the position in the sequence, such that β1=1,β2=2,,βi1=i1formulae-sequencesubscript𝛽11formulae-sequencesubscript𝛽22subscript𝛽𝑖1𝑖1\beta_{1}=1,\beta_{2}=2,\ldots,\beta_{i-1}=i-1italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 , … , italic_β start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = italic_i - 1. The weighted mean pooling is performed using normalized weights:

βj=βjβ1+β2++βi1,subscriptsuperscript𝛽𝑗subscript𝛽𝑗subscript𝛽1subscript𝛽2subscript𝛽𝑖1\beta^{{}^{\prime}}_{j}=\frac{\beta_{j}}{\beta_{1}+\beta_{2}+...+\beta_{i-1}},italic_β start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + … + italic_β start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_ARG , (5)
𝐡TIF=MeanPooling(β1𝐡1,β2𝐡2,,βi1𝐡i1),superscript𝐡𝑇𝐼𝐹MeanPoolingsuperscriptsubscript𝛽1subscript𝐡1superscriptsubscript𝛽2subscript𝐡2superscriptsubscript𝛽𝑖1subscript𝐡𝑖1\mathbf{h}^{TIF}=\operatorname{MeanPooling}(\beta_{1}^{{}^{\prime}}\mathbf{h}_% {1},\beta_{2}^{{}^{\prime}}\mathbf{h}_{2},...,\beta_{i-1}^{{}^{\prime}}\mathbf% {h}_{i-1}),bold_h start_POSTSUPERSCRIPT italic_T italic_I italic_F end_POSTSUPERSCRIPT = roman_MeanPooling ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) , (6)

where 𝐡TIFdsuperscript𝐡𝑇𝐼𝐹superscript𝑑\mathbf{h}^{TIF}\in\mathbb{R}^{d}bold_h start_POSTSUPERSCRIPT italic_T italic_I italic_F end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT represents the information fusion of the past notes. As shown in Figure 3, we prepend a special token [TI] at the beginning of the decoder input, representing temporal information with embedding 𝐡TIFsuperscript𝐡𝑇𝐼𝐹\mathbf{h}^{TIF}bold_h start_POSTSUPERSCRIPT italic_T italic_I italic_F end_POSTSUPERSCRIPT. Consequently, the initial input to the decoder at the first time step consists of [[TI], [BOS]], where [BOS] is a special token indicating the start of generation. We substitute [TI] with the padding token [PAD] for nursing notes that have no past notes.

The model generates subsequent tokens in the summary in an auto-regressive manner. At each time step, the token produced is appended at the end of the decoder input for generation of subsequent tokens. Therefore, the [TI] token contributes to the generation of each token in the summary, serving as a prompt which consistently guides the model to focus on information about the patient’s past condition.

Patient Information Augmentation (PIA).

We aim at obtaining summaries focusing on the patient’s condition. To aid this, we explicitly incorporate patient-level information into the model through a cross-attention mechanism, which facilitates the interaction of information on different levels of representation learning. A patient’s metadata PA typically comprise basic information recorded for the patient’s admission, including age, gender, existing diagnoses, and performed procedures. We convert this metadata into patient information in natural language (one example presented in Appendix A.1), and then encode it using ENCENC\operatorname{ENC}roman_ENC to derive patient embedding 𝐇PAz×dsuperscript𝐇𝑃𝐴superscript𝑧𝑑\mathbf{H}^{PA}\in\mathbb{R}^{z\times d}bold_H start_POSTSUPERSCRIPT italic_P italic_A end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_z × italic_d end_POSTSUPERSCRIPT for patient PT𝑃𝑇PTitalic_P italic_T, where z𝑧zitalic_z represents the number of tokens in patient information. The encoder also learns the embedding of the source note, 𝐇encn×dsuperscript𝐇𝑒𝑛𝑐superscript𝑛𝑑\mathbf{H}^{enc}\in\mathbb{R}^{n\times d}bold_H start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, where n𝑛nitalic_n denotes the number of tokens in the note given as input to the encoder. On the decoder DECDEC\operatorname{DEC}roman_DEC side, let us assume the tokens input to the decoder at the current timestep are [[TI], [BOS], y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,…,yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT]. Consequently, the hidden representation passed to the l𝑙litalic_lth decoder layer is 𝐇ldec(j+2)×dsuperscriptsubscript𝐇𝑙𝑑𝑒𝑐superscript𝑗2𝑑\mathbf{H}_{l}^{dec}\in\mathbb{R}^{{(j+2)}\times d}bold_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_c end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_j + 2 ) × italic_d end_POSTSUPERSCRIPT. The hidden representation 𝐇ldecsuperscriptsubscript𝐇𝑙𝑑𝑒𝑐\mathbf{H}_{l}^{dec}bold_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_c end_POSTSUPERSCRIPT is processed and updated in each decoder layer using the conventional self-attention and cross-attention with 𝐇encsuperscript𝐇𝑒𝑛𝑐\mathbf{H}^{enc}bold_H start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT. Furthermore, we augment the decoder layer with patient information by performing cross-attention also between the hidden representation 𝐇ldecsuperscriptsubscript𝐇𝑙𝑑𝑒𝑐\mathbf{H}_{l}^{dec}bold_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_c end_POSTSUPERSCRIPT and the patient embedding 𝐇PAsuperscript𝐇𝑃𝐴\mathbf{H}^{PA}bold_H start_POSTSUPERSCRIPT italic_P italic_A end_POSTSUPERSCRIPT. This facilitates the fusion of patient- and note-level information:

𝐇l+1dec=MHCA(𝐇enc,MHSA(𝐇ldec))+λ2×MHCA(𝐇PA,MHSA(𝐇ldec)),superscriptsubscript𝐇𝑙1𝑑𝑒𝑐MHCAsuperscript𝐇𝑒𝑛𝑐MHSAsuperscriptsubscript𝐇𝑙𝑑𝑒𝑐subscript𝜆2MHCAsuperscript𝐇𝑃𝐴MHSAsuperscriptsubscript𝐇𝑙𝑑𝑒𝑐\mathbf{H}_{l+1}^{dec}=\operatorname{MHCA}(\mathbf{H}^{enc},\operatorname{MHSA% }(\mathbf{H}_{l}^{dec}))+\lambda_{2}\times\operatorname{MHCA}(\mathbf{H}^{PA},% \operatorname{MHSA}(\mathbf{H}_{l}^{dec})),bold_H start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_c end_POSTSUPERSCRIPT = roman_MHCA ( bold_H start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT , roman_MHSA ( bold_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_c end_POSTSUPERSCRIPT ) ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × roman_MHCA ( bold_H start_POSTSUPERSCRIPT italic_P italic_A end_POSTSUPERSCRIPT , roman_MHSA ( bold_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_c end_POSTSUPERSCRIPT ) ) , (7)

where the MHCAMHCA\operatorname{MHCA}roman_MHCA and the MHSAMHSA\operatorname{MHSA}roman_MHSA respectively denote Multi-Head Cross-Attention and Multi-Head Self-Attention (Vaswani et al., 2017). λ2[0,1]subscript𝜆201\lambda_{2}\in[0,1]italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ 0 , 1 ] is a hyperparameter to control the importance of patient meta information. 𝐇l+1decsuperscriptsubscript𝐇𝑙1𝑑𝑒𝑐\mathbf{H}_{l+1}^{dec}bold_H start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_c end_POSTSUPERSCRIPT is the input to the next decoder layer, or if the l𝑙litalic_lth layer is the final layer, it is the input to the language modeling head.

With these two augmentation blocks, the computation of the final decoder state 𝐇dec(j+2)×dsuperscript𝐇𝑑𝑒𝑐superscript𝑗2𝑑\mathbf{H}^{dec}\in\mathbb{R}^{(j+2)\times d}bold_H start_POSTSUPERSCRIPT italic_d italic_e italic_c end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_j + 2 ) × italic_d end_POSTSUPERSCRIPT for generating the (j+1)𝑗1(j+1)( italic_j + 1 )th token in the summary of the note Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is abstracted as:

[𝐇enc,𝐇PA]=ENC(Ni,PA),𝐇dec=DEC(𝐇enc,𝐇PA,[[TI],[BOS],y1,,yj]),formulae-sequencesuperscript𝐇𝑒𝑛𝑐superscript𝐇𝑃𝐴ENCsubscript𝑁𝑖𝑃𝐴superscript𝐇𝑑𝑒𝑐DECsuperscript𝐇𝑒𝑛𝑐superscript𝐇𝑃𝐴delimited-[]𝑇𝐼delimited-[]𝐵𝑂𝑆subscript𝑦1subscript𝑦𝑗[\mathbf{H}^{enc},\mathbf{H}^{PA}]=\operatorname{ENC}(N_{i},PA),\hskip 14.2263% 6pt\mathbf{H}^{dec}=\operatorname{DEC}(\mathbf{H}^{enc},\mathbf{H}^{PA},[[TI],% [BOS],y_{1},\ldots,y_{j}]),[ bold_H start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_P italic_A end_POSTSUPERSCRIPT ] = roman_ENC ( italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P italic_A ) , bold_H start_POSTSUPERSCRIPT italic_d italic_e italic_c end_POSTSUPERSCRIPT = roman_DEC ( bold_H start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_P italic_A end_POSTSUPERSCRIPT , [ [ italic_T italic_I ] , [ italic_B italic_O italic_S ] , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ) , (8)
𝐯=LMH(𝐇dec),𝐯=STGumbelSoftmax(𝐯).formulae-sequence𝐯LMHsuperscript𝐇𝑑𝑒𝑐superscript𝐯STGumbelSoftmax𝐯\mathbf{v}=\operatorname{LMH}(\mathbf{H}^{dec}),\hskip 14.22636pt\mathbf{v^{% \prime}}=\operatorname{ST-GumbelSoftmax}(\mathbf{v}).bold_v = roman_LMH ( bold_H start_POSTSUPERSCRIPT italic_d italic_e italic_c end_POSTSUPERSCRIPT ) , bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = start_OPFUNCTION roman_ST - roman_GumbelSoftmax end_OPFUNCTION ( bold_v ) . (9)

LMHLMH\operatorname{LMH}roman_LMH (Language Modeling Head) maps 𝐇decsuperscript𝐇𝑑𝑒𝑐\mathbf{H}^{dec}bold_H start_POSTSUPERSCRIPT italic_d italic_e italic_c end_POSTSUPERSCRIPT to a probability vector 𝐯vs𝐯superscript𝑣𝑠\mathbf{v}\in\mathbb{R}^{vs}bold_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_v italic_s end_POSTSUPERSCRIPT over the vocabulary of size vs𝑣𝑠vsitalic_v italic_s. 𝐯𝐯\mathbf{v}bold_v is processed using the straight-though gumbel softmax trick, denoted as STGumbelSoftmaxSTGumbelSoftmax\operatorname{ST-GumbelSoftmax}roman_ST - roman_GumbelSoftmax, resulting in a one-hot vector 𝐯vssuperscript𝐯superscript𝑣𝑠\mathbf{v^{\prime}}\in\mathbb{R}^{vs}bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_v italic_s end_POSTSUPERSCRIPT providing the index of the (j+1)𝑗1(j+1)( italic_j + 1 )th token.

4 Experiments

4.1 Data

We utilize MIMIC-III (Johnson et al., 2016), a widely used real-world EHRs database, for our experiments. In MIMIC-III, clinical notes in “NOTEEVENTS” table are organized by admission, and a single patient may have multiple admissions. Since the information in notes from different admissions of the same patient is discontinuous, we treat notes in each admission independently. We focus on nursing notes within the clinical notes. After the preprocessing, filtering and sampling (details in Appendix A.2), the number of nursing notes in the training, validation and test sets is 149015, 10001, 3079 and the corresponding numbers of admissions are 13893, 1000, 1156.

4.2 Types of Queries

This section presents queries used in our experiments, and more details can be found in Appendix A.3. Two principles are followed when determining the queries: (1) The query should be closely related to the patient and learnable by the query responder RR\operatorname{R}roman_R; (2) Data required to train RR\operatorname{R}roman_R should be easily available without excessive data annotation. Below we propose four different types of queries. In each of these, the query responder R𝑅Ritalic_R is a classification model, which classifies patients according to a specific aspect of a patient’s status. Part of the training data is used to train the query responder RR\operatorname{R}roman_R. When using RR\operatorname{R}roman_R to guide the summarization, we input the summary and the original note to predict the classification probabilities, and minimize the discrepancy between these predictions, as described in Section 3.2. As an additional query, we include a simple baseline by minimizing the semantic distance between the note and its summary, measured by cosine similarity.

Contrastive Next Note Prediction.

Given a nursing note pair (Ni,Ni)subscript𝑁𝑖subscriptsuperscript𝑁𝑖(N_{i},N^{\prime}_{i})( italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), we regard the query about whether Nisubscriptsuperscript𝑁𝑖N^{\prime}_{i}italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the successor note of Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as a prediction of the patient’s future status. To train the query responder R for the next note prediction, we create two note pairs for each nursing note, where the positive pair comprises the note and its successor in the sequence, and the negative pair contains the note and a randomly chosen non-consecutive note. If Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the patient’s last nursing note, we use the patient’s discharge summary and a random note from other patients to construct the positive and negative pairs. The query is formulated as binary classification, and the output of RR\operatorname{R}roman_R is the probability of each pair being the positive pair containing the consecutive notes.

Readmission Prediction.

Readmission information is easily retrieved from the hospital’s database and is closely related to the patient. The readmission prediction query is formulated as a 2-class classification task to predict whether the patient will be readmitted within 30 days of discharge, which reflects the patient’s future condition.

Phenotype Classification.

Classifying a patient’s diagnosis status or phenotype is a query to the patient’s current status. Following Harutyunyan et al. (2019), phenotype classification is formulated as a multi-label classification, where ICD-9 diagnosis codes mapped by HCUP CCS code groups111https://hcup-us.ahrq.gov/toolssoftware/ccs/ccsfactsheet.jsp are categorized into 25 classes. Therefore, the responder outputs the probability distribution of the phenotype as [p1,p2,,p25]subscript𝑝1subscript𝑝2subscript𝑝25[p_{1},p_{2},...,p_{25}][ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT 25 end_POSTSUBSCRIPT ].

Readmission Prediction and Phenotype Classification.

We investigate the combined utilization of two queries, readmission prediction, and phenotype classification, to see if joint guidance is more effective. After obtaining the result of readmission prediction [p1r,p2r]subscriptsuperscript𝑝𝑟1subscriptsuperscript𝑝𝑟2[p^{r}_{1},p^{r}_{2}][ italic_p start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] and the result of phenotype classification [p1c,,p25c]subscriptsuperscript𝑝𝑐1subscriptsuperscript𝑝𝑐25[p^{c}_{1},\ldots,p^{c}_{25}][ italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 25 end_POSTSUBSCRIPT ], we integrate them by converting the results into a 50-class probability distribution.

4.3 Experiment Settings

As the base model, we use BART-Large-CNN222https://huggingface.co/facebook/bart-large-cnn, which is a BART model (Lewis et al., 2020) fine-tuned on CNN Daily Mail, specialized in text summarization. As the query responder, we use Clinical-Longformer (Li et al., 2022), chosen for its ability to handle a long context. It is fine-tuned on the selected queries. Hyperparameter settings for QGSumm and the query responder are presented in Appendix A.4.

Since our method is designed for scenarios where reference summaries are unavailable, we compare it with baselines in both zero-shot and few-shot settings: Zero-Shot: BART-Large-CNN (abbreviated as BART-zs), BioMistral-7B (Labrak et al., 2024) (abbreviated as BioMistral-zs), and GPT-4 (OpenAI, 2023); 10-Shot: BART-Large-CNN (abbreviated as BART-fs), Pegasus (Zhang et al., 2020a), and BioMistral-7B (abbreviated as BioMistral-fs). We use summaries generated by GPT-4 for the 10-shot fine-tuning and use one-shot in-context learning when prompting GPT-4 and BioMistral-7B. We also include results from two extractive methods, TextRank (Mihalcea and Tarau, 2004) and Lead-40%, for reference. In TextRank, we utilize MPNet (Song et al., 2020) to obtain sentence embeddings. In Lead-40%, we use the first 40% of the content of the note as a summary. More details about baselines and in-context learning/few-shot adaptation can be found in Appendix B.1.

4.4 Evaluation Metrics

Evaluating the quality of text summarization is challenging (Bhandari et al., 2020), especially when reference summaries are unavailable. Therefore, we employ multiple metrics covering different aspects of the summaries, providing a comprehensive evaluation.

Automatic Evaluation Metrics.

Metrics in the automatic evaluation are divided into three categories: 1) predictiveness, 2) factuality and consistency, and 3) conciseness.

Metrics for predictiveness assess whether the summary adequately contains patient key information, quantified as the ability to predict the patient’s condition using the summary as input. Specifically, we conduct readmission prediction and phenotype classification using summaries from baselines and our method. We employ the summaries generated by different methods to fine-tune the query responder, resulting in multiple predictors, one for each method. For readmission prediction, we report the weighted F1 and F1 of the positive class (“being readmitted”), and for phenotype classification, we report the F1-Macro score.

For consistency and factuality, we consider: (1) UMLS-Recall. This metric measures the biomedical information consistency by comparing the set of medical concepts in the summary with that in the original note. We employ QuickUMLS (Soldaini and Goharian, 2016) to extract Unified Medical Language System (UMLS) biomedical concepts from the nursing note and its summary. Recall is the proportion of concepts in the original note that are present in the summary. (2) UMLS-FDR. FDR denotes False Discovery Rate. Analogously to UMLS-Recall, we compute the proportion of medical concepts in the summary that do not appear in the original note. (3) FactKB. With FactKB (Feng et al., 2023), an evaluation model measuring the factuality of a summary and its original text, we evaluate whether the summaries are consistent with the nursing notes from the perspective of their overall semantic information. (4) BARTScore. This metric evaluates the general consistency of summaries in a text-generation manner using BART, which also considers aspects such as the structure, coherence, and fluency of the summary (Yuan et al., 2021).

Finally, we report the length of the generated summary as a percentage of the original note’s length to assess conciseness. We do not enforce a strict maximum length for baselines because we believe the model should be capable of determining the appropriate length autonomously.

Metrics used in the manual evaluation by a clinician

Without a reference summary, automatic evaluation metrics may not fully capture the quality of the summary. Therefore, we invite a clinician to conduct manual evaluation of the summaries of 25 nursing notes. The clinician evaluates summaries from multiple methods in a blinded and randomized order. Each summary is rated on four aspects: (1) Informativeness: Whether the summary adequately captures essential information regarding the patient’s condition in the original note; (2) Fluency: Whether the summary is well-written and easy to understand. (3) Consistency: How well the summary aligns with the original nursing note in factuality. (4) Relevance: It evaluates the conciseness of the summary and whether it contains unnecessary information. The score for each aspect ranges from 1 to 5. More detailed grading criteria are presented in Appendix B.2.

5 Results and Discussion

{adjustwidth}

-1cm Type Method Predictiveness Consistency and Factuality Conciseness Readmission Phenotype UMLS-Recall UMLS-FDR BARTScore FactKB Length Weighted F1 F1 Macro F1 Orig. Notes 85.20.5 19.71.9 28.70.5 - - - - - Zero-Shot BART-zs 78.80.4 11.10.9 20.50.3 36.49.0 8.706.2 -1.890.31 0.780.16 31.9% GPT-4 85.60.6 21.52.0 23.60.6 59.28.3 44.27.6 -3.130.47 0.770.17 53.6% BioMistral-zs 80.10.6 10.71.3 21.40.4 55.49.9 50.08.7 -2.800.45 0.680.14 69.2% 10-Shot BART-fs 82.20.5 14.41.3 21.10.4 52.57.3 44.57.1 -2.720.36 0.760.15 65.0% BioMistral-fs 81.70.4 10.21.1 22.00.4 57.210.2 49.17.8 -2.970.43 0.700.15 68.8% Pegasus 80.50.8 12.51.8 18.30.6 35.18.4 52.67.7 -3.070.40 0.700.18 57.4% QGSumm -Similarity 79.50.6 12.01.2 22.40.4 53.17.2 20.76.7 -2.220.31 0.820.13 51.7% -NextNote 80.80.6 11.71.4 23.20.6 56.48.0 35.27.1 -2.320.33 0.770.11 49.3% -Readmission 82.40.5 18.21.6 23.90.5 58.27.5 22.76.5 -2.300.37 0.780.14 46.2% -Phenotype 81.90.5 13.41.5 25.60.6 58.57.4 36.26.9 -2.340.35 0.790.13 48.0% -Re+Ph 84.20.5 17.21.6 25.10.5 58.87.9 24.16.4 -2.260.35 0.800.14 48.2% Extractive Lead-40% 83.10.6 12.61.5 21.70.5 42.76.7 0.302.6 -0.870.11 0.990.06 40.0% TextRank 81.90.7 14.41.7 23.30.5 58.57.9 0.081.4 -0.900.15 0.950.12 51.9%

Table 1: Results of automatic evaluation. Lower values are better for Length and UMLS-HR, higher values for the other metrics. The subscripts denote standard deviation. “Orig. Notes” means using original nursing notes as such for readmission and phenotype prediction. “Re+Ph” means using “Readmission Prediction and Phenotype Classification” as the query. Results from best and 2nd best method under each metric are bolded and underlined. Extractive methods are for reference and not considered in comparison.

5.1 Results of the Automatic Evaluation

Predictiveness.

Results are shown in Table 1. In the readmission prediction task, GPT-4 performs best, producing summaries that enable more accurate prediction of a patient’s status. Our method also exhibits excellent performance, surpassing all few-shot methods. The main reference for our method is BART-zs, as it is the base model in our method, and hence represents performance without the proposed novel components. We see that our method outperforms BART-zs significantly in weighted F1 score (84.2 vs. 78.8) and F1 score of the positive class (18.2 vs. 11.1). This shows the effectiveness of the adaptation strategy guiding the model with useful queries. Interestingly, we find that using the summary from GPT-4 for this task outperforms using the original notes. Similarly, the summary from our method has performance close to that of using the original notes, highlighting the importance of high-quality summaries. In phenotype classification, our method with the query focusing on patients’ phenotype performs the best, outperforming BART-zs in Macro F1 (25.6 vs. 20.5). Even when using the similarity alone as a guiding signal, our method still is better than BART-zs (22.4 vs. 20.5) or BART-fs (22.4 vs. 21.1). Although specialized in text summarization, Pegasus has weak performance on all predictiveness metrics.

Conciseness, Consistency, and Factuality.

As shown in Table 1, there is an expected trade-off between UMLS-Recall and the summary’s length. Our method strikes a good balance between medical information consistency (measured by UMLS-Recall) and conciseness. GPT-4 captures more medical information, but achieves this with summaries which are less concise. Conversely, BART-zs can produce concise summaries but fails to adequately capture medical concepts. Even if the 10-shot fine-tuning clearly improves predictiveness and UMLS-Recall, BART-fs still struggles to generate a concise summary. Similar to BART-fs, both BioMistral-zs and BioMistral-fs tend to produce summaries that are not concise.

Summaries generated from BART-zs maintain high levels in factuality (measured by UMLS-HR and FactKB) and general consistency (measured by BARTScore). Our method also has strong performance on relevant metrics. We find that although LLMs, such as GPT-4 and BioMistral, excel in language understanding, they do not perform well on factuality and general consistency. One possible reason is their tendency to rephrase or even expand upon the original notes, potentially introducing inconsistent information. However, the metrics can be influenced by text style and layout, which may cause summaries that are more different from the original note to score relatively lower, even if they are more fluent. For this reason, our model also scores lower than the base model BART-zs on some metrics. The results and limitations will be further discussed in Sections 5.4 and 5.5.

Effectiveness of the Query Guidance.

According to the results shown in Table 1, the performance with different queries varies. We can observe: (1) Regarding predictiveness, employing queries closely related to patients and focusing on readmission and phenotype information yield superior performance compared to other queries. As expected, the method with phenotype-related queries performs the best in phenotype classification, while the method with readmission-related queries is the best in readmission prediction. This highlights the effectiveness of guiding the summarization with queries, and different queries enable the summary to concentrate on different aspects of the original note. (2) Using similarity as guidance can produce summaries that are more similar to the original notes, resulting in higher scores on general consistency and factuality. However, summaries generated under this configuration tend to be longer and often sacrifice predictiveness and informativeness regarding medical concepts, demonstrating the limitations of the unconstrained guidance signal. (3) When employing joint guidance with both readmission and phenotype information, our method consistently achieves excellent performance across all metrics. This indicates that combining different guidance signals can help in producing better summaries, and further research is needed to explore this aspect in depth.

5.2 Results of the Manual Evaluation by a Clinician

To avoid excessive manual work, we select three baselines to include in the manual evaluation: BART-zs, GPT-4, and BioMistral-fs. The justification for selecting these methods is two-fold: First, BART-zs is the base model in our method and hence the main baseline, demonstrating the benefits of the novel components. Second, GPT-4 and BioMistral-fs are well-known strong baselines and they had good performance in the automatic evaluation. We use “Readmission Prediction and Phenotype Classification” as the query for our method. Average scores for each method across four metrics are shown in Figure 4.

QGSumm vs BART-zs.

Our method significantly outperforms the base model BART-zs on all four metrics. This indicates that the proposed domain adaptation strategy enables the model to generate higher-quality summaries from the medical personnel’s perspective, containing refined and important patient information with fewer hallucinations and increased readability. Although on average the summaries generated by our method are longer than those produced by the base model, our method achieves a higher relevance score from the clinician, suggesting the base model struggles to identify key information and focuses on unnecessary details. Our model can effectively enhance this aspect.

QGSumm vs GPT-4 and BioMistral-fs.

Due to the LLMs’ strong language understanding capability and sufficient medical knowledge, GPT-4 and BioMistral-fs can adequately summarize key information in nursing notes, receiving approximately the same average score in Informativeness as our method. Additionally, they excel in generating fluent summaries by rephrasing and clarifying abbreviations, receiving a slightly higher Fluency from the clinician than our method. However, the rephrasing can introduce factual inconsistencies, and the tendency to infer additional content may reduce factuality. Consequently, our model has a higher average score in Consistency, which is essential in the clinical setting. Furthermore, it generates more concise summaries, leading to a higher average score in Relevance. However, due to the small sample size, the only statistically significant difference in these comparisons was the improvement of our method compared to Biomistral-fs in consistency, and further work is needed for conclusive results.

Calculating the Significance.

We calculated the significance in Figure 4 using a two-tailed Binomial test on the pairwise win-rates. In detail, we count the number of nursing notes where our method has a score higher or lower than a comparison method and test for the null hypothesis that the win-probability is 0.5.

Refer to caption
Figure 4: Results of the manual evaluation by a clinician. The average scores in four metrics across 25 summarized notes are reported for QGSumm, BART-zs, GPT-4, and BioMistral-fs. “*” denotes the result of the significance test.

5.3 Effectiveness of Augmentation Blocks

We analyze the effects of the proposed augmentation blocks through an ablation study. We consider three settings: removing the Patient Information Augmentation block (denoted as w/o PIA); removing the Temporal Information Fusion block (denoted as w/o TIF); removing both blocks (denoted as w/o PIA+TIF). We use “Readmission Prediction and Phenotype Classification” as the query in our method. The results of the ablation study are shown in Table 2 The decreased weighted F1 and macro F1 scores indicate that both augmentation blocks enhance the predictiveness of summaries. This implies that information in patient metadata and previous notes can effectively prompt the inference of the current and future status of patients. Removing the TIF block causes a larger decrease in the F1 scores, suggesting that temporal information is more important than the patient’s metadata in guiding the generation of summaries to focus on the progression of the patient’s status.

On the other hand, the incorporation of patient metadata can lead to more faithful summaries, as the removal of PIA degrades more the performance on UMLS-FDR and FactKB, which are related to factuality. In contrast, the TIF block does not have a significant impact on the factuality. However, according to the UMLS-Recall score, it encourages the model to capture more medical information, thereby improving the consistency of the summary.

Table 2: Results of the ablation study. We present scores of five metrics for QGSumm, and show the change in the value of the metric after removing different augmentation blocks. \downarrow denotes a decrease in the score and \uparrow denotes an increase. We see that the augmentations are consistently useful.
Weighted F1 Macro F1 UMLS-Recall UMLS-FDR FactKB
QGSumm 84.2 25.1 58.8 24.1 0.80
w/o PIA 2.6absent2.6\downarrow 2.6↓ 2.6 1.4absent1.4\downarrow 1.4↓ 1.4 1.4absent1.4\downarrow 1.4↓ 1.4 4.7absent4.7\uparrow 4.7↑ 4.7 0.04absent0.04\downarrow 0.04↓ 0.04
w/o TIF 4.1absent4.1\downarrow 4.1↓ 4.1 1.6absent1.6\downarrow 1.6↓ 1.6 3.8absent3.8\downarrow 3.8↓ 3.8 1.9absent1.9\uparrow 1.9↑ 1.9 0.01absent0.01\downarrow 0.01↓ 0.01
w/o PIA+TIF 4.8absent4.8\downarrow 4.8↓ 4.8 2.3absent2.3\downarrow 2.3↓ 2.3 4.4absent4.4\downarrow 4.4↓ 4.4 5.6absent5.6\uparrow 5.6↑ 5.6 0.04absent0.04\downarrow 0.04↓ 0.04

5.4 Case Study

Refer to caption
Figure 5: One artificial nursing note and its summaries from QGSumm, BART-zs, GPT-4, and BioMistral-fs, respectively.

One artificial nursing note and its corresponding summaries generated by QGSumm, BART-zs, GPT-4, and BioMistral-fs are presented in Figure 5. In the original nursing note, the content highlighted in blue indicates information included in the summary generated by our approach. We can see that our approach captures most of the important patient information. However, some details, such as cardiovascular and respiratory conditions, are overlooked. The summary from BART-zs only covers information from the first half of the nursing note, suggesting the limitation in understanding long context. Summaries from GPT-4 and BioMistral-fs contain more patient information but lack conciseness. These models achieve fluency by rephrasing notes and expanding abbreviations. However, BioMistral-fs struggles with maintaining factuality, often excessively reasoning about the patient’s personal information and condition.

5.5 Discussion

User need -oriented summarization.

A high-quality summary should facilitate efficient understanding of the relevant content for users. In the context of nursing notes, this means the summary should capture the patient’s condition. Our method employs patient-related queries, indirectly ensuring that the summary centers around the patient’s status. The summaries generated with different queries can be seen as coming from distinct conditional distributions and parts of the semantic space. As discussed in Section 5.1, the queries can guide summaries to focus on specific aspects of the original note. Therefore, by selecting appropriate queries, we can control preferences for desired content and adjust granularity, which facilitates a more flexible and user need -oriented summary generation. For instance, broad queries about the patient’s condition will result in a summary that focuses on the patient’s overall condition, while more detailed queries, such as those regarding cardiovascular health, could produce a summary that focuses on that specific aspect.

Design choices for information augmentation.

One challenge is how to efficiently integrate information into the model while avoiding excessive computational cost. We utilize cross-attention to allow the patient’s metadata to efficiently interact on multiple levels with the process of generating the summary. In contrast, for temporal information in previous notes, using cross-attention in a similar manner might make it difficult for the model to balance attention across the current note, past notes, and patient information, in addition to introducing computational challenges with long sequences of notes. Hence, we adopt a simple but effective strategy: representing the temporal information, obtained by weighted mean pooling from previous note representations, as the first token of the decoder’s input. This strategy is intuitive, as information from previous notes naturally precedes the summary of the current note.

Interpretation of the evaluation metrics.

The metrics used in the automatic evaluation have limitations as they do not conclusively reflect the quality of the summary, and come with trade-offs. For example, a good performance in predictiveness and medical information consistency (UMLS-Recall) may not be due to the high quality of the summary but rather caused by copying the source note, resulting in a lack of conciseness and fluency. Conversely, as the summary becomes more concise, it may become less informative. Furthermore, models used to measure factuality and general consistency have inherent biases. As they are based on general semantics, they are potentially weak at recognizing patient-related information due to the dissimilarity between their training domain and clinical data, and they often prioritize text style and structure. Finally, since BARTScore is derived from the BART model, summaries generated by BART have a bias of scoring relatively higher with this metric. We attempt to mitigate the impact of these limitations by comprehensively considering multiple metrics, and including the manual evaluation by a clinician, but there remains a need for more conclusive evaluation metrics.

Limitations.

(1) Our current approach produces summaries of individual nursing notes, lacking the long context and support for multiple note summarization. (2) There is room for more exploration on the formulation of the clinical queries. We don’t employ generative queries but only queries related to the classification of the patient’s status. Also, when investigating the combined effects of multiple queries, further exploration using multi-task learning methods could be beneficial. (3) Due to the workload, the number of notes assessed in the manual evaluation is limited to 25. A larger sample size would allow more conclusive comparisons of the strengths and weaknesses of the methods.

Conclusion.

We presented a novel method for self-supervised nursing note summarization, where the main innovation was the introduction of query guidance, which successfully directed the summaries to include desired content. In the manual evaluation by a professional clinician, our method significantly outperformed a specialized open text summarization model, BART-Large-CNN, in all metrics. Because this model was the base model of our method, the result highlights the usefulness of the novel developments. Of the other baselines, the proprietary GPT-4 had the closest performance to our method and was better than the other baselines. In the automatic evaluation, GPT-4 was better than our method in predictiveness, but, importantly, our method outperformed GPT-4 in factual consistency, having fewer hallucinated facts without sacrificing the correct content. The same trend was seen in the manual evaluation as higher average consistency for our method, although the difference was not statistically significant. Hence, our approach can produce more reliable summaries, clearing obstacles for responsible clinical use of LLMs. From the machine learning perspective, our method demonstrates the feasibility of domain adaptation for pre-trained text summarization models without explicit supervision, and the effectiveness of self-supervised strategies to guide conditional summarization to specific interests.

References

  • Abacha et al. (2023) Asma Ben Abacha, Wen-wai Yim, Yadan Fan, and Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2291–2302, 2023.
  • Adams et al. (2022) Griffin Adams, Han-Chin Shing, Qing Sun, Christopher Winestock, Kathleen Mckeown, and Noémie Elhadad. Learning to revise references for faithful summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4009–4027, 2022.
  • Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • Bhandari et al. (2020) Manik Bhandari, Pranav Narayan Gour, Atabak Ashfaq, Pengfei Liu, and Graham Neubig. Re-evaluating evaluation in text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9347–9359, 2020.
  • Chu and Liu (2019) Eric Chu and Peter Liu. Meansum: A neural model for unsupervised multi-document abstractive summarization. In International Conference on Machine Learning, pages 1223–1232. PMLR, 2019.
  • Clarke et al. (2013) Martina A Clarke, Jeffery L Belden, Richelle J Koopman, Linsey M Steege, Joi L Moore, Shannon M Canfield, and Min S Kim. Information needs and information-seeking behaviour analysis of primary care physicians and nurses: a literature review. Health Information & Libraries Journal, 30(3):178–190, 2013.
  • Dettmers et al. (2024) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  • Elsahar et al. (2021) Hady Elsahar, Maximin Coavoux, Jos Rozen, and Matthias Gallé. Self-supervised and controlled multi-document opinion summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1646–1662, 2021.
  • Feng et al. (2023) Shangbin Feng, Vidhisha Balachandran, Yuyang Bai, and Yulia Tsvetkov. Factkb: Generalizable factuality evaluation using language models enhanced with factual knowledge. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 933–952, 2023.
  • Gao et al. (2022) Yanjun Gao, Dmitriy Dligach, Timothy Miller, Dongfang Xu, Matthew MM Churpek, and Majid Afshar. Summarizing patients’ problems from hospital progress notes using pre-trained sequence-to-sequence models. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2979–2991, 2022.
  • Gao et al. (2023) Yanjun Gao, Dmitriy Dligach, Timothy Miller, and Majid Afshar. Overview of the problem list summarization (probsum) 2023 shared task on summarizing patients’ active diagnoses and problems from electronic health record progress notes. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 461–467, 2023.
  • Hall and Walton (2004) Amanda Hall and Graham Walton. Information overload within the health care system: a literature review. Health Information & Libraries Journal, 21(2):102–108, 2004.
  • Harutyunyan et al. (2019) Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. Scientific Data, 6(1):96, 2019.
  • Hosking et al. (2023) Tom Hosking, Hao Tang, and Mirella Lapata. Attributable and scalable opinion summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8488–8505, 2023.
  • Huang et al. (2019) Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342, 2019.
  • Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumble-softmax. In International Conference on Learning Representations (ICLR 2017), 2017.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific Data, 3(1):1–9, 2016.
  • Kanwal and Rizzo (2022) Neel Kanwal and Giuseppe Rizzo. Attention-based clinical note summarization. In Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, pages 813–820, 2022.
  • Ke et al. (2022) Wenjun Ke, Jinhua Gao, Huawei Shen, and Xueqi Cheng. Consistsum: Unsupervised opinion summarization with the consistency of aspect, sentiment and semantic. In Proceedings of the fifteenth ACM International Conference on Web Search and Aata Mining, pages 467–475, 2022.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Krishna et al. (2021) Kundan Krishna, Sopan Khosla, Jeffrey P Bigham, and Zachary C Lipton. Generating soap notes from doctor-patient conversations using modular summarization techniques. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4958–4972, 2021.
  • Labrak et al. (2024) Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024.
  • Lampinen et al. (2022) Andrew Lampinen, Ishita Dasgupta, Stephanie Chan, Kory Mathewson, Mh Tessler, Antonia Creswell, James McClelland, Jane Wang, and Felix Hill. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 537–563, 2022.
  • Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, 2020.
  • Li et al. (2022) Yikuan Li, Ramsey M Wehbe, Faraz S Ahmad, Hanyin Wang, and Yuan Luo. Clinical-longformer and clinical-bigbird: Transformers for long clinical sequences. arXiv preprint arXiv:2201.11838, 2022.
  • Liu et al. (2022a) Fenglin Liu, Bang Yang, Chenyu You, Xian Wu, Shen Ge, Zhangdaihong Liu, Xu Sun, Yang Yang, and David Clifton. Retrieve, reason, and refine: Generating accurate and faithful patient instructions. Advances in Neural Information Processing Systems, 35:18864–18877, 2022a.
  • Liu et al. (2022b) Puyuan Liu, Chenyang Huang, and Lili Mou. Learning non-autoregressive models from search for unsupervised sentence summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7916–7929, 2022b.
  • Liu et al. (2021) Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 35(1):857–876, 2021.
  • Mihalcea and Tarau (2004) Rada Mihalcea and Paul Tarau. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411, 2004.
  • Moen et al. (2016) Hans Moen, Laura-Maria Peltonen, Juho Heimonen, Antti Airola, Tapio Pahikkala, Tapio Salakoski, and Sanna Salanterä. Comparison of automatic summarisation methods for clinical free text notes. Artificial Intelligence in Medicine, 67:25–37, 2016.
  • OpenAI (2023) R OpenAI. Gpt-4 technical report. ArXiv, 2303, 2023.
  • O’Donnell et al. (2009) Heather C O’Donnell, Rainu Kaushal, Yolanda Barrón, Mark A Callahan, Ronald D Adelman, and Eugenia L Siegler. Physicians’ attitudes towards copy and pasting in electronic note writing. Journal of General Internal Medicine, 24:63–68, 2009.
  • Pagnoni et al. (2023) Artidoro Pagnoni, Alex Fabbri, Wojciech Kryściński, and Chien-Sheng Wu. Socratic pretraining: Question-driven pretraining for controllable summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12737–12755, 2023.
  • Pivovarov and Elhadad (2015) Rimma Pivovarov and Noémie Elhadad. Automated methods for the summarization of electronic health records. Journal of the American Medical Informatics Association : JAMIA, 22:938 – 947, 2015.
  • Reunamo et al. (2022) Akseli Reunamo, Laura-Maria Peltonen, Reetta Mustonen, Minttu Saari, Tapio Salakoski, Sanna Salanterä, and Hans Moen. Text classification model explainability for keyword extraction–towards keyword-based summarization of nursing care episodes. In MEDINFO 2021: One World, One Health–Global Partnership for Digital Innovation, pages 632–636. IOS Press, 2022.
  • Searle et al. (2023) Thomas Searle, Zina Ibrahim, James Teo, and Richard JB Dobson. Discharge summary hospital course summarisation of in patient electronic health record text with clinical concept guided deep pre-trained transformer models. Journal of Biomedical Informatics, 141:104358, 2023.
  • Shing et al. (2021) Han-Chin Shing, Chaitanya Shivade, Nima Pourdamghani, Feng Nan, Philip Resnik, Douglas Oard, and Parminder Bhatia. Towards clinical encounter summarization: Learning to compose discharge summaries from prior notes. arXiv preprint arXiv:2104.13498, 2021.
  • Soldaini and Goharian (2016) Luca Soldaini and Nazli Goharian. Quickumls: a fast, unsupervised approach for medical concept extraction. In MedIR Workshop, SIGIR, pages 1–4, 2016.
  • Song et al. (2020) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
  • Tang et al. (2019) Matthew Tang, Priyanka Gandhi, Md. Ahsanul Kabir, Christopher Zou, Jordyn Blakey, and Xiao Luo. Progress notes classification and keyword extraction using attention-based deep learning models with bert. ArXiv, abs/1910.05786, 2019.
  • Törnvall and Wilhelmsson (2008) Eva Törnvall and Susan Wilhelmsson. Nursing documentation for communicating and evaluating care. Journal of Clinical Nursing, 17(16):2116–2124, 2008.
  • Van Veen et al. (2023) Dave Van Veen, Cara Van Uden, Maayane Attias, Anuj Pareek, Christian Bluethgen, Malgorzata Polacin, Wah Chiu, Jean-Benoit Delbrouck, Juan Zambrano Chaves, Curtis Langlotz, et al. Radadapt: Radiology report summarization via lightweight domain adaptation of large language models. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 449–460, 2023.
  • Van Veen et al. (2024) Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna Seehofnerová, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nature Medicine, pages 1–9, 2024.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  • Vig et al. (2022) Jesse Vig, Alexander Richard Fabbri, Wojciech Kryściński, Chien-Sheng Wu, and Wenhao Liu. Exploring neural models for query-focused summarization. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1455–1468, 2022.
  • Wang et al. (2021) Mengqian Wang, Manhua Wang, Fei Yu, Yue Yang, Jennifer Walker, and Javed Mostafa. A systematic review of automatic text summarization for biomedical literature and ehrs. Journal of the American Medical Informatics Association, 28(10):2287–2297, 2021.
  • Yang et al. (2020) Ziyi Yang, Chenguang Zhu, Robert Gmyr, Michael Zeng, Xuedong Huang, and Eric Darve. Ted: A pretrained unsupervised summarization model with theme modeling and denoising. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1865–1874, 2020.
  • Yuan et al. (2021) Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021.
  • Zhang et al. (2020a) Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR, 2020a.
  • Zhang et al. (2021) Longxiang Zhang, Renato Negrinho, Arindam Ghosh, Vasudevan Jagannathan, Hamid Reza Hassanzadeh, Thomas Schaaf, and Matthew R Gormley. Leveraging pretrained models for automatic summarization of doctor-patient conversations. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3693–3712, 2021.
  • Zhang et al. (2020b) Yuhao Zhang, Derek Merck, Emily Tsai, Christopher D Manning, and Curtis Langlotz. Optimizing the factual correctness of a summary: A study of summarizing radiology reports. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5108–5120, 2020b.
  • Zhong et al. (2022) Ming Zhong, Yang Liu, Suyu Ge, Yuning Mao, Yizhu Jiao, Xingxing Zhang, Yichong Xu, Chenguang Zhu, Michael Zeng, and Jiawei Han. Unsupervised multi-granularity summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4980–4995, 2022.
  • Zhuang et al. (2022) Haojie Zhuang, Wei Emma Zhang, Jian Yang, Congbo Ma, Yutong Qu, and Quan Z Sheng. Learning from the source document: Unsupervised abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4194–4205, 2022.

Appendix A. Additional Details for Implementation

A.1 Patient Metadata

Figure 6 shows one artificial example of how patient information is obtained in natural language from structural metadata. In the MIMIC-III database, we retrieve patient identifiers (“SUBJECT_ID”), gender information(“GENDER”), and date of birth (“DOB”) from the “PATIENTS” table. Information regarding admission identifiers (“HADM_ID”) and admission time (“ADMITTIME”) are obtained from the “ADMISSIONS” table, while diagnosis codes and procedure codes are sourced from “DIAGNOSES_ICD” and “PROCEDURES_ICD” tables, respectively.

Refer to caption
Figure 6: Convert artificial patient metadata to a natural language description.

A.2 Data preprocessing

Following prior research (Harutyunyan et al., 2019; Huang et al., 2019), we perform filtering on admission records and nursing notes. Initially, we filter out specific admission cases: (1) cases of in-hospital mortality and admissions categorized as ”NEWBORN”; (2) cases containing diagnosis codes outside HCUP CCS code groups. We retain only admissions containing clinical notes categorized as ”Nursing/other”. Subsequently, we apply a length limit to nursing notes, filtering out those with more than 800 tokens or fewer than 50 tokens. Finally, we filtered out admission cases with more than 100 nursing notes. Nursing notes in these cases typically represent out-of-distribution information or are irrelevant to the care episode.

We preprocess nursing notes following (Huang et al., 2019). In addition, we expand certain frequently occurring abbreviations found multiple times in each note, such as “pt” (patient), “cv” (cardiovascular), and “resp” (respiratory), to aid the model’s understanding of the notes. By random sampling, we collect 10001 nursing notes from 1000 admissions as the validation set. For the test set, we randomly select 1516 admissions and sample 3079 nursing notes from these admissions. We only use 3079 notes for testing due to the cost of the use of GPT-4. The nursing notes in remaining admissions are included in the training set.

A.3 Details of Implementation for the Query Responder

We use nursing notes in the training set to train the query responder RR\operatorname{R}roman_R. The data statistics are presented in Table 3.

To address the class imbalance issue in the readmission prediction task, we conduct oversampling for notes in the positive class (“being readmitted”) and undersampling for notes in the negative class (“not being readmitted”). This results in 35000 nursing notes being used for training.

Table 3: The number of nursing notes are used for training, validation and testing.
Query Training Validation Testing
Next Note Prediction 100000 5000 17458
Readmission Prediction 35000 10001 17458
Phenotype Classification 149015 10001 17458

Given a nursing note N𝑁Nitalic_N and its summary S𝑆Sitalic_S generated by QGSumm:

Contrastive Next Note Prediction.

Two note pairs (N,Npos)𝑁subscript𝑁𝑝𝑜𝑠(N,N_{pos})( italic_N , italic_N start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) and (N,Nneg)𝑁subscript𝑁𝑛𝑒𝑔(N,N_{neg})( italic_N , italic_N start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ) are constructed as introduced in Section 4.2.

ppos=R(N,Npos),pneg=R(N,Nneg),formulae-sequencesubscript𝑝𝑝𝑜𝑠R𝑁subscript𝑁𝑝𝑜𝑠subscript𝑝𝑛𝑒𝑔R𝑁subscript𝑁𝑛𝑒𝑔p_{pos}=\operatorname{R}(N,N_{pos}),\hskip 8.5359ptp_{neg}=\operatorname{R}(N,% N_{neg}),italic_p start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT = roman_R ( italic_N , italic_N start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT = roman_R ( italic_N , italic_N start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ) , (10)
ppos=R(S,Npos),pneg=R(S,Nneg).formulae-sequencesubscriptsuperscript𝑝𝑝𝑜𝑠R𝑆subscript𝑁𝑝𝑜𝑠subscriptsuperscript𝑝𝑛𝑒𝑔R𝑆subscript𝑁𝑛𝑒𝑔p^{\prime}_{pos}=\operatorname{R}(S,N_{pos}),\hskip 8.5359ptp^{\prime}_{neg}=% \operatorname{R}(S,N_{neg}).italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT = roman_R ( italic_S , italic_N start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT = roman_R ( italic_S , italic_N start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ) . (11)

The learning objective in this case is:

minCrossEntropy([ppos,pneg],[ppos,pneg]).minsubscript𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦subscript𝑝𝑝𝑜𝑠subscript𝑝𝑛𝑒𝑔subscriptsuperscript𝑝𝑝𝑜𝑠subscriptsuperscript𝑝𝑛𝑒𝑔\operatorname{min}\mathcal{L}_{CrossEntropy}([p_{pos},p_{neg}],[p^{\prime}_{% pos},p^{\prime}_{neg}]).roman_min caligraphic_L start_POSTSUBSCRIPT italic_C italic_r italic_o italic_s italic_s italic_E italic_n italic_t italic_r italic_o italic_p italic_y end_POSTSUBSCRIPT ( [ italic_p start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ] , [ italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ] ) . (12)
Readmission Prediction.

The result of the readmission prediction is in the form of [ppos,pneg]subscript𝑝𝑝𝑜𝑠subscript𝑝𝑛𝑒𝑔[p_{pos},p_{neg}][ italic_p start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ], indicating the probability of “being readmitted” and “not being readmitted”. The learning objective is the same as the Equation 12.

Phenotype Classification.

We obtain the probability distribution of the phenotype from RR\operatorname{R}roman_R:

[p1,p2,,p25]=R(N),[p1,p2,,p25]=R(S).formulae-sequencesubscript𝑝1subscript𝑝2subscript𝑝25R𝑁subscriptsuperscript𝑝1subscriptsuperscript𝑝2subscriptsuperscript𝑝25R𝑆[p_{1},p_{2},...,p_{25}]=\operatorname{R}(N),\hskip 8.5359pt[p^{\prime}_{1},p^% {\prime}_{2},...,p^{\prime}_{25}]=\operatorname{R}(S).[ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT 25 end_POSTSUBSCRIPT ] = roman_R ( italic_N ) , [ italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 25 end_POSTSUBSCRIPT ] = roman_R ( italic_S ) . (13)

Consequently, the learning objective is:

minCrossEntropy([p1,p2,,p25],[p1,p2,,p25]).minsubscript𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦subscript𝑝1subscript𝑝2subscript𝑝25subscriptsuperscript𝑝1subscriptsuperscript𝑝2subscriptsuperscript𝑝25\operatorname{min}\mathcal{L}_{CrossEntropy}([p_{1},p_{2},...,p_{25}],[p^{% \prime}_{1},p^{\prime}_{2},...,p^{\prime}_{25}]).roman_min caligraphic_L start_POSTSUBSCRIPT italic_C italic_r italic_o italic_s italic_s italic_E italic_n italic_t italic_r italic_o italic_p italic_y end_POSTSUBSCRIPT ( [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT 25 end_POSTSUBSCRIPT ] , [ italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 25 end_POSTSUBSCRIPT ] ) . (14)

A.4 Hyperparameter Setting

We present hyperparameter settings of QGSumm and the query responder RR\operatorname{R}roman_R in Table 4. The configuration of hyperparameters for the base model’s architecture keeps the same as the original configuration of BART-Large-CNN333https://huggingface.co/facebook/bart-large-cnn/blob/main/config.json. We set the maximum length of the summary to 500 tokens, allowing for flexibility as we aim for the model to autonomously determine the appropriate length. We use Adam optimizer (Kingma and Ba, 2014) to optimize the model.

Table 4: Details of the hyperparameter setting.
Hyperparameter Choices
QGSumm learning rate {1e-5, 2e-5, 5e-5, 2e-4, 5e-4}
number of training epochs 3
λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT {0.1, 0.3, 0.5, 0.7, 0.9, 1.0}
λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT {0.0, 0.1, 0.3, 0.6, 0.8, 1.0}
decoder layers being augmented by PIA {all 12 layers, first 6 layers, last 6 layers}
RR\operatorname{R}roman_R learning rate {2e-5, 5e-5, 2e-4, 5e-4, 1e-3}
number of training epochs for next note prediction 2
number of training epochs for readmission prediction 2
number of training epochs for phenotype classification 3

Appendix B. Additional Details for Baselines and Evaluation

B.1 Baselines

Choice of the baselines

BART-Large-CNN (Lewis et al., 2020): It is chosen as the base model for its excellent performance on text summarization as well as less computation cost than its peers. We consider it as one baseline in the experiment to illustrate performance without the proposed novel components. Pegasus (Zhang et al., 2020a): It is a transformer-based pre-trained model specialized in abstractive summarization, widely recognized as a baseline model in many studies on text summarization. BioMistral-7B (Labrak et al., 2024): It is an open-source instruction-based LLM adapted from Mistral (Jiang et al., 2023) for the medical domain. It achieves state-of-the-art performance in supervised fine-tuning benchmarks compared to other open-source medical language models. GPT-4 (OpenAI, 2023): It is a proprietary LLM representing state-of-the-art on general NLP task.

Consistent with our method’s settings, the maximum length of the summary is set to 500 tokens.

Prompt Learning for GPT-4 and Bio-Mistral

A prompt example is shown in Figure 7. We employ one-shot in-context learning to prompt GPT-4 and BioMistral-7B, providing one summary example. This strategy can significantly improve the conciseness of the generated summary and ensure its text structure aligns with the requirements.

Refer to caption
Figure 7: One prompt example used when testing GPT-4 and BioMistral.

Few-shot Adaptation

We randomly sample 10 nursing notes from the training set and pair them with their corresponding summaries generated by GPT-4 to create the training data for 10-shot fine-tuning of BART-Large-CNN, BioMistral-7B, and Pegasus. The training data is transformed into instructions-formatted prompts as shown in Figure 7 for fine-tuning BioMistral-7B. We fine-tune BART-Large-CNN and Pegasus for 8 and 9 epochs, respectively, with a learning rate of 0.0005. As for BioMistral-7B, we fine-tune it using QLoRA adaptation for 7 epochs with a learning rate set to 0.0002.

Evaluation on Predictiveness

In our experiments, we assess predictiveness by evaluating the performance of readmission prediction and phenotype classification using summaries from various baselines (including 8 different methods, which also includes extractive methods) and different settings of our approach. For instance, to evaluate the predictiveness of summaries from BART-zs, we begin with the trained query responder RR\operatorname{R}roman_R as described in Appendix A.3 Details of Implementation for the Query Responder. We then fine-tune RR\operatorname{R}roman_R and compute the metrics through 10-fold cross-validation using summaries of nursing notes in the test set generated from BART-zs. Consequently, we obtain multiple responders corresponding to each method for readmission prediction and phenotype classification, respectively.

B.2 Grading Criteria of Manuel Evaluation

Refer to caption
Figure 8: The grading criteria provided to the clinician for manual evaluation.

Figure 8 shows the detailed grading criteria we provide to the clinician for manual evaluation. The score for each metric ranges from 1 to 5.

Refer to caption
Figure 9: Effects of various values of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (Equation 3) and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (Equation 7). (a): λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT affects both the length and the medical information consistency of the generated summary, resulting in changes in the summary’s predictiveness; (b): λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regulates the importance of the PIA block. We use “Readmission Prediction and Phenotype Classification” as the query. The PIA is applied to all decoder layers.

Appendix C. Additional Experimental Results

Effects of the length penalty coefficient.

The effects of varying the length penalty coefficient are shown in Figure 9(a). When λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT increases, the generated summaries become more concise. However, once λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT exceeds 0.5, there is a notable decrease in medical information consistency, accompanied by a decline in predictiveness performance. One potential explanation for this phenomenon is that within the range of 0.1 to 0.5, λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT facilitates the refinement of nursing notes by filtering out unnecessary information. However, surpassing 0.5 in the value of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT results in a stricter penalty, which causes the omission of the patient key information for obtaining more concise summaries. To strike a balance between conciseness and consistency, we ultimately set λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 0.5.

Effects of the importance of patient meta information.

λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regulates the contribution of patient metadata to the summarization process. The incorporation of patient meta information helps maintain the factuality of the summary. However, as shown in Figure 9(b), this influence is not consistently beneficial, with the optimal effect observed when λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is set to 0.3. Additionally, excessively large λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT causes the model to prioritize patient metadata over the content of the nursing note, which degrades the quality of the generated summary, reflected as reduced predictiveness. Therefore, we set λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to 0.3 for our method.