\theorembodyfont\theoremheaderfont\theorempostheader

: \theoremsep

Query-Guided Self-Supervised Summarization of Nursing Notes

\NameYa Gao ¹ \Email[email protected]
\NameHans Moen ¹ \Email[email protected]
\NameSaila Koivusalo ² \Email[email protected]
\NameMiika Koskinen ² \Email[email protected]
\NamePekka Marttinen ¹ \Email[email protected]
\addr¹ Aalto University
\addr² Helsinki University Hospital

Abstract

Nursing notes, an important component of Electronic Health Records (EHRs), keep track of the progression of a patient’s health status during a care episode. Distilling the key information in nursing notes through text summarization techniques can improve clinicians’ efficiency in understanding patients’ conditions when reviewing nursing notes. However, existing abstractive summarization methods in the clinical setting have often overlooked nursing notes and require the creation of reference summaries for supervision signals, which is time-consuming. In this work, we introduce QGSumm, a query-guided self-supervised domain adaptation framework for nursing note summarization. Using patient-related clinical queries as guidance, our approach generates high-quality, patient-centered summaries without relying on reference summaries for training. Through automatic and manual evaluation by an expert clinician, we demonstrate the strengths of our approach compared to the state-of-the-art Large Language Models (LLMs) in both zero-shot and few-shot settings. Ultimately, our approach provides a new perspective on conditional text summarization, tailored to the specific interests of clinical personnel.

1 Introduction

Electronic Health Records (EHRs) document the events that patients go through during their hospitalization. These records consist of both free-text clinical notes and structured data. Among them, nursing notes are important for keeping track of the progression of patients’ illnesses, changes in health status, as well as the medications and procedures administered (Törnvall and Wilhelmsson, 2008). Nursing notes provide clinicians with comprehensive insights into the patients’ conditions, assisting in formulating next-step treatment and care plans, as well as in writing the final discharge summaries. However, a patient’s care episode may result in a large number of nursing notes, especially for patients suffering from complex health problems, which causes the problem of information overload (Hall and Walton, 2004). Additionally, the information in nursing notes is usually intricate and highly condensed, making it time-consuming for clinicians to understand (Clarke et al., 2013).

In Natural Language Processing (NLP), text summarization techniques can be used to distill the content of nursing notes (Wang et al., 2021) to help clinicians quickly grasp their contents. Automatic clinical note summarization has been extensively studied, with existing approaches categorized into extractive (Pivovarov and Elhadad, 2015; Moen et al., 2016; Tang et al., 2019) and abstractive methods (Zhang et al., 2020b; Liu et al., 2022a; Searle et al., 2023). However, these methods have certain limitations: (1) Extractive methods only retain sentences, important words, or keyphrases from the original note, limiting coherence and fluency of the summary. This presents a challenge for understanding the summarized content. On the other hand, (2) abstractive methods can generate smoother summaries, but most of the existing works on abstractive clinical note summarization require explicit supervision, i.e., a reference summary as the ground truth. Writing the references is time-consuming (O’Donnell et al., 2009), causing a shortage of training data. Moreover, (3) most abstractive clinical note summarization methods focus on a specific type of notes, such as discharge summaries, radiology reports, or the dialogues between doctors and patients, and there is a lack of research on abstractive nursing note summarization.

Refer to caption — Figure 1: From a patient’s admission to discharge, multiple nursing notes may be generated. As shown in one artificial nursing note example, the notes could be poorly organized and lack clarity.

To address these limitations, we propose a novel approach for abstractive nursing note summarization, which does not require reference summaries for training. The nature of the nursing documentation poses additional challenges. For example, as shown in Figure 1, information in nursing notes may lack clarity, be poorly organized, and contain medical jargon that often includes non-standard abbreviations. In the absence of supervised learning signals, guiding the summarization model to understand the semantic information in notes and generate a good summary becomes challenging. Some previous text summarization works (Chu and Liu, 2019; Elsahar et al., 2021) adapt strategies in self-supervised learning (Liu et al., 2021) to such a scenario. In their methods, the training objective is to decrease the semantic distance between a summary and the original text based on the assumption that a good summary is capable of reconstructing the source text. However, simply making the semantic representation of the summary close to that of the original document does not allow controlling the generated summaries, which may result in a lack of relevant information. In clinical domains, we have specific requirements for the content of the summary, where the focus should be on the patient’s condition. Thus, methods that rely solely on the semantic similarity may not be fully satisfactory.

In this paper, we propose a query-guided self-supervised domain adaptation framework for nursing note summarization, named QGSumm. We formulate a learning objective to adapt the text summarization capability of a pretrained language model to the nursing note domain. Our approach is built on a hypothesis:

A good summary of a clinical note is centered on the patient’s condition. Consequently, when queried about the state of the patient, answers obtained from the summary will be similar to those obtained from the original note.

For instance, a query can concern the probability of a patient’s condition improving in the near future. We can train a model to answer this query using the current nursing note or its summary as input, and if the summary is accurate these two answers should be similar. Accordingly, we design a learning objective that aims at minimizing the discrepancy between the responses from the summary or from the source note to given queries. To further encourage the model to prioritize the patient’s current medical condition, we integrate into the summarization workflow both the patient’s metadata as well as information in the previous notes of the patient recorded on the same admission.

To the best of our knowledge, our proposed framework is the first on abstractive summarization of nursing notes, and there is no previous work on employing the self-supervised learning strategy for clinical note summarization. Our primary contributions are:

•

The study focuses on nursing notes that play a critical role in clinical settings, filling a gap in previous research by introducing a method for abstractive nursing note summarization. Our method’s ability to operate effectively without requiring reference summaries highlights its practical applicability.
•

We propose a novel self-supervised domain adaptation framework. By leveraging patient-related queries to guide the model, we achieve the goal of generating nursing note summaries that prioritize specific content, i.e., patients’ conditions and health status, without the need for ground truth.
•

We conduct a comprehensive automatic evaluation and a manual evaluation by an expert clinician. We compare our approach with state-of-the-art Large Language Models (LLMs) in few- and zero-shot settings. This demonstrates the method’s ability to generate high-quality summaries of nursing notes, and additionally provides an independent evaluation of the common LLMs in this task.

Generalizable Insights about Machine Learning in the Context of Healthcare

In existing healthcare-related NLP research, there is a noticeable gap in addressing nursing notes specifically. Summarizing key patient information within nursing notes could enhance the efficiency of medical personnel’s workflow. We provide a new perspective on obtaining summaries of nursing notes through self-supervised domain adaptation without the need for manually written reference summaries. In unconditional text summarization, the generated summary lacks explicit constraints. On the other hand, most conditional text summarization methods typically require data annotation (Vig et al., 2022) or the extraction of information related to content conditions (Pagnoni et al., 2023) from the source text, making them less suitable for our task. We introduce easily applicable patient-related queries as a way to facilitate conditional text summarization of nursing notes, which ensure that the generated summaries contain information required to respond to the query effectively, and are closely linked to the patient’s key information. Such constraints and guidance make our method highly suitable for the clinical and healthcare field since the resulting summaries are centered on the information nurses and clinicians are most concerned about.

2 Related Work

Key Information Extraction and Summarization from Clinical Notes

Research in this domain focuses on two approaches, extractive and abstractive summarization. The extractive method can preserve faithfulness but results in the inability to paraphrase and difficulties in maintaining coherence. Earlier work primarily used semantic similarity-based techniques (Pivovarov and Elhadad, 2015; Moen et al., 2016). The emergence of Transformer models has shifted the focus to attention-based methods to determine key information in clinical text with an emphasis on explainability (Tang et al., 2019; Reunamo et al., 2022; Kanwal and Rizzo, 2022).

Recently, there has been a notable increase in research on abstractive clinical text summarization. From an application perspective, these methods mainly target discharge summary generation (Shing et al., 2021; Adams et al., 2022; Searle et al., 2023), radiology report summarization (Zhang et al., 2020b; Van Veen et al., 2023), summarization of doctor-patient conversations (Zhang et al., 2021; Krishna et al., 2021; Abacha et al., 2023) and problem list summarization (Gao et al., 2022, 2023). However, unlike our approach, most of these works depend on data annotation or reference summaries for training and domain adaptation.

LLMs demonstrate a remarkable capability in clinical text understanding, leading to interest in investigating their performance in clinical text summarization. Van Veen et al. (2024) extensively analyze the clinical text summarization performance of various LLMs with in-context learning (Lampinen et al., 2022) and QLoRA (Dettmers et al., 2024) adaptation. They compare the performance of LLMs with medical experts, providing insights into the strengths and limitations of LLMs.

Unsupervised and Self-Supervised Abstractive Text Summarization

The scarcity of annotated text for abstractive text summarization tasks has spurred interest in unsupervised and self-supervised text summarization. Previous works have relied on source document reconstruction, operating under the assumption that a good summary should be able to reconstruct the source document or capture its essential content (Chu and Liu, 2019). However, reconstructing an entire text using a summary without any guiding signal or prompt is challenging. In contrast, Yang et al. (2020) leverage the lead bias in news articles, by pretraining a model to predict the leading sentences as the target summary. However, this approach requires specific information distribution and text layout, which is not generally applicable. Some works have proposed two-step approaches to first extract important information or entities in the source text and then perform abstractive summarization with the guidance of this extracted information (Zhong et al., 2022; Ke et al., 2022; Liu et al., 2022b). However, the quality of the generated summary relies on the effectiveness of the extraction process, and developing a reliable extractor may entail significant costs. Zhuang et al. (2022) propose a contrastive learning strategy, using source documents as positive and edited source documents as negative examples. Their training objective aims at maximizing the semantic similarity between generated summaries and positive examples while minimizing those between generated summaries and negative examples. Hosking et al. (2023) propose an attributable opinion summarization system, which encodes sentences as paths through a hierarchical discrete latent space. Given a specific entity, the system can identify its common subpaths that are decoded as the output summary.

3 Methods

Next, we introduce QGSumm, a novel framework to automatically summarize and refine clinical notes, with a focus on capturing important patient-centered information in a self-supervised fashion (Figure 2). In line with the hypothesis, we propose a self-supervised domain adaptation strategy applied on the base model presented in Section 3.1. This strategy positive-contrastively learns from the original nursing notes, providing the summaries with an ability comparable to the original notes to resolve patient-related queries (Section 3.2). Moreover, we aim for the model to maintain focus on the patient’s meta information while also considering temporal aspects during the generation process. To achieve this, we propose two augmentation blocks, detailed in Section 3.3, to enhance the overall performance. Our model summarizes one nursing note at a time, taking into account its context. Assume a patient $PT$ has a sequence of nursing notes $N=\{N_{1},N_{2},\ldots,N_{m}\}$ sorted by time. Our objective is to obtain a summary $S_{i}$ for note $N_{i}$ from the distribution $P(S_{i}|N_{i},{PA},\{N_{1},\ldots,N_{i-1}\},U)$ , which is conditioned on the patient’s metadata ${PA}$ , information in the past notes $\{N_{1},\ldots,N_{i-1}\}$ , and the user query $U$ guiding the generation.

3.1 Base Model

The backbone of our framework is an off-the-shelf transformer-based language model with an encoder-decoder structure, denoted by $\operatorname{M}$ . Specifically, we leverage a checkpoint $\operatorname{M^{sum}}$ of $\operatorname{M}$ as the base model, which has been fine-tuned for text summarization using publicly available datasets. This allows us to efficiently utilize the extensive resources offered by the pre-trained language model without the effort of training from scratch. Hence, $\operatorname{M^{sum}}$ has been enriched with task-specific knowledge for improved performance in text summarization. However, the capability of $\operatorname{M^{sum}}$ to understand clinical text still remains limited. Therefore, additional refinement of $\operatorname{M^{sum}}$ is necessary to enhance its ability to grasp the complicated semantic information within nursing notes.

Let $N_{i}=[t_{1},t_{2},....,t_{n}]$ , where $t_{i},i=1,\ldots,n$ , denote tokens in $N_{i}$ . As a preliminary step, we first train the encoder $\operatorname{ENC}$ of the base model $\operatorname{M^{sum}}$ by reconstructing $N_{i}$ :

\mathcal{L}_{rec}(N_{i},\operatorname{ENC},\operatorname{DEC^{rec}})=\mathcal{% L}_{CrossEntropy}(N_{i},\operatorname{DEC^{rec}}(\operatorname{ENC}(N_{i})).

(1)

Here, $\operatorname{DEC^{rec}}$ is the decoder of the original pretrained model $\operatorname{M}$ , which remains frozen during the training. This process empowers the encoder with the ability to understand the semantic information and the clinical knowledge embedded in nursing notes, enabling it to encode the notes more effectively. This preparatory step as precedes the primary workflow for nursing notes summarization.

3.2 Training Objective

Since there is no ground truth summary available, the conventional method to guide the model $\operatorname{M^{sum}}$ through supervised fine-tuning is not feasible. Instead, we adopt a self-supervised strategy to force the model to generate high-quality, patient-centered summaries that can respond to patient-related queries effectively. We introduce a model $\operatorname{R}$ , which serves as a query responder. This model has been trained to generate responses to specific queries concerning the patient. For example, if a query pertains to the patient’s readmission status, $\operatorname{R}$ is trained to classify patients based on readmission risk using data available in the patient database.

When giving the original nursing note $N_{i}$ or its summary $S_{i}$ generated by $\operatorname{M^{sum}}$ as an input to the responder $\operatorname{R}$ , the training objective is to minimize the discrepancy between the two responses:

\operatorname{min}\mathcal{L}_{CrossEntropy}(\operatorname{R}(N_{i}),% \operatorname{R}(S_{i})).

(2)

This formulation ensures that when responding to a certain patient-related query, using the summary will result in a response similar to that obtained using the original nursing note. To prevent $\operatorname{M^{sum}}$ from generating summaries that are too verbose or direct “copy-paste” from the original notes, we introduce a length penalty term. Therefore, the final loss function for nursing notes within one batch becomes:

\mathcal{L}_{summ}=\frac{1}{K}\sum_{r=1}^{K}\mathcal{L}_{CrossEntropy}(% \operatorname{R}(N_{r}),\operatorname{R}(S_{r}))\times(1+\lambda_{1}e^{(\alpha% -0.5)}),

(3)

where

\alpha=\frac{\sum_{r=1}^{K}\operatorname{Len}(S_{r})}{\sum_{r=1}^{K}% \operatorname{Len}(N_{r})},

(4)

$K$ is the batch size, and $\operatorname{Len}$ denotes the length of the document in terms of the number of tokens. The hyperparameter $\lambda_{1}\in[0,1]$ regulates the extent of the penalty. The information flow from $\operatorname{M^{sum}}$ and $\operatorname{R}$ introduces nondifferentiability into the framework, and we resolve it using the straight-through gumbel softmax trick (Bengio et al., 2013; Jang et al., 2017).

3.3 Augmentation Blocks for the Context of the Patient

Temporal Information Fusion (TIF).

A patient typically has multiple nursing notes ordered in time to document the evolution of her condition. Therefore, the key information crucial for summarizing a patient’s current status is influenced by the context provided by the prior notes. We regard this as temporal information which should be incorporated during summarization to help the model understand the progression of the patient’s condition.

For $N_{i}$ , the embeddings of its previous notes are represented by the embeddings of their respective first tokens, which are special tokens indicating the start of each note. These embeddings are obtained at the last hidden state in the $\operatorname{ENC}$ , denoted as $\{\mathbf{h_{1}},\mathbf{h_{2}},...,\mathbf{h_{i-1}}\}$ , where $\mathbf{h_{i}}\in\mathbb{R}^{d}$ and $d$ is the dimension of the hidden state. We aggregate the representations of the past notes by weighted mean pooling such that the most recent notes receive the largest weight. In practice, we determine initial weights $\beta_{j},j=1,\ldots,i-1$ for each past note $N_{i}$ based on the position in the sequence, such that $\beta_{1}=1,\beta_{2}=2,\ldots,\beta_{i-1}=i-1$ . The weighted mean pooling is performed using normalized weights:

\beta^{{}^{\prime}}_{j}=\frac{\beta_{j}}{\beta_{1}+\beta_{2}+...+\beta_{i-1}},

(5)

\mathbf{h}^{TIF}=\operatorname{MeanPooling}(\beta_{1}^{{}^{\prime}}\mathbf{h}_% {1},\beta_{2}^{{}^{\prime}}\mathbf{h}_{2},...,\beta_{i-1}^{{}^{\prime}}\mathbf% {h}_{i-1}),

(6)

where $\mathbf{h}^{TIF}\in\mathbb{R}^{d}$ represents the information fusion of the past notes. As shown in Figure 3, we prepend a special token [TI] at the beginning of the decoder input, representing temporal information with embedding $\mathbf{h}^{TIF}$ . Consequently, the initial input to the decoder at the first time step consists of [[TI], [BOS]], where [BOS] is a special token indicating the start of generation. We substitute [TI] with the padding token [PAD] for nursing notes that have no past notes.

The model generates subsequent tokens in the summary in an auto-regressive manner. At each time step, the token produced is appended at the end of the decoder input for generation of subsequent tokens. Therefore, the [TI] token contributes to the generation of each token in the summary, serving as a prompt which consistently guides the model to focus on information about the patient’s past condition.

Patient Information Augmentation (PIA).

We aim at obtaining summaries focusing on the patient’s condition. To aid this, we explicitly incorporate patient-level information into the model through a cross-attention mechanism, which facilitates the interaction of information on different levels of representation learning. A patient’s metadata PA typically comprise basic information recorded for the patient’s admission, including age, gender, existing diagnoses, and performed procedures. We convert this metadata into patient information in natural language (one example presented in Appendix A.1), and then encode it using $\operatorname{ENC}$ to derive patient embedding $\mathbf{H}^{PA}\in\mathbb{R}^{z\times d}$ for patient $PT$ , where $z$ represents the number of tokens in patient information. The encoder also learns the embedding of the source note, $\mathbf{H}^{enc}\in\mathbb{R}^{n\times d}$ , where $n$ denotes the number of tokens in the note given as input to the encoder. On the decoder $\operatorname{DEC}$ side, let us assume the tokens input to the decoder at the current timestep are [[TI], [BOS], $y_{1}$ ,…, $y_{j}$ ]. Consequently, the hidden representation passed to the $l$ th decoder layer is $\mathbf{H}_{l}^{dec}\in\mathbb{R}^{{(j+2)}\times d}$ . The hidden representation $\mathbf{H}_{l}^{dec}$ is processed and updated in each decoder layer using the conventional self-attention and cross-attention with $\mathbf{H}^{enc}$ . Furthermore, we augment the decoder layer with patient information by performing cross-attention also between the hidden representation $\mathbf{H}_{l}^{dec}$ and the patient embedding $\mathbf{H}^{PA}$ . This facilitates the fusion of patient- and note-level information:

\mathbf{H}_{l+1}^{dec}=\operatorname{MHCA}(\mathbf{H}^{enc},\operatorname{MHSA% }(\mathbf{H}_{l}^{dec}))+\lambda_{2}\times\operatorname{MHCA}(\mathbf{H}^{PA},% \operatorname{MHSA}(\mathbf{H}_{l}^{dec})),

(7)

where the $\operatorname{MHCA}$ and the $\operatorname{MHSA}$ respectively denote Multi-Head Cross-Attention and Multi-Head Self-Attention (Vaswani et al., 2017). $\lambda_{2}\in[0,1]$ is a hyperparameter to control the importance of patient meta information. $\mathbf{H}_{l+1}^{dec}$ is the input to the next decoder layer, or if the $l$ th layer is the final layer, it is the input to the language modeling head.

With these two augmentation blocks, the computation of the final decoder state $\mathbf{H}^{dec}\in\mathbb{R}^{(j+2)\times d}$ for generating the $(j+1)$ th token in the summary of the note $N_{i}$ is abstracted as:

[\mathbf{H}^{enc},\mathbf{H}^{PA}]=\operatorname{ENC}(N_{i},PA),\hskip 14.2263% 6pt\mathbf{H}^{dec}=\operatorname{DEC}(\mathbf{H}^{enc},\mathbf{H}^{PA},[[TI],% [BOS],y_{1},\ldots,y_{j}]),

(8)

\mathbf{v}=\operatorname{LMH}(\mathbf{H}^{dec}),\hskip 14.22636pt\mathbf{v^{% \prime}}=\operatorname{ST-GumbelSoftmax}(\mathbf{v}).

(9)

$\operatorname{LMH}$ (Language Modeling Head) maps $\mathbf{H}^{dec}$ to a probability vector $\mathbf{v}\in\mathbb{R}^{vs}$ over the vocabulary of size $vs$ . $\mathbf{v}$ is processed using the straight-though gumbel softmax trick, denoted as $\operatorname{ST-GumbelSoftmax}$ , resulting in a one-hot vector $\mathbf{v^{\prime}}\in\mathbb{R}^{vs}$ providing the index of the $(j+1)$ th token.

4 Experiments

4.1 Data

We utilize MIMIC-III (Johnson et al., 2016), a widely used real-world EHRs database, for our experiments. In MIMIC-III, clinical notes in “NOTEEVENTS” table are organized by admission, and a single patient may have multiple admissions. Since the information in notes from different admissions of the same patient is discontinuous, we treat notes in each admission independently. We focus on nursing notes within the clinical notes. After the preprocessing, filtering and sampling (details in Appendix A.2), the number of nursing notes in the training, validation and test sets is 149015, 10001, 3079 and the corresponding numbers of admissions are 13893, 1000, 1156.

4.2 Types of Queries

This section presents queries used in our experiments, and more details can be found in Appendix A.3. Two principles are followed when determining the queries: (1) The query should be closely related to the patient and learnable by the query responder $\operatorname{R}$ ; (2) Data required to train $\operatorname{R}$ should be easily available without excessive data annotation. Below we propose four different types of queries. In each of these, the query responder $R$ is a classification model, which classifies patients according to a specific aspect of a patient’s status. Part of the training data is used to train the query responder $\operatorname{R}$ . When using $\operatorname{R}$ to guide the summarization, we input the summary and the original note to predict the classification probabilities, and minimize the discrepancy between these predictions, as described in Section 3.2. As an additional query, we include a simple baseline by minimizing the semantic distance between the note and its summary, measured by cosine similarity.

Contrastive Next Note Prediction.

Given a nursing note pair $(N_{i},N^{\prime}_{i})$ , we regard the query about whether $N^{\prime}_{i}$ is the successor note of $N_{i}$ as a prediction of the patient’s future status. To train the query responder R for the next note prediction, we create two note pairs for each nursing note, where the positive pair comprises the note and its successor in the sequence, and the negative pair contains the note and a randomly chosen non-consecutive note. If $N_{i}$ is the patient’s last nursing note, we use the patient’s discharge summary and a random note from other patients to construct the positive and negative pairs. The query is formulated as binary classification, and the output of $\operatorname{R}$ is the probability of each pair being the positive pair containing the consecutive notes.

Readmission Prediction.

Readmission information is easily retrieved from the hospital’s database and is closely related to the patient. The readmission prediction query is formulated as a 2-class classification task to predict whether the patient will be readmitted within 30 days of discharge, which reflects the patient’s future condition.

Phenotype Classification.

Classifying a patient’s diagnosis status or phenotype is a query to the patient’s current status. Following Harutyunyan et al. (2019), phenotype classification is formulated as a multi-label classification, where ICD-9 diagnosis codes mapped by HCUP CCS code groups¹¹1https://hcup-us.ahrq.gov/toolssoftware/ccs/ccsfactsheet.jsp are categorized into 25 classes. Therefore, the responder outputs the probability distribution of the phenotype as $[p_{1},p_{2},...,p_{25}]$ .

Readmission Prediction and Phenotype Classification.

We investigate the combined utilization of two queries, readmission prediction, and phenotype classification, to see if joint guidance is more effective. After obtaining the result of readmission prediction $[p^{r}_{1},p^{r}_{2}]$ and the result of phenotype classification $[p^{c}_{1},\ldots,p^{c}_{25}]$ , we integrate them by converting the results into a 50-class probability distribution.

4.3 Experiment Settings

As the base model, we use BART-Large-CNN²²2https://huggingface.co/facebook/bart-large-cnn, which is a BART model (Lewis et al., 2020) fine-tuned on CNN Daily Mail, specialized in text summarization. As the query responder, we use Clinical-Longformer (Li et al., 2022), chosen for its ability to handle a long context. It is fine-tuned on the selected queries. Hyperparameter settings for QGSumm and the query responder are presented in Appendix A.4.

Since our method is designed for scenarios where reference summaries are unavailable, we compare it with baselines in both zero-shot and few-shot settings: Zero-Shot: BART-Large-CNN (abbreviated as BART-zs), BioMistral-7B (Labrak et al., 2024) (abbreviated as BioMistral-zs), and GPT-4 (OpenAI, 2023); 10-Shot: BART-Large-CNN (abbreviated as BART-fs), Pegasus (Zhang et al., 2020a), and BioMistral-7B (abbreviated as BioMistral-fs). We use summaries generated by GPT-4 for the 10-shot fine-tuning and use one-shot in-context learning when prompting GPT-4 and BioMistral-7B. We also include results from two extractive methods, TextRank (Mihalcea and Tarau, 2004) and Lead-40%, for reference. In TextRank, we utilize MPNet (Song et al., 2020) to obtain sentence embeddings. In Lead-40%, we use the first 40% of the content of the note as a summary. More details about baselines and in-context learning/few-shot adaptation can be found in Appendix B.1.

4.4 Evaluation Metrics

Evaluating the quality of text summarization is challenging (Bhandari et al., 2020), especially when reference summaries are unavailable. Therefore, we employ multiple metrics covering different aspects of the summaries, providing a comprehensive evaluation.

Automatic Evaluation Metrics.

Metrics in the automatic evaluation are divided into three categories: 1) predictiveness, 2) factuality and consistency, and 3) conciseness.

Metrics for predictiveness assess whether the summary adequately contains patient key information, quantified as the ability to predict the patient’s condition using the summary as input. Specifically, we conduct readmission prediction and phenotype classification using summaries from baselines and our method. We employ the summaries generated by different methods to fine-tune the query responder, resulting in multiple predictors, one for each method. For readmission prediction, we report the weighted F1 and F1 of the positive class (“being readmitted”), and for phenotype classification, we report the F1-Macro score.

For consistency and factuality, we consider: (1) UMLS-Recall. This metric measures the biomedical information consistency by comparing the set of medical concepts in the summary with that in the original note. We employ QuickUMLS (Soldaini and Goharian, 2016) to extract Unified Medical Language System (UMLS) biomedical concepts from the nursing note and its summary. Recall is the proportion of concepts in the original note that are present in the summary. (2) UMLS-FDR. FDR denotes False Discovery Rate. Analogously to UMLS-Recall, we compute the proportion of medical concepts in the summary that do not appear in the original note. (3) FactKB. With FactKB (Feng et al., 2023), an evaluation model measuring the factuality of a summary and its original text, we evaluate whether the summaries are consistent with the nursing notes from the perspective of their overall semantic information. (4) BARTScore. This metric evaluates the general consistency of summaries in a text-generation manner using BART, which also considers aspects such as the structure, coherence, and fluency of the summary (Yuan et al., 2021).

Finally, we report the length of the generated summary as a percentage of the original note’s length to assess conciseness. We do not enforce a strict maximum length for baselines because we believe the model should be capable of determining the appropriate length autonomously.

Metrics used in the manual evaluation by a clinician

Without a reference summary, automatic evaluation metrics may not fully capture the quality of the summary. Therefore, we invite a clinician to conduct manual evaluation of the summaries of 25 nursing notes. The clinician evaluates summaries from multiple methods in a blinded and randomized order. Each summary is rated on four aspects: (1) Informativeness: Whether the summary adequately captures essential information regarding the patient’s condition in the original note; (2) Fluency: Whether the summary is well-written and easy to understand. (3) Consistency: How well the summary aligns with the original nursing note in factuality. (4) Relevance: It evaluates the conciseness of the summary and whether it contains unnecessary information. The score for each aspect ranges from 1 to 5. More detailed grading criteria are presented in Appendix B.2.

5 Results and Discussion

{adjustwidth}

-1cm Type Method Predictiveness Consistency and Factuality Conciseness Readmission Phenotype UMLS-Recall UMLS-FDR BARTScore FactKB Length Weighted F1 F1 Macro F1 Orig. Notes 85.2_0.5 19.7_1.9 28.7_0.5 - - - - - Zero-Shot BART-zs 78.8_0.4 11.1_0.9 20.5_0.3 36.4_9.0 8.70_6.2 -1.89_0.31 0.78_0.16 31.9% GPT-4 85.6_0.6 21.5_2.0 23.6_0.6 59.2_8.3 44.2_7.6 -3.13_0.47 0.77_0.17 53.6% BioMistral-zs 80.1_0.6 10.7_1.3 21.4_0.4 55.4_9.9 50.0_8.7 -2.80_0.45 0.68_0.14 69.2% 10-Shot BART-fs 82.2_0.5 14.4_1.3 21.1_0.4 52.5_7.3 44.5_7.1 -2.72_0.36 0.76_0.15 65.0% BioMistral-fs 81.7_0.4 10.2_1.1 22.0_0.4 57.2_10.2 49.1_7.8 -2.97_0.43 0.70_0.15 68.8% Pegasus 80.5_0.8 12.5_1.8 18.3_0.6 35.1_8.4 52.6_7.7 -3.07_0.40 0.70_0.18 57.4% QGSumm -Similarity 79.5_0.6 12.0_1.2 22.4_0.4 53.1_7.2 20.7_6.7 -2.22_0.31 0.82_0.13 51.7% -NextNote 80.8_0.6 11.7_1.4 23.2_0.6 56.4_8.0 35.2_7.1 -2.32_0.33 0.77_0.11 49.3% -Readmission 82.4_0.5 18.2_1.6 23.9_0.5 58.2_7.5 22.7_6.5 -2.30_0.37 0.78_0.14 46.2% -Phenotype 81.9_0.5 13.4_1.5 25.6_0.6 58.5_7.4 36.2_6.9 -2.34_0.35 0.79_0.13 48.0% -Re+Ph 84.2_0.5 17.2_1.6 25.1_0.5 58.8_7.9 24.1_6.4 -2.26_0.35 0.80_0.14 48.2% Extractive Lead-40% 83.1_0.6 12.6_1.5 21.7_0.5 42.7_6.7 0.30_2.6 -0.87_0.11 0.99_0.06 40.0% TextRank 81.9_0.7 14.4_1.7 23.3_0.5 58.5_7.9 0.08_1.4 -0.90_0.15 0.95_0.12 51.9%

Table 1: Results of automatic evaluation. Lower values are better for Length and UMLS-HR, higher values for the other metrics. The subscripts denote standard deviation. “Orig. Notes” means using original nursing notes as such for readmission and phenotype prediction. “Re+Ph” means using “Readmission Prediction and Phenotype Classification” as the query. Results from best and 2nd best method under each metric are bolded and underlined. Extractive methods are for reference and not considered in comparison.

5.1 Results of the Automatic Evaluation

Predictiveness.

Results are shown in Table 1. In the readmission prediction task, GPT-4 performs best, producing summaries that enable more accurate prediction of a patient’s status. Our method also exhibits excellent performance, surpassing all few-shot methods. The main reference for our method is BART-zs, as it is the base model in our method, and hence represents performance without the proposed novel components. We see that our method outperforms BART-zs significantly in weighted F1 score (84.2 vs. 78.8) and F1 score of the positive class (18.2 vs. 11.1). This shows the effectiveness of the adaptation strategy guiding the model with useful queries. Interestingly, we find that using the summary from GPT-4 for this task outperforms using the original notes. Similarly, the summary from our method has performance close to that of using the original notes, highlighting the importance of high-quality summaries. In phenotype classification, our method with the query focusing on patients’ phenotype performs the best, outperforming BART-zs in Macro F1 (25.6 vs. 20.5). Even when using the similarity alone as a guiding signal, our method still is better than BART-zs (22.4 vs. 20.5) or BART-fs (22.4 vs. 21.1). Although specialized in text summarization, Pegasus has weak performance on all predictiveness metrics.

Conciseness, Consistency, and Factuality.

As shown in Table 1, there is an expected trade-off between UMLS-Recall and the summary’s length. Our method strikes a good balance between medical information consistency (measured by UMLS-Recall) and conciseness. GPT-4 captures more medical information, but achieves this with summaries which are less concise. Conversely, BART-zs can produce concise summaries but fails to adequately capture medical concepts. Even if the 10-shot fine-tuning clearly improves predictiveness and UMLS-Recall, BART-fs still struggles to generate a concise summary. Similar to BART-fs, both BioMistral-zs and BioMistral-fs tend to produce summaries that are not concise.

Summaries generated from BART-zs maintain high levels in factuality (measured by UMLS-HR and FactKB) and general consistency (measured by BARTScore). Our method also has strong performance on relevant metrics. We find that although LLMs, such as GPT-4 and BioMistral, excel in language understanding, they do not perform well on factuality and general consistency. One possible reason is their tendency to rephrase or even expand upon the original notes, potentially introducing inconsistent information. However, the metrics can be influenced by text style and layout, which may cause summaries that are more different from the original note to score relatively lower, even if they are more fluent. For this reason, our model also scores lower than the base model BART-zs on some metrics. The results and limitations will be further discussed in Sections 5.4 and 5.5.

Effectiveness of the Query Guidance.

According to the results shown in Table 1, the performance with different queries varies. We can observe: (1) Regarding predictiveness, employing queries closely related to patients and focusing on readmission and phenotype information yield superior performance compared to other queries. As expected, the method with phenotype-related queries performs the best in phenotype classification, while the method with readmission-related queries is the best in readmission prediction. This highlights the effectiveness of guiding the summarization with queries, and different queries enable the summary to concentrate on different aspects of the original note. (2) Using similarity as guidance can produce summaries that are more similar to the original notes, resulting in higher scores on general consistency and factuality. However, summaries generated under this configuration tend to be longer and often sacrifice predictiveness and informativeness regarding medical concepts, demonstrating the limitations of the unconstrained guidance signal. (3) When employing joint guidance with both readmission and phenotype information, our method consistently achieves excellent performance across all metrics. This indicates that combining different guidance signals can help in producing better summaries, and further research is needed to explore this aspect in depth.

5.2 Results of the Manual Evaluation by a Clinician

To avoid excessive manual work, we select three baselines to include in the manual evaluation: BART-zs, GPT-4, and BioMistral-fs. The justification for selecting these methods is two-fold: First, BART-zs is the base model in our method and hence the main baseline, demonstrating the benefits of the novel components. Second, GPT-4 and BioMistral-fs are well-known strong baselines and they had good performance in the automatic evaluation. We use “Readmission Prediction and Phenotype Classification” as the query for our method. Average scores for each method across four metrics are shown in Figure 4.

QGSumm vs BART-zs.

Our method significantly outperforms the base model BART-zs on all four metrics. This indicates that the proposed domain adaptation strategy enables the model to generate higher-quality summaries from the medical personnel’s perspective, containing refined and important patient information with fewer hallucinations and increased readability. Although on average the summaries generated by our method are longer than those produced by the base model, our method achieves a higher relevance score from the clinician, suggesting the base model struggles to identify key information and focuses on unnecessary details. Our model can effectively enhance this aspect.

QGSumm vs GPT-4 and BioMistral-fs.

Due to the LLMs’ strong language understanding capability and sufficient medical knowledge, GPT-4 and BioMistral-fs can adequately summarize key information in nursing notes, receiving approximately the same average score in Informativeness as our method. Additionally, they excel in generating fluent summaries by rephrasing and clarifying abbreviations, receiving a slightly higher Fluency from the clinician than our method. However, the rephrasing can introduce factual inconsistencies, and the tendency to infer additional content may reduce factuality. Consequently, our model has a higher average score in Consistency, which is essential in the clinical setting. Furthermore, it generates more concise summaries, leading to a higher average score in Relevance. However, due to the small sample size, the only statistically significant difference in these comparisons was the improvement of our method compared to Biomistral-fs in consistency, and further work is needed for conclusive results.

Calculating the Significance.

We calculated the significance in Figure 4 using a two-tailed Binomial test on the pairwise win-rates. In detail, we count the number of nursing notes where our method has a score higher or lower than a comparison method and test for the null hypothesis that the win-probability is 0.5.

5.3 Effectiveness of Augmentation Blocks

We analyze the effects of the proposed augmentation blocks through an ablation study. We consider three settings: removing the Patient Information Augmentation block (denoted as w/o PIA); removing the Temporal Information Fusion block (denoted as w/o TIF); removing both blocks (denoted as w/o PIA+TIF). We use “Readmission Prediction and Phenotype Classification” as the query in our method. The results of the ablation study are shown in Table 2 The decreased weighted F1 and macro F1 scores indicate that both augmentation blocks enhance the predictiveness of summaries. This implies that information in patient metadata and previous notes can effectively prompt the inference of the current and future status of patients. Removing the TIF block causes a larger decrease in the F1 scores, suggesting that temporal information is more important than the patient’s metadata in guiding the generation of summaries to focus on the progression of the patient’s status.

On the other hand, the incorporation of patient metadata can lead to more faithful summaries, as the removal of PIA degrades more the performance on UMLS-FDR and FactKB, which are related to factuality. In contrast, the TIF block does not have a significant impact on the factuality. However, according to the UMLS-Recall score, it encourages the model to capture more medical information, thereby improving the consistency of the summary.

Table 2: Results of the ablation study. We present scores of five metrics for QGSumm, and show the change in the value of the metric after removing different augmentation blocks.

\downarrow

denotes a decrease in the score and

\uparrow

denotes an increase. We see that the augmentations are consistently useful.

	Weighted F1	Macro F1	UMLS-Recall	UMLS-FDR	FactKB
QGSumm	84.2	25.1	58.8	24.1	0.80
w/o PIA	$\downarrow 2.6$	$\downarrow 1.4$	$\downarrow 1.4$	$\uparrow 4.7$	$\downarrow 0.04$
w/o TIF	$\downarrow 4.1$	$\downarrow 1.6$	$\downarrow 3.8$	$\uparrow 1.9$	$\downarrow 0.01$
w/o PIA+TIF	$\downarrow 4.8$	$\downarrow 2.3$	$\downarrow 4.4$	$\uparrow 5.6$	$\downarrow 0.04$

5.4 Case Study

One artificial nursing note and its corresponding summaries generated by QGSumm, BART-zs, GPT-4, and BioMistral-fs are presented in Figure 5. In the original nursing note, the content highlighted in blue indicates information included in the summary generated by our approach. We can see that our approach captures most of the important patient information. However, some details, such as cardiovascular and respiratory conditions, are overlooked. The summary from BART-zs only covers information from the first half of the nursing note, suggesting the limitation in understanding long context. Summaries from GPT-4 and BioMistral-fs contain more patient information but lack conciseness. These models achieve fluency by rephrasing notes and expanding abbreviations. However, BioMistral-fs struggles with maintaining factuality, often excessively reasoning about the patient’s personal information and condition.

5.5 Discussion

User need -oriented summarization.

A high-quality summary should facilitate efficient understanding of the relevant content for users. In the context of nursing notes, this means the summary should capture the patient’s condition. Our method employs patient-related queries, indirectly ensuring that the summary centers around the patient’s status. The summaries generated with different queries can be seen as coming from distinct conditional distributions and parts of the semantic space. As discussed in Section 5.1, the queries can guide summaries to focus on specific aspects of the original note. Therefore, by selecting appropriate queries, we can control preferences for desired content and adjust granularity, which facilitates a more flexible and user need -oriented summary generation. For instance, broad queries about the patient’s condition will result in a summary that focuses on the patient’s overall condition, while more detailed queries, such as those regarding cardiovascular health, could produce a summary that focuses on that specific aspect.

Design choices for information augmentation.

One challenge is how to efficiently integrate information into the model while avoiding excessive computational cost. We utilize cross-attention to allow the patient’s metadata to efficiently interact on multiple levels with the process of generating the summary. In contrast, for temporal information in previous notes, using cross-attention in a similar manner might make it difficult for the model to balance attention across the current note, past notes, and patient information, in addition to introducing computational challenges with long sequences of notes. Hence, we adopt a simple but effective strategy: representing the temporal information, obtained by weighted mean pooling from previous note representations, as the first token of the decoder’s input. This strategy is intuitive, as information from previous notes naturally precedes the summary of the current note.

Interpretation of the evaluation metrics.

The metrics used in the automatic evaluation have limitations as they do not conclusively reflect the quality of the summary, and come with trade-offs. For example, a good performance in predictiveness and medical information consistency (UMLS-Recall) may not be due to the high quality of the summary but rather caused by copying the source note, resulting in a lack of conciseness and fluency. Conversely, as the summary becomes more concise, it may become less informative. Furthermore, models used to measure factuality and general consistency have inherent biases. As they are based on general semantics, they are potentially weak at recognizing patient-related information due to the dissimilarity between their training domain and clinical data, and they often prioritize text style and structure. Finally, since BARTScore is derived from the BART model, summaries generated by BART have a bias of scoring relatively higher with this metric. We attempt to mitigate the impact of these limitations by comprehensively considering multiple metrics, and including the manual evaluation by a clinician, but there remains a need for more conclusive evaluation metrics.

Limitations.

(1) Our current approach produces summaries of individual nursing notes, lacking the long context and support for multiple note summarization. (2) There is room for more exploration on the formulation of the clinical queries. We don’t employ generative queries but only queries related to the classification of the patient’s status. Also, when investigating the combined effects of multiple queries, further exploration using multi-task learning methods could be beneficial. (3) Due to the workload, the number of notes assessed in the manual evaluation is limited to 25. A larger sample size would allow more conclusive comparisons of the strengths and weaknesses of the methods.

Conclusion.

We presented a novel method for self-supervised nursing note summarization, where the main innovation was the introduction of query guidance, which successfully directed the summaries to include desired content. In the manual evaluation by a professional clinician, our method significantly outperformed a specialized open text summarization model, BART-Large-CNN, in all metrics. Because this model was the base model of our method, the result highlights the usefulness of the novel developments. Of the other baselines, the proprietary GPT-4 had the closest performance to our method and was better than the other baselines. In the automatic evaluation, GPT-4 was better than our method in predictiveness, but, importantly, our method outperformed GPT-4 in factual consistency, having fewer hallucinated facts without sacrificing the correct content. The same trend was seen in the manual evaluation as higher average consistency for our method, although the difference was not statistically significant. Hence, our approach can produce more reliable summaries, clearing obstacles for responsible clinical use of LLMs. From the machine learning perspective, our method demonstrates the feasibility of domain adaptation for pre-trained text summarization models without explicit supervision, and the effectiveness of self-supervised strategies to guide conditional summarization to specific interests.

References

Abacha et al. (2023) Asma Ben Abacha, Wen-wai Yim, Yadan Fan, and Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2291–2302, 2023.
Adams et al. (2022) Griffin Adams, Han-Chin Shing, Qing Sun, Christopher Winestock, Kathleen Mckeown, and Noémie Elhadad. Learning to revise references for faithful summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4009–4027, 2022.
Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
Bhandari et al. (2020) Manik Bhandari, Pranav Narayan Gour, Atabak Ashfaq, Pengfei Liu, and Graham Neubig. Re-evaluating evaluation in text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9347–9359, 2020.
Chu and Liu (2019) Eric Chu and Peter Liu. Meansum: A neural model for unsupervised multi-document abstractive summarization. In International Conference on Machine Learning, pages 1223–1232. PMLR, 2019.
Clarke et al. (2013) Martina A Clarke, Jeffery L Belden, Richelle J Koopman, Linsey M Steege, Joi L Moore, Shannon M Canfield, and Min S Kim. Information needs and information-seeking behaviour analysis of primary care physicians and nurses: a literature review. Health Information & Libraries Journal, 30(3):178–190, 2013.
Dettmers et al. (2024) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
Elsahar et al. (2021) Hady Elsahar, Maximin Coavoux, Jos Rozen, and Matthias Gallé. Self-supervised and controlled multi-document opinion summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1646–1662, 2021.
Feng et al. (2023) Shangbin Feng, Vidhisha Balachandran, Yuyang Bai, and Yulia Tsvetkov. Factkb: Generalizable factuality evaluation using language models enhanced with factual knowledge. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 933–952, 2023.
Gao et al. (2022) Yanjun Gao, Dmitriy Dligach, Timothy Miller, Dongfang Xu, Matthew MM Churpek, and Majid Afshar. Summarizing patients’ problems from hospital progress notes using pre-trained sequence-to-sequence models. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2979–2991, 2022.
Gao et al. (2023) Yanjun Gao, Dmitriy Dligach, Timothy Miller, and Majid Afshar. Overview of the problem list summarization (probsum) 2023 shared task on summarizing patients’ active diagnoses and problems from electronic health record progress notes. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 461–467, 2023.
Hall and Walton (2004) Amanda Hall and Graham Walton. Information overload within the health care system: a literature review. Health Information & Libraries Journal, 21(2):102–108, 2004.
Harutyunyan et al. (2019) Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. Scientific Data, 6(1):96, 2019.
Hosking et al. (2023) Tom Hosking, Hao Tang, and Mirella Lapata. Attributable and scalable opinion summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8488–8505, 2023.
Huang et al. (2019) Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342, 2019.
Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumble-softmax. In International Conference on Learning Representations (ICLR 2017), 2017.
Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific Data, 3(1):1–9, 2016.
Kanwal and Rizzo (2022) Neel Kanwal and Giuseppe Rizzo. Attention-based clinical note summarization. In Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, pages 813–820, 2022.
Ke et al. (2022) Wenjun Ke, Jinhua Gao, Huawei Shen, and Xueqi Cheng. Consistsum: Unsupervised opinion summarization with the consistency of aspect, sentiment and semantic. In Proceedings of the fifteenth ACM International Conference on Web Search and Aata Mining, pages 467–475, 2022.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Krishna et al. (2021) Kundan Krishna, Sopan Khosla, Jeffrey P Bigham, and Zachary C Lipton. Generating soap notes from doctor-patient conversations using modular summarization techniques. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4958–4972, 2021.
Labrak et al. (2024) Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024.
Lampinen et al. (2022) Andrew Lampinen, Ishita Dasgupta, Stephanie Chan, Kory Mathewson, Mh Tessler, Antonia Creswell, James McClelland, Jane Wang, and Felix Hill. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 537–563, 2022.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, 2020.
Li et al. (2022) Yikuan Li, Ramsey M Wehbe, Faraz S Ahmad, Hanyin Wang, and Yuan Luo. Clinical-longformer and clinical-bigbird: Transformers for long clinical sequences. arXiv preprint arXiv:2201.11838, 2022.
Liu et al. (2022a) Fenglin Liu, Bang Yang, Chenyu You, Xian Wu, Shen Ge, Zhangdaihong Liu, Xu Sun, Yang Yang, and David Clifton. Retrieve, reason, and refine: Generating accurate and faithful patient instructions. Advances in Neural Information Processing Systems, 35:18864–18877, 2022a.
Liu et al. (2022b) Puyuan Liu, Chenyang Huang, and Lili Mou. Learning non-autoregressive models from search for unsupervised sentence summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7916–7929, 2022b.
Liu et al. (2021) Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 35(1):857–876, 2021.
Mihalcea and Tarau (2004) Rada Mihalcea and Paul Tarau. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411, 2004.
Moen et al. (2016) Hans Moen, Laura-Maria Peltonen, Juho Heimonen, Antti Airola, Tapio Pahikkala, Tapio Salakoski, and Sanna Salanterä. Comparison of automatic summarisation methods for clinical free text notes. Artificial Intelligence in Medicine, 67:25–37, 2016.
OpenAI (2023) R OpenAI. Gpt-4 technical report. ArXiv, 2303, 2023.
O’Donnell et al. (2009) Heather C O’Donnell, Rainu Kaushal, Yolanda Barrón, Mark A Callahan, Ronald D Adelman, and Eugenia L Siegler. Physicians’ attitudes towards copy and pasting in electronic note writing. Journal of General Internal Medicine, 24:63–68, 2009.
Pagnoni et al. (2023) Artidoro Pagnoni, Alex Fabbri, Wojciech Kryściński, and Chien-Sheng Wu. Socratic pretraining: Question-driven pretraining for controllable summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12737–12755, 2023.
Pivovarov and Elhadad (2015) Rimma Pivovarov and Noémie Elhadad. Automated methods for the summarization of electronic health records. Journal of the American Medical Informatics Association : JAMIA, 22:938 – 947, 2015.
Reunamo et al. (2022) Akseli Reunamo, Laura-Maria Peltonen, Reetta Mustonen, Minttu Saari, Tapio Salakoski, Sanna Salanterä, and Hans Moen. Text classification model explainability for keyword extraction–towards keyword-based summarization of nursing care episodes. In MEDINFO 2021: One World, One Health–Global Partnership for Digital Innovation, pages 632–636. IOS Press, 2022.
Searle et al. (2023) Thomas Searle, Zina Ibrahim, James Teo, and Richard JB Dobson. Discharge summary hospital course summarisation of in patient electronic health record text with clinical concept guided deep pre-trained transformer models. Journal of Biomedical Informatics, 141:104358, 2023.
Shing et al. (2021) Han-Chin Shing, Chaitanya Shivade, Nima Pourdamghani, Feng Nan, Philip Resnik, Douglas Oard, and Parminder Bhatia. Towards clinical encounter summarization: Learning to compose discharge summaries from prior notes. arXiv preprint arXiv:2104.13498, 2021.
Soldaini and Goharian (2016) Luca Soldaini and Nazli Goharian. Quickumls: a fast, unsupervised approach for medical concept extraction. In MedIR Workshop, SIGIR, pages 1–4, 2016.
Song et al. (2020) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
Tang et al. (2019) Matthew Tang, Priyanka Gandhi, Md. Ahsanul Kabir, Christopher Zou, Jordyn Blakey, and Xiao Luo. Progress notes classification and keyword extraction using attention-based deep learning models with bert. ArXiv, abs/1910.05786, 2019.
Törnvall and Wilhelmsson (2008) Eva Törnvall and Susan Wilhelmsson. Nursing documentation for communicating and evaluating care. Journal of Clinical Nursing, 17(16):2116–2124, 2008.
Van Veen et al. (2023) Dave Van Veen, Cara Van Uden, Maayane Attias, Anuj Pareek, Christian Bluethgen, Malgorzata Polacin, Wah Chiu, Jean-Benoit Delbrouck, Juan Zambrano Chaves, Curtis Langlotz, et al. Radadapt: Radiology report summarization via lightweight domain adaptation of large language models. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 449–460, 2023.
Van Veen et al. (2024) Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna Seehofnerová, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nature Medicine, pages 1–9, 2024.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
Vig et al. (2022) Jesse Vig, Alexander Richard Fabbri, Wojciech Kryściński, Chien-Sheng Wu, and Wenhao Liu. Exploring neural models for query-focused summarization. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1455–1468, 2022.
Wang et al. (2021) Mengqian Wang, Manhua Wang, Fei Yu, Yue Yang, Jennifer Walker, and Javed Mostafa. A systematic review of automatic text summarization for biomedical literature and ehrs. Journal of the American Medical Informatics Association, 28(10):2287–2297, 2021.
Yang et al. (2020) Ziyi Yang, Chenguang Zhu, Robert Gmyr, Michael Zeng, Xuedong Huang, and Eric Darve. Ted: A pretrained unsupervised summarization model with theme modeling and denoising. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1865–1874, 2020.
Yuan et al. (2021) Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021.
Zhang et al. (2020a) Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR, 2020a.
Zhang et al. (2021) Longxiang Zhang, Renato Negrinho, Arindam Ghosh, Vasudevan Jagannathan, Hamid Reza Hassanzadeh, Thomas Schaaf, and Matthew R Gormley. Leveraging pretrained models for automatic summarization of doctor-patient conversations. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3693–3712, 2021.
Zhang et al. (2020b) Yuhao Zhang, Derek Merck, Emily Tsai, Christopher D Manning, and Curtis Langlotz. Optimizing the factual correctness of a summary: A study of summarizing radiology reports. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5108–5120, 2020b.
Zhong et al. (2022) Ming Zhong, Yang Liu, Suyu Ge, Yuning Mao, Yizhu Jiao, Xingxing Zhang, Yichong Xu, Chenguang Zhu, Michael Zeng, and Jiawei Han. Unsupervised multi-granularity summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4980–4995, 2022.
Zhuang et al. (2022) Haojie Zhuang, Wei Emma Zhang, Jian Yang, Congbo Ma, Yutong Qu, and Quan Z Sheng. Learning from the source document: Unsupervised abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4194–4205, 2022.

Appendix A. Additional Details for Implementation

A.1 Patient Metadata

Figure 6 shows one artificial example of how patient information is obtained in natural language from structural metadata. In the MIMIC-III database, we retrieve patient identifiers (“SUBJECT_ID”), gender information(“GENDER”), and date of birth (“DOB”) from the “PATIENTS” table. Information regarding admission identifiers (“HADM_ID”) and admission time (“ADMITTIME”) are obtained from the “ADMISSIONS” table, while diagnosis codes and procedure codes are sourced from “DIAGNOSES_ICD” and “PROCEDURES_ICD” tables, respectively.

A.2 Data preprocessing

Following prior research (Harutyunyan et al., 2019; Huang et al., 2019), we perform filtering on admission records and nursing notes. Initially, we filter out specific admission cases: (1) cases of in-hospital mortality and admissions categorized as ”NEWBORN”; (2) cases containing diagnosis codes outside HCUP CCS code groups. We retain only admissions containing clinical notes categorized as ”Nursing/other”. Subsequently, we apply a length limit to nursing notes, filtering out those with more than 800 tokens or fewer than 50 tokens. Finally, we filtered out admission cases with more than 100 nursing notes. Nursing notes in these cases typically represent out-of-distribution information or are irrelevant to the care episode.

We preprocess nursing notes following (Huang et al., 2019). In addition, we expand certain frequently occurring abbreviations found multiple times in each note, such as “pt” (patient), “cv” (cardiovascular), and “resp” (respiratory), to aid the model’s understanding of the notes. By random sampling, we collect 10001 nursing notes from 1000 admissions as the validation set. For the test set, we randomly select 1516 admissions and sample 3079 nursing notes from these admissions. We only use 3079 notes for testing due to the cost of the use of GPT-4. The nursing notes in remaining admissions are included in the training set.

A.3 Details of Implementation for the Query Responder

We use nursing notes in the training set to train the query responder $\operatorname{R}$ . The data statistics are presented in Table 3.

To address the class imbalance issue in the readmission prediction task, we conduct oversampling for notes in the positive class (“being readmitted”) and undersampling for notes in the negative class (“not being readmitted”). This results in 35000 nursing notes being used for training.

Table 3: The number of nursing notes are used for training, validation and testing.

Query	Training	Validation	Testing
Next Note Prediction	100000	5000	17458
Readmission Prediction	35000	10001	17458
Phenotype Classification	149015	10001	17458

Given a nursing note $N$ and its summary $S$ generated by QGSumm:

Contrastive Next Note Prediction.

Two note pairs $(N,N_{pos})$ and $(N,N_{neg})$ are constructed as introduced in Section 4.2.

p_{pos}=\operatorname{R}(N,N_{pos}),\hskip 8.5359ptp_{neg}=\operatorname{R}(N,% N_{neg}),

(10)

p^{\prime}_{pos}=\operatorname{R}(S,N_{pos}),\hskip 8.5359ptp^{\prime}_{neg}=% \operatorname{R}(S,N_{neg}).

(11)

The learning objective in this case is:

\operatorname{min}\mathcal{L}_{CrossEntropy}([p_{pos},p_{neg}],[p^{\prime}_{% pos},p^{\prime}_{neg}]).

(12)

Readmission Prediction.

The result of the readmission prediction is in the form of $[p_{pos},p_{neg}]$ , indicating the probability of “being readmitted” and “not being readmitted”. The learning objective is the same as the Equation 12.

Phenotype Classification.

We obtain the probability distribution of the phenotype from $\operatorname{R}$ :

[p_{1},p_{2},...,p_{25}]=\operatorname{R}(N),\hskip 8.5359pt[p^{\prime}_{1},p^% {\prime}_{2},...,p^{\prime}_{25}]=\operatorname{R}(S).

(13)

Consequently, the learning objective is:

\operatorname{min}\mathcal{L}_{CrossEntropy}([p_{1},p_{2},...,p_{25}],[p^{% \prime}_{1},p^{\prime}_{2},...,p^{\prime}_{25}]).

(14)

A.4 Hyperparameter Setting

We present hyperparameter settings of QGSumm and the query responder $\operatorname{R}$ in Table 4. The configuration of hyperparameters for the base model’s architecture keeps the same as the original configuration of BART-Large-CNN³³3https://huggingface.co/facebook/bart-large-cnn/blob/main/config.json. We set the maximum length of the summary to 500 tokens, allowing for flexibility as we aim for the model to autonomously determine the appropriate length. We use Adam optimizer (Kingma and Ba, 2014) to optimize the model.

Table 4: Details of the hyperparameter setting.

	Hyperparameter	Choices
QGSumm	learning rate	{1e-5, 2e-5, 5e-5, 2e-4, 5e-4}
	number of training epochs	3
	$\lambda_{1}$	{0.1, 0.3, 0.5, 0.7, 0.9, 1.0}
	$\lambda_{2}$	{0.0, 0.1, 0.3, 0.6, 0.8, 1.0}
	decoder layers being augmented by PIA	{all 12 layers, first 6 layers, last 6 layers}
$\operatorname{R}$	learning rate	{2e-5, 5e-5, 2e-4, 5e-4, 1e-3}
	number of training epochs for next note prediction	2
	number of training epochs for readmission prediction	2
	number of training epochs for phenotype classification	3

Appendix B. Additional Details for Baselines and Evaluation

B.1 Baselines

Choice of the baselines

BART-Large-CNN (Lewis et al., 2020): It is chosen as the base model for its excellent performance on text summarization as well as less computation cost than its peers. We consider it as one baseline in the experiment to illustrate performance without the proposed novel components. Pegasus (Zhang et al., 2020a): It is a transformer-based pre-trained model specialized in abstractive summarization, widely recognized as a baseline model in many studies on text summarization. BioMistral-7B (Labrak et al., 2024): It is an open-source instruction-based LLM adapted from Mistral (Jiang et al., 2023) for the medical domain. It achieves state-of-the-art performance in supervised fine-tuning benchmarks compared to other open-source medical language models. GPT-4 (OpenAI, 2023): It is a proprietary LLM representing state-of-the-art on general NLP task.

Consistent with our method’s settings, the maximum length of the summary is set to 500 tokens.

Prompt Learning for GPT-4 and Bio-Mistral

A prompt example is shown in Figure 7. We employ one-shot in-context learning to prompt GPT-4 and BioMistral-7B, providing one summary example. This strategy can significantly improve the conciseness of the generated summary and ensure its text structure aligns with the requirements.

Few-shot Adaptation

We randomly sample 10 nursing notes from the training set and pair them with their corresponding summaries generated by GPT-4 to create the training data for 10-shot fine-tuning of BART-Large-CNN, BioMistral-7B, and Pegasus. The training data is transformed into instructions-formatted prompts as shown in Figure 7 for fine-tuning BioMistral-7B. We fine-tune BART-Large-CNN and Pegasus for 8 and 9 epochs, respectively, with a learning rate of 0.0005. As for BioMistral-7B, we fine-tune it using QLoRA adaptation for 7 epochs with a learning rate set to 0.0002.

Evaluation on Predictiveness

In our experiments, we assess predictiveness by evaluating the performance of readmission prediction and phenotype classification using summaries from various baselines (including 8 different methods, which also includes extractive methods) and different settings of our approach. For instance, to evaluate the predictiveness of summaries from BART-zs, we begin with the trained query responder $\operatorname{R}$ as described in Appendix A.3 Details of Implementation for the Query Responder. We then fine-tune $\operatorname{R}$ and compute the metrics through 10-fold cross-validation using summaries of nursing notes in the test set generated from BART-zs. Consequently, we obtain multiple responders corresponding to each method for readmission prediction and phenotype classification, respectively.

B.2 Grading Criteria of Manuel Evaluation

Figure 8 shows the detailed grading criteria we provide to the clinician for manual evaluation. The score for each metric ranges from 1 to 5.

Appendix C. Additional Experimental Results

Effects of the length penalty coefficient.

The effects of varying the length penalty coefficient are shown in Figure 9(a). When $\lambda_{1}$ increases, the generated summaries become more concise. However, once $\lambda_{1}$ exceeds 0.5, there is a notable decrease in medical information consistency, accompanied by a decline in predictiveness performance. One potential explanation for this phenomenon is that within the range of 0.1 to 0.5, $\lambda_{1}$ facilitates the refinement of nursing notes by filtering out unnecessary information. However, surpassing 0.5 in the value of $\lambda_{1}$ results in a stricter penalty, which causes the omission of the patient key information for obtaining more concise summaries. To strike a balance between conciseness and consistency, we ultimately set $\lambda_{1}$ to 0.5.

Effects of the importance of patient meta information.

$\lambda_{2}$ regulates the contribution of patient metadata to the summarization process. The incorporation of patient meta information helps maintain the factuality of the summary. However, as shown in Figure 9(b), this influence is not consistently beneficial, with the optimal effect observed when $\lambda_{2}$ is set to 0.3. Additionally, excessively large $\lambda_{2}$ causes the model to prioritize patient metadata over the content of the nursing note, which degrades the quality of the generated summary, reflected as reduced predictiveness. Therefore, we set $\lambda_{2}$ to 0.3 for our method.