Extraction of sleep information from clinical notes of Alzheimer's disease patients using natural language processing

Sonish Sivarajkumar; Thomas Yu Chow Tam; Haneef Ahamed Mohammad; Samuel Viggiano; David Oniani; Shyam Visweswaran; Yanshan Wang

doi:10.1093/jamia/ocae177

Extraction of sleep information from clinical notes of Alzheimer's disease patients using natural language processing

J Am Med Inform Assoc. 2024 Oct 1;31(10):2217-2227. doi: 10.1093/jamia/ocae177.

Authors

Sonish Sivarajkumar¹, Thomas Yu Chow Tam², Haneef Ahamed Mohammad², Samuel Viggiano², David Oniani², Shyam Visweswaran^{1

3}, Yanshan Wang^{1

2

3

4}

Affiliations

¹ Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260, United States.
² Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA 15260, United States.
³ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15260, United States.
⁴ Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA 15260, United States.

PMID: 39001795
PMCID: PMC11413436 (available on 2025-07-13)
DOI: 10.1093/jamia/ocae177

Abstract

Objectives: Alzheimer's disease (AD) is the most common form of dementia in the United States. Sleep is one of the lifestyle-related factors that has been shown critical for optimal cognitive function in old age. However, there is a lack of research studying the association between sleep and AD incidence. A major bottleneck for conducting such research is that the traditional way to acquire sleep information is time-consuming, inefficient, non-scalable, and limited to patients' subjective experience. We aim to automate the extraction of specific sleep-related patterns, such as snoring, napping, poor sleep quality, daytime sleepiness, night wakings, other sleep problems, and sleep duration, from clinical notes of AD patients. These sleep patterns are hypothesized to play a role in the incidence of AD, providing insight into the relationship between sleep and AD onset and progression.

Materials and methods: A gold standard dataset is created from manual annotation of 570 randomly sampled clinical note documents from the adSLEEP, a corpus of 192 000 de-identified clinical notes of 7266 AD patients retrieved from the University of Pittsburgh Medical Center (UPMC). We developed a rule-based natural language processing (NLP) algorithm, machine learning models, and large language model (LLM)-based NLP algorithms to automate the extraction of sleep-related concepts, including snoring, napping, sleep problem, bad sleep quality, daytime sleepiness, night wakings, and sleep duration, from the gold standard dataset.

Results: The annotated dataset of 482 patients comprised a predominantly White (89.2%), older adult population with an average age of 84.7 years, where females represented 64.1%, and a vast majority were non-Hispanic or Latino (94.6%). Rule-based NLP algorithm achieved the best performance of F1 across all sleep-related concepts. In terms of positive predictive value (PPV), the rule-based NLP algorithm achieved the highest PPV scores for daytime sleepiness (1.00) and sleep duration (1.00), while the machine learning models had the highest PPV for napping (0.95) and bad sleep quality (0.86), and LLAMA2 with finetuning had the highest PPV for night wakings (0.93) and sleep problem (0.89).

Discussion: Although sleep information is infrequently documented in the clinical notes, the proposed rule-based NLP algorithm and LLM-based NLP algorithms still achieved promising results. In comparison, the machine learning-based approaches did not achieve good results, which is due to the small size of sleep information in the training data.

Conclusion: The results show that the rule-based NLP algorithm consistently achieved the best performance for all sleep concepts. This study focused on the clinical notes of patients with AD but could be extended to general sleep information extraction for other diseases.

Keywords: Alzheimer’s disease; clinical notes; electronic health records; information extraction; natural language processing; sleep.

MeSH terms

Aged
Alzheimer Disease*
Datasets as Topic
Electronic Health Records
Female
Humans
Machine Learning
Male
Natural Language Processing*
Sleep
Sleep Wake Disorders

Abstract

MeSH terms

Grants and funding