Extracting Critical Information from Unstructured Clinicians' Notes Data to Identify Dementia Severity Using a Rule-Based Approach: Feasibility Study

JMIR Aging. 2024 Sep 24:7:e57926. doi: 10.2196/57926.

Abstract

Background: The severity of Alzheimer disease and related dementias (ADRD) is rarely documented in structured data fields in electronic health records (EHRs). Although this information is important for clinical monitoring and decision-making, it is often undocumented or "hidden" in unstructured text fields and not readily available for clinicians to act upon.

Objective: We aimed to assess the feasibility and potential bias in using keywords and rule-based matching for obtaining information about the severity of ADRD from EHR data.

Methods: We used EHR data from a large academic health care system that included patients with a primary discharge diagnosis of ADRD based on ICD-9 (International Classification of Diseases, Ninth Revision) and ICD-10 (International Statistical Classification of Diseases, Tenth Revision) codes between 2014 and 2019. We first assessed the presence of ADRD severity information and then the severity of ADRD in the EHR. Clinicians' notes were used to determine the severity of ADRD based on two criteria: (1) scores from the Mini Mental State Examination and Montreal Cognitive Assessment and (2) explicit terms for ADRD severity (eg, "mild dementia" and "advanced Alzheimer disease"). We compiled a list of common ADRD symptoms, cognitive test names, and disease severity terms, refining it iteratively based on previous literature and clinical expertise. Subsequently, we used rule-based matching in Python using standard open-source data analysis libraries to identify the context in which specific words or phrases were mentioned. We estimated the prevalence of documented ADRD severity and assessed the performance of our rule-based algorithm.

Results: We included 9115 eligible patients with over 65,000 notes from the providers. Overall, 22.93% (2090/9115) of patients were documented with mild ADRD, 20.87% (1902/9115) were documented with moderate or severe ADRD, and 56.20% (5123/9115) did not have any documentation of the severity of their ADRD. For the task of determining the presence of any ADRD severity information, our algorithm achieved an accuracy of >95%, specificity of >95%, sensitivity of >90%, and an F1-score of >83%. For the specific task of identifying the actual severity of ADRD, the algorithm performed well with an accuracy of >91%, specificity of >80%, sensitivity of >88%, and F1-score of >92%. Comparing patients with mild ADRD to those with more advanced ADRD, the latter group tended to contain older, more likely female, and Black patients, and having received their diagnoses in primary care or in-hospital settings. Relative to patients with undocumented ADRD severity, those with documented ADRD severity had a similar distribution in terms of sex, race, and rural or urban residence.

Conclusions: Our study demonstrates the feasibility of using a rule-based matching algorithm to identify ADRD severity from unstructured EHR report data. However, it is essential to acknowledge potential biases arising from differences in documentation practices across various health care systems.

Keywords: AD; ADRD; AI; Alzheimer's disease; Alzheimer's disease and related dementias; EHR; EMR; LLM; NLP; PHR; artificial intelligence; deep learning; dementia; electric medical record; electronic health record; geriatric syndromes; health record; large language model; natural language processing; patient record; personal health record; rule based analysis; unstructured data.

MeSH terms

  • Aged
  • Aged, 80 and over
  • Alzheimer Disease / diagnosis
  • Dementia* / diagnosis
  • Electronic Health Records*
  • Feasibility Studies*
  • Female
  • Humans
  • Male
  • Severity of Illness Index*