Clinical and temporal characterization of COVID-19 subgroups using patient vector embeddings of electronic health records

J Am Med Inform Assoc. 2023 Jan 18;30(2):256-272. doi: 10.1093/jamia/ocac208.

Abstract

Objective: To identify and characterize clinical subgroups of hospitalized Coronavirus Disease 2019 (COVID-19) patients.

Materials and methods: Electronic health records of hospitalized COVID-19 patients at NewYork-Presbyterian/Columbia University Irving Medical Center were temporally sequenced and transformed into patient vector representations using Paragraph Vector models. K-means clustering was performed to identify subgroups.

Results: A diverse cohort of 11 313 patients with COVID-19 and hospitalizations between March 2, 2020 and December 1, 2021 were identified; median [IQR] age: 61.2 [40.3-74.3]; 51.5% female. Twenty subgroups of hospitalized COVID-19 patients, labeled by increasing severity, were characterized by their demographics, conditions, outcomes, and severity (mild-moderate/severe/critical). Subgroup temporal patterns were characterized by the durations in each subgroup, transitions between subgroups, and the complete paths throughout the course of hospitalization.

Discussion: Several subgroups had mild-moderate severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infections but were hospitalized for underlying conditions (pregnancy, cardiovascular disease [CVD], etc.). Subgroup 7 included solid organ transplant recipients who mostly developed mild-moderate or severe disease. Subgroup 9 had a history of type-2 diabetes, kidney and CVD, and suffered the highest rates of heart failure (45.2%) and end-stage renal disease (80.6%). Subgroup 13 was the oldest (median: 82.7 years) and had mixed severity but high mortality (33.3%). Subgroup 17 had critical disease and the highest mortality (64.6%), with age (median: 68.1 years) being the only notable risk factor. Subgroups 18-20 had critical disease with high complication rates and long hospitalizations (median: 40+ days). All subgroups are detailed in the full text. A chord diagram depicts the most common transitions, and paths with the highest prevalence, longest hospitalizations, lowest and highest mortalities are presented. Understanding these subgroups and their pathways may aid clinicians in their decisions for better management and earlier intervention for patients.

Keywords: COVID-19; SARS-CoV-2; cluster analysis; unsupervised machine learning.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Aged
  • COVID-19*
  • Cardiovascular Diseases*
  • Electronic Health Records
  • Female
  • Hospitalization
  • Humans
  • Male
  • Middle Aged
  • SARS-CoV-2