Machine learning approaches improve risk stratification for secondary cardiovascular disease prevention in multiethnic patients

Open Heart. 2021 Oct;8(2):e001802. doi: 10.1136/openhrt-2021-001802.

Abstract

Objectives: Identifying high-risk patients is crucial for effective cardiovascular disease (CVD) prevention. It is not known whether electronic health record (EHR)-based machine-learning (ML) models can improve CVD risk stratification compared with a secondary prevention risk score developed from randomised clinical trials (Thrombolysis in Myocardial Infarction Risk Score for Secondary Prevention, TRS 2°P).

Methods: We identified patients with CVD in a large health system, including atherosclerotic CVD (ASCVD), split into 80% training and 20% test sets. A rich set of EHR patient features was extracted. ML models were trained to estimate 5-year CVD event risk (random forests (RF), gradient-boosted machines (GBM), extreme gradient-boosted models (XGBoost), logistic regression with an L2 penalty and L1 penalty (Lasso)). ML models and TRS 2°P were evaluated by the area under the receiver operating characteristic curve (AUC).

Results: The cohort included 32 192 patients (median age 74 years, with 46% female, 63% non-Hispanic white and 12% Asian patients and 23 475 patients with ASCVD). There were 4010 events over 5 years of follow-up. ML models demonstrated good overall performance; XGBoost demonstrated AUC 0.70 (95% CI 0.68 to 0.71) in the full CVD cohort and AUC 0.71 (95% CI 0.69 to 0.73) in patients with ASCVD, with comparable performance by GBM, RF and Lasso. TRS 2°P performed poorly in all CVD (AUC 0.51, 95% CI 0.50 to 0.53) and ASCVD (AUC 0.50, 95% CI 0.48 to 0.52) patients. ML identified nontraditional predictive variables including education level and primary care visits.

Conclusions: In a multiethnic real-world population, EHR-based ML approaches significantly improved CVD risk stratification for secondary prevention.

Keywords: coronary artery disease; electronic health records; risk factors.

Publication types

  • Multicenter Study
  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Aged
  • Aged, 80 and over
  • Cardiovascular Diseases / ethnology*
  • Cardiovascular Diseases / prevention & control
  • Electronic Health Records / statistics & numerical data
  • Ethnicity*
  • Female
  • Follow-Up Studies
  • Humans
  • Incidence
  • Machine Learning*
  • Male
  • Middle Aged
  • Retrospective Studies
  • Risk Assessment / methods*
  • Risk Factors
  • Survival Rate / trends
  • Time Factors
  • United States / epidemiology