Machine learning approaches improve risk stratification for secondary cardiovascular disease prevention in multiethnic patients

Ashish Sarraju; Andrew Ward; Sukyung Chung; Jiang Li; David Scheinker; Fàtima Rodríguez

doi:10.1136/openhrt-2021-001802

Machine learning approaches improve risk stratification for secondary cardiovascular disease prevention in multiethnic patients

Open Heart. 2021 Oct;8(2):e001802. doi: 10.1136/openhrt-2021-001802.

Authors

Ashish Sarraju^#¹, Andrew Ward^#², Sukyung Chung³, Jiang Li³, David Scheinker^#^{4

5}, Fàtima Rodríguez^#⁶

Affiliations

¹ Division of Cardiovascular Medicine and Cardiovascular Institute, Stanford University School of Medicine, Stanford, California, USA.
² Department of Electrical Engineering, Stanford University, Stanford, California, USA.
³ Palo Alto Medical Foundation Research Institute, Palo Alto, California, USA.
⁴ Department of Management Science and Engineering, Stanford University School of Engineering, Stanford, California, USA.
⁵ Division of Pediatric Endocrinology, Stanford University School of Medicine, Stanford, California, USA.
⁶ Division of Cardiovascular Medicine and Cardiovascular Institute, Stanford University School of Medicine, Stanford, California, USA [email protected].

^# Contributed equally.

Abstract

Objectives: Identifying high-risk patients is crucial for effective cardiovascular disease (CVD) prevention. It is not known whether electronic health record (EHR)-based machine-learning (ML) models can improve CVD risk stratification compared with a secondary prevention risk score developed from randomised clinical trials (Thrombolysis in Myocardial Infarction Risk Score for Secondary Prevention, TRS 2°P).

Methods: We identified patients with CVD in a large health system, including atherosclerotic CVD (ASCVD), split into 80% training and 20% test sets. A rich set of EHR patient features was extracted. ML models were trained to estimate 5-year CVD event risk (random forests (RF), gradient-boosted machines (GBM), extreme gradient-boosted models (XGBoost), logistic regression with an L₂ penalty and L₁ penalty (Lasso)). ML models and TRS 2°P were evaluated by the area under the receiver operating characteristic curve (AUC).

Results: The cohort included 32 192 patients (median age 74 years, with 46% female, 63% non-Hispanic white and 12% Asian patients and 23 475 patients with ASCVD). There were 4010 events over 5 years of follow-up. ML models demonstrated good overall performance; XGBoost demonstrated AUC 0.70 (95% CI 0.68 to 0.71) in the full CVD cohort and AUC 0.71 (95% CI 0.69 to 0.73) in patients with ASCVD, with comparable performance by GBM, RF and Lasso. TRS 2°P performed poorly in all CVD (AUC 0.51, 95% CI 0.50 to 0.53) and ASCVD (AUC 0.50, 95% CI 0.48 to 0.52) patients. ML identified nontraditional predictive variables including education level and primary care visits.

Conclusions: In a multiethnic real-world population, EHR-based ML approaches significantly improved CVD risk stratification for secondary prevention.

Keywords: coronary artery disease; electronic health records; risk factors.

Publication types

Multicenter Study
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Aged
Aged, 80 and over
Cardiovascular Diseases / ethnology*
Cardiovascular Diseases / prevention & control
Electronic Health Records / statistics & numerical data
Ethnicity*
Female
Follow-Up Studies
Humans
Incidence
Machine Learning*
Male
Middle Aged
Retrospective Studies
Risk Assessment / methods*
Risk Factors
Survival Rate / trends
Time Factors
United States / epidemiology

Grants and funding

K01 HL144607/HL/NHLBI NIH HHS/United States