Background: Researchers routinely evaluate novel biomarkers for incorporation into clinical risk models, weighing tradeoffs between cost, availability, and ease of deployment. For risk assessment in population health initiatives, ideal inputs would be those already available for most patients. We hypothesized that common hematologic markers (eg, hematocrit), available in an outpatient complete blood count without differential, would be useful to develop risk models for cardiovascular events.
Methods: We developed Cox proportional hazards models for predicting heart attack, ischemic stroke, heart failure hospitalization, revascularization, and all-cause mortality. For predictors, we used 10 hematologic indices (eg, hematocrit) from routine laboratory measurements, collected March 2016 to May 2017 along with demographic data and diagnostic codes. As outcomes, we used neural network-based automated event adjudication of 1 028 294 discharge summaries. We trained models on 23 238 patients from one hospital in Boston and evaluated them on 29 671 patients from a second one. We assessed calibration using Brier score and discrimination using Harrell's concordance index. In addition, to determine the utility of high-dimensional interactions, we compared our proportional hazards models to random survival forest models.
Results: Event rates in our cohort ranged from 0.0067 to 0.075 per person-year. Models using only hematology indices had concordance index ranging from 0.60 to 0.80 on an external validation set and showed the best discrimination when predicting heart failure (0.80 [95% CI, 0.79-0.82]) and all-cause mortality (0.78 [0.77-0.80]). Compared with models trained only on demographic data and diagnostic codes, models that also used hematology indices had better discrimination and calibration. The concordance index of the resulting models ranged from 0.75 to 0.85 and the improvement in concordance index ranged up to 0.072. Random survival forests had minimal improvement over proportional hazards models.
Conclusions: We conclude that low-cost, ubiquitous inputs, if biologically informative, can provide population-level readouts of risk.
Keywords: cardiovascular disease; heart failure; hematology; ischemic stroke; machine learning.