Targeted learning with daily EHR data

Oleg Sofrygin; Zheng Zhu; Julie A Schmittdiel; Alyce S Adams; Richard W Grant; Mark J van der Laan; Romain Neugebauer

doi:10.1002/sim.8164

Targeted learning with daily EHR data

Stat Med. 2019 Jul 20;38(16):3073-3090. doi: 10.1002/sim.8164. Epub 2019 Apr 25.

Authors

Oleg Sofrygin^{1

2}, Zheng Zhu¹, Julie A Schmittdiel¹, Alyce S Adams¹, Richard W Grant¹, Mark J van der Laan², Romain Neugebauer¹

Affiliations

¹ Division of Research, Kaiser Permanente, Northern California, Oakland, California.
² Division of Biostatistics, University of California, Berkeley, California.

PMID: 31025411
DOI: 10.1002/sim.8164

Abstract

Electronic health records (EHR) data provide a cost- and time-effective opportunity to conduct cohort studies of the effects of multiple time-point interventions in the diverse patient population found in real-world clinical settings. Because the computational cost of analyzing EHR data at daily (or more granular) scale can be quite high, a pragmatic approach has been to partition the follow-up into coarser intervals of pre-specified length (eg, quarterly or monthly intervals). The feasibility and practical impact of analyzing EHR data at a granular scale has not been previously evaluated. We start filling these gaps by leveraging large-scale EHR data from a diabetes study to develop a scalable targeted learning approach that allows analyses with small intervals. We then study the practical effects of selecting different coarsening intervals on inferences by reanalyzing data from the same large-scale pool of patients. Specifically, we map daily EHR data into four analytic datasets using 90-, 30-, 15-, and 5-day intervals. We apply a semiparametric and doubly robust estimation approach, the longitudinal Targeted Minimum Loss-Based Estimation (TMLE), to estimate the causal effects of four dynamic treatment rules with each dataset, and compare the resulting inferences. To overcome the computational challenges presented by the size of these data, we propose a novel TMLE implementation, the "long-format TMLE," and rely on the latest advances in scalable data-adaptive machine-learning software, xgboost and h2o, for estimation of the TMLE nuisance parameters.

Keywords: EHR; Targeted Minimum Loss-Based Estimation; big data; causal inference; dynamic treatment regimes; machine learning.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Causality
Computer Simulation
Diabetes Mellitus
Electronic Health Records*
Humans
Longitudinal Studies*
Machine Learning
Reproducibility of Results

Grants and funding

R01 AI074345-07/National Institute of Health/International