Double Sampling for Informatively Missing Data in Electronic Health Record-Based Comparative Effectiveness Research

Alexander W Levis; Rajarshi Mukherjee; Rui Wang; Heidi Fischer; Sebastien Haneuse

doi:10.1002/sim.10298

Double Sampling for Informatively Missing Data in Electronic Health Record-Based Comparative Effectiveness Research

Stat Med. 2024 Dec 30;43(30):6086-6098. doi: 10.1002/sim.10298. Epub 2024 Dec 5.

Authors

Alexander W Levis¹, Rajarshi Mukherjee², Rui Wang^{2

3}, Heidi Fischer⁴, Sebastien Haneuse²

Affiliations

¹ Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.
² Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts.
³ Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, Massachusetts.
⁴ Department of Research and Evaluation, Kaiser Permanente, Pasadena, California, USA.

Abstract

Missing data arise in most applied settings and are ubiquitous in electronic health records (EHR). When data are missing not at random (MNAR) with respect to measured covariates, sensitivity analyses are often considered. These solutions, however, are often unsatisfying in that they are not guaranteed to yield actionable conclusions. Motivated by an EHR-based study of long-term outcomes following bariatric surgery, we consider the use of double sampling as a means to mitigate MNAR outcome data when the statistical goals are estimation and inference regarding causal effects. We describe assumptions that are sufficient for the identification of the joint distribution of confounders, treatment, and outcome under this design. Additionally, we derive efficient and robust estimators of the average causal treatment effect under a nonparametric model and under a model assuming outcomes were, in fact, initially missing at random (MAR). We compare these in simulations to an approach that adaptively estimates based on evidence of violation of the MAR assumption. Finally, we also show that the proposed double sampling design can be extended to handle arbitrary coarsening mechanisms, and derive nonparametric efficient estimators of any smooth full data functional.

Keywords: causal inference; double sampling; missing data; semiparametric theory; study design.

MeSH terms

Bariatric Surgery / statistics & numerical data
Comparative Effectiveness Research*
Computer Simulation
Data Interpretation, Statistical
Electronic Health Records* / statistics & numerical data
Humans
Models, Statistical*
Statistics, Nonparametric

Abstract

MeSH terms

Grants and funding