Background: This research investigated statistical methods for estimating and analyzing phenotypes derived from electronic health records (EHR) data, accounting for heterogeneity in available data elements across patients. We first addressed the lack of statistical methods for deriving phenotypes from EHR data when data elements are inconsistently available across patients and this missingness in data elements may be related to the underlying health status of the patient. We then developed statistical methods to reduce bias in analyses of observational data using EHR-derived phenotypes.
Objectives: The objectives of this research were to develop, evaluate and apply statistical methods for estimating EHR-derived phenotypes, the collection of characteristics describing a patient, in the presence of inconsistently collected data elements and statistical methods for valid estimation of outcome/exposure associations using such EHR-derived phenotypes.
Methods: We developed a Bayesian latent class approach to EHR-based phenotyping which incorporates the availability of data elements as an additional predictor of the latent phenotype to address dependence of data element availability on the underlying true phenotype. In a series of simulation studies, performance of this latent phenotype was compared to standard computable phenotypes with and without using multiple imputation to address missing data. Probabilistic phenotypes, such as the Bayesian latent phenotype, can be incorporated into subsequent outcome/exposure association studies. However, because these phenotypes are imperfect, standard association estimates will be biased. This is a limitation of any analysis using imperfect, EHR-derived phenotypes. Because phenotyping-error can be differential or non-differential, bias can be towards or away from the null. We developed closed-form bias correction formulas for odds ratios, relative risks, and risk differences estimated using probabilistic phenotypes. Simulation studies compared bias-corrected estimates to naive estimates based on dichotomizing the probabilistic phenotype, the standard approach used in EHR-based analyses. Finally, using data from six sites participating in the PCORI-funded PEDSnet consortium, we investigated the association between development of type 2 diabetes in adolescence and characteristics of the early-life body mass index (BMI) trajectory. This work used the Bayesian latent phenotyping model to characterize type 2 diabetes phenotypes and connected these phenotypes to parameters of the BMI trajectory via a Bayesian joint model. Chart review was not performed for PEDSnet data, so no “gold-standard” phenotypes were available.
Results: In simulation studies, the latent class approach had similar sensitivity to a standard computable phenotype (95.9% vs 91.9%) while substantially improving specificity (99.7% vs 90.8%). Multiple imputation of missing biomarker values modestly improved sensitivity relative to a standard computable phenotype (92.3% vs 91.9%) but at the cost of decreased specificity (90.8% vs 90.7%). Results were similar under missing not at random missingness patterns in which missingness in biomarkers was associated with the underlying true phenotype. When used in outcome/exposure association studies incorporating the proposed bias-correction factor, estimates of all three association parameters were approximately unbiased. In contrast, the naive approach of dichotomizing the probabilistic phenotype was strongly biased towards the null. In an analysis of the association between characteristics of the early-life BMI trajectory and type 2 diabetes in adolescence, we found that children who experienced adiposity rebound between 5 and 9 years had higher odds of subsequent type 2 diabetes than children with age at rebound between 2 and 5 years (odds ratio [OR] for age at adiposity rebound of 5-9 years 1.2, 95% credible interval 1.12-2.35). Children with a predicted BMI at age 9 years in excess of 140% of the 95th percentile for age and sex had the highest odds of adolescent type 2 diabetes (OR 6.22 relative to BMI between 100 and 120% of the 95th percentile, 95% credible interval 4.35-8.17).
Conclusions: The proposed latent phenotyping approach provides a means of combining data elements with variable availability across patients, and can perform well even when missingness is associated with the underlying health status of the patient. Advanced methodologic approaches have the potential to leverage EHR data to investigate relationships among variables routinely collected in healthcare data. This may be of particular relevance to pediatric research because many phenotypes are rare in children, necessitating use of large cohorts. Judicious use of EHR data in combination with appropriate statistical methods to address their shortcomings has the potential to facilitate efficient studies of rare diseases and increase equitable representation of all children in research.
Limitations: The proposed Bayesian latent phenotype is an unsupervised phenotyping approach that does not incorporate information about gold-standard phenotypes and is therefore dependent on parametric assumptions about the distributions and relationships among covariates. We also did not investigate time-varying phenotypes. Finally, the proposed bias-correction approaches are only approximate. Some residual bias may persist.
Copyright © 2021. University of Pennsylvania. All Rights Reserved.