Real World Data Versus Probability Surveys for Estimating Health Conditions at the State Level

J Surv Stat Methodol. 2024 Nov;12(5):1515-1530. doi: 10.1093/jssam/smae036.

Abstract

Government statistical offices worldwide are under pressure to produce statistics rapidly and for more detailed geographies, to compete with unofficial estimates available from web-based big data sources or from private companies. Commonly suggested sources of improved health information are electronic health records (EHRs) and medical claims data. These data sources are collectively known as real world data (RWD) because they are generated from routine health care processes, and they are available for millions of patients. It is clear that RWD can provide estimates that are more timely and less expensive to produce- but a key question is whether or not they are very accurate. To test this, we took advantage of a unique health data source that includes a full range of sociodemographic variables and compare estimates using all of those potential weighting variables, versus estimates derived when only age and sex are available for weighting (as is common with most RWD sources). We show that not accounting for other variables can produce misleading, and quite inaccurate, health estimates.

Keywords: All of Us; bias; data defect correlation; diabetes; electronic health records (EHRs); nonprobability surveys.