With big data comes big responsibility: Strategies for utilizing aggregated, standardized, de-identified electronic health record data for research

Clin Transl Sci. 2025 Jan;18(1):e70093. doi: 10.1111/cts.70093.

Abstract

Electronic health records (EHRs), though they are maintained and utilized for clinical and billing purposes, may provide a wealth of information for research. Currently, sources are available that offer insight into the health histories of well over a quarter of a billion people. Their use, however, is fraught with hazards, including introduction or reinforcement of biases, clarity of disease definitions, protection of patient privacy, definitions of covariates or confounders, accuracy of medication usage compared with prescriptions, the need to introduce other data sources such as vaccination or death records and the ensuing potential for inaccuracy, duplicative records, and understanding and interpreting the outcomes of data queries. On the other hand, the possibility of study of rare disorders or the ability to link apparently disparate events are extremely valuable. Strategies for avoiding the worst pitfalls and hewing to conservative interpretations are essential. This article summarizes many of the approaches that have been used to avoid the most common pitfalls and extract the maximum information from aggregated, standardized, and de-identified EHR data. This article describes 26 topics broken into three major areas: (1) 14 topics related to design issues for observational study using EHR data, (2) 7 topics related to analysis issues when analyzing EHR data, and (3) 5 topics related to reporting studies using EHR data.

Publication types

  • Review

MeSH terms

  • Big Data*
  • Biomedical Research / standards
  • Biomedical Research / statistics & numerical data
  • Data Anonymization
  • Electronic Health Records* / statistics & numerical data
  • Humans