Background: Electronic health records and many legacy systems contain rich longitudinal data that can be used for research; however, they typically are not readily available.
Materials and methods: At Kaiser Permanente Southern California (KPSC), a research data warehouse (RDW) has been developed and maintained since the late 1990s and widely extended in 2006, aggregating and standardizing data collected from internal and a few external sources. This article provides a high-level overview of the RDW and discusses challenges common to data warehouses or repositories for research use. To demonstrate the application of the data, we report the volume, patient characteristics, and age-adjusted prevalence of selected medical conditions and utilization rates of selected medical procedures.
Results: A total of 105 million person-years of health plan enrollment was recorded in the RDW between 1981 and 2018, with most healthcare utilization data available since early or middle 1990s. Among active enrollees on December 31, 2018, 15% were ≥65 years of age, 33.9% were non-Hispanic white, 43.3% Hispanic, 11.0% Asian, and 8.4% African American, and 34.4% of children (2-17 years old) and 72.1% of adults (≥18 years old) were overweight or obese. The age-adjusted prevalence of asthma, atrial fibrillation, diabetes mellitus, hypercholesteremia, and hypertension increased between 2001 and 2018. Hospitalization and Emergency Department (ED) visit rates appeared lower, and office visit rates seemed higher at KPSC compared to the reported US averages.
Discussion and conclusion: Although the RDW is unique to KPSC, its methodologies and experience may provide useful insights for researchers of other healthcare systems worldwide in the era of big data analysis.
Keywords: chronic disease prevalence; data quality; electronic health record; health care utilization; integration; research data warehouse; standardization.
© The Author(s) 2023. Published by Oxford University Press on behalf of the American Medical Informatics Association.