Machine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysis

Yuchen Guo; Victoria Y Strauss; Martí Català; Annika M Jödicke; Sara Khalid; Daniel Prieto-Alhambra

doi:10.3389/fphar.2024.1395707

Machine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysis

Front Pharmacol. 2024 Oct 28:15:1395707. doi: 10.3389/fphar.2024.1395707. eCollection 2024.

Authors

Yuchen Guo¹, Victoria Y Strauss², Martí Català¹, Annika M Jödicke¹, Sara Khalid¹, Daniel Prieto-Alhambra^{1

3}

Affiliations

¹ Pharmaco- and Device Epidemiology Group, Centre of Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, United Kingdom.
² Boehringer-Ingelheim, Ingelheim, Germany.
³ Department of Medical Informatics, Erasmus Medical Center, Rotterdam, Netherlands.

Abstract

Introduction: Machine learning (ML) methods are promising and scalable alternatives for propensity score (PS) estimation, but their comparative performance in disease risk score (DRS) estimation remains unexplored.

Methods: We used real-world data comparing antihypertensive users to non-users with 69 negative control outcomes, and plasmode simulations to study the performance of ML methods in PS and DRS estimation. We conducted a cohort study using UK primary care records. Further, we conducted a plasmode simulation with synthetic treatment and outcome mimicking empirical data distributions. We compared four PS and DRS estimation methods: 1. Reference: Logistic regression including clinically chosen confounders. 2. Logistic regression with L1 regularisation (LASSO). 3. Multi-layer perceptron (MLP). 4. Extreme Gradient Boosting (XgBoost). Covariate balance, coverage of the null effect of negative control outcomes (real-world data) and bias based on the absolute difference between observed and true effects (for plasmode) were estimated. 632,201 antihypertensive users and nonusers were included.

Results: ML methods outperformed the reference method for PS estimation in some scenarios, both in terms of covariate balance and coverage/bias. Specifically, XgBoost achieved the best performance. DRS-based methods performed worse than PS in all tested scenarios.

Discussion: We found that ML methods could be reliable alternatives for PS estimation. ML-based DRS methods performed worse than PS ones, likely given the rarity of outcomes.

Keywords: disease risk scores; machine learning; negative control; observational research; propensity scores; treatment effect.

Grants and funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. Daniel Prieto-Alhambra received funding from the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC). Daniel Prieto-Alhambra is funded through a NIHR Senior Research Fellowship (Grant number SRF-2018-11-ST2-004).