Evaluating uses of data mining techniques in propensity score estimation: a simulation study

Soko Setoguchi; Sebastian Schneeweiss; M Alan Brookhart; Robert J Glynn; E Francis Cook

doi:10.1002/pds.1555

Evaluating uses of data mining techniques in propensity score estimation: a simulation study

Pharmacoepidemiol Drug Saf. 2008 Jun;17(6):546-55. doi: 10.1002/pds.1555.

Authors

Soko Setoguchi¹, Sebastian Schneeweiss, M Alan Brookhart, Robert J Glynn, E Francis Cook

Affiliation

¹ Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02130, USA. [email protected]

Abstract

Background: In propensity score modeling, it is a standard practice to optimize the prediction of exposure status based on the covariate information. In a simulation study, we examined in what situations analyses based on various types of exposure propensity score (EPS) models using data mining techniques such as recursive partitioning (RP) and neural networks (NN) produce unbiased and/or efficient results.

Method: We simulated data for a hypothetical cohort study (n = 2000) with a binary exposure/outcome and 10 binary/continuous covariates with seven scenarios differing by non-linear and/or non-additive associations between exposure and covariates. EPS models used logistic regression (LR) (all possible main effects), RP1 (without pruning), RP2 (with pruning), and NN. We calculated c-statistics (C), standard errors (SE), and bias of exposure-effect estimates from outcome models for the PS-matched dataset.

Results: Data mining techniques yielded higher C than LR (mean: NN, 0.86; RPI, 0.79; RP2, 0.72; and LR, 0.76). SE tended to be greater in models with higher C. Overall bias was small for each strategy, although NN estimates tended to be the least biased. C was not correlated with the magnitude of bias (correlation coefficient [COR] = -0.3, p = 0.1) but increased SE (COR = 0.7, p < 0.001).

Conclusions: Effect estimates from EPS models by simple LR were generally robust. NN models generally provided the least numerically biased estimates. C was not associated with the magnitude of bias but was with the increased SE.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Bias*
Cohort Studies
Computer Simulation
Confounding Factors, Epidemiologic
Data Interpretation, Statistical*
Humans
Logistic Models
Monte Carlo Method
Neural Networks, Computer
Pharmacoepidemiology / methods*

Abstract

Publication types

MeSH terms

Grants and funding