Propensity score and proximity matching using random forest

Peng Zhao; Xiaogang Su; Tingting Ge; Juanjuan Fan

doi:10.1016/j.cct.2015.12.012

Propensity score and proximity matching using random forest

Contemp Clin Trials. 2016 Mar:47:85-92. doi: 10.1016/j.cct.2015.12.012. Epub 2015 Dec 17.

Authors

Peng Zhao¹, Xiaogang Su², Tingting Ge³, Juanjuan Fan⁴

Affiliations

¹ Computational Science Research Center, San Diego State University, San Diego, CA, USA.
² Department of Mathematical Sciences, University of Texas, El Paso, TX, USA.
³ Janssen Research and Development, San Diego, CA, USA.
⁴ Department of Mathematics and Statistics, San Diego State University, San Diego, CA, USA. Electronic address: [email protected].

Abstract

In order to derive unbiased inference from observational data, matching methods are often applied to produce balanced treatment and control groups in terms of all background variables. Propensity score has been a key component in this research area. However, propensity score based matching methods in the literature have several limitations, such as model mis-specifications, categorical variables with more than two levels, difficulties in handling missing data, and nonlinear relationships. Random forest, averaging outcomes from many decision trees, is nonparametric in nature, straightforward to use, and capable of solving these issues. More importantly, the precision afforded by random forest (Caruana et al., 2008) may provide us with a more accurate and less model dependent estimate of the propensity score. In addition, the proximity matrix, a by-product of the random forest, may naturally serve as a distance measure between observations that can be used in matching. The proposed random forest based matching methods are applied to data from the National Health and Nutrition Examination Survey (NHANES). Our results show that the proposed methods can produce well balanced treatment and control groups. An illustration is also provided that the methods can effectively deal with missing data in covariates.

Keywords: Matching; Observational study; Propensity score; Proximity; Random forest.

Publication types

Observational Study
Research Support, N.I.H., Extramural

MeSH terms

Adult
Aged
Body Mass Index
Case-Control Studies
Data Interpretation, Statistical
Databases, Factual
Female
Humans
Male
Middle Aged
Nutrition Surveys
Obesity / epidemiology*
Propensity Score*
Smoking / epidemiology*
Statistics as Topic
United States / epidemiology

Abstract

Publication types

MeSH terms

Grants and funding