Variable Selection in the Presence of Missing Data: Imputation-based Methods

Yize Zhao; Qi Long

doi:10.1002/wics.1402

Variable Selection in the Presence of Missing Data: Imputation-based Methods

Wiley Interdiscip Rev Comput Stat. 2017 Sep-Oct;9(5):e1402. doi: 10.1002/wics.1402. Epub 2017 May 24.

Authors

Yize Zhao¹, Qi Long²

Affiliations

¹ Department of Healthcare Policy and Research, Weill Cornell Medical College, Cornell University.
² Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania.

Abstract

Variable selection plays an essential role in regression analysis as it identifies important variables that associated with outcomes and is known to improve predictive accuracy of resulting models. Variable selection methods have been widely investigated for fully observed data. However, in the presence of missing data, methods for variable selection need to be carefully designed to account for missing data mechanisms and statistical techniques used for handling missing data. Since imputation is arguably the most popular method for handling missing data due to its ease of use, statistical methods for variable selection that are combined with imputation are of particular interest. These methods, valid used under the assumptions of missing at random (MAR) and missing completely at random (MCAR), largely fall into three general strategies. The first strategy applies existing variable selection methods to each imputed dataset and then combine variable selection results across all imputed datasets. The second strategy applies existing variable selection methods to stacked imputed datasets. The third variable selection strategy combines resampling techniques such as bootstrap with imputation. Despite recent advances, this area remains under-developed and offers fertile ground for further research.

Keywords: MAR; MCAR; MNAR; bootstrap; imputation; missing data; resampling; variable selection.

Abstract

Grants and funding