The selection of auxiliary variables is an important first step in appropriately implementing missing data methods such as full information maximum likelihood (FIML) estimation or multiple imputation. However, practical guidelines and statistical tests for selecting useful auxiliary variables are somewhat lacking, leading to potentially biased estimates. We propose the use of random forest analysis and lasso regression as alternative methods to select auxiliary variables, particularly in situations in which the missing data pattern is nonlinear or otherwise complex (i.e., interactive relationships between variables and missingness). Monte Carlo simulations demonstrate the effectiveness of random forest analysis and lasso regression compared to traditional methods (t-tests, Little's MCAR test, logistic regressions), in terms of both selecting auxiliary variables and the performance of said auxiliary variables when incorporated in an analysis with missing data. Both techniques outperformed traditional methods, providing a promising direction for improvement of practical methods for handling missing data in statistical analyses.
Keywords: Auxiliary variables; Missing at random; Missing data; Random forest.
© 2024. The Psychonomic Society, Inc.