Random forest analysis and lasso regression outperform traditional methods in identifying missing data auxiliary variables when the MAR mechanism is nonlinear (p.s. Stop using Little's MCAR test)

Behav Res Methods. 2024 Dec;56(8):8608-8639. doi: 10.3758/s13428-024-02494-1. Epub 2024 Sep 9.

Abstract

The selection of auxiliary variables is an important first step in appropriately implementing missing data methods such as full information maximum likelihood (FIML) estimation or multiple imputation. However, practical guidelines and statistical tests for selecting useful auxiliary variables are somewhat lacking, leading to potentially biased estimates. We propose the use of random forest analysis and lasso regression as alternative methods to select auxiliary variables, particularly in situations in which the missing data pattern is nonlinear or otherwise complex (i.e., interactive relationships between variables and missingness). Monte Carlo simulations demonstrate the effectiveness of random forest analysis and lasso regression compared to traditional methods (t-tests, Little's MCAR test, logistic regressions), in terms of both selecting auxiliary variables and the performance of said auxiliary variables when incorporated in an analysis with missing data. Both techniques outperformed traditional methods, providing a promising direction for improvement of practical methods for handling missing data in statistical analyses.

Keywords: Auxiliary variables; Missing at random; Missing data; Random forest.

MeSH terms

  • Algorithms
  • Computer Simulation
  • Data Interpretation, Statistical
  • Humans
  • Likelihood Functions
  • Models, Statistical
  • Monte Carlo Method*
  • Nonlinear Dynamics
  • Random Forest
  • Regression Analysis