Subsampling based variable selection for generalized linear models

Comput Stat Data Anal. 2023 Aug:184:107740. doi: 10.1016/j.csda.2023.107740. Epub 2023 Mar 11.

Abstract

A novel variable selection method for low-dimensional generalized linear models is introduced. The new approach called AIC OPTimization via STABility Selection (OPT-STABS) repeatedly subsamples the data, minimizes Akaike's Information Criterion (AIC) over a sequence of nested models for each subsample, and includes in the final model those predictors selected in the minimum AIC model in a large fraction of the subsamples. New methods are also introduced to establish an optimal variable selection cutoff over repeated subsamples. An extensive simulation study examining a variety of proposec variable selection methods shows that, although no single method uniformly outperforms the others in all the scenarios considered, OPT-STABS is consistently among the best-performing methods in most settings while it performs competitively for the rest. This is in contrast to other candidate methods which either have poor performance across the board or exhibit good performance in some settings, but very poor in others. In addition, the asymptotic properties of the OPT-STABS estimator are derived, and its root-n consistency and asymptotic normality are proved. The methods are applied to two datasets involving logistic and Poisson regressions.

Keywords: AIC; Screening threshold; Stability selection; Subsampling; Variable selection.