We present an ensemble tree-based algorithm for variable selection in high dimensional datasets, in settings where a time-to-event outcome is observed with error. The proposed methods are motivated by self-reported outcomes collected in large-scale epidemiologic studies, such as the Women's Health Initiative. The proposed methods equally apply to imperfect outcomes that arise in other settings such as data extracted from electronic medical records. To evaluate the performance of our proposed algorithm, we present results from simulation studies, considering both continuous and categorical covariates. We illustrate this approach to discover single nucleotide polymorphisms that are associated with incident Type II diabetes in the Women's Health Initiative. A freely available R package icRSF (R Core Team, 2018; Xu et al., 2018) has been developed to implement the proposed methods.
Keywords: High Dimensional Data; Interval Censoring; Random Survival Forests; Self-reports; Variable Selection.