Best holdout assessment is sufficient for cancer transcriptomic model selection

Jake Crawford; Maria Chikina; Casey S Greene

doi:10.1016/j.patter.2024.101115

Best holdout assessment is sufficient for cancer transcriptomic model selection

Patterns (N Y). 2024 Dec 6;5(12):101115. doi: 10.1016/j.patter.2024.101115. eCollection 2024 Dec 13.

Authors

Jake Crawford¹, Maria Chikina², Casey S Greene^{3

4}

Affiliations

¹ Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
² Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA.
³ Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA.
⁴ Center for Health AI, University of Colorado School of Medicine, Aurora, CO, USA.

Abstract

Guidelines in statistical modeling for genomics hold that simpler models have advantages over more complex ones. Potential advantages include cost, interpretability, and improved generalization across datasets or biological contexts. We directly tested the assumption that small gene signatures generalize better by examining the generalization of mutation status prediction models across datasets (from cell lines to human tumors and vice versa) and biological contexts (holding out entire cancer types from pan-cancer data). We compared model selection between solely cross-validation performance and combining cross-validation performance with regularization strength. We did not observe that more regularized signatures generalized better. This result held across both generalization problems and for both linear models (LASSO logistic regression) and non-linear ones (neural networks). When the goal of an analysis is to produce generalizable predictive models, we recommend choosing the ones that perform best on held-out data or in cross-validation instead of those that are smaller or more regularized.

Keywords: classifier; gene signature; machine learning; occam's razor; transcriptomics.