Biological age estimation from DNA methylation and determination of relevant biomarkers is an active research problem which has predominantly been tackled with black-box penalized regression. Machine learning is used to select a small subset of features from hundreds of thousands of CpG probes and to increase generalizability typically lacking with ordinary least-squares regression. Here, we show that such feature selection lacks biological interpretability and relevance in the clocks of the first and next generations and clarify the logic by which these clocks systematically exclude biomarkers of aging and age-related disease. Moreover, in contrast to the assumption that regularized linear regression is needed to prevent overfitting, we demonstrate that hypothesis-driven selection of biologically relevant features in conjunction with ordinary least squares regression yields accurate, well-calibrated, generalizable clocks with high interpretability. We further demonstrate that the interplay of inflammaging-related shifts of predictor values and their corresponding weights, which we term feature shifts, contributes to the lack of resolution between health and inflammaging in conventional linear models. Lastly, we introduce a method of feature rectification, which aligns these shifts to improve the distinction of age predictions for healthy people vs. patients with various chronic inflammation diseases.
Keywords: Aging; DNA methylation; DNA methylation clock; Elastic net regression; Feature selection; Forward stepwise selection; L1 penalty; PBMc clock.
© 2025. The Author(s).