Systematic feature selection improves accuracy of methylation-based forensic age estimation in Han Chinese males

Forensic Sci Int Genet. 2018 Jul:35:38-45. doi: 10.1016/j.fsigen.2018.03.009. Epub 2018 Mar 23.

Abstract

Estimating individual age from biomarkers may provide key information facilitating forensic investigations. Recent progress has shown DNA methylation at age-associated CpG sites as the most informative biomarkers for estimating the individual age of an unknown donor. Optimal feature selection plays a critical role in determining the performance of the final prediction model. In this study we investigate methylation levels at 153 age-associated CpG sites from 21 previously reported genomic regions using the EpiTYPER system for their predictive power on individual age in 390 Han Chinese males ranging from 15 to 75 years of age. We conducted a systematic feature selection using a stepwise backward multiple linear regression analysis as well as an exhaustive searching algorithm. Both approaches identified the same subset of 9 CpG sites, which in linear combination provided the optimal model fitting with mean absolute deviation (MAD) of 2.89 years of age and explainable variance (R2) of 0.92. The final model was validated in two independent Han Chinese male samples (validation set 1, N = 65, MAD = 2.49, R2 = 0.95, and validation set 2, N = 62, MAD = 3.36, R2 = 0.89). Other competing models such as support vector machine and artificial neural network did not outperform the linear model to any noticeable degree. The validation set 1 was additionally analyzed using Pyrosequencing technology for cross-platform validation and was termed as validation set 3. Directly applying our model, in which the methylation levels were detected by the EpiTYPER system, to the data from pyrosequencing technology showed, however, less accurate results in terms of MAD (validation set 3, N = 65 Han Chinese males, MAD = 4.20, R2 = 0.93), suggesting the presence of a batch effect between different data generation platforms. This batch effect could be partially overcome by a z-score transformation (MAD = 2.76, R2 = 0.93). Overall, our systematic feature selection identified 9 CpG sites as the optimal subset for forensic age estimation and the prediction model consisting of these 9 markers demonstrated high potential in forensic practice. An age estimator implementing our prediction model allowing missing markers is freely available at http://liufan.big.ac.cn/AgePrediction.

Keywords: DNA methylation; Feature selection; Forensic DNA phenotyping; Forensic age estimation; Prediction modeling.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adolescent
  • Adult
  • Aged
  • Algorithms
  • China
  • CpG Islands / genetics*
  • DNA Methylation*
  • Ethnicity / genetics*
  • Forensic Genetics / methods*
  • Humans
  • Linear Models
  • Male
  • Middle Aged
  • Neural Networks, Computer
  • Sequence Analysis, DNA
  • Support Vector Machine
  • Young Adult