DNA methylation is a widely investigated epigenetic mark that plays a vital role in tumorigenesis. Advancements in high-throughput assays, such as the Infinium 450K platform, provide genome-scale DNA methylation landscapes in single-CpG locus resolution, and the identification of differentially methylated loci has become an insightful approach to deepen our understanding of cancers. However, the situation with extremely unbalanced numbers of samples and loci (approximately 1:1,000) makes it rather difficult to explore differential methylation between the sick and the normal. In this article, a hybrid approach based on ensemble feature selection for identifying differentially methylated loci (HyDML) was proposed by incorporating instance perturbation and multiple function models. Experiments on data from The Cancer Genome Atlas showed that HyDML not only achieved effective DML identification, but also outperformed the single-feature selection approach in terms of classification performance and the robustness of feature selection. The intensive analysis of the DML indicated that different types of cancers have mutual patterns, and the stable DML sharing in pan-cancers is of the great potential to be biomarkers, which may strengthen the confidence of domain experts to implement biological validations.
Keywords: DNA methylation; differentially methylated loci; ensemble feature selection; pan-cancers; robustness.