A large collection of Single Nucleotide Polymorphisms (SNPs) has been identified in the human genome. Currently, the epigenetic influences of SNPs on their neighboring CpG sites remain elusive. A growing body of evidence suggests that locus-specific information, including genomic features and local epigenetic state, may play important roles in the epigenetic readout of SNPs. In this study, we made use of mouse methylomes with known SNPs to develop statistical models for the prediction of SNP associated allele-specific DNA methylation (ASM). ASM has been classified into parent-of-origin dependent ASM (P-ASM) and sequence-dependent ASM (S-ASM), which comprises scattered-S-ASM (sS-ASM) and clustered-S-ASM (cS-ASM). We found that P-ASM and cS-ASM CpG sites are both enriched in CpG rich regions, promoters and exons, while sS-ASM CpG sites are enriched in simple repeat and regions with high frequent SNP occurrence. Using Lasso-grouped Logistic Regression (LGLR), we selected 21 out of 282 genomic and methylation related features that are powerful in distinguishing cS-ASM CpG sites and trained the classifiers with machine learning techniques. Based on 5-fold cross-validation, the logistic regression classifier was found to be the best for cS-ASM prediction with an ACC of 0.77, an AUC of 0.84 and an MCC of 0.54. Lastly, we applied the logistic regression classifier on human brain methylome and predicted 608 genes associated with cS-ASM. Gene ontology term enrichment analysis indicated that these cS-ASM associated genes are significantly enriched in the category coding for transcripts with alternative splicing forms. In summary, this study provided an analytical procedure for cS-ASM prediction and shed new light on the understanding of different types of ASM events.
Keywords: Allele-specific DNA methylation; Epigenetic variation; Logistic regression classifier; SNP.
Published by Elsevier Inc.