Partition dataset according to amino acid type improves the prediction of deleterious non-synonymous SNPs

Biochem Biophys Res Commun. 2012 Mar 2;419(1):99-103. doi: 10.1016/j.bbrc.2012.01.138. Epub 2012 Feb 4.

Abstract

Many non-synonymous SNPs (nsSNPs) are associated with diseases, and numerous machine learning methods have been applied to train classifiers for sorting disease-associated nsSNPs from neutral ones. The continuously accumulated nsSNP data allows us to further explore better prediction approaches. In this work, we partitioned the training data into 20 subsets according to either original or substituted amino acid type at the nsSNP site. Using support vector machine (SVM), training classification models on each subset resulted in an overall accuracy of 76.3% or 74.9% depending on the two different partition criteria, while training on the whole dataset obtained an accuracy of only 72.6%. Moreover, the dataset was also randomly divided into 20 subsets, but the corresponding accuracy was only 73.2%. Our results demonstrated that partitioning the whole training dataset into subsets properly, i.e., according to the residue type at the nsSNP site, will improve the performance of the trained classifiers significantly, which should be valuable in developing better tools for predicting the disease-association of nsSNPs.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Artificial Intelligence*
  • Databases, Nucleic Acid
  • Disease / genetics*
  • Genetic Predisposition to Disease
  • Humans
  • Polymorphism, Single Nucleotide*
  • Sequence Analysis, DNA / methods*