A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

Bioinformatics. 2003 Nov 22;19(17):2199-209. doi: 10.1093/bioinformatics/btg297.

Abstract

Motivation: The large volume of single nucleotide polymorphism data now available motivates the development of methods for distinguishing neutral changes from those which have real biological effects. Here, two different machine-learning methods, decision trees and support vector machines (SVMs), are applied for the first time to this problem. In common with most other methods, only non-synonymous changes in protein coding regions of the genome are considered.

Results: In detailed cross-validation analysis, both learning methods are shown to compete well with existing methods, and to out-perform them in some key tests. SVMs show better generalization performance, but decision trees have the advantage of generating interpretable rules with robust estimates of prediction confidence. It is shown that the inclusion of protein structure information produces more accurate methods, in agreement with other recent studies, and the effect of using predicted rather than actual structure is evaluated.

Availability: Software is available on request from the authors.

Publication types

  • Comparative Study
  • Evaluation Study
  • Research Support, Non-U.S. Gov't
  • Validation Study

MeSH terms

  • Algorithms*
  • Animals
  • Artificial Intelligence*
  • Caenorhabditis elegans Proteins / chemistry
  • Caenorhabditis elegans Proteins / genetics
  • Cluster Analysis
  • Gene Expression Profiling / methods*
  • Pattern Recognition, Automated
  • Polymorphism, Single Nucleotide / genetics*
  • Proteins / chemistry*
  • Proteins / genetics*
  • Reproducibility of Results
  • Sensitivity and Specificity
  • Sequence Alignment / methods*
  • Sequence Analysis, Protein / methods*
  • Species Specificity
  • Structure-Activity Relationship

Substances

  • Caenorhabditis elegans Proteins
  • Proteins