A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

V G Krishnan; D R Westhead

doi:10.1093/bioinformatics/btg297

A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

Bioinformatics. 2003 Nov 22;19(17):2199-209. doi: 10.1093/bioinformatics/btg297.

Authors

V G Krishnan¹, D R Westhead

Affiliation

¹ School of Biochemistry and Molecular Biology, University of Leeds, Leeds LS2 9JT, UK.

PMID: 14630648
DOI: 10.1093/bioinformatics/btg297

Abstract

Motivation: The large volume of single nucleotide polymorphism data now available motivates the development of methods for distinguishing neutral changes from those which have real biological effects. Here, two different machine-learning methods, decision trees and support vector machines (SVMs), are applied for the first time to this problem. In common with most other methods, only non-synonymous changes in protein coding regions of the genome are considered.

Results: In detailed cross-validation analysis, both learning methods are shown to compete well with existing methods, and to out-perform them in some key tests. SVMs show better generalization performance, but decision trees have the advantage of generating interpretable rules with robust estimates of prediction confidence. It is shown that the inclusion of protein structure information produces more accurate methods, in agreement with other recent studies, and the effect of using predicted rather than actual structure is evaluated.

Availability: Software is available on request from the authors.

Publication types

Comparative Study
Evaluation Study
Research Support, Non-U.S. Gov't
Validation Study

MeSH terms

Algorithms*
Animals
Artificial Intelligence*
Caenorhabditis elegans Proteins / chemistry
Caenorhabditis elegans Proteins / genetics
Cluster Analysis
Gene Expression Profiling / methods*
Pattern Recognition, Automated
Polymorphism, Single Nucleotide / genetics*
Proteins / chemistry*
Proteins / genetics*
Reproducibility of Results
Sensitivity and Specificity
Sequence Alignment / methods*
Sequence Analysis, Protein / methods*
Species Specificity
Structure-Activity Relationship

Substances

Caenorhabditis elegans Proteins
Proteins