A tree-based approach for motif discovery and sequence classification

Rui Yan; Paul C Boutros; Igor Jurisica

doi:10.1093/bioinformatics/btr353

A tree-based approach for motif discovery and sequence classification

Bioinformatics. 2011 Aug 1;27(15):2054-61. doi: 10.1093/bioinformatics/btr353. Epub 2011 Jun 17.

Authors

Rui Yan¹, Paul C Boutros, Igor Jurisica

Affiliation

¹ Department of Computer Science, University of Toronto, Toronto, Canada M5S 3G4. [email protected]

PMID: 21685048
DOI: 10.1093/bioinformatics/btr353

Abstract

Motivation: Pattern discovery algorithms are widely used for the analysis of DNA and protein sequences. Most algorithms have been designed to find overrepresented motifs in sparse datasets of long sequences, and ignore most positional information. We introduce an algorithm optimized to exploit spatial information in sparse-but-populous datasets.

Results: Our algorithm Tree-based Weighted-Position Pattern Discovery and Classification (T-WPPDC) supports both unsupervised pattern discovery and supervised sequence classification. It identifies positionally enriched patterns using the Kullback-Leibler distance between foreground and background sequences at each position. This spatial information is used to discover positionally important patterns. T-WPPDC then uses a scoring function to discriminate different biological classes. We validated T-WPPDC on an important biological problem: prediction of single nucleotide polymorphisms (SNPs) from flanking sequence. We evaluated 672 separate experiments on 120 datasets derived from multiple species. T-WPPDC outperformed other pattern discovery methods and was comparable to the supervised machine learning algorithms. The algorithm is computationally efficient and largely insensitive to dataset size. It allows arbitrary parameterization and is embarrassingly parallelizable.

Conclusions: T-WPPDC is a minimally parameterized algorithm for both pattern discovery and sequence classification that directly incorporates positional information. We use it to confirm the predictability of SNPs from flanking sequence, and show that positional information is a key to this biological problem.

Availability: The algorithm, code and data are available at: http://www.cs.utoronto.ca/~juris/data/TWPPDC

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Animals
Artificial Intelligence
Base Sequence
Computational Biology / methods*
DNA / genetics
Humans
Mice
Polymorphism, Single Nucleotide
Sequence Analysis, DNA / methods*

Substances

DNA

Grants and funding

MOP-57903/Canadian Institutes of Health Research/Canada