A tree-based approach for motif discovery and sequence classification

Bioinformatics. 2011 Aug 1;27(15):2054-61. doi: 10.1093/bioinformatics/btr353. Epub 2011 Jun 17.

Abstract

Motivation: Pattern discovery algorithms are widely used for the analysis of DNA and protein sequences. Most algorithms have been designed to find overrepresented motifs in sparse datasets of long sequences, and ignore most positional information. We introduce an algorithm optimized to exploit spatial information in sparse-but-populous datasets.

Results: Our algorithm Tree-based Weighted-Position Pattern Discovery and Classification (T-WPPDC) supports both unsupervised pattern discovery and supervised sequence classification. It identifies positionally enriched patterns using the Kullback-Leibler distance between foreground and background sequences at each position. This spatial information is used to discover positionally important patterns. T-WPPDC then uses a scoring function to discriminate different biological classes. We validated T-WPPDC on an important biological problem: prediction of single nucleotide polymorphisms (SNPs) from flanking sequence. We evaluated 672 separate experiments on 120 datasets derived from multiple species. T-WPPDC outperformed other pattern discovery methods and was comparable to the supervised machine learning algorithms. The algorithm is computationally efficient and largely insensitive to dataset size. It allows arbitrary parameterization and is embarrassingly parallelizable.

Conclusions: T-WPPDC is a minimally parameterized algorithm for both pattern discovery and sequence classification that directly incorporates positional information. We use it to confirm the predictability of SNPs from flanking sequence, and show that positional information is a key to this biological problem.

Availability: The algorithm, code and data are available at: http://www.cs.utoronto.ca/~juris/data/TWPPDC

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Animals
  • Artificial Intelligence
  • Base Sequence
  • Computational Biology / methods*
  • DNA / genetics
  • Humans
  • Mice
  • Polymorphism, Single Nucleotide
  • Sequence Analysis, DNA / methods*

Substances

  • DNA