Fast, sensitive discovery of conserved genome-wide motifs

J Comput Biol. 2012 Feb;19(2):139-47. doi: 10.1089/cmb.2011.0249.

Abstract

Regulatory sites that control gene expression are essential to the proper functioning of cells, and identifying them is critical for modeling regulatory networks. We have developed Magma (Multiple Aligner of Genomic Multiple Alignments), a software tool for multiple species, multiple gene motif discovery. Magma identifies putative regulatory sites that are conserved across multiple species and occur near multiple genes throughout a reference genome. Magma takes as input multiple alignments that can include gaps. It uses efficient clustering methods that make it about 70 times faster than PhyloNet, a previous program for this task, with slightly greater sensitivity. We ran Magma on all non-coding DNA conserved between Caenorhabditis elegans and five additional species, about 70 Mbp in total, in <4 h. We obtained 2,309 motifs with lengths of 6-20 bp, each occurring at least 10 times throughout the genome, which collectively covered about 566 kbp of the genomes, approximately 0.8% of the input. Predicted sites occurred in all types of non-coding sequence but were especially enriched in the promoter regions. Comparisons to several experimental datasets show that Magma motifs correspond to a variety of known regulatory motifs.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Animals
  • Base Sequence
  • Binding Sites
  • Caenorhabditis elegans / genetics
  • Caenorhabditis elegans Proteins / genetics
  • Cluster Analysis
  • Computer Simulation
  • Conserved Sequence
  • DNA, Intergenic / genetics
  • Genome, Helminth*
  • Likelihood Functions
  • Models, Genetic*
  • Promoter Regions, Genetic
  • Sequence Alignment
  • Software*
  • Transcription Factors / genetics

Substances

  • Caenorhabditis elegans Proteins
  • DNA, Intergenic
  • Transcription Factors