Regulatory element detection using a probabilistic segmentation model

Proc Int Conf Intell Syst Mol Biol. 2000:8:67-74.

Abstract

The availability of genome-wide mRNA expression data for organisms whose genome is fully sequenced provides a unique data set from which to decipher how transcription is regulated by the upstream control region of a gene. A new algorithm is presented which decomposes DNA sequence into the most probable "dictionary" of motifs or words. Identification of words is based on a probabilistic segmentation model in which the significance of longer words is deduced from the frequency of shorter words of various length. This eliminates the need for a separate set of reference data to define probabilities, and genome-wide applications are therefore possible. For the 6,000 upstream regulatory regions in the yeast genome, the 500 strongest motifs from a dictionary of size 1,200 match at a significance level of 15 standard deviations to a database of cis-regulatory elements. Analysis of sets of genes such as those up-regulated during sporulation reveals many new putative regulatory sites in addition to identifying previously known sites.

MeSH terms

  • Algorithms*
  • Animals
  • Genome*
  • Humans
  • RNA, Messenger / genetics*
  • Sequence Analysis*

Substances

  • RNA, Messenger