Motivation: A key aspect of elucidating gene regulation in bacterial genomes is identifying the basic units of transcription. We present a method, based on probabilistic language models, that we apply to predict operons, promoters and terminators in the genome of Escherichia coli K-12. Our approach has two key properties: (i) it provides a coherent set of predictions for related regulatory elements of various types and (ii) it takes advantage of both DNA sequence and gene expression data, including expression measurements from inter-genic probes.
Results: Our experimental results show that we are able to predict operons and localize promoters and terminators with high accuracy. Moreover, our models that use both sequence and expression data are more accurate than those that use only one of these two data sources.