Integrating database homology in a probabilistic gene structure model

Pac Symp Biocomput. 1997:232-44.

Abstract

We present an improved stochastic model of genes in DNA, and describe a method for integrating database homology into the probabilistic framework. A generalized hidden Markov model (GHMM) describes the grammar of a legal parse of a DNA sequence. Probabilities are estimated for gene features by using dynamic programming to combine information from multiple sensors. We show how matches to homologous sequences from a database can be integrated into the probability estimation by interpreting the likelihood of a sequence in terms of the bit-cost to encode a sequence given a homology match. We also demonstrate how homology matches in protein databases can be exploited to help identify splice sites. Our experiments show significant improvements in the sensitivity and specificity of gene structure identification when these new features are added to our gene-finding system, Genie. Experimental results in tests using a standard set of annotated genes showed that Genie identified 95% of coding nucleotides correctly with a specificity of 91%, and 77% of exons were identified exactly.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Base Sequence*
  • Computer Simulation*
  • DNA / chemistry*
  • DNA / genetics
  • Databases as Topic*
  • Exons
  • Genes*
  • Likelihood Functions
  • Markov Chains
  • Models, Genetic*
  • Probability
  • Programming Languages*
  • Sequence Homology, Nucleic Acid
  • Stochastic Processes

Substances

  • DNA