Noncoding RNA gene detection using comparative sequence analysis

BMC Bioinformatics. 2001:2:8. doi: 10.1186/1471-2105-2-8. Epub 2001 Oct 10.

Abstract

Background: Noncoding RNA genes produce transcripts that exert their function without ever producing proteins. Noncoding RNA gene sequences do not have strong statistical signals, unlike protein coding genes. A reliable general purpose computational genefinder for noncoding RNA genes has been elusive.

Results: We describe a comparative sequence analysis algorithm for detecting novel structural RNA genes. The key idea is to test the pattern of substitutions observed in a pairwise alignment of two homologous sequences. A conserved coding region tends to show a pattern of synonymous substitutions, whereas a conserved structural RNA tends to show a pattern of compensatory mutations consistent with some base-paired secondary structure. We formalize this intuition using three probabilistic "pair-grammars": a pair stochastic context free grammar modeling alignments constrained by structural RNA evolution, a pair hidden Markov model modeling alignments constrained by coding sequence evolution, and a pair hidden Markov model modeling a null hypothesis of position-independent evolution. Given an input pairwise sequence alignment (e.g. from a BLASTN comparison of two related genomes) we classify the alignment into the coding, RNA, or null class according to the posterior probability of each class.

Conclusions: We have implemented this approach as a program, QRNA, which we consider to be a prototype structural noncoding RNA genefinder. Tests suggest that this approach detects noncoding RNA genes with a fair degree of reliability.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Algorithms
  • Animals
  • Base Sequence
  • Bayes Theorem
  • Caenorhabditis / genetics
  • Caenorhabditis elegans / genetics
  • Computational Biology / methods
  • Computational Biology / statistics & numerical data
  • Computer Simulation
  • Escherichia coli / genetics
  • Genome
  • Genome, Bacterial
  • Models, Genetic
  • Molecular Sequence Data
  • RNA, Bacterial / genetics
  • RNA, Helminth / genetics
  • RNA, Untranslated / genetics*
  • Salmonella typhi / genetics
  • Sensitivity and Specificity
  • Sequence Analysis, RNA / methods*
  • Sequence Analysis, RNA / statistics & numerical data

Substances

  • RNA, Bacterial
  • RNA, Helminth
  • RNA, Untranslated