Genetack: frameshift identification in protein-coding sequences by the Viterbi algorithm

J Bioinform Comput Biol. 2010 Jun;8(3):535-51. doi: 10.1142/s0219720010004847.

Abstract

We describe a new program for ab initio frameshift detection in protein-coding nucleotide sequences. The task is to distinguish the same strand overlapping ORFs that occur in the sequence due to a presence of a frameshifted gene from the same strand overlapping ORFs that encompass true overlapping or adjacent genes. The GeneTack program uses a hidden Markov model (HMM) of genomic sequence with possibly frameshifted protein-coding regions. The Viterbi algorithm finds the maximum likelihood path that discriminates between true adjacent genes and those adjacent protein-coding regions that just appear to be separate entities due to frameshifts. Therefore, the program can identify spurious predictions made by a conventional gene-finding program misled by a frameshift. We tested GeneTack as well as two earlier developed programs FrameD and FSFind on 17 prokaryotic genomes with frameshifts introduced randomly into known genes. We observed that the average frameshift prediction accuracy of GeneTack, in terms of (Sn + Sp)/2 values, was higher by a significant margin than the accuracy of two other programs. In addition, we observed that the average accuracy of GeneTack is favorably compared with the accuracy of the FSFind-BLAST program that uses protein database search to verify predicted frameshifts, even though GeneTack does not use external evidence. GeneTack is freely available at http://topaz.gatech.edu/GeneTack/.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms*
  • Base Sequence
  • Computer Simulation
  • Data Interpretation, Statistical
  • Frameshift Mutation / genetics*
  • Models, Genetic*
  • Models, Statistical
  • Molecular Sequence Data
  • Open Reading Frames / genetics*
  • Proteins / genetics*
  • Sequence Analysis, DNA / methods*

Substances

  • Proteins