Combining diverse evidence for gene recognition in completely sequenced bacterial genomes

D Frishman; A Mironov; H W Mewes; M Gelfand

doi:10.1093/nar/26.12.2941

Combining diverse evidence for gene recognition in completely sequenced bacterial genomes

Nucleic Acids Res. 1998 Jun 15;26(12):2941-7. doi: 10.1093/nar/26.12.2941.

Authors

D Frishman¹, A Mironov, H W Mewes, M Gelfand

Affiliation

¹ Munich Information Center for Protein Sequences (MIPS) of the German National Center for Health and Environment (GSF), Am Klopferspitz 18a, 82152 Martinsried, Germany. [email protected]

Abstract

Analysis of a newly sequenced bacterial genome starts with identification of protein-coding genes. Functional assignment of proteins requires the exact knowledge of protein N-termini. We present a new program ORPHEUS that identifies candidate genes and accurately predicts gene starts. The analysis starts with a database similarity search and identification of reliable gene fragments. The latter are used to derive statistical characteristics of protein-coding regions and ribosome-binding sites and to predict the complete set of genes in the analyzed genome. In a test on Bacillus subtilis and Escherichia coli genomes, the program correctly identified 93.3% (resp. 96.3%) of experimentally annotated genes longer than 100 codons described in the PIR-International database, and for these genes 96.3% (83.9%) of starts were predicted exactly. Furthermore, 98.9% (99.1%) of genes longer than 100 codons annotated in GenBank were found, and 92.9% (75.7%) of predicted starts coincided with the feature table description. Finally, for the complete gene complements of B.subtilis and E.coli , including genes shorter than 100 codons, gene prediction accuracy was 88.9 and 87.1%, respectively, with 94.2 and 76.7% starts coinciding with the existing annotation.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Bacillus subtilis / genetics
Bacterial Proteins / genetics
Codon, Initiator
Databases, Factual
Escherichia coli / genetics
Genome, Bacterial*
Open Reading Frames
Sequence Alignment / methods*
Sequence Analysis, DNA
Software*

Substances

Bacterial Proteins
Codon, Initiator