Reduce manual curation by combining gene predictions from multiple annotation engines, a case study of start codon prediction

PLoS One. 2013 May 10;8(5):e63523. doi: 10.1371/journal.pone.0063523. Print 2013.

Abstract

Nowadays, prokaryotic genomes are sequenced faster than the capacity to manually curate gene annotations. Automated genome annotation engines provide users a straight-forward and complete solution for predicting ORF coordinates and function. For many labs, the use of AGEs is therefore essential to decrease the time necessary for annotating a given prokaryotic genome. However, it is not uncommon for AGEs to provide different and sometimes conflicting predictions. Combining multiple AGEs might allow for more accurate predictions. Here we analyzed the ab initio open reading frame (ORF) calling performance of different AGEs based on curated genome annotations of eight strains from different bacterial species with GC% ranging from 35-52%. We present a case study which demonstrates a novel way of comparative genome annotation, using combinations of AGEs in a pre-defined order (or path) to predict ORF start codons. The order of AGE combinations is from high to low specificity, where the specificity is based on the eight genome annotations. For each AGE combination we are able to derive a so-called projected confidence value, which is the average specificity of ORF start codon prediction based on the eight genomes. The projected confidence enables estimating likeliness of a correct prediction for a particular ORF start codon by a particular AGE combination, pinpointing ORFs notoriously difficult to predict start codons. We correctly predict start codons for 90.5±4.8% of the genes in a genome (based on the eight genomes) with an accuracy of 81.1±7.6%. Our consensus-path methodology allows a marked improvement over majority voting (9.7±4.4%) and with an optimal path ORF start prediction sensitivity is gained while maintaining a high specificity.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Base Composition
  • Codon, Initiator*
  • Computational Biology / methods*
  • Consensus Sequence
  • Genome, Bacterial
  • Genomics / methods
  • Molecular Sequence Annotation / methods*
  • Open Reading Frames
  • Reproducibility of Results

Substances

  • Codon, Initiator

Grants and funding

Funding for this study for SH’s contribution came from both NIZO and TIFN. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.