We have developed and implemented a method for computational gene identification called GIN (gene identification using neural nets and homology information) that has been particularly designed to avoid false positive predictions. It thus predicts 55% of all genes tested correctly, has a specificity of 99%, but also has an overall accuracy of 92% on a benchmark set of 570 vertebrate genes constructed by Burset and Guigo. The method combines homology searches in protein and expressed sequence tag databases with several neural networks designed to recognize start codons, Poly(A) signals, stop codons, and splice sites. Predicted exons are assembled into genes using a homology-based scoring function. GIN is able to recognize multiple genes within genomic DNA as demonstrated by the identification of a globin gene (gamma-globin-1(G)) that has not been annotated as a coding region in the widely used the test set of Burset and Guigo. Furthermore, GIN identifies more than 107 other protein hits in noncoding regions and classifies them into possible pseudogenes or splice variants.
Copyright 1998 Academic Press.