Motivation: The hidden Markov model (HMM) is a valuable technique for gene-finding, especially because its flexibility enables the inclusion of various sequence features. Recent programs for bacterial gene-finding include the information of ribosomal binding site (RBS) to improve the recognition accuracy of the start codon, using this feature. We report here our attempt to extend the model into the total transcriptional unit, enabling the prediction of operon structures.
Results: First, we improved the prediction accuracy of coding sequences (CDSs) by employing the models of 'typical', 'atypical' and 'negative (false-positive)' classes as well as the models of RBS and its downstream spacer. The sensitivity of exactly predicting the 204 experimentally confirmed CDSs reached 90.2% in an objective test. Based on the prediction result of CDSs, the positions of the promoters and terminators were predicted. Our model could exactly recognize 60% of 390 known transcriptional units. Thus, the accuracy and significance of this prediction problem is far from trivial. We would like to propose this problem as an open theme in bioinformatics because the ongoing or planned post-sequencing projects will produce much data for future improvements.