Prokaryotic gene finding based on physicochemical characteristics of codons calculated from molecular dynamics simulations

Biophys J. 2008 Jun;94(11):4173-83. doi: 10.1529/biophysj.107.116392. Epub 2008 Mar 7.

Abstract

An ab initio model for gene prediction in prokaryotic genomes is proposed based on physicochemical characteristics of codons calculated from molecular dynamics (MD) simulations. The model requires a specification of three calculated quantities for each codon: the double-helical trinucleotide base pairing energy, the base pair stacking energy, and an index of the propensity of a codon for protein-nucleic acid interactions. The base pairing and stacking energies for each codon are obtained from recently reported MD simulations on all unique tetranucleotide steps, and the third parameter is assigned based on the conjugate rule previously proposed to account for the wobble hypothesis with respect to degeneracies in the genetic code. The third interaction propensity parameter values correlate well with ab initio MD calculated solvation energies and flexibility of codon sequences as well as codon usage in genes and amino acid composition frequencies in approximately 175,000 protein sequences in the Swissprot database. Assignment of these three parameters for each codon enables the calculation of the magnitude and orientation of a cumulative three-dimensional vector for a DNA sequence of any length in each of the six genomic reading frames. Analysis of 372 genomes comprising approximately 350,000 genes shows that the orientations of the gene and nongene vectors are well differentiated and make a clear distinction feasible between genic and nongenic sequences at a level equivalent to or better than currently available knowledge-based models trained on the basis of empirical data, presenting a strong support for the possibility of a unique and useful physicochemical characterization of DNA sequences from codons to genomes.

Publication types

  • Evaluation Study
  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Bacterial Proteins / genetics*
  • Base Sequence
  • Chromosome Mapping / methods*
  • Codon / chemistry*
  • Codon / genetics*
  • Computer Simulation
  • DNA / chemistry*
  • DNA / genetics*
  • Genome, Bacterial / genetics
  • Models, Chemical*
  • Models, Molecular
  • Molecular Sequence Data
  • Open Reading Frames
  • Sequence Analysis, DNA / methods

Substances

  • Bacterial Proteins
  • Codon
  • DNA