GeneMark: Difference between revisions

GeneMark
Developer(s)	Georgia Institute of Technology
Initial release	1993
Operating system	Linux, Windows, and Mac OS
License	Free for academic, non-profit or U.S. Government use
Website	opal.biology.gatech.edu/GeneMark

Browse history interactively

← Previous edit Next edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 02:47, 13 May 2015

GeneMark is a family of ab initio gene prediction programs developed at the Georgia Institute of Technology in Atlanta. First developed in 1993, GeneMark was used in 1995 for annotation of the first completely sequenced bacterium, Haemophilus influenzae, and in 1996 for the first completely sequenced archaea, Methanococcus jannaschii. The GeneMark algorithm uses species specific inhomogeneous Markov chain models of protein-coding DNA sequence as well as homogeneous Markov chain models of non-coding DNA. Parameters of the models are estimated from training sets of sequences of a known type. The major step of the algorithm computes a posteriory probability of a sequence fragment to carry on a genetic code in one of six possible frames (including three frames in complementary DNA strand) or to be "non-coding".^{[clarification needed]}

GeneMark.hmm

Prokaryotic

The GeneMark.hmm algorithm was designed to improve gene prediction quality by finding exact gene starts. The idea was to integrate the GeneMark models into a naturally designed hidden Markov model framework, with gene boundaries modeled as transitions between hidden states. Additionally, the ribosome binding site model is used to make the gene-start predictions more accurate. In evaluations by different groups,^{[by whom?]} GeneMark.hmm was shown to be significantly more accurate than GeneMark in exact gene prediction.^{[citation needed]} Since 1998, GeneMark.hmm and its self-training version GeneMarkS have been the standard tools for gene identification in new prokaryotic genomic sequences, including metagenomes.^{[citation needed]}

Eukaryotic

After developing the prokaryotic version of GeneMark.hmm, the approach was extended to the eukaryotic genomes, where accurate prediction of protein coding exon boundaries presented a major challenge. The hidden Markov model architecture of eukaryotic GeneMark.hmm consists of hidden states for initial, internal, and terminal exons, introns, intergenic regions and single exon genes located on both DNA strands. It also includes hidden states for the initiation site and termination site, as well as donor and acceptor splice sites. GeneMark.hmm has been frequently used for annotation of plant and animal genomes.^{[citation needed]}

Heuristic Models

To accurately find genes in DNA sequences using computers, models of protein coding and non-coding regions derived either from experimentally validated training sets or from large amounts of anonymous DNA sequence are required. A heuristic method for deriving the parameters of inhomogeneous Markov models of protein coding regions was proposed in 1999.^{[further explanation needed]} This heuristic uses the observation that the parameters of the Markov models used in GeneMark can be approximated by the functions of the sequence G+C content.^{[clarification needed]} Therefore, a short DNA sequence sufficient for estimation of the genome G+C content (a fragment longer than 400 nucleotides) is also sufficient for derivation of parameters of the Markov models used in GeneMark and GeneMark.hmm.

Models built by the heuristic approach can be used to find genes in small fragments of anonymous prokaryotic genomes, such as metagenomic sequences, as well as in genomes of organelles, viruses, phages and plasmids. This method can also be used for highly inhomogeneous genomes, where the Markov models must be adjusted to account for local DNA composition. The heuristic method provides evidence that the mutational pressure that shapes G+C content is the driving force of the evolution of codon usage pattern.^{[citation needed]}

Family of gene prediction programs

Bacteria, Archaea and metagenomes

GeneMark-P
GeneMark.hmm-P
GeneMarkS

Eukaryotes

GeneMark-E
GeneMark.hmm-E
GeneMark.hmm-ES

Viruses, phages and plasmids

Heuristic approach

EST and cDNA

GeneMark-E

References

Borodovsky M. and McIninch J. "GeneMark: parallel gene recognition for both DNA strands." Computers & Chemistry (1993) 17 (2): 123–133.
Lukashin A. and Borodovsky M. "GeneMark.hmm: new solutions for gene finding." Nucleic Acids Research (1998) 26 (4): 1107–1115. doi:10.1093/nar/26.4.1107
Besemer J. and Borodovsky M. "Heuristic approach to deriving models for gene finding." Nucleic Acids Research (1999) 27 (19): 3911–3920. doi:10.1093/nar/27.19.3911
Besemer J., Lomsadze A. and Borodovsky M. "GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions." Nucleic Acids Research (2001) 29 (12): 2607–2618. doi:10.1093/nar/29.12.2607
Mills R., Rozanov M., Lomsadze A., Tatusova T. and Borodovsky M. "Improving gene annotation in complete viral genomes." Nucleic Acids Research (2003) 31 (23): 7041–7055. doi:10.1093/nar/gkg878
Besemer J. and Borodovsky M. "GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses." Nucleic Acids Research (2005) 33 (Web Server Issue): W451-454. doi:10.1093/nar/gki487
Lomsadze A., Ter-Hovhannisyan V., Chernoff Y. and Borodovsky M. "Gene identification in novel eukaryotic genomes by self-training algorithm." Nucleic Acids Research (2005) 33 (20): 6494–6506. doi:10.1093/nar/gki937
Zhu W., Lomsadze A. and Borodovsky M. "Ab initio gene identification in metagenomic sequences." Nucleic Acids Research (2010) 38 (12): e132. doi:10.1093/nar/gkq275

External links

Official website

@@ Line 22: / Line 22: @@
 | status                 =
 | genre                  =
-| license                = [[Proprietary software|Proprietary]] [[commercial software]]; free for non-profit academic or U.S. Government use
+| license                =  Free for academic, non-profit or U.S. Government use
 | website                = {{URL|http://opal.biology.gatech.edu/GeneMark | opal.biology.gatech.edu/GeneMark }}
 }}

v t e Genomics
Fields	Cognitive genomics Computational genomics Comparative genomics Functional genomics Genome project Human Genome Project Metagenomics Human Microbiome Project Pangenomics Personal genomics Population genomics Sociogenomics Structural genomics
Bioinformatics	Biochip Cheminformatics Chemogenomics Connectomics Human Connectome Project Epigenomics Human Epigenome Project Glycomics Immunomics Lipidomics Metabolomics Microbiomics Nutrigenomics Paleopolyploidy Pharmacogenetics Pharmacogenomics Systems biology Toxicogenomics Transcriptomics
Structural biology	Proteomics Human proteome project Call-map proteomics Structure-based drug design Expression proteomics
Research tools	2-D electrophoresis Mass spectrometer Electrospray ionization Matrix-assisted laser desorption ionization Matrix-assisted laser desorption ionization-time of flight mass spectrometer Microfluidic-based tools Isotope affinity tags Chromosome conformation capture
Organizations	DNA Data Bank of Japan (JP) European Molecular Biology Laboratory (EU) National Institutes of Health (USA) Wellcome Sanger Institute (UK)
List Kategorie