Jump to content

GENSCAN: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Added rather sparse software infobox
Citation bot (talk | contribs)
Add: doi-access. Removed proxy/dead URL that duplicated identifier. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar
 
(32 intermediate revisions by 17 users not shown)
Line 10: Line 10:
| website = {{URL|http://genes.mit.edu/GENSCANinfo.html}}
| website = {{URL|http://genes.mit.edu/GENSCANinfo.html}}
}}
}}
In [[bioinformatics]] '''GENSCAN''' is a [[Computer program|program]] to identify complete [[gene]] structures in genomic [[DNA]]. It is a G[[Hidden Markov model|HMM]]-based program that can be used to [[gene prediction|predict the location of genes]] and their [[exon]]-[[intron]] boundaries in genomic sequences from a variety of organisms. The GENSCAN Web server can be found at [[Massachusetts Institute of Technology|MIT]].<ref>http://genes.mit.edu/GENSCAN.html The GENSCAN Web Server at MIT</ref>
In [[bioinformatics]], '''GENSCAN''' is a [[Computer program|program]] to identify complete [[gene]] structures in genomic [[DNA]]. It is a G[[Hidden Markov model|HMM]]-based program that can be used to [[gene prediction|predict the location of genes]] and their [[exon]]-[[intron]] boundaries in genomic sequences from a variety of organisms. The GENSCAN Web server can be found at [[Massachusetts Institute of Technology|MIT]].<ref>http://genes.mit.edu/GENSCAN.html {{Webarchive|url=https://web.archive.org/web/20130906115338/http://genes.mit.edu/GENSCAN.html |date=2013-09-06 }} The GENSCAN Web Server at MIT</ref>


GENSCAN was developed by [[Christopher Burge]] <ref>Burge, C. B. (1998) Modeling dependencies in pre-mRNA splicing signals. In [[Steven Salzberg|Salzberg, S.]], Searls, D. and Kasif, S., eds. Computational Methods in Molecular Biology, Elsevier Science, Amsterdam, pp. 127-163. ISBN 9780444502049</ref> in the research group of [[Samuel Karlin]] <ref>{{cite pmid|9149143}}</ref><ref>{{cite pmid|9666331}}</ref> Department of Mathematics, [[Stanford University]].
GENSCAN was developed by [[Christopher Burge]] in the research group of [[Samuel Karlin]] at [[Stanford University]].<ref>Burge, C. B. (1998) Modeling dependencies in pre-mRNA splicing signals. In [[Steven Salzberg|Salzberg, S.]], Searls, D. and Kasif, S., eds. Computational Methods in Molecular Biology, Elsevier Science, Amsterdam, pp. 127-163. {{ISBN|978-0-444-50204-9}}</ref><ref name=":0">{{Cite journal
|last1 = Burge
|first1 = Christopher
|authorlink1 = Christopher Burge
|last2 = Karlin
|first2 = Samuel
|authorlink2 = Samuel Karlin
|title = Prediction of complete gene structures in human genomic DNA
|doi = 10.1006/jmbi.1997.0951
|journal = Journal of Molecular Biology
|volume = 268
|issue = 1
|pages = 78–94
|year = 1997
|pmid = 9149143
|url = https://ai.stanford.edu/~serafim/cs262/Papers/GENSCAN.pdf
|url-status = dead
|archiveurl = https://web.archive.org/web/20150620094015/https://ai.stanford.edu/~serafim/cs262/Papers/GENSCAN.pdf
|archivedate = 2015-06-20
|citeseerx = 10.1.1.115.3107
}}</ref><ref>{{Cite journal
|doi = 10.1016/S0959-440X(98)80069-9
|last1 = Burge
|first1 = C.
|authorlink1 = Christopher Burge
|last2 = Karlin
|first2 = S.
|authorlink2 = Samuel Karlin
|title = Finding the genes in genomic DNA
|journal = Current Opinion in Structural Biology
|volume = 8
|issue = 3
|pages = 346–354
|year = 1998
|url =
|pmid = 9666331
|doi-access= free
}}</ref>

==History==
In 2001, the world of human gene prediction entered into [[Comparative genomics]]. This resulted in the development of a program called TWINSCAN as an adaptation of GENSCAN with higher accuracy. Other programs like N-SCAN were later developed by further adapting the GHMM model.<ref name=":1">{{Cite journal |last=Flicek |first=Paul |date=2007 |title=Gene prediction: compare and CONTRAST |journal=Genome Biology |volume=8 |issue=12 |pages=233 |doi=10.1186/gb-2007-8-12-233 |issn=1474-760X |pmc=2246255 |pmid=18096089 |doi-access=free }}</ref>

As of 2002, GENSCAN remained a popular tool in bioinformatics, becoming a standard feature for genomes released on University of California Santa Cruz and [[Ensembl]] [[Genome browser]].<ref name=":1" />

==Implementation==

=== Genomic Model ===
The primary goal when developing a genomic sequence model for GENSCAN was to identify both the general and specific properties that compose the individual functional units of [[Eukaryote|eukaryotic]] genes (e.g. [[Exon|exons]], [[Intron|introns]], [[Splice site|splice sites]], [[Promoter (genetics)|promoters]]). Particular focus was placed upon features that are recognizable by general transcriptional, splicing and translational machinery that processes the majority of all [[Protein-coding genes|protein coding genes]], as opposed to the signals associated with [[Transcription (biology)|transcription]] or [[genetic engineering|splicing]] of genes and [[Gene family|gene families]] (e.g. [[TATA box]]). In addition, a general three-periodic fifth-order [[Markov model]] of [[Coding region|coding regions]] is used as opposed to models of specific [[Protein motif|protein motifs]] or database [[Sequence homology|homology]] information. In addition, the model factors in the structural and density differences between compositional regions of the human genome.<ref name=":0" />

Due to the usage of these elements, GENSCAN works without needing to reference similar genes in protein sequence databases. Instead, predictions produced by GENSCAN are complementary to those gathered by homology-based gene identification methods (e.g. querying protein databases with [[BLAST (biotechnology)|BLASTX]]). Overall, the structure of the model used in GENSCAN is similar to the [[Hidden Markov model|General Hidden Markov Model]].<ref name=":0" />

=== Features ===
GENSCAN's implementation differs from other programs in multiple ways. A notable difference is the fact that GENSCAN utilizes a genomic sequence model that exclusively focuses [[double-stranded DNA]] where genes that are present on both strands are simultaneously analyzed. Also, GENSCAN is capable of analyzing genomes in situations where there are partial genes or no genes, rather than only being able to analyze single and complete gene sequences like other programs at its time. These two factors contribute to GENSCAN being particularly useful in analyzing longer human genomes. In addition, GENSCAN employs the concept of Maximal Dependence Decomposition such that functional signals in DNA and protein sequences can be modeled, creating the possibility for dependencies between signal positions to be considered by the program. This is implemented in GENSCAN such that a model is generated of the donor splice signal, capturing dependences that are associated with the recognition mechanisms for donor splice sites in [[pre-mRNA]] sequences.<ref name=":0" />

GENSCAN has the capability of calculating the accuracy of each of its predictions by using the [[Forward–backward algorithm|forward-backward algorithm]].<ref name=":0" />

Predicting the structure and overall composition of human genes in regard to exon and gene locations in longer sequences is an additionally useful component of GENSCAN. There are several different features that come as a part of this. One of which being the capability of capturing differences in gene structure and composition between C + G regions in the human genome, using sets of empirically generated model parameters. Another derived feature is, as mentioned before, predicting multiple genes in a sequence in addition to having the ability of working with partial genes and double-stranded DNA. Lastly, this also allows GENSCAN to capture dependencies between signal positions with new models of donor and acceptor splice sites.<ref name=":0" />

=== Efficiency ===
The run time for GENSCAN scales almost linearly when provided realistically sized sequences (several kilobits minimum), but has a worst case of being quadratic.<ref name=":0" />

=== Supplemental Usage ===
GENSCAN, like other genome prediction programs, doesn't produce results that totally match those of other programs. This is due to a multitude of factors including, but not limited to: differences in algorithms, parameters, and training sets. Therefore, GENSCAN has been utilized in the practice of combining two gene prediction programs' results such that if one program in the combination is confident in a sequence prediction, that sequence is used. On the other hand, if neither program is confident in their predictions, the sequence predicted is only used if both programs agree on it.<ref name=":2" />

==Accuracy==
Tests were conducted to evaluate the accuracy of GENSCAN with short data sets. One test was done on the Burset/Guigó dataset containing 570 vertebrate multi-exon gene sequences. The data produced from this test is shown in the table below, along with the data produced by testing other programs with the same dataset. GENSCAN is shown in the table to be generally more accurate than its competitors at predicting sequences with both [[Nucleotide|nucleotides]] and exons.<ref name=":0" />
{| class="wikitable sortable"
|+GENSCAN Accuracy vs. Other Programs<ref name=":0" />
!Program
!Sequences
!Nucleotide Sensitivity
!Nucleotide Specificity
!Nucleotide Approximate Correlation
!Nucleotide Correlation Coefficient
!Exon Sensitivity
!Exon Specificity
!Exon Average
!Missed Exons
!Wrong Exons
|-
|GENSCAN
|570
|0.93
|0.93
|0.91
|0.92
|0.78
|0.81
|0.80
|0.09
|0.05
|-
|FGENEH
|569
|0.77
|0.88
|0.78
|0.80
|0.61
|0.64
|0.64
|0.15
|0.12
|-
|GeneID
|570
|0.63
|0.81
|0.67
|0.65
|0.44
|0.46
|0.45
|0.28
|0.24
|-
|Genie
|570
|0.76
|0.77
|0.72
|n/a
|0.55
|0.48
|0.51
|0.17
|0.33
|-
|GenLang
|570
|0.72
|0.79
|0.69
|0.71
|0.51
|0.52
|0.52
|0.21
|0.22
|-
|GeneParser2
|562
|0.66
|0.79
|0.67
|0.65
|0.35
|0.40
|0.37
|0.34
|0.17
|-
|GRAIL2
|570
|0.72
|0.87
|0.75
|0.76
|0.36
|0.43
|0.40
|0.25
|0.11
|-
|SORFIND
|561
|0.71
|0.85
|0.73
|0.72
|0.42
|0.47
|0.45
|0.24
|0.14
|-
|Xpound
|570
|0.61
|0.87
|0.68
|0.69
|0.15
|0.18
|0.17
|0.33
|0.13
|-
|GeneID+
|478
|0.91
|0.91
|0.88
|0.88
|0.73
|0.70
|0.71
|0.07
|0.13
|-
|GeneParser3
|478
|0.86
|0.91
|0.86
|0.85
|0.56
|0.58
|0.57
|0.14
|0.09
|}
Furthermore, the table shown below specifically describes the accuracy of GENSCAN in regard to genomic sequences organized by ranges of [[CpG site|C + G]] and types of organisms. We can see in the data provided that GENSCAN's accuracy variation was rather insensitive to C + G content and organism type. This further demonstrates GENSCAN's independence of factors that would have impacted the results of comparable genome prediction programs.<ref name=":0" />
{| class="wikitable sortable"
|+GENSCAN Accuracy for Sequences Organized by C+G Content and Organism<ref name=":0" />
!Subset
!Sequences
!Nucleotide Sensitivity
!Nucleotide Specificity
!Nucleotide Approximate Correlation
!Nucleotide Correlation Coefficient
!Exon Sensitivity
!Exon Specificity
!Exon Average
!Missed Exons
!Wrong Exons
|-
|C + G <40
|86
|0.90
|0.95
|0.90
|0.93
|0.78
|0.87
|0.84
|0.14
|0.05
|-
|C + G 40-50
|220
|0.94
|0.92
|0.91
|0.91
|0.80
|0.82
|0.82
|0.08
|0.05
|-
|C + G 50-60
|208
|0.93
|0.93
|0.90
|0.92
|0.75
|0.77
|0.77
|0.08
|0.05
|-
|C + G >60
|56
|0.97
|0.89
|0.90
|0.90
|0.76
|0.77
|0.76
|0.07
|0.08
|-
|Primates
|237
|0.96
|0.94
|0.93
|0.94
|0.81
|0.82
|0.82
|0.07
|0.05
|-
|Rodents
|191
|0.90
|0.93
|0.89
|0.91
|0.75
|0.80
|0.78
|0.11
|0.05
|-
|Non-mamm. Vert.
|72
|0.93
|0.93
|0.90
|0.93
|0.81
|0.85
|0.84
|0.11
|0.06
|}
A separate test was conducted on GENSCAN's accuracy using two GeneParser data sets that are stripped of all genes that are more than 25% of a match regarding amino acids with those in previous GeneParser test sets. The resulting data of this test and of the same test performed on other programs is shown in the table below. We can see that there is little variation between the accuracy of GENSCAN under the aforementioned Burset/Guigó data set and the GeneParser data sets. However, certain data points with higher fluctuation (e.g. 98% CC on high C + G nucleotides in GeneParser set II vs. 90% CC on C + G >60 nucleotides in Burset/Guigó) may be attributed to the GeneParser data sets being much smaller in sample size. The tests on the aforementioned three data sets provided enough information to form respective conclusions. However, these datasets are not of realistic size, therefore, their reliability and scope are justifiably brought into question.<ref name=":0" />
{| class="wikitable"
|+GENSCAN vs. Other Programs Prediction Accuracy Under Data Sets I and II<ref name=":0" />
!Program
!GeneID I
!GeneID II
!GRAIL3 I
!GRAIL3 II
!GeneParser2 I
!GeneParser2 II
!GENSCAN I
!GENSCAN II
|-
|All sequences
|
|
|
|
|
|
|
|
|-
|Correlation
|0.69
|0.55
|0.83
|0.75
|0.78
|0.80
|0.93
|0.93
|-
|Sensitivity
|0.69
|0.50
|0.83
|0.68
|0.87
|0.82
|0.98
|0.95
|-
|Specificity
|0.77
|0.75
|0.87
|0.91
|0.76
|0.86
|0.90
|0.94
|-
|Exons Correct
|0.42
|0.33
|0.52
|0.31
|0.47
|0.46
|0.79
|0.76
|-
|Exons Overlapped
|0.73
|0.64
|0.81
|0.58
|0.87
|0.76
|0.96
|0.91
|-
|High C + G
|
|
|
|
|
|
|
|
|-
|Correlation
|0.65
|0.73
|0.88
|0.80
|0.89
|0.71
|0.94
|0.98
|-
|Sensitivity
|0.72
|0.85
|0.87
|0.80
|0.90
|0.65
|1.00
|0.98
|-
|Specificity
|0.73
|0.73
|0.95
|0.88
|0.93
|0.87
|0.91
|0.98
|-
|Exons Correct
|0.38
|0.43
|0.67
|0.50
|0.64
|0.57
|0.76
|0.64
|-
|Exons Overlapped
|0.80
|0.86
|0.89
|0.79
|0.96
|0.79
|1.00
|0.93
|-
|Medium C + G
|
|
|
|
|
|
|
|
|-
|Correlation
|0.67
|0.52
|0.83
|0.75
|0.75
|0.82
|0.93
|0.94
|-
|Sensitivity
|0.65
|0.47
|0.86
|0.68
|0.86
|0.84
|0.97
|0.95
|-
|Specificity
|0.77
|0.76
|0.84
|0.91
|0.70
|0.87
|0.90
|0.95
|-
|Exons Correct
|0.37
|0.29
|0.51
|0.32
|0.41
|0.46
|0.79
|0.79
|-
|Exons Overlapped
|0.67
|0.62
|0.83
|0.28
|0.84
|0.79
|0.96
|0.93
|-
|Low C + G
|
|
|
|
|
|
|
|
|-
|Correlation
|0.81
|0.62
|0.62
|0.62
|0.72
|0.67
|0.92
|0.81
|-
|Sensitivity
|0.82
|0.56
|0.51
|0.45
|0.79
|0.71
|0.93
|0.80
|-
|Specificity
|0.85
|0.71
|0.87
|0.89
|0.75
|0.67
|0.94
|0.84
|-
|Exons Correct
|0.80
|0.47
|0.25
|0.16
|0.40
|0.37
|0.85
|0.68
|-
|Exons Overlapped
|0.85
|0.63
|0.55
|0.42
|0.85
|0.58
|0.85
|0.74
|}
In 1997, GENSCAN was found to have a higher accuracy than previous gene prediction programs. However, work still needed to be done due to how GENSCAN was shown to only predict 10-15% of genes accurately on realistic data sets.<ref name=":1" /> Because of inaccuracies like this, any predictions given by GENSCAN and other programs must be verified by comparing them to a [[Complementary DNA]] sequence, a [[Expressed sequence tag]] (EST) sequence, or a known protein sequence.<ref name=":2">{{Cite journal |last1=Rogic |first1=S. |last2=Ouellette |first2=B.F. F. |last3=Mackworth |first3=A. K. |date=2002-08-01 |title=Improving gene recognition accuracy by combining predictions from two gene-finding programs |journal=Bioinformatics |volume=18 |issue=8 |pages=1034–1045 |doi=10.1093/bioinformatics/18.8.1034 |pmid=12176826 |issn=1367-4803|doi-access=free }}</ref>


==References==
==References==
{{reflist}}
{{reflist}}



[[Category:Genetics]]
[[Category:Bioinformatics]]
[[Category:Bioinformatics software]]
[[Category:Bioinformatics software]]

Latest revision as of 22:33, 2 December 2023

GENSCAN
Developer(s)Christopher Burge
Available inEnglisch
TypBioinformatics tool
Websitegenes.mit.edu/GENSCANinfo.html

In bioinformatics, GENSCAN is a program to identify complete gene structures in genomic DNA. It is a GHMM-based program that can be used to predict the location of genes and their exon-intron boundaries in genomic sequences from a variety of organisms. The GENSCAN Web server can be found at MIT.[1]

GENSCAN was developed by Christopher Burge in the research group of Samuel Karlin at Stanford University.[2][3][4]

History

[edit]

In 2001, the world of human gene prediction entered into Comparative genomics. This resulted in the development of a program called TWINSCAN as an adaptation of GENSCAN with higher accuracy. Other programs like N-SCAN were later developed by further adapting the GHMM model.[5]

As of 2002, GENSCAN remained a popular tool in bioinformatics, becoming a standard feature for genomes released on University of California Santa Cruz and Ensembl Genome browser.[5]

Implementation

[edit]

Genomic Model

[edit]

The primary goal when developing a genomic sequence model for GENSCAN was to identify both the general and specific properties that compose the individual functional units of eukaryotic genes (e.g. exons, introns, splice sites, promoters). Particular focus was placed upon features that are recognizable by general transcriptional, splicing and translational machinery that processes the majority of all protein coding genes, as opposed to the signals associated with transcription or splicing of genes and gene families (e.g. TATA box). In addition, a general three-periodic fifth-order Markov model of coding regions is used as opposed to models of specific protein motifs or database homology information. In addition, the model factors in the structural and density differences between compositional regions of the human genome.[3]

Due to the usage of these elements, GENSCAN works without needing to reference similar genes in protein sequence databases. Instead, predictions produced by GENSCAN are complementary to those gathered by homology-based gene identification methods (e.g. querying protein databases with BLASTX). Overall, the structure of the model used in GENSCAN is similar to the General Hidden Markov Model.[3]

Eigenschaften

[edit]

GENSCAN's implementation differs from other programs in multiple ways. A notable difference is the fact that GENSCAN utilizes a genomic sequence model that exclusively focuses double-stranded DNA where genes that are present on both strands are simultaneously analyzed. Also, GENSCAN is capable of analyzing genomes in situations where there are partial genes or no genes, rather than only being able to analyze single and complete gene sequences like other programs at its time. These two factors contribute to GENSCAN being particularly useful in analyzing longer human genomes. In addition, GENSCAN employs the concept of Maximal Dependence Decomposition such that functional signals in DNA and protein sequences can be modeled, creating the possibility for dependencies between signal positions to be considered by the program. This is implemented in GENSCAN such that a model is generated of the donor splice signal, capturing dependences that are associated with the recognition mechanisms for donor splice sites in pre-mRNA sequences.[3]

GENSCAN has the capability of calculating the accuracy of each of its predictions by using the forward-backward algorithm.[3]

Predicting the structure and overall composition of human genes in regard to exon and gene locations in longer sequences is an additionally useful component of GENSCAN. There are several different features that come as a part of this. One of which being the capability of capturing differences in gene structure and composition between C + G regions in the human genome, using sets of empirically generated model parameters. Another derived feature is, as mentioned before, predicting multiple genes in a sequence in addition to having the ability of working with partial genes and double-stranded DNA. Lastly, this also allows GENSCAN to capture dependencies between signal positions with new models of donor and acceptor splice sites.[3]

Efficiency

[edit]

The run time for GENSCAN scales almost linearly when provided realistically sized sequences (several kilobits minimum), but has a worst case of being quadratic.[3]

Supplemental Usage

[edit]

GENSCAN, like other genome prediction programs, doesn't produce results that totally match those of other programs. This is due to a multitude of factors including, but not limited to: differences in algorithms, parameters, and training sets. Therefore, GENSCAN has been utilized in the practice of combining two gene prediction programs' results such that if one program in the combination is confident in a sequence prediction, that sequence is used. On the other hand, if neither program is confident in their predictions, the sequence predicted is only used if both programs agree on it.[6]

Accuracy

[edit]

Tests were conducted to evaluate the accuracy of GENSCAN with short data sets. One test was done on the Burset/Guigó dataset containing 570 vertebrate multi-exon gene sequences. The data produced from this test is shown in the table below, along with the data produced by testing other programs with the same dataset. GENSCAN is shown in the table to be generally more accurate than its competitors at predicting sequences with both nucleotides and exons.[3]

GENSCAN Accuracy vs. Other Programs[3]
Program Sequences Nucleotide Sensitivity Nucleotide Specificity Nucleotide Approximate Correlation Nucleotide Correlation Coefficient Exon Sensitivity Exon Specificity Exon Average Missed Exons Wrong Exons
GENSCAN 570 0.93 0.93 0.91 0.92 0.78 0.81 0.80 0.09 0.05
FGENEH 569 0.77 0.88 0.78 0.80 0.61 0.64 0.64 0.15 0.12
GeneID 570 0.63 0.81 0.67 0.65 0.44 0.46 0.45 0.28 0.24
Genie 570 0.76 0.77 0.72 n/a 0.55 0.48 0.51 0.17 0.33
GenLang 570 0.72 0.79 0.69 0.71 0.51 0.52 0.52 0.21 0.22
GeneParser2 562 0.66 0.79 0.67 0.65 0.35 0.40 0.37 0.34 0.17
GRAIL2 570 0.72 0.87 0.75 0.76 0.36 0.43 0.40 0.25 0.11
SORFIND 561 0.71 0.85 0.73 0.72 0.42 0.47 0.45 0.24 0.14
Xpound 570 0.61 0.87 0.68 0.69 0.15 0.18 0.17 0.33 0.13
GeneID+ 478 0.91 0.91 0.88 0.88 0.73 0.70 0.71 0.07 0.13
GeneParser3 478 0.86 0.91 0.86 0.85 0.56 0.58 0.57 0.14 0.09

Furthermore, the table shown below specifically describes the accuracy of GENSCAN in regard to genomic sequences organized by ranges of C + G and types of organisms. We can see in the data provided that GENSCAN's accuracy variation was rather insensitive to C + G content and organism type. This further demonstrates GENSCAN's independence of factors that would have impacted the results of comparable genome prediction programs.[3]

GENSCAN Accuracy for Sequences Organized by C+G Content and Organism[3]
Subset Sequences Nucleotide Sensitivity Nucleotide Specificity Nucleotide Approximate Correlation Nucleotide Correlation Coefficient Exon Sensitivity Exon Specificity Exon Average Missed Exons Wrong Exons
C + G <40 86 0.90 0.95 0.90 0.93 0.78 0.87 0.84 0.14 0.05
C + G 40-50 220 0.94 0.92 0.91 0.91 0.80 0.82 0.82 0.08 0.05
C + G 50-60 208 0.93 0.93 0.90 0.92 0.75 0.77 0.77 0.08 0.05
C + G >60 56 0.97 0.89 0.90 0.90 0.76 0.77 0.76 0.07 0.08
Primates 237 0.96 0.94 0.93 0.94 0.81 0.82 0.82 0.07 0.05
Rodents 191 0.90 0.93 0.89 0.91 0.75 0.80 0.78 0.11 0.05
Non-mamm. Vert. 72 0.93 0.93 0.90 0.93 0.81 0.85 0.84 0.11 0.06

A separate test was conducted on GENSCAN's accuracy using two GeneParser data sets that are stripped of all genes that are more than 25% of a match regarding amino acids with those in previous GeneParser test sets. The resulting data of this test and of the same test performed on other programs is shown in the table below. We can see that there is little variation between the accuracy of GENSCAN under the aforementioned Burset/Guigó data set and the GeneParser data sets. However, certain data points with higher fluctuation (e.g. 98% CC on high C + G nucleotides in GeneParser set II vs. 90% CC on C + G >60 nucleotides in Burset/Guigó) may be attributed to the GeneParser data sets being much smaller in sample size. The tests on the aforementioned three data sets provided enough information to form respective conclusions. However, these datasets are not of realistic size, therefore, their reliability and scope are justifiably brought into question.[3]

GENSCAN vs. Other Programs Prediction Accuracy Under Data Sets I and II[3]
Program GeneID I GeneID II GRAIL3 I GRAIL3 II GeneParser2 I GeneParser2 II GENSCAN I GENSCAN II
All sequences
Correlation 0.69 0.55 0.83 0.75 0.78 0.80 0.93 0.93
Sensitivity 0.69 0.50 0.83 0.68 0.87 0.82 0.98 0.95
Specificity 0.77 0.75 0.87 0.91 0.76 0.86 0.90 0.94
Exons Correct 0.42 0.33 0.52 0.31 0.47 0.46 0.79 0.76
Exons Overlapped 0.73 0.64 0.81 0.58 0.87 0.76 0.96 0.91
High C + G
Correlation 0.65 0.73 0.88 0.80 0.89 0.71 0.94 0.98
Sensitivity 0.72 0.85 0.87 0.80 0.90 0.65 1.00 0.98
Specificity 0.73 0.73 0.95 0.88 0.93 0.87 0.91 0.98
Exons Correct 0.38 0.43 0.67 0.50 0.64 0.57 0.76 0.64
Exons Overlapped 0.80 0.86 0.89 0.79 0.96 0.79 1.00 0.93
Medium C + G
Correlation 0.67 0.52 0.83 0.75 0.75 0.82 0.93 0.94
Sensitivity 0.65 0.47 0.86 0.68 0.86 0.84 0.97 0.95
Specificity 0.77 0.76 0.84 0.91 0.70 0.87 0.90 0.95
Exons Correct 0.37 0.29 0.51 0.32 0.41 0.46 0.79 0.79
Exons Overlapped 0.67 0.62 0.83 0.28 0.84 0.79 0.96 0.93
Low C + G
Correlation 0.81 0.62 0.62 0.62 0.72 0.67 0.92 0.81
Sensitivity 0.82 0.56 0.51 0.45 0.79 0.71 0.93 0.80
Specificity 0.85 0.71 0.87 0.89 0.75 0.67 0.94 0.84
Exons Correct 0.80 0.47 0.25 0.16 0.40 0.37 0.85 0.68
Exons Overlapped 0.85 0.63 0.55 0.42 0.85 0.58 0.85 0.74

In 1997, GENSCAN was found to have a higher accuracy than previous gene prediction programs. However, work still needed to be done due to how GENSCAN was shown to only predict 10-15% of genes accurately on realistic data sets.[5] Because of inaccuracies like this, any predictions given by GENSCAN and other programs must be verified by comparing them to a Complementary DNA sequence, a Expressed sequence tag (EST) sequence, or a known protein sequence.[6]

References

[edit]
  1. ^ http://genes.mit.edu/GENSCAN.html Archived 2013-09-06 at the Wayback Machine The GENSCAN Web Server at MIT
  2. ^ Burge, C. B. (1998) Modeling dependencies in pre-mRNA splicing signals. In Salzberg, S., Searls, D. and Kasif, S., eds. Computational Methods in Molecular Biology, Elsevier Science, Amsterdam, pp. 127-163. ISBN 978-0-444-50204-9
  3. ^ a b c d e f g h i j k l m Burge, Christopher; Karlin, Samuel (1997). "Prediction of complete gene structures in human genomic DNA" (PDF). Journal of Molecular Biology. 268 (1): 78–94. CiteSeerX 10.1.1.115.3107. doi:10.1006/jmbi.1997.0951. PMID 9149143. Archived from the original (PDF) on 2015-06-20.
  4. ^ Burge, C.; Karlin, S. (1998). "Finding the genes in genomic DNA". Current Opinion in Structural Biology. 8 (3): 346–354. doi:10.1016/S0959-440X(98)80069-9. PMID 9666331.
  5. ^ a b c Flicek, Paul (2007). "Gene prediction: compare and CONTRAST". Genome Biology. 8 (12): 233. doi:10.1186/gb-2007-8-12-233. ISSN 1474-760X. PMC 2246255. PMID 18096089.
  6. ^ a b Rogic, S.; Ouellette, B.F. F.; Mackworth, A. K. (2002-08-01). "Improving gene recognition accuracy by combining predictions from two gene-finding programs". Bioinformatics. 18 (8): 1034–1045. doi:10.1093/bioinformatics/18.8.1034. ISSN 1367-4803. PMID 12176826.