GAPSCORE: finding gene and protein names one word at a time

Jeffrey T Chang; Hinrich Schütze; Russ B Altman

doi:10.1093/bioinformatics/btg393

GAPSCORE: finding gene and protein names one word at a time

Bioinformatics. 2004 Jan 22;20(2):216-25. doi: 10.1093/bioinformatics/btg393.

Authors

Jeffrey T Chang¹, Hinrich Schütze, Russ B Altman

Affiliation

¹ Department of Genetics, Stanford Medical Center, 300 Pasteur Drive, Lane L 301, Mail Code 5120, Stanford, CA 94305-5120, USA.

PMID: 14734313
DOI: 10.1093/bioinformatics/btg393

Abstract

Motivation: New high-throughput technologies have accelerated the accumulation of knowledge about genes and proteins. However, much knowledge is still stored as written natural language text. Therefore, we have developed a new method, GAPSCORE, to identify gene and protein names in text. GAPSCORE scores words based on a statistical model of gene names that quantifies their appearance, morphology and context.

Results: We evaluated GAPSCORE against the Yapex data set and achieved an F-score of 82.5% (83.3% recall, 81.5% precision) for partial matches and 57.6% (58.5% recall, 56.7% precision) for exact matches. Since the method is statistical, users can choose score cutoffs that adjust the performance according to their needs.

Availability: GAPSCORE is available at http://bionlp.stanford.edu/gapscore/

Publication types

Comparative Study
Evaluation Study
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.
Research Support, U.S. Gov't, P.H.S.
Validation Study

MeSH terms

Abstracting and Indexing
Algorithms*
Artificial Intelligence
Database Management Systems
Dictionaries as Topic
Genes*
Information Storage and Retrieval / methods*
Natural Language Processing*
Pattern Recognition, Automated*
Periodicals as Topic*
Proteins*
Reproducibility of Results
Sensitivity and Specificity
Software
Terminology as Topic*

Substances

Proteins

Abstract

Publication types

MeSH terms

Substances

Grants and funding