Empirical data on corpus design and usage in biomedical natural language processing

K Bretonnel Cohen; Lynne Fox; Philip  V Ogren; Lawrence Hunter

Empirical data on corpus design and usage in biomedical natural language processing

AMIA Annu Symp Proc. 2005:2005:156-60.

Authors

K Bretonnel Cohen¹, Lynne Fox, Philip V Ogren, Lawrence Hunter

Affiliation

¹ Center for Computational Pharmacology, U. of Colorado School of Medicine, USA. [email protected]

PMID: 16779021
PMCID: PMC1560643

Abstract

This paper describes the design of six publicly available biomedical corpora. We then present usage data for the six corpora. We show that corpora that are carefully annotated with respect to structural and linguistic characteristics and that are distributed in standard formats are more widely used than corpora that are not. These findings have implications for the design of the next generation of biomedical corpora.

MeSH terms

Computational Biology
Databases as Topic*
Information Storage and Retrieval*
Knowledge Bases*
Linguistics
Natural Language Processing*

Grants and funding

R01 LM008111/LM/NLM NIH HHS/United States