Building gold standard corpora for medical natural language processing tasks

Louise Deleger; Qi Li; Todd Lingren; Megan Kaiser; Katalin Molnar; Laura Stoutenborough; Michal Kouril; Keith Marsolo; Imre Solti

Building gold standard corpora for medical natural language processing tasks

AMIA Annu Symp Proc. 2012:2012:144-53. Epub 2012 Nov 3.

Authors

Louise Deleger¹, Qi Li, Todd Lingren, Megan Kaiser, Katalin Molnar, Laura Stoutenborough, Michal Kouril, Keith Marsolo, Imre Solti

Affiliation

¹ Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA.

PMID: 23304283
PMCID: PMC3540456

Abstract

We present the construction of three annotated corpora to serve as gold standards for medical natural language processing (NLP) tasks. Clinical notes from the medical record, clinical trial announcements, and FDA drug labels are annotated. We report high inter-annotator agreements (overall F-measures between 0.8467 and 0.9176) for the annotation of Personal Health Information (PHI) elements for a de-identification task and of medications, diseases/disorders, and signs/symptoms for information extraction (IE) task. The annotated corpora of clinical trials and FDA labels will be publicly released and to facilitate translational NLP tasks that require cross-corpora interoperability (e.g. clinical trial eligibility screening) their annotation schemas are aligned with a large scale, NIH-funded clinical text annotation project.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Clinical Trials as Topic
Drug Labeling
Medical Records
Natural Language Processing*
Software
United States
United States Food and Drug Administration

Abstract

Publication types

MeSH terms

Grants and funding