Effective grading of termhood in biomedical literature

AMIA Annu Symp Proc. 2005:2005:809-13.

Abstract

The ever-increasing amount of textual information in biomedicine calls for effective procedures for automatic terminology extraction which assist biomedical researchers and professionals in gathering and organizing terminological knowledge encoded in text documents. In this study, we propose a new, linguistically grounded measure for automatically identifying multi-word terms from the biomedical literature. Our approach is based on the limited paradigmatic modifiability of terms and is tested on bigram, trigram and quadgram noun phrases extracted from a 104-million-word text corpus comprised of Medline abstracts. Using the UMLS Metathesaurus as a gold standard, we show that our algorithm substantially outperforms the standard term identification measures and, therefore, qualifies as a high-performing building block for any biomedical terminology mining system.

MeSH terms

  • Algorithms*
  • Databases, Bibliographic
  • Information Management / methods*
  • Information Storage and Retrieval / methods*
  • MEDLINE
  • Natural Language Processing*
  • Terminology as Topic*
  • Unified Medical Language System
  • Vocabulary, Controlled