Really, is medical sublanguage that different? Experimental counter-evidence from tagging medical and newspaper corpora

Stud Health Technol Inform. 2004;107(Pt 1):560-4.

Abstract

We compare the performance of two part-of-speech taggers trained on a German newspaper corpus for mixed types of medical documents. TnT, a tagger based on a statistical language model, outperforms Brill's rule-based tagger, and supplied with additional lexicon resources matches state-of-the-art performance figures (close to 97% accuracy) on the medical corpus. We explain this unexpected result by focusing on the statistically significant part-of-speech type overlap between the newspaper training set and the medical test set. At least at that level, sublanguage differences seem to vanish. Thus, statistical off-the-shelf part-of-speech taggers can immediately be reused for medical language processing

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Natural Language Processing*
  • Newspapers as Topic
  • Terminology as Topic*