Really, is medical sublanguage that different? Experimental counter-evidence from tagging medical and newspaper corpora

Joachim Wermter; Udo Hahn

Really, is medical sublanguage that different? Experimental counter-evidence from tagging medical and newspaper corpora

Stud Health Technol Inform. 2004;107(Pt 1):560-4.

Authors

Joachim Wermter¹, Udo Hahn

Affiliation

¹ Department of Medical Informatics, University Hospital Freiburg, Freiburg, Germany. [email protected]

PMID: 15360875

Abstract

We compare the performance of two part-of-speech taggers trained on a German newspaper corpus for mixed types of medical documents. TnT, a tagger based on a statistical language model, outperforms Brill's rule-based tagger, and supplied with additional lexicon resources matches state-of-the-art performance figures (close to 97% accuracy) on the medical corpus. We explain this unexpected result by focusing on the statistically significant part-of-speech type overlap between the newspaper training set and the medical test set. At least at that level, sublanguage differences seem to vanish. Thus, statistical off-the-shelf part-of-speech taggers can immediately be reused for medical language processing

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Natural Language Processing*
Newspapers as Topic
Terminology as Topic*