Improving precision in concept normalization

Pac Symp Biocomput. 2018:23:566-577.

Abstract

Most natural language processing applications exhibit a trade-off between precision and recall. In some use cases for natural language processing, there are reasons to prefer to tilt that trade-off toward high precision. Relying on the Zipfian distribution of false positive results, we describe a strategy for increasing precision, using a variety of both pre-processing and post-processing methods. They draw on both knowledge-based and frequentist approaches to modeling language. Based on an existing high-performance biomedical concept recognition pipeline and a previously published manually annotated corpus, we apply this hybrid rationalist/empiricist strategy to concept normalization for eight different ontologies. Which approaches did and did not improve precision varied widely between the ontologies.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Biological Ontologies / statistics & numerical data
  • Computational Biology / methods
  • Data Mining / methods
  • Electronic Health Records / statistics & numerical data
  • False Positive Reactions
  • Humans
  • Natural Language Processing*
  • Precision Medicine / statistics & numerical data
  • PubMed / statistics & numerical data
  • Reproducibility of Results