Mapping annotations with textual evidence using an scLDA model

AMIA Annu Symp Proc. 2011:2011:834-42. Epub 2011 Oct 22.

Abstract

Most of the knowledge regarding genes and proteins is stored in biomedical literature as free text. Extracting information from complex biomedical texts demands techniques capable of inferring biological concepts from local text regions and mapping them to controlled vocabularies. To this end, we present a sentence-based correspondence latent Dirichlet allocation (scLDA) model which, when trained with a corpus of PubMed documents with known GO annotations, performs the following tasks: 1) learning major biological concepts from the corpus, 2) inferring the biological concepts existing within text regions (sentences), and 3) identifying the text regions in a document that provides evidence for the observed annotations. When applied to new gene-related documents, a trained scLDA model is capable of predicting GO annotations and identifying text regions as textual evidence supporting the predicted annotations. This study uses GO annotation data as a testbed; the approach can be generalized to other annotated data, such as MeSH and MEDLINE documents.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Artificial Intelligence*
  • Computational Biology / methods*
  • Databases, Genetic
  • Genetics*
  • Natural Language Processing
  • PubMed
  • Terminology as Topic*
  • Vocabulary, Controlled*