Identifying informative subsets of the Gene Ontology with information bottleneck methods

Bioinformatics. 2010 Oct 1;26(19):2445-51. doi: 10.1093/bioinformatics/btq449. Epub 2010 Aug 11.

Abstract

Motivation: The Gene Ontology (GO) is a controlled vocabulary designed to represent the biological concepts pertaining to gene products. This study investigates the methods for identifying informative subsets of GO terms in an automatic and objective fashion. This task in turn requires addressing the following issues: how to represent the semantic context of GO terms, what metrics are suitable for measuring the semantic differences between terms, how to identify an informative subset that retains as much as possible of the original semantic information of GO.

Results: We represented the semantic context of a GO term using the word-usage-profile associated with the term, which enables one to measure the semantic differences between terms based on the differences in their semantic contexts. We further employed the information bottleneck methods to automatically identify subsets of GO terms that retain as much as possible of the semantic information in an annotation database. The automatically retrieved informative subsets align well with an expert-picked GO slim subset, cover important concepts and proteins, and enhance literature-based GO annotation.

Availability: http://carcweb.musc.edu/TextminingProjects/.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Computational Biology / methods*
  • Databases, Genetic
  • Databases, Protein
  • Information Storage and Retrieval
  • Proteins / classification
  • Proteins / genetics
  • Semantics
  • Terminology as Topic
  • Vocabulary, Controlled*

Substances

  • Proteins