Identifying informative subsets of the Gene Ontology with information bottleneck methods

Bo Jin; Xinghua Lu

doi:10.1093/bioinformatics/btq449

Identifying informative subsets of the Gene Ontology with information bottleneck methods

Bioinformatics. 2010 Oct 1;26(19):2445-51. doi: 10.1093/bioinformatics/btq449. Epub 2010 Aug 11.

Authors

Bo Jin¹, Xinghua Lu

Affiliation

¹ Department of Biochemistry and Molecular Biology, Medical University of South Carolina, 174 Ashley Ave, Charleston, SC 29425, USA.

Abstract

Motivation: The Gene Ontology (GO) is a controlled vocabulary designed to represent the biological concepts pertaining to gene products. This study investigates the methods for identifying informative subsets of GO terms in an automatic and objective fashion. This task in turn requires addressing the following issues: how to represent the semantic context of GO terms, what metrics are suitable for measuring the semantic differences between terms, how to identify an informative subset that retains as much as possible of the original semantic information of GO.

Results: We represented the semantic context of a GO term using the word-usage-profile associated with the term, which enables one to measure the semantic differences between terms based on the differences in their semantic contexts. We further employed the information bottleneck methods to automatically identify subsets of GO terms that retain as much as possible of the semantic information in an annotation database. The automatically retrieved informative subsets align well with an expert-picked GO slim subset, cover important concepts and proteins, and enhance literature-based GO annotation.

Availability: http://carcweb.musc.edu/TextminingProjects/.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Computational Biology / methods*
Databases, Genetic
Databases, Protein
Information Storage and Retrieval
Proteins / classification
Proteins / genetics
Semantics
Terminology as Topic
Vocabulary, Controlled*

Substances

Proteins

Abstract

Publication types

MeSH terms

Substances

Grants and funding