Semantic similarity analysis of protein data: assessment with biological features and issues

Brief Bioinform. 2012 Sep;13(5):569-85. doi: 10.1093/bib/bbr066. Epub 2011 Dec 2.

Abstract

The integration of proteomics data with biological knowledge is a recent trend in bioinformatics. A lot of biological information is available and is spread on different sources and encoded in different ontologies (e.g. Gene Ontology). Annotating existing protein data with biological information may enable the use (and the development) of algorithms that use biological ontologies as framework to mine annotated data. Recently many methodologies and algorithms that use ontologies to extract knowledge from data, as well as to analyse ontologies themselves have been proposed and applied to other fields. Conversely, the use of such annotations for the analysis of protein data is a relatively novel research area that is currently becoming more and more central in research. Existing approaches span from the definition of the similarity among genes and proteins on the basis of the annotating terms, to the definition of novel algorithms that use such similarities for mining protein data on a proteome-wide scale. This work, after the definition of main concept of such analysis, presents a systematic discussion and comparison of main approaches. Finally, remaining challenges, as well as possible future directions of research are presented.

MeSH terms

  • Algorithms
  • Data Mining
  • Databases, Protein
  • Molecular Sequence Annotation
  • Natural Language Processing
  • Proteins / chemistry*
  • Proteome / chemistry*
  • Proteomics
  • Semantics*
  • Vocabulary, Controlled

Substances

  • Proteins
  • Proteome