Biological cluster evaluation for gene function prediction

J Comput Biol. 2014 Jun;21(6):428-45. doi: 10.1089/cmb.2009.0129. Epub 2010 Jan 8.

Abstract

Recent advances in high-throughput omics techniques render it possible to decode the function of genes by using the "guilt-by-association" principle on biologically meaningful clusters of gene expression data. However, the existing frameworks for biological evaluation of gene clusters are hindered by two bottleneck issues: (1) the choice for the number of clusters, and (2) the external measures which do not take in consideration the structure of the analyzed data and the ontology of the existing biological knowledge. Here, we address the identified bottlenecks by developing a novel framework that allows not only for biological evaluation of gene expression clusters based on existing structured knowledge, but also for prediction of putative gene functions. The proposed framework facilitates propagation of statistical significance at each of the following steps: (1) estimating the number of clusters, (2) evaluating the clusters in terms of novel external structural measures, (3) selecting an optimal clustering algorithm, and (4) predicting gene functions. The framework also includes a method for evaluation of gene clusters based on the structure of the employed ontology. Moreover, our method for obtaining a probabilistic range for the number of clusters is demonstrated valid on synthetic data and available gene expression profiles from Saccharomyces cerevisiae. Finally, we propose a network-based approach for gene function prediction which relies on the clustering of optimal score and the employed ontology. Our approach effectively predicts gene function on the Saccharomyces cerevisiae data set and is also employed to obtain putative gene functions for an Arabidopsis thaliana data set.

Keywords: NP-completeness; algorithms; biochemical networks; combinatorics; computational molecular biology; databases; functional genomics; gene expression.

MeSH terms

  • Arabidopsis / genetics*
  • Datasets as Topic
  • Gene Expression Regulation, Fungal / physiology*
  • Gene Expression Regulation, Plant / physiology*
  • Genes, Fungal / physiology*
  • Genes, Plant / physiology*
  • Saccharomyces cerevisiae / genetics*