A classification-based framework for predicting and analyzing gene regulatory response

Anshul Kundaje; Manuel Middendorf; Mihir Shah; Chris H Wiggins; Yoav Freund; Christina Leslie

doi:10.1186/1471-2105-7-S1-S5

A classification-based framework for predicting and analyzing gene regulatory response

BMC Bioinformatics. 2006 Mar 20;7 Suppl 1(Suppl 1):S5. doi: 10.1186/1471-2105-7-S1-S5.

Authors

Anshul Kundaje¹, Manuel Middendorf, Mihir Shah, Chris H Wiggins, Yoav Freund, Christina Leslie

Affiliation

¹ Department of Computer Science, Columbia University, New York, NY 10027, USA. [email protected]

Abstract

Background: We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem--predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree.

Methods: In the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data.

Results: Using the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast--the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors--and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from http://www.cs.columbia.edu/compbio/robust-geneclass.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms
Amino Acid Motifs
Binding Sites
Computational Biology / methods*
Data Interpretation, Statistical
Databases, Protein
Fungal Proteins / chemistry
Gene Expression Profiling / methods*
Gene Expression Regulation*
Heat-Shock Proteins / metabolism
Molecular Chaperones / chemistry
Oligonucleotide Array Sequence Analysis / methods
Saccharomyces cerevisiae / metabolism

Substances

Fungal Proteins
Heat-Shock Proteins
Molecular Chaperones

Abstract

Publication types

MeSH terms

Substances

Grants and funding