A novel method for predicting activity of cis-regulatory modules, based on a diverse training set

Wei Yang; Saurabh Sinha

doi:10.1093/bioinformatics/btw552

A novel method for predicting activity of cis-regulatory modules, based on a diverse training set

Bioinformatics. 2017 Jan 1;33(1):1-7. doi: 10.1093/bioinformatics/btw552. Epub 2016 Sep 7.

Authors

Wei Yang¹, Saurabh Sinha¹

Affiliation

¹ Department of Computer Science, University of Illinois, Urbana-Champaign, Urbana, IL, USA.

Abstract

Motivation: With the rapid emergence of technologies for locating cis-regulatory modules (CRMs) genome-wide, the next pressing challenge is to assign precise functions to each CRM, i.e. to determine the spatiotemporal domains or cell-types where it drives expression. A popular approach to this task is to model the typical k-mer composition of a set of CRMs known to drive a common expression pattern, and assign that pattern to other CRMs exhibiting a similar k-mer composition. This approach does not rely on prior knowledge of transcription factors relevant to the CRM or their binding motifs, and is thus more widely applicable than motif-based methods for predicting CRM activity, but is also prone to false positive predictions.

Results: We present a novel strategy to improve the above-mentioned approach: to predict if a CRM drives a specific gene expression pattern, assess not only how similar the CRM is to other CRMs with similar activity but also to CRMs with distinct activities. We use a state-of-the-art statistical method to quantify a CRM's sequence similarity to many different training sets of CRMs, and employ a classification algorithm to integrate these similarity scores into a single prediction of the CRM's activity. This strategy is shown to significantly improve CRM activity prediction over current approaches.

Availability and implementation: Our implementation of the new method, called IMMBoost, is freely available as source code, at https://github.com/weiyangedward/IMMBoost CONTACT: [email protected] information: Supplementary data are available at Bioinformatics online.

MeSH terms

Algorithms*
Animals
Binding Sites
Drosophila melanogaster / genetics
Enhancer Elements, Genetic*
Gene Regulatory Networks*
Genome, Insect*
Genomics / methods*
Sequence Analysis, DNA / methods
Transcription Factors / metabolism

Substances

Transcription Factors

Grants and funding

R01 GM114341/GM/NIGMS NIH HHS/United States