Assessing the model transferability for prediction of transcription factor binding sites based on chromatin accessibility

Sheng Liu; Cristina Zibetti; Jun Wan; Guohua Wang; Seth Blackshaw; Jiang Qian

doi:10.1186/s12859-017-1769-7

Assessing the model transferability for prediction of transcription factor binding sites based on chromatin accessibility

BMC Bioinformatics. 2017 Jul 27;18(1):355. doi: 10.1186/s12859-017-1769-7.

Authors

Sheng Liu¹, Cristina Zibetti², Jun Wan¹, Guohua Wang¹, Seth Blackshaw^{1

2

3

4

5}, Jiang Qian⁶

Affiliations

¹ Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.
² Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.
³ Department of Neurology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.
⁴ Centre for Human Systems Biology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.
⁵ Institute for Cell Engineering, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.
⁶ Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA. [email protected].

Abstract

Background: Computational prediction of transcription factor (TF) binding sites in different cell types is challenging. Recent technology development allows us to determine the genome-wide chromatin accessibility in various cellular and developmental contexts. The chromatin accessibility profiles provide useful information in prediction of TF binding events in various physiological conditions. Furthermore, ChIP-Seq analysis was used to determine genome-wide binding sites for a range of different TFs in multiple cell types. Integration of these two types of genomic information can improve the prediction of TF binding events.

Results: We assessed to what extent a model built upon on other TFs and/or other cell types could be used to predict the binding sites of TFs of interest. A random forest model was built using a set of cell type-independent features such as specific sequences recognized by the TFs and evolutionary conservation, as well as cell type-specific features derived from chromatin accessibility data. Our analysis suggested that the models learned from other TFs and/or cell lines performed almost as well as the model learned from the target TF in the cell type of interest. Interestingly, models based on multiple TFs performed better than single-TF models. Finally, we proposed a universal model, BPAC, which was generated using ChIP-Seq data from multiple TFs in various cell types.

Conclusion: Integrating chromatin accessibility information with sequence information improves prediction of TF binding.The prediction of TF binding is transferable across TFs and/or cell lines suggesting there are a set of universal "rules". A computational tool was developed to predict TF binding sites based on the universal "rules".

Keywords: Chromatin accessibility; Feature selection; Machine learning; Transcription factor binding prediction.

MeSH terms

Algorithms
Area Under Curve
Binding Sites
Cell Line, Tumor
Chromatin / chemistry
Chromatin / metabolism*
Chromatin Assembly and Disassembly
DNA / chemistry
DNA / metabolism
Humans
Models, Genetic*
Protein Binding
ROC Curve
Transcription Factors / chemistry
Transcription Factors / metabolism*

Substances

Chromatin
Transcription Factors
DNA

Abstract

MeSH terms

Substances

Grants and funding