Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape

Bioinformatics. 2017 Nov 15;33(22):3575-3583. doi: 10.1093/bioinformatics/btx480.

Abstract

Motivation: An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem.

Results: Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these hidden Markov models into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA datasets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods.

Availability and implementation: Our program is freely available at https://github.com/ramzan1990/sequence2vec.

Contact: [email protected] or [email protected].

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

  • Algorithms*
  • Binding Sites
  • DNA / chemistry
  • DNA / metabolism*
  • Machine Learning
  • Models, Statistical
  • Protein Binding
  • Sequence Analysis, DNA / methods*
  • Transcription Factors / metabolism*

Substances

  • Transcription Factors
  • DNA