Deciphering the Language of Protein-DNA Interactions: A Deep Learning Approach Combining Contextual Embeddings and Multi-Scale Sequence Modeling

Yu-Chen Liu; Yi-Jing Lin; Yan-Yun Chang; Cheng-Che Chuang; Yu-Yen Ou

doi:10.1016/j.jmb.2024.168769

Deciphering the Language of Protein-DNA Interactions: A Deep Learning Approach Combining Contextual Embeddings and Multi-Scale Sequence Modeling

J Mol Biol. 2024 Nov 15;436(22):168769. doi: 10.1016/j.jmb.2024.168769. Epub 2024 Aug 29.

Authors

Yu-Chen Liu¹, Yi-Jing Lin¹, Yan-Yun Chang¹, Cheng-Che Chuang¹, Yu-Yen Ou²

Affiliations

¹ Department of Computer Science and Engineering, Yuan Ze University, Chung-Li 32003, Taiwan.
² Department of Computer Science and Engineering, Yuan Ze University, Chung-Li 32003, Taiwan; Graduate Program in Biomedical Informatics, Yuan Ze University, Chung-Li 32003, Taiwan. Electronic address: [email protected].

PMID: 39214282
DOI: 10.1016/j.jmb.2024.168769

Abstract

Deciphering the mechanisms governing protein-DNA interactions is crucial for understanding key cellular processes and disease pathways. In this work, we present a powerful deep learning approach that significantly advances the computational prediction of DNA-interacting residues from protein sequences. Our method leverages the rich contextual representations learned by pre-trained protein language models, such as ProtTrans, to capture intrinsic biochemical properties and sequence motifs indicative of DNA binding sites. We then integrate these contextual embeddings with a multi-window convolutional neural network architecture, which scans across the sequence at varying window sizes to effectively identify both local and global binding patterns. Comprehensive evaluation on curated benchmark datasets demonstrates the remarkable performance of our approach, achieving an area under the ROC curve (AUC) of 0.89 - a substantial improvement over previous state-of-the-art sequence-based predictors. This showcases the immense potential of pairing advanced representation learning and deep neural network designs for uncovering the complex syntax governing protein-DNA interactions directly from primary sequences. Our work not only provides a robust computational tool for characterizing DNA-binding mechanisms, but also highlights the transformative opportunities at the intersection of language modeling, deep learning, and protein sequence analysis. The publicly available code and data further facilitate broader adoption and continued development of these techniques for accelerating mechanistic insights into vital biological processes and disease pathways. In addition, the code and data for this work are available at https://github.com/B1607/DIRP.

Keywords: DNA binding proteins; convolutional neural networks; deep learning; multiple windows scanning; pre-trained language model.

MeSH terms

Binding Sites
Computational Biology* / methods
DNA* / chemistry
DNA* / metabolism
DNA-Binding Proteins / chemistry
DNA-Binding Proteins / metabolism
Deep Learning*
Neural Networks, Computer
Protein Binding*

Substances

DNA
DNA-Binding Proteins