Inadequacy of Evolutionary Profiles Vis-a-vis Single Sequences in Predicting Transient DNA-Binding Sites in Proteins

Ajay Arya; Dana Mary Varghese; Ajay Kumar Verma; Shandar Ahmad

doi:10.1016/j.jmb.2022.167640

Inadequacy of Evolutionary Profiles Vis-a-vis Single Sequences in Predicting Transient DNA-Binding Sites in Proteins

J Mol Biol. 2022 Jul 15;434(13):167640. doi: 10.1016/j.jmb.2022.167640. Epub 2022 May 18.

Authors

Ajay Arya¹, Dana Mary Varghese¹, Ajay Kumar Verma¹, Shandar Ahmad²

Affiliations

¹ School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India.
² School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India. Electronic address: [email protected].

PMID: 35597551
DOI: 10.1016/j.jmb.2022.167640

Abstract

Sequence-based prediction of DNA-binding residues in a protein is a widely studied problem for which machine learning methods with continuously improving predictive power have been developed. Concatenated rows within a sliding window of a Position Specific Substitution Matrix (PSSM) of the protein are currently used as the primary feature set in almost all the methods of predicting DNA-binding residues. Here we report that these evolutionary profiles are powerful, only for identifying conserved binding sites and fall short for the residue positions which undergo binding to non-binding transitions in closely related proteins. We created a database of highly similar protein pairs with known protein-DNA complexes and investigated differential predictability of conserved and transient binding residues within each pair. Retraining machine learning models uniformly, we compared the predictive powers of the models trained on PSSMs against similarly trained models on sparse-encoded single sequences. We found that the transient binding site predictions from evolutionary profiles are outperformed by single-sequence based models under controlled experiments by as much as 8 percentage points. Thus, we conclude that the PSSM-based models are inadequate to predict high-specificity DNA-binding residues. These findings are of critical significance for the design of mutant- and species-specific DNA ligands and for homology based modeling of protein-DNA complexes.

Keywords: DNA-binding sites; Protein-DNA interactions; Specificity determining positions; Transient and conserved binding sites.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Binding Sites
Computational Biology / methods
DNA* / metabolism
Databases, Protein
Ligands
Protein Binding
Proteins* / chemistry

Substances

Ligands
Proteins
DNA