Predicting the DNA binding specificity of mutated transcription factors using family-level biophysically interpretable machine learning

Shaoxun Liu; Pilar Gomez-Alcala; Christ Leemans; William J Glassford; Richard S Mann; Harmen J Bussemaker

doi:10.1101/2024.01.24.577115

Predicting the DNA binding specificity of mutated transcription factors using family-level biophysically interpretable machine learning

bioRxiv [Preprint]. 2024 Jan 29:2024.01.24.577115. doi: 10.1101/2024.01.24.577115.

Authors

Shaoxun Liu¹, Pilar Gomez-Alcala¹, Christ Leemans¹, William J Glassford², Richard S Mann^{2

3}, Harmen J Bussemaker^{1

3}

Affiliations

¹ Department of Biological Sciences, Columbia University, New York, NY, USA.
² Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA.
³ Department of Systems Biology, Columbia University, New York, NY, USA.

Abstract

Sequence-specific interactions of transcription factors (TFs) with genomic DNA underlie many cellular processes. High-throughput in vitro binding assays coupled with computational analysis have made it possible to accurately define such sequence recognition in a biophysically interpretable yet mechanism-agonistic way for individual TFs. The fact that such sequence-to-affinity models are now available for hundreds of TFs provides new avenues for predicting how the DNA binding specificity of a TF changes when its protein sequence is mutated. To this end, we developed an analytical framework based on a tetrahedron embedding that can be applied at the level of a given structural TF family. Using bHLH as a test case, we demonstrate that we can systematically map dependencies between the protein sequence of a TF and base preference within the DNA binding site. We also develop a regression approach to predict the quantitative energetic impact of mutations in the DNA binding domain of a TF on its DNA binding specificity, and perform SELEX-seq assays on mutated TFs to experimentally validate our results. Our results point to the feasibility of predicting the functional impact of disease mutations and allelic variation in the cell-wide TF repertoire by leveraging high-quality functional information across sets of homologous wild-type proteins.

Keywords: DNA binding specificity; basic helix-loop-helix (bHLH) family; biophysically interpretable machine learning; functional impact of missense mutations; transcription factors.

Publication types

Preprint

Abstract

Publication types

Grants and funding