Protein embeddings predict binding residues in disordered regions

Laura R Jahn; Céline Marquet; Michael Heinzinger; Burkhard Rost

doi:10.1038/s41598-024-64211-4

Protein embeddings predict binding residues in disordered regions

Sci Rep. 2024 Jun 12;14(1):13566. doi: 10.1038/s41598-024-64211-4.

Authors

Laura R Jahn^#¹, Céline Marquet^#², Michael Heinzinger¹, Burkhard Rost^{1

3

4}

Affiliations

¹ School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics and Computational Biology, TUM (Technical University of Munich), 85748, Garching/Munich, Germany.
² School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics and Computational Biology, TUM (Technical University of Munich), 85748, Garching/Munich, Germany. [email protected].
³ Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany.
⁴ TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany.

^# Contributed equally.

Abstract

The identification of protein binding residues helps to understand their biological processes as protein function is often defined through ligand binding, such as to other proteins, small molecules, ions, or nucleotides. Methods predicting binding residues often err for intrinsically disordered proteins or regions (IDPs/IDPRs), often also referred to as molecular recognition features (MoRFs). Here, we presented a novel machine learning (ML) model trained to specifically predict binding regions in IDPRs. The proposed model, IDBindT5, leveraged embeddings from the protein language model (pLM) ProtT5 to reach a balanced accuracy of 57.2 ± 3.6% (95% confidence interval). Assessed on the same data set, this did not differ at the 95% CI from the state-of-the-art (SOTA) methods ANCHOR2 and DeepDISOBind that rely on expert-crafted features and evolutionary information from multiple sequence alignments (MSAs). Assessed on other data, methods such as SPOT-MoRF reached higher MCCs. IDBindT5's SOTA predictions are much faster than other methods, easily enabling full-proteome analyses. Our findings emphasize the potential of pLMs as a promising approach for exploring and predicting features of disordered proteins. The model and a comprehensive manual are publicly available at https://github.com/jahnl/binding_in_disorder .

Keywords: Machine learning; Protein binding; Protein binding prediction; Protein disorder; Protein function; Protein language model.

MeSH terms

Binding Sites
Computational Biology / methods
Databases, Protein
Humans
Intrinsically Disordered Proteins* / chemistry
Intrinsically Disordered Proteins* / metabolism
Machine Learning*
Protein Binding*

Substances

Intrinsically Disordered Proteins

Grants and funding

DFG-GZ: RO1320/4-1/Deutsche Forschungsgemeinschaft