Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning

Joseph D Clark; Xuenan Mi; Douglas A Mitchell; Diwakar Shukla

doi:10.1039/d4dd00170b

Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning

Digit Discov. 2024 Dec 2. doi: 10.1039/d4dd00170b. Online ahead of print.

Authors

Joseph D Clark¹, Xuenan Mi², Douglas A Mitchell^{3

4}, Diwakar Shukla^{2

5

6

7}

Affiliations

¹ School of Molecular and Cellular Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA.
² Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA.
³ Department of Biochemistry, Vanderbilt University School of Medicine Nashville TN 37232 USA.
⁴ Department of Chemistry, Vanderbilt University Nashville TN 37232 USA.
⁵ Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign Urbana IL 61801 USA [email protected].
⁶ Department of Bioengineering, University of Illinois at Urbana-Champaign Urbana IL 61801 USA.
⁷ Department of Chemistry, University of Illinois at Urbana-Chamapaign Urbana IL 61801 USA.

Abstract

Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting the specificity of RiPP biosynthetic enzymes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream prediction of both LazBF and LazDEF substrates. Similarly, masked language modeling of LazDEF substrate preferences produced embeddings that improved prediction of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. We found that a single high-quality data set of substrates and non-substrates for a RiPP biosynthetic enzyme improved substrate prediction for distinct enzymes in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.