Democratizing protein language models with parameter-efficient fine-tuning

Samuel Sledzieski; Meghana Kshirsagar; Minkyung Baek; Rahul Dodhia; Juan Lavista Ferres; Bonnie Berger

doi:10.1073/pnas.2405840121

Democratizing protein language models with parameter-efficient fine-tuning

Proc Natl Acad Sci U S A. 2024 Jun 25;121(26):e2405840121. doi: 10.1073/pnas.2405840121. Epub 2024 Jun 20.

Authors

Samuel Sledzieski^{1

2}, Meghana Kshirsagar¹, Minkyung Baek³, Rahul Dodhia¹, Juan Lavista Ferres¹, Bonnie Berger^{2

4}

Affiliations

¹ AI for Good Research Lab, Microsoft Corporation, Redmond, WA 98052.
² Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139.
³ Department of Biological Sciences, Seoul National University, Seoul 08826, South Korea.
⁴ Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139.

Abstract

Proteomics has been revolutionized by large protein language models (PLMs), which learn unsupervised representations from large corpora of sequences. These models are typically fine-tuned in a supervised setting to adapt the model to specific downstream tasks. However, the computational and memory footprint of fine-tuning (FT) large PLMs presents a barrier for many research groups with limited computational resources. Natural language processing has seen a similar explosion in the size of models, where these challenges have been addressed by methods for parameter-efficient fine-tuning (PEFT). In this work, we introduce this paradigm to proteomics through leveraging the parameter-efficient method LoRA and training new models for two important tasks: predicting protein-protein interactions (PPIs) and predicting the symmetry of homooligomer quaternary structures. We show that these approaches are competitive with traditional FT while requiring reduced memory and substantially fewer parameters. We additionally show that for the PPI prediction task, training only the classification head also remains competitive with full FT, using five orders of magnitude fewer parameters, and that each of these methods outperform state-of-the-art PPI prediction methods with substantially reduced compute. We further perform a comprehensive evaluation of the hyperparameter space, demonstrate that PEFT of PLMs is robust to variations in these hyperparameters, and elucidate where best practices for PEFT in proteomics differ from those in natural language processing. All our model adaptation and evaluation code is available open-source at https://github.com/microsoft/peft_proteomics. Thus, we provide a blueprint to democratize the power of PLM adaptation to groups with limited computational resources.

Keywords: homooligomer symmetry; parameter-efficient fine-tuning; protein language models; protein–protein interactions; quaternary structure.

MeSH terms

Algorithms
Computational Biology / methods
Humans
Natural Language Processing
Protein Interaction Mapping / methods
Proteins / chemistry
Proteins / metabolism
Proteomics* / methods

Substances

Proteins

Abstract

MeSH terms

Substances

Grants and funding