FAPM: Functional Annotation of Proteins using Multi-Modal Models Beyond Structural Modeling

Wenkai Xiang; Zhaoping Xiong; Huan Chen; Jiacheng Xiong; Wei Zhang; Zunyun Fu; Mingyue Zheng; Bing Liu; Qian Shi

doi:10.1093/bioinformatics/btae680

FAPM: Functional Annotation of Proteins using Multi-Modal Models Beyond Structural Modeling

Bioinformatics. 2024 Nov 14:btae680. doi: 10.1093/bioinformatics/btae680. Online ahead of print.

Authors

Wenkai Xiang^{1

2}, Zhaoping Xiong³, Huan Chen⁴, Jiacheng Xiong^{1

5}, Wei Zhang^{1

5}, Zunyun Fu¹, Mingyue Zheng^{1

2

5}, Bing Liu⁴, Qian Shi²

Affiliations

¹ Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China.
² Lingang Laboratory, Shanghai, 200031, China.
³ ProtonUnfold Technology Co., Ltd, Suzhou, 215000, China.
⁴ BioBank, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an 710061, China.
⁵ University of Chinese Academy of Sciences, Beijing, 100049, China.

PMID: 39540736
DOI: 10.1093/bioinformatics/btae680

Abstract

Motivation: Assigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and "tail labels" with few known examples. Previous methods mainly focused on protein sequence features, overlooking the semantic meaning of protein labels.

Results: We introduce FAPM, a contrastive multi-modal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional terms and catalytic activity predictions, in natural language. Our results show that FAPM excels in understanding protein properties, outperforming models based solely on protein sequences or structures. It achieves state-of-the-art performance on public benchmarks and in-house experimentally annotated phage proteins, which often have few known homologs. Additionally, FAPM's flexibility allows it to incorporate extra text prompts, like taxonomy information, enhancing both its predictive performance and explainability. This novel approach offers a promising alternative to current methods that rely on multiple sequence alignment for protein annotation. The online demo is at: https://huggingface.co/spaces/wenkai/FAPM_demo.

Supplementary information: Supplementary data are available at Bioinformatics online.