Fast, sensitive detection of protein homologs using deep dense retrieval

Liang Hong; Zhihang Hu; Siqi Sun; Xiangru Tang; Jiuming Wang; Qingxiong Tan; Liangzhen Zheng; Sheng Wang; Sheng Xu; Irwin King; Mark Gerstein; Yu Li

doi:10.1038/s41587-024-02353-6

Fast, sensitive detection of protein homologs using deep dense retrieval

Nat Biotechnol. 2024 Aug 9. doi: 10.1038/s41587-024-02353-6. Online ahead of print.

Authors

Liang Hong^#¹, Zhihang Hu^#¹, Siqi Sun^#^{2

3}, Xiangru Tang^#⁴, Jiuming Wang^{1

5}, Qingxiong Tan¹, Liangzhen Zheng^{6

7}, Sheng Wang^{6

7}, Sheng Xu^{8

9}, Irwin King¹, Mark Gerstein^{10

11

12

13}, Yu Li^#^{14

15

16

17

18

19}

Affiliations

¹ Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China.
² Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, China. [email protected].
³ Shanghai AI Laboratory, Shanghai, China. [email protected].
⁴ Department of Computer Science, Yale University, New Haven, CT, USA.
⁵ OneAIM Ltd., Hong Kong SAR, China.
⁶ Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.
⁷ Shanghai Zelixir Biotech Company Ltd., Shanghai, China.
⁸ Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, China.
⁹ Shanghai AI Laboratory, Shanghai, China.
¹⁰ Department of Computer Science, Yale University, New Haven, CT, USA. [email protected].
¹¹ Computational Biology and Bioinformatics Program, Yale University, New Haven, CT, USA. [email protected].
¹² Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA. [email protected].
¹³ Department of Statistics and Data Science, Yale University, New Haven, CT, USA. [email protected].
¹⁴ Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China. [email protected].
¹⁵ Shanghai AI Laboratory, Shanghai, China. [email protected].
¹⁶ Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, USA. [email protected].
¹⁷ Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA. [email protected].
¹⁸ Broad Institute of MIT and Harvard, Cambridge, MA, USA. [email protected].
¹⁹ The Chinese University of Hong Kong Shenzhen Research Institute, Shenzhen, China. [email protected].

^# Contributed equally.

PMID: 39123049
DOI: 10.1038/s41587-024-02353-6

Abstract

The identification of protein homologs in large databases using conventional methods, such as protein sequence comparison, often misses remote homologs. Here, we offer an ultrafast, highly sensitive method, dense homolog retriever (DHR), for detecting homologs on the basis of a protein language model and dense retrieval techniques. Its dual-encoder architecture generates different embeddings for the same protein sequence and easily locates homologs by comparing these representations. Its alignment-free nature improves speed and the protein language model incorporates rich evolutionary and structural information within DHR embeddings. DHR achieves a >10% increase in sensitivity compared to previous methods and a >56% increase in sensitivity at the superfamily level for samples that are challenging to identify using alignment-based approaches. It is up to 22 times faster than traditional methods such as PSI-BLAST and DIAMOND and up to 28,700 times faster than HMMER. The new remote homologs exclusively found by DHR are useful for revealing connections between well-characterized proteins and improving our knowledge of protein evolution, structure and function.

Abstract

Grants and funding