Prediction of virus-host associations using protein language models and multiple instance learning

PLoS Comput Biol. 2024 Nov 19;20(11):e1012597. doi: 10.1371/journal.pcbi.1012597. eCollection 2024 Nov.

Abstract

Predicting virus-host associations is essential to determine the specific host species that viruses interact with, and discover if new viruses infect humans and animals. Currently, the host of the majority of viruses is unknown, particularly in microbiomes. To address this challenge, we introduce EvoMIL, a deep learning method that predicts the host species for viruses from viral sequences only. It also identifies important viral proteins that significantly contribute to host prediction. The method combines a pre-trained large protein language model (ESM) and attention-based multiple instance learning to allow protein-orientated predictions. Our results show that protein embeddings capture stronger predictive signals than sequence composition features, including amino acids, physiochemical properties, and DNA k-mers. In multi-host prediction tasks, EvoMIL achieves median F1 score improvements of 10.8%, 16.2%, and 4.9% in prokaryotic hosts, and 1.7%, 6.6% and 11.5% in eukaryotic hosts. EvoMIL binary classifiers achieve impressive AUC over 0.95 for all prokaryotic hosts and range from roughly 0.8 to 0.9 for eukaryotic hosts. Furthermore, EvoMIL identifies important proteins in the prediction task, capturing key functions involved in virus-host specificity.

MeSH terms

  • Animals
  • Computational Biology* / methods
  • Deep Learning
  • Host Microbial Interactions / physiology
  • Host-Pathogen Interactions* / physiology
  • Humans
  • Viral Proteins* / chemistry
  • Viral Proteins* / metabolism
  • Viruses* / metabolism

Substances

  • Viral Proteins

Grants and funding

DL is funded by European Union’s Horizon 2020 research and innovation program, under the Marie Skłodowska-Curie Actions Innovative Training Networks grant agreement no. 955974 (VIROINF). The authors also acknowledge support from the following grants: the Medical Research Council (MRC, MC_UU_12014/12, MC_UU_00034/5, MR/V01157X/1) to DLR, a Doctoral Training Programme in Precision Medicine studentship for KDL, MR/N013166/1, the Biotechnology and Biological Sciences Research Council (BBSRC, BB/V016067/1) to DLR, FY and KY, and Engineering and Physical Sciences Research Council (EPSRC, EP/R018634/1) to KY. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.