Virus-host interactions predictor (VHIP): Machine learning approach to resolve microbial virus-host interaction networks

PLoS Comput Biol. 2024 Sep 18;20(9):e1011649. doi: 10.1371/journal.pcbi.1011649. eCollection 2024 Sep.

Abstract

Viruses of microbes are ubiquitous biological entities that reprogram their hosts' metabolisms during infection in order to produce viral progeny, impacting the ecology and evolution of microbiomes with broad implications for human and environmental health. Advances in genome sequencing have led to the discovery of millions of novel viruses and an appreciation for the great diversity of viruses on Earth. Yet, with knowledge of only "who is there?" we fall short in our ability to infer the impacts of viruses on microbes at population, community, and ecosystem-scales. To do this, we need a more explicit understanding "who do they infect?" Here, we developed a novel machine learning model (ML), Virus-Host Interaction Predictor (VHIP), to predict virus-host interactions (infection/non-infection) from input virus and host genomes. This ML model was trained and tested on a high-value manually curated set of 8849 virus-host pairs and their corresponding sequence data. The resulting dataset, 'Virus Host Range network' (VHRnet), is core to VHIP functionality. Each data point that underlies the VHIP training and testing represents a lab-tested virus-host pair in VHRnet, from which meaningful signals of viral adaptation to host were computed from genomic sequences. VHIP departs from existing virus-host prediction models in its ability to predict multiple interactions rather than predicting a single most likely host or host clade. As a result, VHIP is able to infer the complexity of virus-host networks in natural systems. VHIP has an 87.8% accuracy rate at predicting interactions between virus-host pairs at the species level and can be applied to novel viral and host population genomes reconstructed from metagenomic datasets.

MeSH terms

  • Computational Biology* / methods
  • Host Microbial Interactions / genetics
  • Host Microbial Interactions / physiology
  • Host-Pathogen Interactions* / genetics
  • Host-Pathogen Interactions* / physiology
  • Humans
  • Machine Learning*
  • Viruses* / genetics

Grants and funding

This study is based upon work supported by the National Science Foundation under Grant No. 2055455 awarded to MBD and LZ and 1813069 awarded to LZ and by funding to MBD through the National Oceanic and Atmospheric Administration Great Lakes Omics program distributed through the University of Michigan Cooperative Institute for Great Lakes Research NA17OAR4320152. This is CIGLR contribution number 1250. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.