VirRep: a hybrid language representation learning framework for identifying viruses from human gut metagenomes

Genome Biol. 2024 Jul 4;25(1):177. doi: 10.1186/s13059-024-03320-9.

Abstract

Identifying viruses from metagenomes is a common step to explore the virus composition in the human gut. Here, we introduce VirRep, a hybrid language representation learning framework, for identifying viruses from human gut metagenomes. VirRep combines a context-aware encoder and an evolution-aware encoder to improve sequence representation by incorporating k-mer patterns and sequence homologies. Benchmarking on both simulated and real datasets with varying viral proportions demonstrates that VirRep outperforms state-of-the-art methods. When applied to fecal metagenomes from a colorectal cancer cohort, VirRep identifies 39 high-quality viral species associated with the disease, many of which cannot be detected by existing methods.

Keywords: Human gut metagenomes; Language representation learning; Virus identification.

MeSH terms

  • Colorectal Neoplasms / genetics
  • Colorectal Neoplasms / virology
  • Feces / virology
  • Gastrointestinal Microbiome*
  • Humans
  • Metagenome*
  • Metagenomics / methods
  • Software
  • Viruses / genetics