vClean: assessing virus sequence contamination in viral genomes

NAR Genom Bioinform. 2025 Jan 7;7(1):lqae185. doi: 10.1093/nargab/lqae185. eCollection 2025 Mar.

Abstract

Recent advancements in viral metagenomics and single-virus genomics have improved our ability to obtain the draft genomes of environmental viruses. However, these methods can introduce virus sequence contaminations into viral genomes when short, fragmented partial sequences are present in the assembled contigs. These contaminations can lead to incorrect analyses; however, practical detection tools are lacking. In this study, we introduce vClean, a novel automated tool that detects contaminations in viral genomes. By applying machine learning to the nucleotide sequence features and gene patterns of the input viral genome, vClean could identify contaminations. Specifically, for tailed double-stranded DNA phages, we attempted accurate predictions by defining single-copy-like genes and counting their duplications. We evaluated the performance of vClean using simulated datasets derived from complete reference genomes, achieving a binary accuracy of 0.932. When vClean was applied to 4693 genomes of medium or higher quality derived from public ocean metagenomic data, 1604 genomes (34.2%) were identified as contaminated. We also demonstrated that vClean can detect contamination in single-virus genome data obtained from river water. vClean provides a new benchmark for quality control of environmental viral genomes and has the potential to become an essential tool for environmental viral genome analysis.

MeSH terms

  • DNA Contamination
  • Genome, Viral* / genetics
  • Machine Learning
  • Metagenomics* / methods
  • Sequence Analysis, DNA / methods
  • Software
  • Viruses / genetics
  • Viruses / isolation & purification