Background: RNA viruses mutate at extremely high rates, forming an intra-host viral population of closely related variants, which allows them to evade the host's immune system and makes them particularly dangerous. Viral outbreaks pose a significant threat for public health, and, in order to deal with it, it is critical to infer transmission clusters, i.e., decide whether two viral samples belong to the same outbreak. Next-generation sequencing (NGS) can significantly help in tackling outbreak-related problems. While NGS data is first obtained as short reads, existing methods rely on assembled sequences. This requires reconstruction of the entire viral population, which is complicated, error-prone and time-consuming.
Results: The experimental validation using sequencing data from HCV outbreaks shows that the proposed algorithm can successfully identify genetic relatedness between viral populations, infer transmission direction, transmission clusters and outbreak sources, as well as decide whether the source is present in the sequenced outbreak sample and identify it.
Conclusions: Introduced algorithm allows to cluster genetically related samples, infer transmission directions and predict sources of outbreaks. Validation on experimental data demonstrated that algorithm is able to reconstruct various transmission characteristics. Advantage of the method is the ability to bypass cumbersome read assembly, thus eliminating the chance to introduce new errors, and saving processing time by allowing to use raw NGS reads.
Keywords: De Bruijn graph; Genetic relatedness; K-mers; Outbreaks investigations; Transmission networks.