Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data

Elife. 2024 Oct 10:13:RP98300. doi: 10.7554/eLife.98300.

Abstract

Variant calling is fundamental in bacterial genomics, underpinning the identification of disease transmission clusters, the construction of phylogenetic trees, and antimicrobial resistance detection. This study presents a comprehensive benchmarking of variant calling accuracy in bacterial genomes using Oxford Nanopore Technologies (ONT) sequencing data. We evaluated three ONT basecalling models and both simplex (single-strand) and duplex (dual-strand) read types across 14 diverse bacterial species. Our findings reveal that deep learning-based variant callers, particularly Clair3 and DeepVariant, significantly outperform traditional methods and even exceed the accuracy of Illumina sequencing, especially when applied to ONT's super-high accuracy model. ONT's superior performance is attributed to its ability to overcome Illumina's errors, which often arise from difficulties in aligning reads in repetitive and variant-dense genomic regions. Moreover, the use of high-performing variant callers with ONT's super-high accuracy data mitigates ONT's traditional errors in homopolymers. We also investigated the impact of read depth on variant calling, demonstrating that 10× depth of ONT super-accuracy data can achieve precision and recall comparable to, or better than, full-depth Illumina sequencing. These results underscore the potential of ONT sequencing, combined with advanced variant calling algorithms, to replace traditional short-read sequencing methods in bacterial genomics, particularly in resource-limited settings.

Keywords: C. jejuni; E. coli; K. pneumoniae; L. monocytogenes; M. tuberculosis; S. aureus; S. enterica; S. pyogenes; bacteria; benchmark; bioinformatics; computational biology; deep learning; infectious disease; microbiology; nanopore; systems biology; variant calling.

Plain language summary

Imagine being part of a public health institution when, suddenly, cases of Salmonella surge across your country. You are facing an outbreak of this foodborne disease, and the clock is ticking. People are consuming a contaminated product that is making them sick; how do you identify related cases, track the source of the infection, and shut down its production? In situations like these, scientists need to tell apart even closely related strains of the same bacterial species. This process, known as variant calling, relies on first analysing (or ‘sequencing’) the genetic information obtained from the bacteria of interest, then comparing it to a reference genome. Currently, two main approaches are available for genome sequencing. Traditional ‘short-read’ technologies tend to be more accurate but less reliable when covering certain types of genomic regions. New ‘long-read’ approaches can bypass these limitations though they have historically been less accurate. Comparison with a reference genome can be performed using a tool known as a variant caller. Many of the most effective ones are now based on artificial intelligence approaches such as deep learning. However, these have primarily been applied to human genomic data so far; it therefore remains unclear whether they are equally useful for bacterial genomes. In response, Hall et al. set out to investigate the accuracy of four deep learning-based and three traditional variant callers on datasets from 14 bacterial species obtained via long-read approaches. Their respective performance was also benchmarked against a more conventional approach representing a standard of accuracy (that is, a popular, non-deep learning variant caller used on short-read datasets). These analyses were performed on a ‘truthset’ established by Hall et al., a collection of validated data that allowed them to assess the performance of the various tools tested. The results show that, in this context, the deep learning variant callers more accurately detected genetic variations compared to the traditional approach. These tools, which could be run on standard laptops, were effective even with low amounts of sequencing data – making them useful even in settings where resources are limited. Variant calling is an essential step in tracking the emergence and spread of disease, identifying new strains of bacteria, and examining their evolution. The findings by Hall et al. should therefore benefit various sectors, particularly clinical and public health laboratories.

MeSH terms

  • Bacteria* / classification
  • Bacteria* / genetics
  • Benchmarking*
  • Deep Learning*
  • Genetic Variation
  • Genome, Bacterial*
  • Genomics / methods
  • High-Throughput Nucleotide Sequencing / methods
  • Nanopore Sequencing* / methods
  • Nanopores