SeroBA: rapid high-throughput serotyping of Streptococcus pneumoniae from whole genome sequence data

Lennard Epping; Andries J van Tonder; Rebecca A Gladstone; The Global Pneumococcal Sequencing Consortium; Stephen D Bentley; Andrew J Page; Jacqueline A Keane

doi:10.1099/mgen.0.000186

SeroBA: rapid high-throughput serotyping of Streptococcus pneumoniae from whole genome sequence data

Microb Genom. 2018 Jul;4(7):e000186. doi: 10.1099/mgen.0.000186. Epub 2018 Jun 15.

Authors

Lennard Epping^{1

2}, Andries J van Tonder³, Rebecca A Gladstone³, The Global Pneumococcal Sequencing Consortium, Stephen D Bentley³, Andrew J Page^{2

4}, Jacqueline A Keane²

Affiliations

¹ 2Microbial Genomics, Robert Koch Institute, Berlin, Germany.
² 1Pathogen Informatics, Wellcome Sanger Institute, Hinxton, Cambridgeshire CB10 1SA, UK.
³ 3Infection Genomics, Wellcome Sanger Institute, Hinxton, Cambridgeshire CB10 1SA, UK.
⁴ 4Quadram Institute, Norwich Research Park, Norwich, UK.

Abstract

Streptococcus pneumoniae is responsible for 240 000-460 000 deaths in children under 5 years of age each year. Accurate identification of pneumococcal serotypes is important for tracking the distribution and evolution of serotypes following the introduction of effective vaccines. Recent efforts have been made to infer serotypes directly from genomic data but current software approaches are limited and do not scale well. Here, we introduce a novel method, SeroBA, which uses a k-mer approach. We compare SeroBA against real and simulated data and present results on the concordance and computational performance against a validation dataset, the robustness and scalability when analysing a large dataset, and the impact of varying the depth of coverage on sequence-based serotyping. SeroBA can predict serotypes, by identifying the cps locus, directly from raw whole genome sequencing read data with 98 % concordance using a k-mer-based method, can process 10 000 samples in just over 1 day using a standard server and can call serotypes at a coverage as low as 15-21×. SeroBA is implemented in Python3 and is freely available under an open source GPLv3 licence from: https://github.com/sanger-pathogens/seroba.

Keywords: Streptococcus pneumoniae; k-mer method; pneumococcal; serotyping; whole genome sequencing.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Alleles
Child, Preschool
Databases, Genetic
Genes, Bacterial
High-Throughput Nucleotide Sequencing / methods*
Humans
Pneumococcal Infections / microbiology*
Polymorphism, Single Nucleotide
Sensitivity and Specificity
Serogroup
Serotyping / methods*
Software*
Streptococcus mitis / genetics*
Streptococcus pneumoniae / classification*
Streptococcus pneumoniae / genetics
Streptococcus pneumoniae / isolation & purification
Whole Genome Sequencing*

Abstract

Publication types

MeSH terms

Grants and funding