Investigating the genomic epidemiology of major bacterial pathogens is integral to understanding transmission, evolution, colonization, disease, antimicrobial resistance and vaccine impact. Furthermore, the recent accumulation of large numbers of whole genome sequences for many bacterial species enhances the development of robust genome-wide typing schemes to define the overall bacterial population structure and lineages within it. Using the previously published data, we developed the Pneumococcal Genome Library (PGL), a curated dataset of 30 976 genomes and contextual data for carriage and disease pneumococci recovered between 1916 and 2018 in 82 countries. We leveraged the size and diversity of the PGL to develop a core genome multilocus sequence typing (cgMLST) scheme comprised of 1222 loci. Finally, using multilevel single-linkage clustering, we stratified pneumococci into hierarchical clusters based on allelic similarity thresholds and defined these with a taxonomic life identification number (LIN) barcoding system. The PGL, cgMLST scheme and LIN barcodes represent a high-quality genomic resource and fine-scale clustering approaches for the analysis of pneumococcal populations, which support the genomic epidemiology and surveillance of this leading global pathogen.
Keywords: cgMLST; genome library; genotyping; population structure.