One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads

PLoS Comput Biol. 2021 Jan 27;17(1):e1008678. doi: 10.1371/journal.pcbi.1008678. eCollection 2021 Jan.

Abstract

Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Bacteria / classification
  • Bacteria / genetics
  • Chromosome Mapping / methods*
  • Chromosome Mapping / standards*
  • Genome, Bacterial / genetics
  • Genomics / methods*
  • High-Throughput Nucleotide Sequencing / methods*
  • Phylogeny
  • Polymorphism, Single Nucleotide / genetics
  • Sequence Alignment

Grants and funding

This project was partly funded by projects BFU2017-89594R from MICIN (Spanish Government) and PROMETEO2016-0122 (Generalitat Valenciana, Spain). WGS was performed at Servicio de Secuenciación Masiva y Bioinformática de la Fundación para la Investigación Sanitaria y Biomédica de la Comunitat Valenciana (FISABIO) and co-financed by the European Union through the Operational Program of European Regional Development Fund (ERDF) of Valencia Region (Spain) 2014-2020. CV is recipient of contract FPU2018/02579, BB of contract FPU2016/02139 and CF of FPI contract BES-2015-074204 from MICIN (Spanish Government). LM benefits of a fellowship from Fundación Carolina. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.