Towards pan-genome read alignment to improve variation calling

Daniel Valenzuela; Tuukka Norri; Niko Välimäki; Esa Pitkänen; Veli Mäkinen

doi:10.1186/s12864-018-4465-8

Towards pan-genome read alignment to improve variation calling

BMC Genomics. 2018 May 9;19(Suppl 2):87. doi: 10.1186/s12864-018-4465-8.

Authors

Daniel Valenzuela¹, Tuukka Norri¹, Niko Välimäki², Esa Pitkänen³, Veli Mäkinen⁴

Affiliations

¹ Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, P.O. Box 68 (Gustaf Hällströmin katu 2b), Helsinki, 00014, Finland.
² Department of Medical and Clinical Genetics, Genome-Scale Biology Program, University of Helsinki, Helsinki, Finland.
³ European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany.
⁴ Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, P.O. Box 68 (Gustaf Hällströmin katu 2b), Helsinki, 00014, Finland. [email protected].

Abstract

Background: Typical human genome differs from the reference genome at 4-5 million sites. This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting of >15,000 whole-genomes and >126,000 exome sequences from different individuals. Despite this enormous diversity, resequencing data workflows are still based on a single human reference genome. Identification and genotyping of genetic variants is typically carried out on short-read data aligned to a single reference, disregarding the underlying variation.

Results: We propose a new unified framework for variant calling with short-read data utilizing a representation of human genetic variation - a pan-genomic reference. We provide a modular pipeline that can be seamlessly incorporated into existing sequencing data analysis workflows. Our tool is open source and available online: https://gitlab.com/dvalenzu/PanVC .

Conclusions: Our experiments show that by replacing a standard human reference with a pan-genomic one we achieve an improvement in single-nucleotide variant calling accuracy and in short indel calling accuracy over the widely adopted Genome Analysis Toolkit (GATK) in difficult genomic regions.

Keywords: Pan-genome reference; Read alignment; Variation calling.

MeSH terms

Access to Information
Genetic Variation*
Genome, Human
Humans
Internet
Sequence Alignment
Sequence Analysis, DNA / methods*
Software
Workflow