Population-scale detection of non-reference sequence variants using colored de Bruijn graphs

Thomas Krannich; W Timothy J White; Sebastian Niehus; Guillaume Holley; Bjarni V Halldórsson; Birte Kehr

doi:10.1093/bioinformatics/btab749

Population-scale detection of non-reference sequence variants using colored de Bruijn graphs

Bioinformatics. 2022 Jan 12;38(3):604-611. doi: 10.1093/bioinformatics/btab749.

Authors

Thomas Krannich¹, W Timothy J White², Sebastian Niehus³, Guillaume Holley⁴, Bjarni V Halldórsson^{4

5}, Birte Kehr^{1

3

6}

Affiliations

¹ Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany.
² Google Inc., 8002 Zürich, Switzerland.
³ Regensburg Center for Interventional Immunology (RCI), 93053 Regensburg, Germany.
⁴ deCODE Genetics, Reykjavík 102, Iceland.
⁵ Department of Engineering, School of Technology, Reykjavík University, Reykjavík 102, Iceland.
⁶ Universität Regensburg, 93053 Regensburg, Germany.

Abstract

Motivation: With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of detecting them. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have limited scalability to larger sets of genomes.

Results: We introduce PopIns2, a tool to discover and characterize NRS variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns. In this article, we briefly outline the PopIns2 workflow and highlight our novel algorithmic contributions. We developed an entirely new approach for merging contig assemblies of unaligned reads from many genomes into a single set of NRS using a colored de Bruijn graph. Our tests on simulated data indicate that the new merging algorithm ranks among the best approaches in terms of quality and reliability and that PopIns2 shows the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets.

Availability and implementation: The source code of PopIns2 is available from https://github.com/kehrlab/PopIns2.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Genome, Human
High-Throughput Nucleotide Sequencing / methods
Humans
Reproducibility of Results
Sequence Analysis, DNA / methods
Software*

Abstract

Publication types

MeSH terms

Grants and funding