Personalized pangenome references

Jouni Sirén; Parsa Eskandar; Matteo Tommaso Ungaro; Glenn Hickey; Jordan M Eizenga; Adam M Novak; Xian Chang; Pi-Chuan Chang; Mikhail Kolmogorov; Andrew Carroll; Jean Monlong; Benedict Paten

doi:10.1038/s41592-024-02407-2

Personalized pangenome references

Nat Methods. 2024 Nov;21(11):2017-2023. doi: 10.1038/s41592-024-02407-2. Epub 2024 Sep 11.

Authors

Jouni Sirén¹, Parsa Eskandar², Matteo Tommaso Ungaro^{2

3}, Glenn Hickey², Jordan M Eizenga², Adam M Novak², Xian Chang², Pi-Chuan Chang⁴, Mikhail Kolmogorov⁵, Andrew Carroll⁴, Jean Monlong^{2

6}, Benedict Paten⁷

Affiliations

¹ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA. [email protected].
² UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
³ University of Ferrara, Ferrara, Italy.
⁴ Google LLC, Mountain View, CA, USA.
⁵ Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
⁶ Institut de Recherche en Santé Digestive, Université de Toulouse, INSERM, INRA, ENVT, UPS, Toulouse, France.
⁷ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA. [email protected].

PMID: 39261641
DOI: 10.1038/s41592-024-02407-2

Abstract

Pangenomes reduce reference bias by representing genetic diversity better than a single reference sequence. Yet when comparing a sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with by filtering rare variants. However, this blunt heuristic both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach that imputes a personalized pangenome subgraph by sampling local haplotypes according to k-mer counts in the reads. We implement the approach in the vg toolkit ( https://github.com/vgteam/vg ) for the Giraffe short-read aligner and compare its accuracy to state-of-the-art methods using human pangenome graphs from the Human Pangenome Reference Consortium. This reduces small variant genotyping errors by four times relative to the Genome Analysis Toolkit and makes short-read structural variant genotyping of known variants competitive with long-read variant discovery methods.

MeSH terms

Algorithms
Gene Frequency
Genetic Variation
Genome, Human*
Genomics / methods
Haplotypes
High-Throughput Nucleotide Sequencing / methods
Humans
Software*

Abstract

MeSH terms

Grants and funding