Opening the Black Box of Imputation Software to Study the Impact of Reference Panel Composition on Performance

Genes (Basel). 2023 Feb 4;14(2):410. doi: 10.3390/genes14020410.

Abstract

Genotype imputation is widely used to enrich genetic datasets. The operation relies on panels of known reference haplotypes, typically with whole-genome sequencing data. How to choose a reference panel has been widely studied and it is essential to have a panel that is well matched to the individuals who require missing genotype imputation. However, it is broadly accepted that such an imputation panel will have an enhanced performance with the inclusion of diversity (haplotypes from many different populations). We investigate this observation by examining, in fine detail, exactly which reference haplotypes are contributing at different regions of the genome. This is achieved using a novel method of inserting synthetic genetic variation into the reference panel in order to track the performance of leading imputation algorithms. We show that while diversity may globally improve imputation accuracy, there can be occasions where incorrect genotypes are imputed following the inclusion of more diverse haplotypes in the reference panel. We, however, demonstrate a technique for retaining and benefitting from the diversity in the reference panel whilst avoiding the occasional adverse effects on imputation accuracy. What is more, our results more clearly elucidate the role of diversity in a reference panel than has been shown in previous studies.

Keywords: admixture; genotype imputation; population genetics; rare variants; reference panel.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Gene Frequency
  • Genome-Wide Association Study* / methods
  • Genotype
  • Humans
  • Polymorphism, Single Nucleotide*
  • Software

Grants and funding

This work is funded by the French Ministry of Research for the POPGEN project in the framework of the French initiative for genomic medicine (Plan France Médecine Génomique 2025; PFMG 2025; https://pfmg2025.aviesan.fr/, accessed on 23 December 2021) and the INSERM cross-cutting program Genomic Variability 2018 GOLD (https://aviesan.fr/gold, accessed on 23 December 2021). AFH is funded by POPGEN, TD is funded by GOLD.