HLA haplotype frequencies are estimated from ambiguous unphased HLA genotyping data using Expectation-Maximization (EM) algorithms. Current population genetics methods require independent EM frequency estimates for each population, and assume that each population is in Hardy-Weinberg Equilibrium (HWE). The HWE assumption of EM has thus far resulted in the exclusion of individuals from mixed or unknown ethnic backgrounds from reference datasets. Multi-region populations are currently poorly served by stem cell donor registry HLA imputation and matching implementations due to the inability of such algorithms to incorporate admixture into their population genetics models. To address this unmet need, we have expanded the imputation component of our GRaph IMputation and Matching (GRIMM) framework, where imputation becomes the expectation step in an iterative EM algorithm. Our novel multi-region EM implementation considers region as a Bayesian prior, enabling integration of HLA information from multiple single-region population groups, and for the first time including individuals with ambiguous or mixed ethnic backgrounds. We show that our multi-region EM produces much higher likelihood values and better haplotype recovery as measured by Kullback-Leibler divergence than all evaluated EM implementations when tested on real datasets of US donor registry HLA typings as well as simulated multi-region datasets of ambiguous HLA typings.
Keywords: HLA; Haplotype frequencies; Multi-region expectation-maximization algorithm.
Copyright © 2021 American Society for Histocompatibility and Immunogenetics. Published by Elsevier Inc. All rights reserved.