Estimation of human leukocyte antigen (HLA) haplotype frequencies from unrelated stem cell donor registries presents a challenge because of large sample sizes and heterogeneity of HLA typing data. For the 14th International HLA and Immunogenetics Workshop, five bioinformatics groups initiated the 'Registry Diversity Component' aiming to cross-validate and improve current haplotype estimation tools. Five datasets were derived from different donor registries and then used as input for five different computer programs for haplotype frequency estimation. Because of issues related to heterogeneity and complexity of HLA typing data identified in the initial phase, the same five implementations, and two new ones, were used on simulated datasets in a controlled experiment where the correct results were known a priori. These datasets contained various fractions of missing HLA-DR modeled after European haplotype frequencies. We measured the contribution of sampling fluctuation and estimation error to the deviation of the frequencies from their true values, finding equivalent contributions of each for the chosen samples. Because of patient-directed activities, selective prospective typing strategies and the variety and evolution of typing technology, some donors have more complete and better HLA data. In this setting, we show that restricting estimation to fully typed individuals introduces biases that could be overcome by including all donors in frequency estimation. Our study underlines the importance of critical review and validation of tools in registry-related activity and provides a sustainable framework for validating the computational tools used. Accurate frequencies are essential for match prediction to improve registry operations and to help more patients identify suitably matched donors.
Keywords: donor registry; expectation-maximization; haplotype frequency estimation; hematopoietic stem-cell transplant; human leukocyte antigen; typing ambiguity.
© 2013 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.