Semi-supervised machine learning method for predicting homogeneous ancestry groups to assess Hardy-Weinberg equilibrium in diverse whole-genome sequencing studies

Derek Shyr; Rounak Dey; Xihao Li; Hufeng Zhou; Eric Boerwinkle; Steve Buyske; Mark Daly; Richard A Gibbs; Ira Hall; Tara Matise; Catherine Reeves; Nathan O Stitziel; Michael Zody; Benjamin M Neale; Xihong Lin

doi:10.1016/j.ajhg.2024.08.018

Semi-supervised machine learning method for predicting homogeneous ancestry groups to assess Hardy-Weinberg equilibrium in diverse whole-genome sequencing studies

Am J Hum Genet. 2024 Oct 3;111(10):2129-2138. doi: 10.1016/j.ajhg.2024.08.018. Epub 2024 Sep 12.

Authors

Derek Shyr¹, Rounak Dey¹, Xihao Li², Hufeng Zhou¹, Eric Boerwinkle³, Steve Buyske⁴, Mark Daly⁵, Richard A Gibbs⁶, Ira Hall⁷, Tara Matise⁸, Catherine Reeves⁹, Nathan O Stitziel¹⁰, Michael Zody⁹, Benjamin M Neale⁵, Xihong Lin¹¹

Affiliations

¹ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.
² Department of Biostatistics, University of North Carolina at Chapel Hill Gillings School of Global Public Health, Chapel Hill, NC 27599, USA.
³ Department of Epidemiology, Human Genetics and Environmental Sciences, The University of Texas Health Science Center at Houston School of Public Health, Houston, TX 77030, USA.
⁴ Department of Statistics, Rutgers University, Piscataway, NJ 08854, USA.
⁵ Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA.
⁶ Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA.
⁷ Center for Genomic Health, Department of Genetics, Yale School of Medicine, New Haven, CT 06510, USA.
⁸ Department of Genetics, Rutgers University, Piscataway, NJ 08854, USA.
⁹ New York Genome Center, New York, NY 10013, USA.
¹⁰ Department of Medicine, Washington University School of Medicine, St. Louis, MO 63110, USA; Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA.
¹¹ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Department of Statistics, Harvard University, Cambridge, MA 02115, USA. Electronic address: [email protected].

PMID: 39270648
PMCID: PMC11480788 (available on 2025-04-03)
DOI: 10.1016/j.ajhg.2024.08.018

Abstract

Large-scale, multi-ethnic whole-genome sequencing (WGS) studies, such as the National Human Genome Research Institute Genome Sequencing Program's Centers for Common Disease Genomics (CCDG), play an important role in increasing diversity for genetic research. Before performing association analyses, assessing Hardy-Weinberg equilibrium (HWE) is a crucial step in quality control procedures to remove low quality variants and ensure valid downstream analyses. Diverse WGS studies contain ancestrally heterogeneous samples; however, commonly used HWE methods assume that the samples are homogeneous. Therefore, directly applying these to the whole dataset can yield statistically invalid results. To account for this heterogeneity, HWE can be tested on subsets of samples that have genetically homogeneous ancestries and the results aggregated at each variant. To facilitate valid HWE subset testing, we developed a semi-supervised learning approach that predicts homogeneous ancestries based on the genotype. This method provides a convenient tool for estimating HWE in the presence of population structure and missing self-reported race and ethnicities in diverse WGS studies. In addition, assessing HWE within the homogeneous ancestries provides reliable HWE estimates that will directly benefit downstream analyses, including association analyses in WGS studies. We applied our proposed method on the CCDG dataset, predicting homogeneous genetic ancestry groups for 60,545 multi-ethnic WGS samples to assess HWE within each group.

Keywords: Hardy-Weinberg equilibrium; ancestry; machine mearning; multi-ethnic whole-genome sequencing; semi-supervised learning; statistical genetics.

MeSH terms

Ethnicity / genetics
Genetics, Population / methods
Genome, Human
Genome-Wide Association Study / methods
Genotype
Humans
Polymorphism, Single Nucleotide
Supervised Machine Learning*
Whole Genome Sequencing* / methods