All of Us diversity and scale improve polygenic prediction contextually with greatest improvements for under-represented populations

Kristin Tsuo; Zhuozheng Shi; Tian Ge; Ravi Mandla; Kangcheng Hou; Yi Ding; Bogdan Pasaniuc; Ying Wang; Alicia R Martin

doi:10.1101/2024.08.06.606846

All of Us diversity and scale improve polygenic prediction contextually with greatest improvements for under-represented populations

bioRxiv [Preprint]. 2024 Aug 6:2024.08.06.606846. doi: 10.1101/2024.08.06.606846.

Authors

Kristin Tsuo^{1

2

3}, Zhuozheng Shi⁴, Tian Ge^{2

5

6

7}, Ravi Mandla⁴, Kangcheng Hou⁴, Yi Ding⁴, Bogdan Pasaniuc^{4

8

9

10

11}, Ying Wang^{1

2

3}, Alicia R Martin^{1

2

3}

Affiliations

¹ Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.
² Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
³ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁴ Interdepartmental Program in Bioinformatics, University of California, Los Angeles, Los Angeles, CA 90095, USA.
⁵ Department of Psychiatry, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
⁶ Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
⁷ Center for Precision Psychiatry, Massachusetts General Hospital, Boston, MA, USA.
⁸ Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA.
⁹ Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA.
¹⁰ Institute for Precision Health, University of California, Los Angeles, Los Angeles, CA 90095, USA.
¹¹ Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA.

Abstract

Recent studies have demonstrated that polygenic risk scores (PRS) trained on multi-ancestry data can improve prediction accuracy in groups historically underrepresented in genomic studies, but the availability of linked health and genetic data from large-scale diverse cohorts representative of a wide spectrum of human diversity remains limited. To address this need, the All of Us research program (AoU) generated whole-genome sequences of 245,388 individuals who collectively reflect the diversity of the USA. Leveraging this resource and another widely-used population-scale biobank, the UK Biobank (UKB) with a half million participants, we developed PRS trained on multi-ancestry and multi-biobank data with up to ~750,000 participants for 32 common, complex traits and diseases across a range of genetic architectures. We then compared effects of ancestry, PRS methodology, and genetic architecture on PRS accuracy across a held out subset of ancestrally diverse AoU participants. Due to the more heterogeneous study design of AoU, we found lower heritability on average compared to UKB (0.075 vs 0.165), which limited the maximal achievable PRS accuracy in AoU. Overall, we found that the increased diversity of AoU significantly improved PRS performance in some participants in AoU, especially underrepresented individuals, across multiple phenotypes. Notably, maximizing sample size by combining discovery data across AoU and UKB is not the optimal approach for predicting some phenotypes in African ancestry populations; rather, using data from only AoU for these traits resulted in the greatest accuracy. This was especially true for less polygenic traits with large ancestry-enriched effects, such as neutrophil count (R ²: 0.055 vs. 0.035 using AoU vs. cross-biobank meta-analysis, respectively, because of e.g. DARC). Lastly, we calculated individual-level PRS accuracies rather than grouping by continental ancestry, a critical step towards interpretability in precision medicine. Individualized PRS accuracy decays linearly as a function of ancestry divergence, but the slope was smaller using multi-ancestry GWAS compared to using European GWAS. Our results highlight the potential of biobanks with more balanced representations of human diversity to facilitate more accurate PRS for the individuals least represented in genomic studies.

Publication types

Preprint

Abstract

Publication types

Grants and funding