Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations

Michael Elgart; Genevieve Lyons; Santiago Romero-Brufau; Nuzulul Kurniansyah; Jennifer A Brody; Xiuqing Guo; Henry J Lin; Laura Raffield; Yan Gao; Han Chen; Paul de Vries; Donald M Lloyd-Jones; Leslie A Lange; Gina M Peloso; Myriam Fornage; Jerome I Rotter; Stephen S Rich; Alanna C Morrison; Bruce M Psaty; Daniel Levy; Susan Redline; NHLBI’s Trans-Omics in Precision Medicine (TOPMed) Consortium; Tamar Sofer

doi:10.1038/s42003-022-03812-z

Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations

Commun Biol. 2022 Aug 22;5(1):856. doi: 10.1038/s42003-022-03812-z.

Authors

Michael Elgart^#^{1

2}, Genevieve Lyons^#^{3

4}, Santiago Romero-Brufau^{4

5}, Nuzulul Kurniansyah³, Jennifer A Brody⁶, Xiuqing Guo⁷, Henry J Lin⁷, Laura Raffield⁸, Yan Gao⁹, Han Chen^{10

11}, Paul de Vries¹⁰, Donald M Lloyd-Jones¹², Leslie A Lange¹³, Gina M Peloso¹⁴, Myriam Fornage^{10

15}, Jerome I Rotter⁷, Stephen S Rich¹⁶, Alanna C Morrison¹⁰, Bruce M Psaty¹⁷, Daniel Levy^{18

19}, Susan Redline^{3

20}; NHLBI’s Trans-Omics in Precision Medicine (TOPMed) Consortium; Tamar Sofer^{21

22

23}

Collaborator

NHLBI’s Trans-Omics in Precision Medicine (TOPMed) Consortium:
Paul de Vries

Affiliations

¹ Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA. [email protected].
² Department of Medicine, Harvard Medical School, Boston, MA, USA. [email protected].
³ Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
⁴ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
⁵ Department of Medicine, Mayo Clinic, Rochester, MN, USA.
⁶ Cardiovascular Health Research Unit, Department of Medicine, University of Washington, Seattle, WA, USA.
⁷ The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA.
⁸ Department of Genetics, University of North Carolina, Chapel Hill, NC, USA.
⁹ The Jackson Heart Study, University of Mississippi Medical Center, Jackson, MS, USA.
¹⁰ Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA.
¹¹ Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.
¹² Department of Preventive Medicine, Northwestern University, Chicago, IL, USA.
¹³ Department of Medicine, University of Colorado Denver, Anschutz Medical Campus, Aurora, CO, USA.
¹⁴ Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA.
¹⁵ Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA.
¹⁶ Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA.
¹⁷ Cardiovascular Health Research Unit, Departments of Medicine, Epidemiology, and Health Services, University of Washington, Seattle, WA, USA.
¹⁸ The Population Sciences Branch of the National Heart, Lung and Blood Institute, Bethesda, MD, USA.
¹⁹ The Framingham Heart Study, Framingham, MA, USA.
²⁰ Department of Medicine, Harvard Medical School, Boston, MA, USA.
²¹ Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA. [email protected].
²² Department of Medicine, Harvard Medical School, Boston, MA, USA. [email protected].
²³ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA. [email protected].

^# Contributed equally.

Abstract

Polygenic risk scores (PRS) are commonly used to quantify the inherited susceptibility for a trait, yet they fail to account for non-linear and interaction effects between single nucleotide polymorphisms (SNPs). We address this via a machine learning approach, validated in nine complex phenotypes in a multi-ancestry population. We use an ensemble method of SNP selection followed by gradient boosted trees (XGBoost) to allow for non-linearities and interaction effects. We compare our results to the standard, linear PRS model developed using PRSice, LDpred2, and lassosum2. Combining a PRS as a feature in an XGBoost model results in a relative increase in the percentage variance explained compared to the standard linear PRS model by 22% for height, 27% for HDL cholesterol, 43% for body mass index, 50% for sleep duration, 58% for systolic blood pressure, 64% for total cholesterol, 66% for triglycerides, 77% for LDL cholesterol, and 100% for diastolic blood pressure. Multi-ancestry trained models perform similarly to specific racial/ethnic group trained models and are consistently superior to the standard linear PRS models. This work demonstrates an effective method to account for non-linearities and interaction effects in genetics-based prediction models.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Genetic Predisposition to Disease
Genome-Wide Association Study*
Humans
Machine Learning
Multifactorial Inheritance
Polymorphism, Single Nucleotide*

Associated data

figshare/10.6084/m9.figshare.20304135.v1
figshare/10.6084/m9.figshare.20301423.v1

Abstract

Publication types

MeSH terms

Associated data

Grants and funding