Correction for population stratification in random forest analysis

Yang Zhao; Feng Chen; Rihong Zhai; Xihong Lin; Zhaoxi Wang; Li Su; David C Christiani

doi:10.1093/ije/dys183

Correction for population stratification in random forest analysis

Int J Epidemiol. 2012 Dec;41(6):1798-806. doi: 10.1093/ije/dys183. Epub 2012 Nov 12.

Authors

Yang Zhao¹, Feng Chen, Rihong Zhai, Xihong Lin, Zhaoxi Wang, Li Su, David C Christiani

Affiliation

¹ Environmental and Occupational Medicine and Epidemiology Program, Department of Environmental Health, Harvard School of Public Health, Harvard University, Boston, MA, USA.

Abstract

Background: Population structure (PS), including population stratification and admixture, is a significant confounder in genome-wide association studies (GWAS), as it may produce spurious associations. Random forest (RF) has been increasingly applied in GWAS data analysis because of its advantage in analysing high dimensional genetic data. RF creates importance measures for single nucleotide polymorphisms (SNPs), which are helpful for feature selections. However, if PS is not appropriately corrected, RF tends to give high importance to disease-unrelated SNPs with different frequencies of allele or genotype among subpopulations, leading to inaccurate results.

Methods: In this study, the authors propose to correct for the confounding effect of PS by including the information of PS in RF analysis. The correction procedure starts by extracting the information of PS using EIGENSTRAT or multi-dimensional scaling clustering procedure from a large number of structure inference SNPs. Phenotype and genotypes adjusted by the information of PS are then used as the outcome and predictors in RF analysis.

Results: Extensive simulations indicate that the importance measure of the causal SNP is increased following the PS correction. By analysing a real dataset, the proposed correction removes the spurious association between the lactase gene and height.

Conclusion: The authors propose a simple method to correct for PS in RF analysis on GWAS data. Further studies in real GWAS datasets are required to validate the robustness of the proposed approach.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Confounding Factors, Epidemiologic
Data Interpretation, Statistical*
Genetics, Population / methods*
Genome-Wide Association Study / methods*
Humans
Models, Statistical
Polymorphism, Single Nucleotide*

Abstract

Publication types

MeSH terms

Grants and funding