Correction for population stratification in random forest analysis

Int J Epidemiol. 2012 Dec;41(6):1798-806. doi: 10.1093/ije/dys183. Epub 2012 Nov 12.

Abstract

Background: Population structure (PS), including population stratification and admixture, is a significant confounder in genome-wide association studies (GWAS), as it may produce spurious associations. Random forest (RF) has been increasingly applied in GWAS data analysis because of its advantage in analysing high dimensional genetic data. RF creates importance measures for single nucleotide polymorphisms (SNPs), which are helpful for feature selections. However, if PS is not appropriately corrected, RF tends to give high importance to disease-unrelated SNPs with different frequencies of allele or genotype among subpopulations, leading to inaccurate results.

Methods: In this study, the authors propose to correct for the confounding effect of PS by including the information of PS in RF analysis. The correction procedure starts by extracting the information of PS using EIGENSTRAT or multi-dimensional scaling clustering procedure from a large number of structure inference SNPs. Phenotype and genotypes adjusted by the information of PS are then used as the outcome and predictors in RF analysis.

Results: Extensive simulations indicate that the importance measure of the causal SNP is increased following the PS correction. By analysing a real dataset, the proposed correction removes the spurious association between the lactase gene and height.

Conclusion: The authors propose a simple method to correct for PS in RF analysis on GWAS data. Further studies in real GWAS datasets are required to validate the robustness of the proposed approach.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Confounding Factors, Epidemiologic
  • Data Interpretation, Statistical*
  • Genetics, Population / methods*
  • Genome-Wide Association Study / methods*
  • Humans
  • Models, Statistical
  • Polymorphism, Single Nucleotide*