Two-sample comparison based on prediction error, with applications to candidate gene association studies

Ann Hum Genet. 2007 Jan;71(Pt 1):107-18. doi: 10.1111/j.1469-1809.2006.00306.x.

Abstract

To take advantage of the increasingly available high-density SNP maps across the genome, various tests that compare multilocus genotypes or estimated haplotypes between cases and controls have been developed for candidate gene association studies. Here we view this two-sample testing problem from the perspective of supervised machine learning and propose a new association test. The approach adopts the flexible and easy-to-understand classification tree model as the learning machine, and uses the estimated prediction error of the resulting prediction rule as the test statistic. This procedure not only provides an association test but also generates a prediction rule that can be useful in understanding the mechanisms underlying complex disease. Under the set-up of a haplotype-based transmission/disequilibrium test (TDT) type of analysis, we find through simulation studies that the proposed procedure has the correct type I error rates and is robust to population stratification. The power of the proposed procedure is sensitive to the chosen prediction error estimator. Among commonly used prediction error estimators, the .632+ estimator results in a test that has the best overall performance. We also find that the test using the .632+ estimator is more powerful than the standard single-point TDT analysis, the Pearson's goodness-of-fit test based on estimated haplotype frequencies, and two haplotype-based global tests implemented in the genetic analysis package FBAT. To illustrate the application of the proposed method in population-based association studies, we use the procedure to study the association between non-Hodgkin lymphoma and the IL10 gene.

Publication types

  • Evaluation Study
  • Research Support, N.I.H., Intramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adult
  • Aged
  • Aged, 80 and over
  • Algorithms
  • Artificial Intelligence
  • Case-Control Studies
  • Computer Simulation*
  • Female
  • Genetic Predisposition to Disease*
  • Haplotypes
  • Humans
  • Interleukin-10 / genetics*
  • Lymphoma, Non-Hodgkin / genetics*
  • Middle Aged
  • Models, Genetic*
  • Multifactorial Inheritance
  • Nuclear Family
  • Polymorphism, Single Nucleotide
  • Predictive Value of Tests
  • Sensitivity and Specificity

Substances

  • Interleukin-10