Identification of association between disease and multiple markers via sparse partial least-squares regression

Genet Epidemiol. 2011 Sep;35(6):479-86. doi: 10.1002/gepi.20596. Epub 2011 Jun 15.

Abstract

Although genome-wide association studies have led to the identifications of hundreds of genes underlying dozens of traits in recent years, most published studies have primarily used single marker-based analysis. Intuitively, more information may be utilized when multiple markers are jointly analyzed. Therefore, many methods have been proposed in the literature for association analysis between traits and multiple markers. Among these methods, simulation and real data analyses have shown that it is often more effective to reduce the dimensionality of the markers in a region through principal components analysis of all the markers first, and then to perform association analysis between traits and those principal components that account for most of the genetic variations in the region. However, one major limitation of this approach is that the principal components are derived purely from marker genotypes, without consideration of their relevance to traits. Furthermore, these components are constructed as linear combinations of all the markers even when only a limited number are potentially relevant to traits. In this manuscript, we propose the use of sparse partial least-squares regression to derive the components that are linear combinations of only relevant markers. This approach is able to use information from both traits and marker genotypes. Extensive simulations and real data analyses on a Crohn's disease data set suggest the superiority of this approach over existing methods.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Alleles
  • Computer Simulation
  • Genetic Markers
  • Genetic Predisposition to Disease
  • Genome-Wide Association Study / methods*
  • Genotype
  • Humans
  • Inflammatory Bowel Diseases / genetics*
  • Least-Squares Analysis
  • Models, Genetic
  • Models, Statistical
  • Molecular Epidemiology / methods*
  • Principal Component Analysis
  • Regression Analysis

Substances

  • Genetic Markers