Assessment and management of single nucleotide polymorphism genotype errors in genetic association analysis

Pac Symp Biocomput. 2001:18-29. doi: 10.1142/9789814447362_0003.

Abstract

Single nucleotide polymorphisms (SNP) may be used in case-control designs to test for association between a marker (the SNP) and a disease. However, such designs usually assume that the genotype data are reported without error. We propose a method, the reduced penetrance model method (RPM) that allows for errors in a case-control design, as compared to the full penetrance model method (FPM), that assumes data are errorless. Pearson's chi 2 applied to a 2 x 2 contingency table is the test statistic considered. Additionally, we provide a likelihood method to estimate error rates using SNP genotype data in CEPH pedigrees. We test our method (RPM) against the standard method (FPM) using simulated data. All SNP loci are assumed to have two alleles, coded 1 and 2. We consider three pairs of error rates, two different sample sizes, and two sets of allele frequencies for the SNP locus. SNP genotype data in two populations are simulated under a null hypothesis (allele frequencies equal in both populations) and under an alternative hypothesis (allele frequencies differ between two populations). The total number of simulations is 24; 12 simulations under the null hypothesis, and 12 simulations under the alternative. The significance level threshold is 5%. For the null case, 9/12 (75%) of the simulations show no increase in type I error under RPM, while 3/12 (25%) show a slight increase (rejecting the null for at most 7% of the replicates). There is no increase in the type I error rate for FPM method, which can also be shown analytically. For the alternative case (power), there is a consistent increase in power for the RPM method as compared to FPM method, and average increase of 0.02 for the simulations considered. When sample sizes are large there is virtually no difference in power between RPM and FPM methods. Also, the RPM method provides consistently more accurate allele frequency estimates for the various populations. Our likelihood method to estimate error rates with CEPH pedigrees provides good estimates on average. The largest difference between a true error rate and our average estimated error rate is 0.006. However, there is a fair amount of variability in the estimates, suggesting the need for multiple experiments or larger numbers of CEPH pedigrees. Researchers may use the methods presented in this paper to (1) estimate error rates for their automated genotyping process, and (2) allow for such errors in association analyses, thereby increasing power to detect differences between allele frequencies in case and control populations when errors are present.

Publication types

  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Alleles
  • Case-Control Studies
  • Gene Frequency
  • Genotype*
  • Humans
  • Likelihood Functions
  • Linkage Disequilibrium
  • Models, Genetic
  • Pedigree
  • Polymorphism, Single Nucleotide*