Estimation of DNA contamination and its sources in genotyped samples

Genet Epidemiol. 2019 Dec;43(8):980-995. doi: 10.1002/gepi.22257. Epub 2019 Aug 26.

Abstract

Array genotyping is a cost-effective and widely used tool that enables assessment of up to millions of genetic markers in hundreds of thousands of individuals. Genotyping array data are typically highly accurate but sensitive to mixing of DNA samples from multiple individuals before or during genotyping. Contaminated samples can lead to genotyping errors and consequently cause false positive signals or reduce power of association analyses. Here, we propose a new method to identify contaminated samples and the sources of contamination within a genotyping batch. Through analysis of array intensity and genotype data from intentionally mixed samples and 22,366 samples of the Michigan Genomics Initiative, an ongoing biobank-based study, we show that our method can reliably estimate contamination. We also show that identifying sources of contamination can implicate problematic sample processing steps and guide process improvements. Compared to existing methods, our approach can estimate the proportion of contaminating DNA more accurately, eliminate the need for external databases of allele frequencies, and provide contamination estimates that are more robust to the ancestral origin of the contaminating sample.

Keywords: DNA contamination; biobank; genome-wide association study; genotyping array; quality control.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • DNA
  • DNA Contamination*
  • Gene Frequency
  • Genetic Markers
  • Genomics / methods
  • Genotype
  • Genotyping Techniques* / methods
  • Humans
  • Polymorphism, Single Nucleotide

Substances

  • Genetic Markers
  • DNA