A robust clustering algorithm for identifying problematic samples in genome-wide association studies

Céline Bellenguez; Amy Strange; Colin Freeman; Wellcome Trust Case Control Consortium; Peter Donnelly; Chris C A Spencer

doi:10.1093/bioinformatics/btr599

A robust clustering algorithm for identifying problematic samples in genome-wide association studies

Bioinformatics. 2012 Jan 1;28(1):134-5. doi: 10.1093/bioinformatics/btr599. Epub 2011 Nov 3.

Authors

Céline Bellenguez¹, Amy Strange, Colin Freeman; Wellcome Trust Case Control Consortium; Peter Donnelly, Chris C A Spencer

Collaborators

Wellcome Trust Case Control Consortium:
Peter Donnelly, Ines Barroso, Jenefer M Blackwell, Elvira Bramon, Matthew A Brown, Juan P Casas, Aiden Corvin, Panos Deloukas, Audrey Duncanson, Janusz Jankowski, Hugh S Markus, Christopher G Mathew, Colin N A Palmer, Robert Plomin, Anna Rautanen, Stephen J Sawcer, Richard C Trembath, Ananth C Viswanathan, Nicholas W Wood, Chris C A Spencer, Gavin Band, Céline Bellenguez, Colin Freeman, Garrett Hellenthal, Eleni Giannoulatou, Matti Pirinen, Richard Pearson, Amy Strange, Zhan Su, Damjan Vukcevic, Peter Donnelly, Cordelia Langford, Sarah E Hunt, Sarah Edkins, Rhian Gwilliam, Hannah Blackburn, Suzannah J Bumpstead, Serge Dronov, Matthew Gillman, Emma Gray, Naomi Hammond, Alagurevathi Jayakumar, Owen T McCann, Jennifer Liddle, Simon C Potter, Radhi Ravindrarajah, Michelle Ricketts, Matthew Waller, Paul Weston, Sara Widaa, Pamela Whittaker, Ines Barroso, Panos Deloukas, Christopher G Mathew, Jenefer M Blackwell, Matthew A Brown, Aiden Corvin, Chris C A Spencer

Affiliation

¹ Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK.

Abstract

Summary: High-throughput genotyping arrays provide an efficient way to survey single nucleotide polymorphisms (SNPs) across the genome in large numbers of individuals. Downstream analysis of the data, for example in genome-wide association studies (GWAS), often involves statistical models of genotype frequencies across individuals. The complexities of the sample collection process and the potential for errors in the experimental assay can lead to biases and artefacts in an individual's inferred genotypes. Rather than attempting to model these complications, it has become a standard practice to remove individuals whose genome-wide data differ from the sample at large. Here we describe a simple, but robust, statistical algorithm to identify samples with atypical summaries of genome-wide variation. Its use as a semi-automated quality control tool is demonstrated using several summary statistics, selected to identify different potential problems, and it is applied to two different genotyping platforms and sample collections.

Availability: The algorithm is written in R and is freely available at www.well.ox.ac.uk/chris-spencer

Contact: [email protected]

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Cluster Analysis*
Cohort Studies
Female
Genome-Wide Association Study*
Humans
Male
Oligonucleotide Array Sequence Analysis
Polymorphism, Single Nucleotide*

Abstract

Publication types

MeSH terms

Grants and funding