Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows

Adam Roberts; Leonard McMillan; Wei Wang; Joel Parker; Ivan Rusyn; David Threadgill

doi:10.1093/bioinformatics/btm220

Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows

Bioinformatics. 2007 Jul 1;23(13):i401-7. doi: 10.1093/bioinformatics/btm220.

Authors

Adam Roberts¹, Leonard McMillan, Wei Wang, Joel Parker, Ivan Rusyn, David Threadgill

Affiliation

¹ Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599, USA.

PMID: 17646323
DOI: 10.1093/bioinformatics/btm220

Abstract

Motivation: Typical high-throughput genotyping techniques produce numerous missing calls that confound subsequent analyses, such as disease association studies. Common remedies for this problem include removing affected markers and/or samples or, otherwise, imputing the missing data. On small marker sets imputation is frequently based on a vote of the K-nearest-neighbor (KNN) haplotypes, but this technique is neither practical nor justifiable for large datasets.

Results: We describe a data structure that supports efficient KNN queries over arbitrarily sized, sliding haplotype windows, and evaluate its use for genotype imputation. The performance of our method enables exhaustive exploration over all window sizes and known sites in large (150K, 8.3M) SNP panels. We also compare the accuracy and performance of our methods with competing imputation approaches.

Availability: A free open source software package, NPUTE, is available at http://compgen.unc.edu/software, for non-commercial uses.

Publication types

Evaluation Study
Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Artifacts*
Chromosome Mapping / methods*
DNA Mutational Analysis / methods*
Genetic Variation / genetics
Pattern Recognition, Automated / methods
Polymorphism, Single Nucleotide / genetics*
Sensitivity and Specificity
Sequence Alignment / methods*
Sequence Analysis, DNA / methods*

Grants and funding

U01CA105417/CA/NCI NIH HHS/United States