Accurate detection and genotyping of SNPs utilizing population sequencing data

Vikas Bansal; Olivier Harismendy; Ryan Tewhey; Sarah S Murray; Nicholas J Schork; Eric J Topol; Kelly A Frazer

doi:10.1101/gr.100040.109

Accurate detection and genotyping of SNPs utilizing population sequencing data

Genome Res. 2010 Apr;20(4):537-45. doi: 10.1101/gr.100040.109. Epub 2010 Feb 11.

Authors

Vikas Bansal¹, Olivier Harismendy, Ryan Tewhey, Sarah S Murray, Nicholas J Schork, Eric J Topol, Kelly A Frazer

Affiliation

¹ Scripps Translational Science Institute, The Scripps Research Institute, La Jolla, CA 92037, USA. [email protected]

Abstract

Next-generation sequencing technologies have made it possible to sequence targeted regions of the human genome in hundreds of individuals. Deep sequencing represents a powerful approach for the discovery of the complete spectrum of DNA sequence variants in functionally important genomic intervals. Current methods for single nucleotide polymorphism (SNP) detection are designed to detect SNPs from single individual sequence data sets. Here, we describe a novel method SNIP-Seq (single nucleotide polymorphism identification from population sequence data) that leverages sequence data from a population of individuals to detect SNPs and assign genotypes to individuals. To evaluate our method, we utilized sequence data from a 200-kilobase (kb) region on chromosome 9p21 of the human genome. This region was sequenced in 48 individuals (five sequenced in duplicate) using the Illumina GA platform. Using this data set, we demonstrate that our method is highly accurate for detecting variants and can filter out false SNPs that are attributable to sequencing errors. The concordance of sequencing-based genotype assignments between duplicate samples was 98.8%. The 200-kb region was independently sequenced to a high depth of coverage using two sequence pools containing the 48 individuals. Many of the novel SNPs identified by SNIP-Seq from the individual sequencing were validated by the pooled sequencing data and were subsequently confirmed by Sanger sequencing. We estimate that SNIP-Seq achieves a low false-positive rate of approximately 2%, improving upon the higher false-positive rate for existing methods that do not utilize population sequence data. Collectively, these results suggest that analysis of population sequencing data is a powerful approach for the accurate detection of SNPs and the assignment of genotypes to individual samples.

Publication types

Evaluation Study
Research Support, N.I.H., Extramural

MeSH terms

Base Sequence
Chromosomes, Human, Pair 9 / genetics
Data Collection / methods
Data Collection / standards
False Positive Reactions
Gene Expression Profiling
Genetics, Population* / instrumentation
Genetics, Population* / methods
Genetics, Population* / standards
Genetics, Population* / statistics & numerical data
Genome, Human / genetics
Genome-Wide Association Study
Genotype
Humans
Information Storage and Retrieval / methods
Information Storage and Retrieval / standards
Meta-Analysis as Topic
Molecular Sequence Data
Oligonucleotide Array Sequence Analysis
Polymorphism, Single Nucleotide / genetics*
Reproducibility of Results
Sequence Analysis, DNA / instrumentation
Sequence Analysis, DNA / methods*
Validation Studies as Topic

Abstract

Publication types

MeSH terms

Grants and funding