The advantages of massively parallel sequencing are quickly being realized through the adoption of comprehensive genomic panels across the spectrum of genetic testing. Despite such widespread utilization of next generation sequencing (NGS), a major bottleneck in the implementation and capitalization of this technology remains in the data processing steps, or bioinformatics. Here we describe our approach to defining the limitations of each step in the data processing pipeline by utilizing artificial amplicon data sets to simulate a wide spectrum of genomic alterations. Through this process, we identified limitations of insertion, deletion (indel), and single nucleotide variant (SNV) detection using standard approaches and described novel strategies to improve overall somatic mutation detection. Using these artificial data sets, we were able to demonstrate that NGS assays can have robust mutation detection if the data can be processed in a way that does not lead to large genomic alterations landing in the unmapped data (i.e., trash). By using these pipeline modifications and a new variant caller, AbsoluteVar, we have been able to validate SNV mutation detection to 100% sensitivity and specificity with an allele frequency as low 4% and detection of indels as large as 90 bp. Clinical validation of NGS relies on the ability for mutation detection across a wide array of genetic anomalies, and the utility of artificial data sets demonstrates a mechanism to intelligently test a vast array of mutation types.
Keywords: Next generation sequencing; artificial data set; bioinformatics; sensitivity; validation.
Copyright © 2014 Elsevier Inc. All rights reserved.