Efficient error correction for next-generation sequencing of viral amplicons

Pavel Skums; Zoya Dimitrova; David S Campo; Gilberto Vaughan; Livia Rossi; Joseph C Forbi; Jonny Yokosawa; Alex Zelikovsky; Yury Khudyakov

doi:10.1186/1471-2105-13-S10-S6

Efficient error correction for next-generation sequencing of viral amplicons

BMC Bioinformatics. 2012 Jun 25;13 Suppl 10(Suppl 10):S6. doi: 10.1186/1471-2105-13-S10-S6.

Authors

Pavel Skums¹, Zoya Dimitrova, David S Campo, Gilberto Vaughan, Livia Rossi, Joseph C Forbi, Jonny Yokosawa, Alex Zelikovsky, Yury Khudyakov

Affiliation

¹ Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, Atlanta, GA 30333, USA. [email protected]

Abstract

Background: Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing.

Results: In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones.

Conclusions: Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses.The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm.

Publication types

Comparative Study

MeSH terms

Algorithms*
Cluster Analysis
Computational Biology / methods*
DNA, Viral / genetics
Haplotypes
Sequence Analysis, DNA / methods*
Viruses / genetics*

Substances

DNA, Viral