Estimating viral diversity in infected patients can provide insight into pathogen evolution and emergence of drug resistance. With the widespread adoption of deep sequencing, it is important to develop tools to accurately calculate population diversity from very large datasets. Current methods for estimating diversity that are based on multiple alignments are not practical to apply to such data. In this study, the authors report a novel method (Pairwise Alignment Positional Nucleotide Counting, PAPNC) for estimating population diversity from 454 sequence data. The diversity measurements determined using this method were comparable to those calculated by average pairwise difference (APD) of multiply aligned sequences using MEGA5. Diversities were estimated for 9 patient plasma HIV samples sequenced with Titanium 454 technology and by single-genome sequencing (SGS). Diversities calculated from deep sequencing using PAPNC ranged from 0.002 to 0.021 while APD measurements calculated from SGS data ranged proximately from 0.001 to 0.018, with the difference being attributable to PCR error (contributing background diversity of 0.0016 in a control sample). Comparison of APDs estimated from 100 sets of sequences drawn at random from 454 generated data and from corresponding SGS data showed very close correlation between the two methods with R(2) of 0.96, and differing on average by about 1% (after correction for PCR error). The authors have developed a novel method that is good for calculating genetic diversities for large scale datasets from next generation sequencing. It can be implemented easily as a function in available variation calling programs like SAMtools or haplotype reconstruction software for nucleotide genetic diversity calculation. A Perl script implementing this method is available upon request.
Keywords: HIV-1; Next generation sequencing; Viral population diversity calculation.
Copyright © 2014 Elsevier B.V. All rights reserved.