An efficient test for comparing sequence diversity between two populations

P B Gilbert; V A Novitsky; M A Montano; M Essex

doi:10.1089/106652701300312904

An efficient test for comparing sequence diversity between two populations

J Comput Biol. 2001;8(2):123-39. doi: 10.1089/106652701300312904.

Authors

P B Gilbert¹, V A Novitsky, M A Montano, M Essex

Affiliation

¹ Center for Biostatistics in AIDS Research and Department of Biostatistics, Harvard School of Public Health, Boston, 02115, USA. [email protected]

PMID: 11454301
DOI: 10.1089/106652701300312904

Abstract

We address the problem of comparing interindividual genomic sequence diversity between two populations. Although the methods are general, for concreteness we focus on comparing two human immunodeficiency virus (HIV) infected populations. From a viral isolate(s) taken from each individual in a sample of persons from each population, suppose one or multiple measurements are made on the genetic sequence of a coding region of HIV. Given a definition of genetic distance between sequences, the goal is to test if the distribution of interindividual distances differs between populations. If distances between all pairs of sequences within each group are used, then data-dependencies arising from the use of multiple sequences from individuals invalidates the use of a standard two-sample test such as the t-test. Where this problem has been recognized, a typical solution has been to apply a standard test to a reduced dataset comprised of one sequence or a consensus sequence from each patient. Disadvantages of this procedure are that the conclusion of the test depends on the choice of utilized sequences, often an arbitrary decision, and exclusion of replicate sequences from the analysis may needlessly sacrifice statistical power. We present a new test free of these drawbacks, which is based on a statistic that linearly combines all possible standard test statistics calculated from independent sequence subsamples. We describe statistical power advantages of the test and illustrate its use by application to nucleotide sequence distances measured from HIV-1 infected populations in southern Africa (GenBank accession numbers AF110959--AF110981) and North America/Europe. The test makes minimal assumptions, is maximally efficient and objective, and is broadly applicable.

Publication types

Comparative Study
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Africa, Southern / epidemiology
Algorithms
Europe / epidemiology
Gene Products, gag / genetics
Genetic Variation*
HIV Infections / epidemiology
HIV Infections / virology*
HIV Long Terminal Repeat / genetics
HIV-1 / genetics*
Models, Genetic*
Models, Statistical
Molecular Sequence Data
North America / epidemiology

Substances

Gene Products, gag

Associated data

GENBANK/AF110959
GENBANK/AF110960
GENBANK/AF110961
GENBANK/AF110962
GENBANK/AF110963
GENBANK/AF110964
GENBANK/AF110965
GENBANK/AF110966
GENBANK/AF110967
GENBANK/AF110968
GENBANK/AF110969
GENBANK/AF110970
GENBANK/AF110971
GENBANK/AF110972
GENBANK/AF110973
GENBANK/AF110974
GENBANK/AF110975
GENBANK/AF110976
GENBANK/AF110977
GENBANK/AF110978
GENBANK/AF110979
GENBANK/AF110980
GENBANK/AF110981

Abstract

Publication types

MeSH terms

Substances

Associated data

Grants and funding