An efficient test for comparing sequence diversity between two populations

J Comput Biol. 2001;8(2):123-39. doi: 10.1089/106652701300312904.

Abstract

We address the problem of comparing interindividual genomic sequence diversity between two populations. Although the methods are general, for concreteness we focus on comparing two human immunodeficiency virus (HIV) infected populations. From a viral isolate(s) taken from each individual in a sample of persons from each population, suppose one or multiple measurements are made on the genetic sequence of a coding region of HIV. Given a definition of genetic distance between sequences, the goal is to test if the distribution of interindividual distances differs between populations. If distances between all pairs of sequences within each group are used, then data-dependencies arising from the use of multiple sequences from individuals invalidates the use of a standard two-sample test such as the t-test. Where this problem has been recognized, a typical solution has been to apply a standard test to a reduced dataset comprised of one sequence or a consensus sequence from each patient. Disadvantages of this procedure are that the conclusion of the test depends on the choice of utilized sequences, often an arbitrary decision, and exclusion of replicate sequences from the analysis may needlessly sacrifice statistical power. We present a new test free of these drawbacks, which is based on a statistic that linearly combines all possible standard test statistics calculated from independent sequence subsamples. We describe statistical power advantages of the test and illustrate its use by application to nucleotide sequence distances measured from HIV-1 infected populations in southern Africa (GenBank accession numbers AF110959--AF110981) and North America/Europe. The test makes minimal assumptions, is maximally efficient and objective, and is broadly applicable.

Publication types

  • Comparative Study
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Africa, Southern / epidemiology
  • Algorithms
  • Europe / epidemiology
  • Gene Products, gag / genetics
  • Genetic Variation*
  • HIV Infections / epidemiology
  • HIV Infections / virology*
  • HIV Long Terminal Repeat / genetics
  • HIV-1 / genetics*
  • Models, Genetic*
  • Models, Statistical
  • Molecular Sequence Data
  • North America / epidemiology

Substances

  • Gene Products, gag

Associated data

  • GENBANK/AF110959
  • GENBANK/AF110960
  • GENBANK/AF110961
  • GENBANK/AF110962
  • GENBANK/AF110963
  • GENBANK/AF110964
  • GENBANK/AF110965
  • GENBANK/AF110966
  • GENBANK/AF110967
  • GENBANK/AF110968
  • GENBANK/AF110969
  • GENBANK/AF110970
  • GENBANK/AF110971
  • GENBANK/AF110972
  • GENBANK/AF110973
  • GENBANK/AF110974
  • GENBANK/AF110975
  • GENBANK/AF110976
  • GENBANK/AF110977
  • GENBANK/AF110978
  • GENBANK/AF110979
  • GENBANK/AF110980
  • GENBANK/AF110981