Merging microsatellite data

J Comput Biol. 2006 Jul-Aug;13(6):1131-47. doi: 10.1089/cmb.2006.13.1131.

Abstract

Genotype calling procedures vary from laboratory to laboratory for many microsatellite markers. Even within the same laboratory, application of different experimental protocols often leads to ambiguities. The impact of these ambiguities ranges from irksome to devastating. Resolving the ambiguities can increase effective sample size and preserve evidence in favor of disease-marker associations. Because different data sets may contain different numbers of alleles, merging is unfortunately not a simple process of matching alleles one to one. Merging data sets manually is difficult, time-consuming, and error-prone due to differences in genotyping hardware, binning methods, molecular weight standards, and curve fitting algorithms. Merging is particularly difficult if few or no samples occur in common, or if samples are drawn from ethnic groups with widely varying allele frequencies. It is dangerous to align alleles simply by adding a constant number of base pairs to the alleles of one of the data sets. To address these issues, we have developed a Bayesian model and a Markov chain Monte Carlo (MCMC) algorithm for sampling the posterior distribution under the model. Our computer program, MicroMerge, implements the algorithm and almost always accurately and efficiently finds the most likely correct alignment. Common allele frequencies across laboratories in the same ethnic group are the single most important cue in the model. MicroMerge computes the allelic alignments with the greatest posterior probabilities under several merging options. It also reports when data sets cannot be confidently merged. These features are emphasized in our analysis of simulated and real data.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms
  • Alleles
  • Bayes Theorem
  • Gene Frequency
  • Genetic Markers*
  • Genotype
  • Markov Chains
  • Microsatellite Repeats*
  • Models, Genetic*
  • Models, Statistical*
  • Monte Carlo Method
  • Software*

Substances

  • Genetic Markers