Haplotype inference from short sequence reads using a population genealogical history model

Pac Symp Biocomput. 2011:288-99. doi: 10.1142/9789814335058_0030.

Abstract

High-throughput sequencing is currently a major transforming technology in biology. In this paper, we study a population genomics problem motivated by the newly available short reads data from high-throughput sequencing. In this problem, we are given short reads collected from individuals in a population. The objective is to infer haplotypes with the given reads. We first formulate the computational problem of haplotype inference with short reads. Based on a simple probabilistic model on short reads, we present a new approach of inferring haplotypes directly from given reads (i.e. without first calling genotypes). Our method is finding the most likely haplotypes whose local genealogical history can be approximately modeled as a perfect phylogeny. We show that the optimal haplotypes under this objective can be found for many data using integer linear programming for modest sized data when there is no recombination. We then develop a related heuristic method which can work with larger data, and also allows recombination. Simulation shows that the performance of our method is competitive against alternative approaches.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Computational Biology
  • Genealogy and Heraldry*
  • Genetics, Population / statistics & numerical data*
  • Haplotypes*
  • High-Throughput Nucleotide Sequencing / statistics & numerical data
  • Humans
  • Models, Genetic
  • Polymorphism, Single Nucleotide
  • Recombination, Genetic
  • Software