Genome-wide nucleotide-level mammalian ancestor reconstruction

Genome Res. 2008 Nov;18(11):1829-43. doi: 10.1101/gr.076521.108. Epub 2008 Oct 10.

Abstract

Recently attention has been turned to the problem of reconstructing complete ancestral sequences from large multiple alignments. Successful generation of these genome-wide reconstructions will facilitate a greater knowledge of the events that have driven evolution. We present a new evolutionary alignment modeler, called "Ortheus," for inferring the evolutionary history of a multiple alignment, in terms of both substitutions and, importantly, insertions and deletions. Based on a multiple sequence probabilistic transducer model of the type proposed by Holmes, Ortheus uses efficient stochastic graph-based dynamic programming methods. Unlike other methods, Ortheus does not rely on a single fixed alignment from which to work. Ortheus is also more scaleable than previous methods while being fast, stable, and open source. Large-scale simulations show that Ortheus performs close to optimally on a deep mammalian phylogeny. Simulations also indicate that significant proportions of errors due to insertions and deletions can be avoided by not assuming a fixed alignment. We additionally use a challenging hold-out cross-validation procedure to test the method; using the reconstructions to predict extant sequence bases, we demonstrate significant improvements over using closest extant neighbor sequences. Accompanying this paper, a new, public, and genome-wide set of Ortheus ancestor alignments provide an intriguing new resource for evolutionary studies in mammals. As a first piece of analysis, we attempt to recover "fossilized" ancestral pseudogenes. We confidently find 31 cases in which the ancestral sequence had a more complete sequence than any of the extant sequences.

Publication types

  • Validation Study

MeSH terms

  • Algorithms
  • Animals
  • Base Sequence
  • Computer Simulation
  • DNA / genetics
  • Evolution, Molecular*
  • Fossils
  • Genomics / statistics & numerical data*
  • Humans
  • Mammals / genetics*
  • Models, Statistical
  • Phylogeny
  • Pseudogenes
  • Sequence Alignment / statistics & numerical data
  • Software
  • Stochastic Processes

Substances

  • DNA