Minimum variance rooting of phylogenetic trees and implications for species tree reconstruction

PLoS One. 2017 Aug 11;12(8):e0182238. doi: 10.1371/journal.pone.0182238. eCollection 2017.

Abstract

Phylogenetic trees inferred using commonly-used models of sequence evolution are unrooted, but the root position matters both for interpretation and downstream applications. This issue has been long recognized; however, whether the potential for discordance between the species tree and gene trees impacts methods of rooting a phylogenetic tree has not been extensively studied. In this paper, we introduce a new method of rooting a tree based on its branch length distribution; our method, which minimizes the variance of root to tip distances, is inspired by the traditional midpoint rerooting and is justified when deviations from the strict molecular clock are random. Like midpoint rerooting, the method can be implemented in a linear time algorithm. In extensive simulations that consider discordance between gene trees and the species tree, we show that the new method is more accurate than midpoint rerooting, but its relative accuracy compared to using outgroups to root gene trees depends on the size of the dataset and levels of deviations from the strict clock. We show high levels of error for all methods of rooting estimated gene trees due to factors that include effects of gene tree discordance, deviations from the clock, and gene tree estimation error. Our simulations, however, did not reveal significant differences between two equivalent methods for species tree estimation that use rooted and unrooted input, namely, STAR and NJst. Nevertheless, our results point to limitations of existing scalable rooting methods.

MeSH terms

  • Algorithms*
  • Computer Simulation
  • Databases as Topic
  • Genes
  • Phylogeny*
  • Species Specificity

Grants and funding

This work was supported by the National Science Foundation grant IIS-1565862 (https://www.nsf.gov/awardsearch/showAward?AWD_ID=1565862) to SM, UM, and ES. Computations were performed on the San Diego Supercomputer Center (SDSC) through XSEDE allocations, which is supported by the National Science Foundation grant ACI-1053575 (https://www.nsf.gov/awardsearch/showAward?AWD_ID=1053575).