Treemmer: a tool to reduce large phylogenetic datasets with minimal loss of diversity

BMC Bioinformatics. 2018 May 2;19(1):164. doi: 10.1186/s12859-018-2164-8.

Abstract

Background: Large sequence datasets are difficult to visualize and handle. Additionally, they often do not represent a random subset of the natural diversity, but the result of uncoordinated and convenience sampling. Consequently, they can suffer from redundancy and sampling biases.

Results: Here we present Treemmer, a simple tool to evaluate the redundancy of phylogenetic trees and reduce their complexity by eliminating leaves that contribute the least to the tree diversity.

Conclusions: Treemmer can reduce the size of datasets with different phylogenetic structures and levels of redundancy while maintaining a sub-sample that is representative of the original diversity. Additionally, it is possible to fine-tune the behavior of Treemmer including any kind of meta-information, making Treemmer particularly useful for empirical studies.

Keywords: Biogeography; Clone elimination; Influenza; Large phylogenetic trees; Redundancy reduction; Representative sample; Sampling bias; Size reduction; Tuberculosis.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computational Biology / methods*
  • Databases, Genetic
  • Humans
  • Influenza A virus / genetics*
  • Information Storage and Retrieval
  • Mycobacterium tuberculosis / genetics*
  • Phylogeny*
  • Software*