Background: Large sequence datasets are difficult to visualize and handle. Additionally, they often do not represent a random subset of the natural diversity, but the result of uncoordinated and convenience sampling. Consequently, they can suffer from redundancy and sampling biases.
Results: Here we present Treemmer, a simple tool to evaluate the redundancy of phylogenetic trees and reduce their complexity by eliminating leaves that contribute the least to the tree diversity.
Conclusions: Treemmer can reduce the size of datasets with different phylogenetic structures and levels of redundancy while maintaining a sub-sample that is representative of the original diversity. Additionally, it is possible to fine-tune the behavior of Treemmer including any kind of meta-information, making Treemmer particularly useful for empirical studies.
Keywords: Biogeography; Clone elimination; Influenza; Large phylogenetic trees; Redundancy reduction; Representative sample; Sampling bias; Size reduction; Tuberculosis.