TreeCluster: Clustering biological sequences using phylogenetic trees

Metin Balaban; Niema Moshiri; Uyen Mai; Xingfan Jia; Siavash Mirarab

doi:10.1371/journal.pone.0221068

TreeCluster: Clustering biological sequences using phylogenetic trees

PLoS One. 2019 Aug 22;14(8):e0221068. doi: 10.1371/journal.pone.0221068. eCollection 2019.

Authors

Metin Balaban¹, Niema Moshiri¹, Uyen Mai², Xingfan Jia³, Siavash Mirarab⁴

Affiliations

¹ Bioinformatics and Systems Biology Graduate Program, UC San Diego, La Jolla, CA 92093, United States of America.
² Computer Science and Engineering, UC San Diego, La Jolla, CA 92093, United States of America.
³ Department of Mathematics, UC San Diego, La Jolla, CA 92093, United States of America.
⁴ Department of Electrical and Computer Engineering, UC San Diego, La Jolla, CA 92093, United States of America.

Abstract

Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Base Sequence
Cluster Analysis
Computational Biology / methods
Computational Biology / statistics & numerical data*
HIV / classification
HIV / genetics*
HIV Infections / epidemiology
HIV Infections / transmission
HIV Infections / virology
Humans
Microbiota / genetics*
Phylogeny*
Sequence Alignment / statistics & numerical data*
Software

Grants and funding

P30 AI027767/AI/NIAID NIH HHS/United States