Biclustering via sparse clustering

Erika S Helgeson; Qian Liu; Guanhua Chen; Michael R Kosorok; Eric Bair

doi:10.1111/biom.13136

Biclustering via sparse clustering

Biometrics. 2020 Mar;76(1):348-358. doi: 10.1111/biom.13136. Epub 2019 Oct 14.

Authors

Erika S Helgeson¹, Qian Liu², Guanhua Chen³, Michael R Kosorok², Eric Bair⁴

Affiliations

¹ Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota.
² Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina.
³ Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin.
⁴ Departments of Endodontics and Biostatistics, University of North Carolina, Chapel Hill, North Carolina.

Abstract

In identifying subgroups of a heterogeneous disease or condition, it is often desirable to identify both the observations and the features which differ between subgroups. For instance, it may be that there is a subgroup of individuals with a certain disease who differ from the rest of the population based on the expression profile for only a subset of genes. Identifying the subgroup of patients and subset of genes could lead to better-targeted therapy. We can represent the subgroup of individuals and genes as a bicluster, a submatrix, $U$ , of a larger data matrix, $X$ , such that the features and observations in $U$ differ from those not contained in $U$ . We present a novel two-step method, SC-Biclust, for identifying $U$ . In the first step, the observations in the bicluster are identified to maximize the sum of the weighted between-cluster feature differences. In the second step, features in the bicluster are identified based on their contribution to the clustering of the observations. This versatile method can be used to identify biclusters that differ on the basis of feature means, feature variances, or more general differences. The bicluster identification accuracy of SC-Biclust is illustrated through several simulated studies. Application of SC-Biclust to pain research illustrates its ability to identify biologically meaningful subgroups.

Keywords: biclustering; hierarchical clustering; high-dimensional data; k-means clustering; sparse clustering.

Publication types

Comparative Study
Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Analysis of Variance
Biometry / methods*
Cluster Analysis*
Computer Simulation
Data Interpretation, Statistical
Disease / classification*
Disease / etiology*
Humans
Models, Statistical
Normal Distribution
Software
Temporomandibular Joint Disorders / classification
Temporomandibular Joint Disorders / etiology

Abstract

Publication types

MeSH terms

Grants and funding