Capturing discrete latent structures: choose LDs over PCs

Theresa A Alexander; Rafael A Irizarry; Héctor Corrada Bravo

doi:10.1093/biostatistics/kxab030

Capturing discrete latent structures: choose LDs over PCs

Biostatistics. 2022 Dec 12;24(1):1-16. doi: 10.1093/biostatistics/kxab030.

Authors

Theresa A Alexander¹, Rafael A Irizarry², Héctor Corrada Bravo³

Affiliations

¹ Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.
² Biostatistics and Computational Biology, Dana Farber Cancer Institute and Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA.
³ Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA and Data Science and Statistical Computing, Genentech, Inc. South San Francisco, CA 94080, USA.

Abstract

High-dimensional biological data collection across heterogeneous groups of samples has become increasingly common, creating high demand for dimensionality reduction techniques that capture underlying structure of the data. Discovering low-dimensional embeddings that describe the separation of any underlying discrete latent structure in data is an important motivation for applying these techniques since these latent classes can represent important sources of unwanted variability, such as batch effects, or interesting sources of signal such as unknown cell types. The features that define this discrete latent structure are often hard to identify in high-dimensional data. Principal component analysis (PCA) is one of the most widely used methods as an unsupervised step for dimensionality reduction. This reduction technique finds linear transformations of the data which explain total variance. When the goal is detecting discrete structure, PCA is applied with the assumption that classes will be separated in directions of maximum variance. However, PCA will fail to accurately find discrete latent structure if this assumption does not hold. Visualization techniques, such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), attempt to mitigate these problems with PCA by creating a low-dimensional space where similar objects are modeled by nearby points in the low-dimensional embedding and dissimilar objects are modeled by distant points with high probability. However, since t-SNE and UMAP are computationally expensive, often a PCA reduction is done before applying them which makes it sensitive to PCAs downfalls. Also, tSNE is limited to only two or three dimensions as a visualization tool, which may not be adequate for retaining discriminatory information. The linear transformations of PCA are preferable to non-linear transformations provided by methods like t-SNE and UMAP for interpretable feature weights. Here, we propose iterative discriminant analysis (iDA), a dimensionality reduction technique designed to mitigate these limitations. iDA produces an embedding that carries discriminatory information which optimally separates latent clusters using linear transformations that permit post hoc analysis to determine features that define these latent structures.

Keywords: Bioinformatics; Multivariate data; Statistical modeling.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.
Research Support, N.I.H., Extramural

MeSH terms

Algorithms*
Humans
Principal Component Analysis

Abstract

Publication types

MeSH terms

Grants and funding