Score matching for differential abundance testing of compositional high-throughput sequencing data

bioRxiv [Preprint]. 2024 Dec 9:2024.12.05.627006. doi: 10.1101/2024.12.05.627006.

Abstract

The class of a-b power interaction models, proposed by Yu et al. (2024), provides a general framework for modeling sparse compositional count data with pairwise feature interactions. This class includes many distributions as special cases and enables modeling of zero counts through power transformations, making it particularly suitable for modern high-throughput sequencing data with excess zeros, including single-cell RNA-Seq and microbial amplicon data. Here, we present an extension of this class of models that allows inclusion of covariate information, thus enabling accurate characterization of covariate dependencies in heterogeneous populations. Combining this model with a tailored differential abundance (DA) test leads to a novel DA testing scheme, cosmoDA, that can reduce false positive detection rate caused by correlated features. cosmoDA uses penalized generalized score matching for parsimonious model fitting. We show on simulated benchmarks that cosmoDA can accurately estimate feature interactions in the presence of population heterogeneity and significantly reduces the false discovery rate when testing for differential abundance of correlated features. Using single-cell and amplicon data, we illustrate cosmoDA's ability to estimate data-adaptive Box-Cox-type data transformations and assess the impact of zero replacement and power transformations on downstream differential abundance results. cosmoDA is available at https://github.com/bio-datascience/cosmoDA.

Publication types

  • Preprint