A heavy-tailed model for analyzing miRNA-seq raw read counts

Stat Appl Genet Mol Biol. 2024 May 29;23(1). doi: 10.1515/sagmb-2023-0016. eCollection 2024 Jan 1.

Abstract

This article addresses the limitations of existing statistical models in analyzing and interpreting highly skewed miRNA-seq raw read count data that can range from zero to millions. A heavy-tailed model using discrete stable distributions is proposed as a novel approach to better capture the heterogeneity and extreme values commonly observed in miRNA-seq data. Additionally, the parameters of the discrete stable distribution are proposed as an alternative target for differential expression analysis. An R package for computing and estimating the discrete stable distribution is provided. The proposed model is applied to miRNA-seq raw counts from the Norwegian Women and Cancer Study (NOWAC) and the Cancer Genome Atlas (TCGA) databases. The goodness-of-fit is compared with the popular Poisson and negative binomial distributions, and the discrete stable distributions are found to give a better fit for both datasets. In conclusion, the use of discrete stable distributions is shown to potentially lead to more accurate modeling of the underlying biological processes.

Keywords: TCGA; breast cancer; discrete stable distributions; extremes; lung cancer; miRNA-seq raw read counts.

MeSH terms

  • Female
  • Gene Expression Profiling / methods
  • Gene Expression Profiling / statistics & numerical data
  • High-Throughput Nucleotide Sequencing / methods
  • Humans
  • MicroRNAs* / genetics
  • Models, Statistical*
  • Neoplasms / genetics
  • Sequence Analysis, RNA / methods
  • Sequence Analysis, RNA / statistics & numerical data
  • Software

Substances

  • MicroRNAs