A heavy-tailed model for analyzing miRNA-seq raw read counts

Annika Krutto; Therese Haugdahl Nøst; Magne Thoresen

doi:10.1515/sagmb-2023-0016

A heavy-tailed model for analyzing miRNA-seq raw read counts

Stat Appl Genet Mol Biol. 2024 May 29;23(1). doi: 10.1515/sagmb-2023-0016. eCollection 2024 Jan 1.

Authors

Annika Krutto¹, Therese Haugdahl Nøst^{2

3}, Magne Thoresen¹

Affiliations

¹ Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Oslo, Norway.
² Department of Community Medicine, Department of Community Medicine, 8016 UiT The Arctic University of Norway , Tromsø, Norway.
³ Department of Public Health and Nursing, K.G. Jebsen Center for Genetic Epidemiology, 8016 UiT The Arctic University of Norway , Trondheim, Norway.

PMID: 38810893
DOI: 10.1515/sagmb-2023-0016

Abstract

This article addresses the limitations of existing statistical models in analyzing and interpreting highly skewed miRNA-seq raw read count data that can range from zero to millions. A heavy-tailed model using discrete stable distributions is proposed as a novel approach to better capture the heterogeneity and extreme values commonly observed in miRNA-seq data. Additionally, the parameters of the discrete stable distribution are proposed as an alternative target for differential expression analysis. An R package for computing and estimating the discrete stable distribution is provided. The proposed model is applied to miRNA-seq raw counts from the Norwegian Women and Cancer Study (NOWAC) and the Cancer Genome Atlas (TCGA) databases. The goodness-of-fit is compared with the popular Poisson and negative binomial distributions, and the discrete stable distributions are found to give a better fit for both datasets. In conclusion, the use of discrete stable distributions is shown to potentially lead to more accurate modeling of the underlying biological processes.

Keywords: TCGA; breast cancer; discrete stable distributions; extremes; lung cancer; miRNA-seq raw read counts.

MeSH terms

Female
Gene Expression Profiling / methods
Gene Expression Profiling / statistics & numerical data
High-Throughput Nucleotide Sequencing / methods
Humans
MicroRNAs* / genetics
Models, Statistical*
Neoplasms / genetics
Sequence Analysis, RNA / methods
Sequence Analysis, RNA / statistics & numerical data
Software

Substances

MicroRNAs