Deciphering gene expression patterns using large-scale transcriptomic data and its applications

Shunjie Chen; Pei Wang; Haiping Guo; Yujie Zhang

doi:10.1093/bib/bbae590

Deciphering gene expression patterns using large-scale transcriptomic data and its applications

Brief Bioinform. 2024 Sep 23;25(6):bbae590. doi: 10.1093/bib/bbae590.

Authors

Shunjie Chen¹, Pei Wang^{1

2}, Haiping Guo¹, Yujie Zhang¹

Affiliations

¹ School of Mathematics and Statistics, Henan University, Jinming Avenue, 475004, Kaifeng, China.
² Henan Engineering Research Center for Industrial Internet of Things, Henan University, Mingli Road, 450046, Zhengzhou, China.

PMID: 39541191
DOI: 10.1093/bib/bbae590

Abstract

Gene expression varies stochastically across genders, racial groups, and health statuses. Deciphering these patterns is crucial for identifying informative genes, classifying samples, and understanding diseases like cancer. This study analyzes 11,252 bulk RNA-seq samples to explore expression patterns of 19,156 genes, including 10,512 cancer tissue samples and 740 normal samples. Additionally, 4,884 single-cell RNA-seq samples are examined. Statistical analysis using 16 probability distributions shows that normal samples display a wider range of distributions compared to cancer samples. Cancer samples tend to favor asymmetric distributions such as generalized extreme value, logarithmic normal, and Gaussian mixture distributions. In contrast, certain genes in normal samples exhibit symmetric distributions. Remarkably, more than 95.5% of genes exhibit non-normal distributions, which challenges traditional assumptions. Furthermore, distributions differ significantly between bulk and single-cell RNA-seq data. Many cancer driver genes exhibit distinct distribution patterns across sample types, suggesting potential for gene selection and classification based on distribution characteristics. A novel skewness-based metric is proposed to quantify distribution variation across datasets, showing genes with significant skewness differences have biological relevance. Finally, an improved naïve Bayes method incorporating gene-specific distributions demonstrates superior performance in simulations over traditional methods. This work enhances understanding of gene expression and its application in omics-based gene selection and sample classification.

Keywords: gene expression distribution; gene selection; naïve Bayes; omics data; sample classification; skewness.

MeSH terms

Computational Biology / methods
Gene Expression Profiling* / methods
Gene Expression Regulation, Neoplastic
Humans
Neoplasms* / genetics
Neoplasms* / metabolism
Single-Cell Analysis / methods
Transcriptome*