Modeling and cleaning RNA-seq data significantly improve detection of differentially expressed genes

BMC Bioinformatics. 2022 Nov 16;23(1):488. doi: 10.1186/s12859-022-05023-z.

Abstract

Background: RNA-seq has become a standard technology to quantify mRNA. The measured values usually vary by several orders of magnitude, and while the detection of differences at high values is statistically well grounded, the significance of the differences for rare mRNAs can be weakened by the presence of biological and technical noise.

Results: We have developed a method for cleaning RNA-seq data, which improves the detection of differentially expressed genes and specifically genes with low to moderate transcription. Using a data modeling approach, parameters of randomly distributed mRNA counts are identified and reads, most probably originating from technical noise, are removed. We demonstrate that the removal of this random component leads to the significant increase in the number of detected differentially expressed genes, more significant pvalues and no bias towards low-count genes.

Conclusion: Application of RNAdeNoise to our RNA-seq data on polysome profiling and several published RNA-seq datasets reveals its suitability for different organisms and sequencing technologies such as Illumina and BGI, shows improved detection of differentially expressed genes, and excludes the subjective setting of thresholds for minimal RNA counts. The program, RNA-seq data, resulted gene lists and examples of use are in the supplementary data and at https://github.com/Deyneko/RNAdeNoise .

Keywords: Data cleaning; Data filtering; De-noise; Differential expression; RNA-seq; Statistical modeling.

MeSH terms

  • High-Throughput Nucleotide Sequencing*
  • RNA*
  • RNA, Messenger
  • RNA-Seq
  • Sequence Analysis, RNA / methods

Substances

  • RNA
  • RNA, Messenger