Benchmarking association analyses of continuous exposures with RNA-seq in observational studies

Brief Bioinform. 2021 Nov 5;22(6):bbab194. doi: 10.1093/bib/bbab194.

Abstract

Large datasets of hundreds to thousands of individuals measuring RNA-seq in observational studies are becoming available. Many popular software packages for analysis of RNA-seq data were constructed to study differences in expression signatures in an experimental design with well-defined conditions (exposures). In contrast, observational studies may have varying levels of confounding transcript-exposure associations; further, exposure measures may vary from discrete (exposed, yes/no) to continuous (levels of exposure), with non-normal distributions of exposure. We compare popular software for gene expression-DESeq2, edgeR and limma-as well as linear regression-based analyses for studying the association of continuous exposures with RNA-seq. We developed a computation pipeline that includes transformation, filtering and generation of empirical null distribution of association P-values, and we apply the pipeline to compute empirical P-values with multiple testing correction. We employ a resampling approach that allows for assessment of false positive detection across methods, power comparison and the computation of quantile empirical P-values. The results suggest that linear regression methods are substantially faster with better control of false detections than other methods, even with the resampling method to compute empirical P-values. We provide the proposed pipeline with fast algorithms in an R package Olivia, and implemented it to study the associations of measures of sleep disordered breathing with RNA-seq in peripheral blood mononuclear cells in participants from the Multi-Ethnic Study of Atherosclerosis.

Keywords: RNA-seq; continuous exposure; empirical P-values; non-normality; observational studies.

Publication types

  • Multicenter Study
  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms
  • Atherosclerosis / epidemiology
  • Atherosclerosis / etiology
  • Atherosclerosis / metabolism
  • Benchmarking / methods*
  • Computer Simulation
  • Disease Susceptibility
  • Genetic Predisposition to Disease
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Mutation
  • Phenotype
  • RNA-Seq*
  • Risk Assessment
  • Risk Factors
  • Sequence Analysis, RNA*
  • Software*
  • Web Browser