Cost effective, experimentally robust differential-expression analysis for human/mammalian, pathogen and dual-species transcriptomics

Amol C Shetty; John Mattick; Matthew Chung; Carrie McCracken; Anup Mahurkar; Scott G Filler; Claire M Fraser; David A Rasko; Vincent M Bruno; Julie C Dunning Hotopp

doi:10.1099/mgen.0.000320

Cost effective, experimentally robust differential-expression analysis for human/mammalian, pathogen and dual-species transcriptomics

Microb Genom. 2020 Jan;6(1):e000320. doi: 10.1099/mgen.0.000320.

Authors

Amol C Shetty¹, John Mattick¹, Matthew Chung^{2

1}, Carrie McCracken¹, Anup Mahurkar¹, Scott G Filler^{3

4}, Claire M Fraser^{5

1}, David A Rasko^{2

1}, Vincent M Bruno^{2

1}, Julie C Dunning Hotopp^{1

2

6}

Affiliations

¹ Institute for Genome Sciences, School of Medicine, University of Maryland, Baltimore, MD 21201, USA.
² Department of Microbiology and Immunology, School of Medicine, University of Maryland, Baltimore, MD 21201, USA.
³ David Geffen School of Medicine at UCLA, Los Angeles, CA 90502, USA.
⁴ Division of Infectious Diseases, Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA 90502, USA.
⁵ Department of Medicine, School of Medicine, University of Maryland, Baltimore, MD 21201, USA.
⁶ Greenebaum Cancer Center, University of Maryland, Baltimore, MD 21201, USA.

Abstract

As sequencing read length has increased, researchers have quickly adopted longer reads for their experiments. Here, we examine 14 pathogen or host-pathogen differential gene expression data sets to assess whether using longer reads is warranted. A variety of data sets was used to assess what genomic attributes might affect the outcome of differential gene expression analysis including: gene density, operons, gene length, number of introns/exons and intron length. No genome attribute was found to influence the data in principal components analysis, hierarchical clustering with bootstrap support, or regression analyses of pairwise comparisons that were undertaken on the same reads, looking at all combinations of paired and unpaired reads trimmed to 36, 54, 72 and 101 bp. Read pairing had the greatest effect when there was little variation in the samples from different conditions or in their replicates (e.g. little differential gene expression). But overall, 54 and 72 bp reads were typically most similar. Given differences in costs and mapping percentages, we recommend 54 bp reads for organisms with no or few introns and 72 bp reads for all others. In a third of the data sets, read pairing had absolutely no effect, despite paired reads having twice as much data. Therefore, single-end reads seem robust for differential-expression analyses, but in eukaryotes paired-end reads are likely desired to analyse splice variants and should be preferred for data sets that are acquired with the intent to be community resources that might be used in secondary data analyses.

Keywords: RNA-Seq; dual species RNA-Seq; sequencing; transcriptomics.

MeSH terms

Animals
Aspergillus / genetics*
Bacteria / genetics*
Cost-Benefit Analysis
Dogs
Gene Expression Profiling* / economics
Host-Pathogen Interactions / genetics*
Humans
Ixodes / genetics*
Mice
RNA-Seq
Transcriptome

Abstract

MeSH terms

Grants and funding