Motivation: The recognition that transposable elements (TEs) play important roles in many biological processes has elicited growing interest in analyzing sequencing data derived from this 'dark genome'. This is however complicated by the highly repetitive nature of these sequences in genomes, requiring the deployment of several problem-specific tools as well as the curation of appropriate genome annotations. This pipeline aims to make the analysis of TE sequences and their expression more generally accessible.
Results: The TE-Seq pipeline conducts an end-to-end analysis of RNA sequencing data, examining both genes and TEs. It implements the most current computational methods tailor-made for TEs, and produces a comprehensive analysis of TE expression at both the level of the individual element and at the TE clade level. Furthermore, if supplied with long-read DNA sequencing data, it is able to assess TE expression from non-reference (polymorphic) loci. As a demonstration, we analyzed proliferating, early senescent, and late senescent lung fibroblast RNA-Seq data, and created a custom reference genome and annotations for this cell strain using Nanopore sequencing data. We found that several retrotransposable element (RTE) clades were upregulated in senescence, which included non-reference, intact, and potentially active elements.
Availability and implementation: TE-Seq is made available as a Snakemake pipeline which can be obtained at https://github.com/maxfieldk/TE-Seq. All software dependencies besides Snakemake and Docker/Singularity are packaged into a container which is automatically built and deployed by the pipeline at runtime.