Transcriptomics and epigenetic data integration learning module on Google Cloud

Nathan A Ruprecht; Joshua D Kennedy; Benu Bansal; Sonalika Singhal; Donald Sens; Angela Maggio; Valena Doe; Dale Hawkins; Ross Campbel; Kyle O'Connell; Jappreet Singh Gill; Kalli Schaefer; Sandeep K Singhal

doi:10.1093/bib/bbae352

Transcriptomics and epigenetic data integration learning module on Google Cloud

Brief Bioinform. 2024 Jul 23;25(Supplement_1):bbae352. doi: 10.1093/bib/bbae352.

Authors

Nathan A Ruprecht¹, Joshua D Kennedy^{1

2}, Benu Bansal¹, Sonalika Singhal³, Donald Sens³, Angela Maggio⁴, Valena Doe⁵, Dale Hawkins⁵, Ross Campbel⁶, Kyle O'Connell⁶, Jappreet Singh Gill¹, Kalli Schaefer¹, Sandeep K Singhal^{1

3}

Affiliations

¹ Department of Biomedical Engineering, University of North Dakota, 501 N. Columbia Road Stop 8380, Grand Forks, ND 58202, United States.
² Department of Chemistry and Physics, Drury University, 900 N. Benton Avenue, Springfield, MO 65802, United States.
³ Department of Pathology, University of North Dakota, 1301 N. Columbia Road Stop 9037, Grand Forks, ND 58202, United States.
⁴ Deloitte, Health Data and AI, Deloitte Consulting LLP, 1919 N. Lynn Street, Suite 1500, Arlington, VA 22209, United States.
⁵ Google, Google Cloud, 1900 Reston Metro Plaza, Reston, VA 20190, United States.
⁶ NIH Center for Information Technology (CIT), 6555 Rock Spring Drive, Bethesda, MD 20892, United States.

Abstract

Multi-omics (genomics, transcriptomics, epigenomics, proteomics, metabolomics, etc.) research approaches are vital for understanding the hierarchical complexity of human biology and have proven to be extremely valuable in cancer research and precision medicine. Emerging scientific advances in recent years have made high-throughput genome-wide sequencing a central focus in molecular research by allowing for the collective analysis of various kinds of molecular biological data from different types of specimens in a single tissue or even at the level of a single cell. Additionally, with the help of improved computational resources and data mining, researchers are able to integrate data from different multi-omics regimes to identify new prognostic, diagnostic, or predictive biomarkers, uncover novel therapeutic targets, and develop more personalized treatment protocols for patients. For the research community to parse the scientifically and clinically meaningful information out of all the biological data being generated each day more efficiently with less wasted resources, being familiar with and comfortable using advanced analytical tools, such as Google Cloud Platform becomes imperative. This project is an interdisciplinary, cross-organizational effort to provide a guided learning module for integrating transcriptomics and epigenetics data analysis protocols into a comprehensive analysis pipeline for users to implement in their own work, utilizing the cloud computing infrastructure on Google Cloud. The learning module consists of three submodules that guide the user through tutorial examples that illustrate the analysis of RNA-sequence and Reduced-Representation Bisulfite Sequencing data. The examples are in the form of breast cancer case studies, and the data sets were procured from the public repository Gene Expression Omnibus. The first submodule is devoted to transcriptomics analysis with the RNA sequencing data, the second submodule focuses on epigenetics analysis using the DNA methylation data, and the third submodule integrates the two methods for a deeper biological understanding. The modules begin with data collection and preprocessing, with further downstream analysis performed in a Vertex AI Jupyter notebook instance with an R kernel. Analysis results are returned to Google Cloud buckets for storage and visualization, removing the computational strain from local resources. The final product is a start-to-finish tutorial for the researchers with limited experience in multi-omics to integrate transcriptomics and epigenetics data analysis into a comprehensive pipeline to perform their own biological research.This manuscript describes the development of a resource module that is part of a learning platform named ``NIGMS Sandbox for Cloud-based Learning'' https://github.com/NIGMS/NIGMS-Sandbox. The overall genesis of the Sandbox is described in the editorial NIGMS Sandbox [16] at the beginning of this Supplement. This module delivers learning materials on the analysis of bulk and single-cell ATAC-seq data in an interactive format that uses appropriate cloud resources for data access and analyses.

Keywords: DNA methylation; Google Cloud computing; R Bioconductor; epigenomics; multi-omics integration; transcriptomics.

MeSH terms

Cloud Computing*
Computational Biology / methods
Data Mining / methods
Epigenesis, Genetic
Epigenomics* / methods
Gene Expression Profiling / methods
Humans
Software
Transcriptome

Abstract

MeSH terms

Grants and funding