Uncovering co-regulatory modules and gene regulatory networks in the heart through machine learning-based analysis of large-scale epigenomic data

Naima Vahab; Tarun Bonu; Levin Kuhlmann; Mirana Ramialison; Sonika Tyagi

doi:10.1016/j.compbiomed.2024.108068

Uncovering co-regulatory modules and gene regulatory networks in the heart through machine learning-based analysis of large-scale epigenomic data

Comput Biol Med. 2024 Mar:171:108068. doi: 10.1016/j.compbiomed.2024.108068. Epub 2024 Feb 10.

Authors

Naima Vahab¹, Tarun Bonu², Levin Kuhlmann², Mirana Ramialison³, Sonika Tyagi⁴

Affiliations

¹ School of Computational Technologies, RMIT University, Melbourne VIC 3000, Australia; Department of Infectious Diseases, Alfred Hospital, Prahran VIC 3008, Australia.
² Faculty of Information Technology, Monash University, Clayton VIC 3800, Australia.
³ Murdoch Children Research Institute, Melbourne VIC 3000, Australia.
⁴ School of Computational Technologies, RMIT University, Melbourne VIC 3000, Australia; Department of Infectious Diseases, Alfred Hospital, Prahran VIC 3008, Australia. Electronic address: [email protected].

PMID: 38354497
DOI: 10.1016/j.compbiomed.2024.108068

Abstract

The availability of large-scale epigenomic data from various cell types and conditions has yielded valuable insights for evaluating and learning features predicting the co-binding of transcription factors (TF). However, prior attempts to develop models predicting motif co-occurrence lacked scalability for globally analyzing any motif combination or making cross-species predictions. Moreover, mapping co-regulatory modules (CRM) to gene regulatory networks (GRN) is crucial for understanding underlying function. Currently, no comprehensive pipeline exists for large-scale, rapid, and accurate CRM and GRN identification. In this study, we analyzed and evaluated different TF binding characteristics facilitating biologically significant co-binding to identify all potential clusters of co-binding TFs. We curated the UniBind database, containing ChIP-Seq data from over 1983 samples and 232 TFs, and implemented two machine learning models to predict CRMs and the potential regulatory networks they operate on. Two machine learning models, Convolution Neural Networks (CNN) and Random Forest Classifier(RFC), used to predict co-binding between TFs, were compared using precision-recall Receiver Operating Characteristic (ROC) curves. CNN outperformed RFC (AUC 0.94 vs. 0.88) and achieved higher F1 scores (0.938 vs. 0.872). The CRMs generated by the clustering algorithm were validated against ChipAtlas and MCOT, revealing additional motifs forming CRMs. We predicted 200k CRMs for 50k+ human genes, validated against recent CRM prediction methods with 100% overlap. Further, we narrowed our focus to study heart-related regulatory motifs, filtering the generated CRMs to report 1784 Cardiac CRMs containing at least four cardiac TFs. Identified cardiac CRMs revealed potential novel regulators like ARID3A and RXRB for SCAD, including known TFs like PPARG for F11R. Our findings highlight the importance of the NKX family of transcription factors in cardiac development and provide potential targets for further investigation in cardiac disease.

Keywords: Area under the ROC curve; CNN; CRM; Cardiac diseases; Epigenomics; Gene regulation; Gene regulatory networks; MCOT; Machine learning; Random forest; Receiver operating characteristic; Transcription factor.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
DNA-Binding Proteins / genetics
Epigenomics*
Gene Regulatory Networks* / genetics
Heart
Humans
Transcription Factors / genetics
Transcription Factors / metabolism

Substances

Transcription Factors
ARID3A protein, human
DNA-Binding Proteins