The availability of large-scale epigenomic data from various cell types and conditions has yielded valuable insights for evaluating and learning features predicting the co-binding of transcription factors (TF). However, prior attempts to develop models predicting motif co-occurrence lacked scalability for globally analyzing any motif combination or making cross-species predictions. Moreover, mapping co-regulatory modules (CRM) to gene regulatory networks (GRN) is crucial for understanding underlying function. Currently, no comprehensive pipeline exists for large-scale, rapid, and accurate CRM and GRN identification. In this study, we analyzed and evaluated different TF binding characteristics facilitating biologically significant co-binding to identify all potential clusters of co-binding TFs. We curated the UniBind database, containing ChIP-Seq data from over 1983 samples and 232 TFs, and implemented two machine learning models to predict CRMs and the potential regulatory networks they operate on. Two machine learning models, Convolution Neural Networks (CNN) and Random Forest Classifier(RFC), used to predict co-binding between TFs, were compared using precision-recall Receiver Operating Characteristic (ROC) curves. CNN outperformed RFC (AUC 0.94 vs. 0.88) and achieved higher F1 scores (0.938 vs. 0.872). The CRMs generated by the clustering algorithm were validated against ChipAtlas and MCOT, revealing additional motifs forming CRMs. We predicted 200k CRMs for 50k+ human genes, validated against recent CRM prediction methods with 100% overlap. Further, we narrowed our focus to study heart-related regulatory motifs, filtering the generated CRMs to report 1784 Cardiac CRMs containing at least four cardiac TFs. Identified cardiac CRMs revealed potential novel regulators like ARID3A and RXRB for SCAD, including known TFs like PPARG for F11R. Our findings highlight the importance of the NKX family of transcription factors in cardiac development and provide potential targets for further investigation in cardiac disease.
Keywords: Area under the ROC curve; CNN; CRM; Cardiac diseases; Epigenomics; Gene regulation; Gene regulatory networks; MCOT; Machine learning; Random forest; Receiver operating characteristic; Transcription factor.
Copyright © 2024 The Authors. Published by Elsevier Ltd.. All rights reserved.