LOGOWheat: deep learning-based prediction of regulatory effects for noncoding variants in wheats

Brief Bioinform. 2024 Nov 22;26(1):bbae705. doi: 10.1093/bib/bbae705.

Abstract

Identifying the regulatory effects of noncoding variants presents a significant challenge. Recently, the accumulation of epigenomic profiling data in wheat has provided an opportunity to model the functional impacts of these variants. In this study, we introduce Language of Genome for Wheat (LOGOWheat), a deep learning-based tool designed to predict the regulatory effects of noncoding variants in wheat. LOGOWheat initially employs a self-attention-based, contextualized pretrained language model to acquire bidirectional representations of the unlabeled wheat reference genome. Epigenomic profiling data are also collected and utilized to fine-tune the model, enabling it to discern the regulatory code inherent in genomic sequences. The test results suggest that LOGOWheat is highly effective in predicting multiple chromatin features, achieving an average area under the receiver operating characteristic (AUROC) of 0.8531 and an average area under the precision-recall curve (AUPRC) of 0.7633. Two case studies illustrate and demonstrate the main functions provided by LOGOWheat: assigning scores and prioritizing causal variants within a given variant set and constructing a saturated mutagenesis map in silico to discover high-impact sites or functional motifs in a given sequence. Finally, we propose the concept of extracting potential functional variations from the wheat population by integrating evolutionary conservation information. LOGOWheat is available at http://logowheat.cn/.

Keywords: deep learning; noncoding variants; self-attention; variant score.

MeSH terms

  • Computational Biology / methods
  • Deep Learning*
  • Epigenomics / methods
  • Genetic Variation
  • Genome, Plant
  • Software
  • Triticum* / genetics