OmniGenome: Aligning RNA Sequences with Secondary Structures in Genomic Foundation Models

Heng Yang Department of Computer Science, University of Exeter, EX4 4QF, Exeter, UK Ke Li Department of Computer Science, University of Exeter, EX4 4QF, Exeter, UK

Abstract: The structures of RNA sequences play a vital role in various cellular processes, while existing genomic foundation models (FMs) have struggled with precise sequence-structure alignment, due to the complexity of exponential combinations of nucleotide bases. In this study, we introduce OmniGenome, a foundation model that addresses this critical challenge of sequence-structure alignment in RNA FMs. OmniGenome bridges the sequences with secondary structures using structure-contextualized modeling, enabling hard in-silico genomic tasks that existing FMs cannot handle, e.g., RNA design tasks. The results on two comprehensive genomic benchmarks show that OmniGenome achieves state-of-the-art performance on complex RNA subtasks. For example, OmniGenome solved 74%percent7474\%74 % of complex puzzles, compared to SpliceBERT which solved only 3%percent33\%3 % of the puzzles. Besides, OmniGenome solves most of the puzzles within 1111 hour, while the existing methods usually allocate 24242424 hours for each puzzle. Overall, OmniGenome establishes wide genomic application cases and offers profound insights into biological mechanisms from the perspective of sequence-structure alignment.

Keywords: Genomic Foundation Model, RNA Design, Secondary Structure Prediction

1 Introduction

RNA, folded from diverse nucleotide sequences, is a critical type of molecule and enables the flow of genetic information from DNA to protein [1, 2]. Similar to the natural language texts, the nucleotide sequences are regarded as the sentences represented in the “language” of genomics [3]. Therefore, many studies aim to leverage pretrained genomic foundation models, a.k.a., language models (LMs), to address in-silico RNA modeling tasks, such as secondary structure prediction [4, 5, 6, 7], degradation rate prediction [8, 9], mRNA vaccine design[10, 11]. There is a vital impact of secondary structures111The “structure” term refers to “RNA secondary structure” in this work. of RNA molecules on various cellular processes [12], such as gene expression [13, 14], protein translation [15, 16]. However, the performance of existing FMs are unsatisfactory because they only model the RNA sequences without secondary structures. Consequently, leveraging the secondary structures in the RNA modeling is crucial for the FMs in solving complex genomic subtasks, e.g., RNA design [17]. One of the intuitive methods to utilize secondary structures is to build the bidirectional mapping between sequences and structures. We formulate this mapping as sequence-structure alignment, which can be divided into two parts: sequence-to-structure (Seq2Str) and structure-to-sequence (Str2Seq) prediction. Both two predictions are challenging and have not been settled in previous works because the sequences and structures are sophisticated due to exponential combinations of up to thousands of bases. In short, to align the sequence and structure in FM-based genomic modeling, we need to address the Seq2Str and Str2Seq predictions simultaneously.

Seq2Str prediction. Secondary structure prediction is a difficult task because the structures grow exponentially with increasing sequence lengths, and an identical sequence may fold into different sub-optimal structures because the folding principles of RNA sequences depend on a variety of biological factors [18], as indicated in Fig. 1. Moreover, biologically verified structures are unfortunately expensive, compared to raw sequences, to obtain and process. This problem causes an imbalanced data scale between RNA sequences and verified structures making it hard to train an FM for the Seq2Str prediction. This means we need a tremendous number of secondary structures to train the Seq2Str prediction capability. However, existing works on RNA modeling generally utilize the masked language modeling [19] objective adapted from natural language processing (NLP) to pretrain FMs and regard the secondary structure prediction as a fine-tuning downstream task [5, 6], which inevitably shadows the performance of FMs without the pretraining on Seq2Str prediction.

Refer to caption
Figure 1: A real example for RNA sequence folding. The sub-figures (a) and (c) indicate the same sequence with different structures. The sub-figures (b) and (c) denote the identical structure can be from different sequences.

To address the data scarcity of secondary structures in Seq2Str, we leverage ViennaRNA [20] to compute the structures for billions of RNA sequences and introduce Seq2Str prediction into the pretraining phase. The pretraining on billions of structures will build an effective and robust mapping from sequences to structures. Our experiments in Section 3.3 show that the FM can achieve state-of-the-art performance even in zero-shot secondary structure prediction.

Str2Seq prediction. In addition to the data scarcity problem aforementioned in Seq2Str, Str2Seq prediction presents a substantial challenge because secondary structures can correspond to heterogeneous sequences as shown in Fig. 1, where the identical structure can be from different sequences. Therefore, it is impossible to reconstruct sequences according to only structures based on structure-to-sequence supervised training because there is no deterministic sequence for each structure. Consequently, the Str2Seq prediction has not been explored in the existing genomic FMs.

Refer to caption
Figure 2: A virtual example of sequence and structure tokenization in FMs. We need to expand the vocabulary to support structure tokenization and embedding, otherwise the structure cannot be recognized as shown in the top sub-figure. We show our Str2Seq modeling paradigm in the bottom sub-figure, where the ‘M’ indicates the masked tokens to be reconstructed by OmniGenome.

To overcome the challenge in Str2Seq prediction, we formulate the Str2Seq as a structure-contextualized sequence reconstruction task. Since RNA secondary structures can be represented in texts, we concatenate the sequences and structure pairs and then mask a small portion of the sequence or structure tokens before feeding them into the FM. In other words, we pretrain the FM to reconstruct the missing nucleotide bases given the context of structures, which will build the Str2Seq mapping ability for the FM. To perform this pretraining objective, we expand the vocabulary to represent the structure and deploy corresponding embeddings in the FM, because regular genomic FMs cannot recognize and embed the structures, as shown in the virtual example in Fig. 2.

Benchmark and Evaluation Results. We conduct two comprehensive in-silico genomic understanding benchmarks to evaluate the performance of our FM. The first benchmark compiled in the study is the RNA genomic benchmark (RGB) which contains diverse challenging genomic understanding tasks that benefit from the sequence-structure alignment, such as secondary structure prediction. The second benchmark is the plant genomic benchmark (PGB) which contains millions of DNA sequences to evaluate the DNA sequence understanding tasks. This benchmark is used to evaluate the generalizability of our FM among diversified species and genomes. The overall performance of our FM (up to 186186186186M parameters) on both two benchmarks consistently outperforms existing genomic FMs with up to 35%percent3535\%35 % improvement even compared with agro-NT [14] that contains 1111 billion parameters.

We also conduct zero-shot Seq2Str and Str2Seq prediction experiments to verify the performance of sequence-structure alignment. As revealed in these experiments in Section 3.2 and Section 3.3 our FM achieves up to an 83%percent8383\%83 % macro-F1 score in zero-shot Seq2Str prediction, i.e., secondary structure prediction, outperforming some fine-tuned FMs and ViennaRNA. In terms of Str2Seq prediction performance, we leverage our FM to solve the complex RNA design task and solved 74%percent7474\%74 % of complex puzzles of the Eterna [17] V2 benchmark, while state-of-the-art FMs, e.g., SpliceBERT [21], only solved 3%percent33\%3 % of simple (i.e., the target sequences are very short) puzzles. Besides, our FM takes at most 1111 hours to solve most of the puzzles, while the majority of RNA design methods can take up to 24242424 hours to solve each puzzle.

Open-source Resources. The genomic sequence (e.g., RNA and DNA) modeling is still on its early stage, while the related open-source resources are extremely limited and undocumented. Besides, many of the early works on genomic FMs are not ready-to-use due to different reasons. This problem encourages us to release all the benchmarks, evaluation scripts and FM tutorials, etc. As a result, we have developed an open-source package 222https://github.com/yangheng95/OmniGenomeBench that includes step-by-step tutorials for FM pretraining and downstream tasks fine-tuning, etc. It provides prepared genomic benchmarks and automates the benchmarking process of FMs using the API with only a few lines of code. These resources avoid the tedious and time-consuming programming in future genomic FM research.

Finally, the FM proposed in this work is named OmniGenome, which may advance a deep mechanistic understanding of RNA sequences as well as molecular biology and accelerate our ability to design RNA molecules.

2 Methodology

This section delineates the implementation of OmniGenome step by step.

2.1 Data Preprocessing

Refer to caption
Figure 3: An example of RNA sequence tokenization. The left sub-figure shows that K-mers and BPE entangle the bases and fail to align the SN-level inputs and outputs. The right sub-figure denotes that only SNT can achieve sequence-structure alignment, such as Seq2Str prediction.

Recent studies [21, 22] have shown that data diversity can enhance FM performance without significantly increasing model capacity. For the OmniGenome pretraining, we collected transcriptome data from the OneKP initiative [23], which compiles large-scale RNA-Seq data from 1,12411241,1241 , 124 plant species. Because the raw biological data are sparse, noisy, and characterized by long sequences, they are far from ready for effective FM training.

To address this problem, we developed a four-step data curation protocol to improve data quality. \blacktriangleright Raw RNA-Seq data are often excessively long with thousands of bases. We first sliced them into segments with a window size of 1,02410241,0241 , 024 bases to provide a sufficient context window for RNA sequence understanding. \blacktriangleright To enhance training efficiency and reduce bias, we removed all duplicate sequences. \blacktriangleright To tackle incomplete transcriptome data and other noises, we discard sequences shorter than 50505050 bases. \blacktriangleright To facilitate the sequence-structure alignment training, we adopt ViennaRNA to obtain the secondary structures for the sequences.

2.2 Tokenization

In the realm of genomic sequence modeling, modeling resolution, and performance highly depend on the tokenization strategy [3, 21]. Because the traditional k-mers [24, 13] and BPE [19, 22] methods combine multiple bases into single tokens, they may compromise modeling resolution and thus fail to capture granular features. To achieve a single nucleotide (SN) resolution necessary for genomic tasks like RNA design and structure prediction, we employed single nucleotide tokenization (SNT) method [3, 21] (see an illustrative example in Fig. 3). Additionally, to simplify the implementation, we adopted a vocabulary {{\{{A’, ‘T’, ‘C’, ‘G’, ‘U’, ‘N’, ‘(’, ‘)’, ‘.}}\}} to unify the tokenization of both nucleotide bases and RNA secondary structure information. Following BERT-style models [19], we incorporated special tokens, e.g., ‘<mask>’, to facilitate masked bases and structure prediction.

Refer to caption
Figure 4: The workflow of OmniGenome pretraining. We craft the inputs for three pretraining objectives described in Section 2.3. The outputs are reconstructed sequences based on the context of structure, predicted secondary structure, and unmasked sequences, respectively. The predictions of shadowed tokens are not calculated in the objective functions.

2.3 Pretraining Objectives

As discussed in Section 1, a key desideratum for complex genomic understanding and RNA sequence design is helping the model build an alignment between RNA sequences and their corresponding secondary structures. Bearing this in mind, we designed the following aggregated pretraining objective for OmniGenome:

pretrain=Str2Seq+Seq2Str+MRLM+λθ2,subscriptpretrainsubscriptStr2SeqsubscriptSeq2StrsubscriptMRLM𝜆subscriptnorm𝜃2\mathcal{L}_{\texttt{pretrain}}=\mathcal{L}_{\texttt{Str2Seq}}+\mathcal{L}_{% \texttt{Seq2Str}}+\mathcal{L}_{\texttt{MRLM}}+\lambda||\theta||_{2},caligraphic_L start_POSTSUBSCRIPT pretrain end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT Str2Seq end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT Seq2Str end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT MRLM end_POSTSUBSCRIPT + italic_λ | | italic_θ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (1)

where λ𝜆\lambdaitalic_λ is the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization weight and θ𝜃\thetaitalic_θ represents the parameters of OmniGenome. The following paragraphs explain the design principles of each loss function used in equation (1).

  • Str2SeqsubscriptStr2Seq\mathcal{L}_{\texttt{Str2Seq}}caligraphic_L start_POSTSUBSCRIPT Str2Seq end_POSTSUBSCRIPT is designed to enable OmniGenome to predict bases given partially masked sequences with the context of structures, so that OmniGenome can perceive RNA secondary structure by constructing Str2Seq prediction. To achieve this, we mask 15%percent1515\%15 % of the bases and structure tokens, encouraging the model to infer masked bases (i.e., {{\{{A’, ‘T’, ‘C’, ‘G’, ‘U’, ‘N}}\}}) and structure tokens (i.e., {{\{{(’, ‘)’, ‘.}}\}}). Specifically, Str2SeqsubscriptStr2Seq\mathcal{L}_{\texttt{Str2Seq}}caligraphic_L start_POSTSUBSCRIPT Str2Seq end_POSTSUBSCRIPT is defined as the classic cross-entropy loss widely used in the masked language modeling:

    Str2Seq=1|m|i=1mlogp(xixi),subscriptStr2Seq1𝑚superscriptsubscript𝑖1𝑚𝑝conditionalsubscript𝑥𝑖subscript𝑥𝑖\mathcal{L}_{\texttt{Str2Seq}}=-\frac{1}{|m|}\sum_{i=1}^{m}\log p(x_{i}\mid x_% {\setminus i}),caligraphic_L start_POSTSUBSCRIPT Str2Seq end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG | italic_m | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT ∖ italic_i end_POSTSUBSCRIPT ) , (2)

    where m𝑚mitalic_m is the number of masked nucleotide and structure tokens, and p(xi|xi)𝑝conditionalsubscript𝑥𝑖subscript𝑥𝑖p(x_{i}|x_{\setminus i})italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT ∖ italic_i end_POSTSUBSCRIPT ) indicates the probability of predicting the masked nucleotide xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT based on its context.

  • Complementary to Str2SeqsubscriptStr2Seq\mathcal{L}_{\texttt{Str2Seq}}caligraphic_L start_POSTSUBSCRIPT Str2Seq end_POSTSUBSCRIPT, Seq2StrsubscriptSeq2Str\mathcal{L}_{\texttt{Seq2Str}}caligraphic_L start_POSTSUBSCRIPT Seq2Str end_POSTSUBSCRIPT is designed to enable OmniGenome for Seq2Str prediction. Instead of directly feeding the secondary structure into OmniGenome during training, this objective employs the RNA secondary structures as supervised-training labels. Unlike Str2Seq, the Seq2Str prediction is generally deterministic because the optimal structures are unique. This task is implemented as a token-level classification, where the Seq2StrsubscriptSeq2Str\mathcal{L}_{\texttt{Seq2Str}}caligraphic_L start_POSTSUBSCRIPT Seq2Str end_POSTSUBSCRIPT loss is defined in the following cross-entropy loss:

    Seq2Str=i=1Nc=1Csiclog(s^ic),subscriptSeq2Strsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑐1𝐶subscript𝑠𝑖𝑐subscript^𝑠𝑖𝑐\mathcal{L}_{\texttt{Seq2Str}}=-\sum_{i=1}^{N}\sum_{c=1}^{C}s_{ic}\log(\hat{s}% _{ic}),caligraphic_L start_POSTSUBSCRIPT Seq2Str end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT ) , (3)

    where sicsubscript𝑠𝑖𝑐s_{ic}italic_s start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT denotes the label c𝑐citalic_c of secondary structure at the i𝑖iitalic_i-th position, and s^icsubscript^𝑠𝑖𝑐\hat{s}_{ic}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT is the probability predicted by a linear classifier deployed on OmniGenome. N𝑁Nitalic_N is the length of an RNA sequence and C=3𝐶3C=3italic_C = 3 denotes the number of the possible labels of structure, i.e., {{\{{(’, ‘)’, ‘.}}\}}.

  • The last one MRLMsubscriptMRLM\mathcal{L}_{\texttt{MRLM}}caligraphic_L start_POSTSUBSCRIPT MRLM end_POSTSUBSCRIPT is similar to the conventional masked language modeling loss in NLP. It aims to improve the model’s understanding of latent patterns in RNA sequences by masking or randomly replacing 15%percent1515\%15 % of nucleotide bases. The definition of MRLMsubscriptMRLM\mathcal{L}_{\texttt{MRLM}}caligraphic_L start_POSTSUBSCRIPT MRLM end_POSTSUBSCRIPT is similar to that of Str2SeqsubscriptStr2Seq\mathcal{L}_{\texttt{Str2Seq}}caligraphic_L start_POSTSUBSCRIPT Str2Seq end_POSTSUBSCRIPT, but it only considers the prediction of masked nucleotide bases.

There will be a data leakage problem in Seq2StrsubscriptSeq2Str\mathcal{L}_{\texttt{Seq2Str}}caligraphic_L start_POSTSUBSCRIPT Seq2Str end_POSTSUBSCRIPT objective while the structure is appended to the sequences. In practice, 70%percent7070\%70 % of RNA sequences are used for structure-contextualized pretraining for the Str2SeqsubscriptStr2Seq\mathcal{L}_{\texttt{Str2Seq}}caligraphic_L start_POSTSUBSCRIPT Str2Seq end_POSTSUBSCRIPT objective, while the remaining 30%percent3030\%30 % are used for the other two objectives. This proportion is based on our experiment observation.

2.4 Model Architecture

OmniGenome adopts the classic Transformer encoder architecture with bidirectional multi-head attention. We designed two variants, dubbed OmniGenome52M52M{}^{52\texttt{M}}start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT and OmniGenome186M186M{}^{186\texttt{M}}start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT with 52525252 and 186186186186 million parameters respectively. Some key model specifications are summarized in Table 1.

The training specifications are summarized as follows. \blacktriangleright The learning rate is set to 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and the weight decay is set to 0.010.010.010.01. \blacktriangleright We use AdamW as the optimizer with hyperparameters β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999. \blacktriangleright We use a linear decay strategy with a warm-up period of 1,00010001,0001 , 000 steps in the learning rate scheduler. \blacktriangleright The batch size is set to 2,04820482,0482 , 048. \blacktriangleright No dropout is applied during pretraining, and we use the rotary position embeddings [25] to further enhance the model’s scalability to long RNA sequences. \blacktriangleright We built a distributed training environment with 8888 Nvidia RTX 4090409040904090 GPUs and pretrained for 4444 weeks, while its configuration is introduced in  Appendix 6.

Table 1: Summary of some key model specifications of two OmniGenome variants.
Models # of Embedding Intermediate # of # of Modeling Position
Layers dimension dimension heads parameters length embedding
OmniGenome52M52M{}^{52\texttt{M}}start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT 16161616 480480480480 2,40024002,4002 , 400 24242424 52525252M 1,02410241,0241 , 024 Rotary
OmniGenome186M186M{}^{186\texttt{M}}start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT 32323232 720720720720 2,56025602,5602 , 560 30303030 186186186186M 1,02410241,0241 , 024 Rotary

2.5 Downstream Tasks

2.5.1 Predictive Tasks Fine-tuning

OmniGenome is designed as a general-purpose RNA FM that can be fine-tuned for a diverse set of downstream genomics predictive tasks. In this paper, we constructed a large-scale benchmark suite for RNA FMs. According to the category of genomes, we split the benchmark into two parts.

  • The first one is a collection of genomic understanding tasks aimed at the RNA genomic benchmark (RGB), as shown in Table 7. RGB contains 7777 SN-level tasks that are curated in this work or collected from published articles. The purpose of RGB is to benchmark genomic FMs in challenging SN-level modeling tasks like detection and repair of SN mutations, mRNA sequence degradation rates, and RNA secondary structure prediction. Due to the lack of a plant RNA benchmark dataset, the RGB considers the modeling of RNA sequences from a variety of species, e.g., plant and human. The sequence length in RGB ranges from 107107107107 to 512512512512, which is enough for most RNA understanding tasks. In summary, these multi-species and SN-level tasks in RGB serve as the first comprehensive benchmark utilized to assess the RNA sequence modeling capabilities of OmniGenome and its baseline models. For detailed information on each dataset, such as their sources and sizes, please refer to Appendix 10.1.

  • The plant genomic benchmark [14] (PGB333https://huggingface.co/datasets/InstaDeepAI/plant-genomic-benchmark) shown in Table 9 provides a comprehensive suite of DNA genomic tasks designed to evaluate and improve the predictive capabilities of genomic models in plant biology. As a large-scale plant DNA FM benchmark, PGB involves 8888 types of DNA modeling subtasks, including a range of critical genomic tasks such as promoter strength prediction and gene expression regression. There are 28282828 datasets in total with millions of DNA sequences to be evaluated in PGB, and the sequence lengths are up to 6000600060006000, which is very long for most of the genomic FMs. Since the original evaluation protocol is not publicly available, we have re-implemented the auto-benchmark for all the subtasks from PGB in our package, including various aspects of plant genomics such as gene expression and chromatin accessibility. By integrating diverse genomic tasks, the PGB aims to facilitate advanced research and development in plant genomics, offering a robust platform for the assessment and enhancement of model performance across different plant species. Due to computational limitations, we randomly sample a maximum of 10101010k examples in all datasets in PGB to evaluate the FM’s performance.

2.5.2 RNA Sequence Design

One of the difficult practices addressed by OmniGenome is the RNA design task, which has been long-term neglected by existing FMs because of the incapability of Str2Seq prediction. We introduce a naive genetic algorithm based on OmniGenome’s Str2Seq capability and achieve state-of-the-art performance in RNA sequence design by precisely predicting bases according to the structure context. The main steps in the genetic algorithm and workflow visualization are available in Appendix 9 and Fig. 5, respectively. Please find the parameter setting details in Section 3.3, and the experiment script can be found in the supplemental materials.

3 Experiments

To comprehensively explore the potential of OmniGenome and assess its performance across genomic understanding and RNA design tasks, this section implements rich experimental evaluations on diverse genomic tasks. We first evaluate the sequence-structure alignment capability of OmniGenome. Next, we evaluate the overall performance of OmniGenome on two comprehensive genomic modeling benchmarks, i.e., the RNA Genomic Benchmark (RGB) and the Plant Genomic Benchmark (PGB), respectively. All the benchmark datasets are not from the pretraining database, i.e., OneKP.

3.1 Evaluation Methods

There are no direct counterparts to OmniGenome in plant genomic modeling. However, we can compare it with the following recent genomic FMs as potential baselines to help evaluate the performance of OmniGenome as shown in Table 2. Please refer to Appendix 8 for brief introductions of these FMs. We are aware that some FMs are also developed for RNA, such as RNA-FM [26], RNA-MSM [27], Uni-RNA [28], 5UTR-LM [16], etc. However, we cannot compare OmniGenome with them because their source codes are very hard to work with in our efforts or the models are not publicly available. In the secondary structure prediction tasks, we input the secondary structures computed by ViennaRNA that are different from the true structures into OmniGenome+++ to avoid data leakage problems.

Table 2: The brief statistics of RNA and DNA genomic FMs. The numbers of parameters of different FMs are approximately calculated. Please note that a ‘token’ can contain multiple bases in BPE and k-mers tokenization. The detailed introduction of the following FMs can be found in the original publications.
Model Tokenization # of Params Pretraining Data Species Nucleic Acid
DNABERT-2 BPE 117117117117M 32.4932.4932.4932.49B Tokens Human + 135135135135 Species DNA
NT-V2-100100100100M k-mers 96969696M 300300300300B Tokens Human + 850850850850 Species DNA
HyenaDNA-Large SNT 47474747M 3.2absent3.2\approx 3.2≈ 3.2B Tokens Human DNA
Agro-NT-1111B k-mers 985985985985M 472.5472.5472.5472.5B Tokens 48484848 Edible Plants DNA
SpliceBERT SNT 19191919M 2222M Sequences Multi-Vertebrates precursor-mRNA
CDSBERT SNT 420420420420M 4444M Sequences 4,064064,064 , 069 RNA Families CDS
3UTRBERT k-mers 86868686M 20,3622036220,36220 , 362 Sequences Multi-Species mRNA 3333’UTR
OmniGenome52M52M{}^{52\texttt{M}}start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT SNT 52525252M 54.254.254.254.2B Tokens 1124112411241124 Plant Species mRNA, CDS, UTR
OmniGenome186M186M{}^{186\texttt{M}}start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT 186186186186M

3.2 Zero-shot Structure to Sequence Prediction Evaluation

In this experiment, we demonstrate the Str2Seq prediction capability of OmniGenome based on RNA sequence design. We employed the Eterna [17] V2 benchmark, which consists of 100100100100 specified secondary structures. This task aims to design plausible RNA sequences based on the reference structures. It is intuitive that higher performance in RNA design tasks indicates better structure-to-sequence prediction capability. We develop a simple GA algorithm based on OmniGenome models to solve RNA design puzzles. The implementation details can be found in Fig. 5 in Appendix 9. Within our GA implementation, the population size is set at 1000100010001000, with 100100100100 iterations, and the mutation rate for each base is 0.50.50.50.5. We set the maximum running time for each puzzle to 1111 hour because OmniGenome can easily solve most of the RNA design puzzles within 1111 hour, while each puzzle’s time budget is generally 24242424 hours in existing RNA design methods. We utilize accuracy as our evaluation metric, meaning that only sequences with entirely correct structures are considered resolved. The experimental results are available in Table 3.

Table 3: Performance on the EternaV2 RNA sequence design benchmark. Only the FMs using SNT support the RNA design tasks. The best results are in bold face.
Model Token. EternaV2
Acc
RNAInverse 30303030
SpliceBERT SNT 3333
CDSBERT SNT 00
OmniGenome52M52M{}^{52\texttt{M}}start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT SNT 4444
OmniGenome186M186M{}^{186\texttt{M}}start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT SNT 5555
OmniGenome+52M{}^{52\texttt{M}}+start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT + SNT 71717171
OmniGenome+186M{}^{186\texttt{M}}+start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT + SNT 𝟕𝟒74\mathbf{74}bold_74

It is observed from Table 3 that the FM-free method RNAInverse solved 30 of the RNA design puzzles, indicating a competent but not optimal capability in sequence design. We select the FMs employing SNT to conduct comparison experiments, as tokenizers like k-mers and BPE cannot handle the SN-level resolution modeling, i.e., masked nucleotide prediction. We cannot compare with HyenaDNA because it does not support masked nucleotide prediction. The models SpliceBERT and CDSBERT demonstrated trivial proficiency in the RNA sequence design task, solving 3333 and 00 puzzles, respectively. This suggests these FMs cannot precisely predict the bases without any Str2Seq mapping ability. OmniGenome variants without structure context input have performance similar to the baselines, i.e., OmniGenome52M52M{}^{52\texttt{M}}start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT and OmniGenome186M186M{}^{186\texttt{M}}start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT solved 4444 and 5555 puzzles, respectively. In contrast, OmniGenome+52M{}^{52\texttt{M}}+start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT + and OmniGenome+186M{}^{186\texttt{M}}+start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT + significantly outperformed other models with 71717171 and 74747474 puzzles solved, respectively, underscoring the impact of Str2Seq prediction in sequence-structure alignment, which evidently bolsters the models’ abilities to predict bases according to the structure contexts. Besides, we expect an increase in performance with sufficient computational budgets and the findings provide crucial evidence of the significance of Str2Seq for RNA sequence design.

3.3 Zero-shot Sequence to Structure Prediction Evaluation

This subsection assesses OmniGenome in both Seq2Str and Str2Seq prediction aspects of sequence-structure alignment. The evaluation of Seq2Str is based on zero-shot secondary structure prediction. We use OmniGenome and OmniGenome+++ without fine-tuning to predict the secondary structures of sequences from the test datasets and measure the macro-F1 score, where better structure prediction performance indicates a stronger capability for Seq2Str prediction. We evaluate Str2Seq mapping ability based on masked nucleotide base prediction. This task leverages the FM’s capacity to predict sequences with the context of structure. We randomly mask one base in the sequences from the test datasets and leverage the Str2Seq ability of OmniGenome+++ to predict the masked bases in a zero-shot scenario. We assess the model’s accuracy in predicting masked bases both with and without structure context. The experimental results are available in Table 4.

Table 4: Performance in zero-shot RNA secondary structure prediction and masked nucleotide prediction across various benchmarks, showcasing the capabilities of OmniGenome in zero-shot learning scenarios without any fine-tuning or domain adaptation.
Model RNA Secondary Structure Prediction Masked Nucleotide Prediction
Archive2 bpRNA RNAStralign Rfam Archive2 bpRNA RNAStralign Rfam
F1 F1 F1 F1 Acc Acc Acc Acc
ViennaRNA 75.8975.8975.8975.89 27.8227.8227.8227.82 74.8074.8074.8074.80 19.9819.9819.9819.98
SpliceBERT 30.0430.0430.0430.04 31.4031.4031.4031.40 32.1032.1032.1032.10 24.5024.5024.5024.50
OmniGenome52M52M{}^{52\texttt{M}}start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT 71.8771.8771.8771.87 28.1828.1828.1828.18 74.6274.6274.6274.62 16.0816.0816.0816.08 27.3727.3727.3727.37 26.8226.8226.8226.82 27.4627.4627.4627.46 27.1027.1027.1027.10
OmniGenome186M186M{}^{186\texttt{M}}start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT 72.5772.5772.5772.57 28.2628.2628.2628.26 75.0775.0775.0775.07 18.6018.6018.6018.60 29.8229.8229.8229.82 30.6630.6630.6630.66 28.9228.9228.9228.92 27.9027.9027.9027.90
OmniGenome+52M{}^{52\texttt{M}}+start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT + 72.0972.0972.0972.09 28.2228.2228.2228.22 74.7874.7874.7874.78 81.5681.5681.5681.56 36.4936.4936.4936.49 32.7232.7232.7232.72 39.1839.1839.1839.18 31.3231.3231.3231.32
OmniGenome+186M{}^{186\texttt{M}}+start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT + 73.0173.01\mathbf{73.01}bold_73.01 29.0429.04\mathbf{29.04}bold_29.04 75.3475.34\mathbf{75.34}bold_75.34 83.2483.24\mathbf{83.24}bold_83.24 51.9351.93\mathbf{51.93}bold_51.93 48.9648.96\mathbf{48.96}bold_48.96 52.8652.86\mathbf{52.86}bold_52.86 46.0346.03\mathbf{46.03}bold_46.03

The results in Table 4 reveal OmniGenome’s effectiveness in both secondary structure prediction (i.e., Seq2Str) and masked nucleotide prediction (i.e., Str2Seq). More specifically, OmniGenome52M52M{}^{52\texttt{M}}start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT and OmniGenome186M186M{}^{186\texttt{M}}start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT show impressive performance for zero-shot sequence-structure alignment, while OmniGenome+52M{}^{52\texttt{M}}+start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT + and OmniGenome+186M{}^{186\texttt{M}}+start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT + outperform the base variants of OmniGenome. Even when ablating structure context, OmniGenome186M186M{}^{186\texttt{M}}start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT achieves performance comparable with ViennaRNA. Interestingly, if the structures calculated by ViennaRNA are fed into the model, OmniGenome+++ models can predict the structure more precisely and outperform ViennaRNA on most of the secondary structure prediction datasets, e.g., bpRNA, RNAStralign, and Rfam. From the results of masked nucleotide prediction, it is observed that sequence reconstruction performance can be enhanced by the structure context. OmniGenome+52M{}^{52\texttt{M}}+start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT + and OmniGenome+186M{}^{186\texttt{M}}+start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT + significantly outperform baseline FMs like SpliceBERT.

Overall, the results in Table 4 provide a comprehensive evaluation of the FMs’ capabilities in sequence-structure alignment. This underscores OmniGenome’s performance and robustness in generalizing among different species without prior fine-tuning, highlighting their potential utility in SN-level in-silico genomic modeling.

3.4 RNA Genomic Benchmark

Table 5: The performance of OmniGenome and baseline models on the RGB, with results averaged based on five random seeds. “N.A.” indicates that ViennaRNA is not designed for other predictive genomic tasks.
Model mRNA SNMD SNMR Archive2 bpRNA RNAStralign
RMSE AUC F1 F1 F1 F1
ViennaRNA N.A. N.A. N.A. 75.89 27.82 74.80
DNABERT2 0.81580.81580.81580.8158 49.9449.9449.9449.94 15.8615.8615.8615.86 59.8259.8259.8259.82 43.4043.4043.4043.40 65.4965.4965.4965.49
HyenaDNA 0.80560.80560.80560.8056 53.3253.3253.3253.32 39.8039.8039.8039.80 84.2384.2384.2384.23 56.6256.6256.6256.62 95.4295.4295.4295.42
NT-V2 0.78260.78260.78260.7826 50.4950.4950.4950.49 26.0126.0126.0126.01 79.9079.9079.9079.90 56.6056.6056.6056.60 90.8490.8490.8490.84
Agro-NT 0.78300.78300.78300.7830 49.9949.9949.9949.99 26.3826.3826.3826.38 70.1370.1370.1370.13 48.7148.7148.7148.71 75.2175.2175.2175.21
SpliceBERT 0.73400.73400.73400.7340 58.1158.1158.1158.11 46.4446.4446.4446.44 89.0589.0589.0589.05 69.1069.1069.1069.10 96.9796.9796.9796.97
3UTRBERT 0.77720.77720.77720.7772 50.0250.0250.0250.02 24.0124.0124.0124.01 78.9878.9878.9878.98 56.9356.9356.9356.93 92.0392.0392.0392.03
CDSBERT 0.74680.74680.74680.7468 55.0355.0355.0355.03 36.1636.1636.1636.16 89.3489.3489.3489.34 70.0170.0170.0170.01 97.1597.1597.1597.15
OmniGenome52M52M{}^{52\texttt{M}}start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT 0.71910.71910.71910.7191 62.4462.4462.4462.44 48.9148.9148.9148.91 94.9894.9894.9894.98 82.3482.3482.3482.34 99.0199.0199.0199.01
OmniGenome186M186M{}^{186\texttt{M}}start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT 0.71640.71640.71640.7164 63.8163.8163.8163.81 49.8049.8049.8049.80 95.2095.2095.2095.20 82.4882.4882.4882.48 99.1299.1299.1299.12
OmniGenome+52M{}^{52\texttt{M}}+start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT + 0.71880.71880.71880.7188 63.1163.1163.1163.11 49.2149.2149.2149.21 95.1095.1095.1095.10 81.8881.8881.8881.88 98.9898.9898.9898.98
OmniGenome+186M{}^{186\texttt{M}}+start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT + 0.71210.7121\mathbf{0.7121}bold_0.7121 64.1364.13\mathbf{64.13}bold_64.13 49.4449.44\mathbf{49.44}bold_49.44 95.2795.27\mathbf{95.27}bold_95.27 83.3683.36\mathbf{83.36}bold_83.36 99.3599.35\mathbf{99.35}bold_99.35

The results in Table 5 demonstrate the performance of OmniGenome and its generalizability across various fine-grained RNA genomic modeling tasks. It is observed that the base variants of OmniGenome models achieve better results than both RNA and DNA FM baselines, including Agro-NT and CDSBERT, which contain hundreds of millions of parameters. This is because the existing FMs usually adopt k-mers tokenization that cannot handle single nucleotide resolution tasks, e.g., single nucleotide mutation detection and repair. Because of the Seq2Str pretraining, OmniGenome and OmniGenome+++ models exhibit strong results in secondary structure prediction, underscoring OmniGenome’s capabilities in SN-level RNA sequence understanding and manipulation.

3.5 Plant Genomic Benchmark

Table 6: Performance of OmniGenome and baseline FMs on PGB. “PolyA” stands for Polyadenylation, “Chrom Acc” for Chromatin Accessibility, “Prom Str” for Promoter Strength, “Term Str” for Terminator Strength, “Splice” for Splice Site, “Gene Exp” for Gene Expression, and “Enh Reg” for Enhancer Region. Results for OmniGenome+186M{}^{186\texttt{M}}+start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT + are excluded due to the time-intensive nature of the experiments.
Model PolyA LncRNA Chrom Acc Prom Str Term Str Splice Gene Exp Enhancer
F1 F1 F1 RMSE RMSE F1 RMSE F1
DNABERT2 41.3541.3541.3541.35 72.5572.5572.5572.55 61.4961.4961.4961.49 0.990.990.990.99 0.240.240.240.24 45.3445.3445.3445.34 14.7814.7814.7814.78 36.4036.4036.4036.40
HyenaDNA 83.1183.1183.1183.11 58.2158.2158.2158.21 52.2052.2052.2052.20 0.880.880.880.88 0.260.260.260.26 90.2890.2890.2890.28 14.7614.7614.7614.76 66.1766.1766.1766.17
NT-V2 71.2671.2671.2671.26 73.0873.0873.0873.08 65.7165.7165.7165.71 0.810.810.810.81 0.270.270.270.27 95.0595.0595.0595.05 14.6914.6914.6914.69 73.8973.8973.8973.89
Agro-NT 78.8978.8978.8978.89 67.2467.2467.2467.24 63.2763.2763.2763.27 0.940.940.940.94 0.780.780.780.78 88.4588.4588.4588.45 15.5615.5615.5615.56 62.8362.8362.8362.83
SpliceBERT 65.2365.2365.2365.23 71.8871.8871.8871.88 63.6263.6263.6263.62 0.750.750.750.75 0.220.220.220.22 96.4596.4596.4596.45 14.7014.7014.7014.70 69.7169.7169.7169.71
3UTRBERT 76.4876.4876.4876.48 70.7570.7570.7570.75 63.7163.7163.7163.71 1.041.041.041.04 0.360.360.360.36 94.4494.4494.4494.44 14.8714.8714.8714.87 71.6771.6771.6771.67
CDSBERT 39.7239.7239.7239.72 33.0633.0633.0633.06 48.9548.9548.9548.95 2.192.192.192.19 0.590.590.590.59 52.2052.2052.2052.20 14.7714.7714.7714.77 33.9333.9333.9333.93
OmniGenome52M52M{}^{52\texttt{M}}start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT 85.4785.4785.4785.47 75.7175.7175.7175.71 64.2364.2364.2364.23 0.670.670.670.67 0.210.210.210.21 97.4097.4097.4097.40 14.7614.7614.7614.76 68.3168.3168.3168.31
OmniGenome186M186M{}^{186\texttt{M}}start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT 86.8786.8786.8786.87 77.5377.53\mathbf{77.53}bold_77.53 66.8866.88\mathbf{66.88}bold_66.88 0.650.65\mathbf{0.65}bold_0.65 0.190.19\mathbf{0.19}bold_0.19 98.1598.15\mathbf{98.15}bold_98.15 14.7614.7614.7614.76 72.4572.45\mathbf{72.45}bold_72.45
OmniGenome+52M{}^{52\texttt{M}}+start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT + 87.0587.05\mathbf{87.05}bold_87.05 76.2376.2376.2376.23 65.4165.4165.4165.41 0.650.65\mathbf{0.65}bold_0.65 0.200.200.200.20 97.7097.7097.7097.70 14.7614.7614.7614.76 70.7170.7170.7170.71

The PGB is a plant-oriented genomic benchmark. Although the benchmark datasets in PGB are DNA-based tasks, we can still evaluate the performance of OmniGenome and its generalizability on multi-modal (i.e., DNA and RNA) genomic tasks. The results in Table 6 reveal substantial variability in the performance of different FMs, where OmniGenome52M52M{}^{52\texttt{M}}start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT outperformed other baseline models across most tasks, particularly in tasks like Polyadenylation, Splice Site, and Enhancer Region classification, where they achieved the highest F1 scores. This suggests that OmniGenome’s architecture is particularly adept at handling complex genomic sequences. In comparison, existing FMs, e.g., CDSBERT and Agro-NT, showed lower performance with more parameters than OmniGenome. Besides, the performance of OmniGenome+52M{}^{52\texttt{M}}+start_FLOATSUPERSCRIPT 52 M end_FLOATSUPERSCRIPT + suggests that the structure context can further enhance the performance of genomic modeling.

Overall, OmniGenome models achieve state-of-the-art performance on both benchmarks, especially for OmniGenome+++ variants. The results underscore the importance of sequence-structure alignment in achieving complex genomic modeling tasks.

4 Related Works

Biological sequence modeling, including DNA, RNA, and protein, has attracted attention in recent years. Protein modeling, e.g., AlphaFold [29, 30, 31] and ESM [32], has been studied for many years compared to DNA and RNA modeling. In the realm of genomic sequence modeling, several early works aimed at addressing diversified genome downstream subtasks. For instance, DNABERT [33] adapts the architecture of BERT [19] for genomic sequence modeling, showing preliminary performance for in-silico genomic tasks. DNABERT2 [22], a multi-species FM improved based on DNABERT, proposes replacing k-mers tokenization with BPE tokenization to improve model performance. To explore the performance of large-scale FMs, the nucleotide transformers (V1 & V2) [13], AgroNT [14] and SegmentNT [34] leveraged billions of parameters to boost genomic sequence modeling and achieved promising performance in understanding DNA genome, with model scales up to 2.52.52.52.5 billion and 1111 billion parameters, respectively. Agro-NT [14] was pretrained on multi-species edible plant DNA sequences but failed to transfer effectively to RNA sequence modeling in our experiments. To address the modeling capacity problem caused by the remarkable lengths of genomes, there is growing focus on the necessity of long-range sequence modeling and the introduction of autoregressive FMs, namely, HyenaDNA [3] and Evo [2]. In terms of RNA genomic modeling, there have been some preliminary works, such as scBERT [35], RNABERT [36], RNA-FM [26], RNA-MSM [37], and RNAErnie [38], to name a few. However, these methods have only trained the FMs on a limited-scale database, as RNA sequences are generally expensive to obtain. Some FMs focus on specific types of RNA sequences, such as coding sequences (CDS) [39], 5’ untranslated regions (5’UTR) [40], 3’ untranslated regions (3’UTR) [24], or precursor mRNA sequences [21], thus limiting the models’ ability to capture the diversity of RNA sequences. Uni-RNA [28] has been reported to achieve good performance due to the large scale of the model and database, however, it is not open-sourced and cannot be compared in the experiments.

In short, the existing FMs usually neglect the sequence-structure alignment problem in RNA genomic modeling, while the 5UTR-LM [16] adopts the secondary structure prediction as a pre-training objective to achieve Seq2Str prediction in pretraining. However, these FMs are not available for Str2Seq mapping and suffer from limited model and data scales that fail to uncover the comprehensive efficacy of sequence-structure alignment on a wide set of genomic tasks. ERNIE-RNA [41] feeds the RNA structure along with the sequence into the model and improves the downstream tasks. However, it also ignores the significance of Str2Seq prediction capability. In a nutshell, existing FMs fail to achieve sequence-structure alignment without exception.

5 Conclusion

In this paper, we introduced OmniGenome, a foundation model (FM) that addresses the critical challenge of sequence-structure alignment in genomic modeling. Our approach leverages a structure-contextualized modeling to achieve accurate predictions for both Seq2Str and Str2Seq tasks. By utilizing the extensive OneKP dataset, we ensured the robustness and transferability of OmniGenome across various genomic tasks. Our experimental results demonstrate that OmniGenome significantly outperforms existing models in zero-shot secondary structure prediction, achieving up to 83%percent8383\%83 % macro-F1 score, and excels in solving 74%percent7474\%74 % complex RNA design puzzles with remarkable efficiency. These findings validate the efficacy of our approach and highlight the potential of OmniGenome in advancing RNA and broader genomic research.

Acknowledgment

This work was supported in part by the UKRI Future Leaders Fellowship under Grant MR/S017062/1 and MR/X011135/1; in part by NSFC under Grant 62376056 and 62076056; in part by the Royal Society under Grant IES/R2/212077; in part by the EPSRC under Grant 2404317; in part by the Kan Tong Po Fellowship (KTP\R1\231017); and in part by the Amazon Research Award and Alan Turing Fellowship.

References

  • [1] N. Wang, J. Bian, Y. Li, X. Li, S. Mumtaz, L. Kong, and H. Xiong, “Multi-purpose rna language modelling with motif-aware pretraining and type-guided fine-tuning,” Nature Machine Intelligence, pp. 1–10, 2024.
  • [2] E. Nguyen, M. Poli, M. G. Durrant, A. W. Thomas, B. Kang, J. Sullivan, M. Y. Ng, A. Lewis, A. Patel, A. Lou et al., “Sequence modeling and design from molecular to genome scale with evo,” bioRxiv, pp. 2024–02, 2024.
  • [3] E. Nguyen, M. Poli, M. Faizi, A. W. Thomas, C. Birch-Sykes, M. Wornow, A. Patel, C. M. Rabideau, S. Massaroli, Y. Bengio, S. Ermon, S. A. Baccus, and C. Ré, “Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution,” CoRR, vol. abs/2306.15794, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.15794
  • [4] Z. Tan, Y. Fu, G. Sharma, and D. H. Mathews, “Turbofold ii: Rna structural alignment and secondary structure prediction informed by multiple homologs,” Nucleic acids research, vol. 45, no. 20, pp. 11 570–11 581, 2017.
  • [5] P. Danaee, M. Rouches, M. Wiley, D. Deng, L. Huang, and D. Hendrix, “bprna: large-scale automated annotation and analysis of rna secondary structure,” Nucleic acids research, vol. 46, no. 11, pp. 5381–5394, 2018.
  • [6] D. H. Mathews, “How to benchmark rna secondary structure prediction accuracy,” Methods, vol. 162, pp. 60–67, 2019.
  • [7] I. Kalvari, E. P. Nawrocki, N. Ontiveros-Palacios, J. Argasinska, K. Lamkiewicz, M. Marz, S. Griffiths-Jones, C. Toffano-Nioche, D. Gautheret, Z. Weinberg et al., “Rfam 14: expanded coverage of metagenomic, viral and microrna families,” Nucleic Acids Research, vol. 49, no. D1, pp. D192–D200, 2021.
  • [8] O. Yaish and Y. Orenstein, “Computational modeling of mrna degradation dynamics using deep neural networks,” Bioinformatics, vol. 38, no. 4, pp. 1087–1101, 2022.
  • [9] H. K. Wayment-Steele, W. Kladwang, A. M. Watkins, D. S. Kim, B. Tunguz, W. Reade, M. Demkin, J. Romano, R. Wellington-Oguri, J. J. Nicol et al., “Deep learning models for predicting rna degradation via dual crowdsourcing,” Nature Machine Intelligence, vol. 4, no. 12, pp. 1174–1184, 2022.
  • [10] K. S. Corbett, D. K. Edwards, S. R. Leist, O. M. Abiona, S. Boyoglu-Barnum, R. A. Gillespie, S. Himansu, A. Schäfer, C. T. Ziwawo, A. T. DiPiazza et al., “Sars-cov-2 mrna vaccine design enabled by prototype pathogen preparedness,” Nature, vol. 586, no. 7830, pp. 567–571, 2020.
  • [11] F. Runge, J. K. Franke, D. Fertmann, R. Backofen, and F. Hutter, “Partial rna design,” bioRxiv, 2023. [Online]. Available: https://www.biorxiv.org/content/early/2023/12/30/2023.12.29.573656
  • [12] Y. Wan, M. Kertesz, R. C. Spitale, E. Segal, and H. Y. Chang, “Understanding the transcriptome through rna structure,” Nature Reviews Genetics, vol. 12, no. 9, pp. 641–655, 2011.
  • [13] H. Dalla-Torre, L. Gonzalez, J. Mendoza-Revilla, N. L. Carranza, A. H. Grzywaczewski, F. Oteri, C. Dallago, E. Trop, B. P. de Almeida, H. Sirelkhatim et al., “The nucleotide transformer: Building and evaluating robust foundation models for human genomics,” bioRxiv, pp. 2023–01, 2023.
  • [14] J. Mendoza-Revilla, E. Trop, L. Gonzalez, M. Roller, H. Dalla-Torre, B. P. de Almeida, G. Richard, J. Caton, N. Lopez Carranza, M. Skwark et al., “A foundational large language model for edible plant genomes,” bioRxiv, pp. 2023–10, 2023.
  • [15] B. Frey, A. Celaj, A. Gao, T. Lau, E. Holgersen, A. Lo, V. Lodaya, C. Cole, R. Denroche, C. Spickett et al., “An rna foundation model enables discovery of disease mechanisms and candidate therapeutics,” bioRxiv, pp. 2023–09, 2023.
  • [16] Y. Chu, D. Yu, Y. Li, K. Huang, Y. Shen, L. Cong, J. Zhang, and M. Wang, “A 5’ utr language model for decoding untranslated regions of mrna and function predictions,” Nature Machine Intelligence, pp. 1–12, 2024.
  • [17] J. Lee, W. Kladwang, M. Lee, D. Cantu, M. Azizyan, H. Kim, A. Limpaecher, S. Gaikwad, S. Yoon, A. Treuille et al., “Rna design rules from a massive open laboratory,” Proceedings of the National Academy of Sciences, vol. 111, no. 6, pp. 2122–2127, 2014.
  • [18] I. Tinoco Jr and C. Bustamante, “How rna folds,” Journal of molecular biology, vol. 293, no. 2, pp. 271–281, 1999.
  • [19] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT (1).   Association for Computational Linguistics, 2019, pp. 4171–4186.
  • [20] R. Lorenz, S. H. Bernhart, C. Höner zu Siederdissen, H. Tafer, C. Flamm, P. F. Stadler, and I. L. Hofacker, “Viennarna package 2.0,” Algorithms for molecular biology, vol. 6, pp. 1–14, 2011.
  • [21] K. Chen, Y. Zhou, M. Ding, Y. Wang, Z. Ren, and Y. Yang, “Self-supervised learning on millions of pre-mrna sequences improves sequence-based rna splicing prediction,” bioRxiv, pp. 2023–01, 2023.
  • [22] Z. Zhou, Y. Ji, W. Li, P. Dutta, R. V. Davuluri, and H. Liu, “DNABERT-2: efficient foundation model and benchmark for multi-species genome,” CoRR, vol. abs/2306.15006, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.15006
  • [23] E. J. Carpenter, J. H. Leebens-Mack, and M. S. B. et al., “One thousand plant transcriptomes and the phylogenomics of green plants,” Nature, vol. 574, no. 7780, pp. 679–685, 2019.
  • [24] Y. Yang, G. Li, K. Pang, W. Cao, X. Li, and Z. Zhang, “Deciphering 3’utr mediated gene regulation using interpretable deep representation learning,” bioRxiv, pp. 2023–09, 2023.
  • [25] J. Su, M. H. M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,” Neurocomputing, vol. 568, p. 127063, 2024.
  • [26] J. Chen, Z. Hu, S. Sun, Q. Tan, Y. Wang, Q. Yu, L. Zong, L. Hong, J. Xiao, T. Shen et al., “Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions,” bioRxiv, pp. 2022–08, 2022.
  • [27] Y. Zhang, M. Lang, J. Jiang, Z. Gao, F. Xu, T. Litfin, K. Chen, J. Singh, X. Huang, G. Song et al., “Multiple sequence alignment-based rna language model and its application to structural inference,” Nucleic Acids Research, vol. 52, no. 1, pp. e3–e3, 2024.
  • [28] X. Wang, R. Gu, Z. Chen, Y. Li, X. Ji, G. Ke, and H. Wen, “Uni-rna: universal pre-trained models revolutionize rna research,” bioRxiv, pp. 2023–07, 2023.
  • [29] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli, and D. Hassabis, “Highly accurate protein structure prediction with AlphaFold,” Nature, vol. 596, no. 7873, pp. 583–589, 2021.
  • [30] R. Evans, M. O’Neill, A. Pritzel, N. Antropova, A. Senior, T. Green, A. Žídek, R. Bates, S. Blackwell, J. Yim, O. Ronneberger, S. Bodenstein, M. Zielinski, A. Bridgland, A. Potapenko, A. Cowie, K. Tunyasuvunakool, R. Jain, E. Clancy, P. Kohli, J. Jumper, and D. Hassabis, “Protein complex prediction with alphafold-multimer,” bioRxiv, 2021. [Online]. Available: https://www.biorxiv.org/content/early/2021/10/04/2021.10.04.463034
  • [31] J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick et al., “Accurate structure prediction of biomolecular interactions with alphafold 3,” Nature, pp. 1–3, 2024.
  • [32] Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, A. dos Santos Costa, M. Fazel-Zarandi, T. Sercu, S. Candido et al., “Language models of protein sequences at the scale of evolution enable accurate structure prediction,” BioRxiv, vol. 2022, p. 500902, 2022.
  • [33] Y. Ji, Z. Zhou, H. Liu, and R. V. Davuluri, “DNABERT: pre-trained bidirectional encoder representations from transformers model for dna-language in genome,” Bioinform., vol. 37, no. 15, pp. 2112–2120, 2021.
  • [34] B. P. de Almeida, H. Dalla-Torre, G. Richard, C. Blum, L. Hexemer, M. Gélard, J. Mendoza-Revilla, P. Pandey, S. Laurent, M. Lopez et al., “Segmentnt: annotating the genome at single-nucleotide resolution with dna foundation models,” bioRxiv, pp. 2024–03, 2024.
  • [35] F. Yang, W. Wang, F. Wang, Y. Fang, D. Tang, J. Huang, H. Lu, and J. Yao, “scbert as a large-scale pretrained deep language model for cell type annotation of single-cell rna-seq data,” Nat. Mac. Intell., vol. 4, no. 10, pp. 852–866, 2022. [Online]. Available: https://doi.org/10.1038/s42256-022-00534-z
  • [36] M. Akiyama and Y. Sakakibara, “Informative rna base embedding for rna structural alignment and clustering by deep representation learning,” NAR genomics and bioinformatics, vol. 4, no. 1, p. lqac012, 2022.
  • [37] Y. Zhang, F. Ge, F. Li, X. Yang, J. Song, and D.-J. Yu, “Prediction of multiple types of rna modifications via biological language model,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2023.
  • [38] N. Wang, J. Bian, Y. Li, X. Li, S. Mumtaz, L. Kong, and H. Xiong, “Multi-purpose rna language modelling with motif-aware pretraining and type-guided fine-tuning,” Nature Machine Intelligence, pp. 1–10, 2024.
  • [39] L. Hallee, N. Rafailidis, and J. P. Gleghorn, “cdsbert-extending protein language models with codon awareness,” bioRxiv, 2023.
  • [40] Y. Chu, D. Yu, Y. Li, K. Huang, Y. Shen, L. Cong, J. Zhang, and M. Wang, “A 5’ utr language model for decoding untranslated regions of mrna and function predictions,” Nature Machine Intelligence, pp. 1–12, 2024.
  • [41] W. Yin, Z. Zhang, L. He, R. Jiang, S. Zhang, G. Liu, X. Zhang, T. Qin, and Z. Xie, “Ernie-rna: An rna language model with structure-enhanced representations,” bioRxiv, pp. 2024–03, 2024.
  • [42] C. Peng, J. Shang, J. Guan, D. Wang, and Y. Sun, “Viralm: Empowering virus discovery through the genome foundation model,” bioRxiv, pp. 2024–01, 2024.
  • [43] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” CoRR, vol. abs/2001.08361, 2020. [Online]. Available: https://arxiv.org/abs/2001.08361
  • [44] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” CoRR, vol. abs/2203.15556, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2203.15556
  • [45] N. Muennighoff, A. M. Rush, B. Barak, T. L. Scao, A. Piktus, N. Tazi, S. Pyysalo, T. Wolf, and C. Raffel, “Scaling data-constrained language models,” CoRR, vol. abs/2305.16264, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.16264

6 Pretraining Environment

The pretraining of OmniGenome was conducted on a dedicated Linux computation node, equipped with 8888 Nvidia RTX 4090409040904090 GPUs. For distributed model training, we employed version 4.37.14.37.14.37.14.37.1 of the Transformers library alongside version 0.26.10.26.10.26.10.26.1 of the Accelerate library. Our implementation framework of choice for OmniGenome was PyTorch, specifically version 2.0.02.0.02.0.02.0.0. The ViennaRNA version is 2.6.42.6.42.6.42.6.4 in our experiments. While some existing code was adapted for the modules within OmniGenome, the majority of the codebase, such as genomic sequences preprocessing, model pretraining, objective functions, and experiments, was meticulously crafted from scratch.

7 OmniGenome Package

Genomic modeling is still in its early stages, and resources are consequently very scarce. Typically, existing foundation models open-source only the model, without providing the training, fine-tuning, and benchmark evaluation codes, etc. To address this issue, we have developed a comprehensive open-source genomic modeling toolkit444https://github.com/yangheng95/OmniGenomeBench based on OmniGenome. This toolkit aims to provide extensive FM fine-tuning tutorials and a unified automated benchmark evaluation. The main features of the OmniGenome Package are as follows:

  • Fine-Tuning Tutorials: We provide tutorials for fine-tuning on all downstream genomic modeling tasks, including dataset processing, model implementation, and training processes. A fine-tuning example for secondary structure is included, covering both training and demonstration of secondary structure prediction. The fine-tuning tutorials are available at: https://github.com/yangheng95/OmniGenomeBench/blob/master/examples/secondary_structure_prediction_demo.ipynb

  • Automated Benchmark Evaluation: We offer an automated benchmark evaluation interface, which includes the built-in PGB and RGB benchmarks. By predefining the configurations for benchmark evaluation subtasks, such as hyperparameters, our tool supports the automated benchmark evaluation of future FMs and the addition of new benchmarks. The goal of automated benchmark evaluation is to ensure fairness and ease of use. We provide a tutorial on automated evaluation to guide users in benchmark evaluation. The automated benchmarking example is available at: https://github.com/yangheng95/OmniGenomeBench/blob/master/examples/benchmark/batch_rgb_benchmark.py

  • Genomic Repository Hub: We have created a hub for hosting and distributing open-source licensed datasets, model checkpoints, and benchmark evaluations. Additionally, we have designed flexible interfaces to support the sharing of datasets and models within the community. This approach helps mitigate the issue of resource scarcity. The hub will be available soon.

We are in the process of finalizing the necessary documentation and will officially release this tool in the near future.

8 Evaluation Methods

There are no direct counterparts to OmniGenome in plant genomic modeling. However, we can compare it with the following recent genomic FMs as potential baselines to help evaluate the performance of OmniGenome. The brief introductions of the FMs in Table 2 are as follows:

  • ViennaRNA [20]. ViennaRNA is a comprehensive genomic analysis tool that includes a diverse set of interfaces, such as RNAFold and RNAInverse design. ViennaRNA serves as the baseline for RNA structure prediction and RNA design in our experiments.

  • DNABERT2 [22]. DNABERT2 is one of the latest DNA FMs which improves the performance of DNABERT. The main modification of DNABERT2 is the tokenization method, which was changed to BPE from k-mers.

  • HyenaDNA [3]. HyenaDNA is an autoregressive FM optimized for long-range genomic data processing. HyenaDNA is based on the Hyena convolution architecture and capable of handling sequences up to 1111M bases in length.

  • Nucleotide Transformer (NT) V2 [13]. The NT FMs were trained on DNA data, including the human reference genome and multi-species DNA sequences. They aim to capture the complex patterns within nucleotide sequences for various genomic applications.

  • Agricultural Nucleotide Transformer (Agro-NT) [14]. Agro-NT is a large-scale DNA FM (1111B parameters) akin to the Nucleotide Transformers but with a focus on plant DNA.

  • SpliceBERT [21]. It was trained on 2222M precursor messenger RNA (pre-mRNA) and specialized in RNA splicing of pre-mRNA sequences.

  • 3UTRBERT [24]. This model was trained on 20202020k 3’UTRs for 3’UTR-mediated gene regulation tasks. It uses k-mers tokenization instead of SNT.

  • CDSBERT [39]. It was trained on massive CDS regions within the genome. CDSBERT is adapted from ProteinBERT for predicting protein structures and functions and leverages the alignment between RNA and protein sequences.

  • OmniGenome. OmniGenome is the first genomic FM that highlights the importance of sequence-structure alignment and is available for RNA design tasks.

  • OmniGenome+++555Please find the finetuning example of OmniGenome+++ in the supplemental materials.. OmniGenome+++ is an enhanced variant of OmniGenome that feeds both sequences and structures into OmniGenome to aggregate the feature representations to improving modeling ability.

We are aware that some FMs are also developed for RNA, such as RNA-FM [26], RNA-MSM [27], Uni-RNA [28], 5UTR-LM [16], etc. However, we cannot compare OmniGenome with them because their source codes are very hard to work with in our efforts or the models are not publicly available.

9 Genetic Algorithm for RNA Sequence Design

9.1 Genetic Algorithm

Refer to caption
Figure 5: The genetic algorithm used for solving RNA design tasks. ‘M’ and A are abbreviations for ‘¡mask¿’ and the predicted bases in this mutation operation, respectively. The most effective component in this algorithm is the structure-based sequence reconstruction based on OmniGenome+++.

The working mechanism of our designed genetic algorithm based on OmniGenome+++ is implemented as the following five-step process:

  1. Step 1111.

    Given the target RNA secondary structure, we use OmniGenome to generate a set of candidate sequences 𝒫={𝐬i}i=1N𝒫superscriptsubscriptsuperscript𝐬𝑖𝑖1𝑁\mathcal{P}=\{\mathbf{s}^{i}\}_{i=1}^{N}caligraphic_P = { bold_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT.

  2. Step 2222.

    If the termination criterion is not met, go to Step 3333; otherwise, output the current best sequence 𝐬=argmax𝐬𝒫f(𝐬)superscript𝐬subscriptargmax𝐬𝒫𝑓𝐬\mathbf{s}^{\ast}=\operatorname*{argmax}_{\mathbf{s}\in\mathcal{P}}f(\mathbf{s})bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_argmax start_POSTSUBSCRIPT bold_s ∈ caligraphic_P end_POSTSUBSCRIPT italic_f ( bold_s ).

  3. Step 3333.

    Based on 𝒫𝒫\mathcal{P}caligraphic_P, use single-point crossover and mutation to generate a population of offspring sequences 𝒪={𝐬~}i=1N𝒪superscriptsubscript~𝐬𝑖1𝑁\mathcal{O}=\{\tilde{\mathbf{s}}\}_{i=1}^{N}caligraphic_O = { over~ start_ARG bold_s end_ARG } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT.

  4. Step 4444.

    Combine 𝒫𝒫\mathcal{P}caligraphic_P and 𝒪𝒪\mathcal{O}caligraphic_O to obtain 𝒮=𝒫𝒪𝒮𝒫𝒪\mathcal{S}=\mathcal{P}\bigcup\mathcal{O}caligraphic_S = caligraphic_P ⋃ caligraphic_O, and use OmniGenome to predict the corresponding secondary structures of each sequence in 𝒮𝒮\mathcal{S}caligraphic_S. Evaluate the fitness values of sequences in 𝒮𝒮\mathcal{S}caligraphic_S.

  5. Step 5555.

    Sort 𝒮𝒮\mathcal{S}caligraphic_S according to the fitness values and preserve the best N𝑁Nitalic_N sequences to constitute a new 𝒫𝒫\mathcal{P}caligraphic_P. Return to Step 2222.

Note that the fitness value of a sequence 𝐬𝐬\mathbf{s}bold_s, denoted as f(𝐬)𝑓𝐬f(\mathbf{s})italic_f ( bold_s ), is evaluated as the Hamming distance of the RNA secondary structure predicted by OmniGenome against the target structure. The above genetic algorithm is not terminated until the sequence for the target RNA secondary structure is identified or the allocated computational budget is exhausted.

Refer to caption
Figure 6: Examples for RNA sequence design. We show four examples from the EternaV2 benchmark, where two puzzles (#5 and #11) are correctly solved and two puzzles (#3 and #27) are incomplete. The top four sequences with structures are the reference solutions, and the bottom sequences are obtained by OmniGenome+186M{}^{186\texttt{M}}+start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT +. The structures are derived by ViennaRNA and the red boxes highlight the difference parts between reference and nearly solved structure.

9.2 RNA Design Case Study

As demonstrated in the zero-shot experiments in Table 3, OmniGenome+++ models achieve state-of-the-art performance. We show several complex examples of puzzles from the EternaV2 benchmark. According to Fig. 6, puzzles #5 and #11 with approximately 200200200200+ bases are solved, while these puzzles are challenging to existing FMs. Even for puzzles that are not completely solved, e.g., puzzles #3 and #27, OmniGenome+186M{}^{186\texttt{M}}+start_FLOATSUPERSCRIPT 186 M end_FLOATSUPERSCRIPT + generates very similar structures, where the nucleotide base difference ratio between the designed structure and the target structure is only 3%absentpercent3\approx 3\%≈ 3 %. This finding indicates the proficiency of OmniGenome+++ models in solving challenging single-nucleotide resolution genomic tasks.

10 Benchmark Suites

10.1 RNA Genomic Benchmark

Details of the RGB can be found in Table 7. RGB contains 7777 SN-level tasks that are curated or collected from published articles. The purpose of RGB is to benchmark genomic FMs in challenging SN-level modeling tasks such as detection and repair of SN mutations, mRNA sequence degradation rates, and RNA secondary structure prediction. Due to the lack of a plant RNA benchmark dataset, RGB includes the modeling of RNA sequences from a variety of species, e.g., plant and human. The sequence length in RGB ranges from 107107107107 to 512512512512, which is sufficient for most RNA understanding tasks. In summary, these multi-species and SN-level tasks in RGB serve as the first comprehensive benchmark utilized to assess the RNA sequence modeling capabilities of OmniGenome and its baseline models. The brief introduction of the datasets in RGB is as follows:

  • Single-Nucleotide Mutation Detection (SNMD): We developed a plant RNA dataset synthesizing the single-nucleotide mutations. Focused on identifying potential single nucleotide changes, this task is essential for detecting mutations linked to genetic disorders. The SNMD dataset introduces up to 10101010 random mutations in the original sequences, regardless of variation ratios. Cross-entropy is utilized as the loss function for this binary token classification task.

  • Single-Nucleotide Mutation Repair (SNMR): This task challenges the model to suggest corrective actions at the single nucleotide level, aiding in gene therapy approaches. The SNMR dataset mirrors the SNMD dataset, with cross-entropy as the loss function, indicating a token 4-way (i.e., A, U, C, G) classification task.

  • mRNA Degrade Rate Prediction (mRNA): Estimating the decay rate of nucleotides in mRNA sequences, this task is vital for deciphering gene expression and regulation. The dataset originates from the Kaggle COVID-19 vaccine design competition666https://www.kaggle.com/competitions/stanford-covid-vaccine, focusing solely on sequence-based degradation rate prediction and excluding RNA structures. It’s a token regression task using MSE as the loss function, with the dataset resplit into training, validation, and testing sets for evaluation.

  • RNA Secondary Structure Prediction (bpRNA & Archive2 & RNAStralign & Rfam): Aiming to predict RNA folding into secondary structures, this task is fundamental to RNA functionality and interactions. We evaluated OmniGenome on four datasets, bpRNA [5] (TR0, VL0, TS0 sets), ArchiveII [6], RNAStralign [4] and Rfam [7]. Following existing works, we have excluded sequences over 512512512512 bases and complex structures, simplifying to three symbols: ‘(’, ‘.’, ‘)’Ṙesults may not directly compare with other studies due to these modifications. Cross-entropy serves as the loss function.

Table 7: The brief statistics of subtasks in the RGB. These benchmark datasets are held out or not included in the pretraining database. The numbers of examples in training, validation and testing sets are separated by “/”. indicate the datasets are used for zero-shot performance evaluation only.
Task Task Type # of examples # of classes Metric Sequence length Source
SNMD Token classification 8,000/1,000/1,0008000100010008,000/1,000/1,0008 , 000 / 1 , 000 / 1 , 000 2222 AUC 200200200200 This work
SNMR Token classification 8,000/1,000/1,0008000100010008,000/1,000/1,0008 , 000 / 1 , 000 / 1 , 000 4444 F1 200200200200 This work
mRNA Token regression 1,735/193/19217351931921,735/193/1921 , 735 / 193 / 192 RMSE 107107107107 Kaggle
bpRNA Token classification 10,814/1,300/1,305108141300130510,814/1,300/1,30510 , 814 / 1 , 300 / 1 , 305 3333 F1 512absent512\leq 512≤ 512 [5]
AchiveII Token classification 2278/285/28522782852852278/285/2852278 / 285 / 285 3333 F1 500absent500\leq 500≤ 500 [6]
RNAStrAlign Token classification 17483/2186/2185174832186218517483/2186/218517483 / 2186 / 2185 3333 F1 500absent500\leq 500≤ 500 [4]
Rfam Token classification 501376/62672/626725013766267262672501376/62672/62672501376 / 62672 / 62672 3333 F1 512absent512\leq 512≤ 512 [7]

Please find the appendix for the input and output examples of each subtask in RGB. The detailed task descriptions for each nucleic acid and species, including the number of examples, classes, evaluation metric, and sequence length, are outlined in Table 7. Each task is carefully curated to reflect the complexity and variety inherent in genomic data, providing a robust framework for assessing the nuanced capabilities of state-of-the-art RNA FMs.

Table 8 show the virtual examples of different datasets in RGB. Please refer to our supplementary materials to find the datasets for more details.

Table 8: The virtual input and output examples in RGB. The “\dots” represents the sequences that are omitted for better presentation and the red color indicates the wrong prediction in classification tasks. In the mRNA dataset, all single nucleotides have three values to predict. Note that “T” and “U” can be regarded as the same symbol in RNA sequences and depend on different datasets.
Genome Type Dataset Examples
RNA SNMD Input Sequence G A G T A \dots T T G A G
True Label 0  0  1  0  0 \dots 0  0  1  0  0
Prediction 0  0  0  0  0 \dots 0  0  1  0  0
SNMR Input Sequence T A C G A  \dots C T G A T
True Label T A C A A \dots G T A A T
Prediction T A C A A \dots C T G A T
mRNA Input Sequence G G \dots A C
True Label [0.1,0.3,0.2] [0.8,0.4,0.1]\dots[0.9,0.4,0.3] [0.5,0.2,0.6]
Prediction [0.1,0.3,0.2] [0.8,0.4,0.1]\dots[0.9,0.4,0.3] [0.5,0.2,0.6]
bpRNA Input Sequence G G C G A \dots C U U U U
True Label (   (   (   \cdot   \cdot \dots \cdot   \cdot   )   )   )
Prediction (   (   (   (   \cdot \dots \cdot   )   )   )   )
Archive2 Input Sequence A G U A G \dots U U U G C U
True Label (   (   (   \cdot   \cdot   \dots \cdot   \cdot   )   )   )
Prediction (   (   (   \cdot   \cdot   \dots \cdot   \cdot   )   )   )
RNAStralign Input Sequence A G U A G \dots U U U G C U
True Label (   (   (   \cdot   \cdot   \dots \cdot   \cdot   )   )   )
Prediction (   (   (   \cdot   \cdot   \dots \cdot   \cdot   )   )   )
Rfam Input Sequence A G U A G \dots U U U G C U
True Label (   (   (   \cdot   \cdot   \dots \cdot   \cdot   )   )   )
Prediction (   (   (   \cdot   \cdot   \dots \cdot   \cdot   )   )   )

10.2 Plant Genomic Benchmark

The Plant Genomic Benchmark [14] (PGB) provides a comprehensive suite of datasets designed to evaluate and improve the predictive capabilities of genomic models in plant biology. This benchmark, as shown in Table 9, encompasses a range of critical genomic tasks777https://huggingface.co/InstaDeepAI/agro-nucleotide-transformer-1b, including binary classification, single and multi-variable regression, and multi-label classification, addressing various aspects of plant genomics such as RNA processing, gene expression, and chromatin accessibility. By integrating diverse genomic tasks, the PGB aims to facilitate advanced research and development in plant genomics, offering a robust platform for the assessment and enhancement of model performance across different plant species. To obtain a detailed description of PGB, please refer to Agro-NT [14].

Table 9: The genomic tasks in the Plant Genomic Benchmark. This table briefly enumerates each task by name, the number of datasets available, the type of classification or regression analysis required, the range of sequence lengths, and the total number of samples in each dataset. Please find the dataset details of PGB in Agro-NT.
Task # of datasets Task Type Total # of examples # of classes Metric Sequence length
Polyadenylation 6666 Sequence classification 738,918738918738,918738 , 918 2222 F1 400400400400
Splice site 2222 Sequence classification 4,920,83549208354,920,8354 , 920 , 835 2222 F1 398398398398
LncRNA 2222 Sequence classification 58,0625806258,06258 , 062 6666 F1 10160001016000101-6000101 - 6000
Promoter strength 2222 Sequence regression 147,966147966147,966147 , 966 RMSE 170170170170
Terminator strength 2222 Sequence regression 106,818106818106,818106 , 818 RMSE 170170170170
Chromatin accessibility 7777 Multi-label classification 5,149,69651496965,149,6965 , 149 , 696 9199199-199 - 19 F1 1,00010001,0001 , 000
Gene expression 6666 Multi-variable regression 206,358206358206,358206 , 358 RMSE 6,00060006,0006 , 000
Enhancer region 1111 Sequence classification 18,8931889318,89318 , 893 2222 F1 1,00010001,0001 , 000

11 OneKP Initiative

There has been a variety of FMs utilized in different species, e.g., humans [3, 13], bacteria [2], and viruses [42], which indicates the effectiveness of pretrained FMs on multi-species genomics. In this work, we aim to propose an FM for multi-species plant RNA sequence modeling. We leverage the OneKP initiative [23] to address the scarcity of plant RNA data, which contains 1,12411241,1241 , 124 species of plant transcriptomes. The scale of OneKP enables the development of a more robust and transferable RNA FM.

The 1000100010001000 Plant Transcriptomes Initiative (OneKP) was a comprehensive effort aimed at exploring genetic diversity across the green plant kingdom (Viridiplantae), sequencing the RNA from 1124112411241124 (1342134213421342 in other versions) samples that represent over 1000100010001000 species, encompassing all major taxa within Viridiplantae. This includes streptophyte and chlorophyte green algae, bryophytes, ferns, angiosperms, and gymnosperms. The initiative’s final or capstone publication presents three major analyses: inferring species trees, identifying whole genome duplications, and detecting gene family expansions. These findings are particularly valuable for plant and evolutionary scientists interested in specific gene families, whether their focus is across the entire green plant tree of life or within more narrowly defined lineages.

The sampling strategy for the 1KP was global and collaborative, with samples sourced from a wide range of environments including wild field collections, greenhouses, botanical gardens, laboratory specimens, and algal culture collections. The initiative prioritized the collection of live growing cells, such as young leaves, flowers, or shoots, to ensure a high abundance of expressed genes, though many samples also came from roots and other tissues. RNA extraction was performed using well-established protocols or commercial kits, facilitating the comprehensive analysis of transcribed RNA across this diverse set of species. This monumental effort not only sheds light on plant genetic diversity but also provides a rich data resource for ongoing and future research in plant science and evolutionary biology.

12 Limitations

The limitations of our work are rooted in resource constraints. First, according to the data scaling laws [43, 44, 45], we recognize that the scale of the proposed RNA foundation model remains small, and the parameterization may not be adequate to fully exploit the OneKP database’s potential. Due to resource limitations, we could not pretrain larger models. Moving forward, we aim to train larger-scale foundation models, anticipating that our contributions will accelerate the advancement of DNA and RNA foundation models. Secondly, while the modeling length of our foundation model satisfies most RNA and DNA sequence processing needs, there are tasks with extremely long sequences. Our future efforts will focus on enhancing the model’s capability to handle such lengthy sequences.

13 Ethics Statement

In this research, we utilized the open OneKP dataset, which does not contain human-related privacy concerns. We ensure that such data is not exploited without fair compensation and acknowledgment of the source communities. The pretraining sequences are plant-based genomic data that pose potential harm to ecological systems; we do not permit the use of our model beyond expectations, such as developing malicious bio-software or designing harmful RNA structures. The models and findings should support, not undermine, the conservation of plant species and their habitats. We adhere to principles of transparency and open science, using datasets that are publicly available and providing clear documentation of our methodologies and findings.

Overall, in conducting this research, we have committed to ethical scientific practices that respect biodiversity and aim to contribute positively to the field of genomic research. We encourage ongoing dialogue around the ethical use of plant RNA sequences and support initiatives that promote the sharing of benefits arising from such research with all stakeholders.