OmniGenome: Aligning RNA Sequences with Secondary Structures in Genomic Foundation Models

Heng Yang Department of Computer Science, University of Exeter, EX4 4QF, Exeter, UK Ke Li Department of Computer Science, University of Exeter, EX4 4QF, Exeter, UK

Abstract: The structures of RNA sequences play a vital role in various cellular processes, while existing genomic foundation models (FMs) have struggled with precise sequence-structure alignment, due to the complexity of exponential combinations of nucleotide bases. In this study, we introduce OmniGenome, a foundation model that addresses this critical challenge of sequence-structure alignment in RNA FMs. OmniGenome bridges the sequences with secondary structures using structure-contextualized modeling, enabling hard in-silico genomic tasks that existing FMs cannot handle, e.g., RNA design tasks. The results on two comprehensive genomic benchmarks show that OmniGenome achieves state-of-the-art performance on complex RNA subtasks. For example, OmniGenome solved $74\%$ of complex puzzles, compared to SpliceBERT which solved only $3\%$ of the puzzles. Besides, OmniGenome solves most of the puzzles within $1$ hour, while the existing methods usually allocate $24$ hours for each puzzle. Overall, OmniGenome establishes wide genomic application cases and offers profound insights into biological mechanisms from the perspective of sequence-structure alignment.

Keywords: Genomic Foundation Model, RNA Design, Secondary Structure Prediction

1 Introduction

RNA, folded from diverse nucleotide sequences, is a critical type of molecule and enables the flow of genetic information from DNA to protein [1, 2]. Similar to the natural language texts, the nucleotide sequences are regarded as the sentences represented in the “language” of genomics [3]. Therefore, many studies aim to leverage pretrained genomic foundation models, a.k.a., language models (LMs), to address in-silico RNA modeling tasks, such as secondary structure prediction [4, 5, 6, 7], degradation rate prediction [8, 9], mRNA vaccine design[10, 11]. There is a vital impact of secondary structures¹¹1The “structure” term refers to “RNA secondary structure” in this work. of RNA molecules on various cellular processes [12], such as gene expression [13, 14], protein translation [15, 16]. However, the performance of existing FMs are unsatisfactory because they only model the RNA sequences without secondary structures. Consequently, leveraging the secondary structures in the RNA modeling is crucial for the FMs in solving complex genomic subtasks, e.g., RNA design [17]. One of the intuitive methods to utilize secondary structures is to build the bidirectional mapping between sequences and structures. We formulate this mapping as sequence-structure alignment, which can be divided into two parts: sequence-to-structure (Seq2Str) and structure-to-sequence (Str2Seq) prediction. Both two predictions are challenging and have not been settled in previous works because the sequences and structures are sophisticated due to exponential combinations of up to thousands of bases. In short, to align the sequence and structure in FM-based genomic modeling, we need to address the Seq2Str and Str2Seq predictions simultaneously.

Seq2Str prediction. Secondary structure prediction is a difficult task because the structures grow exponentially with increasing sequence lengths, and an identical sequence may fold into different sub-optimal structures because the folding principles of RNA sequences depend on a variety of biological factors [18], as indicated in Fig. 1. Moreover, biologically verified structures are unfortunately expensive, compared to raw sequences, to obtain and process. This problem causes an imbalanced data scale between RNA sequences and verified structures making it hard to train an FM for the Seq2Str prediction. This means we need a tremendous number of secondary structures to train the Seq2Str prediction capability. However, existing works on RNA modeling generally utilize the masked language modeling [19] objective adapted from natural language processing (NLP) to pretrain FMs and regard the secondary structure prediction as a fine-tuning downstream task [5, 6], which inevitably shadows the performance of FMs without the pretraining on Seq2Str prediction.

Refer to caption — Figure 1: A real example for RNA sequence folding. The sub-figures (a) and (c) indicate the same sequence with different structures. The sub-figures (b) and (c) denote the identical structure can be from different sequences.

To address the data scarcity of secondary structures in Seq2Str, we leverage ViennaRNA [20] to compute the structures for billions of RNA sequences and introduce Seq2Str prediction into the pretraining phase. The pretraining on billions of structures will build an effective and robust mapping from sequences to structures. Our experiments in Section 3.3 show that the FM can achieve state-of-the-art performance even in zero-shot secondary structure prediction.

Str2Seq prediction. In addition to the data scarcity problem aforementioned in Seq2Str, Str2Seq prediction presents a substantial challenge because secondary structures can correspond to heterogeneous sequences as shown in Fig. 1, where the identical structure can be from different sequences. Therefore, it is impossible to reconstruct sequences according to only structures based on structure-to-sequence supervised training because there is no deterministic sequence for each structure. Consequently, the Str2Seq prediction has not been explored in the existing genomic FMs.

To overcome the challenge in Str2Seq prediction, we formulate the Str2Seq as a structure-contextualized sequence reconstruction task. Since RNA secondary structures can be represented in texts, we concatenate the sequences and structure pairs and then mask a small portion of the sequence or structure tokens before feeding them into the FM. In other words, we pretrain the FM to reconstruct the missing nucleotide bases given the context of structures, which will build the Str2Seq mapping ability for the FM. To perform this pretraining objective, we expand the vocabulary to represent the structure and deploy corresponding embeddings in the FM, because regular genomic FMs cannot recognize and embed the structures, as shown in the virtual example in Fig. 2.

Benchmark and Evaluation Results. We conduct two comprehensive in-silico genomic understanding benchmarks to evaluate the performance of our FM. The first benchmark compiled in the study is the RNA genomic benchmark (RGB) which contains diverse challenging genomic understanding tasks that benefit from the sequence-structure alignment, such as secondary structure prediction. The second benchmark is the plant genomic benchmark (PGB) which contains millions of DNA sequences to evaluate the DNA sequence understanding tasks. This benchmark is used to evaluate the generalizability of our FM among diversified species and genomes. The overall performance of our FM (up to $186$ M parameters) on both two benchmarks consistently outperforms existing genomic FMs with up to $35\%$ improvement even compared with agro-NT [14] that contains $1$ billion parameters.

We also conduct zero-shot Seq2Str and Str2Seq prediction experiments to verify the performance of sequence-structure alignment. As revealed in these experiments in Section 3.2 and Section 3.3 our FM achieves up to an $83\%$ macro-F1 score in zero-shot Seq2Str prediction, i.e., secondary structure prediction, outperforming some fine-tuned FMs and ViennaRNA. In terms of Str2Seq prediction performance, we leverage our FM to solve the complex RNA design task and solved $74\%$ of complex puzzles of the Eterna [17] V2 benchmark, while state-of-the-art FMs, e.g., SpliceBERT [21], only solved $3\%$ of simple (i.e., the target sequences are very short) puzzles. Besides, our FM takes at most $1$ hours to solve most of the puzzles, while the majority of RNA design methods can take up to $24$ hours to solve each puzzle.

Open-source Resources. The genomic sequence (e.g., RNA and DNA) modeling is still on its early stage, while the related open-source resources are extremely limited and undocumented. Besides, many of the early works on genomic FMs are not ready-to-use due to different reasons. This problem encourages us to release all the benchmarks, evaluation scripts and FM tutorials, etc. As a result, we have developed an open-source package ²²2https://github.com/yangheng95/OmniGenomeBench that includes step-by-step tutorials for FM pretraining and downstream tasks fine-tuning, etc. It provides prepared genomic benchmarks and automates the benchmarking process of FMs using the API with only a few lines of code. These resources avoid the tedious and time-consuming programming in future genomic FM research.

Finally, the FM proposed in this work is named OmniGenome, which may advance a deep mechanistic understanding of RNA sequences as well as molecular biology and accelerate our ability to design RNA molecules.

2 Methodology

This section delineates the implementation of OmniGenome step by step.

2.1 Data Preprocessing

Recent studies [21, 22] have shown that data diversity can enhance FM performance without significantly increasing model capacity. For the OmniGenome pretraining, we collected transcriptome data from the OneKP initiative [23], which compiles large-scale RNA-Seq data from $1,124$ plant species. Because the raw biological data are sparse, noisy, and characterized by long sequences, they are far from ready for effective FM training.

To address this problem, we developed a four-step data curation protocol to improve data quality. $\blacktriangleright$ Raw RNA-Seq data are often excessively long with thousands of bases. We first sliced them into segments with a window size of $1,024$ bases to provide a sufficient context window for RNA sequence understanding. $\blacktriangleright$ To enhance training efficiency and reduce bias, we removed all duplicate sequences. $\blacktriangleright$ To tackle incomplete transcriptome data and other noises, we discard sequences shorter than $50$ bases. $\blacktriangleright$ To facilitate the sequence-structure alignment training, we adopt ViennaRNA to obtain the secondary structures for the sequences.

2.2 Tokenization

In the realm of genomic sequence modeling, modeling resolution, and performance highly depend on the tokenization strategy [3, 21]. Because the traditional k-mers [24, 13] and BPE [19, 22] methods combine multiple bases into single tokens, they may compromise modeling resolution and thus fail to capture granular features. To achieve a single nucleotide (SN) resolution necessary for genomic tasks like RNA design and structure prediction, we employed single nucleotide tokenization (SNT) method [3, 21] (see an illustrative example in Fig. 3). Additionally, to simplify the implementation, we adopted a vocabulary $\{$ ‘A’, ‘T’, ‘C’, ‘G’, ‘U’, ‘N’, ‘(’, ‘)’, ‘.’ $\}$ to unify the tokenization of both nucleotide bases and RNA secondary structure information. Following BERT-style models [19], we incorporated special tokens, e.g., ‘<mask>’, to facilitate masked bases and structure prediction.

2.3 Pretraining Objectives

As discussed in Section 1, a key desideratum for complex genomic understanding and RNA sequence design is helping the model build an alignment between RNA sequences and their corresponding secondary structures. Bearing this in mind, we designed the following aggregated pretraining objective for OmniGenome:

\mathcal{L}_{\texttt{pretrain}}=\mathcal{L}_{\texttt{Str2Seq}}+\mathcal{L}_{% \texttt{Seq2Str}}+\mathcal{L}_{\texttt{MRLM}}+\lambda||\theta||_{2},

(1)

where $\lambda$ is the $\ell_{2}$ regularization weight and $\theta$ represents the parameters of OmniGenome. The following paragraphs explain the design principles of each loss function used in equation (1).

•

$\mathcal{L}_{\texttt{Str2Seq}}$ is designed to enable OmniGenome to predict bases given partially masked sequences with the context of structures, so that OmniGenome can perceive RNA secondary structure by constructing Str2Seq prediction. To achieve this, we mask $15\%$ of the bases and structure tokens, encouraging the model to infer masked bases (i.e., $\{$ ‘A’, ‘T’, ‘C’, ‘G’, ‘U’, ‘N’ $\}$ ) and structure tokens (i.e., $\{$ ‘(’, ‘)’, ‘.’ $\}$ ). Specifically, $\mathcal{L}_{\texttt{Str2Seq}}$ is defined as the classic cross-entropy loss widely used in the masked language modeling:

\mathcal{L}_{\texttt{Str2Seq}}=-\frac{1}{|m|}\sum_{i=1}^{m}\log p(x_{i}\mid x_% {\setminus i}),

(2)

where $m$ is the number of masked nucleotide and structure tokens, and $p(x_{i}|x_{\setminus i})$ indicates the probability of predicting the masked nucleotide $x_{i}$ based on its context.

•

Complementary to $\mathcal{L}_{\texttt{Str2Seq}}$ , $\mathcal{L}_{\texttt{Seq2Str}}$ is designed to enable OmniGenome for Seq2Str prediction. Instead of directly feeding the secondary structure into OmniGenome during training, this objective employs the RNA secondary structures as supervised-training labels. Unlike Str2Seq, the Seq2Str prediction is generally deterministic because the optimal structures are unique. This task is implemented as a token-level classification, where the $\mathcal{L}_{\texttt{Seq2Str}}$ loss is defined in the following cross-entropy loss:

\mathcal{L}_{\texttt{Seq2Str}}=-\sum_{i=1}^{N}\sum_{c=1}^{C}s_{ic}\log(\hat{s}% _{ic}),

(3)

where $s_{ic}$ denotes the label $c$ of secondary structure at the $i$ -th position, and $\hat{s}_{ic}$ is the probability predicted by a linear classifier deployed on OmniGenome. $N$ is the length of an RNA sequence and $C=3$ denotes the number of the possible labels of structure, i.e., $\{$ ‘(’, ‘)’, ‘.’ $\}$ .

•

The last one $\mathcal{L}_{\texttt{MRLM}}$ is similar to the conventional masked language modeling loss in NLP. It aims to improve the model’s understanding of latent patterns in RNA sequences by masking or randomly replacing $15\%$ of nucleotide bases. The definition of $\mathcal{L}_{\texttt{MRLM}}$ is similar to that of $\mathcal{L}_{\texttt{Str2Seq}}$ , but it only considers the prediction of masked nucleotide bases.

There will be a data leakage problem in $\mathcal{L}_{\texttt{Seq2Str}}$ objective while the structure is appended to the sequences. In practice, $70\%$ of RNA sequences are used for structure-contextualized pretraining for the $\mathcal{L}_{\texttt{Str2Seq}}$ objective, while the remaining $30\%$ are used for the other two objectives. This proportion is based on our experiment observation.

2.4 Model Architecture

OmniGenome adopts the classic Transformer encoder architecture with bidirectional multi-head attention. We designed two variants, dubbed OmniGenome ${}^{52\texttt{M}}$ and OmniGenome ${}^{186\texttt{M}}$ with $52$ and $186$ million parameters respectively. Some key model specifications are summarized in Table 1.

The training specifications are summarized as follows. $\blacktriangleright$ The learning rate is set to $5\times 10^{-5}$ and the weight decay is set to $0.01$ . $\blacktriangleright$ We use AdamW as the optimizer with hyperparameters $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . $\blacktriangleright$ We use a linear decay strategy with a warm-up period of $1,000$ steps in the learning rate scheduler. $\blacktriangleright$ The batch size is set to $2,048$ . $\blacktriangleright$ No dropout is applied during pretraining, and we use the rotary position embeddings [25] to further enhance the model’s scalability to long RNA sequences. $\blacktriangleright$ We built a distributed training environment with $8$ Nvidia RTX $4090$ GPUs and pretrained for $4$ weeks, while its configuration is introduced in Appendix 6.

Table 1: Summary of some key model specifications of two OmniGenome variants.

Models	# of	Embedding	Intermediate	# of	# of	Modeling	Position
Models	Layers	dimension	dimension	heads	parameters	length	embedding
OmniGenome ${}^{52\texttt{M}}$	$16$	$480$	$2,400$	$24$	$52$ M	$1,024$	Rotary
OmniGenome ${}^{186\texttt{M}}$	$32$	$720$	$2,560$	$30$	$186$ M	$1,024$	Rotary

2.5 Downstream Tasks

2.5.1 Predictive Tasks Fine-tuning

OmniGenome is designed as a general-purpose RNA FM that can be fine-tuned for a diverse set of downstream genomics predictive tasks. In this paper, we constructed a large-scale benchmark suite for RNA FMs. According to the category of genomes, we split the benchmark into two parts.

•

The first one is a collection of genomic understanding tasks aimed at the RNA genomic benchmark (RGB), as shown in Table 7. RGB contains $7$ SN-level tasks that are curated in this work or collected from published articles. The purpose of RGB is to benchmark genomic FMs in challenging SN-level modeling tasks like detection and repair of SN mutations, mRNA sequence degradation rates, and RNA secondary structure prediction. Due to the lack of a plant RNA benchmark dataset, the RGB considers the modeling of RNA sequences from a variety of species, e.g., plant and human. The sequence length in RGB ranges from $107$ to $512$ , which is enough for most RNA understanding tasks. In summary, these multi-species and SN-level tasks in RGB serve as the first comprehensive benchmark utilized to assess the RNA sequence modeling capabilities of OmniGenome and its baseline models. For detailed information on each dataset, such as their sources and sizes, please refer to Appendix 10.1.
•

The plant genomic benchmark [14] (PGB³³3https://huggingface.co/datasets/InstaDeepAI/plant-genomic-benchmark) shown in Table 9 provides a comprehensive suite of DNA genomic tasks designed to evaluate and improve the predictive capabilities of genomic models in plant biology. As a large-scale plant DNA FM benchmark, PGB involves $8$ types of DNA modeling subtasks, including a range of critical genomic tasks such as promoter strength prediction and gene expression regression. There are $28$ datasets in total with millions of DNA sequences to be evaluated in PGB, and the sequence lengths are up to $6000$ , which is very long for most of the genomic FMs. Since the original evaluation protocol is not publicly available, we have re-implemented the auto-benchmark for all the subtasks from PGB in our package, including various aspects of plant genomics such as gene expression and chromatin accessibility. By integrating diverse genomic tasks, the PGB aims to facilitate advanced research and development in plant genomics, offering a robust platform for the assessment and enhancement of model performance across different plant species. Due to computational limitations, we randomly sample a maximum of $10$ k examples in all datasets in PGB to evaluate the FM’s performance.

2.5.2 RNA Sequence Design

One of the difficult practices addressed by OmniGenome is the RNA design task, which has been long-term neglected by existing FMs because of the incapability of Str2Seq prediction. We introduce a naive genetic algorithm based on OmniGenome’s Str2Seq capability and achieve state-of-the-art performance in RNA sequence design by precisely predicting bases according to the structure context. The main steps in the genetic algorithm and workflow visualization are available in Appendix 9 and Fig. 5, respectively. Please find the parameter setting details in Section 3.3, and the experiment script can be found in the supplemental materials.

3 Experiments

To comprehensively explore the potential of OmniGenome and assess its performance across genomic understanding and RNA design tasks, this section implements rich experimental evaluations on diverse genomic tasks. We first evaluate the sequence-structure alignment capability of OmniGenome. Next, we evaluate the overall performance of OmniGenome on two comprehensive genomic modeling benchmarks, i.e., the RNA Genomic Benchmark (RGB) and the Plant Genomic Benchmark (PGB), respectively. All the benchmark datasets are not from the pretraining database, i.e., OneKP.

3.1 Evaluation Methods

There are no direct counterparts to OmniGenome in plant genomic modeling. However, we can compare it with the following recent genomic FMs as potential baselines to help evaluate the performance of OmniGenome as shown in Table 2. Please refer to Appendix 8 for brief introductions of these FMs. We are aware that some FMs are also developed for RNA, such as RNA-FM [26], RNA-MSM [27], Uni-RNA [28], 5UTR-LM [16], etc. However, we cannot compare OmniGenome with them because their source codes are very hard to work with in our efforts or the models are not publicly available. In the secondary structure prediction tasks, we input the secondary structures computed by ViennaRNA that are different from the true structures into OmniGenome $+$ to avoid data leakage problems.

Table 2: The brief statistics of RNA and DNA genomic FMs. The numbers of parameters of different FMs are approximately calculated. Please note that a ‘token’ can contain multiple bases in BPE and k-mers tokenization. The detailed introduction of the following FMs can be found in the original publications.

Model	Tokenization	# of Params	Pretraining Data	Species	Nucleic Acid
DNABERT-2	BPE	$117$ M	$32.49$ B Tokens	Human + $135$ Species	DNA
NT-V2- $100$ M	k-mers	$96$ M	$300$ B Tokens	Human + $850$ Species	DNA
HyenaDNA-Large	SNT	$47$ M	$\approx 3.2$ B Tokens	Human	DNA
Agro-NT- $1$ B	k-mers	$985$ M	$472.5$ B Tokens	$48$ Edible Plants	DNA
SpliceBERT	SNT	$19$ M	$2$ M Sequences	Multi-Vertebrates	precursor-mRNA
CDSBERT	SNT	$420$ M	$4$ M Sequences	$4,06$ 9 RNA Families	CDS
3UTRBERT	k-mers	$86$ M	$20,362$ Sequences	Multi-Species	mRNA $3$ ’UTR
OmniGenome ${}^{52\texttt{M}}$	SNT	$52$ M	$54.2$ B Tokens	$1124$ Plant Species	mRNA, CDS, UTR
OmniGenome ${}^{186\texttt{M}}$	SNT	$186$ M	$54.2$ B Tokens	$1124$ Plant Species	mRNA, CDS, UTR

3.2 Zero-shot Structure to Sequence Prediction Evaluation

In this experiment, we demonstrate the Str2Seq prediction capability of OmniGenome based on RNA sequence design. We employed the Eterna [17] V2 benchmark, which consists of $100$ specified secondary structures. This task aims to design plausible RNA sequences based on the reference structures. It is intuitive that higher performance in RNA design tasks indicates better structure-to-sequence prediction capability. We develop a simple GA algorithm based on OmniGenome models to solve RNA design puzzles. The implementation details can be found in Fig. 5 in Appendix 9. Within our GA implementation, the population size is set at $1000$ , with $100$ iterations, and the mutation rate for each base is $0.5$ . We set the maximum running time for each puzzle to $1$ hour because OmniGenome can easily solve most of the RNA design puzzles within $1$ hour, while each puzzle’s time budget is generally $24$ hours in existing RNA design methods. We utilize accuracy as our evaluation metric, meaning that only sequences with entirely correct structures are considered resolved. The experimental results are available in Table 3.

Table 3: Performance on the EternaV2 RNA sequence design benchmark. Only the FMs using SNT support the RNA design tasks. The best results are in bold face.

Model	Token.	EternaV2
Model	Token.	Acc
RNAInverse	—	$30$
SpliceBERT	SNT	$3$
CDSBERT	SNT	$0$
OmniGenome ${}^{52\texttt{M}}$	SNT	$4$
OmniGenome ${}^{186\texttt{M}}$	SNT	$5$
OmniGenome ${}^{52\texttt{M}}+$	SNT	$71$
OmniGenome ${}^{186\texttt{M}}+$	SNT	$\mathbf{74}$

It is observed from Table 3 that the FM-free method RNAInverse solved 30 of the RNA design puzzles, indicating a competent but not optimal capability in sequence design. We select the FMs employing SNT to conduct comparison experiments, as tokenizers like k-mers and BPE cannot handle the SN-level resolution modeling, i.e., masked nucleotide prediction. We cannot compare with HyenaDNA because it does not support masked nucleotide prediction. The models SpliceBERT and CDSBERT demonstrated trivial proficiency in the RNA sequence design task, solving $3$ and $0$ puzzles, respectively. This suggests these FMs cannot precisely predict the bases without any Str2Seq mapping ability. OmniGenome variants without structure context input have performance similar to the baselines, i.e., OmniGenome ${}^{52\texttt{M}}$ and OmniGenome ${}^{186\texttt{M}}$ solved $4$ and $5$ puzzles, respectively. In contrast, OmniGenome ${}^{52\texttt{M}}+$ and OmniGenome ${}^{186\texttt{M}}+$ significantly outperformed other models with $71$ and $74$ puzzles solved, respectively, underscoring the impact of Str2Seq prediction in sequence-structure alignment, which evidently bolsters the models’ abilities to predict bases according to the structure contexts. Besides, we expect an increase in performance with sufficient computational budgets and the findings provide crucial evidence of the significance of Str2Seq for RNA sequence design.

3.3 Zero-shot Sequence to Structure Prediction Evaluation

This subsection assesses OmniGenome in both Seq2Str and Str2Seq prediction aspects of sequence-structure alignment. The evaluation of Seq2Str is based on zero-shot secondary structure prediction. We use OmniGenome and OmniGenome $+$ without fine-tuning to predict the secondary structures of sequences from the test datasets and measure the macro-F1 score, where better structure prediction performance indicates a stronger capability for Seq2Str prediction. We evaluate Str2Seq mapping ability based on masked nucleotide base prediction. This task leverages the FM’s capacity to predict sequences with the context of structure. We randomly mask one base in the sequences from the test datasets and leverage the Str2Seq ability of OmniGenome $+$ to predict the masked bases in a zero-shot scenario. We assess the model’s accuracy in predicting masked bases both with and without structure context. The experimental results are available in Table 4.

Table 4: Performance in zero-shot RNA secondary structure prediction and masked nucleotide prediction across various benchmarks, showcasing the capabilities of OmniGenome in zero-shot learning scenarios without any fine-tuning or domain adaptation.

Model	RNA Secondary Structure Prediction				Masked Nucleotide Prediction
Model	Archive2	bpRNA	RNAStralign	Rfam	Archive2	bpRNA	RNAStralign	Rfam
	F1	F1	F1	F1	Acc	Acc	Acc	Acc
ViennaRNA	$75.89$	$27.82$	$74.80$	$19.98$	—	—	—	—
SpliceBERT	—	—	—	—	$30.04$	$31.40$	$32.10$	$24.50$
OmniGenome ${}^{52\texttt{M}}$	$71.87$	$28.18$	$74.62$	$16.08$	$27.37$	$26.82$	$27.46$	$27.10$
OmniGenome ${}^{186\texttt{M}}$	$72.57$	$28.26$	$75.07$	$18.60$	$29.82$	$30.66$	$28.92$	$27.90$
OmniGenome ${}^{52\texttt{M}}+$	$72.09$	$28.22$	$74.78$	$81.56$	$36.49$	$32.72$	$39.18$	$31.32$
OmniGenome ${}^{186\texttt{M}}+$	$\mathbf{73.01}$	$\mathbf{29.04}$	$\mathbf{75.34}$	$\mathbf{83.24}$	$\mathbf{51.93}$	$\mathbf{48.96}$	$\mathbf{52.86}$	$\mathbf{46.03}$

The results in Table 4 reveal OmniGenome’s effectiveness in both secondary structure prediction (i.e., Seq2Str) and masked nucleotide prediction (i.e., Str2Seq). More specifically, OmniGenome ${}^{52\texttt{M}}$ and OmniGenome ${}^{186\texttt{M}}$ show impressive performance for zero-shot sequence-structure alignment, while OmniGenome ${}^{52\texttt{M}}+$ and OmniGenome ${}^{186\texttt{M}}+$ outperform the base variants of OmniGenome. Even when ablating structure context, OmniGenome ${}^{186\texttt{M}}$ achieves performance comparable with ViennaRNA. Interestingly, if the structures calculated by ViennaRNA are fed into the model, OmniGenome $+$ models can predict the structure more precisely and outperform ViennaRNA on most of the secondary structure prediction datasets, e.g., bpRNA, RNAStralign, and Rfam. From the results of masked nucleotide prediction, it is observed that sequence reconstruction performance can be enhanced by the structure context. OmniGenome ${}^{52\texttt{M}}+$ and OmniGenome ${}^{186\texttt{M}}+$ significantly outperform baseline FMs like SpliceBERT.

Overall, the results in Table 4 provide a comprehensive evaluation of the FMs’ capabilities in sequence-structure alignment. This underscores OmniGenome’s performance and robustness in generalizing among different species without prior fine-tuning, highlighting their potential utility in SN-level in-silico genomic modeling.

3.4 RNA Genomic Benchmark

Table 5: The performance of OmniGenome and baseline models on the RGB, with results averaged based on five random seeds. “N.A.” indicates that ViennaRNA is not designed for other predictive genomic tasks.

Model	mRNA	SNMD	SNMR	Archive2	bpRNA	RNAStralign
Model	RMSE	AUC	F1	F1	F1	F1
ViennaRNA	N.A.	N.A.	N.A.	75.89	27.82	74.80
DNABERT2	$0.8158$	$49.94$	$15.86$	$59.82$	$43.40$	$65.49$
HyenaDNA	$0.8056$	$53.32$	$39.80$	$84.23$	$56.62$	$95.42$
NT-V2	$0.7826$	$50.49$	$26.01$	$79.90$	$56.60$	$90.84$
Agro-NT	$0.7830$	$49.99$	$26.38$	$70.13$	$48.71$	$75.21$
SpliceBERT	$0.7340$	$58.11$	$46.44$	$89.05$	$69.10$	$96.97$
3UTRBERT	$0.7772$	$50.02$	$24.01$	$78.98$	$56.93$	$92.03$
CDSBERT	$0.7468$	$55.03$	$36.16$	$89.34$	$70.01$	$97.15$
OmniGenome ${}^{52\texttt{M}}$	$0.7191$	$62.44$	$48.91$	$94.98$	$82.34$	$99.01$
OmniGenome ${}^{186\texttt{M}}$	$0.7164$	$63.81$	$49.80$	$95.20$	$82.48$	$99.12$
OmniGenome ${}^{52\texttt{M}}+$	$0.7188$	$63.11$	$49.21$	$95.10$	$81.88$	$98.98$
OmniGenome ${}^{186\texttt{M}}+$	$\mathbf{0.7121}$	$\mathbf{64.13}$	$\mathbf{49.44}$	$\mathbf{95.27}$	$\mathbf{83.36}$	$\mathbf{99.35}$

The results in Table 5 demonstrate the performance of OmniGenome and its generalizability across various fine-grained RNA genomic modeling tasks. It is observed that the base variants of OmniGenome models achieve better results than both RNA and DNA FM baselines, including Agro-NT and CDSBERT, which contain hundreds of millions of parameters. This is because the existing FMs usually adopt k-mers tokenization that cannot handle single nucleotide resolution tasks, e.g., single nucleotide mutation detection and repair. Because of the Seq2Str pretraining, OmniGenome and OmniGenome $+$ models exhibit strong results in secondary structure prediction, underscoring OmniGenome’s capabilities in SN-level RNA sequence understanding and manipulation.

3.5 Plant Genomic Benchmark

Table 6: Performance of OmniGenome and baseline FMs on PGB. “PolyA” stands for Polyadenylation, “Chrom Acc” for Chromatin Accessibility, “Prom Str” for Promoter Strength, “Term Str” for Terminator Strength, “Splice” for Splice Site, “Gene Exp” for Gene Expression, and “Enh Reg” for Enhancer Region. Results for OmniGenome

{}^{186\texttt{M}}+

are excluded due to the time-intensive nature of the experiments.

Model	PolyA	LncRNA	Chrom Acc	Prom Str	Term Str	Splice	Gene Exp	Enhancer
Model	F1	F1	F1	RMSE	RMSE	F1	RMSE	F1
DNABERT2	$41.35$	$72.55$	$61.49$	$0.99$	$0.24$	$45.34$	$14.78$	$36.40$
HyenaDNA	$83.11$	$58.21$	$52.20$	$0.88$	$0.26$	$90.28$	$14.76$	$66.17$
NT-V2	$71.26$	$73.08$	$65.71$	$0.81$	$0.27$	$95.05$	$14.69$	$73.89$
Agro-NT	$78.89$	$67.24$	$63.27$	$0.94$	$0.78$	$88.45$	$15.56$	$62.83$
SpliceBERT	$65.23$	$71.88$	$63.62$	$0.75$	$0.22$	$96.45$	$14.70$	$69.71$
3UTRBERT	$76.48$	$70.75$	$63.71$	$1.04$	$0.36$	$94.44$	$14.87$	$71.67$
CDSBERT	$39.72$	$33.06$	$48.95$	$2.19$	$0.59$	$52.20$	$14.77$	$33.93$
OmniGenome ${}^{52\texttt{M}}$	$85.47$	$75.71$	$64.23$	$0.67$	$0.21$	$97.40$	$14.76$	$68.31$
OmniGenome ${}^{186\texttt{M}}$	$86.87$	$\mathbf{77.53}$	$\mathbf{66.88}$	$\mathbf{0.65}$	$\mathbf{0.19}$	$\mathbf{98.15}$	$14.76$	$\mathbf{72.45}$
OmniGenome ${}^{52\texttt{M}}+$	$\mathbf{87.05}$	$76.23$	$65.41$	$\mathbf{0.65}$	$0.20$	$97.70$	$14.76$	$70.71$

The PGB is a plant-oriented genomic benchmark. Although the benchmark datasets in PGB are DNA-based tasks, we can still evaluate the performance of OmniGenome and its generalizability on multi-modal (i.e., DNA and RNA) genomic tasks. The results in Table 6 reveal substantial variability in the performance of different FMs, where OmniGenome ${}^{52\texttt{M}}$ outperformed other baseline models across most tasks, particularly in tasks like Polyadenylation, Splice Site, and Enhancer Region classification, where they achieved the highest F1 scores. This suggests that OmniGenome’s architecture is particularly adept at handling complex genomic sequences. In comparison, existing FMs, e.g., CDSBERT and Agro-NT, showed lower performance with more parameters than OmniGenome. Besides, the performance of OmniGenome ${}^{52\texttt{M}}+$ suggests that the structure context can further enhance the performance of genomic modeling.

Overall, OmniGenome models achieve state-of-the-art performance on both benchmarks, especially for OmniGenome $+$ variants. The results underscore the importance of sequence-structure alignment in achieving complex genomic modeling tasks.

4 Related Works

Biological sequence modeling, including DNA, RNA, and protein, has attracted attention in recent years. Protein modeling, e.g., AlphaFold [29, 30, 31] and ESM [32], has been studied for many years compared to DNA and RNA modeling. In the realm of genomic sequence modeling, several early works aimed at addressing diversified genome downstream subtasks. For instance, DNABERT [33] adapts the architecture of BERT [19] for genomic sequence modeling, showing preliminary performance for in-silico genomic tasks. DNABERT2 [22], a multi-species FM improved based on DNABERT, proposes replacing k-mers tokenization with BPE tokenization to improve model performance. To explore the performance of large-scale FMs, the nucleotide transformers (V1 & V2) [13], AgroNT [14] and SegmentNT [34] leveraged billions of parameters to boost genomic sequence modeling and achieved promising performance in understanding DNA genome, with model scales up to $2.5$ billion and $1$ billion parameters, respectively. Agro-NT [14] was pretrained on multi-species edible plant DNA sequences but failed to transfer effectively to RNA sequence modeling in our experiments. To address the modeling capacity problem caused by the remarkable lengths of genomes, there is growing focus on the necessity of long-range sequence modeling and the introduction of autoregressive FMs, namely, HyenaDNA [3] and Evo [2]. In terms of RNA genomic modeling, there have been some preliminary works, such as scBERT [35], RNABERT [36], RNA-FM [26], RNA-MSM [37], and RNAErnie [38], to name a few. However, these methods have only trained the FMs on a limited-scale database, as RNA sequences are generally expensive to obtain. Some FMs focus on specific types of RNA sequences, such as coding sequences (CDS) [39], 5’ untranslated regions (5’UTR) [40], 3’ untranslated regions (3’UTR) [24], or precursor mRNA sequences [21], thus limiting the models’ ability to capture the diversity of RNA sequences. Uni-RNA [28] has been reported to achieve good performance due to the large scale of the model and database, however, it is not open-sourced and cannot be compared in the experiments.

In short, the existing FMs usually neglect the sequence-structure alignment problem in RNA genomic modeling, while the 5UTR-LM [16] adopts the secondary structure prediction as a pre-training objective to achieve Seq2Str prediction in pretraining. However, these FMs are not available for Str2Seq mapping and suffer from limited model and data scales that fail to uncover the comprehensive efficacy of sequence-structure alignment on a wide set of genomic tasks. ERNIE-RNA [41] feeds the RNA structure along with the sequence into the model and improves the downstream tasks. However, it also ignores the significance of Str2Seq prediction capability. In a nutshell, existing FMs fail to achieve sequence-structure alignment without exception.

5 Conclusion

In this paper, we introduced OmniGenome, a foundation model (FM) that addresses the critical challenge of sequence-structure alignment in genomic modeling. Our approach leverages a structure-contextualized modeling to achieve accurate predictions for both Seq2Str and Str2Seq tasks. By utilizing the extensive OneKP dataset, we ensured the robustness and transferability of OmniGenome across various genomic tasks. Our experimental results demonstrate that OmniGenome significantly outperforms existing models in zero-shot secondary structure prediction, achieving up to $83\%$ macro-F1 score, and excels in solving $74\%$ complex RNA design puzzles with remarkable efficiency. These findings validate the efficacy of our approach and highlight the potential of OmniGenome in advancing RNA and broader genomic research.

Acknowledgment

This work was supported in part by the UKRI Future Leaders Fellowship under Grant MR/S017062/1 and MR/X011135/1; in part by NSFC under Grant 62376056 and 62076056; in part by the Royal Society under Grant IES/R2/212077; in part by the EPSRC under Grant 2404317; in part by the Kan Tong Po Fellowship (KTP\R1\231017); and in part by the Amazon Research Award and Alan Turing Fellowship.

References

[1] N. Wang, J. Bian, Y. Li, X. Li, S. Mumtaz, L. Kong, and H. Xiong, “Multi-purpose rna language modelling with motif-aware pretraining and type-guided fine-tuning,” Nature Machine Intelligence, pp. 1–10, 2024.
[2] E. Nguyen, M. Poli, M. G. Durrant, A. W. Thomas, B. Kang, J. Sullivan, M. Y. Ng, A. Lewis, A. Patel, A. Lou et al., “Sequence modeling and design from molecular to genome scale with evo,” bioRxiv, pp. 2024–02, 2024.
[3] E. Nguyen, M. Poli, M. Faizi, A. W. Thomas, C. Birch-Sykes, M. Wornow, A. Patel, C. M. Rabideau, S. Massaroli, Y. Bengio, S. Ermon, S. A. Baccus, and C. Ré, “Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution,” CoRR, vol. abs/2306.15794, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.15794
[4] Z. Tan, Y. Fu, G. Sharma, and D. H. Mathews, “Turbofold ii: Rna structural alignment and secondary structure prediction informed by multiple homologs,” Nucleic acids research, vol. 45, no. 20, pp. 11 570–11 581, 2017.
[5] P. Danaee, M. Rouches, M. Wiley, D. Deng, L. Huang, and D. Hendrix, “bprna: large-scale automated annotation and analysis of rna secondary structure,” Nucleic acids research, vol. 46, no. 11, pp. 5381–5394, 2018.
[6] D. H. Mathews, “How to benchmark rna secondary structure prediction accuracy,” Methods, vol. 162, pp. 60–67, 2019.
[7] I. Kalvari, E. P. Nawrocki, N. Ontiveros-Palacios, J. Argasinska, K. Lamkiewicz, M. Marz, S. Griffiths-Jones, C. Toffano-Nioche, D. Gautheret, Z. Weinberg et al., “Rfam 14: expanded coverage of metagenomic, viral and microrna families,” Nucleic Acids Research, vol. 49, no. D1, pp. D192–D200, 2021.
[8] O. Yaish and Y. Orenstein, “Computational modeling of mrna degradation dynamics using deep neural networks,” Bioinformatics, vol. 38, no. 4, pp. 1087–1101, 2022.
[9] H. K. Wayment-Steele, W. Kladwang, A. M. Watkins, D. S. Kim, B. Tunguz, W. Reade, M. Demkin, J. Romano, R. Wellington-Oguri, J. J. Nicol et al., “Deep learning models for predicting rna degradation via dual crowdsourcing,” Nature Machine Intelligence, vol. 4, no. 12, pp. 1174–1184, 2022.
[10] K. S. Corbett, D. K. Edwards, S. R. Leist, O. M. Abiona, S. Boyoglu-Barnum, R. A. Gillespie, S. Himansu, A. Schäfer, C. T. Ziwawo, A. T. DiPiazza et al., “Sars-cov-2 mrna vaccine design enabled by prototype pathogen preparedness,” Nature, vol. 586, no. 7830, pp. 567–571, 2020.
[11] F. Runge, J. K. Franke, D. Fertmann, R. Backofen, and F. Hutter, “Partial rna design,” bioRxiv, 2023. [Online]. Available: https://www.biorxiv.org/content/early/2023/12/30/2023.12.29.573656
[12] Y. Wan, M. Kertesz, R. C. Spitale, E. Segal, and H. Y. Chang, “Understanding the transcriptome through rna structure,” Nature Reviews Genetics, vol. 12, no. 9, pp. 641–655, 2011.
[13] H. Dalla-Torre, L. Gonzalez, J. Mendoza-Revilla, N. L. Carranza, A. H. Grzywaczewski, F. Oteri, C. Dallago, E. Trop, B. P. de Almeida, H. Sirelkhatim et al., “The nucleotide transformer: Building and evaluating robust foundation models for human genomics,” bioRxiv, pp. 2023–01, 2023.
[14] J. Mendoza-Revilla, E. Trop, L. Gonzalez, M. Roller, H. Dalla-Torre, B. P. de Almeida, G. Richard, J. Caton, N. Lopez Carranza, M. Skwark et al., “A foundational large language model for edible plant genomes,” bioRxiv, pp. 2023–10, 2023.
[15] B. Frey, A. Celaj, A. Gao, T. Lau, E. Holgersen, A. Lo, V. Lodaya, C. Cole, R. Denroche, C. Spickett et al., “An rna foundation model enables discovery of disease mechanisms and candidate therapeutics,” bioRxiv, pp. 2023–09, 2023.
[16] Y. Chu, D. Yu, Y. Li, K. Huang, Y. Shen, L. Cong, J. Zhang, and M. Wang, “A 5’ utr language model for decoding untranslated regions of mrna and function predictions,” Nature Machine Intelligence, pp. 1–12, 2024.
[17] J. Lee, W. Kladwang, M. Lee, D. Cantu, M. Azizyan, H. Kim, A. Limpaecher, S. Gaikwad, S. Yoon, A. Treuille et al., “Rna design rules from a massive open laboratory,” Proceedings of the National Academy of Sciences, vol. 111, no. 6, pp. 2122–2127, 2014.
[18] I. Tinoco Jr and C. Bustamante, “How rna folds,” Journal of molecular biology, vol. 293, no. 2, pp. 271–281, 1999.
[19] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT (1). Association for Computational Linguistics, 2019, pp. 4171–4186.
[20] R. Lorenz, S. H. Bernhart, C. Höner zu Siederdissen, H. Tafer, C. Flamm, P. F. Stadler, and I. L. Hofacker, “Viennarna package 2.0,” Algorithms for molecular biology, vol. 6, pp. 1–14, 2011.
[21] K. Chen, Y. Zhou, M. Ding, Y. Wang, Z. Ren, and Y. Yang, “Self-supervised learning on millions of pre-mrna sequences improves sequence-based rna splicing prediction,” bioRxiv, pp. 2023–01, 2023.
[22] Z. Zhou, Y. Ji, W. Li, P. Dutta, R. V. Davuluri, and H. Liu, “DNABERT-2: efficient foundation model and benchmark for multi-species genome,” CoRR, vol. abs/2306.15006, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.15006
[23] E. J. Carpenter, J. H. Leebens-Mack, and M. S. B. et al., “One thousand plant transcriptomes and the phylogenomics of green plants,” Nature, vol. 574, no. 7780, pp. 679–685, 2019.
[24] Y. Yang, G. Li, K. Pang, W. Cao, X. Li, and Z. Zhang, “Deciphering 3’utr mediated gene regulation using interpretable deep representation learning,” bioRxiv, pp. 2023–09, 2023.
[25] J. Su, M. H. M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,” Neurocomputing, vol. 568, p. 127063, 2024.
[26] J. Chen, Z. Hu, S. Sun, Q. Tan, Y. Wang, Q. Yu, L. Zong, L. Hong, J. Xiao, T. Shen et al., “Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions,” bioRxiv, pp. 2022–08, 2022.
[27] Y. Zhang, M. Lang, J. Jiang, Z. Gao, F. Xu, T. Litfin, K. Chen, J. Singh, X. Huang, G. Song et al., “Multiple sequence alignment-based rna language model and its application to structural inference,” Nucleic Acids Research, vol. 52, no. 1, pp. e3–e3, 2024.
[28] X. Wang, R. Gu, Z. Chen, Y. Li, X. Ji, G. Ke, and H. Wen, “Uni-rna: universal pre-trained models revolutionize rna research,” bioRxiv, pp. 2023–07, 2023.
[29] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli, and D. Hassabis, “Highly accurate protein structure prediction with AlphaFold,” Nature, vol. 596, no. 7873, pp. 583–589, 2021.
[30] R. Evans, M. O’Neill, A. Pritzel, N. Antropova, A. Senior, T. Green, A. Žídek, R. Bates, S. Blackwell, J. Yim, O. Ronneberger, S. Bodenstein, M. Zielinski, A. Bridgland, A. Potapenko, A. Cowie, K. Tunyasuvunakool, R. Jain, E. Clancy, P. Kohli, J. Jumper, and D. Hassabis, “Protein complex prediction with alphafold-multimer,” bioRxiv, 2021. [Online]. Available: https://www.biorxiv.org/content/early/2021/10/04/2021.10.04.463034
[31] J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick et al., “Accurate structure prediction of biomolecular interactions with alphafold 3,” Nature, pp. 1–3, 2024.
[32] Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, A. dos Santos Costa, M. Fazel-Zarandi, T. Sercu, S. Candido et al., “Language models of protein sequences at the scale of evolution enable accurate structure prediction,” BioRxiv, vol. 2022, p. 500902, 2022.
[33] Y. Ji, Z. Zhou, H. Liu, and R. V. Davuluri, “DNABERT: pre-trained bidirectional encoder representations from transformers model for dna-language in genome,” Bioinform., vol. 37, no. 15, pp. 2112–2120, 2021.
[34] B. P. de Almeida, H. Dalla-Torre, G. Richard, C. Blum, L. Hexemer, M. Gélard, J. Mendoza-Revilla, P. Pandey, S. Laurent, M. Lopez et al., “Segmentnt: annotating the genome at single-nucleotide resolution with dna foundation models,” bioRxiv, pp. 2024–03, 2024.
[35] F. Yang, W. Wang, F. Wang, Y. Fang, D. Tang, J. Huang, H. Lu, and J. Yao, “scbert as a large-scale pretrained deep language model for cell type annotation of single-cell rna-seq data,” Nat. Mac. Intell., vol. 4, no. 10, pp. 852–866, 2022. [Online]. Available: https://doi.org/10.1038/s42256-022-00534-z
[36] M. Akiyama and Y. Sakakibara, “Informative rna base embedding for rna structural alignment and clustering by deep representation learning,” NAR genomics and bioinformatics, vol. 4, no. 1, p. lqac012, 2022.
[37] Y. Zhang, F. Ge, F. Li, X. Yang, J. Song, and D.-J. Yu, “Prediction of multiple types of rna modifications via biological language model,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2023.
[38] N. Wang, J. Bian, Y. Li, X. Li, S. Mumtaz, L. Kong, and H. Xiong, “Multi-purpose rna language modelling with motif-aware pretraining and type-guided fine-tuning,” Nature Machine Intelligence, pp. 1–10, 2024.
[39] L. Hallee, N. Rafailidis, and J. P. Gleghorn, “cdsbert-extending protein language models with codon awareness,” bioRxiv, 2023.
[40] Y. Chu, D. Yu, Y. Li, K. Huang, Y. Shen, L. Cong, J. Zhang, and M. Wang, “A 5’ utr language model for decoding untranslated regions of mrna and function predictions,” Nature Machine Intelligence, pp. 1–12, 2024.
[41] W. Yin, Z. Zhang, L. He, R. Jiang, S. Zhang, G. Liu, X. Zhang, T. Qin, and Z. Xie, “Ernie-rna: An rna language model with structure-enhanced representations,” bioRxiv, pp. 2024–03, 2024.
[42] C. Peng, J. Shang, J. Guan, D. Wang, and Y. Sun, “Viralm: Empowering virus discovery through the genome foundation model,” bioRxiv, pp. 2024–01, 2024.
[43] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” CoRR, vol. abs/2001.08361, 2020. [Online]. Available: https://arxiv.org/abs/2001.08361
[44] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” CoRR, vol. abs/2203.15556, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2203.15556
[45] N. Muennighoff, A. M. Rush, B. Barak, T. L. Scao, A. Piktus, N. Tazi, S. Pyysalo, T. Wolf, and C. Raffel, “Scaling data-constrained language models,” CoRR, vol. abs/2305.16264, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.16264

6 Pretraining Environment

The pretraining of OmniGenome was conducted on a dedicated Linux computation node, equipped with $8$ Nvidia RTX $4090$ GPUs. For distributed model training, we employed version $4.37.1$ of the Transformers library alongside version $0.26.1$ of the Accelerate library. Our implementation framework of choice for OmniGenome was PyTorch, specifically version $2.0.0$ . The ViennaRNA version is $2.6.4$ in our experiments. While some existing code was adapted for the modules within OmniGenome, the majority of the codebase, such as genomic sequences preprocessing, model pretraining, objective functions, and experiments, was meticulously crafted from scratch.

7 OmniGenome Package

Genomic modeling is still in its early stages, and resources are consequently very scarce. Typically, existing foundation models open-source only the model, without providing the training, fine-tuning, and benchmark evaluation codes, etc. To address this issue, we have developed a comprehensive open-source genomic modeling toolkit⁴⁴4https://github.com/yangheng95/OmniGenomeBench based on OmniGenome. This toolkit aims to provide extensive FM fine-tuning tutorials and a unified automated benchmark evaluation. The main features of the OmniGenome Package are as follows:

•

Fine-Tuning Tutorials: We provide tutorials for fine-tuning on all downstream genomic modeling tasks, including dataset processing, model implementation, and training processes. A fine-tuning example for secondary structure is included, covering both training and demonstration of secondary structure prediction. The fine-tuning tutorials are available at: https://github.com/yangheng95/OmniGenomeBench/blob/master/examples/secondary_structure_prediction_demo.ipynb
•

Automated Benchmark Evaluation: We offer an automated benchmark evaluation interface, which includes the built-in PGB and RGB benchmarks. By predefining the configurations for benchmark evaluation subtasks, such as hyperparameters, our tool supports the automated benchmark evaluation of future FMs and the addition of new benchmarks. The goal of automated benchmark evaluation is to ensure fairness and ease of use. We provide a tutorial on automated evaluation to guide users in benchmark evaluation. The automated benchmarking example is available at: https://github.com/yangheng95/OmniGenomeBench/blob/master/examples/benchmark/batch_rgb_benchmark.py
•

Genomic Repository Hub: We have created a hub for hosting and distributing open-source licensed datasets, model checkpoints, and benchmark evaluations. Additionally, we have designed flexible interfaces to support the sharing of datasets and models within the community. This approach helps mitigate the issue of resource scarcity. The hub will be available soon.

We are in the process of finalizing the necessary documentation and will officially release this tool in the near future.

8 Evaluation Methods

•

ViennaRNA [20]. ViennaRNA is a comprehensive genomic analysis tool that includes a diverse set of interfaces, such as RNAFold and RNAInverse design. ViennaRNA serves as the baseline for RNA structure prediction and RNA design in our experiments.
•

DNABERT2 [22]. DNABERT2 is one of the latest DNA FMs which improves the performance of DNABERT. The main modification of DNABERT2 is the tokenization method, which was changed to BPE from k-mers.
•

HyenaDNA [3]. HyenaDNA is an autoregressive FM optimized for long-range genomic data processing. HyenaDNA is based on the Hyena convolution architecture and capable of handling sequences up to $1$ M bases in length.
•

Nucleotide Transformer (NT) V2 [13]. The NT FMs were trained on DNA data, including the human reference genome and multi-species DNA sequences. They aim to capture the complex patterns within nucleotide sequences for various genomic applications.
•

Agricultural Nucleotide Transformer (Agro-NT) [14]. Agro-NT is a large-scale DNA FM ( $1$ B parameters) akin to the Nucleotide Transformers but with a focus on plant DNA.
•

SpliceBERT [21]. It was trained on $2$ M precursor messenger RNA (pre-mRNA) and specialized in RNA splicing of pre-mRNA sequences.
•

3UTRBERT [24]. This model was trained on $20$ k 3’UTRs for 3’UTR-mediated gene regulation tasks. It uses k-mers tokenization instead of SNT.
•

CDSBERT [39]. It was trained on massive CDS regions within the genome. CDSBERT is adapted from ProteinBERT for predicting protein structures and functions and leverages the alignment between RNA and protein sequences.
•

OmniGenome. OmniGenome is the first genomic FM that highlights the importance of sequence-structure alignment and is available for RNA design tasks.
•

OmniGenome $+$ ⁵⁵5Please find the finetuning example of OmniGenome $+$ in the supplemental materials.. OmniGenome $+$ is an enhanced variant of OmniGenome that feeds both sequences and structures into OmniGenome to aggregate the feature representations to improving modeling ability.

We are aware that some FMs are also developed for RNA, such as RNA-FM [26], RNA-MSM [27], Uni-RNA [28], 5UTR-LM [16], etc. However, we cannot compare OmniGenome with them because their source codes are very hard to work with in our efforts or the models are not publicly available.

9 Genetic Algorithm for RNA Sequence Design

9.1 Genetic Algorithm

The working mechanism of our designed genetic algorithm based on OmniGenome $+$ is implemented as the following five-step process:

Step $1$ .

Given the target RNA secondary structure, we use OmniGenome to generate a set of candidate sequences $\mathcal{P}=\{\mathbf{s}^{i}\}_{i=1}^{N}$ .
Step $2$ .

If the termination criterion is not met, go to Step $3$ ; otherwise, output the current best sequence $\mathbf{s}^{\ast}=\operatorname*{argmax}_{\mathbf{s}\in\mathcal{P}}f(\mathbf{s})$ .
Step $3$ .

Based on $\mathcal{P}$ , use single-point crossover and mutation to generate a population of offspring sequences $\mathcal{O}=\{\tilde{\mathbf{s}}\}_{i=1}^{N}$ .
Step $4$ .

Combine $\mathcal{P}$ and $\mathcal{O}$ to obtain $\mathcal{S}=\mathcal{P}\bigcup\mathcal{O}$ , and use OmniGenome to predict the corresponding secondary structures of each sequence in $\mathcal{S}$ . Evaluate the fitness values of sequences in $\mathcal{S}$ .
Step $5$ .

Sort $\mathcal{S}$ according to the fitness values and preserve the best $N$ sequences to constitute a new $\mathcal{P}$ . Return to Step $2$ .

Note that the fitness value of a sequence $\mathbf{s}$ , denoted as $f(\mathbf{s})$ , is evaluated as the Hamming distance of the RNA secondary structure predicted by OmniGenome against the target structure. The above genetic algorithm is not terminated until the sequence for the target RNA secondary structure is identified or the allocated computational budget is exhausted.

9.2 RNA Design Case Study

As demonstrated in the zero-shot experiments in Table 3, OmniGenome $+$ models achieve state-of-the-art performance. We show several complex examples of puzzles from the EternaV2 benchmark. According to Fig. 6, puzzles #5 and #11 with approximately $200$ + bases are solved, while these puzzles are challenging to existing FMs. Even for puzzles that are not completely solved, e.g., puzzles #3 and #27, OmniGenome ${}^{186\texttt{M}}+$ generates very similar structures, where the nucleotide base difference ratio between the designed structure and the target structure is only $\approx 3\%$ . This finding indicates the proficiency of OmniGenome $+$ models in solving challenging single-nucleotide resolution genomic tasks.

10 Benchmark Suites

10.1 RNA Genomic Benchmark

Details of the RGB can be found in Table 7. RGB contains $7$ SN-level tasks that are curated or collected from published articles. The purpose of RGB is to benchmark genomic FMs in challenging SN-level modeling tasks such as detection and repair of SN mutations, mRNA sequence degradation rates, and RNA secondary structure prediction. Due to the lack of a plant RNA benchmark dataset, RGB includes the modeling of RNA sequences from a variety of species, e.g., plant and human. The sequence length in RGB ranges from $107$ to $512$ , which is sufficient for most RNA understanding tasks. In summary, these multi-species and SN-level tasks in RGB serve as the first comprehensive benchmark utilized to assess the RNA sequence modeling capabilities of OmniGenome and its baseline models. The brief introduction of the datasets in RGB is as follows:

•

Single-Nucleotide Mutation Detection (SNMD): We developed a plant RNA dataset synthesizing the single-nucleotide mutations. Focused on identifying potential single nucleotide changes, this task is essential for detecting mutations linked to genetic disorders. The SNMD dataset introduces up to $10$ random mutations in the original sequences, regardless of variation ratios. Cross-entropy is utilized as the loss function for this binary token classification task.
•

Single-Nucleotide Mutation Repair (SNMR): This task challenges the model to suggest corrective actions at the single nucleotide level, aiding in gene therapy approaches. The SNMR dataset mirrors the SNMD dataset, with cross-entropy as the loss function, indicating a token 4-way (i.e., A, U, C, G) classification task.
•

mRNA Degrade Rate Prediction (mRNA): Estimating the decay rate of nucleotides in mRNA sequences, this task is vital for deciphering gene expression and regulation. The dataset originates from the Kaggle COVID-19 vaccine design competition⁶⁶6https://www.kaggle.com/competitions/stanford-covid-vaccine, focusing solely on sequence-based degradation rate prediction and excluding RNA structures. It’s a token regression task using MSE as the loss function, with the dataset resplit into training, validation, and testing sets for evaluation.
•

RNA Secondary Structure Prediction (bpRNA & Archive2 & RNAStralign & Rfam): Aiming to predict RNA folding into secondary structures, this task is fundamental to RNA functionality and interactions. We evaluated OmniGenome on four datasets, bpRNA [5] (TR0, VL0, TS0 sets), ArchiveII [6], RNAStralign [4] and Rfam [7]. Following existing works, we have excluded sequences over $512$ bases and complex structures, simplifying to three symbols: ‘(’, ‘.’, ‘)’Ṙesults may not directly compare with other studies due to these modifications. Cross-entropy serves as the loss function.

Table 7: The brief statistics of subtasks in the RGB. These benchmark datasets are held out or not included in the pretraining database. The numbers of examples in training, validation and testing sets are separated by “/”. ^∗ indicate the datasets are used for zero-shot performance evaluation only.

Task	Task Type	# of examples	# of classes	Metric	Sequence length	Source
SNMD	Token classification	$8,000/1,000/1,000$	$2$	AUC	$200$	This work
SNMR	Token classification	$8,000/1,000/1,000$	$4$	F1	$200$	This work
mRNA	Token regression	$1,735/193/192$	—	RMSE	$107$	Kaggle
bpRNA	Token classification	$10,814/1,300/1,305$	$3$	F1	$\leq 512$	[5]
AchiveII	Token classification	$2278/285/285$	$3$	F1	$\leq 500$	[6]
RNAStrAlign	Token classification	$17483/2186/2185$	$3$	F1	$\leq 500$	[4]
Rfam^∗	Token classification	$501376/62672/62672$	$3$	F1	$\leq 512$	[7]

Please find the appendix for the input and output examples of each subtask in RGB. The detailed task descriptions for each nucleic acid and species, including the number of examples, classes, evaluation metric, and sequence length, are outlined in Table 7. Each task is carefully curated to reflect the complexity and variety inherent in genomic data, providing a robust framework for assessing the nuanced capabilities of state-of-the-art RNA FMs.

Table 8 show the virtual examples of different datasets in RGB. Please refer to our supplementary materials to find the datasets for more details.

Table 8: The virtual input and output examples in RGB. The “

\dots

” represents the sequences that are omitted for better presentation and the red color indicates the wrong prediction in classification tasks. In the mRNA dataset, all single nucleotides have three values to predict. Note that “T” and “U” can be regarded as the same symbol in RNA sequences and depend on different datasets.

Genome Type	Dataset		Examples
RNA	SNMD	Input Sequence	G A G T A $\dots$ T T G A G
		True Label	0 0 1 0 0 $\dots$ 0 0 1 0 0
		Prediction	0 0 0 0 0 $\dots$ 0 0 1 0 0
	SNMR	Input Sequence	T A C G A $\dots$ C T G A T
		True Label	T A C A A $\dots$ G T A A T
		Prediction	T A C A A $\dots$ C T G A T
	mRNA	Input Sequence	G G $\dots$ A C
		True Label	[0.1,0.3,0.2] [0.8,0.4,0.1] $\dots$ [0.9,0.4,0.3] [0.5,0.2,0.6]
		Prediction	[0.1,0.3,0.2] [0.8,0.4,0.1] $\dots$ [0.9,0.4,0.3] [0.5,0.2,0.6]
	bpRNA	Input Sequence	G G C G A $\dots$ C U U U U
		True Label	( ( ( $\cdot$ $\cdot$ $\dots$ $\cdot$ $\cdot$ ) ) )
		Prediction	( ( ( ( $\cdot$ $\dots$ $\cdot$ ) ) ) )
	Archive2	Input Sequence	A G U A G $\dots$ U U U G C U
		True Label	( ( ( $\cdot$ $\cdot$ $\dots$ $\cdot$ $\cdot$ ) ) )
		Prediction	( ( ( $\cdot$ $\cdot$ $\dots$ $\cdot$ $\cdot$ ) ) )
	RNAStralign	Input Sequence	A G U A G $\dots$ U U U G C U
		True Label	( ( ( $\cdot$ $\cdot$ $\dots$ $\cdot$ $\cdot$ ) ) )
		Prediction	( ( ( $\cdot$ $\cdot$ $\dots$ $\cdot$ $\cdot$ ) ) )
	Rfam	Input Sequence	A G U A G $\dots$ U U U G C U
		True Label	( ( ( $\cdot$ $\cdot$ $\dots$ $\cdot$ $\cdot$ ) ) )
		Prediction	( ( ( $\cdot$ $\cdot$ $\dots$ $\cdot$ $\cdot$ ) ) )

10.2 Plant Genomic Benchmark

The Plant Genomic Benchmark [14] (PGB) provides a comprehensive suite of datasets designed to evaluate and improve the predictive capabilities of genomic models in plant biology. This benchmark, as shown in Table 9, encompasses a range of critical genomic tasks⁷⁷7https://huggingface.co/InstaDeepAI/agro-nucleotide-transformer-1b, including binary classification, single and multi-variable regression, and multi-label classification, addressing various aspects of plant genomics such as RNA processing, gene expression, and chromatin accessibility. By integrating diverse genomic tasks, the PGB aims to facilitate advanced research and development in plant genomics, offering a robust platform for the assessment and enhancement of model performance across different plant species. To obtain a detailed description of PGB, please refer to Agro-NT [14].

Table 9: The genomic tasks in the Plant Genomic Benchmark. This table briefly enumerates each task by name, the number of datasets available, the type of classification or regression analysis required, the range of sequence lengths, and the total number of samples in each dataset. Please find the dataset details of PGB in Agro-NT.

Task	# of datasets	Task Type	Total # of examples	# of classes	Metric	Sequence length
Polyadenylation	$6$	Sequence classification	$738,918$	$2$	F1	$400$
Splice site	$2$	Sequence classification	$4,920,835$	$2$	F1	$398$
LncRNA	$2$	Sequence classification	$58,062$	$6$	F1	$101-6000$
Promoter strength	$2$	Sequence regression	$147,966$	—	RMSE	$170$
Terminator strength	$2$	Sequence regression	$106,818$	—	RMSE	$170$
Chromatin accessibility	$7$	Multi-label classification	$5,149,696$	$9-19$	F1	$1,000$
Gene expression	$6$	Multi-variable regression	$206,358$	—	RMSE	$6,000$
Enhancer region	$1$	Sequence classification	$18,893$	$2$	F1	$1,000$

11 OneKP Initiative

There has been a variety of FMs utilized in different species, e.g., humans [3, 13], bacteria [2], and viruses [42], which indicates the effectiveness of pretrained FMs on multi-species genomics. In this work, we aim to propose an FM for multi-species plant RNA sequence modeling. We leverage the OneKP initiative [23] to address the scarcity of plant RNA data, which contains $1,124$ species of plant transcriptomes. The scale of OneKP enables the development of a more robust and transferable RNA FM.

The $1000$ Plant Transcriptomes Initiative (OneKP) was a comprehensive effort aimed at exploring genetic diversity across the green plant kingdom (Viridiplantae), sequencing the RNA from $1124$ ( $1342$ in other versions) samples that represent over $1000$ species, encompassing all major taxa within Viridiplantae. This includes streptophyte and chlorophyte green algae, bryophytes, ferns, angiosperms, and gymnosperms. The initiative’s final or capstone publication presents three major analyses: inferring species trees, identifying whole genome duplications, and detecting gene family expansions. These findings are particularly valuable for plant and evolutionary scientists interested in specific gene families, whether their focus is across the entire green plant tree of life or within more narrowly defined lineages.

The sampling strategy for the 1KP was global and collaborative, with samples sourced from a wide range of environments including wild field collections, greenhouses, botanical gardens, laboratory specimens, and algal culture collections. The initiative prioritized the collection of live growing cells, such as young leaves, flowers, or shoots, to ensure a high abundance of expressed genes, though many samples also came from roots and other tissues. RNA extraction was performed using well-established protocols or commercial kits, facilitating the comprehensive analysis of transcribed RNA across this diverse set of species. This monumental effort not only sheds light on plant genetic diversity but also provides a rich data resource for ongoing and future research in plant science and evolutionary biology.

12 Limitations

The limitations of our work are rooted in resource constraints. First, according to the data scaling laws [43, 44, 45], we recognize that the scale of the proposed RNA foundation model remains small, and the parameterization may not be adequate to fully exploit the OneKP database’s potential. Due to resource limitations, we could not pretrain larger models. Moving forward, we aim to train larger-scale foundation models, anticipating that our contributions will accelerate the advancement of DNA and RNA foundation models. Secondly, while the modeling length of our foundation model satisfies most RNA and DNA sequence processing needs, there are tasks with extremely long sequences. Our future efforts will focus on enhancing the model’s capability to handle such lengthy sequences.

13 Ethics Statement

In this research, we utilized the open OneKP dataset, which does not contain human-related privacy concerns. We ensure that such data is not exploited without fair compensation and acknowledgment of the source communities. The pretraining sequences are plant-based genomic data that pose potential harm to ecological systems; we do not permit the use of our model beyond expectations, such as developing malicious bio-software or designing harmful RNA structures. The models and findings should support, not undermine, the conservation of plant species and their habitats. We adhere to principles of transparency and open science, using datasets that are publicly available and providing clear documentation of our methodologies and findings.

Overall, in conducting this research, we have committed to ethical scientific practices that respect biodiversity and aim to contribute positively to the field of genomic research. We encourage ongoing dialogue around the ethical use of plant RNA sequences and support initiatives that promote the sharing of benefits arising from such research with all stakeholders.