D-sORF: Accurate Ab Initio Classification of Experimentally Detected Small Open Reading Frames (sORFs) Associated with Translational Machinery

Biology (Basel). 2024 Jul 26;13(8):563. doi: 10.3390/biology13080563.

Abstract

Small open reading frames (sORFs; <300 nucleotides or <100 amino acids) are widespread across all genomes, and an increasing variety of them appear to be translating from non-genic regions. Over the past few decades, peptides produced from sORFs have been identified as functional in various organisms, from bacteria to humans. Despite recent advances in next-generation sequencing and proteomics, accurate annotation and classification of sORFs remain a rate-limiting step toward reliable and high-throughput detection of small proteins from non-genic regions. Additionally, the cost of computational methods utilizing machine learning is lower than that of biological experiments, and they can be employed to detect sORFs, laying the groundwork for biological experiments. We present D-sORF, a machine-learning framework that integrates the statistical nucleotide context and motif information around the start codon to predict coding sORFs. D-sORF scores directly for coding identity and requires only the underlying genomic sequence, without incorporating parameters such as the conservation, which, in the case of sORFs, may increase the dispersion of scores within the significantly less conserved non-genic regions. D-sORF achieves 94.74% precision and 92.37% accuracy for small ORFs (using the 99 nt medium length window). When D-sORF is applied to sORFs associated with ribosomes, the identification of transcripts producing peptides (annotated by the Ensembl IDs) is similar to or superior to experimental methodologies based on ribosome-sequencing (Ribo-Seq) profiling. In parallel, the recognition of putative negative data, such as the intron-containing transcripts that associate with ribosomes, remains remarkably low, indicating that D-sORF could be efficiently applied to filter out false-positive sORFs from Ribo-Seq data because of the non-productive ribosomal binding or noise inherent in these protocols.

Keywords: genomic annotation; machine learning; motif prediction; ribosome sequencing; sORF; small open reading frames.

Grants and funding

This study was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the 1st Call for HFRI Research Projects to Support Faculty Members and Researchers and Procure High- Value Research Equipment grant (Project Number 2563). Additionally, it was funded by the European Union - NextGenerationEU through Greece 2.0—National Recovery and Resilience Plan, under the call “Flagship actions in interdisciplinary scientific fields with a special focus on the productive fabric” (ID 16618), project name “Bridging big omic, genetic and medical data for Precision Medicine implementation in Greece” (project code TAEDR-0539180).