Evaluation of strategies for evidence-driven genome annotation using long-read RNA-seq

Alejandro Paniagua; Cristina Agustin-García; Francisco J Pardo-Palacios; Thomas Brown; Maite De Maria; Nancy D Denslow; Camila Mazzoni; Ana Conesa

doi:10.1101/gr.279864.124

Evaluation of strategies for evidence-driven genome annotation using long-read RNA-seq

Genome Res. 2024 Dec 23:gr.279864.124. doi: 10.1101/gr.279864.124. Online ahead of print.

Authors

Alejandro Paniagua¹, Cristina Agustin-García², Francisco J Pardo-Palacios², Thomas Brown³, Maite De Maria⁴, Nancy D Denslow⁴, Camila Mazzoni⁵, Ana Conesa⁶

Affiliations

¹ Institute for Integrative Systems Biology, Spanish National Research Council, Universitat de València.
² Institute for Integrative Systems Biology, Spanish National Research Council.
³ Leibniz Institute of Zoo and Wildlife Research, Berlin Center for Genomics in Biodiversity Research.
⁴ University of Florida.
⁵ Leibniz Institute for Zoo and Wildlife Research, Berlin Center for Genomics in Biodiversity Research.
⁶ Institute for Integrative Systems Biology, Spanish National Research Council; [email protected].

PMID: 39715684
DOI: 10.1101/gr.279864.124

Abstract

While the production of a draft genome has become more accessible due to long-read sequencing, the annotation of these new genomes has not been developed at the same pace. Long-read RNA sequencing (lrRNA-seq) offers a promising solution for enhancing gene annotation. In this study, we explore how sequencing platforms, Oxford Nanopore R9.4.1 chemistry or PacBio Sequel II CCS, and data processing methods influence evidence-driven genome annotation using long reads. Incorporating PacBio transcripts into our annotation pipeline significantly outperformed traditional methods, such as ab initio predictions and short-read-based annotations. We applied this strategy to a nonmodel species, the Florida manatee, and compared our results to existing short-read-based annotation. At the loci level, both annotations were highly concordant, with 90% agreement. However, at the transcript level, the agreement was only 35%. We identified 4,906 novel loci, represented by 5,707 isoforms, with 64% of these isoforms matching known sequences in other mammalian species. Overall, our findings underscore the importance of using high-quality curated transcript models in combination with ab initio methods for effective genome annotation.

Published by Cold Spring Harbor Laboratory Press.