PRONAME: a user-friendly pipeline to process long-read nanopore metabarcoding data by generating high-quality consensus sequences

Benjamin Dubois; Mathieu Delitte; Salomé Lengrand; Claude Bragard; Anne Legrève; Frédéric Debode

doi:10.3389/fbinf.2024.1483255

PRONAME: a user-friendly pipeline to process long-read nanopore metabarcoding data by generating high-quality consensus sequences

Front Bioinform. 2024 Dec 20:4:1483255. doi: 10.3389/fbinf.2024.1483255. eCollection 2024.

Authors

Benjamin Dubois¹, Mathieu Delitte², Salomé Lengrand², Claude Bragard², Anne Legrève², Frédéric Debode¹

Affiliations

¹ Bioengineering Unit, Life Sciences Department, Walloon Agricultural Research Centre, Gembloux, Belgium.
² Earth and Life Institute - Applied Microbiology, Plant Health, UCLouvain, Louvain-la-Neuve, Belgium.

Abstract

Background: The study of sample taxonomic composition has evolved from direct observations and labor-intensive morphological studies to different DNA sequencing methodologies. Most of these studies leverage the metabarcoding approach, which involves the amplification of a small taxonomically-informative portion of the genome and its subsequent high-throughput sequencing. Recent advances in sequencing technology brought by Oxford Nanopore Technologies have revolutionized the field, enabling portability, affordable cost and long-read sequencing, therefore leading to a significant increase in taxonomic resolution. However, Nanopore sequencing data exhibit a particular profile, with a higher error rate compared with Illumina sequencing, and existing bioinformatics pipelines for the analysis of such data are scarce and often insufficient, requiring specialized tools to accurately process long-read sequences.

Results: We present PRONAME (PROcessing NAnopore MEtabarcoding data), an open-source, user-friendly pipeline optimized for processing raw Nanopore sequencing data. PRONAME includes precompiled databases for complete 16S sequences (Silva138 and Greengenes2) and a newly developed and curated database dedicated to bacterial 16S-ITS-23S operon sequences. The user can also provide a custom database if desired, therefore enabling the analysis of metabarcoding data for any domain of life. The pipeline significantly improves sequence accuracy, implementing innovative error-correction strategies and taking advantage of the new sequencing chemistry to produce high-quality duplex reads. Evaluations using a mock community have shown that PRONAME delivers consensus sequences demonstrating at least 99.5% accuracy with standard settings (and up to 99.7%), making it a robust tool for genomic analysis of complex multi-species communities.

Conclusion: PRONAME meets the challenges of long-read Nanopore data processing, offering greater accuracy and versatility than existing pipelines. By integrating Nanopore-specific quality filtering, clustering and error correction, PRONAME produces high-precision consensus sequences. This brings the accuracy of Nanopore sequencing close to that of Illumina sequencing, while taking advantage of the benefits of long-read technologies.

Keywords: accuracy; clustering; database; duplex reads; long-read high-throughput sequencing; microbiome; polishing; ribosomal operon.

Associated data

figshare/10.6084/m9.figshare.26380702

Grants and funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This research was funded by the Belgian Walloon Region ANTAGONIST project (Grant Number D65-1417).