RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes

Daniel H Haft; Azat Badretdin; George Coulouris; Michael DiCuccio; A Scott Durkin; Eric Jovenitti; Wenjun Li; Megdelawit Mersha; Kathleen R O'Neill; Joel Virothaisakun; Françoise Thibaud-Nissen

doi:10.1093/nar/gkad988

RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes

Nucleic Acids Res. 2024 Jan 5;52(D1):D762-D769. doi: 10.1093/nar/gkad988.

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.

Abstract

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and 236 million proteins with up-to-date and consistent annotation. In the past 3 years, we have expanded the diversity of the RefSeq collection by including the best quality metagenome-assembled genomes (MAGs) submitted to INSDC (DDBJ, ENA and GenBank), while maintaining its quality by adding validation checks. Assemblies are now more stringently evaluated for contamination and for completeness of annotation prior to acceptance into RefSeq. MAGs now account for over 17000 assemblies in RefSeq, split over 165 orders and 362 families. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP), which is used to annotate nearly all RefSeq assemblies include better detection of protein-coding genes. Nearly 83% of RefSeq proteins are now named by a curated Protein Family Model, a 4.7% increase in the past three years ago. In addition to literature citations, Enzyme Commission numbers, and gene symbols, Gene Ontology terms are now assigned to 48% of RefSeq proteins, allowing for easier multi-genome comparison. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/. PGAP is available as a stand-alone tool able to produce GenBank-ready files at https://github.com/ncbi/pgap.

Published by Oxford University Press on behalf of Nucleic Acids Research 2023.

MeSH terms

Archaea* / genetics
Bacteria* / genetics
Databases, Nucleic Acid* / standards
Databases, Nucleic Acid* / trends
Genome, Archaeal / genetics
Genome, Bacterial / genetics
Internet
Metagenome*
Molecular Sequence Annotation
Proteins / genetics

Substances

Proteins

RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes

Authors

Affiliation

Abstract

MeSH terms

Substances

Grants and funding