NCBI RefSeq: reference sequence standards through 25 years of curation and annotation

Tamara Goldfarb; Vamsi K Kodali; Shashikant Pujar; Vyacheslav Brover; Barbara Robbertse; Catherine M Farrell; Dong-Ha Oh; Alexander Astashyn; Olga Ermolaeva; Diana Haddad; Wratko Hlavina; Jinna Hoffman; John D Jackson; Vinita S Joardar; David Kristensen; Patrick Masterson; Kelly M McGarvey; Richard McVeigh; Eyal Mozes; Michael R Murphy; Susan S Schafer; Alexander Souvorov; Brett Spurrier; Pooja K Strope; Hanzhen Sun; Anjana R Vatsan; Craig Wallin; David Webb; J Rodney Brister; Eneida Hatcher; Avi Kimchi; William Klimke; Aron Marchler-Bauer; Kim D Pruitt; Françoise Thibaud-Nissen; Terence D Murphy

doi:10.1093/nar/gkae1038

NCBI RefSeq: reference sequence standards through 25 years of curation and annotation

Nucleic Acids Res. 2024 Nov 11:gkae1038. doi: 10.1093/nar/gkae1038. Online ahead of print.

Authors

Tamara Goldfarb¹, Vamsi K Kodali¹, Shashikant Pujar¹, Vyacheslav Brover¹, Barbara Robbertse¹, Catherine M Farrell^{1

2}, Dong-Ha Oh¹, Alexander Astashyn¹, Olga Ermolaeva¹, Diana Haddad¹, Wratko Hlavina¹, Jinna Hoffman¹, John D Jackson¹, Vinita S Joardar¹, David Kristensen¹, Patrick Masterson¹, Kelly M McGarvey¹, Richard McVeigh¹, Eyal Mozes¹, Michael R Murphy¹, Susan S Schafer¹, Alexander Souvorov¹, Brett Spurrier¹, Pooja K Strope¹, Hanzhen Sun¹, Anjana R Vatsan¹, Craig Wallin¹, David Webb¹, J Rodney Brister¹, Eneida Hatcher¹, Avi Kimchi¹, William Klimke¹, Aron Marchler-Bauer¹, Kim D Pruitt¹, Françoise Thibaud-Nissen¹, Terence D Murphy¹

Affiliations

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20894, USA.
² Division of Extramural Programs, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.

PMID: 39526381
DOI: 10.1093/nar/gkae1038

Abstract

Reference sequences and annotations serve as the foundation for many lines of research today, from organism and sequence identification to providing a core description of the genes, transcripts and proteins found in an organism's genome. Interpretation of data including transcriptomics, proteomics, sequence variation and comparative analyses based on reference gene annotations informs our understanding of gene function and possible disease mechanisms, leading to new biomedical discoveries. The Reference Sequence (RefSeq) resource created at the National Center for Biotechnology Information (NCBI) leverages both automatic processes and expert curation to create a robust set of reference sequences of genomic, transcript and protein data spanning the tree of life. RefSeq continues to refine its annotation and quality control processes and utilize better quality genomes resulting from advances in sequencing technologies as well as RNA-Seq data to produce high-quality annotated genomes, ortholog predictions across more organisms and other products that are easily accessible through multiple NCBI resources. This report summarizes the current status of the eukaryotic, prokaryotic and viral RefSeq resources, with a focus on eukaryotic annotation, the increase in taxonomic representation and the effect it will have on comparative genomics. The RefSeq resource is publicly accessible at https://www.ncbi.nlm.nih.gov/refseq.

Published by Oxford University Press on behalf of Nucleic Acids Research 2024.

Abstract

Grants and funding