The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools

Andreas Wilke; Travis Harrison; Jared Wilkening; Dawn Field; Elizabeth M Glass; Nikos Kyrpides; Konstantinos Mavrommatis; Folker Meyer

doi:10.1186/1471-2105-13-141

The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools

BMC Bioinformatics. 2012 Jun 21:13:141. doi: 10.1186/1471-2105-13-141.

Authors

Andreas Wilke¹, Travis Harrison, Jared Wilkening, Dawn Field, Elizabeth M Glass, Nikos Kyrpides, Konstantinos Mavrommatis, Folker Meyer

Affiliation

¹ Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA.

Abstract

Background: Computing of sequence similarity results is becoming a limiting factor in metagenome analysis. Sequence similarity search results encoded in an open, exchangeable format have the potential to limit the needs for computational reanalysis of these data sets. A prerequisite for sharing of similarity results is a common reference.

Description: We introduce a mechanism for automatically maintaining a comprehensive, non-redundant protein database and for creating a quarterly release of this resource. In addition, we present tools for translating similarity searches into many annotation namespaces, e.g. KEGG or NCBI's GenBank.

Conclusions: The data and tools we present allow the creation of multiple result sets using a single computation, permitting computational results to be shared between groups for large sequence data sets.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Computational Biology
Databases, Nucleic Acid
Databases, Protein*
Metagenomics
Proteins / chemistry
Proteins / genetics
Software*

Substances

Proteins