SeqScrub: a web tool for automatic cleaning and annotation of FASTA file headers for bioinformatic applications

Biotechniques. 2019 Aug;67(2):50-54. doi: 10.2144/btn-2018-0188. Epub 2019 Jun 20.

Abstract

Data consistency is necessary for effective bioinformatic analysis. SeqScrub is a web tool that parses and maintains consistent information about protein and DNA sequences in FASTA file format, checks if records are current, and adds taxonomic information by matching identifiers against entries in authoritative biological sequence databases. SeqScrub provides a powerful, yet simple workflow for managing, enriching and exchanging data, which is crucial to establish a record of provenance for sequences found from broad and varied searches; for example, using BLAST on continually updated genome sequence sets. Headers standardized using SeqScrub can be parsed by a majority of bioinformatic tools, stay uniformly named between collaborators and contain informative labels to aid management of reproducible, scientific data. SeqScrub is available at http://bioinf.scmb.uq.edu.au/seqscrub.

Keywords: ancestral sequence reconstruction; data consistency; data curation; data sanitization; taxonomic annotation; web application.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Computational Biology / methods*
  • Data Curation / methods*
  • Databases, Genetic*
  • Humans
  • Internet
  • Phylogeny
  • Sequence Analysis / methods
  • Software*