Dictionary-driven protein annotation

Nucleic Acids Res. 2002 Sep 1;30(17):3901-16. doi: 10.1093/nar/gkf464.

Abstract

Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.

Publication types

  • Comparative Study

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Computational Biology / methods
  • Databases, Protein*
  • Dictionaries, Chemical as Topic*
  • Genomics
  • Humans
  • Internet
  • Molecular Sequence Data
  • Proteins / genetics*
  • Proteome / genetics
  • Sequence Alignment / methods
  • Sequence Homology, Amino Acid
  • Ubiquitin / genetics

Substances

  • Proteins
  • Proteome
  • Ubiquitin