Predicting active site residue annotations in the Pfam database

Jaina Mistry; Alex Bateman; Robert D Finn

doi:10.1186/1471-2105-8-298

Predicting active site residue annotations in the Pfam database

BMC Bioinformatics. 2007 Aug 9:8:298. doi: 10.1186/1471-2105-8-298.

Authors

Jaina Mistry¹, Alex Bateman, Robert D Finn

Affiliation

¹ Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. [email protected]

Abstract

Background: Approximately 5% of Pfam families are enzymatic, but only a small fraction of the sequences within these families (<0.5%) have had the residues responsible for catalysis determined. To increase the active site annotations in the Pfam database, we have developed a strict set of rules, chosen to reduce the rate of false positives, which enable the transfer of experimentally determined active site residue data to other sequences within the same Pfam family.

Description: We have created a large database of predicted active site residues. On comparing our active site predictions to those found in UniProtKB, Catalytic Site Atlas, PROSITE and MEROPS we find that we make many novel predictions. On investigating the small subset of predictions made by these databases that are not predicted by us, we found these sequences did not meet our strict criteria for prediction. We assessed the sensitivity and specificity of our methodology and estimate that only 3% of our predicted sequences are false positives.

Conclusion: We have predicted 606110 active site residues, of which 94% are not found in UniProtKB, and have increased the active site annotations in Pfam by more than 200 fold. Although implemented for Pfam, the tool we have developed for transferring the data can be applied to any alignment with associated experimental active site data and is available for download. Our active site predictions are re-calculated at each Pfam release to ensure they are comprehensive and up to date. They provide one of the largest available databases of active site annotation.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Amino Acid Sequence
Binding Sites
Databases, Protein* / trends
Molecular Sequence Data
Predictive Value of Tests
Sequence Alignment / methods
Sequence Alignment / trends
Sequence Homology, Amino Acid
Software Design

Abstract

Publication types

MeSH terms

Grants and funding