Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores

C A Wilson; J Kreychman; M Gerstein

doi:10.1006/jmbi.2000.3550

Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores

J Mol Biol. 2000 Mar 17;297(1):233-49. doi: 10.1006/jmbi.2000.3550.

Authors

C A Wilson¹, J Kreychman, M Gerstein

Affiliation

¹ Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.

PMID: 10704319
DOI: 10.1006/jmbi.2000.3550

Abstract

Measuring in a quantitative, statistical sense the degree to which structural and functional information can be "transferred" between pairs of related protein sequences at various levels of similarity is an essential prerequisite for robust genome annotation. To this end, we performed pairwise sequence, structure and function comparisons on approximately 30,000 pairs of protein domains with known structure and function. Our domain pairs, which are constructed according to the SCOP fold classification, range in similarity from just sharing a fold, to being nearly identical. Our results show that traditional scores for sequence and structure similarity have the same basic exponential relationship as observed previously, with structural divergence, measured in RMS, being exponentially related to sequence divergence, measured in percent identity. However, as the scale of our survey is much larger than any previous investigations, our results have greater statistical weight and precision. We have been able to express the relationship of sequence and structure similarity using more "modern scores," such as Smith-Waterman alignment scores and probabilistic P-values for both sequence and structure comparison. These modern scores address some of the problems with traditional scores, such as determining a conserved core and correcting for length dependency; they enable us to phrase the sequence-structure relationship in more precise and accurate terms. We found that the basic exponential sequence-structure relationship is very general: the same essential relationship is found in the different secondary-structure classes and is evident in all the scoring schemes. To relate function to sequence and structure we assigned various levels of functional similarity to the domain pairs, based on a simple functional classification scheme. This scheme was constructed by combining and augmenting annotations in the enzyme and fly functional classifications and comparing subsets of these to the Escherichia coli and yeast classifications. We found sigmoidal relationships between similarity in function and sequence, with clear thresholds for different levels of functional conservation. For pairs of domains that share the same fold, precise function appears to be conserved down to approximately 40 % sequence identity, whereas broad functional class is conserved to approximately 25 %. Interestingly, percent identity is more effective at quantifying functional conservation than the more modern scores (e.g. P-values). Results of all the pairwise comparisons and our combined functional classification scheme for protein structures can be accessed from a web database at http://bioinfo.mbb.yale.edu/alignCopyright 2000 Academic Press.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Animals
Computational Biology*
Conserved Sequence / genetics
Databases, Factual
Drosophila melanogaster
Enzymes / chemistry
Enzymes / classification
Enzymes / genetics
Enzymes / metabolism
Genome*
Internet
Molecular Weight
Probability
Protein Folding
Protein Structure, Secondary
Protein Structure, Tertiary
Proteins / chemistry*
Proteins / classification
Proteins / genetics
Proteins / metabolism*
Reproducibility of Results
Sequence Alignment
Software
Structure-Activity Relationship

Substances

Enzymes
Proteins