An integrated approach to the analysis and modeling of protein sequences and structures. III. A comparative study of sequence conservation in protein structural families using multiple structural alignments

J Mol Biol. 2000 Aug 18;301(3):691-711. doi: 10.1006/jmbi.2000.3975.

Abstract

The information required to generate a protein structure is contained in its amino acid sequence, but how three-dimensional information is mapped onto a linear sequence is still incompletely understood. Multiple structure alignments of similar protein structures have been used to investigate conserved sequence features but contradictory results have been obtained, due, in large part, to the absence of subjective criteria to be used in the construction of sequence profiles and in the quantitative comparison of alignment results. Here, we report a new procedure for multiple structure alignment and use it to construct structure-based sequence profiles for similar proteins. The definition of "similar" is based on the structural alignment procedure and on the protein structural distance (PSD) described in paper I of this series, which offers an objective measure for protein structure relationships. Our approach is tested in two well-studied groups of proteins; serine proteases and Ig-like proteins. It is demonstrated that the quality of a sequence profile generated by a multiple structure alignment is quite sensitive to the PSD used as a threshold for the inclusion of proteins in the alignment. Specifically, if the proteins included in the aligned set are too distant in structure from one another, there will be a dilution of information and patterns that are relevant to a subset of the proteins are likely to be lost. In order to understand better how the same three-dimensional information can be encoded in seemingly unrelated sequences, structure-based sequence profiles are constructed for subsets of proteins belonging to nine superfolds. We identify patterns of relatively conserved residues in each subset of proteins. It is demonstrated that the most conserved residues are generally located in the regions where tertiary interactions occur and that are relatively conserved in structure. Nevertheless, the conservation patterns are relatively weak in all cases studied, indicating that structure-determining factors that do not require a particular sequential arrangement of amino acids, such as secondary structure propensities and hydrophobic interactions, are important in encoding protein fold information. In general, we find that similar structures can fold without having a set of highly conserved residue clusters or a well-conserved sequence profile; indeed, in some cases there is no apparent conservation pattern common to structures with the same fold. Thus, when a group of proteins exhibits a common and well-defined sequence pattern, it is more likely that these sequences have a close evolutionary relationship rather than the similarities having arisen from the structural requirements of a given fold.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Amino Acids / chemistry
  • Computer Simulation
  • Conserved Sequence*
  • Databases, Factual
  • Immunoglobulins / chemistry
  • Models, Molecular
  • Models, Statistical
  • Molecular Sequence Data
  • Protein Folding
  • Protein Structure, Secondary
  • Sequence Alignment / methods*
  • Sequence Homology, Amino Acid
  • Serine Endopeptidases / chemistry
  • Software*

Substances

  • Amino Acids
  • Immunoglobulins
  • Serine Endopeptidases