Automated discovery of structural signatures of protein fold and function

J Mol Biol. 2001 Feb 23;306(3):591-605. doi: 10.1006/jmbi.2000.4414.

Abstract

There are constraints on a protein sequence/structure for it to adopt a particular fold. These constraints could be either a local signature involving particular sequences or arrangements of secondary structure or a global signature involving features along the entire chain. To search systematically for protein fold signatures, we have explored the use of Inductive Logic Programming (ILP). ILP is a machine learning technique which derives rules from observation and encoded principles. The derived rules are readily interpreted in terms of concepts used by experts. For 20 populated folds in SCOP, 59 rules were found automatically. The accuracy of these rules, which is defined as the number of true positive plus true negative over the total number of examples, is 74% (cross-validated value). Further analysis was carried out for 23 signatures covering 30% or more positive examples of a particular fold. The work showed that signatures of protein folds exist, about half of rules discovered automatically coincide with the level of fold in the SCOP classification. Other signatures correspond to homologous family and may be the consequence of a functional requirement. Examination of the rules shows that many correspond to established principles published in specific literature. However, in general, the list of signatures is not part of standard biological databases of protein patterns. We find that the length of the loops makes an important contribution to the signatures, suggesting that this is an important determinant of the identity of protein folds. With the expansion in the number of determined protein structures, stimulated by structural genomics initiatives, there will be an increased need for automated methods to extract principles of protein folding from coordinates.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Animals
  • Automation / methods
  • Computational Biology / methods*
  • Cytokines / chemistry
  • Cytokines / metabolism
  • DNA-Binding Proteins*
  • Databases as Topic
  • Globins / chemistry
  • Globins / metabolism
  • Models, Molecular
  • Protein Conformation
  • Protein Folding*
  • Proteins / chemistry*
  • Proteins / classification
  • Proteins / metabolism*
  • Repressor Proteins / chemistry
  • Repressor Proteins / metabolism
  • Reproducibility of Results
  • Sensitivity and Specificity
  • Software*
  • Structure-Activity Relationship
  • Viral Proteins
  • Viral Regulatory and Accessory Proteins

Substances

  • Cytokines
  • DNA-Binding Proteins
  • Proteins
  • Repressor Proteins
  • Viral Proteins
  • Viral Regulatory and Accessory Proteins
  • phage repressor proteins
  • Globins

Associated data

  • PDB/1A7A
  • PDB/1B8A
  • PDB/1BQQ
  • PDB/1DLW
  • PDB/1DLY
  • PDB/1PJC