Accurate Representation of Protein-Ligand Structural Diversity in the Protein Data Bank (PDB)

Int J Mol Sci. 2020 Mar 24;21(6):2243. doi: 10.3390/ijms21062243.

Abstract

The number of available protein structures in the Protein Data Bank (PDB) has considerably increased in recent years. Thanks to the growth of structures and complexes, numerous large-scale studies have been done in various research areas, e.g., protein-protein, protein-DNA, or in drug discovery. While protein redundancy was only simply managed using simple protein sequence identity threshold, the similarity of protein-ligand complexes should also be considered from a structural perspective. Hence, the protein-ligand duplicates in the PDB are widely known, but were never quantitatively assessed, as they are quite complex to analyze and compare. Here, we present a specific clustering of protein-ligand structures to avoid bias found in different studies. The methodology is based on binding site superposition, and a combination of weighted Root Mean Square Deviation (RMSD) assessment and hierarchical clustering. Repeated structures of proteins of interest are highlighted and only representative conformations were conserved for a non-biased view of protein distribution. Three types of cases are described based on the number of distinct conformations identified for each complex. Defining these categories decreases by 3.84-fold the number of complexes, and offers more refined results compared to a protein sequence-based method. Widely distinct conformations were analyzed using normalized B-factors. Furthermore, a non-redundant dataset was generated for future molecular interactions analysis or virtual screening studies.

Keywords: clustering; dataset; protein-ligand complexes; refinement; structural alignment.

MeSH terms

  • Binding Sites
  • Databases, Protein*
  • Humans
  • Ligands
  • Molecular Docking Simulation / methods*
  • Protein Binding
  • Sequence Analysis, Protein / methods*
  • Software*

Substances

  • Ligands