Exploiting Multiple Descriptor Sets in QSAR Studies

J Chem Inf Model. 2016 Mar 28;56(3):501-9. doi: 10.1021/acs.jcim.5b00663. Epub 2016 Mar 10.

Abstract

A quantitative structure-activity relationship (QSAR) is a model relating a specific biological response to the chemical structures of compounds. There are many descriptor sets available to characterize chemical structure, raising the question of how to choose among them or how to use all of them for training a QSAR model. Making efficient use of all sets of descriptors is particularly problematic when active compounds are rare among the assay response data. We consider various strategies to make use of the richness of multiple descriptor sets when assay data are poor in active compounds. Comparisons are made using data from four bioassays, each with five sets of molecular descriptors. The recommended method takes all available descriptors from all sets and uses an algorithm to partition them into groups called phalanxes. Distinct statistical models are trained, each based on only the descriptors in one phalanx, and the models are then averaged in an ensemble of models. By giving the descriptors a chance to contribute in different models, the recommended method uses more of the descriptors in model averaging. This results in better ranking of active compounds to identify a shortlist of drug candidates for development.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Biological Assay
  • Cell Line, Tumor
  • Humans
  • Models, Molecular
  • Quantitative Structure-Activity Relationship*