Stacked Generalization with Applicability Domain Outperforms Simple QSAR on in Vitro Toxicological Data

J Chem Inf Model. 2019 Apr 22;59(4):1486-1496. doi: 10.1021/acs.jcim.8b00553. Epub 2019 Feb 27.

Abstract

The development of in silico tools able to predict bioactivity and toxicity of chemical substances is a powerful solution envisioned to assess toxicity as early as possible. To enable the development of such tools, the ToxCast program has generated and made publicly available in vitro bioactivity data for thousands of compounds. The goal of the present study is to characterize and explore the data from ToxCast in terms of Machine Learning capability. For this, a large scale analysis on the entire database has been performed to build models to predict bioactivities measured in in vitro assays. Simple classical QSAR algorithms (ANN, SVM, LDA, random forest, and Bayesian) were first applied on the data, and the results of these algorithms suggested that they do not seem to be well-suited for data sets with a high proportion of inactive compounds. The study then showed for the first time that the use of an ensemble method named "Stacked generalization" could improve the model performance on this type of data. Indeed, for 61% of 483 models, the Stacked method led to models with higher performance. Moreover, the combination of this ensemble method with an applicability domain filter allows one to assess the reliability of the predictions for further compound prioritization. In particular we showed that for 50% of the models, the ROC score is better if we do not consider the compounds that are not within the applicability domain.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Bayes Theorem
  • Computer Simulation*
  • Quantitative Structure-Activity Relationship*
  • Supervised Machine Learning
  • Toxicology*