The Effect of Resampling on Data-imbalanced Conditions for Prediction towards Nuclear Receptor Profiling Using Deep Learning

Mol Inform. 2020 Aug;39(8):e1900131. doi: 10.1002/minf.201900131. Epub 2020 Mar 31.

Abstract

In toxicity evaluation based on the nuclear receptor signalling pathway, in silico prediction tools are used for the detection of the early stages of long-term toxicities, the prioritization of newly synthesized chemicals and the acquisition of the selectivity and sensitivity. Computational prediction model is one of the promising tools for the toxicity screening of the chemical-protein interaction as deep learning has been improved the prediction accuracies. However, the challenge is that data-imbalanced conditions, where the volume of toxic chemical compound dataset is much smaller than the nontoxic dataset, result in low prediction accuracy of the toxic dataset providing valid information to toxicity hazard. In this paper, we have examined the effect of data imbalance in the toxicity assessment data of AR (LBD), ER (LBD), AhR, and PPAR as nuclear receptors, and identified the severe imbalance between the prediction of the toxic and nontoxic datasets. As the acquisition of the balanced selectivity and sensitivity is required for the assessment of toxicity hazards, data resampling methods have been investigated in order to improve the bias problem in binary classification for toxicity hazard profiling of nuclear receptor. The experimental results achieved a sensitivity of 0.714 and a specificity of 0.787, with an overall accuracy of 0.829 and a ROC-AUC of 0.822 by the simple resampling methods.

Keywords: Computational toxicity prediction; data imbalance; deep learning; nuclear receptor; resampling; toxicity hazard profiling.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Deep Learning*
  • Neural Networks, Computer
  • Receptors, Cytoplasmic and Nuclear / metabolism*

Substances

  • Receptors, Cytoplasmic and Nuclear