As a dangerous end point, respiratory toxicity can cause serious adverse health effects and even death. Meanwhile, it is a common and traditional issue in occupational and environmental protection. Pharmaceutical and chemical industries have a strong urge to develop precise and convenient computational tools to evaluate the respiratory toxicity of compounds as early as possible. Most of the reported theoretical models were developed based on the respiratory toxicity data sets with one single symptom, such as respiratory sensitization, and therefore these models may not afford reliable predictions for toxic compounds with other respiratory symptoms, such as pneumonia or rhinitis. Here, based on a diverse data set of mouse intraperitoneal respiratory toxicity characterized by multiple symptoms, a number of quantitative and qualitative predictions models with high reliability were developed by machine learning approaches. First, a four-tier dimension reduction strategy was employed to find an optimal set of 20 molecular descriptors for model building. Then, six machine learning approaches were used to develop the prediction models, including relevance vector machine (RVM), support vector machine (SVM), regularized random forest (RRF), extreme gradient boosting (XGBoost), naïve Bayes (NB), and linear discriminant analysis (LDA). Among all of the models, the SVM regression model shows the most accurate quantitative predictions for the test set (q2ext = 0.707), and the XGBoost classification model achieves the most accurate qualitative predictions for the test set (MCC of 0.644, AUC of 0.893, and global accuracy of 82.62%). The application domains were analyzed, and all of the tested compounds fall within the application domain coverage. We also examined the structural features of the compounds and important fragments with large prediction errors. In conclusion, the SVM regression model and the XGBoost classification model can be employed as accurate prediction tools for respiratory toxicity.
Keywords: dimension reduction; extreme gradient boosting; machine learning; quantitative structure−activity relationship; respiratory system toxicity.