Predicting chemical ecotoxicity by learning latent space chemical representations

Feng Gao; Wei Zhang; Andrea A Baccarelli; Yike Shen

doi:10.1016/j.envint.2022.107224

Predicting chemical ecotoxicity by learning latent space chemical representations

Environ Int. 2022 May:163:107224. doi: 10.1016/j.envint.2022.107224. Epub 2022 Apr 1.

Authors

Feng Gao¹, Wei Zhang², Andrea A Baccarelli¹, Yike Shen³

Affiliations

¹ Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, NY 10032, United States.
² Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI 48823, United States.
³ Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, NY 10032, United States. Electronic address: [email protected].

Abstract

In silico prediction of chemical ecotoxicity (HC₅₀) represents an important complement to improve in vivo and in vitro toxicological assessment of manufactured chemicals. Recent application of machine learning models to predict chemical HC₅₀ yields variable prediction performance that depends on effectively learning chemical representations from high-dimension data. To improve HC₅₀ prediction performance, we developed an autoencoder model by learning latent space chemical embeddings. This novel approach achieved state-of-the-art prediction performance of HC₅₀ with R² of 0.668 ± 0.003 and mean absolute error (MAE) of 0.572 ± 0.001, and outperformed other dimension reduction methods including principal component analysis (PCA) (R² = 0.601 ± 0.031 and MAE = 0.629 ± 0.005), kernel PCA (R² = 0.631 ± 0.008 and MAE = 0.625 ± 0.006), and uniform manifold approximation and projection dimensionality reduction (R² = 0.400 ± 0.008 and MAE = 0.801 ± 0.002). A simple linear layer with chemical embeddings learned from the autoencoder model performed better than random forest (R² = 0.663 ± 0.007 and MAE = 0.591 ± 0.008), fully connected neural network (R² = 0.614 ± 0.016 and MAE = 0.610 ± 0.008), least absolute shrinkage and selection operator (R² = 0.617 ± 0.037 and MAE = 0.619 ± 0.007), and ridge regression (R² = 0.638 ± 0.007 and MAE = 0.613 ± 0.005) using unlearned raw input features. Our results highlighted the usefulness of learning latent chemical representations, and our autoencoder model provides an alternative approach for robust HC₅₀ prediction.

Keywords: Autoencoder; Chemical ecotoxicity; Dimension reduction; Machine learning; Representation learning.

Predicting chemical ecotoxicity by learning latent space chemical representations

Authors

Affiliations

Abstract

Publication types

MeSH terms

Grants and funding