Multiclass Synthetic Accessibility Prediction

J Chem Inf Model. 2025 Jan 17. doi: 10.1021/acs.jcim.4c01663. Online ahead of print.

Abstract

Evaluating synthetic accessibility of in silico molecules is an integral component of the drug discovery process. While the application of machine learning models to predict whether small molecules are easy or hard to synthesize has gained attention recently, predetermined thresholds and data set imbalances present challenges for these binary classification approaches. In this study, we introduce a novel multiclass fold-ensembled classification approach to predict the minimum number of steps needed to synthesize a small molecule. By ensembling the base models trained on multiple stratified subsampled folds, this approach effectively mitigates the impact of class imbalance through probability aggregation or voting aggregation strategies. Additionally, we propose fuzzy evaluation metrics that account for practical tolerances in predictions, providing a more flexible and realistic assessment of model performance. Through experimentation on two reaction benchmark data sets, we demonstrate the effectiveness of our model in a multiclass synthetic accessibility prediction task and the superiority of our proposed method over six existing models in binary synthetic accessibility prediction tasks.