Machine Learning for Reaction Performance Prediction in Allylic Substitution Enhanced by Automatic Extraction of a Substrate-Aware Descriptor

J Chem Inf Model. 2025 Jan 13;65(1):312-325. doi: 10.1021/acs.jcim.4c02120. Epub 2025 Jan 2.

Abstract

Despite remarkable advancements in the organic synthesis field facilitated by the use of machine learning (ML) techniques, the prediction of reaction outcomes, including yield estimation, catalyst optimization, and mechanism identification, continues to pose a significant challenge. This challenge arises primarily from the lack of appropriate descriptors capable of retaining crucial molecular information for accurate prediction while also ensuring computational efficiency. This study presents a successful application of ML for predicting the performance of Ir-catalyzed allylic substitution reactions. We introduce SubA, an innovative substrate-aware descriptor that is inspired by the fact that specific atoms or motifs in reactants drive the reaction outcomes. By employing graph matching algorithms for molecular backbone identification and incorporating atomic and molecular properties derived from density functional theory calculations, SubA extracts essential information at both the atomic level and the molecular level. Compared to four mainstream descriptors, SubA achieves reduced dimensionality and enhanced prediction accuracy with over 2% mean absolute error reduction in both random and scaffold splitting evaluations. It also demonstrates better generalization when confronted with previously unreported substrate combinations in extended experiments. Furthermore, an interpretable analysis of SubA shows that the predictor focuses on key molecular and atomic features, offering insights into reaction mechanisms.

MeSH terms

  • Algorithms
  • Allyl Compounds / chemistry
  • Catalysis
  • Machine Learning*

Substances

  • Allyl Compounds