Initial Sample Selection in Bayesian Optimization for Combinatorial Optimization of Chemical Compounds

ACS Omega. 2022 Dec 30;8(2):2001-2009. doi: 10.1021/acsomega.2c05145. eCollection 2023 Jan 17.

Abstract

An efficient search for optimal solutions in Bayesian optimization (BO) entails providing appropriate initial samples when building a Gaussian process regression model. For general experimental designs without compounds or molecular descriptors in explanatory variable x, selecting initial samples with a larger D-optimality allows little correlation between x in the selected samples, which leads to effective regression model building. However, in the case of experimental designs with compounds, a high correlation always exists between molecular descriptors calculated from chemical structures, and compounds with similar structures form clusters in the chemical space. Therefore, selecting the initial samples uniformly from each cluster is desirable for obtaining initial samples with maximum information on experimental conditions. As D-optimality does not work well with highly correlated molecular descriptors and does not consider information on clusters in sample selection, we propose an initial sample selection method based on clustering and apply it to the optimization of coupling reaction conditions with BO. We confirm that the proposed method reaches the optimal solution with up to 5% fewer experiments than random sampling or sampling based on D-optimality. This study makes a contribution to the initial sample selection method for BO, and we are convinced that the proposed method improves the search performance of BO in various fields of science and technology if initial samples can be determined using cluster information appropriately formed by utilizing domain knowledge.