Machine learning interatomic potentials (MLIPs) promise quantum-level accuracy at classical force field speeds, but their performance hinges on the quality and diversity of training data. An efficient and fully automated approach to sample chemical reaction space without relying on human intuition, addressing a critical gap in MLIP development is presented. The method combines the speed of tight-binding calculations with selective high-level refinement, generating diverse datasets that capture both equilibrium and reactive regions of potential energy surfaces. By employing single-ended growing string and nudged elastic band methods, reaction pathways previously underrepresented in MLIP training sets, particularly near transition states are systematically explored. This approach yields datasets with rich structural and chemical diversity, essential for robust MLIP development. Open-source code is provided for the entire workflow, facilitating the integration of the approach into existing MLIP development pipelines.
Keywords: chemical reaction space; dataset generation; machine learning interatomic potential.
© 2025 The Author(s). Advanced Science published by Wiley‐VCH GmbH.