Imputation of Missing Data in Materials Science through Nearest Neighbors and Iterative Predictions

J Chem Theory Comput. 2024 Dec 26. doi: 10.1021/acs.jctc.4c01237. Online ahead of print.

Abstract

Missing data in tabular data sets is ubiquitous in statistical analysis, big data analysis, and machine learning studies. Many strategies have been proposed to impute missing data, but their reliability has not been stringently assessed in materials science. Here, we carried out a benchmark test for six imputation strategies: Mean, MissForest, HyperImpute, Gain, Sinkhorn, and a newly proposed MatImpute on seven representative data sets in materials science. The imputation-induced errors (IIEs) were evaluated through the difference between imputed and original values, by root mean square error (RMSE), Wasserstein distance (WD), and a newly introduced metrics data set correlation convergence (DCC), to measure the difference at three aspects for individual data, column-wise distribution, and correlation stability of a data set. MatImpute outperformed the others with the least RMSE and WD and the highest DCC. The IIE increases with the increase of data missing ratio and in the order of missing at random < missing completely at random ≤ missing not at random, considering inherent correlations among missing data. A similar trend was observed for the increase of IIE along the central departure distance in units of the standard deviation, which is consistent with the increase of difficulty from interpolation to extrapolation. Further tests of IIE in regression and classification machine learning predictive models, MatImpute also preserved the highest data recovery fidelity. We released the code of MatImpute to facilitate the construction of high-quality data sets in materials science.