As is known to all, the construction of calibration and validation sets is of great importance for how to select representative samples into subsets so that the calibration model can be built, evaluated and predicted effectively for model development. In this study, a method was proposed for the calibration and validation sets constructed by selecting samples maximally similar to the test samples based on the spectra data. Both the Euclidean distance and Mahalanobis distance were attempted to estimate the spectra similarity. The method to select samples for calibration is more suitable and specific to unknown test samples in practical applications, thus improving the measurement accuracy. In addition, the optimization of calibration set size was carried out to avoid the influence of unnecessary samples. Two data sets of Salvia miltiorrhiza (S. miltiorrhiza) and corn by near infrared spectroscopy (NIR) were used to test the performance of the proposed method compared with two typical sample-selection algorithms, Kennard-Stone (KS) and sample set partitioning based on joint x-y distances (SPXY). The experimental results indicated that the proposed method could select a more targeted set of samples for the unknown test samples and had the superior predictive performance to the KS and SPXY methods.
Keywords: NIR spectroscopy; PLS regression; Salvia miltiorrhiza analysis; Sample selection; Spectra similarity.
Copyright © 2021 Elsevier B.V. All rights reserved.