HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: bigstrut

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2403.08376v1 [cs.LG] 13 Mar 2024

Nonlinear Manifold Learning Determines Microgel Size from Raman Spectroscopy

Eleni D. Koronaki Faculté des Sciences, de la Technologie et de la Communication, Université de Luxembourg, Maison du Nombre, Avenue de la Fonte 6, L-4364 Esch-sur-Alzette, Luxembourg Luise F. Kaven Process Systems Engineering (AVT.SVT), RWTH Aachen University, Forckenbeckstr. 51, 52074 Aachen, Germany Johannes M. M. Faust Process Systems Engineering (AVT.SVT), RWTH Aachen University, Forckenbeckstr. 51, 52074 Aachen, Germany Ioannis G. Kevrekidis Department of Chemical and Biomolecular Engineering and Department of Applied Mathematics and Statistics, Whiting School of Engineering, Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218, USA. Alexander Mitsos JARA-CSD, 52056 Aachen, Germany Process Systems Engineering (AVT.SVT), RWTH Aachen University, Forckenbeckstr. 51, 52074 Aachen, Germany Institute of Energy and Climate Research, Energy Systems Engineering (IEK-10), Forschungszentrum Jülich GmbH, 52425 Jülich, Germany
(March 13, 2024)

Abstract

Polymer particle size constitutes a crucial characteristic of product quality in polymerization. Raman spectroscopy is an established and reliable process analytical technology for in-line concentration monitoring. Recent approaches and some theoretical considerations show a correlation between Raman signals and particle sizes but do not determine polymer size from Raman spectroscopic measurements accurately and reliably. With this in mind, we propose three alternative machine learning workflows to perform this task, all involving diffusion maps, a nonlinear manifold learning technique for dimensionality reduction: (i)𝑖(i)( italic_i ) directly from diffusion maps, (ii)𝑖𝑖(ii)( italic_i italic_i ) alternating diffusion maps, and (iii)𝑖𝑖𝑖(iii)( italic_i italic_i italic_i ) conformal autoencoder neural networks. We apply the workflows to a data set of Raman spectra with associated size measured via dynamic light scattering of 47 microgel (cross-linked polymer) samples in a diameter range of 208 nm to 483 nmrangetimes208nanometertimes483nanometer208\text{\,}\mathrm{nm}483\text{\,}\mathrm{nm}start_ARG start_ARG 208 end_ARG start_ARG times end_ARG start_ARG roman_nm end_ARG end_ARG to start_ARG start_ARG 483 end_ARG start_ARG times end_ARG start_ARG roman_nm end_ARG end_ARG. The conformal autoencoders substantially outperform state-of-the-art methods and results for the first time in a promising prediction of polymer size from Raman spectra.

Einführung

Process analytical methods are crucial for optimizing process performance and product properties, especially in polymerization. In-line spectroscopic methods are advantageous, and, in particular, near-infrared (NIR) and Raman spectroscopy are widely applied spectroscopic methods, see, e.g., the reviews. 1, 2, 3, 4 Evaluation methods for concentrations from (either on-line or off-line) spectroscopic data are established and comprise regression models, such as univariate peak integration based on the Beer-Lambert law, 5 multivariate partial least squares (PLS), or artificial neural networks (ANNs), 6 and physically supported strategies such as multivariate curve resolution-alternating least squares (MCR-ALS) 7 or indirect hard modeling (IHM). 8 Size is a crucial product feature in several processes, e.g., polymerization and crystallization. In contrast to concentrations, the size prediction from spectroscopic data remains a major challenge. Herein, we consider cross-linked polymer networks called microgels, which are not considered particles. 9 Thus, we use the term microgel size oder polymer size instead of the more common particle size. The fact that particles such as polymers influence spectroscopic measurements through light scattering is well established, 10 and experimental evidence of the correlation between Raman scattering and polymer size has been presented, 11 even for relatively large particles. 12 However, only a few approaches attempt to predict polymer sizes from Raman spectra. 11, 12, 13, 14, 15, 16 These approaches rely on relatively small sets of data points and focus on training data-driven models. As most literature studies do not provide performance metrices or alternatively raw data publication, a comprehensive comparison of the quantitative performance metrices of studies from the literature is not feasible. However, qualitatively a lack of prediction accuracy of these approaches is observed compared to established methods such as dynamic light scattering (DLS). An overview of the state-of-the-art work on polymer size prediction from Raman spectra is presented in Tab. 1. Most of these methods are based on the linear methods PLS or principal component analysis (PCA), which reduce the predictors to a smaller set of uncorrelated components and perform least squares regression on these components instead of on the original data. Each spectral measurement consists of many measured intensities, resulting in a large dimensionality of the input vector. However, the measured intensities are not independent and ideally depend on a small number of meaningful properties, e.g., concentration and size. When the available data live in a low-dimensional, yet nonlinear manifold, linear methods often fail to capture the majority of the variance of the data, even with an increased number of principal components. Hence, we propose using nonlinear manifold learning approaches to achieve significant dimensionality reduction and, more importantly, to identify latent variables possessing specific desired properties. Dimensionality reduction replaces extensive data sets with a handful of latent variables. For the reduction, we employ diffusion maps (DMAPs), 17, 18, 19 to derive a parsimonious reduced description of the Raman spectra. Then, as one alternative, we predict the polymer size directly from the latent variables computed with the DMAPs algorithm. Moreover, we take additional steps regarding the characteristics of the latent space by applying two alternative machine-learning methods to discover common latent variables between the spectra and the polymer sizes. In this sense, common describes a set of variables that has a one-to-one correlation to the desired observed quantity. The common variable between the spectra and the polymer size is used as a data-driven junction through which we can predict the polymer size given the spectrum of a new sample. We propose three alternative machine learning methods (presented schematically in Fig. 1): (i𝑖iitalic_i) direct prediction from DMAP coordinates; (ii𝑖𝑖iiitalic_i italic_i) an alternating diffusion maps (AltDMAPs) algorithm, initially introduced by Lederman et al., 20 and later in the context of multimodal data fusion 21 und identification of “jointly smooth” functions on input and output manifolds; 22 and (iii𝑖𝑖𝑖iiiitalic_i italic_i italic_i) an ensemble of concurrently trained neural networks (NNs), named Y-shaped conformal autoencoders, introduced by Evangelou et al. 23 in the context of parameter non-identifiability. Methods (ii𝑖𝑖iiitalic_i italic_i) and (iii𝑖𝑖𝑖iiiitalic_i italic_i italic_i) are particularly appropriate for measurements that depend on various combinations of factors, the effect of which is not readily quantifiable. Herein, these methods are employed to identify the changes in the spectra explicitly attributed to the polymer sizes since the samples are not pretreated. As several factors beyond the size influence, e.g., concentrations of monomer, inhibitor, surfactant, and other partaking species in the reaction, influence the spectra, the proposed methods act as a nonlinear filter that isolates (in a sense, disentangles) the spectral changes that are attributed to the differences in polymer size. The common starting point for the machine learning approaches proposed in this work addresses the reduction of the high dimensionality of spectra, here achieved with DMAPs. The DMAPs algorithm is based upon (mathematical) diffusion processes on the data and facilitates discovering meaningful low-dimensional intrinsic geometric descriptions of data sets, even when the data is high-dimensional, nonlinear, and corrupted by (relatively small) noise. The algorithm has been successfully used for dimensionality reduction in applications relevant to reaction engineering in 24, 25, among others. We propose its use as an effective dimensionality reduction of full spectra, consisting of similar-to\sim 11,000 wavenumbers. This reduction enables efficient interpolation and regression since much fewer (typically <<<10) variables are involved. Once the low-dimensional representation of the spectra is determined in an offline step, it is possible to translate between coordinates in the ambient (spectra) and the reduced space (DMAP coordinates). The mapping from high to low dimension is achieved with the Nyström extension, 26, 27 whereas for the inverse, a particular implementation of Geometric Harmonics, called DoubleDMAPs, is selected. 28 Accurate reconstruction of the data set from the selected DMAP coordinates indicates that the latter is an adequate low-dimensional parameterization. After the dimensionality reduction, three alternative machine learning workflows (i𝑖iitalic_i) to (iii𝑖𝑖𝑖iiiitalic_i italic_i italic_i) are compared. The first approach involves regression analysis to estimate the relationship between the latent variables (DMAP coordinates) and the polymer size (cf. Approach 1 denoted as “Directly from DMAPs” in Fig. 1). For the implementation of AltDMAPs, a common variable between spectra (in their reduced representation) and polymer size is found here as an offline step (cf. Step 2.a in Fig. 1). Then, the overall online computational workflow to predict the polymer size from a new spectrum starts by determining its low-dimensional DMAP coordinates with the Nyström extension. From that, the common variable, the AltDMAP coordinate, is predicted with an ANN or other regression methods, such as XGBOOST (cf. Step 2.a in Fig. 1). Finally, the corresponding size is predicted from the AltDMAP coordinates with an ANN or with DoubleDMAPs (cf. Step 2.b in Fig. 1). In approach (iii𝑖𝑖𝑖iiiitalic_i italic_i italic_i), we exploit recent advances in conformal autoencoder neural network techniques. The DMAP coordinates are used as inputs (and also outputs) to a Y-shaped autoencoder to disentangle the dependencies of the latent variables discovered by a traditional autoencoder architecture (cf. Step 3.a in Fig. 1) and define the desired output, i.e., the polymer size, as a function of a latent variable (cf. Step 3.b in Fig. 1). Ultimately, given a new set of DMAPs, this “designer” neural network (NN) predicts the corresponding polymer size (and reconstruct the DMAPs themselves). We compare the proposed workflows with state-of-the-art techniques and show that it is essential to not only find a generic parsimonious low-dimensional parameterization of the data (here achieved with DMAPs) but to find the appropriate one, possessing a component that the polymer size can be written as a function of. We demonstrate that although the number of pairs of microgel size and spectral measurements is moderate (47, i.e., at least as high as in previous works, Tab. 1), the workflow is a promising direction in predicting polymer size in-line from Raman spectra. The remainder of the article is structured as follows: First, the process of data collection is presented along with an overview of the state-of-the-art methods for polymer size prediction from spectra. Subsequently, the DMAPs and AltDMAPs methods are summarized, followed by a detailed description of the proposed workflow. The conformal autoencoder architecture is then presented, building on the successful dimensionality reduction achieved with DMAPs. Finally, results and conclusions are drawn from the proposed implementation.

Methods

The following sections include the description of the data set used in this contribution and the applied methods for size prediction: benchmark methods, and the proposed workflows, including the DMAPs approach, AltDMAPs workflow, and the conformal autoencoder.

Data

We employ data from our samples taken from continuous synthesis of microgels. 29 The considered microgels here are based on N-isopropylacrylamide and cross-linked via N,N’-methylenebis(acrylamide). The reactor and measurement setup for the continuous synthesis are explained in our previous work. 30 Using the continuous flow reactor, microgels of different sizes are synthesized by changing the reactor temperature and flow rates and the initiator and surfactant concentration. As microgels are known to be of monodisperse size, 31, 32, 33 they represent an excellent system to study polymer size predictions from Raman spectroscopy. In our previous works, 29 we conducted in-line Raman spectra at reaction temperature at 60 °C to 80 °Crangetimes60degreeCelsiustimes80degreeCelsius60\text{\,}\mathrm{\SIUnitSymbolCelsius}80\text{\,}\mathrm{\SIUnitSymbolCelsius}start_ARG start_ARG 60 end_ARG start_ARG times end_ARG start_ARG °C end_ARG end_ARG to start_ARG start_ARG 80 end_ARG start_ARG times end_ARG start_ARG °C end_ARG end_ARG and DLS measurements at 50 °Ctimes50degreeCelsius50\text{\,}\mathrm{\SIUnitSymbolCelsius}start_ARG 50 end_ARG start_ARG times end_ARG start_ARG °C end_ARG. In contrast, in the present work, we acquire additional off-line Raman measurements of the same samples but at 20 °Ctimes20degreeCelsius20\text{\,}\mathrm{\SIUnitSymbolCelsius}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG °C end_ARG and restricted conditions. The restrictions include measuring the samples all within a small amount of time (over two experimentation days) and in glass vials filled to the exact same fluid level to ensure equal conditions for the acquisition of all spectra and to eliminate external influences on the measurements. Consequently, we also conduct further DLS measurements at 20 °Ctimes20degreeCelsius20\text{\,}\mathrm{\SIUnitSymbolCelsius}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG °C end_ARG. The data consist of Raman spectra and DLS measurements of microgel samples. The samples are taken from the output of the continuous flow reactor, which runs at different experimental conditions for each sample. The samples are measured off-line without further treatment, e.g., filtration or dialysis. In total, we use the data from 47 samples at different microgel sizes in the range of 208 nm to 483 nmrangetimes208nanometertimes483nanometer208\text{\,}\mathrm{nm}483\text{\,}\mathrm{nm}start_ARG start_ARG 208 end_ARG start_ARG times end_ARG start_ARG roman_nm end_ARG end_ARG to start_ARG start_ARG 483 end_ARG start_ARG times end_ARG start_ARG roman_nm end_ARG end_ARG in diameter. The determined polydispersity index of these microgels ranges well below 0.368, indicating monodisperse size distribution. Microgels have a different size depending on a threshold in temperature: above approximately 32 °Ctimes32degreeCelsius32\text{\,}\mathrm{\SIUnitSymbolCelsius}start_ARG 32 end_ARG start_ARG times end_ARG start_ARG °C end_ARG, they occur in a collapsed state; below the threshold temperature, they occur in a swollen state. Hence, the microgel sizes at 20 °Ctimes20degreeCelsius20\text{\,}\mathrm{\SIUnitSymbolCelsius}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG °C end_ARG are almost twice as big as at the reaction temperature. Each sample is measured three times via Raman spectroscopy. Raman spectra are taken with an acquisition time of 40 stimes40second40\text{\,}\mathrm{s}start_ARG 40 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG using an RXN2 Raman Analyzer (Kaiser Optical Systems, Ann Arbor, Michigan, USA) with cosmic ray correction. DLS measurements of the samples diluted in ultrapure water are conducted via the Zetasizer Ultra (Malvern Panalytical, Malvern, UK) at 20 °Ctimes20degreeCelsius20\text{\,}\mathrm{\SIUnitSymbolCelsius}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG °C end_ARG with a scattering angle of 90 °times90degree90\text{\,}\mathrm{\SIUnitSymbolDegree}start_ARG 90 end_ARG start_ARG times end_ARG start_ARG ° end_ARG. Each DLS measurement is repeated four times, and the software ZS Xplorer analyzes the scattering intensities. The Raman spectra comprise the Raman intensity measured over a range of wavelengths. The global range is between 100 cm1 to 3425 cm1rangetimes100centimeter1times3425centimeter1100\text{\,}{\mathrm{cm}}^{-1}3425\text{\,}{\mathrm{cm}}^{-1}start_ARG start_ARG 100 end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_cm end_ARG start_ARG - 1 end_ARG end_ARG end_ARG to start_ARG start_ARG 3425 end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_cm end_ARG start_ARG - 1 end_ARG end_ARG end_ARG correlating to 11,084 wavelengths per spectrum. Different spectra pretreatment methods can be applied to the spectral data. We compare using raw spectra and spectra with a linear fit or rubber band baseline correction in combination with standardization in the form of either Standard Normal Variate (SNV) or Min-Max normalization. The experimental data set is published open access 34 and comprises raw and pretreated Raman and evaluated DLS data. We use the same data set for all subsequently described methods to predict microgel size from Raman spectra. Out of the 47 pairs of microgel size and Raman spectra, we take 40 for training and 7 for testing. The same split is applied to quantify the prediction performance of each considered method (state-of-the-art methods and our proposed workflow with nonlinear methods). We conduct the training for each prediction method with 10-fold cross-validation, which involves splitting the training set into 10 smaller sub-sets and using 9 for training and 1 for testing. By repeating this process, using a different collection of sub-sets for training and validation each time, it is possible to define the best possible model hyper-parameters without sacrificing a lot of data. The number of hyper-parameters varies depending on the prediction method. Hence, the set of hyper-parameters for the individual method is described for each method separately in the following sections. The prediction performance of each method is evaluated based on commonly applied metrics: coefficients of determination (R2), root mean squared error (RMSE), and mean absolute percentage error (MAPE) for training and testing. The prediction accuracy is reflected in the %percent\%%-error calculated as:

%error=100DHpredictedDHactualDHactual,\%\mathrm{-error}=100\cdot\frac{D_{\mathrm{H}}^{\mathrm{predicted}}-D_{\mathrm% {H}}^{\mathrm{actual}}}{D_{\mathrm{H}}^{\mathrm{actual}}},% - roman_error = 100 ⋅ divide start_ARG italic_D start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_predicted end_POSTSUPERSCRIPT - italic_D start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_actual end_POSTSUPERSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_actual end_POSTSUPERSCRIPT end_ARG , (1)

where DHsubscript𝐷HD_{\mathrm{H}}italic_D start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT is the microgel’s hydrodynamic diameter. Based on the %percent\%%-error the MAPE is calculated as the sum of the %percent\%%-errors divided by the number of observations.

Benchmark Methods for Polymer Size Prediction from Raman Spectra

To benchmark our proposed method, we compare it against two state-of-the-art methods. These methods include the direct application of PLS to the spectral intensities and the application of PLS to fitted IHM parameters.

Partial Least Squares Regression of Spectral Intensities

As conducted in the literature, 11, 13, 14, 15 we apply a partial least squares (PLS) model regression directly to the spectral data as first introduced by Ito et al.. 13 We consider different spectral ranges for the calibration of our PLS models. These ranges include the global range and the so-called fingerprint (FP) region between 850 cm1 to 1800 cm1rangetimes850centimeter1times1800centimeter1850\text{\,}{\mathrm{cm}}^{-1}1800\text{\,}{\mathrm{cm}}^{-1}start_ARG start_ARG 850 end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_cm end_ARG start_ARG - 1 end_ARG end_ARG end_ARG to start_ARG start_ARG 1800 end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_cm end_ARG start_ARG - 1 end_ARG end_ARG end_ARG. Also, we apply pretreatment methods, combining two different types of baseline subtractions (linear fit and rubber band) and two normalization approaches (MinMax and SNV). Further, we normalize the data using the zscore function in Matlab. We analyze the results based on the metrics R2, RMSE, and MAPE for calibration and validation. Based on the MSE for cross-validation, we chose the number of components (latent variables) for the PLS regression.

Regression of Hard Model Parameters

We conduct the regression of fitted hard model parameters for predicting microgel sizes, as we proposed in previous works. 16 First, an IHM 8 is developed. For that model, the spectral range of the FP is considered. The range between 1552 cm1 to 1560 cm1rangetimes1552centimeter1times1560centimeter11552\text{\,}{\mathrm{cm}}^{-1}1560\text{\,}{\mathrm{cm}}^{-1}start_ARG start_ARG 1552 end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_cm end_ARG start_ARG - 1 end_ARG end_ARG end_ARG to start_ARG start_ARG 1560 end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_cm end_ARG start_ARG - 1 end_ARG end_ARG end_ARG is excluded as it is attributed to an atmospheric oxygen signal. Besides the range restrictions, no further pretreatment is usually applied as per our findings. 16 However, we compare the PLS performance of the IHM parameters with and without pretreatment for a comprehensive comparison. The applied pretreatment for the comparison is MinMax or SNV normalization and linear fit or rubber band baseline subtraction. The IHM prediction model is calibrated using calibration measurements from our previous work. 35 The model includes component hard models for the monomer, polymer, and water and a linear baseline. The hard model of each component consists of multiple characteristic peaks. The individual peaks are characterized by four parameters: position, intensity, shape (fraction of the Gaussian part), and half-width at half maximum (HWHM). The complete indirect hard model combines the component models with their distinctive peaks. The indirect hard model parameters are adjusted to suit the spectra of interest within the fitting process. The applied fitting mode constitutes a medium interaction method, where the weights of the components, the baseline, and the peak positions are varied. The weights of the components represent the magnitude of the individual component in the spectra. Each component (monomer, polymer, water) is accredited with one weight parameter during the model fitting process. The incorporated linear baseline is fitted with regard to its offset and slope. In this context, we restrict component shifts to avoid ambiguities due to overlapping spectral peak positions. The fitting mode follows our previous work. 16 The medium interaction fitting mode results in 49 modified parameter values (intercept and slope of the baseline, one weight for each of the three components, 23 monomer peak positions, 17 polymer peak positions, and four water peak positions) that serve as the input variables to the PLS regression model. In addition, we compare the medium fitting mode with the fitting mode with high interaction. For the high interaction, in addition to the changes in the medium interaction, all peak parameters can be varied within the fitting process. Thus, the high interaction method results in 181 modified parameter values (intercept and slope of the baseline, one weight for each of the three components, 23 monomer peaks, 17 polymer peaks, and four water peaks, where each peak is characterized by the four parameters described previously). The fitted IHM parameter values serve as input to the subsequent PLS regression. Again, we normalize the data using the zscore function in Matlab. We also analyze the results of the PLS regression based on the R2, RMSE, and MAPE values for the hybrid modeling approach, combining IHM and PLS. Based on the MSE for cross-validation, we choose the number of components for the PLS regression.

Diffusion Maps

The following paragraphs highlight the functionality of DMAPs for dimensionality reduction. In addition, we establish a workflow for predicting polymer sizes directly from DMAP coordinates.

Diffusion Maps for Dimensionality Reduction

The DMAPs framework has been shown to facilitate the discovery of meaningful low-dimensional intrinsic geometric descriptions of data sets .25, 28, 36 For a detailed description of the method, the interested reader is referred to the seminal papers 17, 37 and also, among others, 19, 24, 25, 28. Here, we use DMAPs to discover the dimensionality of the manifold that contains a collection of Raman spectra of microgel samples. Furthermore, DMAPs discover data-driven coordinates on (i.e., parameterizations of) the low-dimensional manifold where the data reside. These coordinates do not necessarily have a physical meaning, i.e., they do not have to correlate to any physical quantity. The coordinates of the manifold are a few of the leading eigenvectors, ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, of a scaled affinity matrix, which contains the Euclidean distances between all the pairs of available data points. Discovering which eigenvectors parameterize independent directions and do not span the same direction with different frequencies (harmonics) is essential. To distinguish independent eigenvectors, the local linear regression algorithm can be used as proposed by Dsilva et al. 38 Notably, the proposed approach includes reverse mapping from the DMAP coordinates to the variables in ambient space, which allows for the translation between the high and low-dimensional data description. To this end, Geometric Harmonics are proposed, introduced initially in 17, as a scheme for extending functions defined on data X, f(𝐗):𝐗𝐑:𝑓𝐗𝐗𝐑f(\textbf{X}):\textbf{X}\to\textbf{R}italic_f ( X ) : X → R, for xnew𝐗subscript𝑥𝑛𝑒𝑤𝐗x_{new}\notin\textbf{X}italic_x start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ∉ X. Here, the DoubleDMAPs 28 algorithm, a particular implementation of Geometric Harmonics, is selected. DoubleDMAPs are suitable for low-dimensional data that can be parameterized by just a few non-harmonic eigenvectors. Generally, Geometric Harmonics construct an input-output mapping between the ambient coordinates X and a function of interest f𝑓fitalic_f defined on X by operating directly on the non-harmonic DMAP coordinates.

Prediction directly from DMAPs

Establishing a parsimonious embedding of the spectra enables polymer size prediction directly from the DMAP coordinates. The direct prediction based on DMAP coordinates is achieved here with two different regression methods: (a) NN and (b) XGBOOST, using the DMAP coordinates as input and the polymer size as output. The schematic representation of the direct workflow is shown in Fig. 2. This direct approach is expected to perform well when the latent variables derived by DMAPs possess a one-to-one dependence on the desired observable, here the polymer size.

Alternating Diffusion Maps Workflow

The following sections introduce to the AltDMAPs algorithm and describe the prediction workflow based on AltDMAPs for identifying common variables between the spectra and the polymer size.

Alternating Diffusion Maps

AltDMAPs is a method, based on DMAPs, designed to parameterize the common variable between independent sets of observations, {si(1),si(2)}subscriptsuperscript𝑠1𝑖subscriptsuperscript𝑠2𝑖\{s^{(1)}_{i},s^{(2)}_{i}\}{ italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. The method is introduced by Lederman et al., 20 and the interested reader is referred to this paper and also to 39 for a detailed presentation of the method. Here, only key points are presented for completeness. At the core of the methodology lies the established DMAPs algorithm, which starts with a data set of N𝑁Nitalic_N individual points (represented as d𝑑ditalic_d-dimensional real vectors, x1,,xNsubscript𝑥1subscript𝑥𝑁x_{1},...,x_{N}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT). A similarity measure between each pair of vectors (xi,xj)subscript𝑥𝑖subscript𝑥𝑗(x_{i},x_{j})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), is computed using the standard Euclidean distance, based on which an affinity matrix is constructed. A popular choice is the Gaussian kernel w(i,j)=exp[(xixjϵ)2]𝑤𝑖𝑗superscriptnormsubscript𝑥𝑖subscript𝑥𝑗italic-ϵ2w(i,j)=\exp\left[-\left(\frac{||x_{i}-x_{j}||}{\epsilon}\right)^{2}\right]italic_w ( italic_i , italic_j ) = roman_exp [ - ( divide start_ARG | | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | end_ARG start_ARG italic_ϵ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] where ϵitalic-ϵ\epsilonitalic_ϵ is a hyper-parameter that quantifies the kernel’s bandwidth. To recover a parameterization insensitive to the sampling density, the normalization

𝐖~=𝐏1𝐖𝐏1~𝐖superscript𝐏1superscript𝐖𝐏1\widetilde{\textbf{W}}=\textbf{P}^{-1}\textbf{W}\textbf{P}^{-1}over~ start_ARG W end_ARG = P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_W bold_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

is performed, where Pii=j=1NWijsubscript𝑃𝑖𝑖superscriptsubscript𝑗1𝑁subscript𝑊𝑖𝑗P_{ii}=\sum_{j=1}^{N}W_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and Wijsubscript𝑊𝑖𝑗W_{ij}italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT the elements of the matrix W. A second normalization applied on 𝐖~~𝐖\widetilde{\textbf{W}}over~ start_ARG W end_ARG,

𝐊=𝐃1𝐖~𝐊superscript𝐃1~𝐖\textbf{K}=\textbf{D}^{-1}\widetilde{\textbf{W}}K = D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG W end_ARG (2)

gives a N×N𝑁𝑁N\times Nitalic_N × italic_N Markov matrix K; here D is a diagonal matrix, collecting the row sums of the matrix 𝐖~~𝐖\widetilde{\textbf{W}}over~ start_ARG W end_ARG eigenvectors ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Based to the DMAPs algorithm, the AltDMAPs algorithm defines two weighted kernel matrices, as defined in equation (2): one that is based on the measurements from s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and the other based on the measurements from s(2)superscript𝑠2s^{(2)}italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT, and constructs the alternating-diffusion operator as the product of the two normalized weight matrices:

𝐊alt=𝐊(1)𝐊(2).superscript𝐊𝑎𝑙𝑡superscript𝐊1superscript𝐊2\textbf{K}^{alt}=\textbf{K}^{(1)}\textbf{K}^{(2)}.K start_POSTSUPERSCRIPT italic_a italic_l italic_t end_POSTSUPERSCRIPT = K start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT K start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT .

It has been shown that the diffusion process defined by this operator is equivalent to one that would have been created from measurements of only the common variable 20. This finding can be further explained in terms of alternating propagation steps using K(1)superscript𝐾1K^{(1)}italic_K start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT followed by K(2)superscript𝐾2K^{(2)}italic_K start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT. Each propagation step can be considered a Markov chain that transitions to similar samples, as prescribed by the respective similarity matrix, first in the first sample set, followed by the second.

Prediction Workflow based on AltDMAPs

The proposed machine-learning workflow consists of two offline learning steps, summarized in Fig. 3, that take place once and create the tools for prediction, and three online application steps, shown schematically in Fig. 4, that are implemented in line with the actual process for fast predictions. The two offline steps are:

  • Offline Step 1: Dimensionality reduction of the original collection of spectra, consisting of similar-to\sim 11,000 wavenumbers via DMAPs. The goal is to reduce the number of variables to ideally <<< 10 and thus to re-state the high-dimensional data set in a low-dimensional coordinate system, parameterized by a small number of selected eigenvectors of the kernel matrix defined in equation (2). The eigenvectors can be selected with the help of the local linear regression algorithm. In this work the eigenvectors are selected based on how accurately the high-dimensional variables are reconstructed, as quantified by an L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-norm of the difference between predicted and actual spectra.

  • Offline Step 2: Parameterizing the effect of polymer size on the spectra with the AltDMAPs algorithm. In this implementation of AltDMAPs, the spectra and the polymer size measurements are considered as the two independent sensor measurements (s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and s(2)superscript𝑠2s^{(2)}italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT), respectively. The outcome of this algorithm represents the common variable(s) between the two modalities, i.e., the polymer size.

Having established a reduced representation of the spectra and the common variable between the spectra and the size measurements, we proceed with the online prediction steps for a new spectrum. A schematic representation of the online workflow for size prediction from spectra is presented in Fig. 4. To this end, three steps are proposed, designed to yield very fast (practically instantaneous) predictions, without further tuning of the model parameters, in an automated manner:

  • Online Step 1: Compute the DMAP coordinates of a new spectrum using the Nyström extension.

  • Online Step 2: Predict the common AltDMAPs variable(s) from the DMAP coordinates of the new spectrum:

    AltDMAPi=fAltD(DMAPsi)subscriptAltDMAP𝑖subscript𝑓𝐴𝑙𝑡𝐷subscriptDMAPs𝑖\text{AltDMAP}_{i}=f_{AltD}(\text{DMAPs}_{i})AltDMAP start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_A italic_l italic_t italic_D end_POSTSUBSCRIPT ( DMAPs start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (3)

    i𝑖iitalic_i here is the i-th new spectrum. The function fAltDsubscript𝑓𝐴𝑙𝑡𝐷f_{AltD}italic_f start_POSTSUBSCRIPT italic_A italic_l italic_t italic_D end_POSTSUBSCRIPT can be approximated by various methods, e.g., a fully connected NN, DoubleDMAPS, or XGBOOST.

  • Online Step 3: The polymer size, DHpredictedsuperscriptsubscript𝐷HpredictedD_{\mathrm{H}}^{\mathrm{predicted}}italic_D start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_predicted end_POSTSUPERSCRIPT is predicted as a function of the AltDMAP coefficients as:

    DH,ipredicted=fsize(AltDMAPi)superscriptsubscript𝐷Hipredictedsubscript𝑓𝑠𝑖𝑧𝑒subscriptAltDMAP𝑖D_{\mathrm{H,i}}^{\mathrm{predicted}}=f_{size}(\text{AltDMAP}_{i})italic_D start_POSTSUBSCRIPT roman_H , roman_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_predicted end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_s italic_i italic_z italic_e end_POSTSUBSCRIPT ( AltDMAP start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (4)

    the function fsizesubscript𝑓𝑠𝑖𝑧𝑒f_{size}italic_f start_POSTSUBSCRIPT italic_s italic_i italic_z italic_e end_POSTSUBSCRIPT can be approximated by a feed-forward NN or XGBOOST trained with AltDMAPs as input and polymer size as output.

Conformal Autoencoders: Y-shaped Architectures

Another alternative workflow to predict polymer sizes from Raman spectra involves an ensemble of concurrently trained NNs called Y-shaped conformal autoencoders. We use these Y-shaped conformal autoencoders to predict polymer sizes based on DMAP coordinates. The schematic representation of the workflow, including the Y-shaped autoencoder, is presented in Fig. 4(a). The Y-shaped autoencoder scheme, initially proposed in 23, is summarized here as it was adjusted for the current application. More details on the implementation and training of this machine learning technology are presented in 23. At the core of the scheme lies a regular autoencoder, i.e., a NN where the inputs and outputs are the same, with the addition of an extra “sideways” NN component, as explained below. Overall, the Y-shaped scheme comprises three connected subnetworks (illustrated in Fig. 4(b)):

  • Encoder, NN1, which maps the DMAP coordinates, ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, to the autoencoder latent variables, νisubscript𝜈𝑖\nu_{i}italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:
    (ϕ1,ϕ2,ϕ3,ϕ4,ϕ5)(ν1,ν2,ν3,ν4,ν5)maps-tosubscriptitalic-ϕ1subscriptitalic-ϕ2subscriptitalic-ϕ3subscriptitalic-ϕ4subscriptitalic-ϕ5subscript𝜈1subscript𝜈2subscript𝜈3subscript𝜈4subscript𝜈5(\phi_{1},\phi_{2},\phi_{3},\phi_{4},\phi_{5})\mapsto(\nu_{1},\nu_{2},\nu_{3},% \nu_{4},\nu_{5})( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) ↦ ( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT )

  • Decoder, NN2, which can be thought of as the inverse transformation from the latent space of the autoencoder (νisubscript𝜈𝑖\nu_{i}italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) back to the DMAP coordinates ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:
    (ν1,ν2,ν3,ν4,ν5)(ϕ^1,ϕ^2,ϕ^3,ϕ^4,ϕ^5)maps-tosubscript𝜈1subscript𝜈2subscript𝜈3subscript𝜈4subscript𝜈5subscript^italic-ϕ1subscript^italic-ϕ2subscript^italic-ϕ3subscript^italic-ϕ4subscript^italic-ϕ5(\nu_{1},\nu_{2},\nu_{3},\nu_{4},\nu_{5})\mapsto(\hat{\phi}_{1},\hat{\phi}_{2}% ,\hat{\phi}_{3},\hat{\phi}_{4},\hat{\phi}_{5})( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) ↦ ( over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT )

  • Polymer size estimator, NN3, which maps the right number of autoencoder latent variables, here one of them, ν1subscript𝜈1\nu_{1}italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, to the observed output data, here the polymer size:
    (ν1)DHmaps-tosubscript𝜈1subscript𝐷H(\nu_{1})\mapsto D_{\mathrm{H}}( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ↦ italic_D start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT

The key feature is the loss function, consisting of several parts. Successful reconstruction of the input original parameters (the autoencoder part) constitutes the first part. Noteably, the reconstructed inputs, i.e., the output of NN2, are not required further down in our analysis. Nevertheless, the accurate reconstruction of the inputs is important for the prediction step, because it ensures that the bottleneck variables, of which one is used for prediction, form a low-dimensional embedding of the data. This implies that the bottleneck variables are indeed a parsimonious representation of the original high-dimensional spectra. In theory, this autoencoder part of the NN architecture could also be used for dimensionality reduction of the original high-dimensional data by appropriately pre-selecting the size of the bottleneck layer. However, in the presented case study, the high dimension of the original data set (each spectrum contains similar-to\sim 11,000 wavenumbers) and the relatively limited number of data-points conclude the preferred strategy to first reduce the dimensionality with DMAPs and subsequently implement the neural network with less variables. In the following step, comes the ability of NN3, whose input is the single latent variable, here ν1subscript𝜈1\nu_{1}italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is chosen, to reproduce the observed output, i.e., the polymer size; this defines the polymer size as a function of single input, ν1subscript𝜈1\nu_{1}italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. To concurrently train the different NNs, an additional component to the loss function becomes necessary, which results from further imposing an orthogonality constraint on the conformal autoencoder’s latent coordinates:

dνi,dνj=0,ijformulae-sequence𝑑subscript𝜈𝑖𝑑subscript𝜈𝑗0for-all𝑖𝑗\langle d\nu_{i},d\nu_{j}\rangle=0,\forall i\neq j⟨ italic_d italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ = 0 , ∀ italic_i ≠ italic_j

where dνi𝑑subscript𝜈𝑖d\nu_{i}italic_d italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates the vector of partial derivatives of the latent coordinate νisubscript𝜈𝑖\nu_{i}italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with respect to of the input parameters ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ,\langle\cdot,\cdot\rangle⟨ ⋅ , ⋅ ⟩ indicates the inner product. This constraint is imposed using the automatic differentiation capabilities of the relevant code libraries and aims to disentangle the combination of features that matters to the output from those combinations of features that do not affect it.

Results

The results comprise the analysis of the size prediction from the benchmark methods, and the developed workflow, including predictions directly from DMAP coordinates via a NN or XGBOOST algorithm, implementation of AltDMAPs and of the Y-shaped autoencoder. The Raman spectra and size predictions from DLS measurements of microgel samples are available for transparency 34. The codes implementing the different workflow steps can be found in the GitLab repository 40.

Partial Least Squares Regression of Spectral Intensities

The exhaustive evaluation of various combinations of pretreatment methods for Raman spectra as the basis for PLS regression is summarized in Tab. 2. Overall, the direct application of PLS regression to the spectral intensities results in poor prediction performances for any pretreatment method. The poor performance is indicated by the R2 values significantly lower than 1 for the training and testing. Even R2 values below zero are encountered. Note that R2 values below zero imply that the prediction would be more accurate using the mean value than the value predicted by the regression model. Comparing the performance of spectra in the FP and global spectral region yields that the prediction performance is independent of the spectral region used. For pretreatment involving MinMax normalization in combination with either linear fit or rubber band subtraction or solely linear fit or rubber band subtraction, the number of latent variables needed for the FP region is consistently higher than for the global region, although the global region comprises more prediction variables. In contrast, for spectra without pretreatment or with pretreatment involving SNV normalization, the number of latent variables needed for the FP region is lower than for the global region throughout. In summary, the predictions using raw spectra with no pretreatment in the global region perform best in training and testing considering all performance metrices. Also, the prediction based on raw spectra in the FP region with no pretreatment shows a relatively good performance indicated by the second-best accumulated R2 value (training and testing) of 1.520 times1.520absent1.520\text{\,}start_ARG 1.520 end_ARG start_ARG times end_ARG start_ARG end_ARG. Thus, the parity plots for these most promising configurations are shown in Fig. 6 in comparison. Here, the gray circles represent the training data, and the red circles represent the test data. Over-fitting is precluded sufficiently, as the discrepancy between actual and predicted size is in the same range for the training and test data set. Furthermore, comparing the PLS results based on the raw spectra in the FP region (Fig. 5(a)) and the global region (Fig. 5(b)) shows no significant improvement in the PLS performance caused by the spectral range. However, the limited size of the data set is probably a restrictive factor and with an extended data set the performance of the state-of-the-art methods might improve. Therefore, the findings regarding the application of the state-of-the-art methods to the employed data set can not be generalized.

Regression of Hard Model Parameters

In Tab. 3, we show the overview of different pretreatment methods in the indirect hard modeling step. The fitted parameter values from the IHM are subsequently used in the PLS regression. We considered a high and medium fitting mode corresponding to more or less degrees of freedom for fitting the IHM evaluation model to the experimental spectra. Similarly to the direct regression on spectral intensities, we find that the overall performance of the predictions is unsatisfying. Again, we determine R2 values significantly lower than 1 for the training and even below zero for testing in some cases. Overall, the medium fitting mode shows a poorer prediction performance than the high fitting mode. With respect to the latent variables, for pretreatment involving rubber band more latent variables are needed in high fitting mode than in medium fitting mode. In contrast to the expected outcome that the high fitting mode necessitates more latent variables, as we need to reduce a larger space of prediction variables, the remaining pretreatment methods not involving rubber band subtraction result in less latent variables for high than for medium fitting mode. Furthermore, no clear trend exists that one aspect of the pretreatment method performs better than the other. In Fig. 7, we present the results of the PLS regression based on IHM parameters from Raman spectra pretreated via SNV and rubber band baseline subtraction and high interaction during the fitting, as this configuration yields the relatively best performance according to Tab. 3. Also, we show the second best performing configuration, namely regression based on spectra pretreated with MinMax normalization and a linear fit subtraction and fitted with high interaction. Similarly to the results from the direct application of PLS, the distribution of gray circles (training data) and the red circles test data) here shows that over-fitting is suppressed. In addition, the comparison of the PLS results based on the IHM parameter values from SNV plus linear fit pretreated spectra (Fig. 6(b)) and spectra pretreated with MinMax normaliztion and a linear fit subtraction (Fig. 6(a)) shows no notable improvement for the spectra pretreated via SNV in the PLS performance. In conclusion, the resulting prediction performance does not indicate reliable prediction accuracy. The reduction of the spectral range to IHM parameters constitutes a physically meaningful approach, but is shown to be insufficient for a limited size of data sets. For IHM in combination with PSL, the performance might improve with an extended data set and the findings can hence not be generalized but apply solely to the employed data sets.

Prediction Directly from Diffusion Maps

Firstly, the DMAPs algorithm is implemented, and six DMAP coordinates, ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, ϕ2subscriptitalic-ϕ2\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, ϕ3subscriptitalic-ϕ3\phi_{3}italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, ϕ4subscriptitalic-ϕ4\phi_{4}italic_ϕ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, ϕ5subscriptitalic-ϕ5\phi_{5}italic_ϕ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, and ϕ6subscriptitalic-ϕ6\phi_{6}italic_ϕ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT are selected to parsimoniously represent the spectra that live in a high-dimensional ambient space (cf. Fig. 2, Step 1). These are selected based on how accurately the original data can be reconstructed from those latent variables. Here the L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT norm of the difference between the predicted and actual spectra, for the test set, is 3.341063.34superscript1063.34\cdot 10^{-6}3.34 ⋅ 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT. Subsequently, the polymer size is directly predicted from DMAPs coordinates (cf. Fig. 2, Step 2). To this end, a feed-forward NN is trained, having as inputs the DMAP coordinates that correspond to the training data and, as output, the polymer size. The hyper-parameters of the network here and in subsequent sections are fine-tuned using the KerasTuner𝐾𝑒𝑟𝑎𝑠𝑇𝑢𝑛𝑒𝑟KerasTuneritalic_K italic_e italic_r italic_a italic_s italic_T italic_u italic_n italic_e italic_r RandomSearch𝑅𝑎𝑛𝑑𝑜𝑚𝑆𝑒𝑎𝑟𝑐RandomSearchitalic_R italic_a italic_n italic_d italic_o italic_m italic_S italic_e italic_a italic_r italic_c italic_h hyper-parameter optimization framework. The results of direct predictions from DMAP coordinates are presented in Fig. 8. It is possible to achieve regression, with MAPE=7.895%absentpercent7.895=7.895\%= 7.895 % and R2=0.538absent0.538=0.538= 0.538, for the test set (cf. Fig. 7(a)), and it is more challenging to identify the hyper-parameters that prevent over-fitting. Alternatively, the XGBOOST algorithm is used with randomized search on hyper-parameters of the XGBOOST estimator, using RandomizedSearchCV𝑅𝑎𝑛𝑑𝑜𝑚𝑖𝑧𝑒𝑑𝑆𝑒𝑎𝑟𝑐𝐶𝑉RandomizedSearchCVitalic_R italic_a italic_n italic_d italic_o italic_m italic_i italic_z italic_e italic_d italic_S italic_e italic_a italic_r italic_c italic_h italic_C italic_V from the scikit-learn library in Python. The parameters of the estimator are optimized, here and in the implementations presented in subsequent sections by cross-validated search over parameter settings. For efficiency, not all parameter values are tested, but rather a fixed number of parameter settings is sampled from the specified distributions. The performance of the method has similar accuracy to the neural networks (cf. Fig. 7(b), MAPE=6.348%absentpercent6.348=6.348\%= 6.348 % and R2=0.600absent0.600=0.600= 0.600).

Prediction using Alternating Diffusion Maps

The AltDMAPs algorithm is implemented to find common variables, where the collection of spectra is considered as the s1superscript𝑠1s^{1}italic_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT (Sensor 1) measurements and the corresponding polymer sizes as s2superscript𝑠2s^{2}italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (Sensor 2). Two significant common variables at AltDMAPs index 2 and 6 are found by implementing the local linear regression algorithm, indicated by a high value of the local linear residual in Fig. 8(a). The common variables parameterize the spectra variability attributed to the polymer size. It is worth noting that two clusters appear in the original data set of raw spectra concerning the spectral intensities (cf. Fig. 8(b)). The common variable AltDMAP 2 is one-to-one with the size in each cluster separately, as visually demonstrated, in Fig. 8(c). Overall, for both clusters, the size is not a function of a single AltDMAP, and the AltDMAP variable with index 6666 appears to be correlated to the polymer size, to some extent, since the local linear residual value (cf. Fig. 8(a)) is non-negligible. Identifying a reduced description of the spectra and common variables between the spectra and the polymer size enables predicting the polymer size corresponding to a new spectrum by following the three online steps in the workflow (cf. Fig. 4). As part of Online Step 1, the DMAP coordinates of a new set of spectra are determined by implementing the Nyström extension. The DMAP coordinates are accurately predicted by the Nyström extension, achieving a mean squared error, MSE=1.58×107absent1.58superscript107=1.58\times 10^{-7}= 1.58 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT. Subsequently, in Online Step 2, the common variables, AltDMAPs, are computed as a function of the respective DMAP coordinates of the test set computed in the previous step. Two alternatives are implemented: Double Diffusion Maps, with a prediction error of MSE=0.0248absent0.0248=0.0248= 0.0248, and the Extreme Gradient Boosting (XGBOOST) algorithm, which is slightly more accurate here, achieving MSE=0.0162absent0.0162=0.0162= 0.0162. In practice, including more AltDMAP coordinates for a new spectrum is necessary, here we include the first six AltDMAPs. The final step of the workflow involves predicting the polymer size given the common variables, AltDMAPs. The prediction results using a feed-forward NN and XGBOOST are shown in Fig. 10. Here, the input is the AltDMAP coordinates, and the output is the polymer size. Both methods yield similar results regarding MAPE, RMSE, and R2 for the test set. The performance of the overall online workflow is shown in Fig. 11. Specifically, four combinations of methods are implemented and reported: (i) the size is predicted by a NN from AltDMAPs predicted by Double DMAPs, leads to an R2=0.584absent0.584=0.584= 0.584 and MAPE=8.101%absentpercent8.101=8.101\%= 8.101 % for the test set; (ii) the size is predicted by a NN from AltDMAPs predicted by XGBOOST, leads to an R2=0.330absent0.330=0.330= 0.330 and MAPE=10.058%absentpercent10.058=10.058\%= 10.058 % for the test set; (iii) the size is predicted by the XGBOOST algorithm from AltDMAPs predicted by Double DMAPs, leads to an R2=0.345absent0.345=0.345= 0.345 and MAPE=10.772%absentpercent10.772=10.772\%= 10.772 % for the test set; (iv) the size is predicted by the XGBOOST algorithm from AltDMAPs predicted by XGBOOST, leads to an R2=0.525absent0.525=0.525= 0.525 and MAPE=8.481%absentpercent8.481=8.481\%= 8.481 % for the test set. As shown, the prediction of the polymer particle diameter from the predicted AltDMAPs is less accurate than the prediction of the actual AltDMAPs, which is attributed to the prediction error of the AltDMAPs from the DMAPs. Overall, case (i) is the best-performing one for the available data set for R2 and MAPE. Considering the %percent\%%-error in size prediction, the combination of methods of case (i) is more advantageous regarding the maximum absolute values of the %percent\%%-error and error distribution. Overall, the predictive accuracy directly from DMAPs is slightly better than when implementing the AltDMAP methodology, although they are characterized by the same order of magnitude. Given the size of the available data set and the difference between the performance metrics (R2, MAPE), it is unclear whether this small difference is meaningful enough to persist given a different or larger data set.

Y-shaped Autoencoder

The conformal autoencoder architecture is trained with the same training set as the previous methods: The DMAP coordinates (ϕ1,,ϕ6)subscriptitalic-ϕ1subscriptitalic-ϕ6(\phi_{1},...,\phi_{6})( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) corresponding to the training set are the input to the encoder network NN1. These values are also the target values for the decoder network NN2. We set six latent variables in the bottleneck layer, and we require that the polymer size be defined as a function of ν1subscript𝜈1\nu_{1}italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, by concurrently training the neural network NN3. The performance of this method is superior to both the AltDMAPs workflow and the direct prediction from DMAP coordinates with a R2 value of 0.9510.9510.9510.951 and MAPE value of 2.93%percent2.932.93\%2.93 % for the test set. We believe that the reason for the enhanced performance lies in the fact that the Y-shaped architecture not only finds the latent variables of the data set, as do DMAPs, but with the second NN, specific properties are explicitly imposed on one of them: The polymer size must be a function of one latent variable orthogonal to all other latent variables. The latter property implies that only this one latent variable correlates to the particle diameter, which is a subtle but essential difference from the AltDMAPs approach (two common variables and even more required for accurate online prediction). This hypothesis merits further investigation, which lies beyond the scope of this work.

Comparison of Presented Methods

In Tab. 4, we compare the results from the state-of-the-art to the proposed methods. We cluster the methods in three categories: state-of-the-art, alternating diffusion maps, and prediction directly from DMAP coordinates. The results using the respective methods are described in detail in previous sections. Here, we present the best-forming configuration of pretreatment and spectral range for the direct PLS regression on Raman spectra and PLS regression on IHM parameters. For Raman spectra studied here, the Y-shaped autoencoder outperforms the other considered methods of this work indicated by the only R2 value considerably close to 1111. Also, the low RMSE values of the Y-shaped autoencoder, being approximately a third of the values of the compared methods, indicate improved performance. Lastly, the MAPE value of the autoencoder-based method ranges at 2.930 %times2.930percent2.930\text{\,}\mathrm{\char 37}start_ARG 2.930 end_ARG start_ARG times end_ARG start_ARG % end_ARG, which approximates the precision range of around 2 %times2percent2\text{\,}\mathrm{\char 37}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG % end_ARG expected from DLS 41, the established size measurement device. Thus, even though the evaluation is based on a data set with limited size, for the first time, a purely data-driven evaluation method based on Raman spectra advances the accuracy capability of the established size determination device. The next best-performing methods in the current study include the prediction directly from DMAP coordinates via the XGBOOST algorithm, the PLS regression from IHM parameters, and the direct PLS regression. Each method reaches a similar prediction performance of R2 values of 0.600 times0.600absent0.600\text{\,}start_ARG 0.600 end_ARG start_ARG times end_ARG start_ARG end_ARG, 0.636 times0.636absent0.636\text{\,}start_ARG 0.636 end_ARG start_ARG times end_ARG start_ARG end_ARG, and 0.633 times0.633absent0.633\text{\,}start_ARG 0.633 end_ARG start_ARG times end_ARG start_ARG end_ARG, respectively. Also, their RMSE values are close to each other, ranging at 34.172 nmtimes34.172nanometer34.172\text{\,}\mathrm{nm}start_ARG 34.172 end_ARG start_ARG times end_ARG start_ARG roman_nm end_ARG, 34.840 nmtimes34.840nanometer34.840\text{\,}\mathrm{nm}start_ARG 34.840 end_ARG start_ARG times end_ARG start_ARG roman_nm end_ARG, and 32.700 nmtimes32.700nanometer32.700\text{\,}\mathrm{nm}start_ARG 32.700 end_ARG start_ARG times end_ARG start_ARG roman_nm end_ARG. Finally, the MAPE value of the XGBOOST involving method is with 6.348 %times6.348percent6.348\text{\,}\mathrm{\char 37}start_ARG 6.348 end_ARG start_ARG times end_ARG start_ARG % end_ARG slightly better than the PLS regression based on IHM parameters with 8.573 %times8.573percent8.573\text{\,}\mathrm{\char 37}start_ARG 8.573 end_ARG start_ARG times end_ARG start_ARG % end_ARG and the direct PLS regression with 7.953 %times7.953percent7.953\text{\,}\mathrm{\char 37}start_ARG 7.953 end_ARG start_ARG times end_ARG start_ARG % end_ARG. All methods involving AltDMAPs only achieve an inferior prediction accuracy with R2 values between 0.330  to 0.584 rangetimes0.330absenttimes0.584absent0.330\text{\,}0.584\text{\,}start_ARG start_ARG 0.330 end_ARG start_ARG times end_ARG start_ARG end_ARG end_ARG to start_ARG start_ARG 0.584 end_ARG start_ARG times end_ARG start_ARG end_ARG end_ARG. Within the AltDMAPs involving methods, there is no clear trend suggesting which combination of AltDMAPs prediction with size prediction enhances the performance. On that note, it is worth looking further into the collection of methods that rely on manifold learning (DMAPs, AltDMAPs, and Y-shaped autoencoder), specifically on the number of latent variables required (cf. the last column of Tab. 4) and discuss their unique characteristics. The first observation is that DMAPs are able to meaningfully embed the high dimensional data using six coordinates. Even though a parsimonious embedding of the data is achieved, which is key in making machine learning tasks computationally tractable, it is not a quantitatively accurate predictor. Here, the ability of the AltDMAPs to identify the common variables between the measured observation (the size) and the low dimensional description of the input (Raman spectra) by the DMAP coordinates is beneficial. Consequently, we further reduce the variables that are meaningful for the specific task of predicting polymer size from six to two latent variables, on the data. It becomes clear that “designer” latent spaces, i.e., ones with specific desired characteristics, become useful in disentangling the data features and creating latent variables that specifically map to the observable quantity of choice, here polymer size. This finding motivates the proposed conformal autoencoder architecture. The autoencoder network identifies a single latent variable, which not only maps to the size but is independent of other data features by design (imposed orthogonality condition in the loss function of the NN). This trait of the conformal autoencoder architecture enables our accurate prediction of polymer size here.

Conclusions and Future Work

This contribution considers the important open problem of predicting particle sizes online. We propose the prediction of monodisperse polymer sizes from Raman spectra via a data-driven approach. We apply the approach to our open-access data set of continuous microgel synthesis and demonstrate its capability. Further, we compare our approach against two state-of-the-art benchmark methods to underline the excellent prediction performance of our nonlinear approach. All proposed nonlinear approaches rely on dimensionality reduction via DMAP coordinates. We find that the machine learning workflow combining the data reduction capability of the DMAPs algorithm with the recent advances of Y-shaped autoencoder outperforms all alternatively considered workflows significantly. The remaining proposed methods with altering configurations of involved algorithms accomplish prediction performances comparable to state-of-the-art linear methods. In contrast, the Y-shaped autoencoder approach enables drastically better prediction accuracy, similar to the established size measurement methods such as DLS. Therefore, we derive an algorithmic workflow for prediction of particle sizes from in-line measurements that is competitive with off-line analytical methods. Predicting polymer sizes directly from in-line Raman spectra taken from untreated samples, as labor-intensive DLS processing is circumvented and online reaction monitoring for closed-loop control is enabled. Furthermore, with the proposed method the spectra are not manipulated by spectral pretreatment, thus, no expert knowledge is necessary. In contrast to established machine learning workflows, the proposed algorithm exhibits high efficiency, as the algorithmic filtering enables a proficient prediction performance based on a data set of limited size. This aspect makes the proposed workflow especially relevant, as requiring experimental data is laborious. In addition, the workflow typically relies on less than ten coordinates in the reduced component space. Here, only the first six DMAP coordinates enable us to use the entire wavelength spectrum without exclusions, which would otherwise require problem-specific intuition. Future works include the application of the proposed workflow to predict polymer concentrations and sizes simultaneously to highlight the application of our proposed readily available analysis tool. The simultaneous prediction allows a more comprehensive characterization via a single in-line process analytical tool. Furthermore, subsequent investigations include the open challenge of predicting size distributions from in-line Raman spectroscopy. In addition, future considerations involve extending the workflow to other systems beyond the presented application, which can involve crystallization processes, amongst others.

Author contributions

E.D.K.: developed, implemented, and applied the DM, ADM, and Y-shaped autoencoder frameworks, developed the heuristic pre-processing, wrote initial draft; L.F.K.: performed experimental design, preprocessed data, implemented and applied PLS and hybrid PLS+IHM method, wrote initial draft; J.M.M.F.: performed preliminary numerical experiments with DM, guided analysis, edited manuscript; Y.G.K.: guided the DM, ADM and Y-shaped autoencoder frameworks, edited manuscript; A.M.: conceived the idea, initiated project, supervised L.F.K. & J.M.M.F., wrote initial draft.

Acknowledgments

This work was performed as a part of project B4 of the CRC 985 “Functional Microgels and Microgel Systems” funded by Deutsche Forschungsgemeinschaft (DFG). EDK was funded by the Luxembourg National Research Fund (FNR), grant reference 16758846. For the purpose of open access, the authors have applied a Creative Commons Attribution 4.0 International (CC BY 4.0) license to any Author Accepted Manuscript version arising from this submission. The work of YGK was partially supported by the US Air Force Office of Scientific Research. The authors thank Jörn Viell for scientific discussions and feedback on the manuscript and Andrij Pich for useful discussions on future tasks and impact of this work.

Data Availability and Reproducibility Statement

An invention disclosure is submitted and a patent application is planed for the method. The code underlying this contribution is published in a GitLab repository 40. The underlying data set is published via RWTH publications 34, including Raman spectra in un-treated and pre-treated form and the according DLS size predictions. Data from the manuscript’s figures are tabulated in Supplementary Materials.

References

  • 1 Chew W, Sharratt P. Trends in process analytical technology Analytical Methods. 2010;2:1412.
  • 2 Beer T, Burggraeve A, Fonteyne M, Saerens L, Remon JP, Vervaet C. Near infrared and Raman spectroscopy for the in-process monitoring of pharmaceutical production processes International journal of pharmaceutics. 2011;417:32–47.
  • 3 Pomerantsev AL, Rodionova OY. Process analytical technology: a critical view of the chemometricians Journal of Chemometrics. 2012;26:299–310.
  • 4 Simon LL, Pataki H, Marosi G, et al. Assessment of Recent Process Analytical Technology (PAT) Trends: A Multiauthor Review Organic Process Research & Development. 2015;19:3–62.
  • 5 Beer A. Bestimmung der Absorption des rothen Lichts in farbigen Flüssigkeiten Annalen der Physik und Chemie. 1852;162:78–88.
  • 6 Marini F, Bucci R, Magrì AL, Magrì AD. Artificial neural networks in chemometrics: History, examples and perspectives Microchemical Journal. 2008;88:178–185.
  • 7 Garrido M, Rius FX, Larrechi MS. Multivariate curve resolution-alternating least squares (MCR-ALS) applied to spectroscopic data from monitoring chemical reactions processes Analytical and bioanalytical chemistry. 2008;390:2059–2066.
  • 8 Alsmeyer F, Koß H-J, Marquardt W. Indirect spectral hard modeling for the analysis of reactive and interacting mixtures Applied Spectroscopy. 2004;58:975–985.
  • 9 Keidel R, Ghavami A, Lugo DM, et al. Time-resolved structural evolution during the collapse of responsive hydrogels: The microgel-to-particle transition Science advances. 2018;4:eaao7086.
  • 10 Bohren Craig F., Huffman Donald R.. Absorption and Scattering of Light by Small Particles. Wiley 1998.
  • 11 Reis MM, Araújo PHH, Sayer C, Giudici R. Evidences of correlation between polymer particle size and Raman scattering Polymer. 2003;44:6123–6128.
  • 12 van den Brink M, Pepers M, van Herk AM. Raman spectroscopy of polymer latexes Journal of Raman Spectroscopy. 2002;33:264–272.
  • 13 Ito K, Kato T, Ona T. Non-destructive method for the quantification of the average particle diameter of latex as water-based emulsions by near-infrared Fourier transform Raman spectroscopy Journal of Raman Spectroscopy. 2002;33:466–470.
  • 14 Houben C, Nurumbetov G, Haddleton D, Lapkin AA. Feasibility of the Simultaneous Determination of Monomer Concentrations and Particle Size in Emulsion Polymerization Using in Situ Raman Spectroscopy Industrial & engineering chemistry research. 2015;54:12867–12876.
  • 15 Ambrogi PMN, Colmán MME, Giudici R. Miniemulsion Polymerization Monitoring Using Off-Line Raman Spectroscopy and In-Line NIR Spectroscopy Macromolecular Reaction Engineering. 2017;11:1600013.
  • 16 Meyer-Kirschner J, Mitsos A, Viell J. Polymer particle sizing from Raman spectra by regression of hard model parameters Journal of Raman Spectroscopy. 2018;49:1402–1411.
  • 17 Coifman RR, Lafon S. Diffusion maps Applied and Computational Harmonic Analysis. 2006;21:5–30.
  • 18 Nadler B, Lafon S, Coifman RR, Kevrekidis IG. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems Applied and Computational Harmonic Analysis. 2006;21:113–127.
  • 19 Coifman RR, Kevrekidis IG, Lafon S, Maggioni M, Nadler B. Diffusion maps, reduction coordinates, and low dimensional representation of stochastic systems Multiscale Modeling & Simulation. 2008;7:842–864.
  • 20 Lederman RR, Talmon R. Learning the geometry of common latent variables using alternating-diffusion Applied and Computational Harmonic Analysis. 2018;44:509–536.
  • 21 Katz O, Talmon R, Lo Y-L, Wu H-T. Alternating diffusion maps for multimodal data fusion Information Fusion. 2019;45:346-360.
  • 22 Dietrich F, Yair O, Mulayoff R, Talmon R, Kevrekidis IG. Spectral discovery of jointly smooth features for multimodal data SIAM Journal on Mathematics of Data Science. 2022;4:410–430.
  • 23 Evangelou N, Wichrowski NJ, Kevrekidis GA, et al. On the parameter combinations that matter and on those that do not: data-driven studies of parameter (non)identifiability PNAS Nexus. 2022;1:pgac154.
  • 24 Chiavazzo E, Gear CW, Dsilva CJ, Rabin N, Kevrekidis IG. Reduced models in chemical kinetics via nonlinear data-mining Processes. 2014;2:112–140.
  • 25 Koronaki E, Evangelou N, Psarellis YM, Boudouvis AG, Kevrekidis IG. From partial data to out-of-sample parameter and observation estimation with diffusion maps and geometric harmonics Computers & Chemical Engineering. 2023;178:108357.
  • 26 Nyström EJ. Über die praktische Auflösung von linearen Integralgleichungen mit Anwendungen auf Randwertaufgaben der Potentialtheorie. Akademische Buchhandlung 1929.
  • 27 Fowlkes C, Belongie S, Malik J. Efficient spatiotemporal grouping using the Nystrom method in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001;1:I–IIEEE 2001.
  • 28 Evangelou N, Dietrich F, Chiavazzo E, Lehmberg D, Meila M, Kevrekidis IG. Double Diffusion Maps and their Latent Harmonics for Scientific Computations in Latent Space arXiv preprint arXiv:2204.12536. 2022.
  • 29 Kaven LF, Schweidtmann AM, Keil J, Israel J, Wolter N, Mitsos A. Data-driven Product-Process Optimization of N-isopropylacrylamide Microgel Synthesis in Flow arXiv preprint arXiv:2308.16724. 2023.
  • 30 Kaven LF, Wolff HJM, Wille L, Wessling M, Mitsos A, Viell J. In-line Monitoring of Microgel Synthesis: Flow versus Batch Reactor Organic Process Research & Development. 2021;25:2039–2051.
  • 31 Kather M, Ritter F, Pich A. Surfactant-free synthesis of extremely small stimuli-responsive colloidal gels using a confined impinging jet reactor Chemical Engineering Journal. 2018;344:375–379.
  • 32 Wolff HJM, Kather M, Breisig H, Richtering W, Pich A, Wessling M. From Batch to Continuous Precipitation Polymerization of Thermoresponsive Microgels ACS applied materials & interfaces. 2018;10:24799–24806.
  • 33 Janssen FAL, Kather M, Ksiazkiewicz A, Pich A, Mitsos A. Synthesis of Poly(N-vinylcaprolactam)-Based Microgels by Precipitation Polymerization: Pseudo-Bulk Model for Particle Growth and Size Distribution ACS omega. 2019;4:13795–13807.
  • 34 Kaven L, Mitsos A. Dataset to: Nonlinear Manifold Learning Determines Microgel Size from Raman Spectroscopy 2023. doi:10.18154/RWTH-2023-05604.
  • 35 Kaven LF, Wolff HJ, Wille L, Wessling M, Mitsos A, Viell J. Dataset to: In-line Monitoring of Microgel Synthesis: Flow versus Batch Reactor 2021. doi:10.18154/RWTH-2021-09666.
  • 36 Psarellis YM, Lee S, Bhattacharjee T, Datta SS, Bello-Rivas JM, Kevrekidis IG. Data-driven discovery of chemotactic migration of bacteria via machine learning arXiv preprint arXiv:2208.11853. 2022.
  • 37 Coifman Ronald R, Lafon Stéphane. Geometric harmonics: a novel tool for multiscale out-of-sample extension of empirical functions Applied and Computational Harmonic Analysis. 2006;21:31–52.
  • 38 Dsilva CJ, Talmon R, Coifman RR, Kevrekidis IG. Parsimonious representation of nonlinear dynamical systems through manifold learning: A chemotaxis case study Applied and Computational Harmonic Analysis. 2018;44:759–773.
  • 39 Talmon R, Wu H-T. Latent common manifold learning with alternating diffusion: Analysis and applications Applied and Computational Harmonic Analysis. 2019;47:848-892.
  • 40 Koronaki E, Kaven LF, Faust JMM, Kevrekidis IG, Mitsos A. Code to: Nonlinear Manifold Learning Determines Microgel Size from Raman Spectroscopy 2023. https://gitlab.com/eleni.koronaki/mlforpolymersizeraman.git.
  • 41 Ltd. Malvern Instruments. Zetasizer Nano technical note MRK728-01 - The accuracy and precision expected from dynamic light scattering measurements 2006. https://kdsi.ru/upload/iblock/357/be4ecfa4fcce215f870e4acb8eb229d6.pdf.

Figures

Refer to caption
Figure 1: Schematic overview of the proposed machine learning approaches applied to low dimensional parameterization (provided by DMAPs) of measured Raman spectra: (i) in Approach 1 the polymer size is predicted directly from DMAPs, (ii) Approach 2 implements AltDMAPs to find the data-driven common variable that is one-to-one with the polymer size (2.a), and in (2.b) predicts the polymer size from this common variable, and (iii) Approach 3 implements a Y-shaped conformal autoencoder, that identifies a “designer” latent space (3.a) from which it is possible to predict the polymer size (3.b).
Refer to caption
Figure 2: Schematic representation of the prediction strategy directly from the DMAP coordinates that parsimoniously parameterize the spectra.
Refer to caption
(a)
Refer to caption
(b)
Figure 3: Schematic of the offline steps: (a) Step 1: Finding a reduced representation of the spectra with DMAPs; (b) Step 2: Finding the common variable between spectra and polymer size with AltDMAPs.
Refer to caption
Figure 4: Schematic of the online application of size prediction: Given a measured (new) spectrum, the reduced description via DMAPs is determined (Step 1). Then, the common AltDMAP variable is inferred with NNs, DoubleDMAPs, or XGBOOST networks (Step 2). Finally, the polymer size is predicted from the common variable, with NNs or XGBOOST (Step 3).
Refer to caption
(a)
Refer to caption
(b)
Figure 5: Schematic of the Y-shaped conformal autoencoder architecture: (a) Representation of the workflow; (b) Y-shaped autoencoder composition: NN1 is the encoder that maps the DMAP coordinates to the latent variables of the autoencoder; NN2 is the inverse transformation, from the latent space back to the DMAP coordinates; NN3 maps one of the latent variables to the output of interest: polymer size.
Refer to caption
(a) Fingerprint region
Refer to caption
(b) Global region
Figure 6: Parity plots of microgel size predictions via PLS regression of raw spectra in (a) the fingerprint region and (b) the global region. Gray circles represent the training data, and red circles indicate the test data.
Refer to caption
(a) MinMax, linear fit
Refer to caption
(b) SNV, rubber band
Figure 7: Parity plots of microgel size predictions via PLS regression of IHM parameters from spectra fitted via high interaction. Gray circles represent the training data, and red circles indicate the test data.
Refer to caption
(a)
Refer to caption
(b)
Figure 8: Parity plots of microgel size predictions from DMAPs (a) by a neural network and (b) using XGBOOST. Red points correspond to the test set, and gray points correspond to the training set values. The reported accuracy metrics (R2, RMSE, and MAPE) correspond to the test data.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 9: (a) Local linear regression indicates two significant common variables exist; (b) Plot of the raw spectra included in the data set; two clusters appear with respect to the spectral intensities, colored in red and black; (c) The (normalized) polymer size is plotted over the one common variable (AltDMAP) for both clusters in (b) illustrating the dependence of size on the latent variables (one-to-one within each individual cluster). To be able to predict the size from any spectrum, irrespective of the the cluster it belongs to, more than one AltDMAP is required.
Refer to caption
(a)
Refer to caption
(b)
Figure 10: Parity plots of microgel size predictions from AltDMAPs, with (a) NN, and (b) XGBOOST algorithm. The gray points correspond to the training set data, and the red points correspond to the test data. The reported accuracy metrics correspond to the test data.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 11: Parity plots of microgel diameter predictions via (a) a NN from DoubleDMAPs-predicted AltDMAPs, (b) a NN from XGBOOST-predicted AltDMAPs, (c) XGBOOST from Double DMAPs-predicted AltDMAPs, and (d) XGBOOST from XGBOOST-predicted AltDMAPs. The gray points correspond to the training set data, and the red points correspond to the test data. The reported accuracy metrics (R2, RMSE, and MAPE) correspond to the test data.
Refer to caption
(a)
Refer to caption
(b)
Figure 12: Prediction from Y-shaped autoencoder architecture: (a) Actual versus predicted polymer size, and (b) %percent\%% error for each one of the data points in the test set. Red points correspond to the test set, and gray points correspond to the training set values. The reported accuracy metrics (R2, RMSE, and MAPE) correspond to the test data.

Tables

Table 1: Overview of state-of-the-art approaches and our proposal (last row) to predict polymer sizes from Raman spectra. The number of data points refers to the amount of samples measured via Raman spectroscopy.
Ref. Method Polymer system

Size
[nm]

# points

Spectrum
[ cm1timesabsentcentimeter1\text{\,}{\mathrm{cm}}^{-1}start_ARG end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_cm end_ARG start_ARG - 1 end_ARG end_ARG]

\bigstrut
13 PLS

styrene, butadiene,
methyl methacrylate, acrylonitrile

80-200 47 100-4000 \bigstrut
12 Focal depth

styrene

42-210 6 3200-3800 \bigstrut
11 PLS

styrene, acrylic acid

55-150 23 400-4000 \bigstrut
14 PLS

styrene, butyl acrylate, acrylic acid, methyl acrylate

20-200 N.S.

150-400,
150-1800

\bigstrut
15 PLS

styrene

50-400 40

150-400,
0-4000

\bigstrut
16 IHM+PLS

styrene

23-60 21 1020-2000 \bigstrut
Nonlinear

N-isopropylacryl-

208-483 47 100-3425 \bigstrut
manifold

amide

learning
Table 2: Performance of PLS regression on Raman spectra with different pretreatment methods. The values for RMSE are given in  nmtimesabsentnanometer\text{\,}\mathrm{nm}start_ARG end_ARG start_ARG times end_ARG start_ARG roman_nm end_ARG and for MAPE in %. The intensity level of the gray shading visualizes the quality of the prediction model compared to the remaining models regarding each prediction performance metric: A lower gray shade corresponds to a better performance.

Pretreatment Spectral Training Testing # latent region R2 RMSE MAPE R2 RMSE MAPE variables Linear fit FP 0.9610.9610.9610.961 12.66412.66412.66412.664 2.8822.8822.8822.882 0.4280.4280.4280.428 40.86540.86540.86540.865 8.639 7 Linear fit Global 0.2840.2840.2840.284 54.44954.44954.44954.449 13.07913.07913.07913.079 0.5730.5730.5730.573 35.28835.28835.28835.288 8.419 2 MinMax, linear fit FP 0.8600.8600.8600.860 24.08724.08724.08724.087 5.1065.1065.1065.106 0.1060.1060.1060.106 51.08051.08051.08051.080 12.935 7 MinMax, linear fit Global 0.4160.4160.4160.416 49.15749.15749.15749.157 11.36911.36911.36911.369 0.6960.6960.6960.696 29.79529.79529.79529.795 7.389 3 MinMax, rubber band FP 0.8610.8610.8610.861 23.96023.96023.96023.960 5.3455.3455.3455.345 0.206-0.206-0.206- 0.206 59.32259.32259.32259.322 15.700 6 MinMax, rubber band Global 0.4250.4250.4250.425 48.79448.79448.79448.794 11.23211.23211.23211.232 0.6970.6970.6970.697 29.74029.74029.74029.740 7.496 3 Raw FP 0.7900.7900.7900.790 29.46329.46329.46329.463 6.7506.7506.7506.750 0.7300.7300.7300.730 28.03828.03828.03828.038 7.887 5 Raw Global 0.9420.9420.9420.942 15.47315.47315.47315.473 3.7993.7993.7993.799 0.6330.6330.6330.633 32.70132.70132.70132.701 7.953 8 Rubber band FP 0.9930.9930.9930.993 5.4455.4455.4455.445 1.3591.3591.3591.359 0.041-0.041-0.041- 0.041 55.10955.10955.10955.109 14.329 9 Rubber band Global 0.5090.5090.5090.509 45.07345.07345.07345.073 10.80010.80010.80010.800 0.6620.6620.6620.662 31.40131.40131.40131.401 7.614 3 SNV, linear fit FP 0.1810.1810.1810.181 58.21258.21258.21258.212 14.00414.00414.00414.004 0.6160.6160.6160.616 33.45533.45533.45533.455 8.286 1 SNV, linear fit Global 0.9370.9370.9370.937 16.18416.18416.18416.184 3.9443.9443.9443.944 0.3290.3290.3290.329 44.23044.23044.23044.230 9.777 6 SNV, rubber band FP 0.1750.1750.1750.175 58.43158.43158.43158.431 13.99213.99213.99213.992 0.6200.6200.6200.620 33.30033.30033.30033.300 8.336 1 SNV, rubber band Global 0.9370.9370.9370.937 16.17316.17316.17316.173 3.9423.9423.9423.942 0.3420.3420.3420.342 43.82143.82143.82143.821 9.708 6

Table 3: Performance of PLS regression on IHM parameters from Raman spectra with different pretreatment methods. The values for RMSE are given in  nmtimesabsentnanometer\text{\,}\mathrm{nm}start_ARG end_ARG start_ARG times end_ARG start_ARG roman_nm end_ARG and for MAPE in %. The intensity level of the gray shading visualizes the quality of the prediction model compared to the remaining models regarding each prediction performance metric: A lower gray shade corresponds to a better performance.

Pretreatment Fitting Training Testing # latent mode R2 RMSE MAPE R2 RMSE MAPE variables Linear fit Hoch 0.5770.5770.5770.577 41.83441.83441.83441.834 10.14210.14210.14210.142 0.4720.4720.4720.472 39.24539.24539.24539.245 9.0149.0149.0149.014 2 Linear fit Medium 0.3270.3270.3270.327 52.76852.76852.76852.768 12.14512.14512.14512.145 0.3160.3160.3160.316 44.66744.66744.66744.667 11.78511.78511.78511.785 2 Raw Hoch 0.8390.8390.8390.839 25.81725.81725.81725.817 6.4556.4556.4556.455 0.257-0.257-0.257- 0.257 60.56560.56560.56560.565 14.04014.04014.04014.040 6 Raw Medium 0.3270.3270.3270.327 52.78052.78052.78052.780 12.14512.14512.14512.145 0.3290.3290.3290.329 44.23744.23744.23744.237 11.63711.63711.63711.637 2 Rubber band Hoch 0.6460.6460.6460.646 38.25538.25538.25538.255 8.7398.7398.7398.739 0.3350.3350.3350.335 44.04344.04344.04344.043 10.58210.58210.58210.582 6 Rubber band Medium 0.5380.5380.5380.538 43.74543.74543.74543.745 9.9179.9179.9179.917 0.019-0.019-0.019- 0.019 54.51854.51854.51854.518 10.76510.76510.76510.765 8 MinMax, linear fit Hoch 0.7410.7410.7410.741 32.74632.74632.74632.746 7.2657.2657.2657.265 0.2780.2780.2780.278 45.90745.90745.90745.907 9.2439.2439.2439.243 5 MinMax, linear fit Medium 0.3590.3590.3590.359 51.49151.49151.49151.491 11.44911.44911.44911.449 0.3300.3300.3300.330 44.19844.19844.19844.198 11.98011.98011.98011.980 2 MinMax, rubber band Hoch 0.6800.6800.6800.680 36.39636.39636.39636.396 7.9147.9147.9147.914 0.3230.3230.3230.323 44.43344.43344.43344.433 11.07511.07511.07511.075 6 MinMax, rubber band Medium 0.5080.5080.5080.508 45.10845.10845.10845.108 9.9869.9869.9869.986 0.153-0.153-0.153- 0.153 57.98557.98557.98557.985 11.67511.67511.67511.675 8 SNV, linear fit Hoch 0.6810.6810.6810.681 36.31236.31236.31236.312 8.0098.0098.0098.009 0.2870.2870.2870.287 45.59945.59945.59945.599 8.6838.6838.6838.683 4 SNV, linear fit Medium 0.3660.3660.3660.366 51.22051.22051.22051.220 11.41311.41311.41311.413 0.2830.2830.2830.283 45.72345.72345.72345.723 12.38012.38012.38012.380 2 SNV, rubber band Hoch 0.5850.5850.5850.585 41.46541.46541.46541.465 9.5599.5599.5599.559 0.6110.6110.6110.611 33.69133.69133.69133.691 8.2588.2588.2588.258 4 SNV, rubber band Medium 0.5110.5110.5110.511 44.96744.96744.96744.967 10.01510.01510.01510.015 0.104-0.104-0.104- 0.104 56.74156.74156.74156.741 11.60711.60711.60711.607 8

Table 4: Testing performance of all considered prediction methods within this work. The values for RMSE are given in  nmtimesabsentnanometer\text{\,}\mathrm{nm}start_ARG end_ARG start_ARG times end_ARG start_ARG roman_nm end_ARG and for MAPE in %.
Cluster Method R2 RMSE MAPE # latent
variables
State-of-the-art
Best configuration based on PLS
regression directly to Raman spectra
0.633 32.701 7.953 8
Best configuration based on PLS
regression on IHM parameters
0.611 33.691 8.258 4
Alternating Diffusion Maps
Neural network based on
DoubleDMAPs-predicted AltDMAPs
0.584 68.162 8.101 2(+4)
Neural network based on
XGBOOST-predicted AltDMAPs
0.330 70.130 10.058 2 (+4)
XGBOOST based on
DoubleDMAPs-predicted AltDMAPs
0.345 43.717 10.772 2 (+4)
XGBOOST based on
XGBOOST-predicted AltDMAPs
0.525 37.222 8.481 2 (+4)
Prediction directly from DMAP coordinates Neural network 0.538 78.880 7.895 6
XGBOOST 0.600 34.172 6.348 6
Y-shaped autoencoder 0.951 11.991 2.930 1