Evaluating the performance of machine learning and variable selection methods to identify document paper using infrared spectral data

Spectrochim Acta A Mol Biomol Spectrosc. 2025 Feb 15:327:125299. doi: 10.1016/j.saa.2024.125299. Epub 2024 Oct 18.

Abstract

Infrared spectroscopy is a valuable tool for forensic examinations because it realizes nondestructive and rapid analysis. Recent advancements in machine learning have facilitated the development of chemometrics, extending to applications in questioned document examination. In this study, support vector machine (SVM), feedforward neural network (FNN), and random forest (RF) models were constructed using the infrared spectral data of document paper samples to identify the manufacturer of document paper products. For model training, the infrared (IR) spectral regions were selected based on their variable importance as determined by the RF models. Narrowing the IR spectral data within the range of 1500-800 cm-1 (selected according to variable importance measures) proved effective in terms of enhancing model performance while minimizing computational costs. The FNN and RF models trained on the second-derivative IR spectra in this range obtained F1-scores of 0.978 and 1.000, respectively. The findings of this study confirm the potential of machine learning methods for extracting and examining forensic features in document paper, resulting in robust models with low computational overhead.

Keywords: Feature importance; Feed-forward neural network (FNN); Questioned document; Random forest (RF); Support vector machine (SVM).