Optimizing feature selection with gradient boosting machines in PLS regression for predicting moisture and protein in multi-country corn kernels via NIR spectroscopy

Food Chem. 2024 Oct 30:456:140062. doi: 10.1016/j.foodchem.2024.140062. Epub 2024 Jun 10.

Abstract

Differences in moisture and protein content impact both nutritional value and processing efficiency of corn kernels. Near-infrared (NIR) spectroscopy can be used to estimate kernel composition, but models trained on a few environments may underestimate error rates and bias. We assembled corn samples from diverse international environments and used NIR with chemometrics and partial least squares regression (PLSR) to determine moisture and protein. The potential of five feature selection methods to improve prediction accuracy was assessed by extracting sensitive wavelengths. Gradient boosting machines (GBMs), particularly CatBoost and LightGBM, were found to effectively select crucial wavelengths for moisture (1409, 1900, 1908, 1932, 1953, 2174 nm) and protein (887, 1212, 1705, 1891, 2097, 2456 nm). SHAP plots highlighted significant wavelength contributions to model prediction. These results illustrate GBMs' effectiveness in feature engineering for agricultural and food sector applications, including developing multi-country global calibration models for moisture and protein in corn kernels.

Keywords: Component prediction; Corn kernels; Feature selection; Gradient boosting machine (GBM); Near-infrared (NIR) spectroscopy; Partial least squares regression (PLSR); SHapley additive exPlanations (SHAP).

Publication types

  • Evaluation Study

MeSH terms

  • Least-Squares Analysis
  • Plant Proteins* / analysis
  • Plant Proteins* / chemistry
  • Seeds / chemistry
  • Spectroscopy, Near-Infrared* / methods
  • Water* / analysis
  • Water* / chemistry
  • Zea mays* / chemistry

Substances

  • Plant Proteins
  • Water