[Soil Cadmium Prediction and Health Risk Assessment of an Oasis on the Eastern Edge of the Tarim Basin Based on Feature Optimization and Machine Learning]

Jing-Yu Liu; Ruo-Yi Li; Yong-Chun Liang; Lei Liu; Fang Yin; Su Tang; Lin-Sen He; Yi Zhang

doi:10.13227/j.hjkx.202308010

[Soil Cadmium Prediction and Health Risk Assessment of an Oasis on the Eastern Edge of the Tarim Basin Based on Feature Optimization and Machine Learning]

Huan Jing Ke Xue. 2024 Aug 8;45(8):4802-4811. doi: 10.13227/j.hjkx.202308010.

[Article in Chinese]

Authors

Jing-Yu Liu^{1

2}, Ruo-Yi Li³, Yong-Chun Liang¹, Lei Liu¹, Fang Yin⁴, Su Tang², Lin-Sen He⁴, Yi Zhang⁵

Affiliations

¹ School of Earth Science and Resources, Chang'an University, Xi'an 710054, China.
² Center of Urumqi Comprehensive Survey Natural Resources, China Geological Survey, Urumqi 830057, China.
³ China Aero Geophysical Survey and Remote Sensing Center for Natural Resources, Beijing 100083, China.
⁴ School of Land Engineering, Chang'an University, Xi'an 710054, China.
⁵ Xi'an Mineral Resources Survey Centre, China Geological Survey, Xi'an 710100, China.

PMID: 39168697
DOI: 10.13227/j.hjkx.202308010

Abstract

Soil heavy metal pollution poses a serious threat to food security, human health, and soil ecosystems. Based on 644 soil samples collected from a typical oasis located at the eastern margin of the Tarim Basin, a series of models, namely, multiple linear regression （LR）, neural network （BP）, random forest （RF）, support vector machine （SVM）, and radial basis function （RBF）, were built to predict the soil heavy metal content. The optimal prediction result was obtained and utilized to analyze the spatial distribution features of heavy metal contamination and relevant health risks. The outcomes demonstrated that： ① The average Cd content in the study area was 0.14 mg·kg^-1, which was 1.17 times the soil background value of Xinjiang, making it the primary factor of soil heavy metal contamination in the area. Additionally, the carcinogenicity risk coefficients of Cd for both adults and children were less than 10^-4, indicating that there were no significant long-term health risks for humans in the area. ② The estimation accuracies of the five inversion models were compared, and the validation set of the RF model had an R² value of 0.763 7, which was the highest among the five models. Additionally, the RMSE, MAE, and MBE of the RF model were the smallest among the five models. Therefore, the predicted values of the RF model were most consistent with the measured values of the soil Cd content. The predicted map of soil Cd distribution derived from the RF model coincided best with the interpolation map. ③ The RF model outperformed the other four models in predicting health risks associated with the soil Cd element for both adults and children, resulting in better prediction results. Comparatively, the predicted values of the LR model in the validation set varied greatly, leading to unreliable results. It was demonstrated that the RF was the best model for predicting soil Cd content and evaluating health risks in the study area, considering its superior generalization capability and anti-overfitting ability.

Keywords: cadmium （Cd）; content prediction; feature optimization; health risk assessment; machine learning.

Publication types

English Abstract

MeSH terms

Cadmium* / analysis
China
Ecosystem
Environmental Monitoring* / methods
Humans
Linear Models
Machine Learning*
Neural Networks, Computer
Risk Assessment
Soil / chemistry
Soil Pollutants* / analysis
Support Vector Machine

Substances

Cadmium
Soil Pollutants
Soil