Until recently, Northern China was one of the most SO2 polluted regions in the world. The lack of long-term and spatially resolved surface SO2 data hinders retrospective evaluation of relevant environmental policies and human health effects. This study aims to derive the spatiotemporal distribution of surface SO2 across Northern China during 2005-2019. As "concept drift" causes substantial estimation bias in back-extrapolation, we propose a new approach named the robust back-extrapolation via data augmentation approach (RBE-DA) to model the long-term surface SO2. The results show that the population-weighted regional SO2 ([SO2]pw) increased from 2005 to 2007 and decreased steadily afterwards. The [SO2]pw decreased by 80.4% from 74.2 ± 28.4 μg/m3 in 2007 to 14.6 ± 4.8 μg/m3 in 2019. The predicted spatial distributions for each year show that the SO2 pollution was severe (more than 20 μg/m3) in most areas of Northern China until 2017. By using model interpretation methods, we visually reveal the mechanism of estimation bias in the back-extrapolation. Specifically, the training data is severely imbalanced with respect to the satellite-retrieved SO2 column densities (i.e., it is short on high-value samples), so the benchmark model is unable to extrapolate the effects of this important predictor. This study provides long-term surface SO2 data for post hoc evaluation and human exposure assessment in Northern China, while demonstrating that the interpretable machine learning approach is critical for model diagnostics and refinement. Leveraging satellite retrievals, the RBE-DA approach can be applied worldwide to back-extrapolate various measures of air quality.
Keywords: Back-extrapolation; Data augmentation; Imbalanced data; Machine learning; Northern China; SO(2) pollution.
Copyright © 2022 Elsevier B.V. All rights reserved.