Wavelet-based Bi-dimensional Aggregation Network for SAR Image Change Detection

Jiangwei Xie, Feng Gao, Xiaowei Zhou, Junyu Dong This work was supported in part by the National Science and Technology Major Project under Grant 2022ZD0117202, in part by the Natural Science Foundation of Qingdao under Grant 23-2-1-222-ZYYD-JCH, and in part by the Postdoctoral Fellowship Program of CPSF under Grant GZC20241614. (Corresponding author: Xiaowei Zhou.) Jiangwei Xie, Feng Gao, Xiaowei Zhou, and Junyu Dong are with the School of Computer Science and Technology, Ocean University of China, Qingdao 266100, China.
Abstract

Synthetic aperture radar (SAR) image change detection is critical in remote sensing image analysis. Recently, the attention mechanism has been widely used in change detection tasks. However, existing attention mechanisms often employ down-sampling operations such as average pooling on the Key and Value components to enhance computational efficiency. These irreversible operations result in the loss of high-frequency components and other important information. To address this limitation, we develop Wavelet-based Bi-dimensional Aggregation Network (WBANet) for SAR image change detection. We design a wavelet-based self-attention block that includes discrete wavelet transform and inverse discrete wavelet transform operations on Key and Value components. Hence, the feature undergoes downsampling without any loss of information, while simultaneously enhancing local contextual awareness through an expanded receptive field. Additionally, we have incorporated a bi-dimensional aggregation module that boosts the non-linear representation capability by merging spatial and channel information via broadcast mechanism. Experimental results on three SAR datasets demonstrate that our WBANet significantly outperforms contemporary state-of-the-art methods. Specifically, our WBANet achieves 98.33%, 96.65%, and 96.62% of percentage of correct classification (PCC) on the respective datasets, highlighting its superior performance. Source codes are available at https://github.com/summitgao/WBANet.

Index Terms:
Change detection; Synthetic aperture radar; Wavelet transform; Bi-dimensional aggregation module.

I Introduction

Synthetic aperture radar (SAR) is adept at producing high-resolution images of the Earth’s surface, even under conditions of low visibility caused by adverse weather [1]. SAR sensors can penetrate cloud cover, making them especially valuable for Earth observation in cloudy or rainy areas. Consequently, SAR data has garnered significant interest from the research community, supporting a range of applications such as object detection [2], disaster assessment [3], change detection [4], and image classification [5]. Among these, change detection serves as a crucial tool for identifying changes in land cover, urban growth, and deforestation.

Recently, various convolutional neural network based models for change detection have been developed, demonstrating significant advancements in performance. Hou et al. [6] introduced an end-to-end dual branch architecture that merges CNN with a generative adversarial network (GAN), enhancing the detection of fine-grained changes. Wang et al. [7] introduced distinctive patch convolution combined with random label propagation, achieving high accuracy in change detection at a reduced computational cost. Zhao et al. [8] utilized a multidomain fusion module that integrates spatial and frequency domain features into complementary feature representations. Zhu et al. [9] designed a feature comparison module that limits the number of feature channels in the fusion process, enabling better utilization of fine-grained information in the multiscale feature map for more accurate prediction.

The previously mentioned CNN-based methods have demonstrated impressive achievements. Furthermore, Vision Transformer (ViT) [10] has showcased high performance in various computer vision tasks, leading to the adoption of attention mechanisms in change detection models. Zhang et al. [11] combine convolution and attention mechanisms to improve the performance of SAR image change detection. Although these pioneer efforts have achieved promising performance, designing an attention-based network for SAR change detection is still a non-trivial task, due to the following reasons: 1) High-frequency information loss in self-attention computation. Traditional down-sampling methods on the Key and Value components in efficient attention mechanisms, often result in the loss of high-frequency components like texture details. 2) Limitation in non-linear feature transformation. Existing methods require MLP-like structure for non-linear feature transformation. However, spatial and channel-wise attentions are rarely exploited simultaneously.

To address the above two limitations, we propose a Wavelet-based Bi-dimensional Aggregation Network, WBANet for short, which achieves down-sampling without information dropping and fuses both spatial and channel information. Specifically, we design a Wavelet-based Self-attention Module (WSM) which uses Discrete Wavelet Transform (DWT) and Inverse Discrete Wavelet Transform (IDWT) to enable lossless and invertible down-sampling in the self-attention computation. In addition, we develop a Bi-dimensional Aggregation Module (BAM) to enhance the non-linear feature representation capabilities. This module efficiently captures both spatial and channel-wise feature dependencies.

Refer to caption
Figure 1: An overview of the proposed Wavelet-based Bi-dimensional Aggregation Network (WBANet). The WBANet comprises pre-classification module, wavelet-based bi-dimensional aggregation blocks. Each wavelet-based bi-dimensional aggregation block has two critical components: Wavelet-based Self-attention Module and Bi-dimensional Aggregation Module.

In summary, the contributions of this letter can be summarized as follows:

  • We propose WSM to integrates DWT and IDWT for down-sampling without information loss, thus preserving textures and other high-frequency details.

  • We develop BAM that captures both spatial and channel-wise feature dependencies effectively. This module merges information from two branches and enhances the non-linear feature representation capabilities.

  • Extensive experiments are conducted on three public SAR datasets, demonstrating the efficacy of our proposed WBANet. We have made our code publicly available to benefit other researchers.

II Methodology

The framework of the proposed WBANet is illustrated in Fig. 1. First of all, two multitemporal SAR images (I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), captured at different times over the same geographic region, are fed into the network. The objective of the change detection task is to generate a change map, marking changed pixels as ”1” and unchanged pixels as ”0”. Initially, the pre-classification module uses a logarithmic ratio operator to compute a difference image for pseudo-label generation. Subsequently, the hierarchical fuzzy c-means algorithm [12] [13] is employed to classify pixels into changed, unchanged, and intermediate categories. Then, some wavelet-based bi-dimensional aggregation block process these data from the pre-classification module. Finally, the output features from this block are passed through fully connected layer to generate the change map.

The wavelet-based bi-dimensional aggregation block is comprised of two components: the Wavelet-based Self-attention Block (WSM) and the Bi-dimensional Aggregation Module (BAM). We will present the details of both modules in the following subsections.

II-A Wavelet-based Self-attention Module (WSM)

The proposed WSM employs Discrete Wavelet Transform (DWT) and Inverse Discrete Wavelet Transform (IDWT) to facilitate down-sampling in the attention mechanism. The wavelet transform enables feature extraction at both coarse and fine-grained scales, while also ensuring that the down-sampling is invertible. Due to the simple structure and high computational efficiency of Haar wavelet, it can quickly complete the downsampling operation. Furthermore, SAR change images often exhibit considerable sharp high-frequency information while Haar wavelet is adept at effectively capturing these high-frequency components[14]. Thus, we select Haar wavelet to conduct the downsample operation. The structure of this module is shown in Fig. 2.

To efficiently process the input feature XH×W×C𝑋superscript𝐻𝑊𝐶X\in\mathbb{R}^{H\times W\times C}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, we first reduce its channel dimensions to X~H×W×C4~𝑋superscript𝐻𝑊𝐶4\widetilde{X}\in\mathbb{R}^{H\times W\times\frac{C}{4}}over~ start_ARG italic_X end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × divide start_ARG italic_C end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT using a learnable transformation matrix WdC×C4subscript𝑊𝑑superscript𝐶𝐶4W_{d}\in\mathbb{R}^{C\times\frac{C}{4}}italic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × divide start_ARG italic_C end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT. Following this channel reduction, we apply DWT with the Haar wavelet to down-sample X~~𝑋\widetilde{X}over~ start_ARG italic_X end_ARG, and decompose it into four distinct subbands.

Haar wavelet is composed of the low-pass filter fL=(12,12)subscript𝑓𝐿1212f_{L}=\left(\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}}\right)italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG , divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG ) and high-pass filter fH=(12,12)subscript𝑓𝐻1212f_{H}=\left(\frac{1}{\sqrt{2}},-\frac{1}{\sqrt{2}}\right)italic_f start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG , - divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG ). We first encode X~~𝑋\widetilde{X}over~ start_ARG italic_X end_ARG into two subbands XLsubscript𝑋𝐿X_{L}italic_X start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and XHsubscript𝑋𝐻X_{H}italic_X start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT along the rows. Subsequently, these subbands are processed using the same filters along the columns, resulting in four wavelet subbands: XLL,XLH,XHL,subscript𝑋𝐿𝐿subscript𝑋𝐿𝐻subscript𝑋𝐻𝐿X_{LL},X_{LH},X_{HL},italic_X start_POSTSUBSCRIPT italic_L italic_L end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_L italic_H end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_H italic_L end_POSTSUBSCRIPT , and XHHsubscript𝑋𝐻𝐻X_{HH}italic_X start_POSTSUBSCRIPT italic_H italic_H end_POSTSUBSCRIPT. Here, XLLH2×W2×C4subscript𝑋𝐿𝐿superscript𝐻2𝑊2𝐶4X_{LL}\in\mathbb{R}^{\frac{H}{2}\times\frac{W}{2}\times\frac{C}{4}}italic_X start_POSTSUBSCRIPT italic_L italic_L end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 2 end_ARG × divide start_ARG italic_W end_ARG start_ARG 2 end_ARG × divide start_ARG italic_C end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT encodes the low-frequency components, and contains coarse-grained structural information. XLH,XHL,XHHH2×W2×C4subscript𝑋𝐿𝐻subscript𝑋𝐻𝐿subscript𝑋𝐻𝐻superscript𝐻2𝑊2𝐶4X_{LH},X_{HL},X_{HH}\in\mathbb{R}^{\frac{H}{2}\times\frac{W}{2}\times\frac{C}{% 4}}italic_X start_POSTSUBSCRIPT italic_L italic_H end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_H italic_L end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_H italic_H end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 2 end_ARG × divide start_ARG italic_W end_ARG start_ARG 2 end_ARG × divide start_ARG italic_C end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT represent the high-frequency components, and describe fine-grained textures.

We then concatenate these four subbands along the channel dimension to get X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG:

X^=Concat(XLL,XLH,XHL,XHH)^𝑋Concatsubscript𝑋𝐿𝐿subscript𝑋𝐿𝐻subscript𝑋𝐻𝐿subscript𝑋𝐻𝐻\hat{X}=\textrm{Concat}(X_{LL},X_{LH},X_{HL},X_{HH})over^ start_ARG italic_X end_ARG = Concat ( italic_X start_POSTSUBSCRIPT italic_L italic_L end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_L italic_H end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_H italic_L end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_H italic_H end_POSTSUBSCRIPT ) (1)
Refer to caption
Figure 2: Comparison of the traditional self-attention block in PVT(Pyramid Vision Transformer) and the proposed wavelet-based self-attention module. Linear denotes fully connected layers and tensor-product\otimes denotes matrix multiplication.

The concatenated output X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG is transformed into Key (Kwsuperscript𝐾𝑤K^{w}italic_K start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT) and Value (Vwsuperscript𝑉𝑤V^{w}italic_V start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT) matrices through the convolutional layer, while the Query (Q𝑄Qitalic_Q) remains the original input image X𝑋Xitalic_X. In this case, wavelet-based multi-head self-attention computes the interaction across these elements for each head as follows:

headisubscripthead𝑖\displaystyle\text{head}_{i}head start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =Attention(Qi,Kiw,Viw)absentAttentionsubscript𝑄𝑖superscriptsubscript𝐾𝑖𝑤superscriptsubscript𝑉𝑖𝑤\displaystyle=\textrm{Attention}\left({Q}_{i},{K}_{i}^{{w}},{V}_{{i}}^{{w}}\right)= Attention ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) (2)
=Softmax(QiKiwDh)ViwabsentSoftmaxsubscript𝑄𝑖superscriptsubscript𝐾𝑖𝑤subscript𝐷superscriptsubscript𝑉𝑖𝑤\displaystyle=\textrm{Softmax}\left(\frac{{Q}_{i}{K}_{i}^{{w}}}{\sqrt{D_{h}}}% \right){V}_{i}^{{w}}= Softmax ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT

where Kiwsuperscriptsubscript𝐾𝑖𝑤{K}_{i}^{w}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT denotes the down-sampled key, Viwsuperscriptsubscript𝑉𝑖𝑤{V}_{i}^{w}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT denotes the down-sampled value, and Dhsubscript𝐷{D_{h}}italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT represents the dimension of each head.

To enhance the output of the wavelet-based self-attention block, we apply the IDWT to X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG to produce X𝒓superscript𝑋𝒓X^{\boldsymbol{r}}italic_X start_POSTSUPERSCRIPT bold_italic_r end_POSTSUPERSCRIPT. The reconstructed X𝒓superscript𝑋𝒓X^{\boldsymbol{r}}italic_X start_POSTSUPERSCRIPT bold_italic_r end_POSTSUPERSCRIPT mirrors the details of the original input image, providing excellent local contextualization and an expanded receptive field. The final output integrates the contributions of each attention head with this reconstructed map. This integration is essential for effectively capturing information across multiple scales.

Refer to caption
Figure 3: Illustration of the Bi-dimensional Aggregation Module. It includes two branches: global channel attention and local spatial attention. \it{c}⃝ indicates to concatenate in channel axis. direct-sum\oplus implies element-wise addition and direct-product\odot denotes element-wise multiplication.

The overall operation can be formulated as follows:

WaveAttn(X)=Concat(head0,,headNh,Xr)WO,WaveAttn𝑋Concatsubscripthead0subscriptheadsubscript𝑁superscript𝑋𝑟superscript𝑊𝑂\textrm{WaveAttn}(X)=\textrm{Concat}\left(\textit{head}_{0},\cdots,\textit{% head}_{N_{h}},{X}^{r}\right){W}^{O},WaveAttn ( italic_X ) = Concat ( head start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , head start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT , (3)

where Nhsubscript𝑁{N_{h}}italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT represents the number of attention heads, and WOsuperscript𝑊𝑂{W}^{O}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT is the transformation matrix that combines all the heads and the reconstructed image into a single output tensor. The use of wavelet transform in the self-attention mechanism significantly enhances the ability to contextualize information over longer ranges with a reduced computational load compared to conventional self-attention modules. This approach ensures that both global coherence and local detail are preserved and emphasized in the model’s outputs.

II-B Bi-dimensional Aggregation Module (BAM)

To enhance the non-linear representation capabilities and effectively capture both spatial and channel dependencies, we develop Bi-dimensional Aggregation Module (BAM), as depicted in Fig. 3. This module includes two branches: the channel aggregation branch and the spatial aggregation branch.

Channel Aggregation: In this branch, average pooling is applied in the spatial dimension of input features XH×W×C𝑋superscript𝐻𝑊𝐶X\in\mathbb{R}^{{H\times{W}\times{C}}}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT to aggregate global representations. Subsequently, a fully connected layer (FC) coupled with a GELU activation function reduces the channel dimensions from C𝐶Citalic_C to Cr𝐶𝑟\frac{C}{r}divide start_ARG italic_C end_ARG start_ARG italic_r end_ARG, producing an intermediate output X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG. Here, r𝑟ritalic_r, the reduction ratio, is set to 2. This is followed by FC layer and Sigmoid activation function to generate the output of the channel attention branch, XC1×1×Csuperscript𝑋𝐶superscript11𝐶X^{C}\in\mathbb{R}^{1\times 1\times C}italic_X start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 1 × italic_C end_POSTSUPERSCRIPT.

Spatial Aggregation: First, a linear transformation, together with a GELU activation, transforms the channel dimension to Cr𝐶𝑟\frac{C}{r}divide start_ARG italic_C end_ARG start_ARG italic_r end_ARG, while the spatial dimensions remain unchanged. The resulting intermediate output, X~~𝑋\widetilde{X}over~ start_ARG italic_X end_ARG, is then concatenated with X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG to form X~H×W×2Crsuperscript~𝑋superscript𝐻𝑊2𝐶𝑟\widetilde{X}^{\prime}\in\mathbb{R}^{H\times W\times\frac{2C}{r}}over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × divide start_ARG 2 italic_C end_ARG start_ARG italic_r end_ARG end_POSTSUPERSCRIPT. The process culminates similarly to the channel attention branch, resulting in the final output XSH×W×1superscript𝑋𝑆superscript𝐻𝑊1X^{S}\in\mathbb{R}^{H\times W\times 1}italic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 1 end_POSTSUPERSCRIPT.

The outputs of both branches, XCsuperscript𝑋𝐶X^{C}italic_X start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT and XSsuperscript𝑋𝑆X^{S}italic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, are merged through an element-wise summation, ensuring the final output retains the same dimensions as the original input X𝑋Xitalic_X. This integration optimally combines the refined channel and spatial information, enhancing the overall feature representation while maintaining focus on both global context and local details.

III Experimental Results and Analysis

Refer to caption
Figure 4: Visualized results of different change detection methods on the three datasets: (a) Image captured at t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. (b) Image captured at t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. (c) Ground truth image. (d)-(i) Results by different methods.

III-A Datasets and Evaluation Metrics

To validate the effectiveness of the proposed WBANet, we conducted comprehensive experiments on three distinct SAR datasets: Chao Lake, the Yellow River, and Sulzberger Datasets. Chao Lake Dataset: This dataset includes images of Chao Lake in China, captured in May 2020 using the Sentinel-1 sensor. This period coincides with the highest recorded water levels in the lake’s history, providing a dynamic range of changes to detect. Sulzberger Dataset: Captured by the European Space Agency’s Envisat satellite over five days in March 2011, this dataset documents the breakup of an ice shelf, offering a unique perspective on drastic natural events. Yellow River Dataset: This dataset focuses on the Yellow River Estuary in China, with data collected from June 2008 to June 2009 using the Radarset-2 SAR sensor. This dataset is particularly challenging due to the pronounced speckle noise.

The hierarchical fuzzy c-means algorithm could classify pixels into changed, unchanged, and intermediate categories. Pixels from the changed and unchanged groups are randomly selected as training data, while intermediate group pixels are the test data. For a thorough assessment of our model, we employed five commonly used evaluation metrics: False Positives (FP), False Negatives (FN), Overall Error (OE), Percentage of Correct Classification (PCC), and the Kappa Coefficient (KC).

TABLE I: Change detection results of different methods on three Datasets. The best results are marked in Bold.
Method Results on the Chao Lake dataset
FP FN OE PCC (%percent\%%) KC (%percent\%%)
CWNN [15] 7528 2213 9741 93.39 65.01
SAFNet [16] 2231 1272 3503 97.62 85.55
DDNet [17] 1472 1559 3031 97.94 87.04
LANTNet [18] 1822 1023 2845 98.07 88.20
CAMixer [11] 1867 906 2773 98.12 88.56
Proposed WBANet 1092 1373 2465 98.33 89.38
Method Results on the Sulzberger dataset
FP FN OE PCC (%percent\%%) KC (%percent\%%)
CWNN [15] 2987 387 3374 94.85 86.95
SAFNet [16] 1661 883 2544 96.12 89.80
DDNet [17] 1835 585 2420 96.31 90.39
LANTNet [18] 1761 635 2396 96.34 90.46
CAMixer [11] 1105 1207 2312 96.47 90.56
Proposed WBANet 1553 640 2193 96.65 91.23
Method Results on the Yellow River dataset
FP FN OE PCC (%percent\%%) KC (%percent\%%)
CWNN [15] 3052 1034 4086 94.50 82.46
SAFNet [16] 2199 1467 3666 95.06 83.69
DDNet [17] 1251 2222 3473 95.32 83.76
LANTNet [18] 915 2343 3258 95.61 84.56
CAMixer [11] 991 2126 3117 95.80 85.35
Proposed WBANet 605 1905 2510 96.62 88.15

III-B Experimental Results and Discussion

We evaluated our WBANet against five state-of-the-art methods: CWNN [15], SAFNet [16], DDNet [17], LANTNet [18], and CAMixer [11], implemented using default parameters from their studies. All the experiments, except for the CWNN running with Matlab, were conducted on the Google Colab platform with Python 3.10.12, PyTorch 2.1.0, and an NVIDIA Tesla T4 GPU with 15 GB of memory.

Quantitative results are detailed in Table I. For the Chao Lake dataset, our WBANet excels in all metrics except false negatives (FN), with significant improvements in the Kappa Coefficient (KC) by 24.37%, 3.83%, 2.34%, 1.18%, and 0.82% over CWNN, SAFNet, DDNet, LANTNet, and CAMixer, respectively. On the Sulzberger dataset, WBANet outperforms other methods in overall error (OE), PCC, and KC. Although CAMixer and CWNN show lower FP and FN rates respectively, they both register higher OEs compared to our WBANet. Similarly, on the Yellow River dataset, WBANet leads in all metrics apart from FN. Notably, it enhances the KC value by 5.69%, 4.46%, 4.39%, 2.79%, and 2.35% over CWNN, SAFNet, DDNet, LANTNet, and CAMixer, respectively. While CWNN records a lower FN rate, it lags significantly behind in FP and OE.

Fig. 4 illustrates the visual comparison of change maps produced by different methods on three datasets. Compared to the baseline methods, such as CWNN and SAFNet, our WBANet generates change maps that are visually closer to the ground truth and contain less noise. For instance, in the Yellow River dataset, where speckle noise greatly impacts performance, it is challenging to produce accurate change maps. Here, the performance of CWNN and SAFNet is notably degraded, while DDNet, LANTNet, and CAMixer frequently misclassify changed pixels as unchanged.

Experimental results on three SAR datasets confirm that our proposed WBANet outperforms other state-of-the-art methods. The effectiveness of our WSM and BAM demonstrates significant contributions to attention feature extraction and non-linear representation modeling.

TABLE II: Ablation studies of the proposed WBANet.
Method PCC on different datasets (%percent\%%)
Chao Lake Sulzberger Yellow River
Basic Network 97.46 96.02 95.67
w/o WSM 98.13 96.33 95.92
w/o BAM 97.82 96.51 96.17
Proposed WBANet 98.33 96.65 96.62

III-C Ablation Study

To evaluate the effectiveness of the proposed Wavelet-based Self-attention Block and Bi-dimensional Aggregation Module, ablation experiments were performed on three datasets. We designed three variants: (1) Basic Network, which is the WBANet without the WSM and BAM; (2) w/o WSM, which omits the Wavelet-based Self-attention Module; and (3) w/o BAM, which lacks the Bi-dimensional Aggregation Module. The results in Table II clearly show that both the WSM and the BAM significantly enhance the non-linear representation capabilities, thereby improving change detection performance.

Refer to caption
Figure 5: Visualization of the feature representations on the Yellow River dataset. (a) Features before the WSM. (b) Features after the WSM.

Additionally, we utilized the t-SNE [19] tool to visualize feature characteristics before and after applying the Wavelet-based Self-attention Block. As depicted in Fig. 5, the representations post-application display more distinct, well-defined clusters compared to the original input.

III-D Analysis of the Block Number

The number of Wavelet-based Bi-dimensional Aggregation Blocks, denoted as N𝑁Nitalic_N, is a crucial parameter. We explored the relationship between N𝑁Nitalic_N and the Percentage of Correct Classification (PCC) by varying N𝑁Nitalic_N from 0 to 8. As illustrated in Fig. 6, PCC consistently improves with an increase in the number of Wavelet-based Bi-dimensional Aggregation Blocks up to 5. However, beyond this point, PCC begins to decline due to the increased model complexity. Consequently, we optimized N𝑁Nitalic_N for different datasets: N=5𝑁5N=5italic_N = 5 for the Chao Lake dataset, N=2𝑁2N=2italic_N = 2 for the Sulzberger dataset, and N=4𝑁4N=4italic_N = 4 for the Yellow River dataset.

IV Conclusion

In this letter, we introduce a novel WBANet for SAR image change detection task. The WBANet utilizes DWT and IDWT to achieve down-sampling without the loss of high-frequency details and other important information. Additionally, we developed BAM to enhance non-linear representation capabilities by capturing spatial and channel dependencies and refining features. Extensive experiments on three SAR datasets have verified the effectiveness and rationality of our solution.

Refer to caption
Figure 6: Relationship between the number of attention blocks and PCC values.

References

  • [1] J. Wang, F. Gao, J. Dong, S. Zhang, and Q. Du, “Change detection from synthetic aperture radar images via graph-based knowledge supplement network,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 1823–1836, 2022.
  • [2] I. Ševo and A. Avramović, “Convolutional neural network based automatic object detection on aerial images,” IEEE Geoscience and Remote Sensing Letters, vol. 13, no. 5, pp. 740–744, 2016.
  • [3] A. Sarkar, T. Chowdhury, R. R. Murphy, A. Gangopadhyay, and M. Rahnemoonfar, “SAM-VQA: Supervised attention-based visual question answering model for post-disaster damage assessment on remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023.
  • [4] T. Yan, Z. Wan, P. Zhang, G. Cheng, and H. Lu, “TransY-Net: Learning fully transformer networks for change detection of remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–12, 2023.
  • [5] X. Qian, F. Liu, L. Jiao, X. Zhang, Y. Guo, X. Liu, and Y. Cui, “Ridgelet-Nets with speckle reduction regularization for SAR image scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 11, pp. 9290–9306, 2021.
  • [6] B. Hou, Q. Liu, H. Wang, and Y. Wang, “From W-Net to CDGAN: Bitemporal change detection via deep learning techniques,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 3, pp. 1790–1802, 2020.
  • [7] J. Wang, F. Gao, J. Dong, Q. Du, and H.-C. Li, “Change detection from synthetic aperture radar images via dual path denoising network,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 2667–2680, 2022.
  • [8] C. Zhao, L. Ma, L. Wang, T. Ohtsuki, P. T. Mathiopoulos, and Y. Wang, “SAR image change detection in spatial-frequency domain based on attention mechanism and gated linear unit,” IEEE Geoscience and Remote Sensing Letters, vol. 20, pp. 1–5, 2023.
  • [9] S. Zhu, Y. Song, Y. Zhang, and Y. Zhang, “ECFNet: A siamese network with fewer FPs and fewer FNs for change detection of remote sensing images,” IEEE Geoscience and Remote Sensing Letters, vol. 20, pp. 1–5, 2023.
  • [10] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proceedings of International Conference on Learning Representations (ICLR), 2021, pp. 1–21.
  • [11] H. Zhang, Z. Lin, F. Gao, J. Dong, Q. Du, and H.-C. Li, “Convolution and attention mixer for synthetic aperture radar image change detection,” IEEE Geoscience and Remote Sensing Letters, vol. 20, pp. 1–5, 2023.
  • [12] F. Gao, J. Dong, B. Li, and Q. Xu, “Automatic change detection in synthetic aperture radar images based on PCANet,” IEEE Geoscience and Remote Sensing Letters, vol. 13, no. 12, pp. 1792–1796, 2016.
  • [13] H.-C. Li, T. Celik, N. Longbotham, and W. J. Emery, “Gabor feature based unsupervised change detection of multitemporal sar images based on two-level clustering,” IEEE Geoscience and Remote Sensing Letters, vol. 12, no. 12, pp. 2458–2462, 2015.
  • [14] S. Mallat, A wavelet tour of signal processing.   Elsevier, 1999.
  • [15] F. Gao, X. Wang, Y. Gao, J. Dong, and S. Wang, “Sea ice change detection in SAR images based on convolutional-wavelet neural networks,” IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 8, pp. 1240–1244, 2019.
  • [16] Y. Gao, F. Gao, J. Dong, Q. Du, and H.-C. Li, “Synthetic aperture radar image change detection via siamese adaptive fusion network,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 10 748–10 760, 2021.
  • [17] X. Qu, F. Gao, J. Dong, Q. Du, and H.-C. Li, “Change detection in synthetic aperture radar images using a dual-domain network,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022.
  • [18] D. Meng, F. Gao, J. Dong, Q. Du, and H.-C. Li, “Synthetic aperture radar image change detection via layer attention-based noise-tolerant network,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022.
  • [19] L. Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008.