1. Introduction
Automatic extraction of buildings from remote sensing imagery is of paramount importance in many application areas such as urban planning, population estimation, and disaster response [
1]. Assigning a semantic building class label to each pixel in very high resolution (VHR) imagery of urban areas is a challenging task because of high intra-class and low inter-class variabilities [
2,
3]. This is because in high resolution images, the building category contains many different sized manmade-objects in urban areas, where the amount of clutters is increasing—e.g., the shadow of tall buildings—the similarity of rooftops to some roads. The result is that it is difficult to label buildings reliably and accurately.
We have witnessed a rapid, revolutionary change in computer vision research, mainly driven by convolutional neural network (CNN) [
4] and the availability of large scale training data [
5]. Recently, several CNNs-based semantic segmentation methods have been used in building extraction from earth observation images [
6,
7,
8]. The patch-based CNNs methods [
9,
10,
11,
12,
13] were initially adopted for prediction in dense urban areas. These patched-CNNs label the center pixel by processing an image patch through a neural network. They tend to be computationally expensive and are usually used to detect large objects [
14,
15]. Since Long et al. [
16] adapted the classification network into fully convolutional network (FCN) for semantic segmentation, FCN and its extensions have gradually become the preferred solution in the field of semantic labeling [
17,
18,
19,
20]. Though FCN-based methods can produce dense pixel-wise output directly, the pixel-wise classification derived from the final score map is quite coarse because of the sequential sub-sampling operations in the FCN.
To address the problem of coarse predictions, recent research [
21,
22,
23,
24,
25,
26] have further improved FCN-based methods for semantic labeling of remote sensing images. There is a growing body of literature that many studies [
27,
28,
29,
30,
31] employ the encoder–decoder architecture with skip connection. UNet [
32], a typical model in the style of encoder–decoder, reuses low-level information to refine the output, and results in better performance. For obtaining accurate labeling of VHR images, an effective structure to integrate the high-resolution, low-level features, and the low-resolution, high-level features is needed. The skip connection fuses features so as to compensate the loss of spatial information caused by repeating local operations (e.g., pooling and strided convolution). Features via skip connection are multi-scale in nature due to the increasingly large receptive field sizes [
33]. However, one thing to note is that most existing approaches that are built on top of a contemporary classification network are good at aggregating global contexts. While the reuse of information from early encoding layers contributes to localization in the decoding phase, it may introduce redundant information which results in over-segmentation [
34] and unexpected ambiguous representations [
35,
36]. To be specific, the low level features in the encoder are computed in the shallow layers of the network, while the high level features in the decoder are computed in the deep layers of the network. Obviously, we can assume that the latter has undergone more processing and there is a semantic gap between the features of encoder and decoder. For example, a deep layer in the decoding stage may confidently discriminate between a gray pixel belonging to ‘asphalt roads’ or ‘rooftops’, because more global contexts are passed through a long path from the low layers to the high layers. However, the signals from the symmetric layer early have different levels of discrimination that are specific to the primary class ‘impervious surface’ and therefore express confidence in both subclasses. As a result, integrating these features directly through skip connection may decrease the accuracy of prediction. A new research has shown that fusing semantically dissimilar features from the encoder and decoder subnetworks directly can degrade segmentation performance [
37]. Thus, it is important to bridge the semantic gap between features of encoder and decoder prior to fusion.
In recent years, several researchers have begun to apply attention mechanisms to CNNs. Initially, attention in CNNs was used to interpret the gradient of a class output score with respect to the input image [
38]. Later trainable self-attention was deployed for image captioning, image classification, object detection, and image segmentation [
39,
40,
41,
42]. A large body of literature exploring different gating architectures has emerged. For instance, Oktay et al. [
43] proposed a self-attention gating module that can be utilized in FCN models for medical image segmentation. Zagoruyko et al. [
44] improved the performance of a student CNN by transferring the attention maps from a teacher network. Different from the above, where they used the grid-attention technique to capture spatial salient regions, Hu et al. [
45] proposed channel-wise attention to highlight important feature dimensions. Subsequent studies [
46,
47,
48,
49] have demonstrated the performance of channel-wise attention mechanism in the semantic segmentation task. In remote sensing, some attempts [
50,
51,
52] have been made to adopt attention mechanisms on the building extraction task. Yang et al. [
52] used a spatial attention module that weights map generated by applying sigmoid function at the deep features. Pan et al. [
50] used a generative adversarial network with spatial [
34] and channel [
45] attention to extract buildings. Though there are a few differences in the above attention modules, most of these implementations can be attributed to the use of self-attention to enhance the representation of single-layer features.
Since the attention can model interdependency and adjust the response of a position or a channel in the input feature maps, we expect to exploit it to alleviate the semantic difference between features from different depths in the skip connection. Similar to [
39,
46], we employ a joint attention module (RFA) in the deep neural network, while our focus is to bridge gap between hierarchical representations. To this end, we proposed an attention re-weighting process that could be integrated into UNet model for the building extraction task in VHR images. The proposed attention module emphasizes meaningful features and suppress insignificant features along both channel and spatial dimensions adaptively, under the guidance of deep features. Benefitting from global context information captured by joint attention, the semantic information of high spatial resolution but low level features in the encoder are gradually enriched in a task-oriented direction before fusion. In summary, the contributions of our work are summarized as follows:
(1) We implement joint spatial and channel-wise attention mechanism to enhance consistency of features across layers in the U-shaped FCN. Experimental results show that using attention jointly is effective to reduce semantic differences between features.
(2) We integrate the proposed attention module into existing UNet model and propose an end-to-end method (RFA-UNet) for the building extraction task, which attains comparable and stable performance with other state-of-the-art model on three public datasets.
The remainder of this paper is organized as follows.
Section 2 introduces the proposed method. The experimental results are presented in
Section 3. The discussion about the method and experiments is given in
Section 4.
Section 5 concludes this paper.
4. Discussion
Applying the attention mechanism to the segmentation model UNet, we observe that our joint attention module improves the performance of existing architecture for the task of building extraction in VHR images. The reason why the proposed attention improves the performance might be related to the inherent attributes of CNNs and the flaw of the plain skip connection in the encoder–decoder architectures. Generally, CNNs increase the receptive field by stacking convolution layers, which means the receptive field of a given layer only focus on a local region, especially at the shallow of the network. Therefore, the difference between deep layer and shallow layer in the use of context information leads to the variation of classification capacities. On the other hand, the spatial information of low level features is important to localize the classified objects, but these low level features also bring debatable noisy information that results in categorical errors [
68]. In this paper, we rethink the relationship between shallow and corresponding deep layers in the skip connection at the feature level. In order to leverage the spatial information from shallow layers and the context information from deep layers, we employ the attention mechanism that highlights advantageous features and suppress features making less contribution. The channel-wise attention part of the proposed module applies global average pooling to the concatenated features, which extracts global categorical information of two input features. Two subsequent fully connected layers play an important role in capturing feature dependencies in the channel dimension. This way ensures the cross-layers information exchange. Thus, the rescaled low level output activated by sigmoid is more dynamically consistent with high level features. Furthermore, the spatial attention part uses additive attention to refine the low level features with the aid from the high level features that with larger receptive fields, which introduces more elaborate context to improve the classifying ability of the features.
Compared to other existing attention method, flexibility is an advantage of our proposed attention module. The experimental results on three different datasets demonstrate that RFA module can better deal with the task of building extraction with different sources of aerial images. Taking channel and spatial dimensions into account successively allows for a more robust interaction of context information between the feature layers in the segmentation model. Meanwhile, the residual mapping branch of RFA alleviates the gradient vanishing in the training process. These are two reasons why the proposed RFA attention module outperforms other single attention methods in this study. With respect to DualAN that also uses two kinds of attentions in the comparison, our approach is quite different from it. In particular, DualAN applies attention mechanisms in parallel to the bottleneck of network, which focuses on employing self-attention to enhance representation of deep features, rather than reducing the semantic discrepancy between different level features. Moreover, because of the high cost of intermediate matrix multiplication in the DualAN, the authors [
62] just place it for the bottleneck features with low spatial resolution. The experimental results imply this strategy is not effective enough for building extraction in the aerial images. However, our practice has shown that the proposed joint attention only increases small cost of additional model parameters (see
Table 5, about 0.4 million) and computation (about 1.53 MB), even when applied at every level of the network. This flexibility implies the possibility of embedding RFA in other architectures in the future.
There is abundant room for further studies. First, the proposed RFA module does not validate the possible improvements it might bring on the other encoder–decoder models. At present, the reason we do not apply the RFA module to other models is that many factors need to be considered, such as the computational resource consumption of the models, the applicability of models themselves to different data sets and hyperparameter settings of models. The comparison with other methods in training time also means further hyperparameter optimization of the proposed module is possible (see
Table 5 and
Table A1). Therefore, it is needed to provide a more comprehensive comparison of these methods in the future. Second, we have conducted the experiments on three datasets of urban buildings in the public domain (e.g., Mass. Buildings, Potsdam, and WHU). It is promising to develop the RFA applied models on multi-source data and rural residential buildings. Finally, we only focus on the task of building extraction in this paper. Since the proposed RFA-UNet can be easily transformed into a multi-class semantic segmentation models, we plan to extend our model with extra geometric constraints and to multiple classes.