HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-ND 4.0
arXiv:2403.06122v1 [cs.CV] 10 Mar 2024

Style Blind Domain Generalized Semantic Segmentation
via Covariance Alignment and Semantic Consistence Contrastive Learning

Woo-Jin Ahn11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Geun-Yeong Yang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Hyun-Duck Choi22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT  Myo-Taeg Lim11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT11footnotemark: 1
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTKorea University  22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTChonnam National University
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT{wjahn,hggofficial,mlim}@korea.ac.kr   22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT[email protected]
Corresponding Authors
Abstract

Deep learning models for semantic segmentation often experience performance degradation when deployed to unseen target domains unidentified during the training phase. This is mainly due to variations in image texture (i.e. style) from different data sources. To tackle this challenge, existing domain generalized semantic segmentation (DGSS) methods attempt to remove style variations from the feature. However, these approaches struggle with the entanglement of style and content, which may lead to the unintentional removal of crucial content information, causing performance degradation. This study addresses this limitation by proposing BlindNet, a novel DGSS approach that blinds the style without external modules or datasets. The main idea behind our proposed approach is to alleviate the effect of style in the encoder whilst facilitating robust segmentation in the decoder. To achieve this, BlindNet comprises two key components: covariance alignment and semantic consistency contrastive learning. Specifically, the covariance alignment trains the encoder to uniformly recognize various styles and preserve the content information of the feature, rather than removing the style-sensitive factor. Meanwhile, semantic consistency contrastive learning enables the decoder to construct discriminative class embedding space and disentangles features that are vulnerable to misclassification. Through extensive experiments, our approach outperforms existing DGSS methods, exhibiting robustness and superior performance for semantic segmentation on unseen target domains. The code is available at https://github.com/root0yang/BlindNet.

1 Introduction

Refer to caption
Figure 1: Comparison of semantic segmentation results between the baseline (DeepLabV3+ with ResNet50 backbone) and our BlindNet. Both models are trained on the source domain (GTAV [43]) and tested on the target domain (Cityscapes [8]).

Semantic segmentation, a technique that classifies each pixel in an image into predefined categories, has garnered significant attention due to its potential applications in various fields. Particularly, it plays a crucial role in autonomous driving [18, 1] and robotic systems [35, 34]. Besides, with the advent of large datasets, deep neural networks have emerged as a trending approach for semantic segmentation tasks, achieving impressive results [54, 5, 44, 3]. However, there remains a major bottleneck detailing the meticulous and labor-intensive process of dataset labeling. More particularly, this process not only consumes time but also poses economic challenges [8, 46]. To address this challenge, synthetic datasets have emerged as a compelling alternative. Specifically, these datasets, generated using three-dimensional (3D) rendering techniques, offer vast amounts of easily accessible data, eliminating the need for manual labeling [43, 45]. However, a challenge arises when models trained on synthetic datasets are deployed in real-world scenarios. More precisely, a domain shift problem arises due to style factor discrepancies (e.g. texture, illumination, and image quality) between synthetic and real-world data, which affects the performance of the model, as shown in Fig. 1.

To address the domain shift problem, domain adaptation semantic segmentation (DASS) has been introduced [25, 17, 15, 51, 62, 22]. Specifically, DASS aims to bridge the gap between source and target domains by aligning their data distributions. However, a significant limitation of DASS is its dependency on the target domain during training. For DASS to function effectively, target domain samples must be available during the training phase.

Meanwhile, domain generalized semantic segmentation (DGSS) has been introduced as an alternative approach to tackle the domain shift problem. Unlike DASS, DGSS is trained only with the source domain, aiming to extract domain-invariant features. To achieve this, two main techniques have been employed: domain randomization (DR) and feature normalization (FN).

DR augments the training set by introducing variability, either by altering the image style [41, 60] or by modifying the feature representation [24, 53]. By exposing the model to a wider variety of styles via DR, the network is less likely to overfit to the specific styles present in the training data. Consequently, the robustness of the model is improved, making it more adept at generalizing to new, unseen domains. Nevertheless, a crucial limitation of DR is its significant dependence on auxiliary domains.

FN methods, converse to DR, regularize the features to prevent the model from overfitting to the distinct styles or characteristics of the training data. This is achieved by removing domain-specific style information using feature statistics, such as instance normalization [38] or whitening transformation [7, 42, 39, 19]. While these approaches effectively remove style-related information, they simultaneously pose the challenge of removing semantic content because content and style information are entangled. Consequently, the model fails to capture the essential patterns or features required for accurate segmentation prediction.

To address this problem, we propose BlindNet, a model that blinds the style within the encoder and improves the robustness of the decoder, without requiring auxiliary datasets or external modules. Specifically, our proposed BlindNet consists of two components: covariance alignment for the encoder and semantic consistency contrastive learning for the decoder. Precisely, the covariance alignment facilitates the generation of style-invariant features with the proposed covariance matching loss function (CML) and the cross-covariance loss function (CCL). Specifically, CML mitigates the effects of style variations, while CCL focuses on preserving content information, effectively addressing the prevalent content information loss observed in the FN method. To further improve the generalization ability, we develop semantic consistency contrastive learning, which consists of class-wise contrastive learning (CWCL) and semantic disentanglement contrastive learning (SDCL). Particularly, the CWCL constructs a discriminative class embedding space, while SDCL disentangles features of similar classes that often lead to prediction errors. Extensive experiments across various datasets demonstrate that the proposed BlindNet outperforms existing DGSS methods.

Our contributions are summarized as follows:

  • We propose a covariance alignment within the encoder, comprising CML and CCL. Specifically, the CML aims to mitigate the effects of style variations, while CCL ensures the preservation of content information, together facilitating the generation of style-agnostic features.

  • We propose semantic consistency contrastive learning within the decoder that comprises CWCL and SDCL, utilizing segmentation masks. Specifically, CWCL generates discriminative embeddings, whereas SDCL disentangles features of similar classes, enhancing the robustness of the model.

  • Through extensive experiments, we demonstrate the superiority of our approach in DGSS, without the need to alter the network architecture or rely on external datasets.

2 Related Works

Domain adaptation and generalization for semantic segmentation. Domain adaptation (DA) aims at minimizing the distribution discrepancy between different domains, enabling a model to generalize from a source to a target domain. For DASS, adversarial training and cross-domain self-training strategies are commonly used. Particularly, adversarial-based methods [15, 29] employ generative adversarial networks  [11] to close the feature distribution gap between source and target domains. Meanwhile, cross-domain self-training methods [62, 59, 37, 16] generate pseudo-labels for target domain data using pre-trained models, and employ them as training data, thereby expanding the training data and reducing distribution differences between the source and target domain.

Domain generalization (DG) methods  [28, 50, 61, 30, 26] aim to improve the generalization ability of the model without accessing the target domain during training. Since the difference in style of the image is the main cause of the disparity, most existing domain generalized semantic segmentation methods utilize the style information of the image for domain-invariant learning. Feature statistics (e.g. mean, variance, covariance, gram matrix, etc.) which are commonly used in style transfer [10, 57, 27, 21] are employed to capture the style information. Interestingly, existing DGSS methods can be separated into two parts: i) domain randomization to expand the distribution of style or ii) feature normalization to remove style.

DR involves randomizing either the image or its features through stylization to learn domain-invariant features from various styles. For example, Peng et al. [42] extended the source domain data by stylizing images in the style of unreal paintings. Similarly, Yue et al. [60] and Huang et al. [22] attempted to enhance generalization by synthesizing images with diverse styles in the image space. In another study, Lee et al. [24] adopted ImageNet data [9] as wild data and performed randomization via synthesis in the feature space. Meanwhile, Wu et al. [53] diversified the trainable feature space by mixing the statistics of the feature and its color-jittered feature with Ada-IN [21].

FN methods aim to remove domain-specific styles from features, extracting only domain-invariant content. For instance, Pan et al. [39] first attempted the DGSS method, combining batch normalization (BN) [23] and instance normalization (IN) [48] in the network layer. While BN preserves the content information within discriminative features, IN focuses on removing domain-specific style information from features. In a study, Choi et al. [7] addressed the limitations of previous whitening transformation [31, 6] that can eliminate the content information from the feature. Specifically, they proposed an instance-selective whitening approach designed to remove covariance components that are sensitive to domain shifts. Peng et al. [41] developed semantic-aware normalization that performs on class-wise and semantic-aware whitening that aligns channels based on the prediction through group whitening transformation [6]. In a study, Xu et al. [55] introduced the prior guided attention module and guided feature whitening to re-calibrate the feature and remove domain-specific style effects, respectively. Unlike the FN methods that directly remove the style component, our work explores a covariance alignment method that mitigates the effect of the style’s effect while preserving the content information. Our method achieves the DGSS without any additional modules or auxiliary datasets.

Contrastive Learning. Contrastive learning aims to learn representations by maximizing the similarity between positive pairs of samples while minimizing the similarity between negative pairs. In recent years, it has attracted significant attention for its effectiveness in learning discriminative representations across various tasks [4, 12, 14, 2]. Oord et al. [36] were the first to introduce the infoNCE loss, a type of contrastive loss function designed for self-contrastive learning. In a work, Park et al. [40] introduced patch-level contrastive learning for image translation, using co-located patches as positive pairs and spatially distant patches as negatives to maintain image context. For the semantic segmentation task, Wang et al. [52] introduced class-wise contrastive learning to aid the model in learning the embedding space of each class. Specifically, they sampled the classes existing in the images and applied contrastive learning based on the class label. For DGSS, Lee et al. [24] adopted contrastive learning to learn the ImageNet information in their model. Specifically, they set the ImageNet data as wild and applied the contrastive learning method by setting the wild-stylized feature and its closest wild content as positive samples. In another study, Yang et al. [56] developed multi-level contrastive learning, which designed instance prototypes and class prototypes for contrastive learning. Specifically, they sample each class’s pixel features and apply contrastive learning with a transition-probability ability matrix. Unlike recent DGSS works that embed the original image, we define contrastive learning for the augmented image. Specifically, the proposed method builds a robust embedding space by preserving the semantic consistency of the feature representation across various domains.

Refer to caption
Figure 2: Overview of the proposed BlindNet. The network processes a pair of images - the original image x𝑥xitalic_x and its augmented counterpart xasubscript𝑥𝑎x_{a}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. It employs covariance alignment to treat encoder features and utilizes semantic consistency contrastive learning for the processing of decoder features.

3 Method

The goal of the proposed method is to train a segmentation model φ𝜑\varphiitalic_φ on a given source domain S𝑆Sitalic_S and generalize well to the unseen target domain T𝑇Titalic_T. Precisely, the source domain S={(x,y)}𝑆𝑥𝑦S=\{(x,y)\}italic_S = { ( italic_x , italic_y ) } contains a paired image xH×W×3𝑥superscript𝐻𝑊3x\in\mathbb{R}^{H\times W\times 3}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT and segmentation label yH×W×C𝑦superscript𝐻𝑊𝐶y\in\mathbb{R}^{H\times W\times C}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, where H𝐻Hitalic_H, W𝑊Witalic_W, and C𝐶Citalic_C denote the height of the image, the width of the image, and the class number of the segmentation map, respectively. The model φ𝜑\varphiitalic_φ takes an original x𝑥xitalic_x and its augmented counterpart xasubscript𝑥𝑎x_{a}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, which have the same content but different styles, and uses the feature information to enhance its generalization ability.

As shown in Fig. 2, our method leverages the feature information through two main approaches: covariance alignment and semantic consistency contrastive learning. Specifically, the covariance matching ensures that features, having different styles but the same content, contain similar information. Additionally, the semantic consistency contrastive learning embeds the generalized features into discriminative representation based on the segmentation label.

3.1 Covariance Alignment

The domain shift in semantic segmentation results from changes in the visual characteristics of the image, known as style. This style information is typically detected in the shallow layers of networks [38], which are the encoders of the segmentation model. Based on this understanding, our method targets the encoder to handle style variations by employing the proposed covariance matching loss and cross-covariance loss function.

Covariance Matching Loss

To train the network to uniformly recognize various styles without removing content information, we introduce the CML. Specifically, the loss aims to minimize the difference between covariance matrices derived from different styles of image features. Given an image pair (x,xa)𝑥subscript𝑥𝑎(x,x_{a})( italic_x , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ), the features from ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT block of the encoder are represented as Fi(Hi×Wi)×Cisuperscript𝐹𝑖superscriptsuperscript𝐻𝑖superscript𝑊𝑖superscript𝐶𝑖F^{i}\in\mathbb{R}^{(H^{i}\times W^{i})\times C^{i}}italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) × italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and Fai(Hi×Wi)×Cisuperscriptsubscript𝐹𝑎𝑖superscriptsuperscript𝐻𝑖superscript𝑊𝑖superscript𝐶𝑖F_{a}^{i}\in\mathbb{R}^{(H^{i}\times W^{i})\times C^{i}}italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) × italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, respectively. Further, following the methodologies of [7, 41], we compute the covariance matrices using instance normalized features [48], which ensures consistent scaling across features. The feature maps are normalized and flattened into F¯¯𝐹\bar{F}over¯ start_ARG italic_F end_ARG and Fa¯(HW)×C¯subscript𝐹𝑎superscript𝐻𝑊𝐶\bar{F_{a}}\in\mathbb{R}^{(HW)\times C}over¯ start_ARG italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_H italic_W ) × italic_C end_POSTSUPERSCRIPT, which are given by:

F¯=(Fμ(F))σ(F),Fa¯=(Faμ(Fa))σ(Fa),formulae-sequence¯𝐹𝐹𝜇𝐹𝜎𝐹¯subscript𝐹𝑎subscript𝐹𝑎𝜇subscript𝐹𝑎𝜎subscript𝐹𝑎\displaystyle\bar{F}=\frac{\left(F-\mu(F)\right)}{\sigma(F)},\bar{F_{a}}=\frac% {\left(F_{a}-\mu(F_{a})\right)}{\sigma(F_{a})},over¯ start_ARG italic_F end_ARG = divide start_ARG ( italic_F - italic_μ ( italic_F ) ) end_ARG start_ARG italic_σ ( italic_F ) end_ARG , over¯ start_ARG italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG = divide start_ARG ( italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ ( italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_σ ( italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) end_ARG , (1)

where μ()C𝜇superscript𝐶\mu(\cdot)\in\mathbb{R}^{C}italic_μ ( ⋅ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT and σ()C𝜎superscript𝐶\sigma(\cdot)\in\mathbb{R}^{C}italic_σ ( ⋅ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT denote the mean and standard deviation of the features. Utilizing the normalized features, we evaluate the covariance matrices for the original and augmented image features as:

Σx,xisuperscriptsubscriptΣ𝑥𝑥𝑖\displaystyle\Sigma_{x,x}^{i}roman_Σ start_POSTSUBSCRIPT italic_x , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =Fi¯TFi¯,absentsuperscript¯superscript𝐹𝑖𝑇¯superscript𝐹𝑖\displaystyle=\bar{F^{i}}^{T}\cdot\bar{F^{i}},= over¯ start_ARG italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ over¯ start_ARG italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG , Σxa,xaisuperscriptsubscriptΣsubscript𝑥𝑎subscript𝑥𝑎𝑖\displaystyle\Sigma_{x_{a},x_{a}}^{i}roman_Σ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =Fai¯TFai¯.absentsuperscript¯superscriptsubscript𝐹𝑎𝑖𝑇¯superscriptsubscript𝐹𝑎𝑖\displaystyle=\bar{F_{a}^{i}}^{T}\cdot\bar{F_{a}^{i}}.= over¯ start_ARG italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ over¯ start_ARG italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG . (2)

The CML is then formulated to align these covariance matrices, ensuring that the network maintains consistency in the presence of style variations. The CML is defined as:

CM=i=1neΣx,xiΣxa,xai2,subscript𝐶𝑀superscriptsubscript𝑖1subscript𝑛𝑒subscriptnormsuperscriptsubscriptΣ𝑥𝑥𝑖superscriptsubscriptΣsubscript𝑥𝑎subscript𝑥𝑎𝑖2\mathcal{L}_{CM}=\sum_{i=1}^{n_{e}}\|\Sigma_{x,x}^{i}-\Sigma_{x_{a},x_{a}}^{i}% \|_{2},caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ roman_Σ start_POSTSUBSCRIPT italic_x , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - roman_Σ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (3)

where nesubscript𝑛𝑒n_{e}italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT denotes the number of blocks of the encoder.

Cross-covariance Loss

While the CML effectively aligns the internal distributions of features within the same image, it does not fully account for the direct correlations across paired images. To complement this, we introduce CCL, which aims to encode the consistent content information of an image pair (x,xa)𝑥subscript𝑥𝑎(x,x_{a})( italic_x , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) by utilizing the cross-covariance of the image pair. Given the normalized feature pair (Fi¯,Fai¯)¯superscript𝐹𝑖¯superscriptsubscript𝐹𝑎𝑖(\bar{F^{i}},\bar{F_{a}^{i}})( over¯ start_ARG italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG , over¯ start_ARG italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ), the cross-covariance of the feature pair can be expressed as:

Σx,xai=Fi¯TFai¯.superscriptsubscriptΣ𝑥subscript𝑥𝑎𝑖superscript¯superscript𝐹𝑖𝑇¯superscriptsubscript𝐹𝑎𝑖\Sigma_{x,x_{a}}^{i}=\bar{F^{i}}^{T}\cdot\bar{F_{a}^{i}}.roman_Σ start_POSTSUBSCRIPT italic_x , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = over¯ start_ARG italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ over¯ start_ARG italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG . (4)

The cross-covariance is expected to exhibit an identity matrix, as the feature pair should contain identical information. Nonetheless, the proposed CCL converges only the diagonal component of the covariance matrix to one. This is to prevent the drawbacks of the existing FN methods [7, 42] from removing content information. The CCL function is thus defined as:

CC=i=1nediag(Σx,xai)𝟙2,subscript𝐶𝐶superscriptsubscript𝑖1subscript𝑛𝑒subscriptnormdiagsuperscriptsubscriptΣ𝑥subscript𝑥𝑎𝑖12\mathcal{L}_{CC}=\sum_{i=1}^{n_{e}}\|\text{diag}(\Sigma_{x,x_{a}}^{i})-% \mathbbm{1}\|_{2},caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ diag ( roman_Σ start_POSTSUBSCRIPT italic_x , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - blackboard_1 ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (5)

where diag(Σc)CdiagsubscriptΣ𝑐superscript𝐶\text{diag}(\Sigma_{c})\in\mathbb{R}^{C}diag ( roman_Σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT denotes the column vector comprising diagonal elements of Σx,xaisuperscriptsubscriptΣ𝑥subscript𝑥𝑎𝑖\Sigma_{x,x_{a}}^{i}roman_Σ start_POSTSUBSCRIPT italic_x , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝟙C1superscript𝐶\mathbbm{1}\in\mathbb{R}^{C}blackboard_1 ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT denotes the one vector.

Refer to caption
Figure 3: Illustration of semantic consistency contrastive learning: The mask M𝑀Mitalic_M represents the error mask derived from the augmented segmentation map. CWCL conducts contrastive learning by sampling per segmentation class and SDCL conducts contrastive learning based on the M𝑀Mitalic_M. Both methods share a projection head π𝜋\piitalic_π for the semantic representation.

3.2 Semantic Consistence Contrastive Learning

While the encoder focuses on generating style-blinded features, the decoder aims to improve the robustness of the segmentation prediction against domain shifts. For the decoder, we employ a contrastive learning approach, which has demonstrated effectiveness in extracting discriminative features [4]. Specifically, we utilize the InfoNCE loss [36], which is formulated as:

IN(a,p,n)=log(e(ap/τ)e(ap/τ)+nNe(an/τ))subscript𝐼𝑁𝑎𝑝𝑛superscript𝑒𝑎𝑝𝜏superscript𝑒𝑎𝑝𝜏superscriptsubscript𝑛superscript𝑁superscript𝑒𝑎𝑛𝜏\displaystyle\begin{aligned} \mathcal{L}_{IN}(a,p,n)=-\log\left(\frac{e^{(a% \cdot p/\tau)}}{e^{(a\cdot p/\tau)}+\sum_{n}^{N^{-}}e^{(a\cdot n/\tau)}}\right% )\end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_I italic_N end_POSTSUBSCRIPT ( italic_a , italic_p , italic_n ) = - roman_log ( divide start_ARG italic_e start_POSTSUPERSCRIPT ( italic_a ⋅ italic_p / italic_τ ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT ( italic_a ⋅ italic_p / italic_τ ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ( italic_a ⋅ italic_n / italic_τ ) end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW (6)

where a𝑎aitalic_a, p𝑝pitalic_p, n𝑛nitalic_n, and Nsuperscript𝑁N^{-}italic_N start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT denote anchor, positive sample, negative sample, and negative sample set, respectively.

To achieve consistent feature representation in DGSS across various styles, we introduce semantic consistency contrastive learning. Specifically, the anchor is derived from the augmented image xasubscript𝑥𝑎x_{a}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, while the positive sample is extracted from the corresponding pixel of the original image x𝑥xitalic_x. Our method consists of two main components based on the negative sample as shown in Fig. 3: class-wise contrastive learning and semantic disentanglement contrastive learning.

Class-wise Contrastive Learning

Our CWCL aims to build a discriminative embedding space for each segmentation class using different classes of the original image as negatives. Given an image pair (x,xa)𝑥subscript𝑥𝑎(x,x_{a})( italic_x , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ), the features from the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT block of the decoder at pixel position (m,n)𝑚𝑛(m,n)( italic_m , italic_n ) are denoted as F(m,n)jsuperscriptsubscript𝐹𝑚𝑛𝑗F_{(m,n)}^{j}italic_F start_POSTSUBSCRIPT ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and Fa,(m,n)j(1×1)×Cjsuperscriptsubscript𝐹𝑎𝑚𝑛𝑗superscript11superscript𝐶𝑗F_{a,(m,n)}^{j}\in\mathbb{R}^{(1\times 1)\times C^{j}}italic_F start_POSTSUBSCRIPT italic_a , ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 × 1 ) × italic_C start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, where Cjsuperscript𝐶𝑗C^{j}italic_C start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT indicates the channel length of the feature. As mentioned above, we take Fa,(m,n)jsuperscriptsubscript𝐹𝑎𝑚𝑛𝑗F_{a,(m,n)}^{j}italic_F start_POSTSUBSCRIPT italic_a , ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT as the anchor and F(m,n)jsuperscriptsubscript𝐹𝑚𝑛𝑗F_{(m,n)}^{j}italic_F start_POSTSUBSCRIPT ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT as the positive sample, since they represent the same content at the corresponding spatial location. To obtain the negative samples from Fjsuperscript𝐹𝑗F^{j}italic_F start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, we leverage the resized segmentation class label yj(Hj×Wj)×Csuperscript𝑦𝑗superscriptsuperscript𝐻𝑗superscript𝑊𝑗𝐶y^{j}\in\mathbb{R}^{(H^{j}\times W^{j})\times C}italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) × italic_C end_POSTSUPERSCRIPT to collect the different class samples. The samples are passed through the projection head, denoted as π𝜋\piitalic_π, resulting in the projected features F~~𝐹\tilde{F}over~ start_ARG italic_F end_ARG and F~asubscript~𝐹𝑎\tilde{F}_{a}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. We define our CWCL as:

CWCL=jndIN(F~a,(m,n)j,F~(m,n)j,F~(p,q)j)subscript𝐶𝑊𝐶𝐿superscriptsubscript𝑗subscript𝑛𝑑subscript𝐼𝑁superscriptsubscript~𝐹𝑎𝑚𝑛𝑗superscriptsubscript~𝐹𝑚𝑛𝑗superscriptsubscript~𝐹𝑝𝑞𝑗\displaystyle\mathcal{L}_{CWCL}=\sum_{j}^{n_{d}}\mathcal{L}_{IN}\left(\tilde{F% }_{a,(m,n)}^{j},\tilde{F}_{(m,n)}^{j},\tilde{F}_{(p,q)}^{j}\right)caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_I italic_N end_POSTSUBSCRIPT ( over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_a , ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT ( italic_p , italic_q ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) (7)
where(p,q){(p,q)P|y(p,q)jy(m,n)j}where𝑝𝑞conditional-set𝑝𝑞𝑃superscriptsubscript𝑦𝑝𝑞𝑗superscriptsubscript𝑦𝑚𝑛𝑗\displaystyle\textrm{where}\quad(p,q)\in\{(p,q)\in P|y_{(p,q)}^{j}\neq y_{(m,n% )}^{j}\}where ( italic_p , italic_q ) ∈ { ( italic_p , italic_q ) ∈ italic_P | italic_y start_POSTSUBSCRIPT ( italic_p , italic_q ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ≠ italic_y start_POSTSUBSCRIPT ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT }

The set P𝑃Pitalic_P represents all pixel positions in feature Fjsuperscript𝐹𝑗F^{j}italic_F start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, with dimensions Hj×Wjsuperscript𝐻𝑗superscript𝑊𝑗H^{j}\times W^{j}italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT corresponding to the height and width. Additionally, ndsubscript𝑛𝑑n_{d}italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT denotes the total number of blocks in the decoder.

Refer to caption
Figure 4: t-SNE [49] visualization comparing scenarios with and without SCCLsubscript𝑆𝐶𝐶𝐿\mathcal{L}_{SCCL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_C italic_C italic_L end_POSTSUBSCRIPT. In (b), the application of SCCL results in a clear separation between the sidewalk (pink), the road (purple), and the building (gray).
Backbone Methods External Trained on GTAV (G) Trained on Cityscapes (C)
Dataset Module C B M S B M S G
ResNet50 [13] Baseline [3] - - 28.95 25.14 28.18 26.23 44.96 51.68 23.29 42.55
IBN-Net [38] - - 33.85 32.30 37.75 27.90 48.56 57.04 26.14 45.06
RobustNet [7] - - 37.31 35.20 40.33 28.30 50.73 58.64 26.20 45.00
SiamDoGe [53] - - 42.96 37.54 40.64 28.34 51.53 59.00 26.67 45.08
DIRL [55] - 41.04 39.15 41.60 - 51.80 - 26.50 46.52
WildNet [24] - 44.62 38.42 46.09 31.34 50.94 58.79 27.95 47.01
SANSAW [42] - 39.75 37.34 41.86 30.79 52.95 59.81 28.32 47.28
SPC [20] - 44.10 40.46 45.51 - - - - -
DPCL [56] - 44.74 40.59 46.33 30.81 50.97 58.59 25.85 46.00
Ours - - 45.72 41.32 47.08 31.39 51.84 60.18 28.51 47.97
ShuffleNetV2 [32] Baseline [3] - - 25.56 22.17 28.60 23.33 36.84 43.13 21.56 36.95
IBN-Net [38] - - 27.10 31.82 34.89 25.56 41.89 46.35 22.99 40.91
RobustNet [7] - - 30.98 32.06 35.31 24.31 41.94 46.97 22.82 40.17
SiamDoGe [53] - - 34.40 34.23 35.87 21.95 42.61 47.48 23.13 40.93
DIRL [55] - 31.88 32.57 36.12 - 42.55 - 23.74 41.23
DPCL [56] - 36.66 34.35 39.92 22.66 43.90 48.95 22.47 41.07
Ours - - 38.56 34.51 40.11 25.64 44.22 49.69 23.54 41.10
Table 1: Quantitative comparison of mIoU (%) between DGSS methods. External dataset denotes the necessity of an auxiliary dataset during training and External module denotes the requirement of an additional module during inference. G, C, B, M, and S denote GTAV, Cityscapes, BDD100K, Mapillary, and SYNTHIA, respectively. The best and second-best results are bolded and underlined, respectively.

Semantic Disentanglement Contrastive Learning

Domain shifts can lead to the entanglement of similar classes, causing the model to misclassify, as illustrated in Fig. 4. To mitigate this issue, we introduce the SDCL, specifically designed to disentangle the feature xasubscript𝑥𝑎x_{a}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT that has been misclassified, making it closer to the correct class and far from the misclassified class to achieve effective disentanglement. To further ensure consistent feature space and capture the semantic meaning, we share the projection head π𝜋\piitalic_π used in the CWCL loss. Given the predicted segmentation map of the augmented image, represented as y^a=φ(xa)subscript^𝑦𝑎𝜑subscript𝑥𝑎\hat{y}_{a}=\varphi(x_{a})over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_φ ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ), we resize it to y^a,(m,n)j(Hj×Wj)×Csuperscriptsubscript^𝑦𝑎𝑚𝑛𝑗superscriptsuperscript𝐻𝑗superscript𝑊𝑗𝐶\hat{y}_{a,(m,n)}^{j}\in\mathbb{R}^{(H^{j}\times W^{j})\times C}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_a , ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) × italic_C end_POSTSUPERSCRIPT. Similarly, y(m,n)jsuperscriptsubscript𝑦𝑚𝑛𝑗y_{(m,n)}^{j}italic_y start_POSTSUBSCRIPT ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT represents the ground truth segmentation map. Using these segmentation maps, we set the anchor at positions where y^a,(m,n)jy(m,n)jsuperscriptsubscript^𝑦𝑎𝑚𝑛𝑗superscriptsubscript𝑦𝑚𝑛𝑗\hat{y}_{a,(m,n)}^{j}\neq y_{(m,n)}^{j}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_a , ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ≠ italic_y start_POSTSUBSCRIPT ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. Negative samples are selected from the augmented image features corresponding to the anchor’s misclassified class. The samples go through the projection head π𝜋\piitalic_π. Our SDCL loss is defined as follows:

SDCL=jndIN(F~a,(m,n)j,F~(m,n)j,F~a,(r,s)j)subscript𝑆𝐷𝐶𝐿superscriptsubscript𝑗subscript𝑛𝑑subscript𝐼𝑁superscriptsubscript~𝐹𝑎𝑚𝑛𝑗superscriptsubscript~𝐹𝑚𝑛𝑗superscriptsubscript~𝐹𝑎𝑟𝑠𝑗\displaystyle\mathcal{L}_{SDCL}=\sum_{j}^{n_{d}}\mathcal{L}_{IN}\left(\tilde{F% }_{a,(m,n)}^{j},\tilde{F}_{(m,n)}^{j},\tilde{F}_{a,(r,s)}^{j}\right)caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_C italic_L end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_I italic_N end_POSTSUBSCRIPT ( over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_a , ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_a , ( italic_r , italic_s ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) (8)
where(r,s){(r,s)P|y(r,s)j=y^(m,n)j}where𝑟𝑠conditional-set𝑟𝑠𝑃superscriptsubscript𝑦𝑟𝑠𝑗superscriptsubscript^𝑦𝑚𝑛𝑗\displaystyle\textrm{where}\quad(r,s)\in\{(r,s)\in P|y_{(r,s)}^{j}=\hat{y}_{(m% ,n)}^{j}\}where ( italic_r , italic_s ) ∈ { ( italic_r , italic_s ) ∈ italic_P | italic_y start_POSTSUBSCRIPT ( italic_r , italic_s ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT ( italic_m , italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT }

Finally, combining the cross-entropy segmentation loss CEsubscript𝐶𝐸\mathcal{L}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT with other loss components, the total is defined as:

total=CE+ω1CM+ω2CC+ω3CWCL+ω4SDCLsubscripttotalsubscript𝐶𝐸subscript𝜔1subscript𝐶𝑀subscript𝜔2subscript𝐶𝐶missing-subexpressionsubscript𝜔3subscript𝐶𝑊𝐶𝐿subscript𝜔4subscript𝑆𝐷𝐶𝐿\displaystyle\begin{aligned} \mathcal{L}_{\text{total}}=\mathcal{L}_{CE}&+% \omega_{1}\mathcal{L}_{CM}+\omega_{2}\mathcal{L}_{CC}\\ &+\omega_{3}\mathcal{L}_{CWCL}+\omega_{4}\mathcal{L}_{SDCL}\end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT end_CELL start_CELL + italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT + italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_ω start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT + italic_ω start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_C italic_L end_POSTSUBSCRIPT end_CELL end_ROW (9)

where ω1subscript𝜔1\omega_{1}italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, ω2subscript𝜔2\omega_{2}italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, ω3subscript𝜔3\omega_{3}italic_ω start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and ω4subscript𝜔4\omega_{4}italic_ω start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT denote the weighting factor of each loss functions.

4 Experiment

In this section, we describe the implementation details, the experimental setup for comparison with existing DGSS methods, and the ablation study conducted to further validate the effectiveness of our approach.

4.1 Implementation Details

We adopt DeepLabV3+ [3] for the segmentation architecture and use ResNet-50 [13], ShuffleNetV2 [32], and MoblieNetV2 [47] as the backbone network of the segmentation network. The model is trained for 40K iterations with a batch size of 8 using the SGD optimizer, which has a momentum of 0.9 and a weight decay of 5e-4. We employ a polynomial learning rate schedule with an initial rate of 1e-2 and a power of 0.9. For the simulation of domain shift, we augment the image xasubscript𝑥𝑎x_{a}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT using strong color jittering transformation similar to [7]. The weighting parameters of (9), ω1subscript𝜔1\omega_{1}italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, ω2subscript𝜔2\omega_{2}italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, ω3subscript𝜔3\omega_{3}italic_ω start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and ω4subscript𝜔4\omega_{4}italic_ω start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, are set as 0.2, 0.2, 0.3, and 0.3 respectively.

4.2 Datasets

We use two synthetic datasets (GTA [43] and SYNTHIA [45]), and three real-world datasets (Cityscapes [8], BDD-100K [58], and Mapillary [33]) for the experiment. All segmentation labels are evaluated based on 19 object categories.

GTAV (G) [43] is a large-scale dataset generated from the Grand Theft Auto V (GTAV) game engine. It comprises 24,966 images, split into 12,403 for training, 6,382 for validation, and 6,181 for testing with a resolution of 1914×\times×1052.

SYNTHIA (S) [45] is a virtual, photo-realistic urban scene dataset comprising 9,400 images with a resolution of 960×\times×720. Among these, 2,820 images are designated for evaluation.

Cityscapes (C) [8] is a large-scale urban scene dataset captured from 50 cities, primarily in Germany. Particularly, it contains 5,000 high-resolution images with a resolution of 2048×\times×1024. The dataset is divided into 2,975 images for training, 500 for validation, and 1,525 for testing.

BDD-100K (B) [58] is another real-world urban scene dataset that contains more diverse 10000 urban driving scene images with a resolution of 1280×\times×720. Specifically, the validation split (1,000 images) is used for evaluation.

Mapillary (M) [33] contains 25,000 images with a minimum resolution of 1920×\times×1080, collected from various locations worldwide. Specifically, the validation split of 2,000 images is used for evaluation.

4.3 Comparison with DGSS methods

We compare our methods with other state-of-the-art DGSS methods: Baseline (DeepLabV3+ [3] trained on the source domain), IBN-Net [38], RobustNet [7], SiamDoGe [53], DIRL [55], WildNet [24], SANSAW [42], SPC [20], and DPCL [56]. To evaluate the generalization ability of the model on arbitrary unseen domains, we conduct the experiment on two scenarios: i) trained on GTAV, tested on Cityscapes, BDD-100K, and Mapillary, and ii) trained on Cityscapes, tested on BDD-100K, Mapillary, and SYNTHIA. The quantitative results are computed with mean intersection over union (mIoU). Additionally, we compared the method trained on the backbone of ResNet-50 [13], ShuffleNetV2 [32], and MoblieNet [47], pre-trained on ImageNet [9].

Methods External Module Trained on GTAV (G)
C B M Mean
Baseline [3] 25.94 25.73 26.45 26.04
IBN-Net [38] 30.14 27.66 27.07 28.29
RobustNet [7] 30.86 30.05 30.67 30.52
SiamDoGe [53] 34.15 34.50 32.34 33.67
DIRL [55] 34.67 32.78 34.31 33.92
DPCL [56] 37.57 35.45 40.30 37.77
Ours 37.66 36.10 40.40 38.05
Table 2: Quantitative comparison of mIoU (%) using MobileNetV2 [47] backbone trained on the GTAV dataset.

Quantitative and Qualitative Results

Table 1 summarizes the quantitative results. Our method outperforms all other methods when trained on GTAV, using ResNet-50 as the backbone. When compared with FN methods that remove domain-specific styles, we demonstrate that our approach minimizes the loss of content information. We also show that our method effectively shows generalization ability when trained on Cityscapes. We further evaluate our methods with different backbones, showing the wide applicability of our method. When trained with ShuffleNetV2, our method achieves the first or second-best performance among unseen target domains. Table 2 shows the results of our method trained on GTAV with MobileNetV2, demonstrating the superiority of our method.

For qualitative evaluation, we compare the visual result between DGSS methods and ours. As depicted in Fig. 5, our method demonstrates superior results compared to other approaches, particularly in its overall prediction accuracy. Notably, our proposed techniques enable distinct prediction of features such as on road and sidewalk, yielding clearer segmentation boundaries. Please refer to the supplementary material for more qualitative results.

Refer to caption
Figure 5: Qualitative comparison between DGSS methods trained on GTAV (G) and tested on unseen target domains of Cityscapes (C) using DeeplabV3+ with ResNet50 backbone.
Methods
Externel
Module
Params (M) GFLOPS Time (ms)
Baseline [13] 45.08 277.77 10.01
SANSAW [42] 25.63 421.86 68.96
SPC [20] 45.22 286.09 12.24
DIRL [55] 45.41 278.11 11.69
DPCL [56] 56.46 1188.64 823.78
Ours 45.08 277.78 10.03
Table 3: Computational cost comparison conducted using DeepLabV3+ with a ResNet-50 backbone on an NVIDIA Tesla V100 GPU with an image resolution of 2048×1024204810242048\times 10242048 × 1024. Inference time is averaged over 400 trials.

Computational cost analysis

To confirm that our approach does not incur additional computational overhead, we provide the number of parameters, GFLOPS, and average inference time of each method. As detailed in Table 3, our method operates comparably to baseline models by learning features intrinsically without adopting a separate module.

CMsubscript𝐶𝑀\mathcal{L}_{CM}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT

CCsubscript𝐶𝐶\mathcal{L}_{CC}caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT

CWCLsubscript𝐶𝑊𝐶𝐿\mathcal{L}_{CWCL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT

SDCLsubscript𝑆𝐷𝐶𝐿\mathcal{L}_{SDCL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_C italic_L end_POSTSUBSCRIPT

C B M
28.95 25.14 28.18

38.08 36.65 40.62

40.42 37.81 43.91

42.03 38.27 44.02

43.16 38.59 45.38

43.17 38.23 44.84

45.72 41.32 47.08
Table 4: Ablation study on proposed losses. The experiments were conducted using DeepLabV3+ with ResNet-50 backbone, trained on the GTAV dataset. The losses are detailed in CMsubscript𝐶𝑀\mathcal{L}_{CM}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT: (3), CCsubscript𝐶𝐶\mathcal{L}_{CC}caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT: (5), CWCLsubscript𝐶𝑊𝐶𝐿\mathcal{L}_{CWCL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT: (7), SDCLsubscript𝑆𝐷𝐶𝐿\mathcal{L}_{SDCL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_C italic_L end_POSTSUBSCRIPT: (8)

4.4 Ablation Studies

In this subsection, we conducted a series of ablation studies to demonstrate the individual contribution and effectiveness of each component within our method. Specifically, we investigate the impact of the following components: CMsubscript𝐶𝑀\mathcal{L}_{CM}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT, CCsubscript𝐶𝐶\mathcal{L}_{CC}caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT, CWCLsubscript𝐶𝑊𝐶𝐿\mathcal{L}_{CWCL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT, SDCLsubscript𝑆𝐷𝐶𝐿\mathcal{L}_{SDCL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_C italic_L end_POSTSUBSCRIPT. For the study, we use a scenario where the DeepLabV3+ model with backbone ResNet-50 model is trained on GTA and tested on Cityscapes, BDD-100K, and Mapillary.

Table 4 presents the impact of various proposed losses on domain generalization performance. Specifically, the baseline model, trained solely with cross-entropy loss, exhibits suboptimal performance on target domains because of overfitting to the source domain. Conversely, the integration of any proposed loss mechanisms leads to a marked enhancement in performance. More specifically, the incorporation of the covariance alignment (CMsubscript𝐶𝑀\mathcal{L}_{CM}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT, CCsubscript𝐶𝐶\mathcal{L}_{CC}caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT) shows its efficacy in preserving essential content information by correlating features of paired images. The differential impact of the semantic consistency constrastive learning (CWCLsubscript𝐶𝑊𝐶𝐿\mathcal{L}_{CWCL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT, SDCLsubscript𝑆𝐷𝐶𝐿\mathcal{L}_{SDCL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_C italic_L end_POSTSUBSCRIPT) is also evident, as it significantly aids in disentangling features of similar classes, thereby constructing a more robust embedding space.

Refer to caption
Figure 6: t-SNE [49] visualization comparing the covariance with and without CMLsubscript𝐶𝑀𝐿\mathcal{L}_{CML}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M italic_L end_POSTSUBSCRIPT.

Covariance Matching Loss. Fig. 6 presents t-SNE plots of covariances for original and augmented images, before and after the application of CML. The baseline network perceives original and augmented images differently from a style perspective. However, after applying CML, the distribution becomes more intermixed, indicating that our proposed CML effectively ensures similar recognition of different style images.

Calcuation of CCL. Table 4(a) demonstrates that our proposed cross-covariance method, which converges the diagonal components to 1, yields superior performance. As mentioned before, removing non-diagonal components, which contain content information actually degrades performance.

Sampling number in CWCL. Table 4(b) and Table 4(c) show the impact of varying the number of classes sampled per image and the number of samples per class in CWCL, respectively. As shown in Table 4(b), the performance improves with an increase in the diversity of classes sampled in CWCL. This suggests that contrasting a broader array of classes enhances the model’s discriminative capability. Furthermore, Table 4(c) demonstrates that a balanced number of negative samples per class leads to optimal performance.

Projection Head for SDCL. The influence of different project head configurations on the SDCL is investigated. We experimented with three distinct approaches: individual projection head, copying the weights of CWCL’s (stop gradient), and shared projection head of CWCL. As demonstrated in Table 4(d), sharing the projection head yielded the most superior results. The results indicate that SDCL not only relies on the semantic information from CWCL for effective disentanglement of similar classes but also enhances the embedding space learned by CWCL.

Cross-covariance loss
Method C B M
Whitening 38.68 36.91 42.12
Diagonal 40.42 37.81 43.91
(a)
# of classes
# C B M
10 45.57 38.88 46.37
15 45.72 41.32 47.08
(b)
# of negative samples
# C B M
10 44.76 38.21 46.46
50 45.72 41.32 47.08
100 44.44 39.14 46.29
(c)
Projection Head of SDCL
MLP C B M
Individual 44.91 38.45 46.31
Shared (SG) 44.03 38.15 46.64
Shared 45.72 41.32 47.08
(d)
Table 5: Ablation studies. (a) Calculation of CCsubscript𝐶𝐶\mathcal{L}_{CC}caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT. (b) Number of classes for CWCLsubscript𝐶𝑊𝐶𝐿\mathcal{L}_{CWCL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT. (c) Number of negative samples for CWCLsubscript𝐶𝑊𝐶𝐿\mathcal{L}_{CWCL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT. (d) Projection head of SDCLsubscript𝑆𝐷𝐶𝐿\mathcal{L}_{SDCL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_C italic_L end_POSTSUBSCRIPT. “SG” indicates stop gradient.

5 Conclusion

In this paper, we propose a novel BlindNet with covariance alignment and semantic consistency contrastive learning. By introducing covariance alignment, our method effectively addresses style variations, ensuring the extraction of features that are consistent across different styles. Furthermore, with the proposed semantic consistency contrastive learning, we not only facilitate the extraction of discriminative features but also enhance the generalization capabilities of the model in semantic segmentation predictions. Comprehensive experimental results validate the effectiveness of our approach, demonstrating its ability to generalize across multiple unseen target domains without requiring auxiliary domains or additional modules. Our future work will be improving and stabilizing the covariance alignment method.

Acknowledgement. This work was supported in part by the Basic Science Research Program through National Research Foundation of Korea (NRF) (Grants No. NRF-2022R1F1A1073543), the MSIT(Ministry of Science and ICT), Korea, under the ICAN(ICT Challenge and Advanced Network of HRD) support program(RS-2022-00156385) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation), and Innovative Human Resource Development for Local Intellectualization program through the Institute of Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT)(IITP-2024-00156287).

References

  • Bartoccioni et al. [2023] Florent Bartoccioni, Éloi Zablocki, Andrei Bursuc, Patrick Pérez, Matthieu Cord, and Karteek Alahari. Lara: Latents and rays for multi-camera bird’s-eye-view semantic segmentation. In Conference on Robot Learning, pages 1663–1672. PMLR, 2023.
  • Caron et al. [2020] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
  • Chen et al. [2017] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  • Cheng et al. [2022] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
  • Cho et al. [2019] Wonwoong Cho, Sungha Choi, David Keetae Park, Inkyu Shin, and Jaegul Choo. Image-to-image translation via group-wise deep whitening-and-coloring transformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10639–10647, 2019.
  • Choi et al. [2021] Sungha Choi, Sanghun Jung, Huiwon Yun, Joanne T Kim, Seungryong Kim, and Jaegul Choo. Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11580–11590, 2021.
  • Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Gatys et al. [2016] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Grill et al. [2020] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • He et al. [2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  • Hoffman et al. [2018] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning, pages 1989–1998. Pmlr, 2018.
  • Hoyer et al. [2022] Lukas Hoyer, Dengxin Dai, and Luc Van Gool. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9924–9935, 2022.
  • Hoyer et al. [2023] Lukas Hoyer, Dengxin Dai, Qin Wang, Yuhua Chen, and Luc Van Gool. Improving semi-supervised and domain-adaptive semantic segmentation with self-supervised depth estimation. International Journal of Computer Vision, pages 1–27, 2023.
  • Hu et al. [2023] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
  • Huang et al. [2021] Jiaxing Huang, Dayan Guan, Aoran Xiao, and Shijian Lu. Fsdr: Frequency space domain randomization for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6891–6902, 2021.
  • Huang et al. [2023] Wei Huang, Chang Chen, Yong Li, Jiacheng Li, Cheng Li, Fenglong Song, Youliang Yan, and Zhiwei Xiong. Style projected clustering for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3061–3071, 2023.
  • Huang and Belongie [2017] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510, 2017.
  • Huang et al. [2018] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pages 172–189, 2018.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
  • Lee et al. [2022] Suhyeon Lee, Hongje Seong, Seongwon Lee, and Euntai Kim. Wildnet: Learning domain generalized semantic segmentation from the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9936–9946, 2022.
  • Li et al. [2020] Guangrui Li, Guoliang Kang, Wu Liu, Yunchao Wei, and Yi Yang. Content-consistent matching for domain adaptive semantic segmentation. In European conference on computer vision, pages 440–456. Springer, 2020.
  • Li et al. [2021] Lei Li, Ke Gao, Juan Cao, Ziyao Huang, Yepeng Weng, Xiaoyue Mi, Zhengze Yu, Xiaoya Li, and Boyang Xia. Progressive domain expansion network for single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 224–233, 2021.
  • Li et al. [2017] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Universal style transfer via feature transforms. Advances in neural information processing systems, 30, 2017.
  • Li et al. [2018] Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European conference on computer vision (ECCV), pages 624–639, 2018.
  • Li et al. [2019] Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6936–6945, 2019.
  • Li and Hoiem [2017] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
  • Luo [2017] Ping Luo. Learning deep architectures via generalized whitened neural networks. In International Conference on Machine Learning, pages 2238–2246. PMLR, 2017.
  • Ma et al. [2018] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
  • Neuhold et al. [2017] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE international conference on computer vision, pages 4990–4999, 2017.
  • Nilsson et al. [2021] David Nilsson, Aleksis Pirinen, Erik Gärtner, and Cristian Sminchisescu. Embodied visual active learning for semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2373–2383, 2021.
  • Onozuka et al. [2021] Yuya Onozuka, Ryosuke Matsumi, and Motoki Shino. Autonomous mobile robot navigation independent of road boundary using driving recommendation map. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4501–4508. IEEE, 2021.
  • Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Pan et al. [2020] Fei Pan, Inkyu Shin, Francois Rameau, Seokju Lee, and In So Kweon. Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3764–3773, 2020.
  • Pan et al. [2018] Xingang Pan, Ping Luo, Jianping Shi, and Xiaoou Tang. Two at once: Enhancing learning and generalization capacities via ibn-net. In Proceedings of the European Conference on Computer Vision (ECCV), pages 464–479, 2018.
  • Pan et al. [2019] Xingang Pan, Xiaohang Zhan, Jianping Shi, Xiaoou Tang, and Ping Luo. Switchable whitening for deep representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1863–1871, 2019.
  • Park et al. [2020] Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. Contrastive learning for unpaired image-to-image translation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 319–345. Springer, 2020.
  • Peng et al. [2021] Duo Peng, Yinjie Lei, Lingqiao Liu, Pingping Zhang, and Jun Liu. Global and local texture randomization for synthetic-to-real semantic segmentation. IEEE Transactions on Image Processing, 30:6594–6608, 2021.
  • Peng et al. [2022] Duo Peng, Yinjie Lei, Munawar Hayat, Yulan Guo, and Wen Li. Semantic-aware domain generalized segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2594–2605, 2022.
  • Richter et al. [2016] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 102–118. Springer, 2016.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  • Ros et al. [2016] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243, 2016.
  • Sakaridis et al. [2021] Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10765–10775, 2021.
  • Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  • Ulyanov et al. [2016] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  • Van der Maaten and Hinton [2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  • Volpi et al. [2018] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31, 2018.
  • Vu et al. [2019] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2517–2526, 2019.
  • Wang et al. [2021] Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, Ender Konukoglu, and Luc Van Gool. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7303–7313, 2021.
  • Wu et al. [2022] Zhenyao Wu, Xinyi Wu, Xiaoping Zhang, Lili Ju, and Song Wang. Siamdoge: Domain generalizable semantic segmentation using siamese network. In European Conference on Computer Vision, pages 603–620. Springer, 2022.
  • Xie et al. [2021] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
  • Xu et al. [2022] Qi Xu, Liang Yao, Zhengkai Jiang, Guannan Jiang, Wenqing Chu, Wenhui Han, Wei Zhang, Chengjie Wang, and Ying Tai. Dirl: Domain-invariant representation learning for generalizable semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2884–2892, 2022.
  • Yang et al. [2023] Liwei Yang, Xiang Gu, and Jian Sun. Generalized semantic segmentation by self-supervised source domain projection and multi-level contrastive learning. arXiv preprint arXiv:2303.01906, 2023.
  • Yoo et al. [2019] Jaejun Yoo, Youngjung Uh, Sanghyuk Chun, Byeongkyu Kang, and Jung-Woo Ha. Photorealistic style transfer via wavelet transforms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9036–9045, 2019.
  • Yu et al. [2020] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
  • Yu et al. [2021] Fei Yu, Mo Zhang, Hexin Dong, Sheng Hu, Bin Dong, and Li Zhang. Dast: Unsupervised domain adaptation in semantic segmentation based on discriminator attention and self-training. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10754–10762, 2021.
  • Yue et al. [2019] Xiangyu Yue, Yang Zhang, Sicheng Zhao, Alberto Sangiovanni-Vincentelli, Kurt Keutzer, and Boqing Gong. Domain randomization and pyramid consistency: Simulation-to-real generalization without accessing target domain data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2100–2110, 2019.
  • Zhou et al. [2020] Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xiang. Domain generalization with mixstyle. In International Conference on Learning Representations, 2020.
  • Zou et al. [2018] Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pages 289–305, 2018.
\thetitle

Supplementary Material

Appendix A Implementation Details of BlindNet

As shown in Fig. 7, we apply our covariance alignment losses to the encoder features and the semantic consistency contrastive learning to the decoder features.

Refer to caption
Figure 7: Implementation details of proposed loss functions in the DeepLabV3+ architecture.

Appendix B More Results

In this section, we show the detailed quantitative compassion results (Section B.1) and additional qualitative results (Section B.2) of our study.

B.1 Quantitative Results

Table 6 reports a comparison of pixel accuracy and IoU for each semantic class between DGSS methods. Our model significantly outperforms others in overall pixel accuracy, indicating its robust performance. In IoU for each semantic class, our model particularly excels in roads, sidewalks, sky, people, riders, and cars, which are commonly present in photos. However, the table also indicates a degraded performance in classes such as traffic signs, traffic lights, and trains, which are less frequently encountered in the source domain (GTAV). Our future work will aim to address this issue and improve performance across all classes.

B.2 Qualitative Results

Figs. 8 (Gabsent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROWC), 9 (Gabsent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROWB), and 10 (Gabsent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROWM) present qualitative comparisons between our model and others, including baseline [3], RobustNet [7], WildNet [24], SiamDoGe [53], and SPC [20]. The results clearly illustrate our model’s consistent superiority, particularly in the segmentation of sidewalks, roads, buildings, terrain, and cars. The result demonstrates the robustness and effectiveness of our model in handling DGSS.

Appendix C More Ablation Studies

In this section, we conduct more ablation studies on our model. In Section C.1, we show a qualitative analysis of the proposed loss functions, and in Section C.2, we experiment on the weight of the proposed loss functions.

C.1 Qualitative Results

We incrementally added each loss function (CMsubscript𝐶𝑀\mathcal{L}_{CM}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT, CCsubscript𝐶𝐶\mathcal{L}_{CC}caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT, CWCLsubscript𝐶𝑊𝐶𝐿\mathcal{L}_{CWCL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT, SDCLsubscript𝑆𝐷𝐶𝐿\mathcal{L}_{SDCL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_C italic_L end_POSTSUBSCRIPT) to the baseline model to validate the impact of loss. Fig. 11 presents the qualitative results of the ablation studies on the proposed loss functions.

For our qualitative ablation study, we added each loss function (CMsubscript𝐶𝑀\mathcal{L}_{CM}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT, CCsubscript𝐶𝐶\mathcal{L}_{CC}caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT, CWCLsubscript𝐶𝑊𝐶𝐿\mathcal{L}_{CWCL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT, SDCLsubscript𝑆𝐷𝐶𝐿\mathcal{L}_{SDCL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_C italic_L end_POSTSUBSCRIPT) to the baseline model, validating their contributions. The results are depicted in Fig. 11. The introduction of CML (CMsubscript𝐶𝑀\mathcal{L}_{CM}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT) enhances the capture of the details such as traffic lights, as illustrated in Fig. 11 row 1. Adding CCL (CCsubscript𝐶𝐶\mathcal{L}_{CC}caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT) further strengthens content representation, leading to an improvement in overall accuracy. The CWCL (CWCLsubscript𝐶𝑊𝐶𝐿\mathcal{L}_{CWCL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT) strengthens semantic understanding, allowing for better detection of smaller objects. However, this enhancement comes with a trade-off, as it introduces some degree of confusion among similar classes (e.g. sidewalk and road). The application of SDCL (SDCLsubscript𝑆𝐷𝐶𝐿\mathcal{L}_{SDCL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_C italic_L end_POSTSUBSCRIPT) effectively disentangles misclassified features, leading to clearer class distinctions.

C.2 Hyper-parameter

We varied the weighting parameters for each loss function in (9), and conducted experiments by adjusting each loss weight by 0.1, using the model configuration that initially showed the best performance as our baseline, reported in Table 7. The CML (CMsubscript𝐶𝑀\mathcal{L}_{CM}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT), a key component for style blindness, shows that an overly strong influence can significantly degrade network performance. Conversely, the CCL (CCsubscript𝐶𝐶\mathcal{L}_{CC}caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT) and the CWCL (CWCLsubscript𝐶𝑊𝐶𝐿\mathcal{L}_{CWCL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT) exhibit improved performance with a slightly higher influence than a lower influence.

ω1subscript𝜔1\omega_{1}italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (CMsubscript𝐶𝑀\mathcal{L}_{CM}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT) ω2subscript𝜔2\omega_{2}italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (CCsubscript𝐶𝐶\mathcal{L}_{CC}caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT) ω3subscript𝜔3\omega_{3}italic_ω start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (CWCLsubscript𝐶𝑊𝐶𝐿\mathcal{L}_{CWCL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT) ω4subscript𝜔4\omega_{4}italic_ω start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT (SDCLsubscript𝑆𝐷𝐶𝐿\mathcal{L}_{SDCL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_C italic_L end_POSTSUBSCRIPT) C B M S
0.2 0.2 0.3 0.3 45.72 41.32 47.08 31.39
0.1 0.2 0.3 0.3 45.06 39.37 45.14 31.09
0.3 0.2 0.3 0.3 43.04 38.75 44.69 29.58
0.2 0.1 0.3 0.3 44.15 39.15 46.00 30.62
0.2 0.3 0.3 0.3 44.78 40.01 46.56 30.74
0.2 0.2 0.2 0.3 43.42 39.24 45.55 30.40
0.2 0.2 0.4 0.3 45.52 39.88 45.73 30.20
0.2 0.2 0.3 0.2 44.58 40.42 47.35 30.72
0.2 0.2 0.3 0.4 45.26 40.16 46.91 30.49
Table 7: Sensitivity to weighting parameters of each loss function
Methods
Pixel
Accuracy
mIoU

Road

Sidewalk

Building

Wall

Fence

Pole

Traffic light

Traffic sign

Vegetation

Terrain

Sky

Person

Rider

Car

Truck

Bus

Train

Motorcycle

Bicycle

Baseline [3] 71.02 29.0 51.9 20.6 57.2 22.4 21.0 25.3 24.9 10.1 61.3 23.7 52.0 53.8 13.6 51.2 19.5 21.2 0.3 12.0 8.1
RobustNet [7] 77.18 37.3 58.9 27.7 63.2 22.8 23.1 26.4 30.6 20.7 85.1 39.2 69.8 62.4 15.9 76.7 23.2 22.3 3.9 18.4 18.6
SiamDoGe [53] 84.73 43.0 83.7 34.1 78.6 26.4 25.6 26.0 42.4 28.6 84.3 28.1 68.9 62.1 31.1 85.6 31.3 28.9 3.5 22.8 23.3
WildNet [24] 84.57 44.6 81.2 38.2 76.9 28.1 25.1 35.1 32.1 24.5 85.4 35.4 72.2 65.0 27.3 85.5 29.7 33.2 12.6 32.8 27.4
SPC [20] 86.65 44.1 86.9 37.8 81.2 28.9 26.9 36.9 35.1 25.2 83.7 36.2 78.5 63.9 30.4 84.1 24.8 28.1 12.1 19.3 17.9
DPCL [56] 82.22 44.7 75.6 32.8 73.2 26.1 23.5 34.1 42.3 28.2 85.2 38.5 81.2 63.8 25.0 76.6 31.7 33.9 5.7 27.6 45.0
Ours 87.91 45.7 88.3 44.1 82.4 30.9 26.8 35.4 33.4 20.3 85.0 34.2 78.5 66.0 33.7 86.8 33.0 41.1 1.4 25.3 22.1
Table 6: Quantitative results for pixel accuracy and each semantic class. The models are trained on GTAV and tested on Cityscapes using a ResNet50 backbone. The best and second best results are bolded and underlined, respectively
Refer to caption
Figure 8: Qualitative comparison between DGSS methods trained on GTAV (G) and tested on unseen target domains of Cityscapes (C) using DeeplabV3+ with ResNet50 backbone.
Refer to caption
Figure 9: Qualitative comparison between DGSS methods trained on GTAV (G) and tested on unseen target domains of BDD100K (B) using DeeplabV3+ with ResNet50 backbone.
Refer to caption
Figure 10: Qualitative comparison between DGSS methods trained on GTAV (G) and tested on unseen target domains of Mapillary (M) using DeeplabV3+ with ResNet50 backbone.
Refer to caption
Figure 11: Qualitative comparison for ablation studies. The models are trained on GTAV (G) and tested on unseen target domains of Mapillary (M) using DeeplabV3+ with ResNet50 backbone. (a) CMsubscript𝐶𝑀\mathcal{L}_{CM}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT. (b) CM+CCsubscript𝐶𝑀subscript𝐶𝐶\mathcal{L}_{CM}+\mathcal{L}_{CC}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT. (c) CM+CC+CWCLsubscript𝐶𝑀subscript𝐶𝐶subscript𝐶𝑊𝐶𝐿\mathcal{L}_{CM}+\mathcal{L}_{CC}+\mathcal{L}_{CWCL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT, (d) CM+CC+CWCL+SDCLsubscript𝐶𝑀subscript𝐶𝐶subscript𝐶𝑊𝐶𝐿subscript𝑆𝐷𝐶𝐿\mathcal{L}_{CM}+\mathcal{L}_{CC}+\mathcal{L}_{CWCL}+\mathcal{L}_{SDCL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_C italic_W italic_C italic_L end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_C italic_L end_POSTSUBSCRIPT