Style Blind Domain Generalized Semantic Segmentation
via Covariance Alignment and Semantic Consistence Contrastive Learning

Woo-Jin Ahn

{}^{1}

Geun-Yeong Yang

{}^{1}

Hyun-Duck Choi

{}^{2}

Myo-Taeg Lim

{}^{1}

¹¹footnotemark: 1

{}^{1}

Korea University

{}^{2}

Chonnam National University

{}^{1}

{wjahn,hggofficial,mlim}@korea.ac.kr

{}^{2}

[email protected] Corresponding Authors

Abstract

Deep learning models for semantic segmentation often experience performance degradation when deployed to unseen target domains unidentified during the training phase. This is mainly due to variations in image texture (i.e. style) from different data sources. To tackle this challenge, existing domain generalized semantic segmentation (DGSS) methods attempt to remove style variations from the feature. However, these approaches struggle with the entanglement of style and content, which may lead to the unintentional removal of crucial content information, causing performance degradation. This study addresses this limitation by proposing BlindNet, a novel DGSS approach that blinds the style without external modules or datasets. The main idea behind our proposed approach is to alleviate the effect of style in the encoder whilst facilitating robust segmentation in the decoder. To achieve this, BlindNet comprises two key components: covariance alignment and semantic consistency contrastive learning. Specifically, the covariance alignment trains the encoder to uniformly recognize various styles and preserve the content information of the feature, rather than removing the style-sensitive factor. Meanwhile, semantic consistency contrastive learning enables the decoder to construct discriminative class embedding space and disentangles features that are vulnerable to misclassification. Through extensive experiments, our approach outperforms existing DGSS methods, exhibiting robustness and superior performance for semantic segmentation on unseen target domains. The code is available at https://github.com/root0yang/BlindNet.

1 Introduction

Refer to caption — Figure 1: Comparison of semantic segmentation results between the baseline (DeepLabV3+ with ResNet50 backbone) and our BlindNet. Both models are trained on the source domain (GTAV [43]) and tested on the target domain (Cityscapes [8]).

Semantic segmentation, a technique that classifies each pixel in an image into predefined categories, has garnered significant attention due to its potential applications in various fields. Particularly, it plays a crucial role in autonomous driving [18, 1] and robotic systems [35, 34]. Besides, with the advent of large datasets, deep neural networks have emerged as a trending approach for semantic segmentation tasks, achieving impressive results [54, 5, 44, 3]. However, there remains a major bottleneck detailing the meticulous and labor-intensive process of dataset labeling. More particularly, this process not only consumes time but also poses economic challenges [8, 46]. To address this challenge, synthetic datasets have emerged as a compelling alternative. Specifically, these datasets, generated using three-dimensional (3D) rendering techniques, offer vast amounts of easily accessible data, eliminating the need for manual labeling [43, 45]. However, a challenge arises when models trained on synthetic datasets are deployed in real-world scenarios. More precisely, a domain shift problem arises due to style factor discrepancies (e.g. texture, illumination, and image quality) between synthetic and real-world data, which affects the performance of the model, as shown in Fig. 1.

To address the domain shift problem, domain adaptation semantic segmentation (DASS) has been introduced [25, 17, 15, 51, 62, 22]. Specifically, DASS aims to bridge the gap between source and target domains by aligning their data distributions. However, a significant limitation of DASS is its dependency on the target domain during training. For DASS to function effectively, target domain samples must be available during the training phase.

Meanwhile, domain generalized semantic segmentation (DGSS) has been introduced as an alternative approach to tackle the domain shift problem. Unlike DASS, DGSS is trained only with the source domain, aiming to extract domain-invariant features. To achieve this, two main techniques have been employed: domain randomization (DR) and feature normalization (FN).

DR augments the training set by introducing variability, either by altering the image style [41, 60] or by modifying the feature representation [24, 53]. By exposing the model to a wider variety of styles via DR, the network is less likely to overfit to the specific styles present in the training data. Consequently, the robustness of the model is improved, making it more adept at generalizing to new, unseen domains. Nevertheless, a crucial limitation of DR is its significant dependence on auxiliary domains.

FN methods, converse to DR, regularize the features to prevent the model from overfitting to the distinct styles or characteristics of the training data. This is achieved by removing domain-specific style information using feature statistics, such as instance normalization [38] or whitening transformation [7, 42, 39, 19]. While these approaches effectively remove style-related information, they simultaneously pose the challenge of removing semantic content because content and style information are entangled. Consequently, the model fails to capture the essential patterns or features required for accurate segmentation prediction.

To address this problem, we propose BlindNet, a model that blinds the style within the encoder and improves the robustness of the decoder, without requiring auxiliary datasets or external modules. Specifically, our proposed BlindNet consists of two components: covariance alignment for the encoder and semantic consistency contrastive learning for the decoder. Precisely, the covariance alignment facilitates the generation of style-invariant features with the proposed covariance matching loss function (CML) and the cross-covariance loss function (CCL). Specifically, CML mitigates the effects of style variations, while CCL focuses on preserving content information, effectively addressing the prevalent content information loss observed in the FN method. To further improve the generalization ability, we develop semantic consistency contrastive learning, which consists of class-wise contrastive learning (CWCL) and semantic disentanglement contrastive learning (SDCL). Particularly, the CWCL constructs a discriminative class embedding space, while SDCL disentangles features of similar classes that often lead to prediction errors. Extensive experiments across various datasets demonstrate that the proposed BlindNet outperforms existing DGSS methods.

Our contributions are summarized as follows:

•

We propose a covariance alignment within the encoder, comprising CML and CCL. Specifically, the CML aims to mitigate the effects of style variations, while CCL ensures the preservation of content information, together facilitating the generation of style-agnostic features.
•

We propose semantic consistency contrastive learning within the decoder that comprises CWCL and SDCL, utilizing segmentation masks. Specifically, CWCL generates discriminative embeddings, whereas SDCL disentangles features of similar classes, enhancing the robustness of the model.
•

Through extensive experiments, we demonstrate the superiority of our approach in DGSS, without the need to alter the network architecture or rely on external datasets.

2 Related Works

Domain adaptation and generalization for semantic segmentation. Domain adaptation (DA) aims at minimizing the distribution discrepancy between different domains, enabling a model to generalize from a source to a target domain. For DASS, adversarial training and cross-domain self-training strategies are commonly used. Particularly, adversarial-based methods [15, 29] employ generative adversarial networks [11] to close the feature distribution gap between source and target domains. Meanwhile, cross-domain self-training methods [62, 59, 37, 16] generate pseudo-labels for target domain data using pre-trained models, and employ them as training data, thereby expanding the training data and reducing distribution differences between the source and target domain.

Domain generalization (DG) methods [28, 50, 61, 30, 26] aim to improve the generalization ability of the model without accessing the target domain during training. Since the difference in style of the image is the main cause of the disparity, most existing domain generalized semantic segmentation methods utilize the style information of the image for domain-invariant learning. Feature statistics (e.g. mean, variance, covariance, gram matrix, etc.) which are commonly used in style transfer [10, 57, 27, 21] are employed to capture the style information. Interestingly, existing DGSS methods can be separated into two parts: i) domain randomization to expand the distribution of style or ii) feature normalization to remove style.

DR involves randomizing either the image or its features through stylization to learn domain-invariant features from various styles. For example, Peng et al. [42] extended the source domain data by stylizing images in the style of unreal paintings. Similarly, Yue et al. [60] and Huang et al. [22] attempted to enhance generalization by synthesizing images with diverse styles in the image space. In another study, Lee et al. [24] adopted ImageNet data [9] as wild data and performed randomization via synthesis in the feature space. Meanwhile, Wu et al. [53] diversified the trainable feature space by mixing the statistics of the feature and its color-jittered feature with Ada-IN [21].

FN methods aim to remove domain-specific styles from features, extracting only domain-invariant content. For instance, Pan et al. [39] first attempted the DGSS method, combining batch normalization (BN) [23] and instance normalization (IN) [48] in the network layer. While BN preserves the content information within discriminative features, IN focuses on removing domain-specific style information from features. In a study, Choi et al. [7] addressed the limitations of previous whitening transformation [31, 6] that can eliminate the content information from the feature. Specifically, they proposed an instance-selective whitening approach designed to remove covariance components that are sensitive to domain shifts. Peng et al. [41] developed semantic-aware normalization that performs on class-wise and semantic-aware whitening that aligns channels based on the prediction through group whitening transformation [6]. In a study, Xu et al. [55] introduced the prior guided attention module and guided feature whitening to re-calibrate the feature and remove domain-specific style effects, respectively. Unlike the FN methods that directly remove the style component, our work explores a covariance alignment method that mitigates the effect of the style’s effect while preserving the content information. Our method achieves the DGSS without any additional modules or auxiliary datasets.

Contrastive Learning. Contrastive learning aims to learn representations by maximizing the similarity between positive pairs of samples while minimizing the similarity between negative pairs. In recent years, it has attracted significant attention for its effectiveness in learning discriminative representations across various tasks [4, 12, 14, 2]. Oord et al. [36] were the first to introduce the infoNCE loss, a type of contrastive loss function designed for self-contrastive learning. In a work, Park et al. [40] introduced patch-level contrastive learning for image translation, using co-located patches as positive pairs and spatially distant patches as negatives to maintain image context. For the semantic segmentation task, Wang et al. [52] introduced class-wise contrastive learning to aid the model in learning the embedding space of each class. Specifically, they sampled the classes existing in the images and applied contrastive learning based on the class label. For DGSS, Lee et al. [24] adopted contrastive learning to learn the ImageNet information in their model. Specifically, they set the ImageNet data as wild and applied the contrastive learning method by setting the wild-stylized feature and its closest wild content as positive samples. In another study, Yang et al. [56] developed multi-level contrastive learning, which designed instance prototypes and class prototypes for contrastive learning. Specifically, they sample each class’s pixel features and apply contrastive learning with a transition-probability ability matrix. Unlike recent DGSS works that embed the original image, we define contrastive learning for the augmented image. Specifically, the proposed method builds a robust embedding space by preserving the semantic consistency of the feature representation across various domains.

3 Method

The goal of the proposed method is to train a segmentation model $\varphi$ on a given source domain $S$ and generalize well to the unseen target domain $T$ . Precisely, the source domain $S=\{(x,y)\}$ contains a paired image $x\in\mathbb{R}^{H\times W\times 3}$ and segmentation label $y\in\mathbb{R}^{H\times W\times C}$ , where $H$ , $W$ , and $C$ denote the height of the image, the width of the image, and the class number of the segmentation map, respectively. The model $\varphi$ takes an original $x$ and its augmented counterpart $x_{a}$ , which have the same content but different styles, and uses the feature information to enhance its generalization ability.

As shown in Fig. 2, our method leverages the feature information through two main approaches: covariance alignment and semantic consistency contrastive learning. Specifically, the covariance matching ensures that features, having different styles but the same content, contain similar information. Additionally, the semantic consistency contrastive learning embeds the generalized features into discriminative representation based on the segmentation label.

3.1 Covariance Alignment

The domain shift in semantic segmentation results from changes in the visual characteristics of the image, known as style. This style information is typically detected in the shallow layers of networks [38], which are the encoders of the segmentation model. Based on this understanding, our method targets the encoder to handle style variations by employing the proposed covariance matching loss and cross-covariance loss function.

Covariance Matching Loss

To train the network to uniformly recognize various styles without removing content information, we introduce the CML. Specifically, the loss aims to minimize the difference between covariance matrices derived from different styles of image features. Given an image pair $(x,x_{a})$ , the features from $i^{th}$ block of the encoder are represented as $F^{i}\in\mathbb{R}^{(H^{i}\times W^{i})\times C^{i}}$ and $F_{a}^{i}\in\mathbb{R}^{(H^{i}\times W^{i})\times C^{i}}$ , respectively. Further, following the methodologies of [7, 41], we compute the covariance matrices using instance normalized features [48], which ensures consistent scaling across features. The feature maps are normalized and flattened into $\bar{F}$ and $\bar{F_{a}}\in\mathbb{R}^{(HW)\times C}$ , which are given by:

\displaystyle\bar{F}=\frac{\left(F-\mu(F)\right)}{\sigma(F)},\bar{F_{a}}=\frac% {\left(F_{a}-\mu(F_{a})\right)}{\sigma(F_{a})},

(1)

where $\mu(\cdot)\in\mathbb{R}^{C}$ and $\sigma(\cdot)\in\mathbb{R}^{C}$ denote the mean and standard deviation of the features. Utilizing the normalized features, we evaluate the covariance matrices for the original and augmented image features as:

\displaystyle\Sigma_{x,x}^{i}

\displaystyle=\bar{F^{i}}^{T}\cdot\bar{F^{i}},

\displaystyle\Sigma_{x_{a},x_{a}}^{i}

\displaystyle=\bar{F_{a}^{i}}^{T}\cdot\bar{F_{a}^{i}}.

(2)

The CML is then formulated to align these covariance matrices, ensuring that the network maintains consistency in the presence of style variations. The CML is defined as:

\mathcal{L}_{CM}=\sum_{i=1}^{n_{e}}\|\Sigma_{x,x}^{i}-\Sigma_{x_{a},x_{a}}^{i}% \|_{2},

(3)

where $n_{e}$ denotes the number of blocks of the encoder.

Cross-covariance Loss

While the CML effectively aligns the internal distributions of features within the same image, it does not fully account for the direct correlations across paired images. To complement this, we introduce CCL, which aims to encode the consistent content information of an image pair $(x,x_{a})$ by utilizing the cross-covariance of the image pair. Given the normalized feature pair $(\bar{F^{i}},\bar{F_{a}^{i}})$ , the cross-covariance of the feature pair can be expressed as:

\Sigma_{x,x_{a}}^{i}=\bar{F^{i}}^{T}\cdot\bar{F_{a}^{i}}.

(4)

The cross-covariance is expected to exhibit an identity matrix, as the feature pair should contain identical information. Nonetheless, the proposed CCL converges only the diagonal component of the covariance matrix to one. This is to prevent the drawbacks of the existing FN methods [7, 42] from removing content information. The CCL function is thus defined as:

\mathcal{L}_{CC}=\sum_{i=1}^{n_{e}}\|\text{diag}(\Sigma_{x,x_{a}}^{i})-% \mathbbm{1}\|_{2},

(5)

where $\text{diag}(\Sigma_{c})\in\mathbb{R}^{C}$ denotes the column vector comprising diagonal elements of $\Sigma_{x,x_{a}}^{i}$ and $\mathbbm{1}\in\mathbb{R}^{C}$ denotes the one vector.

3.2 Semantic Consistence Contrastive Learning

While the encoder focuses on generating style-blinded features, the decoder aims to improve the robustness of the segmentation prediction against domain shifts. For the decoder, we employ a contrastive learning approach, which has demonstrated effectiveness in extracting discriminative features [4]. Specifically, we utilize the InfoNCE loss [36], which is formulated as:

\displaystyle\begin{aligned} \mathcal{L}_{IN}(a,p,n)=-\log\left(\frac{e^{(a% \cdot p/\tau)}}{e^{(a\cdot p/\tau)}+\sum_{n}^{N^{-}}e^{(a\cdot n/\tau)}}\right% )\end{aligned}

(6)

where $a$ , $p$ , $n$ , and $N^{-}$ denote anchor, positive sample, negative sample, and negative sample set, respectively.

To achieve consistent feature representation in DGSS across various styles, we introduce semantic consistency contrastive learning. Specifically, the anchor is derived from the augmented image $x_{a}$ , while the positive sample is extracted from the corresponding pixel of the original image $x$ . Our method consists of two main components based on the negative sample as shown in Fig. 3: class-wise contrastive learning and semantic disentanglement contrastive learning.

Class-wise Contrastive Learning

Our CWCL aims to build a discriminative embedding space for each segmentation class using different classes of the original image as negatives. Given an image pair $(x,x_{a})$ , the features from the $j^{th}$ block of the decoder at pixel position $(m,n)$ are denoted as $F_{(m,n)}^{j}$ and $F_{a,(m,n)}^{j}\in\mathbb{R}^{(1\times 1)\times C^{j}}$ , where $C^{j}$ indicates the channel length of the feature. As mentioned above, we take $F_{a,(m,n)}^{j}$ as the anchor and $F_{(m,n)}^{j}$ as the positive sample, since they represent the same content at the corresponding spatial location. To obtain the negative samples from $F^{j}$ , we leverage the resized segmentation class label $y^{j}\in\mathbb{R}^{(H^{j}\times W^{j})\times C}$ to collect the different class samples. The samples are passed through the projection head, denoted as $\pi$ , resulting in the projected features $\tilde{F}$ and $\tilde{F}_{a}$ . We define our CWCL as:

	$\displaystyle\mathcal{L}_{CWCL}=\sum_{j}^{n_{d}}\mathcal{L}_{IN}\left(\tilde{F% }_{a,(m,n)}^{j},\tilde{F}_{(m,n)}^{j},\tilde{F}_{(p,q)}^{j}\right)$		(7)
	$\displaystyle\textrm{where}\quad(p,q)\in\{(p,q)\in P\|y_{(p,q)}^{j}\neq y_{(m,n% )}^{j}\}$

The set $P$ represents all pixel positions in feature $F^{j}$ , with dimensions $H^{j}\times W^{j}$ corresponding to the height and width. Additionally, $n_{d}$ denotes the total number of blocks in the decoder.

Backbone	Methods	External		Trained on GTAV (G)				Trained on Cityscapes (C)
Backbone	Methods	Dataset	Module	C	B	M	S	B	M	S	G
ResNet50 [13]	Baseline [3]	-	-	28.95	25.14	28.18	26.23	44.96	51.68	23.29	42.55
	IBN-Net [38]	-	-	33.85	32.30	37.75	27.90	48.56	57.04	26.14	45.06
	RobustNet [7]	-	-	37.31	35.20	40.33	28.30	50.73	58.64	26.20	45.00
	SiamDoGe [53]	-	-	42.96	37.54	40.64	28.34	51.53	59.00	26.67	45.08
	DIRL [55]	-	✓	41.04	39.15	41.60	-	51.80	-	26.50	46.52
	WildNet [24]	✓	-	44.62	38.42	46.09	31.34	50.94	58.79	27.95	47.01
	SANSAW [42]	-	✓	39.75	37.34	41.86	30.79	52.95	59.81	28.32	47.28
	SPC [20]	-	✓	44.10	40.46	45.51	-	-	-	-	-
	DPCL [56]	-	✓	44.74	40.59	46.33	30.81	50.97	58.59	25.85	46.00
	Ours	-	-	45.72	41.32	47.08	31.39	51.84	60.18	28.51	47.97
ShuffleNetV2 [32]	Baseline [3]	-	-	25.56	22.17	28.60	23.33	36.84	43.13	21.56	36.95
	IBN-Net [38]	-	-	27.10	31.82	34.89	25.56	41.89	46.35	22.99	40.91
	RobustNet [7]	-	-	30.98	32.06	35.31	24.31	41.94	46.97	22.82	40.17
	SiamDoGe [53]	-	-	34.40	34.23	35.87	21.95	42.61	47.48	23.13	40.93
	DIRL [55]	-	✓	31.88	32.57	36.12	-	42.55	-	23.74	41.23
	DPCL [56]	-	✓	36.66	34.35	39.92	22.66	43.90	48.95	22.47	41.07
	Ours	-	-	38.56	34.51	40.11	25.64	44.22	49.69	23.54	41.10

Table 1: Quantitative comparison of mIoU (%) between DGSS methods. External dataset denotes the necessity of an auxiliary dataset during training and External module denotes the requirement of an additional module during inference. G, C, B, M, and S denote GTAV, Cityscapes, BDD100K, Mapillary, and SYNTHIA, respectively. The best and second-best results are bolded and underlined, respectively.

Semantic Disentanglement Contrastive Learning

Domain shifts can lead to the entanglement of similar classes, causing the model to misclassify, as illustrated in Fig. 4. To mitigate this issue, we introduce the SDCL, specifically designed to disentangle the feature $x_{a}$ that has been misclassified, making it closer to the correct class and far from the misclassified class to achieve effective disentanglement. To further ensure consistent feature space and capture the semantic meaning, we share the projection head $\pi$ used in the CWCL loss. Given the predicted segmentation map of the augmented image, represented as $\hat{y}_{a}=\varphi(x_{a})$ , we resize it to $\hat{y}_{a,(m,n)}^{j}\in\mathbb{R}^{(H^{j}\times W^{j})\times C}$ . Similarly, $y_{(m,n)}^{j}$ represents the ground truth segmentation map. Using these segmentation maps, we set the anchor at positions where $\hat{y}_{a,(m,n)}^{j}\neq y_{(m,n)}^{j}$ . Negative samples are selected from the augmented image features corresponding to the anchor’s misclassified class. The samples go through the projection head $\pi$ . Our SDCL loss is defined as follows:

	$\displaystyle\mathcal{L}_{SDCL}=\sum_{j}^{n_{d}}\mathcal{L}_{IN}\left(\tilde{F% }_{a,(m,n)}^{j},\tilde{F}_{(m,n)}^{j},\tilde{F}_{a,(r,s)}^{j}\right)$		(8)
	$\displaystyle\textrm{where}\quad(r,s)\in\{(r,s)\in P\|y_{(r,s)}^{j}=\hat{y}_{(m% ,n)}^{j}\}$

Finally, combining the cross-entropy segmentation loss $\mathcal{L}_{CE}$ with other loss components, the total is defined as:

\displaystyle\begin{aligned} \mathcal{L}_{\text{total}}=\mathcal{L}_{CE}&+% \omega_{1}\mathcal{L}_{CM}+\omega_{2}\mathcal{L}_{CC}\\ &+\omega_{3}\mathcal{L}_{CWCL}+\omega_{4}\mathcal{L}_{SDCL}\end{aligned}

(9)

where $\omega_{1}$ , $\omega_{2}$ , $\omega_{3}$ , and $\omega_{4}$ denote the weighting factor of each loss functions.

4 Experiment

In this section, we describe the implementation details, the experimental setup for comparison with existing DGSS methods, and the ablation study conducted to further validate the effectiveness of our approach.

4.1 Implementation Details

We adopt DeepLabV3+ [3] for the segmentation architecture and use ResNet-50 [13], ShuffleNetV2 [32], and MoblieNetV2 [47] as the backbone network of the segmentation network. The model is trained for 40K iterations with a batch size of 8 using the SGD optimizer, which has a momentum of 0.9 and a weight decay of 5e-4. We employ a polynomial learning rate schedule with an initial rate of 1e-2 and a power of 0.9. For the simulation of domain shift, we augment the image $x_{a}$ using strong color jittering transformation similar to [7]. The weighting parameters of (9), $\omega_{1}$ , $\omega_{2}$ , $\omega_{3}$ and $\omega_{4}$ , are set as 0.2, 0.2, 0.3, and 0.3 respectively.

4.2 Datasets

We use two synthetic datasets (GTA [43] and SYNTHIA [45]), and three real-world datasets (Cityscapes [8], BDD-100K [58], and Mapillary [33]) for the experiment. All segmentation labels are evaluated based on 19 object categories.

GTAV (G) [43] is a large-scale dataset generated from the Grand Theft Auto V (GTAV) game engine. It comprises 24,966 images, split into 12,403 for training, 6,382 for validation, and 6,181 for testing with a resolution of 1914 $\times$ 1052.

SYNTHIA (S) [45] is a virtual, photo-realistic urban scene dataset comprising 9,400 images with a resolution of 960 $\times$ 720. Among these, 2,820 images are designated for evaluation.

Cityscapes (C) [8] is a large-scale urban scene dataset captured from 50 cities, primarily in Germany. Particularly, it contains 5,000 high-resolution images with a resolution of 2048 $\times$ 1024. The dataset is divided into 2,975 images for training, 500 for validation, and 1,525 for testing.

BDD-100K (B) [58] is another real-world urban scene dataset that contains more diverse 10000 urban driving scene images with a resolution of 1280 $\times$ 720. Specifically, the validation split (1,000 images) is used for evaluation.

Mapillary (M) [33] contains 25,000 images with a minimum resolution of 1920 $\times$ 1080, collected from various locations worldwide. Specifically, the validation split of 2,000 images is used for evaluation.

4.3 Comparison with DGSS methods

We compare our methods with other state-of-the-art DGSS methods: Baseline (DeepLabV3+ [3] trained on the source domain), IBN-Net [38], RobustNet [7], SiamDoGe [53], DIRL [55], WildNet [24], SANSAW [42], SPC [20], and DPCL [56]. To evaluate the generalization ability of the model on arbitrary unseen domains, we conduct the experiment on two scenarios: i) trained on GTAV, tested on Cityscapes, BDD-100K, and Mapillary, and ii) trained on Cityscapes, tested on BDD-100K, Mapillary, and SYNTHIA. The quantitative results are computed with mean intersection over union (mIoU). Additionally, we compared the method trained on the backbone of ResNet-50 [13], ShuffleNetV2 [32], and MoblieNet [47], pre-trained on ImageNet [9].

Methods	External Module	Trained on GTAV (G)
Methods	External Module	C	B	M	Mean
Baseline [3]		25.94	25.73	26.45	26.04
IBN-Net [38]		30.14	27.66	27.07	28.29
RobustNet [7]		30.86	30.05	30.67	30.52
SiamDoGe [53]		34.15	34.50	32.34	33.67
DIRL [55]	✓	34.67	32.78	34.31	33.92
DPCL [56]	✓	37.57	35.45	40.30	37.77
Ours		37.66	36.10	40.40	38.05

Table 2: Quantitative comparison of mIoU (%) using MobileNetV2 [47] backbone trained on the GTAV dataset.

Quantitative and Qualitative Results

Table 1 summarizes the quantitative results. Our method outperforms all other methods when trained on GTAV, using ResNet-50 as the backbone. When compared with FN methods that remove domain-specific styles, we demonstrate that our approach minimizes the loss of content information. We also show that our method effectively shows generalization ability when trained on Cityscapes. We further evaluate our methods with different backbones, showing the wide applicability of our method. When trained with ShuffleNetV2, our method achieves the first or second-best performance among unseen target domains. Table 2 shows the results of our method trained on GTAV with MobileNetV2, demonstrating the superiority of our method.

For qualitative evaluation, we compare the visual result between DGSS methods and ours. As depicted in Fig. 5, our method demonstrates superior results compared to other approaches, particularly in its overall prediction accuracy. Notably, our proposed techniques enable distinct prediction of features such as on road and sidewalk, yielding clearer segmentation boundaries. Please refer to the supplementary material for more qualitative results.

Methods

Externel

Module

Params (M)

GFLOPS

Time (ms)

Baseline [13]

45.08

277.77

10.01

SANSAW [42]

✓

25.63

421.86

68.96

SPC [20]

✓

45.22

286.09

12.24

DIRL [55]

✓

45.41

278.11

11.69

DPCL [56]

✓

56.46

1188.64

823.78

Ours

45.08

277.78

10.03

Table 3: Computational cost comparison conducted using DeepLabV3+ with a ResNet-50 backbone on an NVIDIA Tesla V100 GPU with an image resolution of

2048\times 1024

. Inference time is averaged over 400 trials.

Computational cost analysis

To confirm that our approach does not incur additional computational overhead, we provide the number of parameters, GFLOPS, and average inference time of each method. As detailed in Table 3, our method operates comparably to baseline models by learning features intrinsically without adopting a separate module.

$\mathcal{L}_{CM}$	$\mathcal{L}_{CC}$	$\mathcal{L}_{CWCL}$	$\mathcal{L}_{SDCL}$	C	B	M
				28.95	25.14	28.18
✓				38.08	36.65	40.62
✓	✓			40.42	37.81	43.91
		✓		42.03	38.27	44.02
		✓	✓	43.16	38.59	45.38
✓	✓	✓		43.17	38.23	44.84
✓	✓	✓	✓	45.72	41.32	47.08

Table 4: Ablation study on proposed losses. The experiments were conducted using DeepLabV3+ with ResNet-50 backbone, trained on the GTAV dataset. The losses are detailed in

\mathcal{L}_{CM}

: (3),

\mathcal{L}_{CC}

: (5),

\mathcal{L}_{CWCL}

: (7),

\mathcal{L}_{SDCL}

: (8)

4.4 Ablation Studies

In this subsection, we conducted a series of ablation studies to demonstrate the individual contribution and effectiveness of each component within our method. Specifically, we investigate the impact of the following components: $\mathcal{L}_{CM}$ , $\mathcal{L}_{CC}$ , $\mathcal{L}_{CWCL}$ , $\mathcal{L}_{SDCL}$ . For the study, we use a scenario where the DeepLabV3+ model with backbone ResNet-50 model is trained on GTA and tested on Cityscapes, BDD-100K, and Mapillary.

Table 4 presents the impact of various proposed losses on domain generalization performance. Specifically, the baseline model, trained solely with cross-entropy loss, exhibits suboptimal performance on target domains because of overfitting to the source domain. Conversely, the integration of any proposed loss mechanisms leads to a marked enhancement in performance. More specifically, the incorporation of the covariance alignment ( $\mathcal{L}_{CM}$ , $\mathcal{L}_{CC}$ ) shows its efficacy in preserving essential content information by correlating features of paired images. The differential impact of the semantic consistency constrastive learning ( $\mathcal{L}_{CWCL}$ , $\mathcal{L}_{SDCL}$ ) is also evident, as it significantly aids in disentangling features of similar classes, thereby constructing a more robust embedding space.

Covariance Matching Loss. Fig. 6 presents t-SNE plots of covariances for original and augmented images, before and after the application of CML. The baseline network perceives original and augmented images differently from a style perspective. However, after applying CML, the distribution becomes more intermixed, indicating that our proposed CML effectively ensures similar recognition of different style images.

Calcuation of CCL. Table 4(a) demonstrates that our proposed cross-covariance method, which converges the diagonal components to 1, yields superior performance. As mentioned before, removing non-diagonal components, which contain content information actually degrades performance.

Sampling number in CWCL. Table 4(b) and Table 4(c) show the impact of varying the number of classes sampled per image and the number of samples per class in CWCL, respectively. As shown in Table 4(b), the performance improves with an increase in the diversity of classes sampled in CWCL. This suggests that contrasting a broader array of classes enhances the model’s discriminative capability. Furthermore, Table 4(c) demonstrates that a balanced number of negative samples per class leads to optimal performance.

Projection Head for SDCL. The influence of different project head configurations on the SDCL is investigated. We experimented with three distinct approaches: individual projection head, copying the weights of CWCL’s (stop gradient), and shared projection head of CWCL. As demonstrated in Table 4(d), sharing the projection head yielded the most superior results. The results indicate that SDCL not only relies on the semantic information from CWCL for effective disentanglement of similar classes but also enhances the embedding space learned by CWCL.

Cross-covariance loss
Method	C	B	M
Whitening	38.68	36.91	42.12
Diagonal	40.42	37.81	43.91

(a)

# of classes
#	C	B	M
10	45.57	38.88	46.37
15	45.72	41.32	47.08

(b)

# of negative samples
#	C	B	M
10	44.76	38.21	46.46
50	45.72	41.32	47.08
100	44.44	39.14	46.29

(c)

Projection Head of SDCL
MLP	C	B	M
Individual	44.91	38.45	46.31
Shared (SG)	44.03	38.15	46.64
Shared	45.72	41.32	47.08

(d)

Table 5: Ablation studies. (a) Calculation of

\mathcal{L}_{CC}

. (b) Number of classes for

\mathcal{L}_{CWCL}

. (c) Number of negative samples for

\mathcal{L}_{CWCL}

. (d) Projection head of

\mathcal{L}_{SDCL}

. “SG” indicates stop gradient.

5 Conclusion

In this paper, we propose a novel BlindNet with covariance alignment and semantic consistency contrastive learning. By introducing covariance alignment, our method effectively addresses style variations, ensuring the extraction of features that are consistent across different styles. Furthermore, with the proposed semantic consistency contrastive learning, we not only facilitate the extraction of discriminative features but also enhance the generalization capabilities of the model in semantic segmentation predictions. Comprehensive experimental results validate the effectiveness of our approach, demonstrating its ability to generalize across multiple unseen target domains without requiring auxiliary domains or additional modules. Our future work will be improving and stabilizing the covariance alignment method.

Acknowledgement. This work was supported in part by the Basic Science Research Program through National Research Foundation of Korea (NRF) (Grants No. NRF-2022R1F1A1073543), the MSIT(Ministry of Science and ICT), Korea, under the ICAN(ICT Challenge and Advanced Network of HRD) support program(RS-2022-00156385) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation), and Innovative Human Resource Development for Local Intellectualization program through the Institute of Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT)(IITP-2024-00156287).

References

Bartoccioni et al. [2023] Florent Bartoccioni, Éloi Zablocki, Andrei Bursuc, Patrick Pérez, Matthieu Cord, and Karteek Alahari. Lara: Latents and rays for multi-camera bird’s-eye-view semantic segmentation. In Conference on Robot Learning, pages 1663–1672. PMLR, 2023.
Caron et al. [2020] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
Chen et al. [2017] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
Cheng et al. [2022] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
Cho et al. [2019] Wonwoong Cho, Sungha Choi, David Keetae Park, Inkyu Shin, and Jaegul Choo. Image-to-image translation via group-wise deep whitening-and-coloring transformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10639–10647, 2019.
Choi et al. [2021] Sungha Choi, Sanghun Jung, Huiwon Yun, Joanne T Kim, Seungryong Kim, and Jaegul Choo. Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11580–11590, 2021.
Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Gatys et al. [2016] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
Grill et al. [2020] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
He et al. [2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
Hoffman et al. [2018] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning, pages 1989–1998. Pmlr, 2018.
Hoyer et al. [2022] Lukas Hoyer, Dengxin Dai, and Luc Van Gool. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9924–9935, 2022.
Hoyer et al. [2023] Lukas Hoyer, Dengxin Dai, Qin Wang, Yuhua Chen, and Luc Van Gool. Improving semi-supervised and domain-adaptive semantic segmentation with self-supervised depth estimation. International Journal of Computer Vision, pages 1–27, 2023.
Hu et al. [2023] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
Huang et al. [2021] Jiaxing Huang, Dayan Guan, Aoran Xiao, and Shijian Lu. Fsdr: Frequency space domain randomization for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6891–6902, 2021.
Huang et al. [2023] Wei Huang, Chang Chen, Yong Li, Jiacheng Li, Cheng Li, Fenglong Song, Youliang Yan, and Zhiwei Xiong. Style projected clustering for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3061–3071, 2023.
Huang and Belongie [2017] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510, 2017.
Huang et al. [2018] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pages 172–189, 2018.
Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
Lee et al. [2022] Suhyeon Lee, Hongje Seong, Seongwon Lee, and Euntai Kim. Wildnet: Learning domain generalized semantic segmentation from the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9936–9946, 2022.
Li et al. [2020] Guangrui Li, Guoliang Kang, Wu Liu, Yunchao Wei, and Yi Yang. Content-consistent matching for domain adaptive semantic segmentation. In European conference on computer vision, pages 440–456. Springer, 2020.
Li et al. [2021] Lei Li, Ke Gao, Juan Cao, Ziyao Huang, Yepeng Weng, Xiaoyue Mi, Zhengze Yu, Xiaoya Li, and Boyang Xia. Progressive domain expansion network for single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 224–233, 2021.
Li et al. [2017] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Universal style transfer via feature transforms. Advances in neural information processing systems, 30, 2017.
Li et al. [2018] Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European conference on computer vision (ECCV), pages 624–639, 2018.
Li et al. [2019] Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6936–6945, 2019.
Li and Hoiem [2017] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
Luo [2017] Ping Luo. Learning deep architectures via generalized whitened neural networks. In International Conference on Machine Learning, pages 2238–2246. PMLR, 2017.
Ma et al. [2018] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
Neuhold et al. [2017] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE international conference on computer vision, pages 4990–4999, 2017.
Nilsson et al. [2021] David Nilsson, Aleksis Pirinen, Erik Gärtner, and Cristian Sminchisescu. Embodied visual active learning for semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2373–2383, 2021.
Onozuka et al. [2021] Yuya Onozuka, Ryosuke Matsumi, and Motoki Shino. Autonomous mobile robot navigation independent of road boundary using driving recommendation map. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4501–4508. IEEE, 2021.
Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Pan et al. [2020] Fei Pan, Inkyu Shin, Francois Rameau, Seokju Lee, and In So Kweon. Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3764–3773, 2020.
Pan et al. [2018] Xingang Pan, Ping Luo, Jianping Shi, and Xiaoou Tang. Two at once: Enhancing learning and generalization capacities via ibn-net. In Proceedings of the European Conference on Computer Vision (ECCV), pages 464–479, 2018.
Pan et al. [2019] Xingang Pan, Xiaohang Zhan, Jianping Shi, Xiaoou Tang, and Ping Luo. Switchable whitening for deep representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1863–1871, 2019.
Park et al. [2020] Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. Contrastive learning for unpaired image-to-image translation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 319–345. Springer, 2020.
Peng et al. [2021] Duo Peng, Yinjie Lei, Lingqiao Liu, Pingping Zhang, and Jun Liu. Global and local texture randomization for synthetic-to-real semantic segmentation. IEEE Transactions on Image Processing, 30:6594–6608, 2021.
Peng et al. [2022] Duo Peng, Yinjie Lei, Munawar Hayat, Yulan Guo, and Wen Li. Semantic-aware domain generalized segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2594–2605, 2022.
Richter et al. [2016] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 102–118. Springer, 2016.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
Ros et al. [2016] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243, 2016.
Sakaridis et al. [2021] Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10765–10775, 2021.
Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
Ulyanov et al. [2016] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
Van der Maaten and Hinton [2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
Volpi et al. [2018] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31, 2018.
Vu et al. [2019] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2517–2526, 2019.
Wang et al. [2021] Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, Ender Konukoglu, and Luc Van Gool. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7303–7313, 2021.
Wu et al. [2022] Zhenyao Wu, Xinyi Wu, Xiaoping Zhang, Lili Ju, and Song Wang. Siamdoge: Domain generalizable semantic segmentation using siamese network. In European Conference on Computer Vision, pages 603–620. Springer, 2022.
Xie et al. [2021] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
Xu et al. [2022] Qi Xu, Liang Yao, Zhengkai Jiang, Guannan Jiang, Wenqing Chu, Wenhui Han, Wei Zhang, Chengjie Wang, and Ying Tai. Dirl: Domain-invariant representation learning for generalizable semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2884–2892, 2022.
Yang et al. [2023] Liwei Yang, Xiang Gu, and Jian Sun. Generalized semantic segmentation by self-supervised source domain projection and multi-level contrastive learning. arXiv preprint arXiv:2303.01906, 2023.
Yoo et al. [2019] Jaejun Yoo, Youngjung Uh, Sanghyuk Chun, Byeongkyu Kang, and Jung-Woo Ha. Photorealistic style transfer via wavelet transforms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9036–9045, 2019.
Yu et al. [2020] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
Yu et al. [2021] Fei Yu, Mo Zhang, Hexin Dong, Sheng Hu, Bin Dong, and Li Zhang. Dast: Unsupervised domain adaptation in semantic segmentation based on discriminator attention and self-training. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10754–10762, 2021.
Yue et al. [2019] Xiangyu Yue, Yang Zhang, Sicheng Zhao, Alberto Sangiovanni-Vincentelli, Kurt Keutzer, and Boqing Gong. Domain randomization and pyramid consistency: Simulation-to-real generalization without accessing target domain data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2100–2110, 2019.
Zhou et al. [2020] Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xiang. Domain generalization with mixstyle. In International Conference on Learning Representations, 2020.
Zou et al. [2018] Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pages 289–305, 2018.

\thetitle

Supplementary Material

Appendix A Implementation Details of BlindNet

As shown in Fig. 7, we apply our covariance alignment losses to the encoder features and the semantic consistency contrastive learning to the decoder features.

Appendix B More Results

In this section, we show the detailed quantitative compassion results (Section B.1) and additional qualitative results (Section B.2) of our study.

B.1 Quantitative Results

Table 6 reports a comparison of pixel accuracy and IoU for each semantic class between DGSS methods. Our model significantly outperforms others in overall pixel accuracy, indicating its robust performance. In IoU for each semantic class, our model particularly excels in roads, sidewalks, sky, people, riders, and cars, which are commonly present in photos. However, the table also indicates a degraded performance in classes such as traffic signs, traffic lights, and trains, which are less frequently encountered in the source domain (GTAV). Our future work will aim to address this issue and improve performance across all classes.

B.2 Qualitative Results

Figs. 8 (G $\xrightarrow{}$ C), 9 (G $\xrightarrow{}$ B), and 10 (G $\xrightarrow{}$ M) present qualitative comparisons between our model and others, including baseline [3], RobustNet [7], WildNet [24], SiamDoGe [53], and SPC [20]. The results clearly illustrate our model’s consistent superiority, particularly in the segmentation of sidewalks, roads, buildings, terrain, and cars. The result demonstrates the robustness and effectiveness of our model in handling DGSS.

Appendix C More Ablation Studies

In this section, we conduct more ablation studies on our model. In Section C.1, we show a qualitative analysis of the proposed loss functions, and in Section C.2, we experiment on the weight of the proposed loss functions.

C.1 Qualitative Results

We incrementally added each loss function ( $\mathcal{L}_{CM}$ , $\mathcal{L}_{CC}$ , $\mathcal{L}_{CWCL}$ , $\mathcal{L}_{SDCL}$ ) to the baseline model to validate the impact of loss. Fig. 11 presents the qualitative results of the ablation studies on the proposed loss functions.

For our qualitative ablation study, we added each loss function ( $\mathcal{L}_{CM}$ , $\mathcal{L}_{CC}$ , $\mathcal{L}_{CWCL}$ , $\mathcal{L}_{SDCL}$ ) to the baseline model, validating their contributions. The results are depicted in Fig. 11. The introduction of CML ( $\mathcal{L}_{CM}$ ) enhances the capture of the details such as traffic lights, as illustrated in Fig. 11 row 1. Adding CCL ( $\mathcal{L}_{CC}$ ) further strengthens content representation, leading to an improvement in overall accuracy. The CWCL ( $\mathcal{L}_{CWCL}$ ) strengthens semantic understanding, allowing for better detection of smaller objects. However, this enhancement comes with a trade-off, as it introduces some degree of confusion among similar classes (e.g. sidewalk and road). The application of SDCL ( $\mathcal{L}_{SDCL}$ ) effectively disentangles misclassified features, leading to clearer class distinctions.

C.2 Hyper-parameter

We varied the weighting parameters for each loss function in (9), and conducted experiments by adjusting each loss weight by 0.1, using the model configuration that initially showed the best performance as our baseline, reported in Table 7. The CML ( $\mathcal{L}_{CM}$ ), a key component for style blindness, shows that an overly strong influence can significantly degrade network performance. Conversely, the CCL ( $\mathcal{L}_{CC}$ ) and the CWCL ( $\mathcal{L}_{CWCL}$ ) exhibit improved performance with a slightly higher influence than a lower influence.

$\omega_{1}$ ( $\mathcal{L}_{CM}$ )	$\omega_{2}$ ( $\mathcal{L}_{CC}$ )	$\omega_{3}$ ( $\mathcal{L}_{CWCL}$ )	$\omega_{4}$ ( $\mathcal{L}_{SDCL}$ )	C	B	M	S
0.2	0.2	0.3	0.3	45.72	41.32	47.08	31.39
0.1	0.2	0.3	0.3	45.06	39.37	45.14	31.09
0.3	0.2	0.3	0.3	43.04	38.75	44.69	29.58
0.2	0.1	0.3	0.3	44.15	39.15	46.00	30.62
0.2	0.3	0.3	0.3	44.78	40.01	46.56	30.74
0.2	0.2	0.2	0.3	43.42	39.24	45.55	30.40
0.2	0.2	0.4	0.3	45.52	39.88	45.73	30.20
0.2	0.2	0.3	0.2	44.58	40.42	47.35	30.72
0.2	0.2	0.3	0.4	45.26	40.16	46.91	30.49

Table 7: Sensitivity to weighting parameters of each loss function

Methods

Pixel

Accuracy

mIoU

Road

Sidewalk

Building

Wall

Fence

Pole

Traffic light

Traffic sign

Vegetation

Terrain

Sky

Person

Rider

Car

Truck

Bus

Train

Motorcycle

Bicycle

Baseline [3]

71.02

29.0

51.9

20.6

57.2

22.4

21.0

25.3

24.9

10.1

61.3

23.7

52.0

53.8

13.6

51.2

19.5

21.2

0.3

12.0

8.1

RobustNet [7]

77.18

37.3

58.9

27.7

63.2

22.8

23.1

26.4

30.6

20.7

85.1

39.2

69.8

62.4

15.9

76.7

23.2

22.3

3.9

18.4

18.6

SiamDoGe [53]

84.73

43.0

83.7

34.1

78.6

26.4

25.6

26.0

42.4

28.6

84.3

28.1

68.9

62.1

31.1

85.6

31.3

28.9

3.5

22.8

23.3

WildNet [24]

84.57

44.6

81.2

38.2

76.9

28.1

25.1

35.1

32.1

24.5

85.4

35.4

72.2

65.0

27.3

85.5

29.7

33.2

12.6

32.8

27.4

SPC [20]

86.65

44.1

86.9

37.8

81.2

28.9

26.9

36.9

35.1

25.2

83.7

36.2

78.5

63.9

30.4

84.1

24.8

28.1

12.1

19.3

17.9

DPCL [56]

82.22

44.7

75.6

32.8

73.2

26.1

23.5

34.1

42.3

28.2

85.2

38.5

81.2

63.8

25.0

76.6

31.7

33.9

5.7

27.6

45.0

Ours

87.91

45.7

88.3

44.1

82.4

30.9

26.8

35.4

33.4

20.3

85.0

34.2

78.5

66.0

33.7

86.8

33.0

41.1

1.4

25.3

22.1

Table 6: Quantitative results for pixel accuracy and each semantic class. The models are trained on GTAV and tested on Cityscapes using a ResNet50 backbone. The best and second best results are bolded and underlined, respectively

Style Blind Domain Generalized Semantic Segmentation via Covariance Alignment and Semantic Consistence Contrastive Learning

Abstract

1 Introduction

2 Related Works

3 Method

3.1 Covariance Alignment

Covariance Matching Loss

Cross-covariance Loss

3.2 Semantic Consistence Contrastive Learning

Class-wise Contrastive Learning

Semantic Disentanglement Contrastive Learning

4 Experiment

4.1 Implementation Details

4.2 Datasets

4.3 Comparison with DGSS methods

Quantitative and Qualitative Results

Computational cost analysis

4.4 Ablation Studies

5 Conclusion

References

Appendix A Implementation Details of BlindNet

Appendix B More Results

B.1 Quantitative Results

B.2 Qualitative Results

Appendix C More Ablation Studies

C.1 Qualitative Results

C.2 Hyper-parameter

Style Blind Domain Generalized Semantic Segmentation
via Covariance Alignment and Semantic Consistence Contrastive Learning