A Simple Recipe for Language-guided Domain Generalized Segmentation

Mohammad Fahes¹ Tuan-Hung Vu^1,2 Andrei Bursuc^1,2 Patrick Pérez³ Raoul de Charette¹
¹ Inria ² Valeo.ai ³ Kyutai

Abstract

Generalization to new domains not seen during training is one of the long-standing challenges in deploying neural networks in real-world applications. Existing generalization techniques either necessitate external images for augmentation, and/or aim at learning invariant representations by imposing various alignment constraints. Large-scale pretraining has recently shown promising generalization capabilities, along with the potential of binding different modalities. For instance, the advent of vision-language models like CLIP has opened the doorway for vision models to exploit the textual modality. In this paper, we introduce a simple framework for generalizing semantic segmentation networks by employing language as the source of randomization. Our recipe comprises three key ingredients: (i) the preservation of the intrinsic CLIP robustness through minimal fine-tuning, (ii) language-driven local style augmentation, and (iii) randomization by locally mixing the source and augmented styles during training. Extensive experiments report state-of-the-art results on various generalization benchmarks. Code is accessible at https://github.com/astra-vision/FAMix.

https://astra-vision.github.io/FAMix

1 Introduction

A prominent challenge associated with deep neural networks is their constrained capacity to generalize when confronted with shifts in data distribution. This limitation is rooted in the assumption of data being independent and identically distributed, a presumption that frequently proves unrealistic in real-world scenarios. For instance, in safety-critical applications like autonomous driving, it is imperative for a segmentation model to exhibit resilient generalization capabilities when dealing with alterations in lighting, variations in weather conditions, and shifts in geographic location, among other considerations.

To address this challenge, domain adaptation [51, 37, 14, 52, 20, 36] has emerged; its core principle revolves around aligning the distributions of both the source and target domains. However, DA hinges on having access to target data, which may not always be available. Even when accessible, this data might not encompass the full spectrum of distributions encountered in diverse real-world scenarios. Domain generalization [68, 67, 53, 34, 56, 35] overcomes this limitation by enhancing the robustness of models to arbitrary and previously unseen domains.

The training of segmentation networks is often backed by large-scale pretraining as initialization for the feature representation. Until now, to the best of our knowledge, domain generalization for semantic segmentation (DGSS) networks [40, 7, 41, 62, 55, 25, 57, 21, 31, 26] are pretrained with ImageNet [9]. The underlying concept is to transfer the representations from the upstream task of classification to the downstream task of segmentation.

Lately, contrastive language image pretraining (CLIP) [43, 24, 59, 60] has demonstrated that transferable visual representations could be learned from the sole supervision of loose natural language descriptions at very large scale. Subsequently, a plethora of applications have been proposed using CLIP [43], including zero-shot semantic segmentation [33, 64], image editing [29], transfer learning [44, 11], open-vocabulary object detection [17], few-shot learning [70, 69] etc. A recent line of research proposes fine-tuning techniques to preserve the robustness of CLIP under distribution shift [54, 28, 16, 49], but they are limited to classification.

In this paper, we aim at answering the following question: How to leverage CLIP pretraining for enhanced domain generalization for semantic segmentation? The motivation for rethinking DGSS with CLIP is twofold. On one hand, distribution robustness is a notable characteristic of CLIP [13]. On the other hand, the language modality offers an extra source of information compared to unimodal pretrained models.

A direct comparison of training two segmentation models under identical conditions but with different pretraining, i.e. ImageNet vs. CLIP, shows that CLIP pretraining does not yield promising results. Indeed, Tab. 1 shows that fine-tuning CLIP-initialized network performs worse than its ImageNet counterpart on out-of-distribution (OOD) data. This raises doubts about the suitability of CLIP pretraining for DGSS and indicates that it is more prone to overfitting the source distribution at the expense of degrading its original distributional robustness properties. Note that both models converge and achieve similar results on in-domain data. More details are provided in Appendix A.

Pretraining	C	B	M	S	AN	AS	AR	AF	Mean
ImageNet	29.04	32.17	34.26	29.87	4.36	22.38	28.34	26.76	25.90
CLIP	16.81	16.31	17.80	27.10	2.95	8.58	14.35	13.61	14.69

Table 1: Comparison of ImageNet and CLIP pretraining for out-of-distribution semantic segmentation. The network is DeepLabv3+ with ResNet-50 as backbone. The models are trained on GTAV and the performance (mIoU %) is reported on Cityscapes (C), BDD-100K (B), Mapillary (M), SYNTHIA (S), and ACDC Night (AN), Snow (AS), Rain (AR) and Fog (AF).

This paper shows that we can prevent such behavior with a simple recipe involving minimal fine-tuning, language-driven style augmentation, and mixing. Our approach is coined FAMix, for Freeze, Augment and Mix.

It was recently argued that fine-tuning might distort the pretrained representations and negatively affect OOD generalization [28]. To maintain the integrity of the representation, one extreme approach is to entirely freeze the backbone. However, this can undermine representation adaptability and lead to subpar OOD generalization. As a middle-ground strategy balancing adaptation and feature preservation, we suggest minimal fine-tuning of the backbone, where a substantial portion remains frozen, and only the final layers undergo fine-tuning.

For generalization, we show that rethinking MixStyle [67] leads to significant performance gains. As illustrated in Fig. 1, we mix the statistics of the original source features with augmented statistics mined using language. This helps explore styles beyond the source distribution at training time without using additional image.

We summarize our contributions as follows:

•

We propose a simple framework for DGSS based on minimal fine-tuning of the backbone and language-driven style augmentation. To the best of our knowledge, we are the first to study DGSS with CLIP pretraining.
•

We propose language-driven class-wise local style augmentation. We mine class-specific local statistics using prompts that express random styles and names of patch-wise dominant classes. During training, randomization is performed through patch-wise style mixing of the source and mined styles.
•

We conduct careful ablations to show the effectiveness of FAMix. Our framework outperforms state-of-the-art approaches in single and multi-source DGSS settings.

Refer to caption — Figure 1: Mixing strategies. (Left) MixStyle [67] consists of a linear mixing between the feature statistics of the source domain(s) S samples. (Right) We apply an augmentation $\mathcal{A(.)}$ on the source domain statistics, then perform linear mixing between original and augmented statistics. Intuitively, this enlarges the support of the training distribution by leveraging statistics beyond the source domain(s), as well as discovering intermediate domains. $\mathcal{A(.)}$ could be a language-driven or Gaussian noise augmentation, and we show that the former leads to better generalization results.

2 Related works

Domain generalization (DG). The goal of DG is to train, from a single or multiple source domains, models that perform well under arbitrary domain shifts. The DG literature spans a broad range of approaches, including adversarial learning [35, 61], meta-learning [4, 42], data augmentation [67, 65, 66] and domain-invariant representation learning [3, 1, 27, 7]. We refer the reader to [68, 53] for comprehensive surveys on DG.

Domain generalization with CLIP. CLIP [43] exhibits a remarkable distributional robustness [13]. Nevertheless, fine-tuning comes at the expense of sacrificing generalization. Kumar et al. [28] observe that full fine-tuning can distort the pretrained representation, and propose a two-stage strategy, consisting of training a linear probe with a frozen feature extractor, then fine-tuning both. Wortsman et al. [54] propose ensembling the weights of zero-shot and fine-tuned models. Goyal et al. [16] show that preserving the pretraining paradigm (i.e. contrastive learning) during the adaptation to the downstream task improves both in-domain (ID) and OOD performance without multi-step fine-tuning or weight ensembling. CLIPood [49] introduces margin metric softmax training objective and Beta moving average for optimization to handle both open-class and open-domain at test time. On the other hand, distributional robustness could be improved by training a small amount of parameters on top of a frozen CLIP backbone in a teacher-student manner [23, 30]. Other works show that specialized prompt ensembling and/or image ensembling strategies [2, 15] coupled with label augmentation using the WordNet hierarchy improve robustness in classification.

Domain Generalized Semantic Segmentation. DGSS methods could be categorized into three main groups: normalization methods, domain randomization (DR) and invariant representation learning. Normalization methods aim at removing style contribution from the representation. For instance, IBN-Net [40] shows that Instance Normalization (IN) makes the representation invariant to variations in the scene appearance (e.g., change of colors, illumination, etc.), and that combining IN and batch normalization (BN) helps the synthetic-to-real generalization. SAN & SAW [41] proposes semantic-aware feature normalization and whitening, while RobustNet [7] proposes an instance selective whitening loss, where only feature covariances that are sensitive to photometric transformations are whitened. DR aims instead at diversifying the data during training. Some methods use additional data for DR. For example, WildNet [31] uses ImageNet [9] data for content and style extension learning, while TLDR [26] proposes learning texture from random style images. Other methods like SiamDoGe [55] perform DR solely by data augmentation, using a Siamese [6] structure. Finally in the invariant representation learning group, SPC-Net [21] builds a representation space based on style and semantic projection and clustering, and SHADE [62] regularizes the training with a style consistency loss and a retrospection consistency loss.

3 Method

FAMix proposes an effective recipe for DGSS through the blending of simple ingredients. It consists of two stages (see Fig. 2): (i) Local style mining from language (Sec. 3.2); (ii) Training of a segmentation network with minimal fine-tuning and local style mixing (Sec. 3.3). In Fig. 2 and in the following, CLIP-I1 denotes the stem layers and Layer1 of CLIP image encoder, CLIP-I2 the remaining layers excluding the attention pooling, and CLIP-T the text encoder.

We start with some preliminary background knowledge, introducing AdaIN and PIN which are essential to our work.

3.1 Preliminaries

Adaptive Instance Normalization (AdaIN). For a feature map $\mathbf{f}\in\mathbb{R}^{h\times w\times c}$ , AdaIN [22] shows that the channel-wise mean $\boldsymbol{\mu}\in\mathbb{R}^{c}$ and standard deviation $\boldsymbol{\sigma}\in\mathbb{R}^{c}$ capture information about the style of the input image, allowing style transfer between images. Hence, stylizing a source feature ${\mathbf{f}}_{\text{s}}$ with an arbitrary target style $(\mu({\mathbf{f}}_{\text{t}}),\sigma({\mathbf{f}}_{\text{t}}))$ reads:

\small\texttt{AdaIN}({\mathbf{f}}_{\text{s}},{\mathbf{f}}_{\text{t}})=\sigma({% \mathbf{f}}_{\text{t}})\Big{(}\frac{{\mathbf{f}}_{\text{s}}-\mu({\mathbf{f}}_{% \text{s}})}{\sigma({\mathbf{f}}_{\text{s}})}\Big{)}+\mu({\mathbf{f}}_{\text{t}% }),

(1)

with $\mu(\cdot)$ and $\sigma(\cdot)$ the mean and standard deviation of input feature; multiplications and additions being element-wise.

Prompt-driven Instance Normalization (PIN). PIN was introduced for prompt-driven zero-shot domain adaptation in PØDA [11]. It replaces the target style $(\mu({\mathbf{f}}_{\text{t}}),\sigma({\mathbf{f}}_{\text{t}}))$ in AdaIN (1) with two optimizable variables $(\boldsymbol{\mu},\boldsymbol{\sigma})$ guided by a single prompt in natural language. The rationale is to leverage a frozen CLIP [43] to mine visual styles from the prompt representation in the shared space. Given a prompt $P$ and a feature map ${\mathbf{f}}_{\text{s}}$ , PIN reads as:

\small\texttt{PIN}_{(P)}({\mathbf{f}}_{\text{s}})=\boldsymbol{\sigma}\Big{(}% \frac{{\mathbf{f}}_{\text{s}}-\mu({\mathbf{f}}_{\text{s}})}{\sigma({\mathbf{f}% }_{\text{s}})}\Big{)}+\boldsymbol{\mu},

(2)

where $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ are optimized using gradient descent, such that the cosine distance between the visual feature representation and the prompt representation is minimized.

Different from PØDA which mines styles globally with a predetermined prompt describing the target domain, we make use of PIN to mine class-specific styles using local patches of the features, leveraging random style prompts. Further, we show the effectiveness of incorporating the class name in the prompt for better style mining.

3.2 Local Style Mining

Our approach is to leverage PIN to mine class-specific style banks that are used for feature augmentation when training FAMix. Given a set of cropped images ${\mathcal{I}}_{\text{s}}$ , we encode them using CLIP-I1 to get a set of low-level features ${\mathcal{F}}_{\text{s}}$ . Each batch $b$ of features ${\mathbf{f}}_{\text{s}}\in{\mathcal{F}}_{\text{s}}$ is cropped into $m$ patches, resulting in $b\times m$ patches $\mathbf{f}_{p}$ , and associated ground-truth annotation $\mathbf{y}_{p}$ , of size $\nicefrac{{h}}{{\sqrt{m}}}\times\nicefrac{{w}}{{\sqrt{m}}}\times c$ .

We aim at populating $K$ style banks, $K$ being the total number of classes. For a feature patch $\mathbf{f}_{p}$ , we compute the dominant class from the corresponding label patch $\mathbf{y}_{p}$ , and get its name $t_{p}$ from the predefined classes in the training dataset. Given a set of prompts describing random styles $\mathcal{R}$ , the target prompt $P_{p}$ is formed by concatenating a randomly sampled style prompt $r$ from $\mathcal{R}$ and $t_{p}$ (e.g., retro futurism style building). We show in the experiments (Sec. 4.4) that our method is not very sensitive to the prompt design, yet our prompt construction works best.

The idea is to mine proxy domains and explore intermediate ones in a class-aware manner (as detailed in Sec. 3.3), which makes our work fundamentally different from [11], that steers features towards a particular target style and corresponding domain, and better suited to generalization.

To handle the class imbalance problem, we simply select one feature patch $\mathbf{f}_{p}$ per class among the total $b\times m$ patches, as shown in Fig. 2. Consequently, we apply PIN (2) to optimize the local styles to match the representations of their corresponding prompts, and use the mined styles to populate the corresponding style banks. The complete procedure is outlined in Algorithm 1.

The resulting style banks { $\mathcal{T}^{(1)},\cdots,\mathcal{T}^{(K)}$ } are used for domain randomization during training.

Input : Set

{\mathcal{F}}_{\text{s}}

of source features batches.

Label set

{\mathcal{Y}}_{\text{s}}

{\mathcal{D}}_{\text{s}}

Set of random prompts

\mathcal{R}

and class names

\mathcal{C}

Param : Number of patches

m

Number of classes

K

Output :

K

sets {

\mathcal{T}^{(1)},\cdots,\mathcal{T}^{(K)}

} of class-wise augmented statistics.

\{\mathcal{T}^{(1)},\cdots,\mathcal{T}^{(K)}\}\leftarrow\emptyset

5foreach $({\mathbf{f}}_{\text{s}}\in{\mathcal{F}}_{\text{s}}$ , ${\mathbf{y}}_{\text{s}}\in{\mathcal{Y}}_{\text{s}})$ do

\{\mathbf{y}_{p}\}\leftarrow\textnormal{{crop-patch}}({\mathbf{y}}_{\text{s}},m)

\{c_{p}\},\{P_{p}\},\{f_{p}\}\leftarrow\emptyset

9 foreach $\mathbf{y}_{p}$ $\in$ { $\mathbf{y}_{p}$ } do

c_{p}\leftarrow\textnormal{{get-dominant-class}}(\mathbf{y}_{p})

11 if $c_{p}$ not in $\{c_{p}\}$ then

\{c_{p}\}\leftarrow c_{p}

\{P_{p}\}\leftarrow\textnormal{{concat}}(\textnormal{{sample}}(\mathcal{R}),% \textnormal{{get-name}}(c_{p}))

\{f_{p}\}\leftarrow f_{p}

15 end if

17 end foreach

\boldsymbol{\mu}^{(c_{p})},\boldsymbol{\sigma}^{(c_{p})},\mathbf{f}_{p}^{% \prime}\leftarrow\textnormal{{PIN}}_{(P_{p})}(\mathbf{f}_{p})

\mathcal{T}^{(c_{p})}\leftarrow\mathcal{T}^{(c_{p})}\cup\{(\boldsymbol{\mu}^{(% c_{p})},\boldsymbol{\sigma}^{(c_{p})})\}

20 end foreach

Algorithm 1 Local Style Mining.

Input : Set

{\mathcal{F}}_{\text{s}}

of source features batches.

Label set

{\mathcal{Y}}_{\text{s}}

{\mathcal{D}}_{\text{s}}

K

sets {

\mathcal{T}^{(1)},\cdots,\mathcal{T}^{(K)}

} of class-wise

augmented statistics.

Param : Number of patches

m

2 foreach ( ${\mathbf{f}}_{\text{s}}\in{\mathcal{F}}_{\text{s}}$ , ${\mathbf{y}}_{\text{s}}\in{\mathcal{Y}}_{\text{s}}$ ) do

\alpha\sim\textnormal{{Beta}}(0.1,0.1)

4 for $(i,j)\in[1,\sqrt{m}]\times[1,\sqrt{m}]$ do

c_{p}^{(ij)}\leftarrow\textnormal{{get-dominant-class}}({\mathbf{y}}_{\text{s}% }^{(ij)})

\boldsymbol{\mu}^{(ij)},\boldsymbol{\sigma}^{(ij)}\leftarrow\textnormal{{% sample}}(\mathcal{T}^{(c_{p}^{(ij)})})

\mu_{\textit{mix}}\leftarrow(1-\alpha).\mu({\mathbf{f}}_{\text{s}}^{(ij)})+% \alpha.\boldsymbol{\mu}^{(ij)}

\sigma_{\textit{mix}}\leftarrow(1-\alpha).\sigma({\mathbf{f}}_{\text{s}}^{(ij)% })+\alpha.\boldsymbol{\sigma}^{(ij)}

{\mathbf{f}}_{\text{s}}^{(ij)}\leftarrow\textnormal{{AdaIN}}({\mathbf{f}}_{% \text{s}}^{(ij)},\mu_{\textit{mix}},\sigma_{\textit{mix}})

12 end for

{\mathbf{\tilde{y}}}_{\text{s}}\leftarrow\texttt{CLIP-I2}{}({\mathbf{f}}_{% \text{s}})

\textnormal{{Loss}}=\textnormal{{cross-entropy}}({\mathbf{\tilde{y}}}_{\text{s% }},{\mathbf{y}}_{\text{s}})

16 end foreach

Algorithm 2 Training FAMix.

Method	arch.	C	B	M	S	AN	AS	AR	AF	Mean
RobustNet [7]	RN50	36.58	35.20	40.33	28.30	6.32	29.97	33.02	32.56	30.29
SAN & SAW [41]		39.75	37.34	41.86	30.79	-	-	-	-	-
Pin the memory [25]		41.00	34.60	37.40	27.08	3.84	5.51	5.89	7.27	20.32
SHADE [62]		44.65	39.28	43.34	28.41	8.18	30.38	35.44	36.87	33.32
SiamDoGe [55]		42.96	37.54	40.64	28.34	10.60	30.71	35.84	36.45	32.89
DPCL [57]		44.87	40.21	46.74	-	-	-	-	-	-
SPC-Net [21]		44.10	40.46	45.51	-	-	-	-	-	-
NP [12]		40.62	35.56	38.92	27.65	-	-	-	-	-
WildNet* [31]		44.62	38.42	46.09	31.34	8.27	30.29	36.32	35.39	33.84
TLDR* [26]		46.51	42.58	46.18	30.57	13.13	36.02	38.89	40.58	36.81
FAMix (ours)		48.15	45.61	52.11	34.23	14.96	37.09	38.66	40.25	38.88
SAN & SAW [41]	RN101	45.33	41.18	40.77	31.84	-	-	-	-	-
SHADE^† [62]		46.66	43.66	45.50	31.58	7.58	32.48	36.90	36.69	35.13
WildNet* [31]		45.79	41.73	47.08	32.51	-	-	-	-	-
TLDR* [26]		47.58	44.88	48.80	33.14	-	-	-	-	-
FAMix (ours)		49.47	46.40	51.97	36.72	19.89	41.38	40.91	42.15	41.11

Table 2: Single-source DGSS trained on GTAV. Performance (mIoU %) of FAMix compared to other DGSS methods trained on G and evaluated on C, S, M, S, A for ResNet-50 (‘RN50’) and ResNet-101 (‘RN101’) backbone architecture (‘arch.’). * indicates the use of extra-data.

\dagger

indicates the use of the full data for training. We emphasize best and second best results.

3.3 Training FAMix

Style randomization. During training, randomly cropped images ${\mathcal{I}}_{\text{s}}$ are encoded into ${\mathbf{f}}_{\text{s}}$ using CLIP-I1. Each batch of feature maps ${\mathbf{f}}_{\text{s}}$ is viewed as a grid of $m$ patches, without cropping them. For each patch ${\mathbf{f}}_{\text{s}}{\!}^{(ij)}$ within the grid, the dominant class $c_{p}{\!}^{(ij)}$ is queried using the corresponding ground truth patch ${\mathbf{y}}_{\text{s}}{\!}^{(ij)}$ , and a style is randomly sampled from the corresponding mined bank $\mathcal{T}{}^{(c_{p}{\!}^{(ij)})}$ . We then apply patch-wise convex combination (i.e., style mixing) of the original style of the patch and the mined style. Specifically, for an arbitrary patch ${\mathbf{f}}_{\text{s}}{\!}^{(ij)}$ , our local style mixing reads:

	$\displaystyle\mu_{\textit{mix}}\leftarrow(1-\alpha)\mu({\mathbf{f}}_{\text{s}}% ^{(ij)})+\alpha\boldsymbol{\mu}^{(ij)}$		(3)
	$\displaystyle\sigma_{\textit{mix}}\leftarrow(1-\alpha)\sigma({\mathbf{f}}_{% \text{s}}^{(ij)})+\alpha\boldsymbol{\sigma}^{(ij)},$		(4)

with $(\boldsymbol{\mu}^{(ij)},\boldsymbol{\sigma}^{(ij)})\in\mathcal{T}^{(c_{p}^{(% ij)})}$ and $\alpha\in[0,1]^{c}$ .

As shown in Fig. 1, our style mixing strategy differs from [67] which applies a linear interpolation between styles extracted from the images of a limited set of source domain(s) assumed to be available for training. Here, we view the mined styles as variations of multiple proxy target domains defined by the prompts. Training is conducted over all the paths in the feature space between the source and proxy domains without requiring any additional image during training other than the one from source.

Style transfer is applied through AdaIN (1). Only the standard cross-entropy loss between the ground truth ${\mathbf{y}}_{\text{s}}$ and the prediction ${\mathbf{\tilde{y}}}_{\text{s}}$ is applied for training the network. Algorithm 2 shows the training steps of FAMix.

Minimal fine-tuning. During training, we fine-tune only the last few layers of the backbone. Subsequently, we examine various alternatives and show that the minimal extent of fine-tuning is the crucial factor in witnessing the effectiveness of our local style mixing strategy.

Previous works [67, 12, 40] suggest that shallow feature statistics capture style information while deeper features encode semantic content. Consequently, some DGSS methods focus on learning style-agnostic representations [7, 40, 41], but this can compromise the expressiveness of the representation and suppress content information. In contrast, our intuition is to retain these identified traits by introducing variability to the shallow features through augmentation and mixing. Simultaneously, we guide the network to learn invariant high-level representations by training the final layers of the backbone with a label-preserving assumption, using a standard cross-entropy loss.

4 Experiments

4.1 Experimental setup

Synthetic datasets. GTAV [45] and SYNTHIA [46] are used as synthetic datasets. GTAV consists of 24 966 images split into 12 403 images for training, 6 382 for validation and 6 181 for testing. SYNTHIA consists of 9 400 images: 6 580 for training and 2 820 for validation. GTAV and SYNTHIA are denoted by G and S, respectively.

Real datasets. Cityscapes [8], BDD-100K [58], and Mapillary [39] contain 2 975, 7 000, and 18 000 images for training and 500, 1 000, and 2 000 images for validation, respectively. ACDC [48] is a dataset of driving scenes in adverse conditions: night, snow, rain and fog with respectively 106, 100, 100 and 100 images in the validation sets. C, B, and M denote Cityscapes, BDD-100K and Mapillary, respectively; AN, AS, AR and AF denote night, snow, rain and fog subsets of ACDC, respectively.

Implementation details. Following previous works [7, 41, 25, 62, 55, 57, 21, 31, 26], we adopt DeepLabv3+ [5] as segmentation model. ResNet-50 and ResNet-101 [19], initialized with CLIP pretrained weights, are used in our experiments as backbones. Specifically, we remove the attention pooling layer and add a randomly initialized decoder head. The output stride is $16$ . Single-source and multi-source models are trained respectively for $40K$ and $60K$ iterations with a batch size of $8$ . The training images are cropped to $768\times 768$ . Stochastic Gradient Descent (SGD) with a momentum of $0.9$ and weight decay of $10^{-4}$ is used as optimizer. Polynomial decay with a power of $0.9$ is used, with an initial learning rate of $10^{-1}$ for the classifier and $10^{-2}$ for the backbone. We use color jittering and horizontal flip as data augmentation. Label smoothing regularization [50] is adopted. For style mining, Layer1 features are divided into $9$ patches. Each patch is resized to $56\times 56$ , corresponding to the dimensions of Layer1 features for an input image of size $224\times 224$ (i.e. the input dimension of CLIP). We use ImageNet templates¹¹1https://github.com/openai/CLIP/ for each prompt.

Evaluation metric. We evaluate our models on the validation sets of the unseen target domains with mean Intersection over Union (mIoU%) of the $19$ shared semantic classes. For each experiment, we report the average of three runs.

4.2 Comparison with DGSS methods

Single-source DGSS. We compare FAMix with state-of-the-art DGSS methods under the single-source setting.

Training on GTAV (G) as source, Tab. 2 reports models trained with either ResNet-50 or ResNet-101 backbones. The unseen target datasets are C, B, M, S, and the four subsets of A. Tab. 2 shows that our method significantly outperforms all the baselines on all the datasets for both backbones. We note that WildNet [31] and TLDR [26] use extra-data, while SHADE [62] uses the full G dataset (24,966 images) for training with ResNet-101. Class-wise performances are reported in Appendix B.

Training on Cityscapes (C) as source, Tab. 3 reports performance with ResNet-50 backbone. The unseen target datasets are B, M, G, and S. The table shows that our method outperforms the baseline in average, and is competitive to SOTA on G and M.

Method	B	M	G	S	Mean
RobustNet [7]	50.73	58.64	45.00	26.20	45.14
Pin the memory [25]	46.78	55.10	-	-	-
SiamDoGe [55]	51.53	59.00	45.08	26.67	45.57
WildNet* [31]	50.94	58.79	47.01	27.95	46.17
DPCL [57]	52.29	-	46.00	26.60	-
FAMix (ours)	54.07	58.72	45.12	32.67	47.65

Table 3: Single-source DGSS trained on Cityscapes. Performance (mIoU %) of FAMix compared to other DGSS methods trained on C and evaluated on B, M, G and S for ResNet-50 backbone. * indicates the use of extra-data. We emphasize best and second best results.

Multi-source DGSS. We also show the effectiveness of FAMix in the multi-source setting, training on G+S and evaluating on C, B and M. The results reported in Tab. 4 for ResNet-50 backbone outperform state-of-the-art.

Method	C	B	M	Mean
RobustNet [7]	37.69	34.09	38.49	36.76
Pin the memory [25]	44.51	38.07	42.70	41.76
SHADE [62]	47.43	40.30	47.60	45.11
SPC-Net [21]	46.36	43.18	48.23	45.92
TLDR* [26]	48.83	42.58	47.80	46.40
FAMix (ours)	49.41	45.51	51.61	48.84

Table 4: Multi-source DGSS trained on GTAV + SYNTHIA. Performance (mIoU %) of FAMix compared to other DGSS methods trained on G+S and evaluated on C, B, M for ResNet-50 backbone. * indicates the use of extra-data. We emphasize best and second best results.

Qualitative results. We visually compare the segmentation results with Pin the memory [25], SHADE [62] and WildNet [31] in Fig. 3. FAMix clearly outperforms other DGSS methods on “stuff” (e.g., road and sky) and “things” (e.g., bicycle and bus) classes.

4.3 Decoder-Probing Fine-Tuning (DP-FT)

Kumar et al. [28] show that standard fine-tuning may distort the pretrained feature representation, leading to degraded OOD performances for classification. Consequently, they propose a two-step training strategy: (1) Training a linear probe (LP) on top of the frozen backbone features, (2) Fine-tuning (FT) both the linear probe and the backbone. Inspired by it, Saito et al. [47] apply the same strategy for object detection, which is referred to as Decoder-probing Fine-tuning (DP-FT). They observe that DP-FT improves over DP depending on the architecture. We hypothesize that the effect is also dependent on the pretraining paradigm and the downstream task. As observed in Tab. 1, CLIP might remarkably overfit the source domain when fine-tuned. In Tab. 5, we compare fine-tuning (FT), decoder-probing (DP) and DP-FT. DP brings improvements over FT since it completely preserves the pretrained representation. Yet, DP major drawback lies in its limitation to adapt features for the downstream task, resulting in suboptimal results. Surprisingly, DP-FT largely falls behind DP, meaning that the learned features over-specialize to the source domain distribution even with a “decoder warm-up”.

The results advocate for the need of specific strategies to preserve CLIP robustness for semantic segmentation. This need emerges from the additional gap between pretraining (i.e. aligning object-level and language representations) and fine-tuning (i.e. supervised pixel classification).

Method	C	B	M	S	AN	AS	AR	AF	Mean
FT	16.81	16.31	17.80	27.10	2.95	8.58	14.35	13.61	14.69
DP	34.13	37.67	42.21	29.10	10.71	26.26	29.47	30.40	29.99
DP-FT	25.62	21.71	26.39	31.45	4.22	18.26	20.07	20.85	21.07
FAMix (ours)	48.15	45.61	52.11	34.23	14.96	37.09	38.66	40.25	38.88

Table 5: FAMix vs. DP-FT. Performance (mIoU%) of FAMix compared to Fine-tuning (FT), Decoder-probing (DP) and Decoder-probing Fine-tuning (DP-FT). We use here ResNet-50, trained on GTAV. We emphasize best and second best results.

Freeze	Augment	Mix	C	B	M	S	AN	AS	AR	AF	Mean
✗	✗	✗	16.81	16.31	17.80	27.10	2.95	8.58	14.35	13.61	14.69
✗	✓	✗	22.48	26.05	24.15	25.40	4.83	17.61	22.86	19.75	20.39
✗	✗	✓	20.07	21.24	22.91	26.52	1.28	14.99	22.09	20.51	18.70
✗	✓	✓	27.53	26.59	26.27	26.91	4.90	18.91	25.60	22.14	22.36
✓	✗	✗	37.83	38.88	44.24	31.93	12.41	29.59	31.56	33.05	32.44
✓	✓	✗	36.65	35.73	37.32	30.44	14.72	34.65	34.91	38.98	32.93
✓	✗	✓	43.43	43.79	48.19	33.70	11.32	35.55	36.15	38.19	36.29
✓	✓	✓	48.15	45.61	52.11	34.23	14.96	37.09	38.66	40.25	38.88

Table 6: Ablation of FAMix components. Performance (mIoU %) after removing one or more components of FAMix.

4.4 Ablation studies

We conduct all the ablations on a ResNet-50 backbone with GTAV (G) as source dataset.

Removing ingredients from the recipe. FAMix is based on minimal fine-tuning of the backbone (i.e., Freeze), style augmentation and mixing. We show in Tab. 6 that the best generalization results are only obtained when combining the three ingredients. Specifically, when the backbone is fine-tuned (i.e., Freeze ✗), the performances are largely harmed. When minimal fine-tuning is performed (i.e., Freeze ✓), we argue that the augmentations are too strong to be applied without style mixing; the latter brings both effects of domain interpolation and use of the original statistics. Subsequently, when style mixing is not applied (i.e. Freeze ✓, Augment ✓, Mix ✗), the use of mined styles brings mostly no improvement on OOD segmentation compared to training without augmentation (i.e. Freeze ✓, Augment ✗, Mix ✗). Note that for Freeze ✓, Augment ✓, Mix ✗, the line 8 in Algorithm 2 becomes:

{\mathbf{f}}_{\text{s}}^{(ij)}\leftarrow\textnormal{{AdaIN}}({\mathbf{f}}_{% \text{s}}^{(ij)},\boldsymbol{\mu}^{(ij)},\boldsymbol{\sigma}^{(ij)})

(5)

Our style mixing is different from MixStyle [67] for being applied: (1) patch-wise and (2) between original styles of the source data and augmented versions of them. Note that the case (Freeze ✓, Augment ✗, Mix ✓) could be seen as a variant of MixStyle, yet applied locally and class-wise. Our complete recipe is proved to be significantly more effective with a boost of $\approx+6$ mean mIoU w.r.t. the baseline of training without augmentation and mixing.

Prompt construction. Tab. 7 reports results when ablating the prompt construction. In FAMix, the final prompt is derived by concatenating $<$ random style prompt $>$ and $<$ class name $>$ ; removing either of those leads to inferior results. Interestingly, replacing the style prompt by random characters – e.g. “ioscjspa” – does not significantly degrade the performance. In certain aspects, using random prompts still induces a randomization effect within the FAMix framework. However, meaningful prompts still consistently lead to the best results.

RCP	RSP	CN	C	B	M	S	AN	AS	AR	AF	Mean
		✓	45.99	43.71	50.48	34.75	15.22	35.09	34.92	38.17	37.29
✓			46.10	44.24	48.90	33.62	13.39	35.99	36.68	39.86	37.35
	✓		45.64	44.59	49.13	33.64	15.33	37.32	35.98	38.85	37.56
✓		✓	47.83	44.83	50.38	34.27	14.43	37.07	37.07	38.76	38.08
	✓	✓	48.15	45.61	52.11	34.23	14.96	37.09	38.66	40.25	38.88

Table 7: Ablation on the prompt construction. Performance (mIoU %) for different prompt constructions. RCP, RSP and CN refer to

<

random character prompt

>

<

random style prompt

>

and

<

class name

>

, respectively.

Number of style prompts. FAMix uses a set $\mathcal{R}$ of random style prompts which are concatenated with the class names; $\mathcal{R}$ is formed by querying ChatGPT²²2https://chat.openai.com/ using $<$ give me 20 prompts of 2 to 5 words describing random image styles $>$ . The output prompts are provided in Appendix C. Fig. 4(a) shows that the size of $\mathcal{R}$ has a marginal impact on FAMix performance. Yet, the mIoU scores on C, B, M and AR are higher for $|\mathcal{R}|=20$ compared to $|\mathcal{R}|=1$ and almost equal for the other datasets.

The low sensitivity of the performance to the size of $\mathcal{R}$ could be explained by two factors. First, mining even from a single prompt results in different style variations as the optimization starts from different anchor points in the latent space, as argued in [11]. Second, mixing style between the source and the mined proxy domains is the crucial factor making the network explore intermediate domains during training. This does not contradict the effect of our prompt construction which leads to the best results (Tab. 7).

Local vs. global style mining. To highlight the effect of our class-wise local style mining, we perform an ablation replacing it with global style mining. Specifically, the same set of $<$ random style prompt $>$ are used, though being concatenated with $<$ driving $>$ as a global description instead of local class name. Intuitively, local style mining and mixing induces richer style variations and more contrast among patches. The results in Tab. 8 show the effectiveness of our local style mining and mixing strategy, bringing about $3$ mIoU improvement on G $\rightarrow$ C.

Syle mining		C	B	M	S	AN	AS	AR	AF	Mean
global w/	“street view”	45.51	45.12	50.40	33.65	14.59	36.92	37.38	40.53	38.01
	“urban scene”	46.59	45.38	51.33	33.67	14.42	35.96	37.30	40.52	38.15
	“roadscape”	45.49	45.55	50.63	33.66	14.77	36.75	37.07	40.33	38.03
	“commute snapshot”	45.39	45.08	50.50	33.68	13.65	36.63	37.93	40.92	37.97
	“driving”	45.06	44.98	50.67	33.36	14.84	35.11	36.21	39.52	37.47
local		48.15	45.61	52.11	34.23	14.96	37.09	38.66	40.25	38.88

Table 8: Ablation on style mining. Global style mining consists of mining one style per feature map, using

<

random style prompt

>

<

global class

>

as prompt.

What to mix? Let $\mathcal{S}=\bigcup_{k=1}^{K}\mathcal{S}^{(k)}$ and $\mathcal{T}=\bigcup_{k=1}^{K}\mathcal{T}^{(k)}$ the sets of class-wise source and augmented features, respectively. In FAMix training, for an arbitrary patch ${\mathbf{f}}_{\text{s}}^{(ij)}$ , style mixing is performed between the original source statistics and statistics sampled from the augmented set (i.e., $(\boldsymbol{\mu}^{(ij)},\boldsymbol{\sigma}^{(ij)})\in\mathcal{T}^{(c_{p}^{(% ij)})}$ , see (3) and (4)). In class-wise vanilla MixStyle, $(\boldsymbol{\mu}^{(ij)},\boldsymbol{\sigma}^{(ij)})\in\mathcal{S}^{(c_{p}^{(% ij)})}$ . In Tab. 9, we show that sampling $(\boldsymbol{\mu}^{(ij)},\boldsymbol{\sigma}^{(ij)})$ from $\mathcal{S}^{(c_{p}^{(ij)})}\cup\mathcal{T}^{(c_{p}^{(ij)})}$ does not lead to better generalization, despite sampling from a set with twice the cardinality. This supports our mixing strategy visualized in Fig. 1. Intuitively, sampling from $\mathcal{S}\cup\mathcal{T}$ could be viewed as applying either MixStyle or our mixing with a probability $p=0.5$ .

Syle mining	C	B	M	S	AN	AS	AR	AF	Mean
$\mathcal{S}$	43.43	43.79	48.19	33.70	11.32	35.55	36.15	38.19	36.29
$\mathcal{S}\cup\mathcal{T}$	44.76	45.59	50.78	34.05	13.67	36.92	37.18	38.13	37.64
$\mathcal{T}$ (ours)	48.15	45.61	52.11	34.23	14.96	37.09	38.66	40.25	38.88

Table 9: Ablation on the sets used for mixing. The styles

(\boldsymbol{\mu},\boldsymbol{\sigma})

used in (3) and (4) are sampled either from

\mathcal{S}

oder

\mathcal{S}\cup\mathcal{T}

oder

\mathcal{T}

Minimal fine-tuning. We argue for minimal fine-tuning as a compromise between pretrained feature preservation and adaptation. Fig. 4(b) shows an increasing OOD generalization trend with more freezing. Interestingly, only fine-tuning the last layers of the last convolutional block (where the dilation is applied) achieves the best results. When training on Cityscapes (Tab. 3), we observed that freezing all the layers except Layer4 achieves the best results.

4.5 Does FAMix require language?

Inspired by the observation that target statistics deviate around the source ones in real cases [12], we conduct an experiment where we replace language-driven style mining by noise perturbation. The same procedure of FAMix is kept: (i) Features are divided into patches, perturbed with noise and then saved into a style bank based on the dominant class; (ii) During training, patch-wise style mixing of original and perturbed styles is performed.

Different from Fan et al. [12], who perform a perturbation on the feature statistics using a normal distribution with pre-defined parameters, we experiment perturbation with different magnitudes of noise controlled by the signal-to-noise ratio (SNR). Consider the mean of a patch $\mu\in\mathbb{R}^{c}$ as a signal, the goal is to perturb it with some noise $n_{\mu}\in\mathbb{R}^{c}$ . The $\texttt{SNR}_{\texttt{dB}}$ between $\|\mu\|$ and $\|n_{\mu}\|$ is defined as $\texttt{SNR}_{\texttt{dB}}=20\,\text{log}_{10}\left(\nicefrac{{\|\mu\|}}{{\|n_% {\mu}\|}}\right)$ . Given $\mu$ , $\texttt{SNR}_{\texttt{dB}}$ , and $n\sim\mathcal{N}(0,I)$ , where $I\in\mathbb{R}^{c\times c}$ is the identity matrix, the noise is computed as $n_{\mu}=10^{\frac{-\texttt{SNR}}{20}}\frac{\|\mu\|}{\|n\|}n$ . We add $\mu+n_{\mu}$ to the style bank corresponding to the dominant class in the patch. The same applies to $\sigma\in\mathbb{R}^{c}$ . The results of training for different noise levels are in Tab. 10. Using language as source of randomization outperforms any noise level. The baseline corresponds to the case where no augmentation nor mixing are performed (See Tab. 6, Freeze ✓, Augment ✗, Mix ✗). SNR= $\infty$ could be seen as a variant of MixStyle, applied class-wise to patches (See Tab. 6, Freeze ✓, Augment ✗, Mix ✓). The vanilla MixStyle gets inferior results.

Besides lower OOD performance, one more disadvantage of noise augmentation compared to our language-driven augmentation is the need to select a value for the SNR, for which the optimal value might vary depending on the target domain encountered at the test time.

SNR	C	B	M	S	AN	AS	AR	AF	Mean
Baseline	37.83	38.88	44.24	31.93	12.41	29.59	31.56	33.05	32.44
5	28.78	29.24	30.32	21.67	12.60	24.00	25.95	25.87	24.80
10	40.09	39.50	43.45	29.09	13.36	33.47	33.11	36.17	33.53
15	45.02	44.16	48.63	32.96	14.55	36.09	35.99	40.96	37.30
20	45.52	44.29	49.26	33.45	12.40	35.96	36.52	38.60	37.00
25	44.82	44.26	48.54	33.30	11.38	34.51	35.46	37.61	36.24
30	43.07	43.80	48.31	33.47	12.33	35.05	35.58	38.10	36.21
$\infty$	43.43	43.79	48.19	33.70	11.32	35.55	36.15	38.19	36.29
MixStyle [67]	40.97	42.04	48.36	33.15	13.14	31.26	34.94	38.12	35.25
Prompts	48.15	45.61	52.11	34.23	14.96	37.09	38.66	40.25	38.88

Table 10: Noise vs prompt-driven augmentation. The prompt-driven augmentation in FAMix is replaced by random noise with different levels defined by SNR. We also include vanilla MixStyle. The prompt-driven strategy is superior.

5 Conclusion

We presented FAMix, a simple recipe for domain generalized semantic segmentation with CLIP pretraining. We proposed to locally mix the styles of source features with their augmented counterparts obtained using language prompts. Combined with minimal fine-tuning, FAMix significantly outperforms the state-of-the-art approaches. Extensive experiments showcase the effectiveness of our framework. We hope that FAMix will serve as a strong baseline in future works, exploring the potential of leveraging large-scale vision-language models for perception tasks.

Acknowledgment. This work was partially funded by French project SIGHT (ANR-20-CE23-0016) and was supported by ELSA - European Lighthouse on Secure and Safe AI funded by the European Union under grant agreement No. 101070617. It was performed using HPC resources from GENCI–IDRIS (Grant AD011014477).

Appendix

We provide details about Tab. 1 experiments in Appendix A. Further, we report class-wise performance for FAMix in Appendix B, and detail the prompts used in our experiments in Appendix C. Finally, we discuss the limitations and perspectives in Appendix D.

We refer to the supplementary video for further demonstration of FAMix qualitative performance: https://youtu.be/vyjtvx2El9Q.

Appendix A CLIP vs. ImageNet initialization

In Tab. 1 of the main paper, we introduce a comparison of ImageNet and CLIP pretraining for out-of-distribution semantic segmentation. We clarify here its implementation.

To produce Tab. 1 we employ the public code³³3https://github.com/VainF/DeepLabV3Plus-Pytorch and finetuned with SGD using a learning rate of $10^{-1}$ for the segmenter and $10^{-2}$ for the backbone. We note that we freeze the stem layers and Layer1 for both backbones, i.e., ImageNet and CLIP initialized ResNet-50, after observing that full fine-tuning leads to subpar in-domain results for CLIP. Crucially, in this setting, both ImageNet and CLIP initialized networks converge and achieve the same performance in-domain (on GTA5 validation set). Hence, we argue that the poor OOD performance of CLIP initialization in Tab. 1 may originate from the distortion of the robust CLIP representation towards the source domain distribution as advocated in [28]. To alleviate such distortion, we freeze most of the backbone layers in FAMix.

However, we highlight that different hyper-parameter choices could boost the performance to some extent. For example, Rao et al. [44] observed that fine-tuning CLIP for semantic segmentation with the default configuration in MMSegmentation⁴⁴4https://github.com/open-mmlab/mmsegmentation leads to 15.6% mIoU lower performance than its ImageNet pre-trained counterpart on ADE20K [63]. Consequently, they propose using AdamW [38] for optimization.

In Tab. 11 we reproduce the experiment of Tab. 1 experiment but using AdamW as optimizer. $O_{1}$ refers to the optimization configuration adopted in our paper, i.e., SGD optimizer with a learning rate of $10^{-1}$ for the segmenter and $10^{-2}$ for the backbone. $O_{2}$ and $O_{3}$ both refer to the use of AdamW with a learning rate of $10^{-4}$ for the segmenter and $10^{-5}$ for the backbone. In $O_{2}$ all the backbone is fine-tuned while in $O_{3}$ the stem layers and Layer1 are frozen, similar to $O_{1}$ . Results show that using AdamW improves performance across out-of-distributions (OOD) domains for both CLIP and ImageNet initialized networks, but still largely lags behind FAMix (mean mIoU=38.88%) and even our variant (Freeze ✓, Augment ✗, Mix ✗) (mean mIoU=32.44%) in Tab. 6.

These results hint that using AdamW with relatively low learning rates might reduce the feature distortion of CLIP. Motivated by this observation, one could question the necessity of the minimal fine-tuning part of FAMix, and whether similar results could be achieved only by augmenting, mixing, and fine-tuning with AdamW and low learning rate. We call this variant AMix (Augment and Mix) and show the results in Tab. 12, which support the necessity of our full recipe.

Optim.	Pretraining	C	B	M	S	AN	AS	AR	AF	Mean
$O_{1}$	ImageNet	29.04	32.17	34.26	29.87	4.36	22.38	28.34	26.76	25.90
$O_{1}$	CLIP	16.81	16.31	17.80	27.10	2.95	8.58	14.35	13.61	14.69
$O_{2}$	ImageNet	28.00	36.82	37.00	30.60	3.56	24.14	29.51	26.23	26.98
$O_{2}$	CLIP	31.73	25.89	30.68	33.32	2.56	19.17	21.42	17.58	22.79
$O_{3}$	ImageNet	28.74	36.91	37.86	30.32	4.46	22.48	28.49	25.38	26.83
$O_{3}$	CLIP	26.81	23.11	29.82	32.38	4.20	18.50	22.59	20.31	22.22

Table 11: Effect of optimization configurations on OOD performance. Performance (mIoU %) of CLIP vs. ImageNet initialized networks for different optimization configurations.

Method	C	B	M	S	AN	AS	AR	AF	Mean
AMix	40.50	38.69	36.05	33.61	4.03	23.03	30.01	26.89	29.10
FAMix	48.15	45.61	52.11	34.23	14.96	37.09	38.66	40.25	38.88

Table 12: AMix with AdamW optimizer vs. FAMix. Performance (mIoU %) of FAMix in our default configuration compared to a variant with no minimal fine-tuning, replacing SGD with AdamW optimizer.

Appendix B Class-wise performance

We report class-wise IoUs in Tab. 13 and Tab. 14. The standard deviations of the mIoU (%) over three runs are also reported.

Target eval.	road	sidewalk	building	wall	fence	pole	traffic light	traffic sign	vegetation	terrain	sky	person	rider	car	truck	bus	train	motorcycle	bicycle	mIoU%
C	88.97	39.48	83.12	28.52	29.06	38.64	42.67	36.26	86.48	24.70	78.26	69.08	23.88	85.35	29.63	38.82	9.28	38.02	44.68	48.15 $\pm$ 0.38
B	87.83	40.33	78.44	15.88	35.20	38.13	40.63	29.49	77.52	31.19	90.80	60.28	23.23	82.99	26.73	34.53	0.00	43.55	29.92	45.61 $\pm$ 0.84
M	86.65	41.65	78.67	26.91	30.88	45.91	46.50	61.48	81.84	38.79	94.09	68.65	33.59	84.52	40.90	42.40	10.15	41.20	35.24	52.11 $\pm$ 0.17
S	60.55	49.61	82.63	7.80	5.42	29.23	15.71	15.26	68.18	0.00	90.83	61.59	12.09	61.29	0.00	35.23	0.00	32.75	22.19	34.23 $\pm$ 0.53
AN	47.44	7.01	38.73	8.42	3.59	23.04	18.42	5.75	19.33	5.82	5.65	26.61	10.39	50.46	4.10	0.00	0.79	8.42	0.28	14.96 $\pm$ 0.09
AS	66.93	10.15	62.17	33.95	22.10	35.26	51.20	35.57	73.12	20.72	77.55	52.02	0.63	71.62	21.20	1.14	12.36	47.32	9.71	37.09 $\pm$ 0.83
AR	73.41	21.42	77.58	19.41	16.96	33.63	44.90	38.53	80.96	29.31	94.98	56.89	17.04	76.15	16.15	7.07	5.11	23.74	1.24	38.66 $\pm$ 1.12
AF	77.61	31.99	76.42	28.84	10.30	31.42	52.92	29.99	68.09	24.55	92.03	54.52	34.91	68.21	26.07	11.02	1.44	7.71	36.78	40.25 $\pm$ 0.71

Table 13: ResNet-50 class-wise performance. We report the performance of FAMix (IoU %) trained on GTAV with ResNet-50 as backbone.

Target eval.	road	sidewalk	building	wall	fence	pole	traffic light	traffic sign	vegetation	terrain	sky	person	rider	car	truck	bus	train	motorcycle	bicycle	mIoU%
C	90.25	44.78	84.54	31.71	31.48	44.17	45.45	35.78	87.17	35.58	84.30	69.62	20.48	86.87	31.11	44.38	7.73	34.06	30.46	49.47 $\pm$ 0.36
B	86.79	41.47	79.53	16.67	41.27	39.41	42.31	33.35	78.64	36.86	91.03	60.32	23.73	81.51	31.13	25.99	0.00	45.78	25.73	46.40 $\pm$ 0.50
M	78.71	38.89	81.85	26.56	40.22	47.32	49.27	62.19	82.68	41.54	95.63	67.60	25.87	85.50	41.62	35.87	12.55	43.69	29.92	51.97 $\pm$ 1.30
S	63.72	56.80	85.05	9.30	21.62	33.26	16.44	18.96	69.42	0.00	92.10	63.52	10.95	64.86	0.00	29.84	0.00	39.67	22.22	36.72 $\pm$ 0.71
AN	65.87	23.23	37.83	13.72	4.60	30.34	16.49	7.48	27.37	7.83	17.61	35.16	18.48	53.71	5.67	0.00	0.84	10.25	1.44	19.89 $\pm$ 1.22
AS	75.42	31.90	72.15	36.75	27.85	38.89	49.63	33.09	72.45	22.98	84.21	56.75	1.91	75.84	34.61	4.44	4.17	48.91	14.26	41.38 $\pm$ 0.34
AR	57.58	26.76	79.76	19.79	21.06	37.70	46.34	37.66	83.85	36.96	94.80	55.40	31.61	79.53	14.84	14.01	6.97	29.36	3.34	40.91 $\pm$ 1.28
AF	77.96	41.73	77.99	34.73	6.85	36.80	49.49	34.51	72.00	32.60	91.52	46.28	27.28	70.77	31.17	19.08	4.87	10.75	34.55	42.15 $\pm$ 1.87

Table 14: ResNet-101 class-wise performance. We report the performance of FAMix (IoU %) trained on GTAV with ResNet-101 as backbone.

Appendix C Prompts used for style mining

The $<$ random style prompt $>$ used for training FAMix:
$\mathcal{R}_{1}$ = $<$ random style prompt $>$ = { Ethereal Mist, Cyberpunk Cityscape, Rustic Charm, Galactic Fantasy, Pastel Dreams, Dystopian Noir, Whimsical Wonderland, Urban Grit, Enchanted Forest, Retro Futurism, Monochrome Elegance, Vibrant Graffiti, Haunting Shadows, Steampunk Adventures, Watercolor Serenity, Industrial Chic, Cosmic Voyage, Pop Art Popularity, Abstract Symphony, Magical Realism, Abstract Geometric Patterns, Vintage Film Grain, Neon Cityscape Vibes, Surreal Watercolor Dreams, Minimalist Nature Scenes, Cyberpunk Urban Chaos, Impressionist Sunset Hues, Pop Art Explosion, Fantasy Forest Adventures, Pixelated Digital Chaos, Monochromatic Street Photography, Vibrant Graffiti Expressions, Steampunk Industrial Charm, Ethereal Cloudscapes, Retro Futurism Flare, Dark and Moody Landscapes, Pastel Dreamworlds, Galactic Space Odyssey, Abstract Brush Strokes, Noir Cinematic Moments, Whimsical Fairy Tale Realms, Modernist Architectural Wonders, Macro Botanical Elegance, Dystopian Sci-Fi Realities, High Contrast Street Art, Impressionist City Reflections, Pixel Art Nostalgia, Dynamic Action Sequences, Soft Focus Pastels, Abstract 3D Renderings, Mystical Moonlit Landscapes, Urban Decay Aesthetics, Holographic Futuristic Visions, Vintage Polaroid Snapshots, Digital Glitch Anomalies, Japanese Zen Gardens, Psychedelic Kaleidoscopes, Cosmic Abstract Portraits, Subtle Earthy Textures, Hyperrealistic Wildlife Portraits, Cybernetic Neon Lights, Warped Reality Illusions, Whimsical Watercolor Animals, Industrial Grunge Textures, Tropical Paradise Escapes, Dynamic Street Performances, Abstract Architectural Wonders, Comic Book Panel Vibes, Soft Glow Sunsets, 8-Bit Pixel Adventures, Galactic Nebula Explosions, Doodle Sketchbook Pages, High-Tech Futuristic Landscapes, Cinematic Noir Shadows, Vibrant Desert Landscapes, Abstract Collage Chaos, Nature in Infrared, Surreal Dream Sequences, Abstract Light Painting, Whimsical Fantasy Creatures, Cybernetic Augmented Reality, Impressionist Rainy Days, Vintage Aged Photographs, Neon Anime Cityscapes, Pastel Sunset Palette, Surreal Floating Islands, Abstract Mosaic Patterns, Retro Sci-Fi Spaceships, Futuristic Cyber Landscapes, Steampunk Clockwork Contraptions, Monochromatic Urban Decay, Glitch Art Distortions, Magical Forest Enchantments, Digital Oil Painting, Pop Surrealist Dreams, Dynamic Graffiti Murals, Vintage Pin-up Glamour, Abstract Kinetic Sculptures, Neon Jungle Adventures, Minimalist Futuristic Interfaces }

The $<$ random character prompt $>$ used in Tab. 7 experiments (i.e., ‘RCP’) are:
$\mathcal{R}_{2}$ = $<$ random character prompt $>$ = { ioscjspa, cjosae, wqvsecpas, csavwggw, csanoiaj, zfaspf, atpwqkmfc, mdmfejh, casjicjai, cnoacpoaj, noiasvnai, kcsakofnaoi, cjncioasn, wkqgmdc, jqblhyu, pqwfkgr, mzxanqnw, wnzsalml, sdqlhkjr, odfeqfit }

Both $\mathcal{R}_{1}$ and $\mathcal{R}_{2}$ are concatenated with the word style.

Appendix D Limitations and perspectives

D.1 Limitations

Failures conditions.

We show in Fig. 5 failure cases in rare conditions, which include extreme illumination or darkness, low visibility due to rain drops on the windshield, and other adverse conditions (e.g., snowy road). While FAMix improves over the baseline, the results remain unsatisfactory for safety-critical applications as it fails to segment critical objects in the scenes (e.g., car, road, sidewalk, person etc). We leave for future research the generalization to the above mentioned conditioned. One possible direction could be to design specific methods for specific corner conditions (e.g., [18, 32]), although we highlight this is orthogonal to generalization.

Stylization.

At the heart of FAMix lies the assumption that unseen target distributions could be covered by augmenting the mean and standard deviation of the low-level features. While the correlation between “style” and these parameters has been shown in previous research [40, 67], we believe that the hypothesis stating that the domain shift could be described only by these parameters over-simplifies generalization. Moreover, FAMix does not handle or provide an estimation of uncertainty, which is crucial for both classes in and outside the label set for the application at hand.

D.2 Perspectives

Vision transformers (ViTs) [10] have recently emerged as an alternative to CNNs. We leave a ViT implementation of FAMix for future work. Applying prompt-driven instance normalization (PIN) [11] to ViTs appears non-trivial as the relation between statistics of low-level feature maps and style is established only for CNNs so far, and could raise some technical challenges. Exploring this direction might first involve a study of the correlation between style and statistics of patches. If such correlation is demonstrated, a naive way to apply FAMix could be by applying PIN with tied parameters across the patches.

While some modern architectures are inherently more robust than older ones, the problem of DGSS with ResNets (e.g. ResNet-50 and ResNet-101) is still not solved. As long as the gap exists between in-domain and out-of-distribution performances, we believe that this setting remains interesting, and that a general understanding of domain generalization could emerge from the algorithms proposed to address it.

References

Ahuja et al. [2021] Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance principle meets information bottleneck for out-of-distribution generalization. In NeurIPS, 2021.
Allingham et al. [2023] James Urquhart Allingham, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, and Balaji Lakshminarayanan. A simple zero-shot prompt weighting technique to improve prompt ensembling in text-image models. In ICML, 2023.
Arjovsky et al. [2019] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
Balaji et al. [2018] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. Metareg: Towards domain generalization using meta-regularization. In NeurIPS, 2018.
Chen et al. [2018] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
Chen and He [2021] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, 2021.
Choi et al. [2021] Sungha Choi, Sanghun Jung, Huiwon Yun, Joanne T Kim, Seungryong Kim, and Jaegul Choo. Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening. In CVPR, 2021.
Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
Fahes et al. [2023] Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, and Raoul de Charette. Poda: Prompt-driven zero-shot domain adaptation. In ICCV, 2023.
Fan et al. [2023] Qi Fan, Mattia Segu, Yu-Wing Tai, Fisher Yu, Chi-Keung Tang, Bernt Schiele, and Dengxin Dai. Towards robust object detection invariant to real-world domain shifts. In ICLR, 2023.
Fang et al. [2022] Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, and Ludwig Schmidt. Data determines distributional robustness in contrastive language image pre-training (clip). In ICML, 2022.
Ganin et al. [2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016.
Ge et al. [2023] Yunhao Ge, Jie Ren, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang, Hartwig Adam, Laurent Itti, Balaji Lakshminarayanan, and Jiaping Zhao. Improving zero-shot generalization and robustness of multi-modal models. In CVPR, 2023.
Goyal et al. [2023] Sachin Goyal, Ananya Kumar, Sankalp Garg, Zico Kolter, and Aditi Raghunathan. Finetune like you pretrain: Improved finetuning of zero-shot vision models. In CVPR, 2023.
Gu et al. [2022] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
Halder et al. [2019] Shirsendu Sukanta Halder, Jean-François Lalonde, and Raoul de Charette. Physics-based rendering for improving robustness to rain. In ICCV, 2019.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
Hoffman et al. [2018] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, 2018.
Huang et al. [2023] Wei Huang, Chang Chen, Yong Li, Jiacheng Li, Cheng Li, Fenglong Song, Youliang Yan, and Zhiwei Xiong. Style projected clustering for domain generalized semantic segmentation. In CVPR, 2023.
Huang and Belongie [2017] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.
Jain et al. [2023] Nishant Jain, Harkirat Behl, Yogesh Singh Rawat, and Vibhav Vineet. Efficiently robustify pre-trained models. In ICCV, 2023.
Jia et al. [2021] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
Kim et al. [2022] Jin Kim, Jiyoung Lee, Jungin Park, Dongbo Min, and Kwanghoon Sohn. Pin the memory: Learning to generalize semantic segmentation. In CVPR, 2022.
Kim et al. [2023] Sunghwan Kim, Dae-hwan Kim, and Hoseong Kim. Texture learning domain randomization for domain generalized segmentation. In ICCV, 2023.
Krueger et al. [2021] David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex). In ICML, 2021.
Kumar et al. [2022] Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. In ICLR, 2022.
Kwon and Ye [2022] Gihyun Kwon and Jong Chul Ye. Clipstyler: Image style transfer with a single text condition. In CVPR, 2022.
Laroudie et al. [2023] Clement Laroudie, Andrei Bursuc, Mai Lan Ha, and Gianni Franchi. Improving clip robustness with knowledge distillation and self-training. arXiv preprint arXiv:2309.10361, 2023.
Lee et al. [2022] Suhyeon Lee, Hongje Seong, Seongwon Lee, and Euntai Kim. Wildnet: Learning domain generalized semantic segmentation from the wild. In CVPR, 2022.
Lengyel et al. [2021] Attila Lengyel, Sourav Garg, Michael Milford, and Jan C van Gemert. Zero-shot day-night domain adaptation with a physics prior. In ICCV, 2021.
Li et al. [2022] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic segmentation. In ICLR, 2022.
Li et al. [2018a] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In CVPR, 2018a.
Li et al. [2018b] Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In ECCV, 2018b.
Li et al. [2019] Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. In CVPR, 2019.
Long et al. [2018] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. In NeurIPS, 2018.
Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.
Neuhold et al. [2017] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, 2017.
Pan et al. [2018] Xingang Pan, Ping Luo, Jianping Shi, and Xiaoou Tang. Two at once: Enhancing learning and generalization capacities via ibn-net. In ECCV, 2018.
Peng et al. [2022] Duo Peng, Yinjie Lei, Munawar Hayat, Yulan Guo, and Wen Li. Semantic-aware domain generalized segmentation. In CVPR, 2022.
Qiao et al. [2020] Fengchun Qiao, Long Zhao, and Xi Peng. Learning to learn single domain generalization. In CVPR, 2020.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
Rao et al. [2022] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, 2022.
Richter et al. [2016] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. In ECCV, 2016.
Ros et al. [2016] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016.
Saito et al. [2023] Kuniaki Saito, Donghyun Kim, Piotr Teterwak, Rogerio Feris, and Kate Saenko. Mind the backbone: Minimizing backbone distortion for robust object detection. arXiv preprint arXiv:2303.14744, 2023.
Sakaridis et al. [2021] Christos Sakaridis, Dengxin Dai, and Luc Van Gool. ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding. In ICCV, 2021.
Shu et al. [2023] Yang Shu, Xingzhuo Guo, Jialong Wu, Ximei Wang, Jianmin Wang, and Mingsheng Long. Clipood: Generalizing clip to out-of-distributions. In ICML, 2023.
Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
Tzeng et al. [2017] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017.
Vu et al. [2019] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In CVPR, 2019.
Wang et al. [2022] Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip Yu. Generalizing to unseen domains: A survey on domain generalization. T-KDE, 2022.
Wortsman et al. [2022] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. In CVPR, 2022.
Wu et al. [2022] Zhenyao Wu, Xinyi Wu, Xiaoping Zhang, Lili Ju, and Song Wang. Siamdoge: Domain generalizable semantic segmentation using siamese network. In ECCV, 2022.
Xu et al. [2021] Qinwei Xu, Ruipeng Zhang, Ya Zhang, Yanfeng Wang, and Qi Tian. A fourier-based framework for domain generalization. In CVPR, 2021.
Yang et al. [2023] Liwei Yang, Xiang Gu, and Jian Sun. Generalized semantic segmentation by self-supervised source domain projection and multi-level contrastive learning. In AAAI, 2023.
Yu et al. [2020] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In CVPR, 2020.
Zhai et al. [2022] Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. LiT: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023.
Zhao et al. [2020] Shanshan Zhao, Mingming Gong, Tongliang Liu, Huan Fu, and Dacheng Tao. Domain generalization via entropy regularization. In NeurIPS, 2020.
Zhao et al. [2022] Yuyang Zhao, Zhun Zhong, Na Zhao, Nicu Sebe, and Gim Hee Lee. Style-hallucinated dual consistency learning for domain generalized semantic segmentation. In ECCV, 2022.
Zhou et al. [2019] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
Zhou et al. [2022a] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In ECCV, 2022a.
Zhou et al. [2020a] Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, and Tao Xiang. Deep domain-adversarial image generation for domain generalisation. In AAAI, 2020a.
Zhou et al. [2020b] Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, and Tao Xiang. Learning to generate novel domains for domain generalization. In ECCV, 2020b.
Zhou et al. [2021] Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xiang. Domain generalization with mixstyle. In ICLR, 2021.
Zhou et al. [2022b] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. TPAMI, 2022b.
Zhou et al. [2022c] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, 2022c.
Zhou et al. [2022d] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. IJCV, 2022d.

Road	Sidewalk	Building	Wall	Fence	Pole	Traffic light	Traffic sign	Vegetation	Terrain
Sky	Person	Rider	Car	Truck	Bus	Train	Motorbike	Bicycle	n/a

Road	Sidewalk	Building	Wall	Fence	Pole	Traffic light	Traffic sign	Vegetation	Terrain
Sky	Person	Rider	Car	Truck	Bus	Train	Motorbike	Bicycle	n/a