A Simple Recipe for Language-guided Domain Generalized Segmentation

Mohammad Fahes1  Tuan-Hung Vu1,2  Andrei Bursuc1,2  Patrick Pérez3  Raoul de Charette1
1 Inria    2 Valeo.ai    3 Kyutai
Abstract

Generalization to new domains not seen during training is one of the long-standing challenges in deploying neural networks in real-world applications. Existing generalization techniques either necessitate external images for augmentation, and/or aim at learning invariant representations by imposing various alignment constraints. Large-scale pretraining has recently shown promising generalization capabilities, along with the potential of binding different modalities. For instance, the advent of vision-language models like CLIP has opened the doorway for vision models to exploit the textual modality. In this paper, we introduce a simple framework for generalizing semantic segmentation networks by employing language as the source of randomization. Our recipe comprises three key ingredients: (i) the preservation of the intrinsic CLIP robustness through minimal fine-tuning, (ii) language-driven local style augmentation, and (iii) randomization by locally mixing the source and augmented styles during training. Extensive experiments report state-of-the-art results on various generalization benchmarks. Code is accessible at https://github.com/astra-vision/FAMix.

1 Introduction

A prominent challenge associated with deep neural networks is their constrained capacity to generalize when confronted with shifts in data distribution. This limitation is rooted in the assumption of data being independent and identically distributed, a presumption that frequently proves unrealistic in real-world scenarios. For instance, in safety-critical applications like autonomous driving, it is imperative for a segmentation model to exhibit resilient generalization capabilities when dealing with alterations in lighting, variations in weather conditions, and shifts in geographic location, among other considerations.

To address this challenge, domain adaptation [51, 37, 14, 52, 20, 36] has emerged; its core principle revolves around aligning the distributions of both the source and target domains. However, DA hinges on having access to target data, which may not always be available. Even when accessible, this data might not encompass the full spectrum of distributions encountered in diverse real-world scenarios. Domain generalization [68, 67, 53, 34, 56, 35] overcomes this limitation by enhancing the robustness of models to arbitrary and previously unseen domains.

The training of segmentation networks is often backed by large-scale pretraining as initialization for the feature representation. Until now, to the best of our knowledge, domain generalization for semantic segmentation (DGSS) networks [40, 7, 41, 62, 55, 25, 57, 21, 31, 26] are pretrained with ImageNet [9]. The underlying concept is to transfer the representations from the upstream task of classification to the downstream task of segmentation.

Lately, contrastive language image pretraining (CLIP) [43, 24, 59, 60] has demonstrated that transferable visual representations could be learned from the sole supervision of loose natural language descriptions at very large scale. Subsequently, a plethora of applications have been proposed using CLIP [43], including zero-shot semantic segmentation [33, 64], image editing [29], transfer learning [44, 11], open-vocabulary object detection [17], few-shot learning [70, 69] etc. A recent line of research proposes fine-tuning techniques to preserve the robustness of CLIP under distribution shift [54, 28, 16, 49], but they are limited to classification.

In this paper, we aim at answering the following question: How to leverage CLIP pretraining for enhanced domain generalization for semantic segmentation? The motivation for rethinking DGSS with CLIP is twofold. On one hand, distribution robustness is a notable characteristic of CLIP [13]. On the other hand, the language modality offers an extra source of information compared to unimodal pretrained models.

A direct comparison of training two segmentation models under identical conditions but with different pretraining, i.e. ImageNet vs. CLIP, shows that CLIP pretraining does not yield promising results. Indeed, Tab. 1 shows that fine-tuning CLIP-initialized network performs worse than its ImageNet counterpart on out-of-distribution (OOD) data. This raises doubts about the suitability of CLIP pretraining for DGSS and indicates that it is more prone to overfitting the source distribution at the expense of degrading its original distributional robustness properties. Note that both models converge and achieve similar results on in-domain data. More details are provided in Appendix A.

Pretraining C B M S AN AS AR AF Mean
ImageNet 29.04 32.17 34.26 29.87 4.36 22.38 28.34 26.76 25.90
CLIP 16.81 16.31 17.80 27.10 2.95 8.58 14.35 13.61 14.69
Table 1: Comparison of ImageNet and CLIP pretraining for out-of-distribution semantic segmentation. The network is DeepLabv3+ with ResNet-50 as backbone. The models are trained on GTAV and the performance (mIoU %) is reported on Cityscapes (C), BDD-100K (B), Mapillary (M), SYNTHIA (S), and ACDC Night (AN), Snow (AS), Rain (AR) and Fog (AF).

This paper shows that we can prevent such behavior with a simple recipe involving minimal fine-tuning, language-driven style augmentation, and mixing. Our approach is coined FAMix, for Freeze, Augment and Mix.

It was recently argued that fine-tuning might distort the pretrained representations and negatively affect OOD generalization [28]. To maintain the integrity of the representation, one extreme approach is to entirely freeze the backbone. However, this can undermine representation adaptability and lead to subpar OOD generalization. As a middle-ground strategy balancing adaptation and feature preservation, we suggest minimal fine-tuning of the backbone, where a substantial portion remains frozen, and only the final layers undergo fine-tuning.

For generalization, we show that rethinking MixStyle [67] leads to significant performance gains. As illustrated in Fig. 1, we mix the statistics of the original source features with augmented statistics mined using language. This helps explore styles beyond the source distribution at training time without using additional image.

We summarize our contributions as follows:

  • We propose a simple framework for DGSS based on minimal fine-tuning of the backbone and language-driven style augmentation. To the best of our knowledge, we are the first to study DGSS with CLIP pretraining.

  • We propose language-driven class-wise local style augmentation. We mine class-specific local statistics using prompts that express random styles and names of patch-wise dominant classes. During training, randomization is performed through patch-wise style mixing of the source and mined styles.

  • We conduct careful ablations to show the effectiveness of FAMix. Our framework outperforms state-of-the-art approaches in single and multi-source DGSS settings.

Refer to caption
Figure 1: Mixing strategies. (Left) MixStyle [67] consists of a linear mixing between the feature statistics of the source domain(s) S samples. (Right) We apply an augmentation 𝒜(.)\mathcal{A(.)}caligraphic_A ( . ) on the source domain statistics, then perform linear mixing between original and augmented statistics. Intuitively, this enlarges the support of the training distribution by leveraging statistics beyond the source domain(s), as well as discovering intermediate domains. 𝒜(.)\mathcal{A(.)}caligraphic_A ( . ) could be a language-driven or Gaussian noise augmentation, and we show that the former leads to better generalization results.
Refer to caption
Figure 2: Overall process of FAMix. FAMix consists of two steps. (Left) Local style mining consists of dividing the low-level feature activations into patches, which are used for style mining using Prompt-driven Instance Normalization (PIN) [11]. Specifically, for each patch, the dominant class is queried from the ground truth, and the mined style is added to corresponding class-specific style bank. (Right) Training the segmentation network is performed with minimal fine-tuning of the backbone. At each iteration, the low-level feature activations are viewed as grids of patches. For each patch, the dominant class is queried using the ground truth, then a style is sampled from the corresponding style bank. Style randomization is performed by normalizing each patch in the grid by its statistics, and transferring the new style which is a mixing between the original style and the sampled one. The network is trained using only a cross-entropy loss.

2 Related works

Domain generalization (DG). The goal of DG is to train, from a single or multiple source domains, models that perform well under arbitrary domain shifts. The DG literature spans a broad range of approaches, including adversarial learning [35, 61], meta-learning [4, 42], data augmentation [67, 65, 66] and domain-invariant representation learning [3, 1, 27, 7]. We refer the reader to [68, 53] for comprehensive surveys on DG.

Domain generalization with CLIP.  CLIP [43] exhibits a remarkable distributional robustness [13]. Nevertheless, fine-tuning comes at the expense of sacrificing generalization. Kumar et al. [28] observe that full fine-tuning can distort the pretrained representation, and propose a two-stage strategy, consisting of training a linear probe with a frozen feature extractor, then fine-tuning both. Wortsman et al. [54] propose ensembling the weights of zero-shot and fine-tuned models. Goyal et al. [16] show that preserving the pretraining paradigm (i.e. contrastive learning) during the adaptation to the downstream task improves both in-domain (ID) and OOD performance without multi-step fine-tuning or weight ensembling. CLIPood [49] introduces margin metric softmax training objective and Beta moving average for optimization to handle both open-class and open-domain at test time. On the other hand, distributional robustness could be improved by training a small amount of parameters on top of a frozen CLIP backbone in a teacher-student manner [23, 30]. Other works show that specialized prompt ensembling and/or image ensembling strategies [2, 15] coupled with label augmentation using the WordNet hierarchy improve robustness in classification.

Domain Generalized Semantic Segmentation.  DGSS methods could be categorized into three main groups: normalization methods, domain randomization (DR) and invariant representation learning. Normalization methods aim at removing style contribution from the representation. For instance, IBN-Net [40] shows that Instance Normalization (IN) makes the representation invariant to variations in the scene appearance (e.g., change of colors, illumination, etc.), and that combining IN and batch normalization (BN) helps the synthetic-to-real generalization. SAN & SAW [41] proposes semantic-aware feature normalization and whitening, while RobustNet [7] proposes an instance selective whitening loss, where only feature covariances that are sensitive to photometric transformations are whitened. DR aims instead at diversifying the data during training. Some methods use additional data for DR. For example, WildNet [31] uses ImageNet [9] data for content and style extension learning, while TLDR [26] proposes learning texture from random style images. Other methods like SiamDoGe [55] perform DR solely by data augmentation, using a Siamese [6] structure. Finally in the invariant representation learning group, SPC-Net [21] builds a representation space based on style and semantic projection and clustering, and SHADE [62] regularizes the training with a style consistency loss and a retrospection consistency loss.

3 Method

FAMix proposes an effective recipe for DGSS through the blending of simple ingredients. It consists of two stages (see Fig. 2): (i) Local style mining from language (Sec. 3.2); (ii) Training of a segmentation network with minimal fine-tuning and local style mixing (Sec. 3.3). In Fig. 2 and in the following, CLIP-I1 denotes the stem layers and Layer1 of CLIP image encoder, CLIP-I2 the remaining layers excluding the attention pooling, and CLIP-T the text encoder.

We start with some preliminary background knowledge, introducing AdaIN and PIN which are essential to our work.

3.1 Preliminaries

Adaptive Instance Normalization (AdaIN).  For a feature map 𝐟h×w×c𝐟superscript𝑤𝑐\mathbf{f}\in\mathbb{R}^{h\times w\times c}bold_f ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_c end_POSTSUPERSCRIPT, AdaIN [22] shows that the channel-wise mean 𝝁c𝝁superscript𝑐\boldsymbol{\mu}\in\mathbb{R}^{c}bold_italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and standard deviation 𝝈c𝝈superscript𝑐\boldsymbol{\sigma}\in\mathbb{R}^{c}bold_italic_σ ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT capture information about the style of the input image, allowing style transfer between images. Hence, stylizing a source feature 𝐟ssubscript𝐟s{\mathbf{f}}_{\text{s}}bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT with an arbitrary target style (μ(𝐟t),σ(𝐟t))𝜇subscript𝐟t𝜎subscript𝐟t(\mu({\mathbf{f}}_{\text{t}}),\sigma({\mathbf{f}}_{\text{t}}))( italic_μ ( bold_f start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ) , italic_σ ( bold_f start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ) ) reads:

AdaIN(𝐟s,𝐟t)=σ(𝐟t)(𝐟sμ(𝐟s)σ(𝐟s))+μ(𝐟t),AdaINsubscript𝐟ssubscript𝐟t𝜎subscript𝐟tsubscript𝐟s𝜇subscript𝐟s𝜎subscript𝐟s𝜇subscript𝐟t\small\texttt{AdaIN}({\mathbf{f}}_{\text{s}},{\mathbf{f}}_{\text{t}})=\sigma({% \mathbf{f}}_{\text{t}})\Big{(}\frac{{\mathbf{f}}_{\text{s}}-\mu({\mathbf{f}}_{% \text{s}})}{\sigma({\mathbf{f}}_{\text{s}})}\Big{)}+\mu({\mathbf{f}}_{\text{t}% }),AdaIN ( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT , bold_f start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ) = italic_σ ( bold_f start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ) ( divide start_ARG bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT - italic_μ ( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ ( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ) end_ARG ) + italic_μ ( bold_f start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ) , (1)

with μ()𝜇\mu(\cdot)italic_μ ( ⋅ ) and σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) the mean and standard deviation of input feature; multiplications and additions being element-wise.

Prompt-driven Instance Normalization (PIN).  PIN was introduced for prompt-driven zero-shot domain adaptation in PØDA [11]. It replaces the target style (μ(𝐟t),σ(𝐟t))𝜇subscript𝐟t𝜎subscript𝐟t(\mu({\mathbf{f}}_{\text{t}}),\sigma({\mathbf{f}}_{\text{t}}))( italic_μ ( bold_f start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ) , italic_σ ( bold_f start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ) ) in AdaIN (1) with two optimizable variables (𝝁,𝝈)𝝁𝝈(\boldsymbol{\mu},\boldsymbol{\sigma})( bold_italic_μ , bold_italic_σ ) guided by a single prompt in natural language. The rationale is to leverage a frozen CLIP [43] to mine visual styles from the prompt representation in the shared space. Given a prompt P𝑃Pitalic_P and a feature map 𝐟ssubscript𝐟s{\mathbf{f}}_{\text{s}}bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, PIN reads as:

PIN(P)(𝐟s)=𝝈(𝐟sμ(𝐟s)σ(𝐟s))+𝝁,subscriptPIN𝑃subscript𝐟s𝝈subscript𝐟s𝜇subscript𝐟s𝜎subscript𝐟s𝝁\small\texttt{PIN}_{(P)}({\mathbf{f}}_{\text{s}})=\boldsymbol{\sigma}\Big{(}% \frac{{\mathbf{f}}_{\text{s}}-\mu({\mathbf{f}}_{\text{s}})}{\sigma({\mathbf{f}% }_{\text{s}})}\Big{)}+\boldsymbol{\mu},PIN start_POSTSUBSCRIPT ( italic_P ) end_POSTSUBSCRIPT ( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ) = bold_italic_σ ( divide start_ARG bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT - italic_μ ( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ ( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ) end_ARG ) + bold_italic_μ , (2)

where 𝝁𝝁\boldsymbol{\mu}bold_italic_μ and 𝝈𝝈\boldsymbol{\sigma}bold_italic_σ are optimized using gradient descent, such that the cosine distance between the visual feature representation and the prompt representation is minimized.

Different from PØDA which mines styles globally with a predetermined prompt describing the target domain, we make use of PIN to mine class-specific styles using local patches of the features, leveraging random style prompts. Further, we show the effectiveness of incorporating the class name in the prompt for better style mining.

3.2 Local Style Mining

Our approach is to leverage PIN to mine class-specific style banks that are used for feature augmentation when training FAMix. Given a set of cropped images ssubscripts{\mathcal{I}}_{\text{s}}caligraphic_I start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, we encode them using CLIP-I1 to get a set of low-level features ssubscripts{\mathcal{F}}_{\text{s}}caligraphic_F start_POSTSUBSCRIPT s end_POSTSUBSCRIPT. Each batch b𝑏bitalic_b of features 𝐟sssubscript𝐟ssubscripts{\mathbf{f}}_{\text{s}}\in{\mathcal{F}}_{\text{s}}bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT s end_POSTSUBSCRIPT is cropped into m𝑚mitalic_m patches, resulting in b×m𝑏𝑚b\times mitalic_b × italic_m patches 𝐟psubscript𝐟𝑝\mathbf{f}_{p}bold_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, and associated ground-truth annotation 𝐲psubscript𝐲𝑝\mathbf{y}_{p}bold_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, of size h/m×w/m×c𝑚𝑤𝑚𝑐\nicefrac{{h}}{{\sqrt{m}}}\times\nicefrac{{w}}{{\sqrt{m}}}\times c/ start_ARG italic_h end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG × / start_ARG italic_w end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG × italic_c.

We aim at populating K𝐾Kitalic_K style banks, K𝐾Kitalic_K being the total number of classes. For a feature patch 𝐟psubscript𝐟𝑝\mathbf{f}_{p}bold_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, we compute the dominant class from the corresponding label patch 𝐲psubscript𝐲𝑝\mathbf{y}_{p}bold_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, and get its name tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT from the predefined classes in the training dataset. Given a set of prompts describing random styles \mathcal{R}caligraphic_R, the target prompt Ppsubscript𝑃𝑝P_{p}italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is formed by concatenating a randomly sampled style prompt r𝑟ritalic_r from \mathcal{R}caligraphic_R and tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (e.g., retro​ futurism​ style​ building). We show in the experiments (Sec. 4.4) that our method is not very sensitive to the prompt design, yet our prompt construction works best.

The idea is to mine proxy domains and explore intermediate ones in a class-aware manner (as detailed in Sec. 3.3), which makes our work fundamentally different from [11], that steers features towards a particular target style and corresponding domain, and better suited to generalization.

To handle the class imbalance problem, we simply select one feature patch 𝐟psubscript𝐟𝑝\mathbf{f}_{p}bold_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT per class among the total b×m𝑏𝑚b\times mitalic_b × italic_m patches, as shown in Fig. 2. Consequently, we apply PIN (2) to optimize the local styles to match the representations of their corresponding prompts, and use the mined styles to populate the corresponding style banks. The complete procedure is outlined in Algorithm 1.

The resulting style banks {𝒯(1),,𝒯(K)superscript𝒯1superscript𝒯𝐾\mathcal{T}^{(1)},\cdots,\mathcal{T}^{(K)}caligraphic_T start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , ⋯ , caligraphic_T start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT} are used for domain randomization during training.

1
Input : Set ssubscripts{\mathcal{F}}_{\text{s}}caligraphic_F start_POSTSUBSCRIPT s end_POSTSUBSCRIPT of source features batches.
Label set 𝒴ssubscript𝒴s{\mathcal{Y}}_{\text{s}}caligraphic_Y start_POSTSUBSCRIPT s end_POSTSUBSCRIPT in 𝒟ssubscript𝒟s{\mathcal{D}}_{\text{s}}caligraphic_D start_POSTSUBSCRIPT s end_POSTSUBSCRIPT.
Set of random prompts \mathcal{R}caligraphic_R and class names 𝒞𝒞\mathcal{C}caligraphic_C.
Param : Number of patches m𝑚mitalic_m.
Number of classes K𝐾Kitalic_K.
Output : K𝐾Kitalic_K sets {𝒯(1),,𝒯(K)superscript𝒯1superscript𝒯𝐾\mathcal{T}^{(1)},\cdots,\mathcal{T}^{(K)}caligraphic_T start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , ⋯ , caligraphic_T start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT} of class-wise augmented statistics.
2
3{𝒯(1),,𝒯(K)}superscript𝒯1superscript𝒯𝐾\{\mathcal{T}^{(1)},\cdots,\mathcal{T}^{(K)}\}\leftarrow\emptyset{ caligraphic_T start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , ⋯ , caligraphic_T start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT } ← ∅
4
5foreach (𝐟ss({\mathbf{f}}_{\text{s}}\in{\mathcal{F}}_{\text{s}}( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, 𝐲s𝒴s){\mathbf{y}}_{\text{s}}\in{\mathcal{Y}}_{\text{s}})bold_y start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ∈ caligraphic_Y start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ) do
6      {𝐲p}crop-patch(𝐲s,m)subscript𝐲𝑝crop-patchsubscript𝐲s𝑚\{\mathbf{y}_{p}\}\leftarrow\textnormal{{crop-patch}}({\mathbf{y}}_{\text{s}},m){ bold_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } ← crop-patch ( bold_y start_POSTSUBSCRIPT s end_POSTSUBSCRIPT , italic_m )
7       {cp},{Pp},{fp}subscript𝑐𝑝subscript𝑃𝑝subscript𝑓𝑝\{c_{p}\},\{P_{p}\},\{f_{p}\}\leftarrow\emptyset{ italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } , { italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } , { italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } ← ∅
8      
9      foreach 𝐲psubscript𝐲𝑝\mathbf{y}_{p}bold_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT \in {𝐲psubscript𝐲𝑝\mathbf{y}_{p}bold_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT} do
10             cpget-dominant-class(𝐲p)subscript𝑐𝑝get-dominant-classsubscript𝐲𝑝c_{p}\leftarrow\textnormal{{get-dominant-class}}(\mathbf{y}_{p})italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← get-dominant-class ( bold_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT )
11             if cpsubscript𝑐𝑝c_{p}italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT not in {cp}subscript𝑐𝑝\{c_{p}\}{ italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } then
12                   {cp}cpsubscript𝑐𝑝subscript𝑐𝑝\{c_{p}\}\leftarrow c_{p}{ italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } ← italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
13                   {Pp}concat(sample(),get-name(cp))subscript𝑃𝑝concatsampleget-namesubscript𝑐𝑝\{P_{p}\}\leftarrow\textnormal{{concat}}(\textnormal{{sample}}(\mathcal{R}),% \textnormal{{get-name}}(c_{p})){ italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } ← concat ( sample ( caligraphic_R ) , get-name ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) )
14                   {fp}fpsubscript𝑓𝑝subscript𝑓𝑝\{f_{p}\}\leftarrow f_{p}{ italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } ← italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
15             end if
16            
17       end foreach
18      𝝁(cp),𝝈(cp),𝐟pPIN(Pp)(𝐟p)superscript𝝁subscript𝑐𝑝superscript𝝈subscript𝑐𝑝superscriptsubscript𝐟𝑝subscriptPINsubscript𝑃𝑝subscript𝐟𝑝\boldsymbol{\mu}^{(c_{p})},\boldsymbol{\sigma}^{(c_{p})},\mathbf{f}_{p}^{% \prime}\leftarrow\textnormal{{PIN}}_{(P_{p})}(\mathbf{f}_{p})bold_italic_μ start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← PIN start_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( bold_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT )
19       𝒯(cp)𝒯(cp){(𝝁(cp),𝝈(cp))}superscript𝒯subscript𝑐𝑝superscript𝒯subscript𝑐𝑝superscript𝝁subscript𝑐𝑝superscript𝝈subscript𝑐𝑝\mathcal{T}^{(c_{p})}\leftarrow\mathcal{T}^{(c_{p})}\cup\{(\boldsymbol{\mu}^{(% c_{p})},\boldsymbol{\sigma}^{(c_{p})})\}caligraphic_T start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ← caligraphic_T start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∪ { ( bold_italic_μ start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) }
20 end foreach
21
Algorithm 1 Local Style Mining.
1
Input : Set ssubscripts{\mathcal{F}}_{\text{s}}caligraphic_F start_POSTSUBSCRIPT s end_POSTSUBSCRIPT of source features batches.
Label set 𝒴ssubscript𝒴s{\mathcal{Y}}_{\text{s}}caligraphic_Y start_POSTSUBSCRIPT s end_POSTSUBSCRIPT in 𝒟ssubscript𝒟s{\mathcal{D}}_{\text{s}}caligraphic_D start_POSTSUBSCRIPT s end_POSTSUBSCRIPT.
K𝐾Kitalic_K sets {𝒯(1),,𝒯(K)superscript𝒯1superscript𝒯𝐾\mathcal{T}^{(1)},\cdots,\mathcal{T}^{(K)}caligraphic_T start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , ⋯ , caligraphic_T start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT} of class-wise
augmented statistics.
Param : Number of patches m𝑚mitalic_m.
2 foreach (𝐟sssubscript𝐟ssubscripts{\mathbf{f}}_{\text{s}}\in{\mathcal{F}}_{\text{s}}bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, 𝐲s𝒴ssubscript𝐲ssubscript𝒴s{\mathbf{y}}_{\text{s}}\in{\mathcal{Y}}_{\text{s}}bold_y start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ∈ caligraphic_Y start_POSTSUBSCRIPT s end_POSTSUBSCRIPT) do
3      αBeta(0.1,0.1)similar-to𝛼Beta0.10.1\alpha\sim\textnormal{{Beta}}(0.1,0.1)italic_α ∼ Beta ( 0.1 , 0.1 )
4       for (i,j)[1,m]×[1,m]𝑖𝑗1𝑚1𝑚(i,j)\in[1,\sqrt{m}]\times[1,\sqrt{m}]( italic_i , italic_j ) ∈ [ 1 , square-root start_ARG italic_m end_ARG ] × [ 1 , square-root start_ARG italic_m end_ARG ] do
5             cp(ij)get-dominant-class(𝐲s(ij))superscriptsubscript𝑐𝑝𝑖𝑗get-dominant-classsuperscriptsubscript𝐲s𝑖𝑗c_{p}^{(ij)}\leftarrow\textnormal{{get-dominant-class}}({\mathbf{y}}_{\text{s}% }^{(ij)})italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ← get-dominant-class ( bold_y start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT )
6             𝝁(ij),𝝈(ij)sample(𝒯(cp(ij)))superscript𝝁𝑖𝑗superscript𝝈𝑖𝑗samplesuperscript𝒯superscriptsubscript𝑐𝑝𝑖𝑗\boldsymbol{\mu}^{(ij)},\boldsymbol{\sigma}^{(ij)}\leftarrow\textnormal{{% sample}}(\mathcal{T}^{(c_{p}^{(ij)})})bold_italic_μ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ← sample ( caligraphic_T start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT )
7            
8            μmix(1α).μ(𝐟s(ij))+α.𝝁(ij)formulae-sequencesubscript𝜇mix1𝛼𝜇superscriptsubscript𝐟s𝑖𝑗𝛼superscript𝝁𝑖𝑗\mu_{\textit{mix}}\leftarrow(1-\alpha).\mu({\mathbf{f}}_{\text{s}}^{(ij)})+% \alpha.\boldsymbol{\mu}^{(ij)}italic_μ start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT ← ( 1 - italic_α ) . italic_μ ( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) + italic_α . bold_italic_μ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT
9             σmix(1α).σ(𝐟s(ij))+α.𝝈(ij)formulae-sequencesubscript𝜎mix1𝛼𝜎superscriptsubscript𝐟s𝑖𝑗𝛼superscript𝝈𝑖𝑗\sigma_{\textit{mix}}\leftarrow(1-\alpha).\sigma({\mathbf{f}}_{\text{s}}^{(ij)% })+\alpha.\boldsymbol{\sigma}^{(ij)}italic_σ start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT ← ( 1 - italic_α ) . italic_σ ( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) + italic_α . bold_italic_σ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT
10             𝐟s(ij)AdaIN(𝐟s(ij),μmix,σmix)superscriptsubscript𝐟s𝑖𝑗AdaINsuperscriptsubscript𝐟s𝑖𝑗subscript𝜇mixsubscript𝜎mix{\mathbf{f}}_{\text{s}}^{(ij)}\leftarrow\textnormal{{AdaIN}}({\mathbf{f}}_{% \text{s}}^{(ij)},\mu_{\textit{mix}},\sigma_{\textit{mix}})bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ← AdaIN ( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT )
11            
12       end for
13      𝐲~sCLIP-I2(𝐟s)subscript~𝐲sCLIP-I2subscript𝐟s{\mathbf{\tilde{y}}}_{\text{s}}\leftarrow\texttt{CLIP-I2}{}({\mathbf{f}}_{% \text{s}})over~ start_ARG bold_y end_ARG start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ← CLIP-I2 ( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT )
14       Loss=cross-entropy(𝐲~s,𝐲s)Losscross-entropysubscript~𝐲ssubscript𝐲s\textnormal{{Loss}}=\textnormal{{cross-entropy}}({\mathbf{\tilde{y}}}_{\text{s% }},{\mathbf{y}}_{\text{s}})Loss = cross-entropy ( over~ start_ARG bold_y end_ARG start_POSTSUBSCRIPT s end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT s end_POSTSUBSCRIPT )
15      
16 end foreach
Algorithm 2 Training FAMix.
Method arch. C B M S AN AS AR AF Mean
RobustNet [7] RN50 36.58 35.20 40.33 28.30 6.32 29.97 33.02 32.56 30.29
SAN & SAW [41] 39.75 37.34 41.86 30.79 - - - - -
Pin the memory [25] 41.00 34.60 37.40 27.08 3.84 5.51 5.89 7.27 20.32
SHADE [62] 44.65 39.28 43.34 28.41 8.18 30.38 35.44 36.87 33.32
SiamDoGe [55] 42.96 37.54 40.64 28.34 10.60 30.71 35.84 36.45 32.89
DPCL [57] 44.87 40.21 46.74 - - - - - -
SPC-Net [21] 44.10 40.46 45.51 - - - - - -
NP [12] 40.62 35.56 38.92 27.65 - - - - -
WildNet* [31] 44.62 38.42 46.09 31.34 8.27 30.29 36.32 35.39 33.84
TLDR* [26] 46.51 42.58 46.18 30.57 13.13 36.02 38.89 40.58 36.81
FAMix (ours) 48.15 45.61 52.11 34.23 14.96 37.09 38.66 40.25 38.88
SAN & SAW [41] RN101 45.33 41.18 40.77 31.84 - - - - -
SHADE [62] 46.66 43.66 45.50 31.58 7.58 32.48 36.90 36.69 35.13
WildNet* [31] 45.79 41.73 47.08 32.51 - - - - -
TLDR* [26] 47.58 44.88 48.80 33.14 - - - - -
FAMix (ours) 49.47 46.40 51.97 36.72 19.89 41.38 40.91 42.15 41.11
Table 2: Single-source DGSS trained on GTAV. Performance (mIoU %) of FAMix compared to other DGSS methods trained on G and evaluated on C, S, M, S, A for ResNet-50 (‘RN50’) and ResNet-101 (‘RN101’) backbone architecture (‘arch.’). * indicates the use of extra-data. \dagger indicates the use of the full data for training. We emphasize best and second best results.

3.3 Training FAMix

Style randomization.  During training, randomly cropped images ssubscripts{\mathcal{I}}_{\text{s}}caligraphic_I start_POSTSUBSCRIPT s end_POSTSUBSCRIPT are encoded into 𝐟ssubscript𝐟s{\mathbf{f}}_{\text{s}}bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT using CLIP-I1. Each batch of feature maps 𝐟ssubscript𝐟s{\mathbf{f}}_{\text{s}}bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT is viewed as a grid of m𝑚mitalic_m patches, without cropping them. For each patch 𝐟s(ij)superscriptsubscript𝐟s𝑖𝑗{\mathbf{f}}_{\text{s}}{\!}^{(ij)}bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT within the grid, the dominant class cp(ij)superscriptsubscript𝑐𝑝𝑖𝑗c_{p}{\!}^{(ij)}italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT is queried using the corresponding ground truth patch 𝐲s(ij)superscriptsubscript𝐲s𝑖𝑗{\mathbf{y}}_{\text{s}}{\!}^{(ij)}bold_y start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT, and a style is randomly sampled from the corresponding mined bank 𝒯(cp(ij))\mathcal{T}{}^{(c_{p}{\!}^{(ij)})}caligraphic_T start_FLOATSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) end_FLOATSUPERSCRIPT. We then apply patch-wise convex combination (i.e., style mixing) of the original style of the patch and the mined style. Specifically, for an arbitrary patch 𝐟s(ij)superscriptsubscript𝐟s𝑖𝑗{\mathbf{f}}_{\text{s}}{\!}^{(ij)}bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT, our local style mixing reads:

μmix(1α)μ(𝐟s(ij))+α𝝁(ij)subscript𝜇mix1𝛼𝜇superscriptsubscript𝐟s𝑖𝑗𝛼superscript𝝁𝑖𝑗\displaystyle\mu_{\textit{mix}}\leftarrow(1-\alpha)\mu({\mathbf{f}}_{\text{s}}% ^{(ij)})+\alpha\boldsymbol{\mu}^{(ij)}italic_μ start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT ← ( 1 - italic_α ) italic_μ ( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) + italic_α bold_italic_μ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT (3)
σmix(1α)σ(𝐟s(ij))+α𝝈(ij),subscript𝜎mix1𝛼𝜎superscriptsubscript𝐟s𝑖𝑗𝛼superscript𝝈𝑖𝑗\displaystyle\sigma_{\textit{mix}}\leftarrow(1-\alpha)\sigma({\mathbf{f}}_{% \text{s}}^{(ij)})+\alpha\boldsymbol{\sigma}^{(ij)},italic_σ start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT ← ( 1 - italic_α ) italic_σ ( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) + italic_α bold_italic_σ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT , (4)

with (𝝁(ij),𝝈(ij))𝒯(cp(ij))superscript𝝁𝑖𝑗superscript𝝈𝑖𝑗superscript𝒯superscriptsubscript𝑐𝑝𝑖𝑗(\boldsymbol{\mu}^{(ij)},\boldsymbol{\sigma}^{(ij)})\in\mathcal{T}^{(c_{p}^{(% ij)})}( bold_italic_μ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) ∈ caligraphic_T start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT and α[0,1]c𝛼superscript01𝑐\alpha\in[0,1]^{c}italic_α ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

As shown in Fig. 1, our style mixing strategy differs from [67] which applies a linear interpolation between styles extracted from the images of a limited set of source domain(s) assumed to be available for training. Here, we view the mined styles as variations of multiple proxy target domains defined by the prompts. Training is conducted over all the paths in the feature space between the source and proxy domains without requiring any additional image during training other than the one from source.

Style transfer is applied through AdaIN (1). Only the standard cross-entropy loss between the ground truth 𝐲ssubscript𝐲s{\mathbf{y}}_{\text{s}}bold_y start_POSTSUBSCRIPT s end_POSTSUBSCRIPT and the prediction 𝐲~ssubscript~𝐲s{\mathbf{\tilde{y}}}_{\text{s}}over~ start_ARG bold_y end_ARG start_POSTSUBSCRIPT s end_POSTSUBSCRIPT is applied for training the network. Algorithm 2 shows the training steps of FAMix.

Minimal fine-tuning.  During training, we fine-tune only the last few layers of the backbone. Subsequently, we examine various alternatives and show that the minimal extent of fine-tuning is the crucial factor in witnessing the effectiveness of our local style mixing strategy.

Previous works [67, 12, 40] suggest that shallow feature statistics capture style information while deeper features encode semantic content. Consequently, some DGSS methods focus on learning style-agnostic representations [7, 40, 41], but this can compromise the expressiveness of the representation and suppress content information. In contrast, our intuition is to retain these identified traits by introducing variability to the shallow features through augmentation and mixing. Simultaneously, we guide the network to learn invariant high-level representations by training the final layers of the backbone with a label-preserving assumption, using a standard cross-entropy loss.

4 Experiments

4.1 Experimental setup

Synthetic datasets.  GTAV [45] and SYNTHIA [46] are used as synthetic datasets. GTAV consists of 24 966 images split into 12 403 images for training, 6 382 for validation and 6 181 for testing. SYNTHIA consists of 9 400 images: 6 580 for training and 2 820 for validation. GTAV and SYNTHIA are denoted by G and S, respectively.

Real datasets.  Cityscapes [8], BDD-100K [58], and Mapillary [39] contain 2 975, 7 000, and 18 000 images for training and 500, 1 000, and 2 000 images for validation, respectively. ACDC [48] is a dataset of driving scenes in adverse conditions: night, snow, rain and fog with respectively 106, 100, 100 and 100 images in the validation sets. C, B, and M denote Cityscapes, BDD-100K and Mapillary, respectively; AN, AS, AR and AF denote night, snow, rain and fog subsets of ACDC, respectively.

Implementation details.  Following previous works [7, 41, 25, 62, 55, 57, 21, 31, 26], we adopt DeepLabv3+ [5] as segmentation model. ResNet-50 and ResNet-101 [19], initialized with CLIP pretrained weights, are used in our experiments as backbones. Specifically, we remove the attention pooling layer and add a randomly initialized decoder head. The output stride is 16161616. Single-source and multi-source models are trained respectively for 40K40𝐾40K40 italic_K and 60K60𝐾60K60 italic_K iterations with a batch size of 8888. The training images are cropped to 768×768768768768\times 768768 × 768. Stochastic Gradient Descent (SGD) with a momentum of 0.90.90.90.9 and weight decay of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT is used as optimizer. Polynomial decay with a power of 0.90.90.90.9 is used, with an initial learning rate of 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for the classifier and 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT for the backbone. We use color jittering and horizontal flip as data augmentation. Label smoothing regularization [50] is adopted. For style mining, Layer1 features are divided into 9999 patches. Each patch is resized to 56×56565656\times 5656 × 56, corresponding to the dimensions of Layer1 features for an input image of size 224×224224224224\times 224224 × 224 (i.e. the input dimension of CLIP). We use ImageNet templates111https://github.com/openai/CLIP/ for each prompt.

Evaluation metric.  We evaluate our models on the validation sets of the unseen target domains with mean Intersection over Union (mIoU%) of the 19191919 shared semantic classes. For each experiment, we report the average of three runs.

4.2 Comparison with DGSS methods

Single-source DGSS. We compare FAMix with state-of-the-art DGSS methods under the single-source setting.

Training on GTAV (G) as source, Tab. 2 reports models trained with either ResNet-50 or ResNet-101 backbones. The unseen target datasets are C, B, M, S, and the four subsets of A. Tab. 2 shows that our method significantly outperforms all the baselines on all the datasets for both backbones. We note that WildNet [31] and TLDR [26] use extra-data, while SHADE [62] uses the full G dataset (24,966 images) for training with ResNet-101. Class-wise performances are reported in Appendix B.

Training on Cityscapes (C) as source, Tab. 3 reports performance with ResNet-50 backbone. The unseen target datasets are B, M, G, and S. The table shows that our method outperforms the baseline in average, and is competitive to SOTA on G and M.

Method B M G S Mean
RobustNet [7] 50.73 58.64 45.00 26.20 45.14
Pin the memory [25] 46.78 55.10 - - -
SiamDoGe [55] 51.53 59.00 45.08 26.67 45.57
WildNet* [31] 50.94 58.79 47.01 27.95 46.17
DPCL [57] 52.29 - 46.00 26.60 -
FAMix (ours) 54.07 58.72 45.12 32.67 47.65
Table 3: Single-source DGSS trained on Cityscapes. Performance (mIoU %) of FAMix compared to other DGSS methods trained on C and evaluated on B, M, G and S for ResNet-50 backbone. * indicates the use of extra-data. We emphasize best and second best results.

Multi-source DGSS.  We also show the effectiveness of FAMix in the multi-source setting, training on G+S and evaluating on C, B and M. The results reported in Tab. 4 for ResNet-50 backbone outperform state-of-the-art.

      Method       C       B       M       Mean
      RobustNet [7]       37.69       34.09       38.49       36.76
      Pin the memory [25]       44.51       38.07       42.70       41.76
      SHADE [62]       47.43       40.30       47.60       45.11
      SPC-Net [21]       46.36       43.18       48.23       45.92
      TLDR* [26]       48.83       42.58       47.80       46.40
      FAMix (ours)       49.41       45.51       51.61       48.84
Table 4: Multi-source DGSS trained on GTAV + SYNTHIA. Performance (mIoU %) of FAMix compared to other DGSS methods trained on G+S and evaluated on C, B, M for ResNet-50 backbone. * indicates the use of extra-data. We emphasize best and second best results.

Qualitative results.  We visually compare the segmentation results with Pin the memory [25], SHADE [62] and WildNet [31] in Fig. 3. FAMix clearly outperforms other DGSS methods on “stuff” (e.g., road and sky) and “things” (e.g., bicycle and bus) classes.

Image GT PIN the mem. [25] SHADE [62] WildNet [31] FAMix
C Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
B Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
M Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Road Sidewalk Building Wall Fence Pole Traffic light Traffic sign Vegetation Terrain
Sky Person Rider Car Truck Bus Train Motorbike Bicycle n/a
Figure 3: Qualitative results. Columns 1-2: Image and ground truth (GT), Columns 3-4-5: DGSS methods results, Column 6: Our results. The models are trained on GTAV with ResNet-50 backbone.

4.3 Decoder-Probing Fine-Tuning (DP-FT)

Kumar et al. [28] show that standard fine-tuning may distort the pretrained feature representation, leading to degraded OOD performances for classification. Consequently, they propose a two-step training strategy: (1) Training a linear probe (LP) on top of the frozen backbone features, (2) Fine-tuning (FT) both the linear probe and the backbone. Inspired by it, Saito et al. [47] apply the same strategy for object detection, which is referred to as Decoder-probing Fine-tuning (DP-FT). They observe that DP-FT improves over DP depending on the architecture. We hypothesize that the effect is also dependent on the pretraining paradigm and the downstream task. As observed in Tab. 1, CLIP might remarkably overfit the source domain when fine-tuned. In Tab. 5, we compare fine-tuning (FT), decoder-probing (DP) and DP-FT. DP brings improvements over FT since it completely preserves the pretrained representation. Yet, DP major drawback lies in its limitation to adapt features for the downstream task, resulting in suboptimal results. Surprisingly, DP-FT largely falls behind DP, meaning that the learned features over-specialize to the source domain distribution even with a “decoder warm-up”.

The results advocate for the need of specific strategies to preserve CLIP robustness for semantic segmentation. This need emerges from the additional gap between pretraining (i.e. aligning object-level and language representations) and fine-tuning (i.e. supervised pixel classification).

Method C B M S AN AS AR AF Mean
FT 16.81 16.31 17.80 27.10 2.95 8.58 14.35 13.61 14.69
DP 34.13 37.67 42.21 29.10 10.71 26.26 29.47 30.40 29.99
DP-FT 25.62 21.71 26.39 31.45 4.22 18.26 20.07 20.85 21.07
FAMix (ours) 48.15 45.61 52.11 34.23 14.96 37.09 38.66 40.25 38.88
Table 5: FAMix vs. DP-FT. Performance (mIoU%) of FAMix compared to Fine-tuning (FT), Decoder-probing (DP) and Decoder-probing Fine-tuning (DP-FT). We use here ResNet-50, trained on GTAV. We emphasize best and second best results.
Freeze Augment Mix C B M S AN AS AR AF Mean
16.81 16.31 17.80 27.10 2.95 8.58 14.35 13.61 14.69
22.48 26.05 24.15 25.40 4.83 17.61 22.86 19.75 20.39
20.07 21.24 22.91 26.52 1.28 14.99 22.09 20.51 18.70
27.53 26.59 26.27 26.91 4.90 18.91 25.60 22.14 22.36
37.83 38.88 44.24 31.93 12.41 29.59 31.56 33.05 32.44
36.65 35.73 37.32 30.44 14.72 34.65 34.91 38.98 32.93
43.43 43.79 48.19 33.70 11.32 35.55 36.15 38.19 36.29
48.15 45.61 52.11 34.23 14.96 37.09 38.66 40.25 38.88
Table 6: Ablation of FAMix components. Performance (mIoU %) after removing one or more components of FAMix.

4.4 Ablation studies

We conduct all the ablations on a ResNet-50 backbone with GTAV (G) as source dataset.

Removing ingredients from the recipe.  FAMix is based on minimal fine-tuning of the backbone (i.e., Freeze), style augmentation and mixing. We show in Tab. 6 that the best generalization results are only obtained when combining the three ingredients. Specifically, when the backbone is fine-tuned (i.e., Freeze ✗), the performances are largely harmed. When minimal fine-tuning is performed (i.e., Freeze ✓), we argue that the augmentations are too strong to be applied without style mixing; the latter brings both effects of domain interpolation and use of the original statistics. Subsequently, when style mixing is not applied (i.e. Freeze ✓, Augment ✓, Mix ✗), the use of mined styles brings mostly no improvement on OOD segmentation compared to training without augmentation (i.e. Freeze ✓, Augment ✗, Mix ✗). Note that for Freeze ✓, Augment ✓, Mix ✗, the line 8 in Algorithm 2 becomes:

𝐟s(ij)AdaIN(𝐟s(ij),𝝁(ij),𝝈(ij))superscriptsubscript𝐟s𝑖𝑗AdaINsuperscriptsubscript𝐟s𝑖𝑗superscript𝝁𝑖𝑗superscript𝝈𝑖𝑗{\mathbf{f}}_{\text{s}}^{(ij)}\leftarrow\textnormal{{AdaIN}}({\mathbf{f}}_{% \text{s}}^{(ij)},\boldsymbol{\mu}^{(ij)},\boldsymbol{\sigma}^{(ij)})bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ← AdaIN ( bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT , bold_italic_μ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) (5)

Our style mixing is different from MixStyle [67] for being applied: (1) patch-wise and (2) between original styles of the source data and augmented versions of them. Note that the case (Freeze ✓, Augment ✗, Mix ✓) could be seen as a variant of MixStyle, yet applied locally and class-wise. Our complete recipe is proved to be significantly more effective with a boost of +6absent6\approx+6≈ + 6 mean mIoU w.r.t. the baseline of training without augmentation and mixing.

Prompt construction.   Tab. 7 reports results when ablating the prompt construction. In FAMix, the final prompt is derived by concatenating <<<random​ style​ prompt>>> and <<<class​ name>>>; removing either of those leads to inferior results. Interestingly, replacing the style prompt by random characters – e.g. “ioscjspa” – does not significantly degrade the performance. In certain aspects, using random prompts still induces a randomization effect within the FAMix framework. However, meaningful prompts still consistently lead to the best results.

RCP RSP CN C B M S AN AS AR AF Mean
45.99 43.71 50.48 34.75 15.22 35.09 34.92 38.17 37.29
46.10 44.24 48.90 33.62 13.39 35.99 36.68 39.86 37.35
45.64 44.59 49.13 33.64 15.33 37.32 35.98 38.85 37.56
47.83 44.83 50.38 34.27 14.43 37.07 37.07 38.76 38.08
48.15 45.61 52.11 34.23 14.96 37.09 38.66 40.25 38.88
Table 7: Ablation on the prompt construction. Performance (mIoU %) for different prompt constructions. RCP, RSP and CN refer to <<<random​ character​ prompt>>>, <<<random​ style​ prompt>>> and <<<class​ name>>>, respectively.

Number of style prompts.  FAMix uses a set \mathcal{R}caligraphic_R of random style prompts which are concatenated with the class names; \mathcal{R}caligraphic_R is formed by querying ChatGPT222https://chat.openai.com/ using <<<give​ me​ 20​ prompts​ of​ 2​ to​ 5​ words​ describing​ random​ image​ styles>>>. The output prompts are provided in Appendix C. Fig. 4(a) shows that the size of \mathcal{R}caligraphic_R has a marginal impact on FAMix performance. Yet, the mIoU scores on C, B, M and AR are higher for ||=2020|\mathcal{R}|=20| caligraphic_R | = 20 compared to ||=11|\mathcal{R}|=1| caligraphic_R | = 1 and almost equal for the other datasets.

The low sensitivity of the performance to the size of \mathcal{R}caligraphic_R could be explained by two factors. First, mining even from a single prompt results in different style variations as the optimization starts from different anchor points in the latent space, as argued in [11]. Second, mixing style between the source and the mined proxy domains is the crucial factor making the network explore intermediate domains during training. This does not contradict the effect of our prompt construction which leads to the best results (Tab. 7).

Refer to caption
(a) Number of prompts
Refer to caption
(b) Effect of layer freezing
Figure 4: Ablation of prompt set and freezing strategy. 4(a) Performance (mIoU %) on test datasets w.r.t. the number of random style prompts in \mathcal{R}caligraphic_R. 4(b) Effect of freezing layers reporting on x-axis the last frozen layer. For example, ‘L3’ means freezing L1, L2 and L3. ‘L4’ ’ indicates that the Layer4 is partially frozen.

Local vs. global style mining.  To highlight the effect of our class-wise local style mining, we perform an ablation replacing it with global style mining. Specifically, the same set of <<<random​ style​ prompt>>> are used, though being concatenated with <<<driving>>> as a global description instead of local class name. Intuitively, local style mining and mixing induces richer style variations and more contrast among patches. The results in Tab. 8 show the effectiveness of our local style mining and mixing strategy, bringing about 3333 mIoU improvement on G \rightarrow C.

Syle mining C B M S AN AS AR AF Mean
global w/ “street view” 45.51 45.12 50.40 33.65 14.59 36.92 37.38 40.53 38.01
“urban scene” 46.59 45.38 51.33 33.67 14.42 35.96 37.30 40.52 38.15
“roadscape” 45.49 45.55 50.63 33.66 14.77 36.75 37.07 40.33 38.03
“commute snapshot” 45.39 45.08 50.50 33.68 13.65 36.63 37.93 40.92 37.97
“driving” 45.06 44.98 50.67 33.36 14.84 35.11 36.21 39.52 37.47
local 48.15 45.61 52.11 34.23 14.96 37.09 38.66 40.25 38.88
Table 8: Ablation on style mining. Global style mining consists of mining one style per feature map, using <<<random​ style​ prompt>>> + <<<global​ class>>> as prompt.

What to mix?  Let 𝒮=k=1K𝒮(k)𝒮superscriptsubscript𝑘1𝐾superscript𝒮𝑘\mathcal{S}=\bigcup_{k=1}^{K}\mathcal{S}^{(k)}caligraphic_S = ⋃ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT and 𝒯=k=1K𝒯(k)𝒯superscriptsubscript𝑘1𝐾superscript𝒯𝑘\mathcal{T}=\bigcup_{k=1}^{K}\mathcal{T}^{(k)}caligraphic_T = ⋃ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_T start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT the sets of class-wise source and augmented features, respectively. In FAMix training, for an arbitrary patch 𝐟s(ij)superscriptsubscript𝐟s𝑖𝑗{\mathbf{f}}_{\text{s}}^{(ij)}bold_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT, style mixing is performed between the original source statistics and statistics sampled from the augmented set (i.e., (𝝁(ij),𝝈(ij))𝒯(cp(ij))superscript𝝁𝑖𝑗superscript𝝈𝑖𝑗superscript𝒯superscriptsubscript𝑐𝑝𝑖𝑗(\boldsymbol{\mu}^{(ij)},\boldsymbol{\sigma}^{(ij)})\in\mathcal{T}^{(c_{p}^{(% ij)})}( bold_italic_μ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) ∈ caligraphic_T start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT, see (3) and (4)). In class-wise vanilla MixStyle, (𝝁(ij),𝝈(ij))𝒮(cp(ij))superscript𝝁𝑖𝑗superscript𝝈𝑖𝑗superscript𝒮superscriptsubscript𝑐𝑝𝑖𝑗(\boldsymbol{\mu}^{(ij)},\boldsymbol{\sigma}^{(ij)})\in\mathcal{S}^{(c_{p}^{(% ij)})}( bold_italic_μ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) ∈ caligraphic_S start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT. In Tab. 9, we show that sampling (𝝁(ij),𝝈(ij))superscript𝝁𝑖𝑗superscript𝝈𝑖𝑗(\boldsymbol{\mu}^{(ij)},\boldsymbol{\sigma}^{(ij)})( bold_italic_μ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) from 𝒮(cp(ij))𝒯(cp(ij))superscript𝒮superscriptsubscript𝑐𝑝𝑖𝑗superscript𝒯superscriptsubscript𝑐𝑝𝑖𝑗\mathcal{S}^{(c_{p}^{(ij)})}\cup\mathcal{T}^{(c_{p}^{(ij)})}caligraphic_S start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ∪ caligraphic_T start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT does not lead to better generalization, despite sampling from a set with twice the cardinality. This supports our mixing strategy visualized in Fig. 1. Intuitively, sampling from 𝒮𝒯𝒮𝒯\mathcal{S}\cup\mathcal{T}caligraphic_S ∪ caligraphic_T could be viewed as applying either MixStyle or our mixing with a probability p=0.5𝑝0.5p=0.5italic_p = 0.5.

Syle mining C B M S AN AS AR AF Mean
𝒮𝒮\mathcal{S}caligraphic_S 43.43 43.79 48.19 33.70 11.32 35.55 36.15 38.19 36.29
𝒮𝒯𝒮𝒯\mathcal{S}\cup\mathcal{T}caligraphic_S ∪ caligraphic_T 44.76 45.59 50.78 34.05 13.67 36.92 37.18 38.13 37.64
𝒯𝒯\mathcal{T}caligraphic_T (ours) 48.15 45.61 52.11 34.23 14.96 37.09 38.66 40.25 38.88
Table 9: Ablation on the sets used for mixing. The styles (𝝁,𝝈)𝝁𝝈(\boldsymbol{\mu},\boldsymbol{\sigma})( bold_italic_μ , bold_italic_σ ) used in (3) and (4) are sampled either from 𝒮𝒮\mathcal{S}caligraphic_S oder 𝒮𝒯𝒮𝒯\mathcal{S}\cup\mathcal{T}caligraphic_S ∪ caligraphic_T oder 𝒯𝒯\mathcal{T}caligraphic_T.

Minimal fine-tuning.  We argue for minimal fine-tuning as a compromise between pretrained feature preservation and adaptation.  Fig. 4(b) shows an increasing OOD generalization trend with more freezing. Interestingly, only fine-tuning the last layers of the last convolutional block (where the dilation is applied) achieves the best results. When training on Cityscapes (Tab. 3), we observed that freezing all the layers except Layer4 achieves the best results.

4.5 Does FAMix require language?

Inspired by the observation that target statistics deviate around the source ones in real cases [12], we conduct an experiment where we replace language-driven style mining by noise perturbation. The same procedure of FAMix is kept: (i) Features are divided into patches, perturbed with noise and then saved into a style bank based on the dominant class; (ii) During training, patch-wise style mixing of original and perturbed styles is performed.

Different from Fan et al. [12], who perform a perturbation on the feature statistics using a normal distribution with pre-defined parameters, we experiment perturbation with different magnitudes of noise controlled by the signal-to-noise ratio (SNR). Consider the mean of a patch μc𝜇superscript𝑐\mu\in\mathbb{R}^{c}italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT as a signal, the goal is to perturb it with some noise nμcsubscript𝑛𝜇superscript𝑐n_{\mu}\in\mathbb{R}^{c}italic_n start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. The SNRdBsubscriptSNRdB\texttt{SNR}_{\texttt{dB}}SNR start_POSTSUBSCRIPT dB end_POSTSUBSCRIPT between μnorm𝜇\|\mu\|∥ italic_μ ∥ and nμnormsubscript𝑛𝜇\|n_{\mu}\|∥ italic_n start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ is defined as SNRdB=20log10(μ/nμ)subscriptSNRdB20subscriptlog10norm𝜇normsubscript𝑛𝜇\texttt{SNR}_{\texttt{dB}}=20\,\text{log}_{10}\left(\nicefrac{{\|\mu\|}}{{\|n_% {\mu}\|}}\right)SNR start_POSTSUBSCRIPT dB end_POSTSUBSCRIPT = 20 log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( / start_ARG ∥ italic_μ ∥ end_ARG start_ARG ∥ italic_n start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ end_ARG ). Given μ𝜇\muitalic_μ, SNRdBsubscriptSNRdB\texttt{SNR}_{\texttt{dB}}SNR start_POSTSUBSCRIPT dB end_POSTSUBSCRIPT, and n𝒩(0,I)similar-to𝑛𝒩0𝐼n\sim\mathcal{N}(0,I)italic_n ∼ caligraphic_N ( 0 , italic_I ), where Ic×c𝐼superscript𝑐𝑐I\in\mathbb{R}^{c\times c}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_c end_POSTSUPERSCRIPT is the identity matrix, the noise is computed as nμ=10SNR20μnnsubscript𝑛𝜇superscript10SNR20norm𝜇norm𝑛𝑛n_{\mu}=10^{\frac{-\texttt{SNR}}{20}}\frac{\|\mu\|}{\|n\|}nitalic_n start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT divide start_ARG - SNR end_ARG start_ARG 20 end_ARG end_POSTSUPERSCRIPT divide start_ARG ∥ italic_μ ∥ end_ARG start_ARG ∥ italic_n ∥ end_ARG italic_n. We add μ+nμ𝜇subscript𝑛𝜇\mu+n_{\mu}italic_μ + italic_n start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT to the style bank corresponding to the dominant class in the patch. The same applies to σc𝜎superscript𝑐\sigma\in\mathbb{R}^{c}italic_σ ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. The results of training for different noise levels are in Tab. 10. Using language as source of randomization outperforms any noise level. The baseline corresponds to the case where no augmentation nor mixing are performed (See Tab. 6, Freeze ✓, Augment ✗, Mix ✗). SNR=\infty could be seen as a variant of MixStyle, applied class-wise to patches (See Tab. 6, Freeze ✓, Augment ✗, Mix ✓). The vanilla MixStyle gets inferior results.

Besides lower OOD performance, one more disadvantage of noise augmentation compared to our language-driven augmentation is the need to select a value for the SNR, for which the optimal value might vary depending on the target domain encountered at the test time.

SNR C B M S AN AS AR AF Mean
Baseline 37.83 38.88 44.24 31.93 12.41 29.59 31.56 33.05 32.44
5 28.78 29.24 30.32 21.67 12.60 24.00 25.95 25.87 24.80
10 40.09 39.50 43.45 29.09 13.36 33.47 33.11 36.17 33.53
15 45.02 44.16 48.63 32.96 14.55 36.09 35.99 40.96 37.30
20 45.52 44.29 49.26 33.45 12.40 35.96 36.52 38.60 37.00
25 44.82 44.26 48.54 33.30 11.38 34.51 35.46 37.61 36.24
30 43.07 43.80 48.31 33.47 12.33 35.05 35.58 38.10 36.21
\infty 43.43 43.79 48.19 33.70 11.32 35.55 36.15 38.19 36.29
MixStyle [67] 40.97 42.04 48.36 33.15 13.14 31.26 34.94 38.12 35.25
Prompts 48.15 45.61 52.11 34.23 14.96 37.09 38.66 40.25 38.88
Table 10: Noise vs prompt-driven augmentation. The prompt-driven augmentation in FAMix is replaced by random noise with different levels defined by SNR. We also include vanilla MixStyle. The prompt-driven strategy is superior.

5 Conclusion

We presented FAMix, a simple recipe for domain generalized semantic segmentation with CLIP pretraining. We proposed to locally mix the styles of source features with their augmented counterparts obtained using language prompts. Combined with minimal fine-tuning, FAMix significantly outperforms the state-of-the-art approaches. Extensive experiments showcase the effectiveness of our framework. We hope that FAMix will serve as a strong baseline in future works, exploring the potential of leveraging large-scale vision-language models for perception tasks.

Acknowledgment. This work was partially funded by French project SIGHT (ANR-20-CE23-0016) and was supported by ELSA - European Lighthouse on Secure and Safe AI funded by the European Union under grant agreement No. 101070617. It was performed using HPC resources from GENCI–IDRIS (Grant AD011014477).

Appendix

We provide details about Tab. 1 experiments in Appendix A. Further, we report class-wise performance for FAMix in Appendix B, and detail the prompts used in our experiments in Appendix C. Finally, we discuss the limitations and perspectives in Appendix D.

We refer to the supplementary video for further demonstration of FAMix qualitative performance: https://youtu.be/vyjtvx2El9Q.

Appendix A CLIP vs. ImageNet initialization

In Tab. 1 of the main paper, we introduce a comparison of ImageNet and CLIP pretraining for out-of-distribution semantic segmentation. We clarify here its implementation.

To produce Tab. 1 we employ the public code333https://github.com/VainF/DeepLabV3Plus-Pytorch and finetuned with SGD using a learning rate of 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for the segmenter and 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT for the backbone. We note that we freeze the stem layers and Layer1 for both backbones, i.e., ImageNet and CLIP initialized ResNet-50, after observing that full fine-tuning leads to subpar in-domain results for CLIP. Crucially, in this setting, both ImageNet and CLIP initialized networks converge and achieve the same performance in-domain (on GTA5 validation set). Hence, we argue that the poor OOD performance of CLIP initialization in Tab. 1 may originate from the distortion of the robust CLIP representation towards the source domain distribution as advocated in [28]. To alleviate such distortion, we freeze most of the backbone layers in FAMix.

However, we highlight that different hyper-parameter choices could boost the performance to some extent. For example, Rao et al. [44] observed that fine-tuning CLIP for semantic segmentation with the default configuration in MMSegmentation444https://github.com/open-mmlab/mmsegmentation leads to 15.6% mIoU lower performance than its ImageNet pre-trained counterpart on ADE20K [63]. Consequently, they propose using AdamW [38] for optimization.

In Tab. 11 we reproduce the experiment of Tab. 1 experiment but using AdamW as optimizer. O1subscript𝑂1O_{1}italic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT refers to the optimization configuration adopted in our paper, i.e., SGD optimizer with a learning rate of 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for the segmenter and 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT for the backbone. O2subscript𝑂2O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and O3subscript𝑂3O_{3}italic_O start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT both refer to the use of AdamW with a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for the segmenter and 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for the backbone. In O2subscript𝑂2O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT all the backbone is fine-tuned while in O3subscript𝑂3O_{3}italic_O start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT the stem layers and Layer1 are frozen, similar to O1subscript𝑂1O_{1}italic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Results show that using AdamW improves performance across out-of-distributions (OOD) domains for both CLIP and ImageNet initialized networks, but still largely lags behind FAMix (mean mIoU=38.88%) and even our variant (Freeze ✓, Augment ✗, Mix ✗) (mean mIoU=32.44%) in Tab. 6.

These results hint that using AdamW with relatively low learning rates might reduce the feature distortion of CLIP. Motivated by this observation, one could question the necessity of the minimal fine-tuning part of FAMix, and whether similar results could be achieved only by augmenting, mixing, and fine-tuning with AdamW and low learning rate. We call this variant AMix (Augment and Mix) and show the results in Tab. 12, which support the necessity of our full recipe.

Optim. Pretraining C B M S AN AS AR AF Mean
O1subscript𝑂1O_{1}italic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ImageNet 29.04 32.17 34.26 29.87 4.36 22.38 28.34 26.76 25.90
CLIP 16.81 16.31 17.80 27.10 2.95 8.58 14.35 13.61 14.69
O2subscript𝑂2O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ImageNet 28.00 36.82 37.00 30.60 3.56 24.14 29.51 26.23 26.98
CLIP 31.73 25.89 30.68 33.32 2.56 19.17 21.42 17.58 22.79
O3subscript𝑂3O_{3}italic_O start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ImageNet 28.74 36.91 37.86 30.32 4.46 22.48 28.49 25.38 26.83
CLIP 26.81 23.11 29.82 32.38 4.20 18.50 22.59 20.31 22.22
Table 11: Effect of optimization configurations on OOD performance. Performance (mIoU %) of CLIP vs. ImageNet initialized networks for different optimization configurations.
Method C B M S AN AS AR AF Mean
AMix 40.50 38.69 36.05 33.61 4.03 23.03 30.01 26.89 29.10
FAMix 48.15 45.61 52.11 34.23 14.96 37.09 38.66 40.25 38.88
Table 12: AMix with AdamW optimizer vs. FAMix. Performance (mIoU %) of FAMix in our default configuration compared to a variant with no minimal fine-tuning, replacing SGD with AdamW optimizer.

Appendix B Class-wise performance

We report class-wise IoUs in Tab. 13 and Tab. 14. The standard deviations of the mIoU (%) over three runs are also reported.

Target eval.

  road

  sidewalk

  building

  wall

  fence

  pole

  traffic light

  traffic sign

  vegetation

  terrain

  sky

  person

  rider

  car

  truck

  bus

  train

  motorcycle

  bicycle

mIoU%
C 88.97 39.48 83.12 28.52 29.06 38.64 42.67 36.26 86.48 24.70 78.26 69.08 23.88 85.35 29.63 38.82 9.28 38.02 44.68 48.15±plus-or-minus\pm±0.38
B 87.83 40.33 78.44 15.88 35.20 38.13 40.63 29.49 77.52 31.19 90.80 60.28 23.23 82.99 26.73 34.53 0.00 43.55 29.92 45.61±plus-or-minus\pm±0.84
M 86.65 41.65 78.67 26.91 30.88 45.91 46.50 61.48 81.84 38.79 94.09 68.65 33.59 84.52 40.90 42.40 10.15 41.20 35.24 52.11±plus-or-minus\pm±0.17
S 60.55 49.61 82.63 7.80 5.42 29.23 15.71 15.26 68.18 0.00 90.83 61.59 12.09 61.29 0.00 35.23 0.00 32.75 22.19 34.23±plus-or-minus\pm±0.53
AN 47.44 7.01 38.73 8.42 3.59 23.04 18.42 5.75 19.33 5.82 5.65 26.61 10.39 50.46 4.10 0.00 0.79 8.42 0.28 14.96±plus-or-minus\pm±0.09
AS 66.93 10.15 62.17 33.95 22.10 35.26 51.20 35.57 73.12 20.72 77.55 52.02 0.63 71.62 21.20 1.14 12.36 47.32 9.71 37.09±plus-or-minus\pm±0.83
AR 73.41 21.42 77.58 19.41 16.96 33.63 44.90 38.53 80.96 29.31 94.98 56.89 17.04 76.15 16.15 7.07 5.11 23.74 1.24 38.66±plus-or-minus\pm±1.12
AF 77.61 31.99 76.42 28.84 10.30 31.42 52.92 29.99 68.09 24.55 92.03 54.52 34.91 68.21 26.07 11.02 1.44 7.71 36.78 40.25±plus-or-minus\pm±0.71
Table 13: ResNet-50 class-wise performance. We report the performance of FAMix (IoU %) trained on GTAV with ResNet-50 as backbone.
Target eval.

  road

  sidewalk

  building

  wall

  fence

  pole

  traffic light

  traffic sign

  vegetation

  terrain

  sky

  person

  rider

  car

  truck

  bus

  train

  motorcycle

  bicycle

mIoU%
C 90.25 44.78 84.54 31.71 31.48 44.17 45.45 35.78 87.17 35.58 84.30 69.62 20.48 86.87 31.11 44.38 7.73 34.06 30.46 49.47±plus-or-minus\pm±0.36
B 86.79 41.47 79.53 16.67 41.27 39.41 42.31 33.35 78.64 36.86 91.03 60.32 23.73 81.51 31.13 25.99 0.00 45.78 25.73 46.40±plus-or-minus\pm±0.50
M 78.71 38.89 81.85 26.56 40.22 47.32 49.27 62.19 82.68 41.54 95.63 67.60 25.87 85.50 41.62 35.87 12.55 43.69 29.92 51.97±plus-or-minus\pm±1.30
S 63.72 56.80 85.05 9.30 21.62 33.26 16.44 18.96 69.42 0.00 92.10 63.52 10.95 64.86 0.00 29.84 0.00 39.67 22.22 36.72±plus-or-minus\pm±0.71
AN 65.87 23.23 37.83 13.72 4.60 30.34 16.49 7.48 27.37 7.83 17.61 35.16 18.48 53.71 5.67 0.00 0.84 10.25 1.44 19.89±plus-or-minus\pm±1.22
AS 75.42 31.90 72.15 36.75 27.85 38.89 49.63 33.09 72.45 22.98 84.21 56.75 1.91 75.84 34.61 4.44 4.17 48.91 14.26 41.38±plus-or-minus\pm±0.34
AR 57.58 26.76 79.76 19.79 21.06 37.70 46.34 37.66 83.85 36.96 94.80 55.40 31.61 79.53 14.84 14.01 6.97 29.36 3.34 40.91±plus-or-minus\pm±1.28
AF 77.96 41.73 77.99 34.73 6.85 36.80 49.49 34.51 72.00 32.60 91.52 46.28 27.28 70.77 31.17 19.08 4.87 10.75 34.55 42.15±plus-or-minus\pm±1.87
Table 14: ResNet-101 class-wise performance. We report the performance of FAMix (IoU %) trained on GTAV with ResNet-101 as backbone.

Appendix C Prompts used for style mining

The <<<random style prompt>>> used for training FAMix:
1subscript1\mathcal{R}_{1}caligraphic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = <<<random style prompt>>> = { Ethereal Mist, Cyberpunk Cityscape, Rustic Charm, Galactic Fantasy, Pastel Dreams, Dystopian Noir, Whimsical Wonderland, Urban Grit, Enchanted Forest, Retro Futurism, Monochrome Elegance, Vibrant Graffiti, Haunting Shadows, Steampunk Adventures, Watercolor Serenity, Industrial Chic, Cosmic Voyage, Pop Art Popularity, Abstract Symphony, Magical Realism, Abstract Geometric Patterns, Vintage Film Grain, Neon Cityscape Vibes, Surreal Watercolor Dreams, Minimalist Nature Scenes, Cyberpunk Urban Chaos, Impressionist Sunset Hues, Pop Art Explosion, Fantasy Forest Adventures, Pixelated Digital Chaos, Monochromatic Street Photography, Vibrant Graffiti Expressions, Steampunk Industrial Charm, Ethereal Cloudscapes, Retro Futurism Flare, Dark and Moody Landscapes, Pastel Dreamworlds, Galactic Space Odyssey, Abstract Brush Strokes, Noir Cinematic Moments, Whimsical Fairy Tale Realms, Modernist Architectural Wonders, Macro Botanical Elegance, Dystopian Sci-Fi Realities, High Contrast Street Art, Impressionist City Reflections, Pixel Art Nostalgia, Dynamic Action Sequences, Soft Focus Pastels, Abstract 3D Renderings, Mystical Moonlit Landscapes, Urban Decay Aesthetics, Holographic Futuristic Visions, Vintage Polaroid Snapshots, Digital Glitch Anomalies, Japanese Zen Gardens, Psychedelic Kaleidoscopes, Cosmic Abstract Portraits, Subtle Earthy Textures, Hyperrealistic Wildlife Portraits, Cybernetic Neon Lights, Warped Reality Illusions, Whimsical Watercolor Animals, Industrial Grunge Textures, Tropical Paradise Escapes, Dynamic Street Performances, Abstract Architectural Wonders, Comic Book Panel Vibes, Soft Glow Sunsets, 8-Bit Pixel Adventures, Galactic Nebula Explosions, Doodle Sketchbook Pages, High-Tech Futuristic Landscapes, Cinematic Noir Shadows, Vibrant Desert Landscapes, Abstract Collage Chaos, Nature in Infrared, Surreal Dream Sequences, Abstract Light Painting, Whimsical Fantasy Creatures, Cybernetic Augmented Reality, Impressionist Rainy Days, Vintage Aged Photographs, Neon Anime Cityscapes, Pastel Sunset Palette, Surreal Floating Islands, Abstract Mosaic Patterns, Retro Sci-Fi Spaceships, Futuristic Cyber Landscapes, Steampunk Clockwork Contraptions, Monochromatic Urban Decay, Glitch Art Distortions, Magical Forest Enchantments, Digital Oil Painting, Pop Surrealist Dreams, Dynamic Graffiti Murals, Vintage Pin-up Glamour, Abstract Kinetic Sculptures, Neon Jungle Adventures, Minimalist Futuristic Interfaces }

The <<<random character prompt>>> used in Tab. 7 experiments (i.e., ‘RCP’) are:
2subscript2\mathcal{R}_{2}caligraphic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = <<<random character prompt>>> = { ioscjspa, cjosae, wqvsecpas, csavwggw, csanoiaj, zfaspf, atpwqkmfc, mdmfejh, casjicjai, cnoacpoaj, noiasvnai, kcsakofnaoi, cjncioasn, wkqgmdc, jqblhyu, pqwfkgr, mzxanqnw, wnzsalml, sdqlhkjr, odfeqfit }

Both 1subscript1\mathcal{R}_{1}caligraphic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\mathcal{R}_{2}caligraphic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are concatenated with the word style.

Appendix D Limitations and perspectives

D.1 Limitations

Failures conditions.

We show in Fig. 5 failure cases in rare conditions, which include extreme illumination or darkness, low visibility due to rain drops on the windshield, and other adverse conditions (e.g., snowy road). While FAMix improves over the baseline, the results remain unsatisfactory for safety-critical applications as it fails to segment critical objects in the scenes (e.g., car, road, sidewalk, person etc). We leave for future research the generalization to the above mentioned conditioned. One possible direction could be to design specific methods for specific corner conditions (e.g.[18, 32]), although we highlight this is orthogonal to generalization.

Image GT Baseline FAMix
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Road Sidewalk Building Wall Fence Pole Traffic light Traffic sign Vegetation Terrain
Sky Person Rider Car Truck Bus Train Motorbike Bicycle n/a
Figure 5: Examples of failure cases. Columns 1-2: Image and Ground Truth (GT), Column 3: Baseline (Freeze ✗, Augment ✗, Mix ✗), Column 4: FAMix results. The models are trained on GTAV with ResNet-50 backbone.

Stylization.

At the heart of FAMix lies the assumption that unseen target distributions could be covered by augmenting the mean and standard deviation of the low-level features. While the correlation between “style” and these parameters has been shown in previous research [40, 67], we believe that the hypothesis stating that the domain shift could be described only by these parameters over-simplifies generalization. Moreover, FAMix does not handle or provide an estimation of uncertainty, which is crucial for both classes in and outside the label set for the application at hand.

D.2 Perspectives

Vision transformers (ViTs) [10] have recently emerged as an alternative to CNNs. We leave a ViT implementation of FAMix for future work. Applying prompt-driven instance normalization (PIN) [11] to ViTs appears non-trivial as the relation between statistics of low-level feature maps and style is established only for CNNs so far, and could raise some technical challenges. Exploring this direction might first involve a study of the correlation between style and statistics of patches. If such correlation is demonstrated, a naive way to apply FAMix could be by applying PIN with tied parameters across the patches.

While some modern architectures are inherently more robust than older ones, the problem of DGSS with ResNets (e.g. ResNet-50 and ResNet-101) is still not solved. As long as the gap exists between in-domain and out-of-distribution performances, we believe that this setting remains interesting, and that a general understanding of domain generalization could emerge from the algorithms proposed to address it.

References

  • Ahuja et al. [2021] Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance principle meets information bottleneck for out-of-distribution generalization. In NeurIPS, 2021.
  • Allingham et al. [2023] James Urquhart Allingham, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, and Balaji Lakshminarayanan. A simple zero-shot prompt weighting technique to improve prompt ensembling in text-image models. In ICML, 2023.
  • Arjovsky et al. [2019] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  • Balaji et al. [2018] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. Metareg: Towards domain generalization using meta-regularization. In NeurIPS, 2018.
  • Chen et al. [2018] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
  • Chen and He [2021] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, 2021.
  • Choi et al. [2021] Sungha Choi, Sanghun Jung, Huiwon Yun, Joanne T Kim, Seungryong Kim, and Jaegul Choo. Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening. In CVPR, 2021.
  • Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  • Fahes et al. [2023] Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, and Raoul de Charette. Poda: Prompt-driven zero-shot domain adaptation. In ICCV, 2023.
  • Fan et al. [2023] Qi Fan, Mattia Segu, Yu-Wing Tai, Fisher Yu, Chi-Keung Tang, Bernt Schiele, and Dengxin Dai. Towards robust object detection invariant to real-world domain shifts. In ICLR, 2023.
  • Fang et al. [2022] Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, and Ludwig Schmidt. Data determines distributional robustness in contrastive language image pre-training (clip). In ICML, 2022.
  • Ganin et al. [2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016.
  • Ge et al. [2023] Yunhao Ge, Jie Ren, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang, Hartwig Adam, Laurent Itti, Balaji Lakshminarayanan, and Jiaping Zhao. Improving zero-shot generalization and robustness of multi-modal models. In CVPR, 2023.
  • Goyal et al. [2023] Sachin Goyal, Ananya Kumar, Sankalp Garg, Zico Kolter, and Aditi Raghunathan. Finetune like you pretrain: Improved finetuning of zero-shot vision models. In CVPR, 2023.
  • Gu et al. [2022] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
  • Halder et al. [2019] Shirsendu Sukanta Halder, Jean-François Lalonde, and Raoul de Charette. Physics-based rendering for improving robustness to rain. In ICCV, 2019.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • Hoffman et al. [2018] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, 2018.
  • Huang et al. [2023] Wei Huang, Chang Chen, Yong Li, Jiacheng Li, Cheng Li, Fenglong Song, Youliang Yan, and Zhiwei Xiong. Style projected clustering for domain generalized semantic segmentation. In CVPR, 2023.
  • Huang and Belongie [2017] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.
  • Jain et al. [2023] Nishant Jain, Harkirat Behl, Yogesh Singh Rawat, and Vibhav Vineet. Efficiently robustify pre-trained models. In ICCV, 2023.
  • Jia et al. [2021] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  • Kim et al. [2022] Jin Kim, Jiyoung Lee, Jungin Park, Dongbo Min, and Kwanghoon Sohn. Pin the memory: Learning to generalize semantic segmentation. In CVPR, 2022.
  • Kim et al. [2023] Sunghwan Kim, Dae-hwan Kim, and Hoseong Kim. Texture learning domain randomization for domain generalized segmentation. In ICCV, 2023.
  • Krueger et al. [2021] David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex). In ICML, 2021.
  • Kumar et al. [2022] Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. In ICLR, 2022.
  • Kwon and Ye [2022] Gihyun Kwon and Jong Chul Ye. Clipstyler: Image style transfer with a single text condition. In CVPR, 2022.
  • Laroudie et al. [2023] Clement Laroudie, Andrei Bursuc, Mai Lan Ha, and Gianni Franchi. Improving clip robustness with knowledge distillation and self-training. arXiv preprint arXiv:2309.10361, 2023.
  • Lee et al. [2022] Suhyeon Lee, Hongje Seong, Seongwon Lee, and Euntai Kim. Wildnet: Learning domain generalized semantic segmentation from the wild. In CVPR, 2022.
  • Lengyel et al. [2021] Attila Lengyel, Sourav Garg, Michael Milford, and Jan C van Gemert. Zero-shot day-night domain adaptation with a physics prior. In ICCV, 2021.
  • Li et al. [2022] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic segmentation. In ICLR, 2022.
  • Li et al. [2018a] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In CVPR, 2018a.
  • Li et al. [2018b] Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In ECCV, 2018b.
  • Li et al. [2019] Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. In CVPR, 2019.
  • Long et al. [2018] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. In NeurIPS, 2018.
  • Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.
  • Neuhold et al. [2017] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, 2017.
  • Pan et al. [2018] Xingang Pan, Ping Luo, Jianping Shi, and Xiaoou Tang. Two at once: Enhancing learning and generalization capacities via ibn-net. In ECCV, 2018.
  • Peng et al. [2022] Duo Peng, Yinjie Lei, Munawar Hayat, Yulan Guo, and Wen Li. Semantic-aware domain generalized segmentation. In CVPR, 2022.
  • Qiao et al. [2020] Fengchun Qiao, Long Zhao, and Xi Peng. Learning to learn single domain generalization. In CVPR, 2020.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
  • Rao et al. [2022] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, 2022.
  • Richter et al. [2016] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. In ECCV, 2016.
  • Ros et al. [2016] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016.
  • Saito et al. [2023] Kuniaki Saito, Donghyun Kim, Piotr Teterwak, Rogerio Feris, and Kate Saenko. Mind the backbone: Minimizing backbone distortion for robust object detection. arXiv preprint arXiv:2303.14744, 2023.
  • Sakaridis et al. [2021] Christos Sakaridis, Dengxin Dai, and Luc Van Gool. ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding. In ICCV, 2021.
  • Shu et al. [2023] Yang Shu, Xingzhuo Guo, Jialong Wu, Ximei Wang, Jianmin Wang, and Mingsheng Long. Clipood: Generalizing clip to out-of-distributions. In ICML, 2023.
  • Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  • Tzeng et al. [2017] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017.
  • Vu et al. [2019] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In CVPR, 2019.
  • Wang et al. [2022] Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip Yu. Generalizing to unseen domains: A survey on domain generalization. T-KDE, 2022.
  • Wortsman et al. [2022] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. In CVPR, 2022.
  • Wu et al. [2022] Zhenyao Wu, Xinyi Wu, Xiaoping Zhang, Lili Ju, and Song Wang. Siamdoge: Domain generalizable semantic segmentation using siamese network. In ECCV, 2022.
  • Xu et al. [2021] Qinwei Xu, Ruipeng Zhang, Ya Zhang, Yanfeng Wang, and Qi Tian. A fourier-based framework for domain generalization. In CVPR, 2021.
  • Yang et al. [2023] Liwei Yang, Xiang Gu, and Jian Sun. Generalized semantic segmentation by self-supervised source domain projection and multi-level contrastive learning. In AAAI, 2023.
  • Yu et al. [2020] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In CVPR, 2020.
  • Zhai et al. [2022] Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. LiT: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
  • Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023.
  • Zhao et al. [2020] Shanshan Zhao, Mingming Gong, Tongliang Liu, Huan Fu, and Dacheng Tao. Domain generalization via entropy regularization. In NeurIPS, 2020.
  • Zhao et al. [2022] Yuyang Zhao, Zhun Zhong, Na Zhao, Nicu Sebe, and Gim Hee Lee. Style-hallucinated dual consistency learning for domain generalized semantic segmentation. In ECCV, 2022.
  • Zhou et al. [2019] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
  • Zhou et al. [2022a] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In ECCV, 2022a.
  • Zhou et al. [2020a] Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, and Tao Xiang. Deep domain-adversarial image generation for domain generalisation. In AAAI, 2020a.
  • Zhou et al. [2020b] Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, and Tao Xiang. Learning to generate novel domains for domain generalization. In ECCV, 2020b.
  • Zhou et al. [2021] Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xiang. Domain generalization with mixstyle. In ICLR, 2021.
  • Zhou et al. [2022b] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. TPAMI, 2022b.
  • Zhou et al. [2022c] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, 2022c.
  • Zhou et al. [2022d] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. IJCV, 2022d.