Google Scholar

WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization

J Ma, Y Niu, S Huang, G Han, SF Chang - arXiv preprint arXiv:2405.18405, 2024 - arxiv.org

arXiv preprint arXiv:2405.18405, 2024•arxiv.org

Language has been useful in extending the vision encoder to data from diverse distributions without empirical discovery in training domains. However, as the image description is mostly at coarse-grained level and ignores visual details, the resulted embeddings are still ineffective in overcoming complexity of domains at inference time. We present a self-supervision framework WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation, by only leveraging data in a single domain and without any test prior. Specifically, for each image, we first estimate the language embedding with fine-grained alignment, which can be consequently used to adaptively identify and then remove domain-specific counterpart from the raw visual embedding. WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT. Experimental studies on three domain generalization datasets demonstrate the effectiveness of our approach.

arxiv.org

Show moreShow less

Speichern Sie Cite Cited by 1 Related articles All 2 versions View as HTML

Cite

Advanced search

Saved to My library

WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization