MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

Wang, Ye; Jiang, Bowei; Zou, Changqing; Ma, Rui

Computer Science > Computer Vision and Pattern Recognition

arXiv:2303.10839 (cs)

[Submitted on 20 Mar 2023 (v1), last revised 21 Mar 2023 (this version, v2)]

Title:MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

Authors:Ye Wang, Bowei Jiang, Changqing Zou, Rui Ma

View PDF

Abstract:Multifold observations are common for different data modalities, e.g., a 3D shape can be represented by multi-view images and an image can be described with different captions. Existing cross-modal contrastive representation learning (XM-CLR) methods such as CLIP are not fully suitable for multifold data as they only consider one positive pair and treat other pairs as negative when computing the contrastive loss. In this paper, we propose MXM-CLR, a unified framework for contrastive learning of multifold cross-modal representations. MXM-CLR explicitly models and learns the relationships between multifold observations of instances from different modalities for more comprehensive representation learning. The key of MXM-CLR is a novel multifold-aware hybrid loss which considers multiple positive observations when computing the hard and soft relationships for the cross-modal data pairs. We conduct quantitative and qualitative comparisons with SOTA baselines for cross-modal retrieval tasks on the Text2Shape and Flickr30K datasets. We also perform extensive evaluations on the adaptability and generalizability of MXM-CLR, as well as ablation studies on the loss design and effects of batch sizes. The results show the superiority of MXM-CLR in learning better representations for the multifold data. The code is available at this https URL.

Comments:	16 pages, 14 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2303.10839 [cs.CV]
	(or arXiv:2303.10839v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2303.10839

Submission history

From: Ye Wang [view email]
[v1] Mon, 20 Mar 2023 02:51:53 UTC (9,685 KB)
[v2] Tue, 21 Mar 2023 02:37:37 UTC (9,685 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators