Large-scale Bilingual Language-Image Contrastive Learning

Ko, Byungsoo; Gu, Geonmo

Computer Science > Computer Vision and Pattern Recognition

arXiv:2203.14463 (cs)

[Submitted on 28 Mar 2022 (v1), last revised 15 Apr 2022 (this version, v2)]

Title:Large-scale Bilingual Language-Image Contrastive Learning

Authors:Byungsoo Ko, Geonmo Gu

View PDF

Abstract:This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than English. In this work, we collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP. We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation. Extensive experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages. Moreover, we discuss multimodal-related research questions: 1) strong augmentation-based methods can distract the model from learning proper multimodal relations; 2) training multimodal model without cross-lingual relation can learn the relation via visual semantics; 3) our bilingual KELIP can capture cultural differences of visual semantics for the same meaning of words; 4) a large-scale multimodal model can be used for multimodal feature analogy. We hope that this work will provide helpful experience and findings for future research. We provide an open-source pre-trained KELIP.

Comments:	Accepted by ICLRW2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2203.14463 [cs.CV]
	(or arXiv:2203.14463v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2203.14463

Submission history

From: ByungSoo Ko [view email]
[v1] Mon, 28 Mar 2022 03:02:03 UTC (12,001 KB)
[v2] Fri, 15 Apr 2022 02:37:55 UTC (12,001 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Large-scale Bilingual Language-Image Contrastive Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Large-scale Bilingual Language-Image Contrastive Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators