CoVR: Learning Composed Video Retrieval from Web Video Captions

Ventura, Lucas; Yang, Antoine; Schmid, Cordelia; Varol, Gül

doi:10.1609/aaai.v38i6.28334

Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.14746 (cs)

[Submitted on 28 Aug 2023 (v1), last revised 30 May 2024 (this version, v3)]

Title:CoVR: Learning Composed Video Retrieval from Web Video Captions

Authors:Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol

View PDF HTML (experimental)

Abstract:Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at this https URL.

Comments:	AAAI 2024, Updated the results on CIRR with the correct evaluation. Project page: Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2308.14746 [cs.CV]
	(or arXiv:2308.14746v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.14746
Related DOI:	https://doi.org/10.1609/aaai.v38i6.28334

Submission history

From: Lucas Ventura [view email]
[v1] Mon, 28 Aug 2023 17:55:33 UTC (10,236 KB)
[v2] Tue, 21 May 2024 14:44:08 UTC (11,101 KB)
[v3] Thu, 30 May 2024 11:52:33 UTC (11,102 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CoVR: Learning Composed Video Retrieval from Web Video Captions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CoVR: Learning Composed Video Retrieval from Web Video Captions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators