Prompting Visual-Language Models for Efficient Video Understanding

Ju, Chen; Han, Tengda; Zheng, Kunhao; Zhang, Ya; Xie, Weidi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2112.04478 (cs)

[Submitted on 8 Dec 2021 (v1), last revised 15 Jul 2022 (this version, v2)]

Title:Prompting Visual-Language Models for Efficient Video Understanding

Authors:Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, Weidi Xie

View PDF

Abstract:Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation. This paper presents a simple but strong baseline to efficiently adapt the pre-trained I-VL model, and exploit its powerful ability for resource-hungry video understanding tasks, with minimal training. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert video-related tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components. On 10 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, and zero-shot scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite optimising significantly fewer parameters.

Comments:	ECCV 2022. Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2112.04478 [cs.CV]
	(or arXiv:2112.04478v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2112.04478

Submission history

From: Chen Ju [view email]
[v1] Wed, 8 Dec 2021 18:58:16 UTC (291 KB)
[v2] Fri, 15 Jul 2022 08:31:45 UTC (321 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Prompting Visual-Language Models for Efficient Video Understanding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Prompting Visual-Language Models for Efficient Video Understanding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators