SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

Lin, Yuanze; Wei, Chen; Wang, Huiyu; Yuille, Alan; Xie, Cihang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2211.11446 (cs)

[Submitted on 21 Nov 2022 (v1), last revised 30 Nov 2022 (this version, v3)]

Title:SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

Authors:Yuanze Lin, Chen Wei, Huiyu Wang, Alan Yuille, Cihang Xie

View PDF

Abstract:Video-language pre-training is crucial for learning powerful multi-modal representation. However, it typically requires a massive amount of computation. In this paper, we develop SMAUG, an efficient pre-training framework for video-language models. The foundation component in SMAUG is masked autoencoders. Different from prior works which only mask textual inputs, our masking strategy considers both visual and textual modalities, providing a better cross-modal alignment and saving more pre-training costs. On top of that, we introduce a space-time token sparsification module, which leverages context information to further select only "important" spatial regions and temporal frames for pre-training. Coupling all these designs allows our method to enjoy both competitive performances on text-to-video retrieval and video question answering tasks, and much less pre-training costs by 1.9X or more. For example, our SMAUG only needs about 50 NVIDIA A6000 GPU hours for pre-training to attain competitive performances on these two video-language tasks across six popular benchmarks.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2211.11446 [cs.CV]
	(or arXiv:2211.11446v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2211.11446

Submission history

From: Yuanze Lin [view email]
[v1] Mon, 21 Nov 2022 13:34:34 UTC (8,791 KB)
[v2] Tue, 22 Nov 2022 17:27:37 UTC (8,791 KB)
[v3] Wed, 30 Nov 2022 04:09:56 UTC (8,804 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators