Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding

Shimomoto, Erica K.; Marrese-Taylor, Edison; Takamura, Hiroya; Kobayashi, Ichiro; Nakayama, Hideki; Miyao, Yusuke

Computer Science > Computer Vision and Pattern Recognition

arXiv:2209.13359 (cs)

[Submitted on 26 Sep 2022 (v1), last revised 25 May 2023 (this version, v2)]

Title:Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding

Authors:Erica K. Shimomoto, Edison Marrese-Taylor, Hiroya Takamura, Ichiro Kobayashi, Hideki Nakayama, Yusuke Miyao

View PDF

Abstract:This paper explores the task of Temporal Video Grounding (TVG) where, given an untrimmed video and a natural language sentence query, the goal is to recognize and determine temporal boundaries of action instances in the video described by the query. Recent works tackled this task by improving query inputs with large pre-trained language models (PLM) at the cost of more expensive training. However, the effects of this integration are unclear, as these works also propose improvements in the visual inputs. Therefore, this paper studies the effects of PLMs in TVG and assesses the applicability of parameter-efficient training with NLP adapters. We couple popular PLMs with a selection of existing approaches and test different adapters to reduce the impact of the additional parameters. Our results on three challenging datasets show that, without changing the visual inputs, TVG models greatly benefited from the PLM integration and fine-tuning, stressing the importance of sentence query representation in this task. Furthermore, NLP adapters were an effective alternative to full fine-tuning, even though they were not tailored to our task, allowing PLM integration in larger TVG models and delivering results comparable to SOTA models. Finally, our results shed light on which adapters work best in different scenarios.

Comments:	Accepted for Findings of ACL2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2209.13359 [cs.CV]
	(or arXiv:2209.13359v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2209.13359

Submission history

From: Erica Kido Shimomoto [view email]
[v1] Mon, 26 Sep 2022 08:11:19 UTC (1,110 KB)
[v2] Thu, 25 May 2023 08:50:45 UTC (10,583 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators