Compositional Temporal Visual Grounding of Natural Language Event Descriptions

Stroud, Jonathan C.; McCaffrey, Ryan; Mihalcea, Rada; Deng, Jia; Russakovsky, Olga

Computer Science > Computer Vision and Pattern Recognition

arXiv:1912.02256 (cs)

[Submitted on 4 Dec 2019]

Title:Compositional Temporal Visual Grounding of Natural Language Event Descriptions

Authors:Jonathan C. Stroud, Ryan McCaffrey, Rada Mihalcea, Jia Deng, Olga Russakovsky

View PDF

Abstract:Temporal grounding entails establishing a correspondence between natural language event descriptions and their visual depictions. Compositional modeling becomes central: we first ground atomic descriptions "girl eating an apple," "batter hitting the ball" to short video segments, and then establish the temporal relationships between the segments. This compositional structure enables models to recognize a wider variety of events not seen during training through recognizing their atomic sub-events. Explicit temporal modeling accounts for a wide variety of temporal relationships that can be expressed in language: e.g., in the description "girl stands up from the table after eating an apple" the visual ordering of the events is reversed, with first "eating an apple" followed by "standing up from the table." We leverage these observations to develop a unified deep architecture, CTG-Net, to perform temporal grounding of natural language event descriptions to videos. We demonstrate that our system outperforms prior state-of-the-art methods on the DiDeMo, Tempo-TL, and Tempo-HL temporal grounding datasets.

Comments:	Project page: this http URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:1912.02256 [cs.CV]
	(or arXiv:1912.02256v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1912.02256

Submission history

From: Jonathan Stroud [view email]
[v1] Wed, 4 Dec 2019 21:36:16 UTC (8,839 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2019-12

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Jonathan C. Stroud
Rada Mihalcea
Jia Deng
Olga Russakovsky

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Compositional Temporal Visual Grounding of Natural Language Event Descriptions

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Compositional Temporal Visual Grounding of Natural Language Event Descriptions

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators