Learning to Segment Referred Objects from Narrated Egocentric Videos

Shen, Yuhan; Wang, Huiyu; Yang, Xitong; Feiszli, Matt; Elhamifar, Ehsan; Torresani, Lorenzo; Mavroudi, Effrosyni

Yuhan Shen, Huiyu Wang, Xitong Yang, Matt Feiszli, Ehsan Elhamifar, Lorenzo Torresani, Effrosyni Mavroudi; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14510-14520

Abstract

Egocentric videos provide a first-person perspective of the wearer's activities involving simultaneous interactions with multiple objects. In this work we propose the task of weakly-supervised Narration-based Video Object Segmentation (NVOS). Given an egocentric video clip and a narration of the wearer's activities our aim is to segment object instances mentioned in the narration without using any spatial annotations during training. Existing weakly-supervised video object grounding methods typically yield bounding boxes for referred objects. In contrast we propose ROSA a weakly-supervised pixel-level grounding framework learning alignments between referred objects and segmentation mask proposals. Our model harnesses vision-language models pre-trained on image-text pairs to embed region masks and object phrases. During training we combine (a) a video-narration contrastive loss that implicitly supervises the alignment between regions and phrases and (b) a region-phrase contrastive loss based on inferred latent alignments. To address the lack of annotated NVOS datasets in egocentric videos we create a new evaluation benchmark VISOR-NVOS leveraging existing annotations of segmentation masks from VISOR alongside 14.6k newly-collected object-based video clip narrations. Our approach achieves state-of-the-art zero-shot pixel-level grounding performance compared to strong baselines under similar supervision. Additionally we demonstrate generalization capabilities for zero-shot video object grounding on YouCook2 a third-person instructional video dataset.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Shen_2024_CVPR, author = {Shen, Yuhan and Wang, Huiyu and Yang, Xitong and Feiszli, Matt and Elhamifar, Ehsan and Torresani, Lorenzo and Mavroudi, Effrosyni}, title = {Learning to Segment Referred Objects from Narrated Egocentric Videos}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {14510-14520} }