-
EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation
Authors:
Baoqi Pei,
Guo Chen,
Jilan Xu,
Yuping He,
Yicheng Liu,
Kanghua Pan,
Yifei Huang,
Yali Wang,
Tong Lu,
Limin Wang,
Yu Qiao
Abstract:
In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the uniqu…
▽ More
In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions. In the Ego4D challenges, we tackle various tasks including Natural Language Queries, Step Grounding, Moment Queries, Short-term Object Interaction Anticipation, and Long-term Action Anticipation. In addition, we also participate in the EPIC-Kitchens challenge, where we engage in the Action Recognition, Multiple Instance Retrieval, and Domain Adaptation for Action Recognition tracks. By adapting EgoVideo to these diverse tasks, we showcase its versatility and effectiveness in different egocentric video analysis scenarios, demonstrating the powerful representation ability of EgoVideo as an egocentric foundation model. Our codebase and pretrained models are publicly available at https://github.com/OpenGVLab/EgoVideo.
△ Less
Submitted 30 June, 2024; v1 submitted 26 June, 2024;
originally announced June 2024.
-
EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World
Authors:
Yifei Huang,
Guo Chen,
Jilan Xu,
Mingfang Zhang,
Lijin Yang,
Baoqi Pei,
Hongjie Zhang,
Lu Dong,
Yali Wang,
Limin Wang,
Yu Qiao
Abstract:
Being able to map the activities of others into one's own point of view is one fundamental human skill even from a very early age. Taking a step toward understanding this human ability, we introduce EgoExoLearn, a large-scale dataset that emulates the human demonstration following process, in which individuals record egocentric videos as they execute tasks guided by demonstration videos. Focusing…
▽ More
Being able to map the activities of others into one's own point of view is one fundamental human skill even from a very early age. Taking a step toward understanding this human ability, we introduce EgoExoLearn, a large-scale dataset that emulates the human demonstration following process, in which individuals record egocentric videos as they execute tasks guided by demonstration videos. Focusing on the potential applications in daily assistance and professional support, EgoExoLearn contains egocentric and demonstration video data spanning 120 hours captured in daily life scenarios and specialized laboratories. Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints. To this end, we present benchmarks such as cross-view association, cross-view action planning, and cross-view referenced skill assessment, along with detailed analysis. We expect EgoExoLearn can serve as an important resource for bridging the actions across views, thus paving the way for creating AI agents capable of seamlessly learning by observing humans in the real world. Code and data can be found at: https://github.com/OpenGVLab/EgoExoLearn
△ Less
Submitted 5 June, 2024; v1 submitted 24 March, 2024;
originally announced March 2024.
-
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Authors:
Yi Wang,
Kunchang Li,
Xinhao Li,
Jiashuo Yu,
Yinan He,
Guo Chen,
Baoqi Pei,
Rongkun Zheng,
Jilan Xu,
Zun Wang,
Yansong Shi,
Tianxiang Jiang,
Songze Li,
Hongjie Zhang,
Yifei Huang,
Yu Qiao,
Yali Wang,
Limin Wang
Abstract:
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our approach employs a progressive training paradigm that unifies the different self- or weakly-supervised learning frameworks of masked video token reconstruction, cross-modal contrastive learning, and next token predict…
▽ More
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our approach employs a progressive training paradigm that unifies the different self- or weakly-supervised learning frameworks of masked video token reconstruction, cross-modal contrastive learning, and next token prediction. Different training stages would guide our model to capture different levels of structure and semantic information through different pretext tasks. At the data level, we prioritize the spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. We scale both data and model size for our InternVideo2. Through extensive experiments, we validate our designs and demonstrate the state-of-the-art performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related captioning, dialogue, and long video understanding benchmarks, highlighting its ability to reason and comprehend long temporal contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo2/.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
Authors:
Guo Chen,
Yifei Huang,
Jilan Xu,
Baoqi Pei,
Zhe Chen,
Zhiqi Li,
Jiahao Wang,
Kunchang Li,
Tong Lu,
Limin Wang
Abstract:
Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternati…
▽ More
Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: https://github.com/OpenGVLab/video-mamba-suite.
△ Less
Submitted 14 March, 2024;
originally announced March 2024.
-
Robust Blind Source Separation by Soft Decision-Directed Non-Unitary Joint Diagonalization
Authors:
Wenjuan Liu,
Dazheng Feng,
Bingnan Pei,
Mengdao Xing,
Xinhong Meng,
Qianru Wei
Abstract:
Approximate joint diagonalization of a set of matrices provides a powerful framework for numerous statistical signal processing applications. For non-unitary joint diagonalization (NUJD) based on the least-squares (LS) criterion, outliers, also referred to as anomaly or discordant observations, have a negative influence on the performance, since squaring the residuals magnifies the effects of them…
▽ More
Approximate joint diagonalization of a set of matrices provides a powerful framework for numerous statistical signal processing applications. For non-unitary joint diagonalization (NUJD) based on the least-squares (LS) criterion, outliers, also referred to as anomaly or discordant observations, have a negative influence on the performance, since squaring the residuals magnifies the effects of them. To solve this problem, we propose a novel cost function that incorporates the soft decision-directed scheme into the least-squares algorithm and develops an efficient algorithm. The influence of the outliers is mitigated by applying decision-directed weights which are associated with the residual error at each iterative step. Specifically, the mixing matrix is estimated by a modified stationary point method, in which the updating direction is determined based on the linear approximation to the gradient function. Simulation results demonstrate that the proposed algorithm outperforms conventional non-unitary diagonalization algorithms in terms of both convergence performance and robustness to outliers.
△ Less
Submitted 28 June, 2021;
originally announced July 2021.