Zum Hauptinhalt springen

Showing 101–114 of 114 results for author: Shou, M Z

.
  1. arXiv:2203.07303  [pdf, other

    cs.CV

    All in One: Exploring Unified Video-Language Pre-training

    Authors: Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

    Abstract: Mainstream Video-Language Pre-training models \cite{actbert,clipbert,violet} consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal fusion Transformers, resulting in increased parameters with lower efficiency in downstream tasks. In this work, we for the first time introduce… ▽ More

    Submitted 14 March, 2022; originally announced March 2022.

    Comments: 18 pages. 11 figures. Code: https://github.com/showlab/all-in-one

  2. arXiv:2203.04203  [pdf, other

    cs.CV

    AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant

    Authors: Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, Mike Zheng Shou

    Abstract: A long-standing goal of intelligent assistants such as AR glasses/robots has been to assist users in affordance-centric real-world scenarios, such as "how can I run the microwave for 1 minute?". However, there is still no clear task definition and suitable benchmarks. In this paper, we define a new task called Affordance-centric Question-driven Task Completion, where the AI assistant should learn… ▽ More

    Submitted 20 July, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

    Comments: Accepted by ECCV 2022. Equal contribution: Benita Wong, Joya Chen, You Wu; Corresponding author: Mike Zheng Shou

  3. arXiv:2112.14976   

    cs.CV cs.AI

    Contrastive Learning of Semantic and Visual Representations for Text Tracking

    Authors: Zhuang Li, Weijia Wu, Mike Zheng Shou, Jiahong Li, Size Li, Zhongyuan Wang, Hong Zhou

    Abstract: Semantic representation is of great benefit to the video text tracking(VTT) task that requires simultaneously classifying, detecting, and tracking texts in the video. Most existing approaches tackle this task by appearance similarity in continuous frames, while ignoring the abundant semantic features. In this paper, we explore to robustly track video text with contrastive learning of semantic and… ▽ More

    Submitted 19 August, 2022; v1 submitted 30 December, 2021; originally announced December 2021.

    Comments: Merge the paper with arXiv article 2207.08417. We will withdraw the two papers and create new one

  4. arXiv:2112.01194  [pdf, other

    cs.CV cs.MM

    Video-Text Pre-training with Learned Regions

    Authors: Rui Yan, Mike Zheng Shou, Yixiao Ge, Alex Jinpeng Wang, Xudong Lin, Guanyu Cai, Jinhui Tang

    Abstract: Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information. State-of-the-art approaches extract visual features from raw pixels in an end-to-end fashion. However, these methods operate at frame-level directly and thus overlook the spatio-temporal structure of objects in video, which yet h… ▽ More

    Submitted 6 December, 2021; v1 submitted 2 December, 2021; originally announced December 2021.

  5. arXiv:2112.00656  [pdf, other

    cs.CV cs.CL

    Object-aware Video-language Pre-training for Retrieval

    Authors: Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, Xiaohu Qie, Mike Zheng Shou

    Abstract: Recently, by introducing large-scale dataset and strong transformer network, video-language pre-training has shown great success especially for retrieval. Yet, existing video-language transformer models do not explicitly fine-grained semantic align. In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object represent… ▽ More

    Submitted 18 May, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

    Comments: CVPR2022; Code: https://github.com/FingerRec/OA-Transformer

  6. arXiv:2111.15050  [pdf, other

    cs.CV

    AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant

    Authors: Stan Weixian Lei, Difei Gao, Yuxuan Wang, Dongxing Mao, Zihan Liang, Lingmin Ran, Mike Zheng Shou

    Abstract: It is still a pipe dream that personal AI assistants on the phone and AR glasses can assist our daily life in addressing our questions like ``how to adjust the date for this watch?'' and ``how to set its heating duration? (while pointing at an oven)''. The queries used in conventional tasks (i.e. Video Question Answering, Video Retrieval, Moment Localization) are often factoid and based on pure te… ▽ More

    Submitted 10 October, 2022; v1 submitted 29 November, 2021; originally announced November 2021.

    Comments: 20 pages, 12 figures

  7. arXiv:2111.14448  [pdf, other

    cs.CV cs.MM eess.AS

    AVA-AVD: Audio-Visual Speaker Diarization in the Wild

    Authors: Eric Zhongcong Xu, Zeyang Song, Satoshi Tsutsui, Chao Feng, Mang Ye, Mike Zheng Shou

    Abstract: Audio-visual speaker diarization aims at detecting "who spoke when" using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To develop diarization methods for these challengi… ▽ More

    Submitted 16 July, 2022; v1 submitted 29 November, 2021; originally announced November 2021.

    Comments: ACMMM 2022

  8. arXiv:2111.12527  [pdf, other

    cs.CV

    MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

    Authors: David Junhao Zhang, Kunchang Li, Yali Wang, Yunpeng Chen, Shashwat Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou

    Abstract: Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large computation burden. To fill this gap, we present an efficient self-attention free backbone, namely MorphMLP, which flexibly leverages the concise Fully-Connected (FC)… ▽ More

    Submitted 23 August, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

    Comments: ECCV2022

  9. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  10. arXiv:2109.06085  [pdf, other

    cs.CV cs.CL

    On Pursuit of Designing Multi-modal Transformer for Video Grounding

    Authors: Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, Yuexian Zou

    Abstract: Video grounding aims to localize the temporal segment corresponding to a sentence query from an untrimmed video. Almost all existing video grounding methods fall into two frameworks: 1) Top-down model: It predefines a set of segment candidates and then conducts segment classification and regression. 2) Bottom-up model: It directly predicts frame-wise probabilities of the referential segment bounda… ▽ More

    Submitted 11 April, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

    Comments: Accepted by Conference on Empirical Methods in Natural Language Processing (EMNLP 2021, Oral)

  11. Deep Motion Prior for Weakly-Supervised Temporal Action Localization

    Authors: Meng Cao, Can Zhang, Long Chen, Mike Zheng Shou, Yuexian Zou

    Abstract: Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos with only video-level labels. Currently, most state-of-the-art WSTAL methods follow a Multi-Instance Learning (MIL) pipeline: producing snippet-level predictions first and then aggregating to the video-level prediction. However, we argue that existing methods have overlooked two important drawbacks:… ▽ More

    Submitted 29 July, 2022; v1 submitted 12 August, 2021; originally announced August 2021.

    Comments: Accepted by IEEE Transactions on Image Processing (TIP)

  12. arXiv:2107.06592  [pdf, other

    eess.AS cs.SD eess.IV

    Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

    Authors: Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li

    Abstract: Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that ma… ▽ More

    Submitted 25 July, 2021; v1 submitted 14 July, 2021; originally announced July 2021.

    Comments: ACM Multimedia 2021

  13. arXiv:2101.10511  [pdf, other

    cs.CV

    Generic Event Boundary Detection: A Benchmark for Event Segmentation

    Authors: Mike Zheng Shou, Stan Weixian Lei, Weiyao Wang, Deepti Ghadiyaram, Matt Feiszli

    Abstract: This paper presents a novel task together with a new benchmark for detecting generic, taxonomy-free event boundaries that segment a whole video into chunks. Conventional work in temporal video segmentation and action detection focuses on localizing pre-defined action categories and thus does not scale to generic videos. Cognitive Science has known since last century that humans consistently segmen… ▽ More

    Submitted 19 August, 2021; v1 submitted 25 January, 2021; originally announced January 2021.

    Comments: ICCV 2021

  14. arXiv:2006.07976  [pdf, other

    cs.CV cs.LG eess.IV

    Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

    Authors: Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, Hongsheng Li

    Abstract: Localizing persons and recognizing their actions from videos is a challenging task towards high-level video understanding. Recent advances have been achieved by modeling direct pairwise relations between entities. In this paper, we take one step further, not only model direct relations between pairs but also take into account indirect higher-order relations established upon multiple elements. We p… ▽ More

    Submitted 20 April, 2021; v1 submitted 14 June, 2020; originally announced June 2020.

    Comments: Accepted in CVPR 2021