Skip to main content

Showing 1–42 of 42 results for author: Bertasius, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.19209  [pdf, other

    cs.CV cs.AI cs.CL

    VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

    Authors: Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal

    Abstract: Video-language understanding tasks have focused on short video clips, often struggling with long-form video understanding tasks. Recently, many long video-language understanding approaches have leveraged the reasoning capabilities of Large Language Models (LLMs) to perform long video QA, transforming videos into densely sampled frame captions, and asking LLMs to respond to text queries over captio… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: 20 pages, first three authors contributed equally; Project page: https://videotree2024.github.io/

  2. arXiv:2403.19638  [pdf, other

    cs.CV cs.SD eess.AS

    Siamese Vision Transformers are Scalable Audio-visual Learners

    Authors: Yan-Bo Lin, Gedas Bertasius

    Abstract: Traditional audio-visual methods rely on independent audio and visual backbones, which is costly and not scalable. In this work, we investigate using an audio-visual siamese network (AVSiam) for efficient and scalable audio-visual pretraining. Our framework uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memo… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

  3. arXiv:2403.13910  [pdf, other

    cs.RO cs.GR cs.LG

    Augmented Reality Demonstrations for Scalable Robot Imitation Learning

    Authors: Yue Yang, Bryce Ikeda, Gedas Bertasius, Daniel Szafir

    Abstract: Robot Imitation Learning (IL) is a widely used method for training robots to perform manipulation tasks that involve mimicking human demonstrations to acquire skills. However, its practicality has been limited due to its requirement that users be trained in operating real robot arms to provide demonstrations. This paper presents an innovative solution: an Augmented Reality (AR)-assisted framework… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

  4. arXiv:2403.08755  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    DAM: Dynamic Adapter Merging for Continual Video QA Learning

    Authors: Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, Gedas Bertasius

    Abstract: We present a parameter-efficient method for continual video question-answering (VidQA) learning. Our method, named DAM, uses the proposed Dynamic Adapter Merging to (i) mitigate catastrophic forgetting, (ii) enable efficient adaptation to continually arriving datasets, (iii) handle inputs from unknown datasets during inference, and (iv) enable knowledge sharing across similar dataset domains. Give… ▽ More

    Submitted 22 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

    Comments: The first two authors contribute equally

  5. arXiv:2402.13250  [pdf, other

    cs.CV

    Video ReCap: Recursive Captioning of Hour-Long Videos

    Authors: Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius

    Abstract: Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process v… ▽ More

    Submitted 16 May, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: Accepted by CVPR 2024

  6. arXiv:2401.10529  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

    Authors: Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong Huang

    Abstract: Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less inve… ▽ More

    Submitted 24 January, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

    Comments: 27 pages, 23 figures

  7. arXiv:2312.17235  [pdf, other

    cs.CV

    A Simple LLM Framework for Long-Range Video Question-Answering

    Authors: Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius

    Abstract: We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a Large Language Model (GPT-3… ▽ More

    Submitted 26 February, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

  8. arXiv:2312.06729  [pdf, other

    cs.CV

    RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

    Authors: Tanveer Hannan, Md Mohaiminul Islam, Thomas Seidl, Gedas Bertasius

    Abstract: Locating specific moments within long videos (20-120 minutes) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5-30 seconds) grounding methods to this problem yields poor performance. Since most real life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages… ▽ More

    Submitted 13 July, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: The code is released at https://github.com/Tanveer81/RGNet

  9. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

  10. arXiv:2309.10091  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Unified Coarse-to-Fine Alignment for Video-Text Retrieval

    Authors: Ziyang Wang, Yi-Lin Sung, Feng Cheng, Gedas Bertasius, Mohit Bansal

    Abstract: The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained alignment between visual and textual information. However, retrieving the correct video according to the text query is often challenging as it requires the ability to reason about both high-level (scene) and low-level (object) visual clues and how they relate to the text query. To this end, we propose a Unifi… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: ICCV 2023

  11. arXiv:2301.08237  [pdf, other

    cs.CV

    LoCoNet: Long-Short Context Network for Active Speaker Detection

    Authors: Xizi Wang, Feng Cheng, Gedas Bertasius, David Crandall

    Abstract: Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. ASD reasons from audio and visual information from two contexts: long-term intra-speaker context and short-term inter-speaker context. Long-term intra-speaker context models the temporal dependencies of the same speaker, while short-term inter-speaker context models the interactions of speakers in the same sc… ▽ More

    Submitted 29 March, 2024; v1 submitted 19 January, 2023; originally announced January 2023.

    Comments: accepted by CVPR 2024

  12. arXiv:2212.14427  [pdf, other

    cs.CV

    Efficient Movie Scene Detection using State-Space Transformers

    Authors: Md Mohaiminul Islam, Mahmudul Hasan, Kishan Shamsundar Athrey, Tony Braskich, Gedas Bertasius

    Abstract: The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Spa… ▽ More

    Submitted 21 June, 2023; v1 submitted 29 December, 2022; originally announced December 2022.

    Comments: Accepted by CVPR 2023. Code: https://github.com/md-mohaiminul/TranS4mer

  13. arXiv:2212.07983  [pdf, other

    cs.CV cs.CL cs.LG cs.SD eess.AS

    Vision Transformers are Parameter-Efficient Audio-Visual Learners

    Authors: Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius

    Abstract: Vision transformers (ViTs) have achieved impressive results on various computer vision tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained only on visual data, to generalize to audio-visual data without finetuning any of its original parameters. To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visu… ▽ More

    Submitted 5 April, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

    Comments: CVPR 2023 Project Page: https://genjib.github.io/project_page/LAVISH/

  14. arXiv:2212.05051  [pdf, other

    cs.CV

    VindLU: A Recipe for Effective Video-and-Language Pretraining

    Authors: Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius

    Abstract: The last several years have witnessed remarkable progress in video-and-language (VidL) understanding. However, most modern VidL approaches use complex and specialized model architectures and sophisticated pretraining protocols, making the reproducibility, analysis and comparisons of these frameworks difficult. Hence, instead of proposing yet another new VidL model, this paper conducts a thorough e… ▽ More

    Submitted 5 April, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: CVPR 2023. Project page: https://klauscc.github.io/vindlu.html

  15. arXiv:2210.11006  [pdf, other

    cs.CV

    SimpleClick: Interactive Image Segmentation with Simple Vision Transformers

    Authors: Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer

    Abstract: Click-based interactive image segmentation aims at extracting objects with a limited user clicking. A hierarchical backbone is the de-facto architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for downstream… ▽ More

    Submitted 11 March, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

    Comments: Tech report. Update 03/11/2023: Add results on a tiny model and append supplementary materials

  16. arXiv:2208.11553  [pdf, other

    cs.CV

    MuMUR : Multilingual Multimodal Universal Retrieval

    Authors: Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, Shachar Rosenman, Shao-Yen Tseng, Gedas Bertasius, Vasudev Lal

    Abstract: Multi-modal retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval. We first use state-of-th… ▽ More

    Submitted 19 September, 2023; v1 submitted 24 August, 2022; originally announced August 2022.

    Comments: This is an extension of the previous MKTVR paper (for which you can find a reference here : https://dl.acm.org/doi/abs/10.1007/978-3-031-28244-7_42 or in a previous version on arxiv). This version was published to the Information Retrieval Journal

  17. arXiv:2207.11814  [pdf, other

    cs.CV

    Object State Change Classification in Egocentric Videos using the Divided Space-Time Attention Mechanism

    Authors: Md Mohaiminul Islam, Gedas Bertasius

    Abstract: This report describes our submission called "TarHeels" for the Ego4D: Object State Change Classification Challenge. We use a transformer-based video recognition model and leverage the Divided Space-Time Attention mechanism for classifying object state change in egocentric videos. Our submission achieves the second-best performance in the challenge. Furthermore, we perform an ablation study to show… ▽ More

    Submitted 4 January, 2023; v1 submitted 24 July, 2022; originally announced July 2022.

    Comments: 2nd place winner, Ego4D challenge, CVPR 2022

  18. arXiv:2205.05739  [pdf, other

    cs.CV cs.AI cs.CL cs.HC cs.MA

    Learning to Retrieve Videos by Asking Questions

    Authors: Avinash Madasu, Junier Oliva, Gedas Bertasius

    Abstract: The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user. This can be sub-optimal if the initial query has ambiguities, which would lead to many falsely retrieved videos. To overcome this limitation, we propose a novel framework for Video Retrieval u… ▽ More

    Submitted 16 July, 2022; v1 submitted 11 May, 2022; originally announced May 2022.

    Journal ref: ACM Multimedia 2022

  19. arXiv:2204.02874  [pdf, other

    cs.CV cs.AI cs.CL cs.SD eess.AS

    ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

    Authors: Yan-Bo Lin, Jie Lei, Mohit Bansal, Gedas Bertasius

    Abstract: We introduce an audiovisual method for long-range text-to-video retrieval. Unlike previous approaches designed for short video retrieval (e.g., 5-15 seconds in duration), our approach aims to retrieve minute-long videos that capture complex human actions. One challenge of standard video-only approaches is the large computational cost associated with processing hundreds of densely extracted frames… ▽ More

    Submitted 2 August, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: ECCV 2022 Oral project page: https://yanbo.ml/project_page/eclipse/

  20. arXiv:2204.01692  [pdf, other

    cs.CV

    Long Movie Clip Classification with State-Space Video Models

    Authors: Md Mohaiminul Islam, Gedas Bertasius

    Abstract: Most modern video recognition models are designed to operate on short video clips (e.g., 5-10s in length). Thus, it is challenging to apply such models to long movie understanding tasks, which typically require sophisticated long-range temporal reasoning. The recently introduced video transformers partially address this issue by using long-range temporal self-attention. However, due to the quadrat… ▽ More

    Submitted 4 January, 2023; v1 submitted 4 April, 2022; originally announced April 2022.

    Comments: Accepted by ECCV 2022

  21. arXiv:2204.01680  [pdf, other

    cs.CV

    TALLFormer: Temporal Action Localization with a Long-memory Transformer

    Authors: Feng Cheng, Gedas Bertasius

    Abstract: Most modern approaches in temporal action localization divide this problem into two parts: (i) short-term feature extraction and (ii) long-range temporal boundary localization. Due to the high GPU memory cost caused by processing long untrimmed videos, many methods sacrifice the representational power of the short-term feature extractor by either freezing the backbone or using a small spatial vide… ▽ More

    Submitted 26 July, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

    Comments: Accepted by ECCV 2022

  22. arXiv:2201.10990  [pdf, other

    cs.CV

    Learning To Recognize Procedural Activities with Distant Supervision

    Authors: Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani

    Abstract: In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes. Accurately categorizing these activities requires not only recognizing the individual steps that compose the task but also capturing their temporal d… ▽ More

    Submitted 16 June, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

    Comments: CVPR 2022. Code will be released here https://github.com/facebookresearch/video-distant-supervision

  23. arXiv:2106.09212  [pdf, other

    cs.CV cs.AI

    Long-Short Temporal Contrastive Learning of Video Transformers

    Authors: Jue Wang, Gedas Bertasius, Du Tran, Lorenzo Torresani

    Abstract: Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only da… ▽ More

    Submitted 31 March, 2022; v1 submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted in CVPR 2022

  24. arXiv:2102.05095  [pdf, other

    cs.CV

    Is Space-Time Attention All You Need for Video Understanding?

    Authors: Gedas Bertasius, Heng Wang, Lorenzo Torresani

    Abstract: We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attentio… ▽ More

    Submitted 9 June, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

    Comments: Accepted to ICML 2021

  25. arXiv:2101.12059  [pdf, other

    cs.CV cs.CL

    VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

    Authors: Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani

    Abstract: We present \textsc{Vx2Text}, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. In order to leverage transformer networks, which have been shown to be effective at modeling language, each modality is first converted into a set of language embeddings by a learnable tokenizer. This allows our approach to perform multimodal fusion in the language s… ▽ More

    Submitted 29 January, 2021; v1 submitted 28 January, 2021; originally announced January 2021.

    Comments: Work in progress

  26. arXiv:2007.07306  [pdf, other

    cs.CV

    COBE: Contextualized Object Embeddings from Narrated Instructional Video

    Authors: Gedas Bertasius, Lorenzo Torresani

    Abstract: Many objects in the real world undergo dramatic variations in visual appearance. For example, a tomato may be red or green, sliced or chopped, fresh or fried, liquid or solid. Training a single detector to accurately recognize tomatoes in all these different states is challenging. On the other hand, contextual cues (e.g., the presence of a knife, a cutting board, a strainer or a pan) are often str… ▽ More

    Submitted 29 October, 2020; v1 submitted 14 July, 2020; originally announced July 2020.

    Comments: NeurIPS 2020

  27. arXiv:1912.04573  [pdf, other

    cs.CV

    Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation

    Authors: Gedas Bertasius, Lorenzo Torresani

    Abstract: We introduce a method for simultaneously classifying, segmenting and tracking object instances in a video sequence. Our method, named MaskProp, adapts the popular Mask R-CNN to video by adding a mask propagation branch that propagates frame-level object instance masks from each video frame to all the other frames in a video clip. This allows our system to predict clip-level instance tracks with re… ▽ More

    Submitted 9 July, 2021; v1 submitted 10 December, 2019; originally announced December 2019.

    Comments: CVPR 2020 Best Paper Nominee

  28. arXiv:1906.04016  [pdf, other

    cs.CV

    Learning Temporal Pose Estimation from Sparsely-Labeled Videos

    Authors: Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

    Abstract: Modern approaches for multi-person pose estimation in video require large amounts of dense annotations. However, labeling every frame in a video is costly and labor intensive. To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation. Given a pa… ▽ More

    Submitted 11 December, 2019; v1 submitted 6 June, 2019; originally announced June 2019.

    Comments: Accepted to NeurIPS 2019

  29. arXiv:1904.05410  [pdf, other

    cs.CV

    Attentive Action and Context Factorization

    Authors: Yang Wang, Vinh Tran, Gedas Bertasius, Lorenzo Torresani, Minh Hoai

    Abstract: We propose a method for human action recognition, one that can localize the spatiotemporal regions that `define' the actions. This is a challenging task due to the subtlety of human actions in video and the co-occurrence of contextual elements. To address this challenge, we utilize conjugate samples of human actions, which are video clips that are contextually similar to human action samples but d… ▽ More

    Submitted 10 April, 2019; originally announced April 2019.

    Comments: 10 pages, 6 figures

  30. arXiv:1812.04172  [pdf, other

    cs.CV

    Learning Discriminative Motion Features Through Detection

    Authors: Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

    Abstract: Despite huge success in the image domain, modern detection models such as Faster R-CNN have not been used nearly as much for video analysis. This is arguably due to the fact that detection models are designed to operate on single frames and as a result do not have a mechanism for learning motion representations directly from video. We propose a learning procedure that allows detection models such… ▽ More

    Submitted 10 December, 2018; originally announced December 2018.

  31. arXiv:1803.05549  [pdf, other

    cs.CV

    Object Detection in Video with Spatiotemporal Sampling Networks

    Authors: Gedas Bertasius, Lorenzo Torresani, Jianbo Shi

    Abstract: We propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos. Our STSN performs object detection in a video frame by learning to spatially sample features from the adjacent frames. This naturally renders the approach robust to occlusion or motion blur in individual frames. Our framework does not require additional supervision, as… ▽ More

    Submitted 24 July, 2018; v1 submitted 14 March, 2018; originally announced March 2018.

  32. arXiv:1803.01413  [pdf, other

    cs.CV

    Egocentric Basketball Motion Planning from a Single First-Person Image

    Authors: Gedas Bertasius, Aaron Chan, Jianbo Shi

    Abstract: We present a model that uses a single first-person image to generate an egocentric basketball motion sequence in the form of a 12D camera configuration trajectory, which encodes a player's 3D location and 3D head orientation throughout the sequence. To do this, we first introduce a future convolutional neural network (CNN) that predicts an initial sequence of 12D camera configurations, aiming to c… ▽ More

    Submitted 4 March, 2018; originally announced March 2018.

    Comments: CVPR 2018

  33. arXiv:1709.01630  [pdf, other

    cs.CV

    Using Cross-Model EgoSupervision to Learn Cooperative Basketball Intention

    Authors: Gedas Bertasius, Jianbo Shi

    Abstract: We present a first-person method for cooperative basketball intention prediction: we predict with whom the camera wearer will cooperate in the near future from unlabeled first-person images. This is a challenging task that requires inferring the camera wearer's visual attention, and decoding the social cues of other players. Our key observation is that a first-person view provides strong cues to i… ▽ More

    Submitted 5 September, 2017; originally announced September 2017.

  34. arXiv:1611.05365  [pdf, other

    cs.CV

    Am I a Baller? Basketball Performance Assessment from First-Person Videos

    Authors: Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

    Abstract: This paper presents a method to assess a basketball player's performance from his/her first-person video. A key challenge lies in the fact that the evaluation metric is highly subjective and specific to a particular evaluator. We leverage the first-person camera to address this challenge. The spatiotemporal visual semantics provided by a first-person view allows us to reason about the camera weare… ▽ More

    Submitted 2 August, 2017; v1 submitted 16 November, 2016; originally announced November 2016.

  35. arXiv:1611.05335  [pdf, other

    cs.CV

    Unsupervised Learning of Important Objects from First-Person Videos

    Authors: Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

    Abstract: A first-person camera, placed at a person's head, captures, which objects are important to the camera wearer. Most prior methods for this task learn to detect such important objects from the manually labeled first-person data in a supervised fashion. However, important objects are strongly related to the camera wearer's internal state such as his intentions and attention, and thus, only the person… ▽ More

    Submitted 2 August, 2017; v1 submitted 16 November, 2016; originally announced November 2016.

  36. arXiv:1605.07686  [pdf, other

    cs.CV

    Local Perturb-and-MAP for Structured Prediction

    Authors: Gedas Bertasius, Qiang Liu, Lorenzo Torresani, Jianbo Shi

    Abstract: Conditional random fields (CRFs) provide a powerful tool for structured prediction, but cast significant challenges in both the learning and inference steps. Approximation techniques are widely used in both steps, which should be considered jointly to guarantee good performance (a.k.a. "inferning"). Perturb-and-MAP models provide a promising alternative to CRFs, but require global combinatorial op… ▽ More

    Submitted 13 October, 2016; v1 submitted 24 May, 2016; originally announced May 2016.

  37. arXiv:1605.07681  [pdf, other

    cs.CV

    Convolutional Random Walk Networks for Semantic Image Segmentation

    Authors: Gedas Bertasius, Lorenzo Torresani, Stella X. Yu, Jianbo Shi

    Abstract: Most current semantic segmentation methods rely on fully convolutional networks (FCNs). However, their use of large receptive fields and many pooling layers cause low spatial resolution inside the deep layers. This leads to predictions with poor localization around the boundaries. Prior work has attempted to address this issue by post-processing predictions with CRFs or MRFs. But such models often… ▽ More

    Submitted 8 May, 2017; v1 submitted 24 May, 2016; originally announced May 2016.

  38. arXiv:1603.04908  [pdf, other

    cs.CV

    First Person Action-Object Detection with EgoNet

    Authors: Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

    Abstract: Unlike traditional third-person cameras mounted on robots, a first-person camera, captures a person's visual sensorimotor object interactions from up close. In this paper, we study the tight interplay between our momentary visual attention and motor action with objects from a first-person camera. We propose a concept of action-objects---the objects that capture person's conscious visual (watching… ▽ More

    Submitted 10 June, 2017; v1 submitted 15 March, 2016; originally announced March 2016.

  39. arXiv:1511.02682  [pdf, other

    cs.CV

    Exploiting Egocentric Object Prior for 3D Saliency Detection

    Authors: Gedas Bertasius, Hyun Soo Park, Jianbo Shi

    Abstract: On a minute-to-minute basis people undergo numerous fluid interactions with objects that barely register on a conscious level. Recent neuroscientific research demonstrates that humans have a fixed size prior for salient objects. This suggests that a salient object in 3D undergoes a consistent transformation such that people's visual system perceives it with an approximately fixed size. This findin… ▽ More

    Submitted 9 November, 2015; originally announced November 2015.

  40. arXiv:1511.02674  [pdf, other

    cs.CV

    Semantic Segmentation with Boundary Neural Fields

    Authors: Gedas Bertasius, Jianbo Shi, Lorenzo Torresani

    Abstract: The state-of-the-art in semantic segmentation is currently represented by fully convolutional networks (FCNs). However, FCNs use large receptive fields and many pooling layers, both of which cause blurring and low spatial resolution in the deep layers. As a result FCNs tend to produce segmentations that are poorly localized around object boundaries. Prior work has attempted to address this issue i… ▽ More

    Submitted 24 May, 2016; v1 submitted 9 November, 2015; originally announced November 2015.

  41. arXiv:1504.06201  [pdf, other

    cs.CV

    High-for-Low and Low-for-High: Efficient Boundary Detection from Deep Object Features and its Applications to High-Level Vision

    Authors: Gedas Bertasius, Jianbo Shi, Lorenzo Torresani

    Abstract: Most of the current boundary detection systems rely exclusively on low-level features, such as color and texture. However, perception studies suggest that humans employ object-level reasoning when judging if a particular pixel is a boundary. Inspired by this observation, in this work we show how to predict boundaries by exploiting object-level features from a pretrained object-classification netwo… ▽ More

    Submitted 21 September, 2015; v1 submitted 23 April, 2015; originally announced April 2015.

  42. arXiv:1412.1123  [pdf, other

    cs.CV

    DeepEdge: A Multi-Scale Bifurcated Deep Network for Top-Down Contour Detection

    Authors: Gedas Bertasius, Jianbo Shi, Lorenzo Torresani

    Abstract: Contour detection has been a fundamental component in many image segmentation and object detection systems. Most previous work utilizes low-level features such as texture or saliency to detect contours and then use them as cues for a higher-level task such as object detection. However, we claim that recognizing objects and predicting contours are two mutually related tasks. Contrary to traditional… ▽ More

    Submitted 23 April, 2015; v1 submitted 2 December, 2014; originally announced December 2014.

    Comments: Accepted to CVPR 2015