Zum Hauptinhalt springen

Showing 1–18 of 18 results for author: Sener, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.09919  [pdf, other

    cs.CV

    Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment

    Authors: Zhanzhong Pang, Fadime Sener, Shrinivas Ramasubramanian, Angela Yao

    Abstract: Procedural activity videos often exhibit a long-tailed action distribution due to varying action frequencies and durations. However, state-of-the-art temporal action segmentation methods overlook the long tail and fail to recognize tail actions. Existing long-tail methods make class-independent assumptions and struggle to identify tail classes when applied to temporal segmentation frameworks. This… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: Accepted by ECCV 2024

  2. arXiv:2403.19811  [pdf, other

    cs.CV

    X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization

    Authors: Anna Kukleva, Fadime Sener, Edoardo Remelli, Bugra Tekin, Eric Sauser, Bernt Schiele, Shugao Ma

    Abstract: Lately, there has been growing interest in adapting vision-language models (VLMs) to image and third-person video classification due to their success in zero-shot recognition. However, the adaptation of these models to egocentric videos has been largely unexplored. To address this gap, we propose a simple yet effective cross-modal adaptation framework, which we call X-MIC. Using a video adapter, o… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Comments: CVPR 2024

  3. arXiv:2403.17827  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

    Authors: Sammy Christen, Shreyas Hampali, Fadime Sener, Edoardo Remelli, Tomas Hodan, Eric Sauser, Shugao Ma, Bugra Tekin

    Abstract: Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. We propose DiffH2O, a novel method to synthesize realistic, one or two-handed object interactions f… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Project Page: https://diffh2o.github.io/

  4. arXiv:2403.09805  [pdf, other

    cs.CV cs.LG

    On the Utility of 3D Hand Poses for Action Recognition

    Authors: Md Salman Shamil, Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao

    Abstract: 3D hand pose is an underexplored modality for action recognition. Poses are compact yet informative and can greatly benefit applications with limited compute budgets. However, poses alone offer an incomplete understanding of actions, as they cannot fully capture objects and environments with which humans interact. We propose HandFormer, a novel multimodal transformer, to efficiently model hand-obj… ▽ More

    Submitted 14 August, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

    Comments: ECCV 2024; https://s-shamil.github.io/HandFormer/

  5. arXiv:2308.11488  [pdf, other

    cs.CV

    Opening the Vocabulary of Egocentric Actions

    Authors: Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao

    Abstract: Human actions in egocentric videos are often hand-object interactions composed from a verb (performed by the hand) applied to an object. Despite their extensive scaling up, egocentric datasets still face two limitations - sparsity of action compositions and a closed set of interacting objects. This paper proposes a novel open vocabulary action recognition task. Given a set of verbs and objects obs… ▽ More

    Submitted 12 December, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

    Comments: NeurIPS 2023 camera ready; https://dibschat.github.io/openvocab-egoAR/

  6. arXiv:2307.16453  [pdf, other

    cs.AI cs.LO

    Every Mistake Counts in Assembly

    Authors: Guodong Ding, Fadime Sener, Shugao Ma, Angela Yao

    Abstract: One promising use case of AI assistants is to help with complex procedures like cooking, home repair, and assembly tasks. Can we teach the assistant to interject after the user makes a mistake? This paper targets the problem of identifying ordering mistakes in assembly procedures. We propose a system that can detect ordering mistakes by utilizing a learned knowledge base. Our framework constructs… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: 10 pages, 5 figures

  7. arXiv:2304.12301  [pdf, other

    cs.CV

    AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation

    Authors: Takehiko Ohkawa, Kun He, Fadime Sener, Tomas Hodan, Luan Tran, Cem Keskin

    Abstract: We present AssemblyHands, a large-scale benchmark dataset with accurate 3D hand pose annotations, to facilitate the study of egocentric activities with challenging hand-object interactions. The dataset includes synchronized egocentric and exocentric images sampled from the recent Assembly101 dataset, in which participants assemble and disassemble take-apart toys. To obtain high-quality 3D hand pos… ▽ More

    Submitted 24 April, 2023; originally announced April 2023.

    Comments: CVPR 2023. Project page: https://assemblyhands.github.io/

  8. arXiv:2210.10352  [pdf, other

    cs.CV

    Temporal Action Segmentation: An Analysis of Modern Techniques

    Authors: Guodong Ding, Fadime Sener, Angela Yao

    Abstract: Temporal action segmentation (TAS) in videos aims at densely identifying video frames in minutes-long videos with multiple action classes. As a long-range video understanding task, researchers have developed an extended collection of methods and examined their performance using various benchmarks. Despite the rapid growth of TAS techniques in recent years, no systematic survey has been conducted i… ▽ More

    Submitted 21 October, 2023; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: 19 pages, 9 figures, 8 tables, TPAMI 2023

  9. arXiv:2203.14712  [pdf, other

    cs.CV

    Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities

    Authors: Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, Angela Yao

    Abstract: Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants work without fixed instructions, and the sequences feature rich and natural variations in action ordering, mistakes, and corrections. Assembly101 is the first multi-view action dataset, with simultaneous static (8) and egocentric (4) recordings.… ▽ More

    Submitted 1 May, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

    Comments: CVPR 2022, https://assembly-101.github.io/

  10. arXiv:2106.03162  [pdf, other

    cs.CV

    Transformed ROIs for Capturing Visual Transformations in Videos

    Authors: Abhinav Rai, Fadime Sener, Angela Yao

    Abstract: Modeling the visual changes that an action brings to a scene is critical for video understanding. Currently, CNNs process one local neighbourhood at a time, thus contextual relationships over longer ranges, while still learnable, are indirect. We present TROI, a plug-and-play module for CNNs to reason between mid-level feature representations that are otherwise separated in space and time. The mod… ▽ More

    Submitted 5 November, 2022; v1 submitted 6 June, 2021; originally announced June 2021.

    Comments: CVIU 2022 - Computer Vision and Image Understanding

  11. arXiv:2106.03158  [pdf, other

    cs.CV

    Transferring Knowledge from Text to Video: Zero-Shot Anticipation for Procedural Actions

    Authors: Fadime Sener, Rishabh Saraf, Angela Yao

    Abstract: Can we teach a robot to recognize and make predictions for activities that it has never seen before? We tackle this problem by learning models for video from text. This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text corpora and transfers the knowledge to video. Given a portion of an instructional video, our model recognizes and predicts coherent… ▽ More

    Submitted 5 November, 2022; v1 submitted 6 June, 2021; originally announced June 2021.

    Comments: TPAMI 2022. arXiv admin note: text overlap with arXiv:1812.02501

  12. arXiv:2106.03152  [pdf, other

    cs.CV

    Technical Report: Temporal Aggregate Representations

    Authors: Fadime Sener, Dibyadip Chatterjee, Angela Yao

    Abstract: This technical report extends our work presented in [9] with more experiments. In [9], we tackle long-term video understanding, which requires reasoning from current and past or future observations and raises several fundamental questions. How should temporal or sequential relationships be modelled? What temporal extent of information and context needs to be processed? At what temporal scale shoul… ▽ More

    Submitted 15 June, 2021; v1 submitted 6 June, 2021; originally announced June 2021.

  13. arXiv:2006.00830  [pdf, other

    cs.CV

    Temporal Aggregate Representations for Long-Range Video Understanding

    Authors: Fadime Sener, Dipika Singhania, Angela Yao

    Abstract: Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as ma… ▽ More

    Submitted 30 July, 2020; v1 submitted 1 June, 2020; originally announced June 2020.

    Comments: ECCV 2020, European Conference on Computer Vision

  14. arXiv:1904.04189  [pdf, other

    cs.CV

    Unsupervised learning of action classes with continuous temporal embedding

    Authors: Anna Kukleva, Hilde Kuehne, Fadime Sener, Juergen Gall

    Abstract: The task of temporally detecting and segmenting actions in untrimmed videos has seen an increased attention recently. One problem in this context arises from the need to define and label action boundaries to create annotations for training which is very time and cost intensive. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. To… ▽ More

    Submitted 8 April, 2019; originally announced April 2019.

    Comments: CVPR 2019

  15. arXiv:1812.03570  [pdf, other

    cs.CV

    Learning Style Compatibility for Furniture

    Authors: Divyansh Aggarwal, Elchin Valiyev, Fadime Sener, Angela Yao

    Abstract: When judging style, a key question that often arises is whether or not a pair of objects are compatible with each other. In this paper we investigate how Siamese networks can be used efficiently for assessing the style compatibility between images of furniture items. We show that the middle layers of pretrained CNNs can capture essential information about furniture style, which allows for efficien… ▽ More

    Submitted 9 December, 2018; originally announced December 2018.

    Comments: German Conference on Pattern Recognition(GCPR)

  16. arXiv:1812.02501  [pdf, other

    cs.CV cs.LG

    Zero-Shot Anticipation for Instructional Activities

    Authors: Fadime Sener, Angela Yao

    Abstract: How can we teach a robot to predict what will happen next for an activity it has never seen before? We address this problem of zero-shot anticipation by presenting a hierarchical model that generalizes instructional knowledge from large-scale text-corpora and transfers the knowledge to the visual domain. Given a portion of an instructional video, our model predicts coherent and plausible actions m… ▽ More

    Submitted 20 October, 2019; v1 submitted 6 December, 2018; originally announced December 2018.

    Comments: ICCV 2019

  17. arXiv:1803.09490  [pdf, ps, other

    cs.CV

    Unsupervised Learning and Segmentation of Complex Activities from Video

    Authors: Fadime Sener, Angela Yao

    Abstract: This paper presents a new method for unsupervised segmentation of complex activities from video into multiple steps, or sub-activities, without any textual input. We propose an iterative discriminative-generative approach which alternates between discriminatively learning the appearance of sub-activities from the videos' visual features to sub-activity labels and generatively modelling the tempora… ▽ More

    Submitted 26 March, 2018; originally announced March 2018.

    Comments: CVPR 2018 Accepted Manuscript

  18. arXiv:1704.03057  [pdf, other

    cs.CV

    DRAW: Deep networks for Recognizing styles of Artists Who illustrate children's books

    Authors: Samet Hicsonmez, Nermin Samet, Fadime Sener, Pinar Duygulu

    Abstract: This paper is motivated from a young boy's capability to recognize an illustrator's style in a totally different context. In the book "We are All Born Free" [1], composed of selected rights from the Universal Declaration of Human Rights interpreted by different illustrators, the boy was surprised to see a picture similar to the ones in the "Winnie the Witch" series drawn by Korky Paul (Figure 1).… ▽ More

    Submitted 10 April, 2017; originally announced April 2017.

    Comments: ACM ICMR 2017