Zum Hauptinhalt springen

Showing 1–12 of 12 results for author: Ayyubi, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.11145  [pdf, other

    cs.CV cs.AI cs.MM

    Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

    Authors: Junzhang Liu, Zhecan Wang, Hammad Ayyubi, Haoxuan You, Chris Thomas, Rui Sun, Shih-Fu Chang, Kai-Wei Chang

    Abstract: Despite the widespread adoption of Vision-Language Understanding (VLU) benchmarks such as VQA v2, OKVQA, A-OKVQA, GQA, VCR, SWAG, and VisualCOMET, our analysis reveals a pervasive issue affecting their integrity: these benchmarks contain samples where answers rely on assumptions unsupported by the provided context. Training models on such data foster biased learning and hallucinations as models te… ▽ More

    Submitted 25 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

  2. arXiv:2403.18600  [pdf, other

    cs.CV cs.AI cs.RO

    RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

    Authors: Ali Zare, Yulei Niu, Hammad Ayyubi, Shih-fu Chang

    Abstract: Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable mod… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: 23 pages, 6 figures, 12 tables

  3. arXiv:2312.02188  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Video Summarization: Towards Entity-Aware Captions

    Authors: Hammad A. Ayyubi, Tianqi Liu, Arsha Nagrani, Xudong Lin, Mingda Zhang, Anurag Arnab, Feng Han, Yukun Zhu, Jialu Liu, Shih-Fu Chang

    Abstract: Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

  4. arXiv:2305.17540  [pdf, other

    cs.CV cs.CL

    Learning from Children: Improving Image-Caption Pretraining via Curriculum

    Authors: Hammad A. Ayyubi, Rahul Lokesh, Alireza Zareian, Bo Wu, Shih-Fu Chang

    Abstract: Image-caption pretraining has been quite successfully used for downstream vision tasks like zero-shot image classification and object detection. However, image-caption pretraining is still a hard problem -- it requires multiple concepts (nouns) from captions to be aligned to several objects in images. To tackle this problem, we go to the roots -- the best learner, children. We take inspiration fro… ▽ More

    Submitted 30 May, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

    Comments: ACL Findings 2023

  5. arXiv:2305.14985  [pdf, other

    cs.CV cs.CL

    IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

    Authors: Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

    Abstract: The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: 13 pages, 5 figures

  6. arXiv:2210.12444  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Weakly-Supervised Temporal Article Grounding

    Authors: Long Chen, Yulei Niu, Brian Chen, Xudong Lin, Guangxing Han, Christopher Thomas, Hammad Ayyubi, Heng Ji, Shih-Fu Chang

    Abstract: Given a long untrimmed video and natural language queries, video grounding (VG) aims to temporally localize the semantically-aligned video segments. Almost all existing VG work holds two simple but unrealistic assumptions: 1) All query sentences can be grounded in the corresponding video. 2) All query sentences for the same video are always at the same semantic scale. Unfortunately, both assumptio… ▽ More

    Submitted 23 February, 2023; v1 submitted 22 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022, https://github.com/zjuchenlong/WSAG

  7. arXiv:2206.07207  [pdf, other

    cs.CV cs.CL

    Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities

    Authors: Hammad A. Ayyubi, Christopher Thomas, Lovish Chum, Rahul Lokesh, Long Chen, Yulei Niu, Xudong Lin, Xuande Feng, Jaywon Koo, Sounak Ray, Shih-Fu Chang

    Abstract: Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intric… ▽ More

    Submitted 19 December, 2023; v1 submitted 14 June, 2022; originally announced June 2022.

    Comments: AAAI 2024

  8. arXiv:2004.02032  [pdf, other

    cs.AI cs.CL cs.CV

    Generating Rationales in Visual Question Answering

    Authors: Hammad A. Ayyubi, Md. Mehrab Tanjim, Julian J. McAuley, Garrison W. Cottrell

    Abstract: Despite recent advances in Visual QuestionAnswering (VQA), it remains a challenge todetermine how much success can be attributedto sound reasoning and comprehension ability.We seek to investigate this question by propos-ing a new task ofrationale generation. Es-sentially, we task a VQA model with generat-ing rationales for the answers it predicts. Weuse data from the Visual Commonsense Rea-soning… ▽ More

    Submitted 4 April, 2020; originally announced April 2020.

  9. arXiv:2003.03695  [pdf, other

    cs.LG cs.NE stat.ML

    Progressive Growing of Neural ODEs

    Authors: Hammad A. Ayyubi, Yi Yao, Ajay Divakaran

    Abstract: Neural Ordinary Differential Equations (NODEs) have proven to be a powerful modeling tool for approximating (interpolation) and forecasting (extrapolation) irregularly sampled time series data. However, their performance degrades substantially when applied to real-world data, especially long-term data with complex behaviors (e.g., long-term trend across years, mid-term seasonality across months, a… ▽ More

    Submitted 7 March, 2020; originally announced March 2020.

    Journal ref: ICLR Workshop on Neural Networks and Differential Equations, 2020

  10. arXiv:1910.11124  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Enforcing Reasoning in Visual Commonsense Reasoning

    Authors: Hammad A. Ayyubi, Md. Mehrab Tanjim, David J. Kriegman

    Abstract: The task of Visual Commonsense Reasoning is extremely challenging in the sense that the model has to not only be able to answer a question given an image, but also be able to learn to reason. The baselines introduced in this task are quite limiting because two networks are trained for predicting answers and rationales separately. Question and image is used as input to train answer prediction netwo… ▽ More

    Submitted 27 December, 2019; v1 submitted 20 October, 2019; originally announced October 2019.

  11. arXiv:1910.09638  [pdf, other

    cs.LG cs.CV

    GANspection

    Authors: Hammad A. Ayyubi

    Abstract: Generative Adversarial Networks (GANs) have been used extensively and quite successfully for unsupervised learning. As GANs don't approximate an explicit probability distribution, it's an interesting study to inspect the latent space representations learned by GANs. The current work seeks to push the boundaries of such inspection methods to further understand in more detail the manifold being lear… ▽ More

    Submitted 21 October, 2019; originally announced October 2019.

  12. arXiv:1903.07563  [pdf, other

    cs.CV

    Human Activity Recognition for Edge Devices

    Authors: Manjot Bilkhu, Hammababdullah Ayyubi

    Abstract: Video activity Recognition has recently gained a lot of momentum with the release of massive Kinetics (400 and 600) data. Architectures such as I3D and C3D networks have shown state-of-the-art performances for activity recognition. The one major pitfall with these state-of-the-art networks is that they require a lot of compute. In this paper we explore how we can achieve comparable results to thes… ▽ More

    Submitted 18 March, 2019; originally announced March 2019.

    Comments: 7 pages, 8 figures