Zum Hauptinhalt springen

Showing 1–10 of 10 results for author: Yacoob, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.15998  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

    Authors: Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catanzaro, Andrew Tao, Jan Kautz, Zhiding Yu, Guilin Liu

    Abstract: The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vis… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: Github: https://github.com/NVlabs/Eagle, HuggingFace: https://huggingface.co/NVEagle

  2. arXiv:2406.02951  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

    Authors: Trevine Oorloff, Surya Koppisetti, Nicolò Bonettini, Divyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, Gaurav Bharaj

    Abstract: With the rapid growth in deepfake video content, we require improved and generalizable methods to detect them. Most existing detection methods either use uni-modal cues or rely on supervised training to capture the dissonance between the audio and visual modalities. While the former disregards the audio-visual correspondences entirely, the latter predominantly focuses on discerning audio-visual cu… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted to CVPR 2024

  3. arXiv:2311.10774  [pdf, other

    cs.CL cs.AI

    MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning

    Authors: Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, Dong Yu

    Abstract: With the rapid development of large language models (LLMs) and their integration into large multimodal models (LMMs), there has been impressive progress in zero-shot completion of user-oriented vision-language tasks. However, a gap remains in the domain of chart image understanding due to the distinct abstract components in charts. To address this, we introduce a large-scale MultiModal Chart Instr… ▽ More

    Submitted 15 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: Accepted to NAACL 2024

  4. arXiv:2310.14566  [pdf, other

    cs.CV cs.CL

    HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

    Authors: Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou

    Abstract: We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129… ▽ More

    Submitted 25 March, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: Accepted to CVPR 2024

  5. arXiv:2306.14565  [pdf, other

    cs.CV cs.AI cs.CE cs.CL cs.MM

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Authors: Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, Lijuan Wang

    Abstract: Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visua… ▽ More

    Submitted 19 March, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

    Comments: 40 pages, 32 figures, ICLR 2024

  6. arXiv:2302.07919  [pdf, other

    cs.CV cs.AI cs.MM

    COVID-VTS: Fact Extraction and Verification on Short Video Platforms

    Authors: Fuxiao Liu, Yaser Yacoob, Abhinav Shrivastava

    Abstract: We introduce a new benchmark, COVID-VTS, for fact-checking multi-modal information involving short-duration videos with COVID19- focused information from both the real world and machine generation. We propose, TwtrDetective, an effective model incorporating cross-media consistency checking to detect token-level malicious tampering in different modalities, and generate explanations. Due to the scar… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

    Comments: 11 pages, 5 figures, accepted to EACL2023

  7. arXiv:2302.07848  [pdf, other

    cs.CV

    One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2

    Authors: Trevine Oorloff, Yaser Yacoob

    Abstract: While recent research has progressively overcome the low-resolution constraint of one-shot face video re-enactment with the help of StyleGAN's high-fidelity portrait generation, these approaches rely on at least one of the following: explicit 2D/3D priors, optical flow based warping as motion descriptors, off-the-shelf encoders, etc., which constrain their performance (e.g., inconsistent predictio… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

    Comments: The project page is located at https://trevineoorloff.github.io/FaceVideoReenactment_HybridLatents.io/

  8. arXiv:2203.14512  [pdf, other

    cs.CV

    Expressive Talking Head Video Encoding in StyleGAN2 Latent-Space

    Authors: Trevine Oorloff, Yaser Yacoob

    Abstract: While the recent advances in research on video reenactment have yielded promising results, the approaches fall short in capturing the fine, detailed, and expressive facial features (e.g., lip-pressing, mouth puckering, mouth gaping, and wrinkles) which are crucial in generating realistic animated face videos. To this end, we propose an end-to-end expressive face video encoding approach that facili… ▽ More

    Submitted 14 February, 2023; v1 submitted 28 March, 2022; originally announced March 2022.

    Comments: The project page is located at https://trevineoorloff.github.io/ExpressiveFaceVideoEncoding.io/

  9. arXiv:1709.01993  [pdf, other

    cs.CV

    Label Denoising Adversarial Network (LDAN) for Inverse Lighting of Face Images

    Authors: Hao Zhou, Jin Sun, Yaser Yacoob, David W. Jacobs

    Abstract: Lighting estimation from face images is an important task and has applications in many areas such as image editing, intrinsic image decomposition, and image forgery detection. We propose to train a deep Convolutional Neural Network (CNN) to regress lighting parameters from a single face image. Lacking massive ground truth lighting labels for face images in the wild, we use an existing method to es… ▽ More

    Submitted 6 September, 2017; originally announced September 2017.

  10. arXiv:1512.06075  [pdf, other

    cs.CV

    Modeling Colors of Single Attribute Variations with Application to Food Appearance

    Authors: Yaser Yacoob

    Abstract: This paper considers the intra-image color-space of an object or a scene when these are subject to a dominant single-source of variation. The source of variation can be intrinsic or extrinsic (i.e., imaging conditions) to the object. We observe that the quantized colors for such objects typically lie on a planar subspace of RGB, and in some cases linear or polynomial curves on this plane are effec… ▽ More

    Submitted 18 December, 2015; originally announced December 2015.

    Comments: 9 Pages. Paper does not reference recent food-classification papers. It is intended for wider scope