Zum Hauptinhalt springen

Showing 1–8 of 8 results for author: Bakr, E M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.06211  [pdf, other

    cs.CV

    iMotion-LLM: Motion Prediction Instruction Tuning

    Authors: Abdulwahab Felemban, Eslam Mohamed Bakr, Xiaoqian Shen, Jian Ding, Abduallah Mohamed, Mohamed Elhoseiny

    Abstract: We introduce iMotion-LLM: a Multimodal Large Language Models (LLMs) with trajectory prediction, tailored to guide interactive multi-agent scenarios. Different from conventional motion prediction approaches, iMotion-LLM capitalizes on textual instructions as key inputs for generating contextually relevant trajectories. By enriching the real-world driving scenarios in the Waymo Open Dataset with tex… ▽ More

    Submitted 11 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

  2. arXiv:2405.18937  [pdf, other

    cs.CV cs.CL

    Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding

    Authors: Junjie Fei, Mahmoud Ahmed, Jian Ding, Eslam Mohamed Bakr, Mohamed Elhoseiny

    Abstract: While 3D MLLMs have achieved significant progress, they are restricted to object and scene understanding and struggle to understand 3D spatial structures at the part level. In this paper, we introduce Kestrel, representing a novel approach that empowers 3D MLLMs with part-aware understanding, enabling better interpretation and segmentation grounding of 3D objects at the part level. Despite its sig… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  3. arXiv:2311.14542  [pdf, other

    cs.CV

    ToddlerDiffusion: Flash Interpretable Controllable Diffusion Model

    Authors: Eslam Mohamed Bakr, Liangbing Zhao, Vincent Tao Hu, Matthieu Cord, Patrick Perez, Mohamed Elhoseiny

    Abstract: Diffusion-based generative models excel in perceptually impressive synthesis but face challenges in interpretability. This paper introduces ToddlerDiffusion, an interpretable 2D diffusion image-synthesis framework inspired by the human generation system. Unlike traditional diffusion models with opaque denoising steps, our approach decomposes the generation process into simpler, interpretable stage… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

  4. arXiv:2310.06214  [pdf, other

    cs.CV

    CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding

    Authors: Eslam Mohamed Bakr, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, Mohamed Elhoseiny

    Abstract: 3D visual grounding is the ability to localize objects in 3D scenes conditioned by utterances. Most existing methods devote the referring head to localize the referred object directly, causing failure in complex scenarios. In addition, it does not illustrate how and why the network reaches the final decision. In this paper, we address this question Can we design an interpretable 3D visual groundin… ▽ More

    Submitted 20 April, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  5. arXiv:2304.05390  [pdf, other

    cs.CV cs.AI cs.LG

    HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models

    Authors: Eslam Mohamed Bakr, Pengzhan Sun, Xiaoqian Shen, Faizan Farooq Khan, Li Erran Li, Mohamed Elhoseiny

    Abstract: In recent years, Text-to-Image (T2I) models have been extensively studied, especially with the emergence of diffusion models that achieve state-of-the-art results on T2I synthesis tasks. However, existing benchmarks heavily rely on subjective human evaluation, limiting their ability to holistically assess the model's capabilities. Furthermore, there is a significant gap between efforts in developi… ▽ More

    Submitted 23 November, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

    Comments: ICCV 2023

  6. arXiv:2304.04874  [pdf, other

    cs.CV cs.AI cs.LG

    ImageCaptioner$^2$: Image Captioner for Image Captioning Bias Amplification Assessment

    Authors: Eslam Mohamed Bakr, Pengzhan Sun, Li Erran Li, Mohamed Elhoseiny

    Abstract: Most pre-trained learning systems are known to suffer from bias, which typically emerges from the data, the model, or both. Measuring and quantifying bias and its sources is a challenging task and has been extensively studied in image captioning. Despite the significant effort in this direction, we observed that existing metrics lack consistency in the inclusion of the visual signal. In this paper… ▽ More

    Submitted 5 June, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

  7. arXiv:2211.14241  [pdf, other

    cs.CV

    Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding

    Authors: Eslam Mohamed Bakr, Yasmeen Alsaedy, Mohamed Elhoseiny

    Abstract: The 3D visual grounding task has been explored with visual and language streams comprehending referential language to identify target objects in 3D scenes. However, most existing methods devote the visual stream to capturing the 3D visual clues using off-the-shelf point clouds encoders. The main question we address in this paper is "can we consolidate the 3D visual stream by 2D clues synthesized f… ▽ More

    Submitted 25 November, 2022; originally announced November 2022.

    Journal ref: NeurIPS 2022

  8. arXiv:2211.07521  [pdf, other

    cs.CV

    PKCAM: Previous Knowledge Channel Attention Module

    Authors: Eslam Mohamed Bakr, Ahmad El Sallab, Mohsen A. Rashwan

    Abstract: Recently, attention mechanisms have been explored with ConvNets, both across the spatial and channel dimensions. However, from our knowledge, all the existing methods devote the attention modules to capture local interactions from a uni-scale. In this paper, we propose a Previous Knowledge Channel Attention Module(PKCAM), that captures channel-wise relations across different layers to model the gl… ▽ More

    Submitted 25 November, 2022; v1 submitted 14 November, 2022; originally announced November 2022.