Zum Hauptinhalt springen

Showing 1–5 of 5 results for author: Kittenplon, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.08255  [pdf, other

    cs.CL

    M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

    Authors: Benjamin Hsu, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nadejde, Xing Niu, Yair Kittenplon, Ron Litman, Raghavendra Pappagari

    Abstract: Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: NAACL 2024, dataset at https://github.com/amazon-science/m3t-multi-modal-translation-bench

  2. arXiv:2402.05472  [pdf, other

    cs.CV

    Question Aware Vision Transformer for Multimodal Reasoning

    Authors: Roy Ganz, Yair Kittenplon, Aviad Aberdam, Elad Ben Avraham, Oren Nuriel, Shai Mazor, Ron Litman

    Abstract: Vision-Language (VL) models have gained significant research focus, enabling remarkable advances in multimodal reasoning. These architectures typically comprise a vision encoder, a Large Language Model (LLM), and a projection module that aligns visual features with the LLM's representation space. Despite their success, a critical limitation persists: the vision encoding process remains decoupled f… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  3. arXiv:2301.07389  [pdf, other

    cs.CV cs.LG

    Towards Models that Can See and Read

    Authors: Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Kittenplon, Shai Mazor, Ron Litman

    Abstract: Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth anal… ▽ More

    Submitted 21 March, 2023; v1 submitted 18 January, 2023; originally announced January 2023.

  4. arXiv:2202.05508  [pdf, other

    cs.CV cs.CL cs.LG

    Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer

    Authors: Yair Kittenplon, Inbal Lavi, Sharon Fogel, Yarin Bar, R. Manmatha, Pietro Perona

    Abstract: Text spotting end-to-end methods have recently gained attention in the literature due to the benefits of jointly optimizing the text detection and recognition components. Existing methods usually have a distinct separation between the detection and recognition branches, requiring exact annotations for the two tasks. We introduce TextTranSpotter (TTS), a transformer-based approach for text spotting… ▽ More

    Submitted 14 February, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

  5. arXiv:2011.10147  [pdf, other

    cs.CV cs.LG

    FlowStep3D: Model Unrolling for Self-Supervised Scene Flow Estimation

    Authors: Yair Kittenplon, Yonina C. Eldar, Dan Raviv

    Abstract: Estimating the 3D motion of points in a scene, known as scene flow, is a core problem in computer vision. Traditional learning-based methods designed to learn end-to-end 3D flow often suffer from poor generalization. Here we present a recurrent architecture that learns a single step of an unrolled iterative alignment procedure for refining scene flow predictions. Inspired by classical algorithms,… ▽ More

    Submitted 4 April, 2021; v1 submitted 19 November, 2020; originally announced November 2020.

    Comments: CVPR 2021