Zum Hauptinhalt springen

Showing 1–23 of 23 results for author: Litman, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.12594  [pdf, other

    cs.CV

    VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

    Authors: Ofir Abramovich, Niv Nayman, Sharon Fogel, Inbal Lavi, Ron Litman, Shahar Tsiper, Royee Tichauer, Srikar Appalaraju, Shai Mazor, R. Manmatha

    Abstract: In recent years, notable advancements have been made in the domain of visual document understanding, with the prevailing architecture comprising a cascade of vision and language models. The text component can either be extracted explicitly with the use of external OCR models in OCR-based approaches, or alternatively, the vision model can be endowed with reading capabilities in OCR-free approaches.… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: 32 pages, 18 figures

  2. arXiv:2406.08255  [pdf, other

    cs.CL

    M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

    Authors: Benjamin Hsu, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nadejde, Xing Niu, Yair Kittenplon, Ron Litman, Raghavendra Pappagari

    Abstract: Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: NAACL 2024, dataset at https://github.com/amazon-science/m3t-multi-modal-translation-bench

  3. arXiv:2402.05472  [pdf, other

    cs.CV

    Question Aware Vision Transformer for Multimodal Reasoning

    Authors: Roy Ganz, Yair Kittenplon, Aviad Aberdam, Elad Ben Avraham, Oren Nuriel, Shai Mazor, Ron Litman

    Abstract: Vision-Language (VL) models have gained significant research focus, enabling remarkable advances in multimodal reasoning. These architectures typically comprise a vision encoder, a Large Language Model (LLM), and a projection module that aligns visual features with the LLM's representation space. Despite their success, a critical limitation persists: the vision encoding process remains decoupled f… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  4. arXiv:2401.03411  [pdf, other

    cs.CL cs.CV

    GRAM: Global Reasoning for Multi-Page VQA

    Authors: Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, Ron Litman

    Abstract: The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computation… ▽ More

    Submitted 18 March, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

  5. arXiv:2301.07464  [pdf, other

    cs.CV cs.LG

    CLIPTER: Looking at the Bigger Picture in Scene Text Recognition

    Authors: Aviad Aberdam, David Bensaïd, Alona Golts, Roy Ganz, Oren Nuriel, Royee Tichauer, Shai Mazor, Ron Litman

    Abstract: Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative capabilities of modern vision-language models, such as CLIP, to provide scene-level information to… ▽ More

    Submitted 23 July, 2023; v1 submitted 18 January, 2023; originally announced January 2023.

    Comments: Accepted for publication by ICCV 2023

  6. arXiv:2301.07389  [pdf, other

    cs.CV cs.LG

    Towards Models that Can See and Read

    Authors: Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Kittenplon, Shai Mazor, Ron Litman

    Abstract: Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth anal… ▽ More

    Submitted 21 March, 2023; v1 submitted 18 January, 2023; originally announced January 2023.

  7. arXiv:2209.06717  [pdf, other

    cs.CV

    Out-of-Vocabulary Challenge Report

    Authors: Sergi Garcia-Bordils, Andrés Mafla, Ali Furkan Biten, Oren Nuriel, Aviad Aberdam, Shai Mazor, Ron Litman, Dimosthenis Karatzas

    Abstract: This paper presents final results of the Out-Of-Vocabulary 2022 (OOV) challenge. The OOV contest introduces an important aspect that is not commonly studied by Optical Character Recognition (OCR) models, namely, the recognition of unseen scene text instances at training time. The competition compiles a collection of public scene text datasets comprising of 326,385 images with 4,864,405 scene text… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: To be appeared in Text In Everything Workshop in ECCV 2022

  8. arXiv:2205.03873  [pdf, other

    cs.CV cs.AI cs.LG

    Multimodal Semi-Supervised Learning for Text Recognition

    Authors: Aviad Aberdam, Roy Ganz, Shai Mazor, Ron Litman

    Abstract: Until recently, the number of public real-world text images was insufficient for training scene text recognizers. Therefore, most modern training methods rely on synthetic data and operate in a fully supervised manner. Nevertheless, the amount of public real-world text images has increased significantly lately, including a great deal of unlabeled data. Leveraging these resources requires semi-supe… ▽ More

    Submitted 8 May, 2022; originally announced May 2022.

    Comments: Code will be published upon publication

  9. arXiv:2112.12494  [pdf, other

    cs.CV

    LaTr: Layout-Aware Transformer for Scene-Text VQA

    Authors: Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, R. Manmatha

    Abstract: We propose a novel multimodal architecture for Scene Text Visual Question Answering (STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to reason over different modalities. Thus, we first investigate the impact of each modality, and reveal the importance of the language module, especially when enriched with layout information. Accounting for this, we propose a single… ▽ More

    Submitted 24 December, 2021; v1 submitted 23 December, 2021; originally announced December 2021.

  10. arXiv:2105.03906  [pdf, other

    cs.CV

    TextAdaIN: Paying Attention to Shortcut Learning in Text Recognizers

    Authors: Oren Nuriel, Sharon Fogel, Ron Litman

    Abstract: Leveraging the characteristics of convolutional layers, neural networks are extremely effective for pattern recognition tasks. However in some cases, their decisions are based on unintended information leading to high performance on standard benchmarks but also to a lack of generalization to challenging testing conditions and unintuitive failures. Recent work has termed this "shortcut learning" an… ▽ More

    Submitted 24 July, 2022; v1 submitted 9 May, 2021; originally announced May 2021.

    Comments: 12 pages, 8 figures, Accepted to ECCV 2022

  11. arXiv:2012.12643  [pdf, other

    cs.CV cs.LG

    On Calibration of Scene-Text Recognition Models

    Authors: Ron Slossberg, Oron Anschel, Amir Markovitz, Ron Litman, Aviad Aberdam, Shahar Tsiper, Shai Mazor, Jon Wu, R. Manmatha

    Abstract: In this work, we study the problem of word-level confidence calibration for scene-text recognition (STR). Although the topic of confidence calibration has been an active research area for the last several decades, the case of structured and sequence prediction calibration has been scarcely explored. We analyze several recent STR methods and show that they are consistently overconfident. We then fo… ▽ More

    Submitted 23 December, 2020; originally announced December 2020.

  12. arXiv:2012.10873  [pdf, other

    cs.CV

    Sequence-to-Sequence Contrastive Learning for Text Recognition

    Authors: Aviad Aberdam, Ron Litman, Shahar Tsiper, Oron Anschel, Ron Slossberg, Shai Mazor, R. Manmatha, Pietro Perona

    Abstract: We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations, which we apply to text recognition. To account for the sequence-to-sequence structure, each feature map is divided into different instances over which the contrastive loss is computed. This operation enables us to contrast in a sub-word level, where from each image we extract several positive p… ▽ More

    Submitted 20 December, 2020; originally announced December 2020.

  13. arXiv:2008.02231  [pdf, other

    cs.CV

    Can You Read Me Now? Content Aware Rectification using Angle Supervision

    Authors: Amir Markovitz, Inbal Lavi, Or Perel, Shai Mazor, Roee Litman

    Abstract: The ubiquity of smartphone cameras has led to more and more documents being captured by cameras rather than scanned. Unlike flatbed scanners, photographed documents are often folded and crumpled, resulting in large local variance in text structure. The problem of document rectification is fundamental to the Optical Character Recognition (OCR) process on documents, and its ability to overcome geome… ▽ More

    Submitted 5 August, 2020; originally announced August 2020.

    Comments: Presented in ECCV 2020

  14. arXiv:2003.11288  [pdf, other

    cs.CV

    SCATTER: Selective Context Attentional Scene Text Recognizer

    Authors: Ron Litman, Oron Anschel, Shahar Tsiper, Roee Litman, Shai Mazor, R. Manmatha

    Abstract: Scene Text Recognition (STR), the task of recognizing text against complex image backgrounds, is an active area of research. Current state-of-the-art (SOTA) methods still struggle to recognize text written in arbitrary shapes. In this paper, we introduce a novel architecture for STR, named Selective Context ATtentional Text Recognizer (SCATTER). SCATTER utilizes a stacked block architecture with i… ▽ More

    Submitted 25 March, 2020; originally announced March 2020.

    Comments: In CVPR 2020

  15. arXiv:2003.10557  [pdf, other

    cs.CV cs.CL cs.LG

    ScrabbleGAN: Semi-Supervised Varying Length Handwritten Text Generation

    Authors: Sharon Fogel, Hadar Averbuch-Elor, Sarel Cohen, Shai Mazor, Roee Litman

    Abstract: Optical character recognition (OCR) systems performance have improved significantly in the deep learning era. This is especially true for handwritten text recognition (HTR), where each author has a unique style, unlike printed text, where the variation is smaller by design. That said, deep learning based HTR is limited, as in every other task, by the number of training examples. Gathering data is… ▽ More

    Submitted 23 March, 2020; originally announced March 2020.

    Comments: in CVPR 2020

  16. arXiv:2003.08763  [pdf

    cs.CV cs.IR cs.LG stat.ML

    Shape retrieval of non-rigid 3d human models

    Authors: David Pickup, Xianfang Sun, Paul L Rosin, Ralph R Martin, Z Cheng, Zhouhui Lian, Masaki Aono, A Ben Hamza, A Bronstein, M Bronstein, S Bu, Umberto Castellani, S Cheng, Valeria Garro, Andrea Giachetti, Afzal Godil, Luca Isaia, J Han, Henry Johan, L Lai, Bo Li, C Li, Haisheng Li, Roee Litman, X Liu , et al. (6 additional authors not shown)

    Abstract: 3D models of humans are commonly used within computer graphics and vision, and so the ability to distinguish between body shapes is an important shape retrieval problem. We extend our recent paper which provided a benchmark for testing non-rigid 3D shape retrieval algorithms on 3D human models. This benchmark provided a far stricter challenge than previous shape benchmarks. We have added 145 new m… ▽ More

    Submitted 1 March, 2020; originally announced March 2020.

    Comments: International Journal of Computer Vision, 2016

  17. arXiv:1802.07045  [pdf, other

    cs.CV

    Latent RANSAC

    Authors: Simon Korman, Roee Litman

    Abstract: We present a method that can evaluate a RANSAC hypothesis in constant time, i.e. independent of the size of the data. A key observation here is that correct hypotheses are tightly clustered together in the latent parameter domain. In a manner similar to the generalized Hough transform we seek to find this cluster, only that we need as few as two votes for a successful detection. Rapidly locating s… ▽ More

    Submitted 3 June, 2018; v1 submitted 20 February, 2018; originally announced February 2018.

    Comments: presented in CVPR 2018

  18. arXiv:1701.00669  [pdf, other

    cs.CV

    Product Manifold Filter: Non-Rigid Shape Correspondence via Kernel Density Estimation in the Product Space

    Authors: Matthias Vestner, Roee Litman, Emanuele Rodolà, Alex Bronstein, Daniel Cremers

    Abstract: Many algorithms for the computation of correspondences between deformable shapes rely on some variant of nearest neighbor matching in a descriptor space. Such are, for example, various point-wise correspondence recovery algorithms used as a post-processing stage in the functional correspondence framework. Such frequently used techniques implicitly make restrictive assumptions (e.g., near-isometry)… ▽ More

    Submitted 7 April, 2017; v1 submitted 3 January, 2017; originally announced January 2017.

    Comments: To appear at CVPR 2017

  19. arXiv:1609.05715  [pdf, other

    cs.DS

    SpectroMeter: Amortized Sublinear Spectral Approximation of Distance on Graphs

    Authors: Roee Litman, Alex Bronstein

    Abstract: We present a method to approximate pairwise distance on a graph, having an amortized sub-linear complexity in its size. The proposed method follows the so called heat method due to Crane et al. The only additional input are the values of the eigenfunctions of the graph Laplacian at a subset of the vertices. Using these values we estimate a random walk from the source points, and normalize the resu… ▽ More

    Submitted 15 September, 2016; originally announced September 2016.

  20. arXiv:1607.03425  [pdf, other

    cs.CV

    Bayesian Inference of Bijective Non-Rigid Shape Correspondence

    Authors: Matthias Vestner, Roee Litman, Alex Bronstein, Emanuele Rodolà, Daniel Cremers

    Abstract: Many algorithms for the computation of correspondences between deformable shapes rely on some variant of nearest neighbor matching in a descriptor space. Such are, for example, various point-wise correspondence recovery algorithms used as a postprocessing stage in the functional correspondence framework. In this paper, we show that such frequently used techniques in practice suffer from lack of ac… ▽ More

    Submitted 12 July, 2016; originally announced July 2016.

  21. Probably Approximately Symmetric: Fast rigid Symmetry Detection with Global Guarantees

    Authors: Simon Korman, Roee Litman, Shai Avidan, Alex Bronstein

    Abstract: We present a fast algorithm for global rigid symmetry detection with approximation guarantees. The algorithm is guaranteed to find the best approximate symmetry of a given shape, to within a user-specified threshold, with very high probability. Our method uses a carefully designed sampling of the transformation space, where each transformation is efficiently evaluated using a sub-linear algorithm.… ▽ More

    Submitted 6 October, 2014; v1 submitted 26 March, 2014; originally announced March 2014.

    MSC Class: 65D18; 68U05 ACM Class: I.3.5

    Journal ref: Comput. Graph. Forum 34, 1 (February 2015), 2-13

  22. arXiv:1102.4258  [pdf, other

    cs.CV

    SHREC 2011: robust feature detection and description benchmark

    Authors: E. Boyer, A. M. Bronstein, M. M. Bronstein, B. Bustos, T. Darom, R. Horaud, I. Hotz, Y. Keller, J. Keustermans, A. Kovnatsky, R. Litman, J. Reininghaus, I. Sipiran, D. Smeets, P. Suetens, D. Vandermeulen, A. Zaharescu, V. Zobel

    Abstract: Feature-based approaches have recently become very popular in computer vision and image analysis applications, and are becoming a promising direction in shape retrieval. SHREC'11 robust feature detection and description benchmark simulates the feature detection and description stages of feature-based shape retrieval algorithms. The benchmark tests the performance of shape feature detectors and des… ▽ More

    Submitted 21 February, 2011; originally announced February 2011.

    Comments: This is a full version of the SHREC'11 report published in 3DOR

  23. Diffusion-geometric maximally stable component detection in deformable shapes

    Authors: Roee Litman, Alex M. Bronstein, Michael M. Bronstein

    Abstract: Maximally stable component detection is a very popular method for feature analysis in images, mainly due to its low computation cost and high repeatability. With the recent advance of feature-based methods in geometric shape analysis, there is significant interest in finding analogous approaches in the 3D world. In this paper, we formulate a diffusion-geometric framework for stable component detec… ▽ More

    Submitted 17 December, 2010; originally announced December 2010.

    ACM Class: I.4.7; I.4.8