Zum Hauptinhalt springen

Showing 1–50 of 104 results for author: Jawahar, C V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17437  [pdf, other

    cs.CV

    Advancing Question Answering on Handwritten Documents: A State-of-the-Art Recognition-Based Model for HW-SQuAD

    Authors: Aniket Pal, Ajoy Mondal, C. V. Jawahar

    Abstract: Question-answering handwritten documents is a challenging task with numerous real-world applications. This paper proposes a novel recognition-based approach that improves upon the previous state-of-the-art on the HW-SQuAD and BenthamQA datasets. Our model incorporates transformer-based document retrieval and ensemble methods at the model level, achieving an Exact Match score of 82.02% and 69% in H… ▽ More

    Submitted 15 July, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

    Comments: 16 pages

  2. arXiv:2404.08561  [pdf, other

    cs.CV cs.AI cs.RO

    IDD-X: A Multi-View Dataset for Ego-relative Important Object Localization and Explanation in Dense and Unstructured Traffic

    Authors: Chirag Parikh, Rohit Saluja, C. V. Jawahar, Ravi Kiran Sarvadevabhatla

    Abstract: Intelligent vehicle systems require a deep understanding of the interplay between road conditions, surrounding entities, and the ego vehicle's driving behavior for safe and efficient navigation. This is particularly critical in developing countries where traffic situations are often dense and unstructured with heterogeneous road occupants. Existing datasets, predominantly geared towards structured… ▽ More

    Submitted 23 April, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

    Comments: Accepted at ICRA 2024; Project page: https://idd-x.github.io/

  3. IndicSTR12: A Dataset for Indic Scene Text Recognition

    Authors: Harsh Lunia, Ajoy Mondal, C V Jawahar

    Abstract: The importance of Scene Text Recognition (STR) in today's increasingly digital world cannot be overstated. Given the significance of STR, data intensive deep learning approaches that auto-learn feature mappings have primarily driven the development of STR solutions. Several benchmark datasets and substantial work on deep learning models are available for Latin languages to meet this need. On more… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Journal ref: ICDAR 2023 Workshops. Lecture Notes in Computer Science, vol 14193. Springer, Cham (2023)

  4. arXiv:2403.01087  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Towards Accurate Lip-to-Speech Synthesis in-the-Wild

    Authors: Sindhu Hegde, Rudrabha Mukhopadhyay, C. V. Jawahar, Vinay Namboodiri

    Abstract: In this paper, we introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements. The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone, resulting in unsatisfactory outcomes. To overcome this issue, we propose i… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

    Comments: 8 pages of content, 1 page of references and 4 figures

    Journal ref: In Proceedings of the 31st ACM International Conference on Multimedia, 2023

  5. arXiv:2402.15832  [pdf, other

    cs.CV cs.AI

    Multiple Instance Learning for Glioma Diagnosis using Hematoxylin and Eosin Whole Slide Images: An Indian Cohort Study

    Authors: Ekansh Chauhan, Amit Sharma, Megha S Uppin, C. V. Jawahar, P. K. Vinod

    Abstract: The effective management of brain tumors relies on precise typing, subtyping, and grading. This study advances patient care with findings from rigorous multiple instance learning experimentations across various feature extractors and aggregators in brain tumor histopathology. It establishes new performance benchmarks in glioma subtype classification across multiple datasets, including a novel data… ▽ More

    Submitted 8 March, 2024; v1 submitted 24 February, 2024; originally announced February 2024.

  6. arXiv:2311.18572  [pdf, other

    cs.CV

    Overcoming Label Noise for Source-free Unsupervised Video Domain Adaptation

    Authors: Avijit Dasgupta, C. V. Jawahar, Karteek Alahari

    Abstract: Despite the progress seen in classification methods, current approaches for handling videos with distribution shifts in source and target domains remain source-dependent as they require access to the source data during the adaptation stage. In this paper, we present a self-training based source-free video domain adaptation approach to address this challenge by bridging the gap between the source a… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

    Comments: Extended version of our ICVGIP paper

  7. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

  8. arXiv:2311.03550  [pdf, other

    cs.CV cs.AI

    United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

    Authors: Siddhant Bansal, Chetan Arora, C. V. Jawahar

    Abstract: Given multiple videos of the same task, procedure learning addresses identifying the key-steps and determining their order to perform the task. For this purpose, existing approaches use the signal generated from a pair of videos. This makes key-steps discovery challenging as the algorithms lack inter-videos perspective. Instead, we propose an unsupervised Graph-based Procedure Learning (GPL) frame… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: 13 pages, 6 figures, Accepted in Winter Conference on Applications of Computer Vision (WACV), 2024

  9. arXiv:2309.14715  [pdf, other

    cs.CV cs.HC cs.LG

    Explaining Deep Face Algorithms through Visualization: A Survey

    Authors: Thrupthi Ann John, Vineeth N Balasubramanian, C. V. Jawahar

    Abstract: Although current deep models for face tasks surpass human performance on some benchmarks, we do not understand how they work. Thus, we cannot predict how it will react to novel inputs, resulting in catastrophic failures and unwanted biases in the algorithms. Explainable AI helps bridge the gap, but currently, there are very few visualization algorithms designed for faces. This work undertakes a fi… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

    ACM Class: I.2.10; I.4.10; I.5.1

    Journal ref: IEEE Transactions in Biometrics, Behaviour and Identity Science (IEEE T-BIOM) 2023

  10. arXiv:2309.01380  [pdf, other

    cs.CV

    Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

    Authors: Soumya Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar

    Abstract: Researchers have extensively studied the field of vision and language, discovering that both visual and textual content is crucial for understanding scenes effectively. Particularly, comprehending text in videos holds great significance, requiring both scene text understanding and temporal reasoning. This paper focuses on exploring two recently introduced datasets, NewsVideoQA and M4-ViteVQA, whic… ▽ More

    Submitted 11 September, 2023; v1 submitted 4 September, 2023; originally announced September 2023.

  11. arXiv:2308.12199  [pdf, other

    cs.CV

    Towards Real-Time Analysis of Broadcast Badminton Videos

    Authors: Nitin Nilesh, Tushar Sharma, Anurag Ghosh, C. V. Jawahar

    Abstract: Analysis of player movements is a crucial subset of sports analysis. Existing player movement analysis methods use recorded videos after the match is over. In this work, we propose an end-to-end framework for player movement analysis for badminton matches on live broadcast match videos. We only use the visual inputs from the match and, unlike other approaches which use multi-modal sensor data, our… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

  12. arXiv:2307.03948  [pdf, other

    cs.CV

    Reading Between the Lanes: Text VideoQA on the Road

    Authors: George Tom, Minesh Mathew, Sergi Garcia, Dimosthenis Karatzas, C. V. Jawahar

    Abstract: Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and te… ▽ More

    Submitted 8 July, 2023; originally announced July 2023.

  13. arXiv:2303.02641  [pdf, other

    cs.CV cs.AI

    CueCAn: Cue Driven Contextual Attention For Identifying Missing Traffic Signs on Unconstrained Roads

    Authors: Varun Gupta, Anbumani Subramanian, C. V. Jawahar, Rohit Saluja

    Abstract: Unconstrained Asian roads often involve poor infrastructure, affecting overall road safety. Missing traffic signs are a regular part of such roads. Missing or non-existing object detection has been studied for locating missing curbs and estimating reasonable regions for pedestrians on road scene images. Such methods involve analyzing task-specific single object cues. In this paper, we present the… ▽ More

    Submitted 5 March, 2023; originally announced March 2023.

    Comments: International Conference on Robotics and Automation (ICRA'23)

  14. A Fine-Grained Vehicle Detection (FGVD) Dataset for Unconstrained Roads

    Authors: Prafful Kumar Khoba, Chirag Parikh, Rohit Saluja, Ravi Kiran Sarvadevabhatla, C. V. Jawahar

    Abstract: The previous fine-grained datasets mainly focus on classification and are often captured in a controlled setup, with the camera focusing on the objects. We introduce the first Fine-Grained Vehicle Detection (FGVD) dataset in the wild, captured from a moving camera mounted on a car. It contains 5502 scene images with 210 unique fine-grained labels of multiple vehicle types organized in a three-leve… ▽ More

    Submitted 30 December, 2022; originally announced December 2022.

  15. arXiv:2212.08834  [pdf, other

    cs.CV

    Towards Robust Handwritten Text Recognition with On-the-fly User Participation

    Authors: Ajoy Mondal, Rohit saluja, C. V. Jawahar

    Abstract: Long-term OCR services aim to provide high-quality output to their users at competitive costs. It is essential to upgrade the models because of the complex data loaded by the users. The service providers encourage the users who provide data where the OCR model fails by rewarding them based on data complexity, readability, and available budget. Hitherto, the OCR works include preparing the models o… ▽ More

    Submitted 17 December, 2022; originally announced December 2022.

  16. arXiv:2212.07776  [pdf, other

    cs.CV

    Enhancing Indic Handwritten Text Recognition Using Global Semantic Information

    Authors: Ajoy Mondal, C. V. Jawahar

    Abstract: Handwritten Text Recognition (HTR) is more interesting and challenging than printed text due to uneven variations in the handwriting style of the writers, content, and time. HTR becomes more challenging for the Indic languages because of (i) multiple characters combined to form conjuncts which increase the number of characters of respective languages, and (ii) near to 100 unique basic Unicode char… ▽ More

    Submitted 15 December, 2022; originally announced December 2022.

  17. arXiv:2212.00999  [pdf, other

    cs.IR

    Information Retrieval from the Digitized Books

    Authors: Riya Gupta, C. V. Jawahar

    Abstract: Extracting the relevant information out of a large number of documents is a challenging and tedious task. The quality of results generated by the traditionally available full-text search engine and text-based image retrieval systems is not optimal. Information retrieval (IR) tasks become more challenging with the nontraditional language scripts, as in the case of Indic scripts. The authors have de… ▽ More

    Submitted 2 December, 2022; originally announced December 2022.

    Comments: 6 pages including references, 5 figures, and 1 table. For project page see https://cvit.iiit.ac.in/research/projects/cvit-projects/retrieval-from-large-document-image-collections

  18. arXiv:2211.05588  [pdf, other

    cs.CV

    Watching the News: Towards VideoQA Models that can Read

    Authors: Soumya Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar

    Abstract: Video Question Answering methods focus on commonsense reasoning and visual cognition of objects or persons and their interactions over time. Current VideoQA approaches ignore the textual information present in the video. Instead, we argue that textual information is complementary to the action and provides essential contextualisation cues to the reasoning process. To this end, we propose a novel V… ▽ More

    Submitted 7 December, 2023; v1 submitted 10 November, 2022; originally announced November 2022.

  19. arXiv:2210.16644  [pdf, other

    cs.CV

    Unsupervised Audio-Visual Lecture Segmentation

    Authors: Darshan Singh S, Anchit Gupta, C. V. Jawahar, Makarand Tapaswi

    Abstract: Over the last decade, online lecture videos have become increasingly popular and have experienced a meteoric rise during the pandemic. However, video-language research has primarily focused on instructional videos or movies, and tools to help students navigate the growing online lectures are lacking. Our first contribution is to facilitate research in the educational domain, by introducing AVLectu… ▽ More

    Submitted 29 October, 2022; originally announced October 2022.

    Comments: 17 pages, 14 figures, 14 tables, Accepted to WACV 2023. Project page: https://cvit.iiit.ac.in/research/projects/cvit-projects/avlectures

  20. arXiv:2210.16579  [pdf, other

    cs.CV

    INR-V: A Continuous Representation Space for Video-based Generative Tasks

    Authors: Bipasha Sen, Aditya Agarwal, Vinay P Namboodiri, C. V. Jawahar

    Abstract: Generating videos is a complex task that is accomplished by generating a set of temporally coherent images frame-by-frame. This limits the expressivity of videos to only image-based operations on the individual video frames needing network designs to obtain temporally coherent trajectories in the underlying image space. We propose INR-V, a video representation network that learns a continuous spac… ▽ More

    Submitted 2 April, 2023; v1 submitted 29 October, 2022; originally announced October 2022.

    Comments: Published in Transactions on Machine Learning Research (10/2022); https://openreview.net/forum?id=aIoEkwc2oB

  21. arXiv:2210.12878  [pdf, other

    cs.CV

    IDD-3D: Indian Driving Dataset for 3D Unstructured Road Scenes

    Authors: Shubham Dokania, A. H. Abdul Hafez, Anbumani Subramanian, Manmohan Chandraker, C. V. Jawahar

    Abstract: Autonomous driving and assistance systems rely on annotated data from traffic and road scenarios to model and learn the various object relations in complex real-world scenarios. Preparation and training of deploy-able deep learning architectures require the models to be suited to different traffic scenarios and adapt to different situations. Currently, existing datasets, while large-scale, lack su… ▽ More

    Submitted 23 October, 2022; originally announced October 2022.

    Comments: 10 pages, 8 figures, 5 tables, Accepted in Winter Conference on Applications of Computer Vision (WACV 2023)

  22. arXiv:2210.10828  [pdf, other

    cs.CV

    Grounded Video Situation Recognition

    Authors: Zeeshan Khan, C. V. Jawahar, Makarand Tapaswi

    Abstract: Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambigua… ▽ More

    Submitted 19 October, 2022; originally announced October 2022.

    Comments: Accepted to NeurIPS 2022. Project Page: https://zeeshank95.github.io/grvidsitu

  23. arXiv:2210.03692  [pdf, other

    cs.CV

    Compressing Video Calls using Synthetic Talking Heads

    Authors: Madhav Agarwal, Anchit Gupta, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C V Jawahar

    Abstract: We leverage the modern advancements in talking head generation to propose an end-to-end system for talking head video compression. Our algorithm transmits pivot frames intermittently while the rest of the talking head video is generated by animating them. We use a state-of-the-art face reenactment network to detect key points in the non-pivot frames and transmit them to the receiver. A dense flow… ▽ More

    Submitted 7 October, 2022; originally announced October 2022.

    Comments: British Machine Vision Conference (BMVC), 2022

  24. arXiv:2210.02755  [pdf, other

    cs.CV

    Audio-Visual Face Reenactment

    Authors: Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar

    Abstract: This work proposes a novel method to generate realistic talking head videos using audio and visual streams. We animate a source image by transferring head motion from a driving video using a dense motion field generated using learnable keypoints. We improve the quality of lip sync using audio as an additional input, helping the network to attend to the mouth region. We use additional priors using… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

    Comments: Winter Conference on Applications of Computer Vision (WACV), 2023

  25. arXiv:2209.00642  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

    Authors: Sindhu B Hegde, K R Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, C. V. Jawahar

    Abstract: In this work, we address the problem of generating speech from silent lip videos for any speaker in the wild. In stark contrast to previous works, our method (i) is not restricted to a fixed number of speakers, (ii) does not explicitly impose constraints on the domain or the vocabulary and (iii) deals with videos that are recorded in the wild as opposed to within laboratory settings. The task pres… ▽ More

    Submitted 1 September, 2022; originally announced September 2022.

    Comments: Accepted in ACM-MM 2022, 9 pages, 2 pages supplementary, 7 Figures

  26. arXiv:2208.09796  [pdf, other

    cs.CV cs.CY

    Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale

    Authors: Aditya Agarwal, Bipasha Sen, Rudrabha Mukhopadhyay, Vinay Namboodiri, C. V Jawahar

    Abstract: Many people with some form of hearing loss consider lipreading as their primary mode of day-to-day communication. However, finding resources to learn or improve one's lipreading skills can be challenging. This is further exacerbated in the COVID19 pandemic due to restrictions on direct interactions with peers and speech therapists. Today, online MOOCs platforms like Coursera and Udemy have become… ▽ More

    Submitted 4 October, 2022; v1 submitted 20 August, 2022; originally announced August 2022.

    Comments: Accepted at WACV 2023

  27. arXiv:2208.09788  [pdf, other

    cs.CV

    FaceOff: A Video-to-Video Face Swapping System

    Authors: Aditya Agarwal, Bipasha Sen, Rudrabha Mukhopadhyay, Vinay Namboodiri, C. V. Jawahar

    Abstract: Doubles play an indispensable role in the movie industry. They take the place of the actors in dangerous stunt scenes or scenes where the same actor plays multiple characters. The double's face is later replaced with the actor's face and expressions manually using expensive CGI technology, costing millions of dollars and taking months to complete. An automated, inexpensive, and fast way can be to… ▽ More

    Submitted 21 October, 2022; v1 submitted 20 August, 2022; originally announced August 2022.

    Comments: Accepted at WACV 2023

  28. Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors

    Authors: Sindhu B Hegde, Rudrabha Mukhopadhyay, Vinay P Namboodiri, C. V. Jawahar

    Abstract: In this paper, we explore an interesting question of what can be obtained from an $8\times8$ pixel video sequence. Surprisingly, it turns out to be quite a lot. We show that when we process this $8\times8$ video with the right set of audio and image priors, we can obtain a full-length, $256\times256$ video. We achieve this $32\times$ scaling of an extremely low-resolution input using our novel aud… ▽ More

    Submitted 17 August, 2022; originally announced August 2022.

    Comments: Accepted in ACM-MM 2022, 10 pages, 6 pages supplementary, 18 Figures

  29. arXiv:2208.07943  [pdf, other

    cs.CV

    TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments

    Authors: Shubham Dokania, Anbumani Subramanian, Manmohan Chandraker, C. V. Jawahar

    Abstract: High-quality structured data with rich annotations are critical components in intelligent vehicle systems dealing with road scenes. However, data curation and annotation require intensive investments and yield low-diversity scenarios. The recently growing interest in synthetic data raises questions about the scope of improvement in such systems and the amount of manual work still required to produ… ▽ More

    Submitted 16 August, 2022; originally announced August 2022.

    Comments: 18 pages, 5 figures, Accepted in European Conference on Computer Vision (ECCV 2022)

  30. arXiv:2207.10883  [pdf, other

    cs.CV cs.AI

    My View is the Best View: Procedure Learning from Egocentric Videos

    Authors: Siddhant Bansal, Chetan Arora, C. V. Jawahar

    Abstract: Procedure learning involves identifying the key-steps and determining their logical order to perform a task. Existing approaches commonly use third-person videos for learning the procedure, making the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras pro… ▽ More

    Submitted 22 July, 2022; originally announced July 2022.

    Comments: 25 pages, 6 figures, Accepted in European Conference on Computer Vision (ECCV) 2022

  31. arXiv:2204.08364  [pdf, other

    cs.CV

    Detecting, Tracking and Counting Motorcycle Rider Traffic Violations on Unconstrained Roads

    Authors: Aman Goyal, Dev Agarwal, Anbumani Subramanian, C. V. Jawahar, Ravi Kiran Sarvadevabhatla, Rohit Saluja

    Abstract: In many Asian countries with unconstrained road traffic conditions, driving violations such as not wearing helmets and triple-riding are a significant source of fatalities involving motorcycles. Identifying and penalizing such riders is vital in curbing road accidents and improving citizens' safety. With this motivation, we propose an approach for detecting, tracking, and counting motorcycle ridin… ▽ More

    Submitted 18 April, 2022; originally announced April 2022.

    Comments: 10 pages, 9 figures, Accepted at The 5th Workshop and Prize Challenge: Bridging the Gap between Computational Photography and Visual Recognition (UG2+) in conjunction with IEEE CVPR 2022

  32. arXiv:2201.08574  [pdf, other

    cs.CV cs.AI cs.MM

    Classroom Slide Narration System

    Authors: Jobin K. V., Ajoy Mondal, C. V. Jawahar

    Abstract: Slide presentations are an effective and efficient tool used by the teaching community for classroom communication. However, this teaching model can be challenging for blind and visually impaired (VI) students. The VI student required personal human assistance for understand the presented slide. This shortcoming motivates us to design a Classroom Slide Narration System (CSNS) that generates audio… ▽ More

    Submitted 21 January, 2022; originally announced January 2022.

    Journal ref: CVIP 2021

  33. arXiv:2201.06569  [pdf, other

    cs.CV

    Automatic Quantification and Visualization of Street Trees

    Authors: Arpit Bahety, Rohit Saluja, Ravi Kiran Sarvadevabhatla, Anbumani Subramanian, C. V. Jawahar

    Abstract: Assessing the number of street trees is essential for evaluating urban greenery and can help municipalities employ solutions to identify tree-starved streets. It can also help identify roads with different levels of deforestation and afforestation over time. Yet, there has been little work in the area of street trees quantification. This work first explains a data collection setup carefully design… ▽ More

    Submitted 17 January, 2022; originally announced January 2022.

    Comments: Accepted at ICVGIP 2021

  34. Towards Boosting the Accuracy of Non-Latin Scene Text Recognition

    Authors: Sanjana Gunna, Rohit Saluja, C. V. Jawahar

    Abstract: Scene-text recognition is remarkably better in Latin languages than the non-Latin languages due to several factors like multiple fonts, simplistic vocabulary statistics, updated data generation tools, and writing systems. This paper examines the possible reasons for low accuracy by comparing English datasets with non-Latin languages. We compare various features like the size (width and height) of… ▽ More

    Submitted 10 January, 2022; originally announced January 2022.

    Comments: 12 pages, 6 figures

    Journal ref: ICDAR 2021: Document Analysis and Recognition, ICDAR 2021 Workshops, pp 282-293

  35. Transfer Learning for Scene Text Recognition in Indian Languages

    Authors: Sanjana Gunna, Rohit Saluja, C. V. Jawahar

    Abstract: Scene text recognition in low-resource Indian languages is challenging because of complexities like multiple scripts, fonts, text size, and orientations. In this work, we investigate the power of transfer learning for all the layers of deep scene text recognition networks from English to two common Indian languages. We perform experiments on the conventional CRNN model and STAR-Net to ensure gener… ▽ More

    Submitted 10 January, 2022; originally announced January 2022.

    Comments: 16 pages, 5 figures

    Journal ref: ICDAR 2021: Document Analysis and Recognition, ICDAR 2021 Workshops, pp 182-197

  36. arXiv:2111.07129  [pdf, other

    cs.CV cs.AI

    Visual Understanding of Complex Table Structures from Document Images

    Authors: Sachin Raja, Ajoy Mondal, C V Jawahar

    Abstract: Table structure recognition is necessary for a comprehensive understanding of documents. Tables in unstructured business documents are tough to parse due to the high diversity of layouts, varying alignments of contents, and the presence of empty cells. The problem is particularly difficult because of challenges in identifying individual cells using visual or linguistic contexts or both. Accurate d… ▽ More

    Submitted 13 November, 2021; originally announced November 2021.

  37. arXiv:2111.05547  [pdf, other

    cs.CV cs.LG

    ICDAR 2021 Competition on Document VisualQuestion Answering

    Authors: Rubèn Tito, Minesh Mathew, C. V. Jawahar, Ernest Valveny, Dimosthenis Karatzas

    Abstract: In this report we present results of the ICDAR 2021 edition of the Document Visual Question Challenges. This edition complements the previous tasks on Single Document VQA and Document Collection VQA with a newly introduced on Infographics VQA. Infographics VQA is based on a new dataset of more than 5,000 infographics images and 30,000 question-answer pairs. The winner methods have scored 0.6120 AN… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

  38. arXiv:2111.01740  [pdf, other

    cs.CV cs.CL

    Personalized One-Shot Lipreading for an ALS Patient

    Authors: Bipasha Sen, Aditya Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar

    Abstract: Lipreading or visually recognizing speech from the mouth movements of a speaker is a challenging and mentally taxing task. Unfortunately, multiple medical conditions force people to depend on this skill in their day-to-day lives for essential communication. Patients suffering from Amyotrophic Lateral Sclerosis (ALS) often lose muscle control, consequently their ability to generate speech and commu… ▽ More

    Submitted 2 November, 2021; originally announced November 2021.

    Journal ref: BMVC 2021

  39. arXiv:2110.12205  [pdf, other

    cs.CV

    Multi-Domain Incremental Learning for Semantic Segmentation

    Authors: Prachi Garg, Rohit Saluja, Vineeth N Balasubramanian, Chetan Arora, Anbumani Subramanian, C. V. Jawahar

    Abstract: Recent efforts in multi-domain learning for semantic segmentation attempt to learn multiple geographical datasets in a universal, joint model. A simple fine-tuning experiment performed sequentially on three popular road scene segmentation datasets demonstrates that existing segmentation frameworks fail at incrementally learning on a series of visually disparate geographical domains. When learning… ▽ More

    Submitted 23 October, 2021; originally announced October 2021.

    Comments: 11 pages, 5 figures, Accepted in WACV 2022

  40. Intelligent Video Editing: Incorporating Modern Talking Face Generation Algorithms in a Video Editor

    Authors: Anchit Gupta, Faizan Farooq Khan, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C. V. Jawahar

    Abstract: This paper proposes a video editor based on OpenShot with several state-of-the-art facial video editing algorithms as added functionalities. Our editor provides an easy-to-use interface to apply modern lip-syncing algorithms interactively. Apart from lip-syncing, the editor also uses audio and facial re-enactment to generate expressive talking faces. The manual control improves the overall experie… ▽ More

    Submitted 16 October, 2021; originally announced October 2021.

    Comments: 9 pages, 7 figures, accepted in ICVGIP 2021

  41. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  42. arXiv:2109.05226  [pdf, other

    cs.CV

    Evaluating Computer Vision Techniques for Urban Mobility on Large-Scale, Unconstrained Roads

    Authors: Harish Rithish, Raghava Modhugu, Ranjith Reddy, Rohit Saluja, C. V. Jawahar

    Abstract: Conventional approaches for addressing road safety rely on manual interventions or immobile CCTV infrastructure. Such methods are expensive in enforcing compliance to traffic rules and do not scale to large road networks. This paper proposes a simple mobile imaging setup to address several common problems in road safety at scale. We use recent computer vision techniques to identify possible irregu… ▽ More

    Submitted 11 September, 2021; originally announced September 2021.

    Comments: 8 pages, 8 figures

  43. arXiv:2108.02996  [pdf, other

    cs.CV

    Efficient and Generic Interactive Segmentation Framework to Correct Mispredictions during Clinical Evaluation of Medical Images

    Authors: Bhavani Sambaturu, Ashutosh Gupta, C. V. Jawahar, Chetan Arora

    Abstract: Semantic segmentation of medical images is an essential first step in computer-aided diagnosis systems for many applications. However, given many disparate imaging modalities and inherent variations in the patient data, it is difficult to consistently achieve high accuracy using modern deep neural networks (DNNs). This has led researchers to propose interactive image segmentation techniques where… ▽ More

    Submitted 6 August, 2021; originally announced August 2021.

    Comments: 12 pages, 8 figures, accepted to MICCAI 2021

    MSC Class: 49-06 (Primary); 49-11(Secondary) ACM Class: I.4.6; I.5.1

  44. arXiv:2107.09622  [pdf, other

    cs.CL

    More Parameters? No Thanks!

    Authors: Zeeshan Khan, Kartheek Akella, Vinay P. Namboodiri, C V Jawahar

    Abstract: This work studies the long-standing problems of model capacity and negative interference in multilingual neural machine translation MNMT. We use network pruning techniques and observe that pruning 50-70% of the parameters from a trained MNMT model results only in a 0.29-1.98 drop in the BLEU score. Suggesting that there exist large redundancies even in MNMT models. These observations motivate us t… ▽ More

    Submitted 20 July, 2021; originally announced July 2021.

  45. arXiv:2106.12790  [pdf, other

    cs.CV

    Towards Automatic Speech to Sign Language Generation

    Authors: Parul Kapoor, Rudrabha Mukhopadhyay, Sindhu B Hegde, Vinay Namboodiri, C V Jawahar

    Abstract: We aim to solve the highly challenging task of generating continuous sign language videos solely from speech segments for the first time. Recent efforts in this space have focused on generating such videos from human-annotated text transcripts without considering other modalities. However, replacing speech with sign language proves to be a practical solution while communicating with people sufferi… ▽ More

    Submitted 24 June, 2021; originally announced June 2021.

    Comments: 5 pages(including references), 5 figures, Accepted in Interspeech 2021

  46. arXiv:2105.01386  [pdf, other

    cs.CV cs.LG

    Canonical Saliency Maps: Decoding Deep Face Models

    Authors: Thrupthi Ann John, Vineeth N Balasubramanian, C V Jawahar

    Abstract: As Deep Neural Network models for face processing tasks approach human-like performance, their deployment in critical applications such as law enforcement and access control has seen an upswing, where any failure may have far-reaching consequences. We need methods to build trust in deployed systems by making their working as transparent as possible. Existing visualization algorithms are designed f… ▽ More

    Submitted 16 August, 2021; v1 submitted 4 May, 2021; originally announced May 2021.

    Comments: Under review. Added three new experiments, cleaned up some figures and equations

    ACM Class: I.4

  47. arXiv:2104.12756  [pdf, other

    cs.CV cs.CL

    InfographicVQA

    Authors: Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthenis Karatzas, Ernest Valveny, C. V Jawahar

    Abstract: Infographics are documents designed to effectively communicate information using a combination of textual, graphical and visual elements. In this work, we explore the automatic understanding of infographic images by using Visual Question Answering technique.To this end, we present InfographicVQA, a new dataset that comprises a diverse collection of infographics along with natural language question… ▽ More

    Submitted 22 August, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

  48. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction

    Authors: Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shjian Lu, C. V. Jawahar

    Abstract: Scanned receipts OCR and key information extraction (SROIE) represent the processeses of recognizing text from scanned receipts and extracting key texts from them and save the extracted tests to structured documents. SROIE plays critical roles for many document analysis applications and holds great commercial potentials, but very little research works and advances have been published in this area.… ▽ More

    Submitted 18 March, 2021; originally announced March 2021.

  49. arXiv:2012.13751  [pdf, other

    cs.CV cs.LG

    Few Shot Learning With No Labels

    Authors: Aditya Bharti, N. B. Vineeth, C. V. Jawahar

    Abstract: Few-shot learners aim to recognize new categories given only a small number of training samples. The core challenge is to avoid overfitting to the limited data while ensuring good generalization to novel classes. Existing literature makes use of vast amounts of annotated data by simply shifting the label requirement from novel classes to base classes. Since data annotation is time-consuming and co… ▽ More

    Submitted 26 December, 2020; originally announced December 2020.

  50. arXiv:2012.10852  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Visual Speech Enhancement Without A Real Visual Stream

    Authors: Sindhu B Hegde, K R Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C. V. Jawahar

    Abstract: In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over "audio-only" methods. But, these methods cannot be used for several ap… ▽ More

    Submitted 20 December, 2020; originally announced December 2020.

    Comments: 10 pages, 4 figures, Accepted in WACV 2021