Zum Hauptinhalt springen

Showing 1–6 of 6 results for author: Mitra, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.15334  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

    Authors: Brandon Huang, Chancharik Mitra, Assaf Arbelle, Leonid Karlinsky, Trevor Darrell, Roei Herzig

    Abstract: The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, wh… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  2. arXiv:2403.00212  [pdf, other

    cs.CL cs.CV cs.LG cs.SD eess.AS

    Transcription and translation of videos using fine-tuned XLSR Wav2Vec2 on custom dataset and mBART

    Authors: Aniket Tathe, Anand Kamble, Suyash Kumbharkar, Atharva Bhandare, Anirban C. Mitra

    Abstract: This research addresses the challenge of training an ASR model for personalized voices with minimal data. Utilizing just 14 minutes of custom audio from a YouTube video, we employ Retrieval-Based Voice Conversion (RVC) to create a custom Common Voice 16.0 corpus. Subsequently, a Cross-lingual Self-supervised Representations (XLSR) Wav2Vec2 model is fine-tuned on this dataset. The developed web-bas… ▽ More

    Submitted 29 February, 2024; originally announced March 2024.

  3. arXiv:2401.06183  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2

    Authors: Aniket Tathe, Anand Kamble, Suyash Kumbharkar, Atharva Bhandare, Anirban C. Mitra

    Abstract: Speech has long been a barrier to effective communication and connection, persisting as a challenge in our increasingly interconnected world. This research paper introduces a transformative solution to this persistent obstacle an end-to-end speech conversion framework tailored for Hindi-to-English translation, culminating in the synthesis of English audio. By integrating cutting-edge technologies… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

  4. arXiv:2311.17076  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Compositional Chain-of-Thought Prompting for Large Multimodal Models

    Authors: Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig

    Abstract: The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. O… ▽ More

    Submitted 31 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

  5. arXiv:2311.14836  [pdf, other

    cs.SD cs.CL eess.AS

    Custom Data Augmentation for low resource ASR using Bark and Retrieval-Based Voice Conversion

    Authors: Anand Kamble, Aniket Tathe, Suyash Kumbharkar, Atharva Bhandare, Anirban C. Mitra

    Abstract: This paper proposes two innovative methodologies to construct customized Common Voice datasets for low-resource languages like Hindi. The first methodology leverages Bark, a transformer-based text-to-audio model developed by Suno, and incorporates Meta's enCodec and a pre-trained HuBert model to enhance Bark's performance. The second methodology employs Retrieval-Based Voice Conversion (RVC) and u… ▽ More

    Submitted 9 January, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

  6. arXiv:2311.06694  [pdf, other

    cs.CL cs.AI cs.CV cs.RO

    Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding

    Authors: Chancharik Mitra, Abrar Anwar, Rodolfo Corona, Dan Klein, Trevor Darrell, Jesse Thomason

    Abstract: When connecting objects and their language referents in an embodied 3D environment, it is important to note that: (1) an object can be better characterized by leveraging comparative information between itself and other objects, and (2) an object's appearance can vary with camera position. As such, we present the Multi-view Approach to Grounding in Context (MAGiC), which selects an object referent… ▽ More

    Submitted 6 April, 2024; v1 submitted 11 November, 2023; originally announced November 2023.

    Journal ref: North American Chapter of the Association for Computational Linguistics (NAACL), 2024