Zum Hauptinhalt springen

Showing 1–11 of 11 results for author: Manohar, V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.02560  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Less Peaky and More Accurate CTC Forced Alignment by Label Priors

    Authors: Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe, Daniel Povey, Sanjeev Khudanpur

    Abstract: Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leve… ▽ More

    Submitted 18 July, 2024; v1 submitted 22 April, 2024; originally announced June 2024.

    Comments: Accepted by ICASSP 2024. Github repo: https://github.com/huangruizhe/audio/tree/aligner_label_priors

  2. arXiv:2306.15687  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

    Authors: Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu

    Abstract: Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative… ▽ More

    Submitted 19 October, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

    Comments: Accepted to NeurIPS 2023

  3. arXiv:2211.13282  [pdf, other

    cs.SD cs.AI eess.AS

    Voice-preserving Zero-shot Multiple Accent Conversion

    Authors: Mumin Jin, Prashant Serai, Jilong Wu, Andros Tjandra, Vimal Manohar, Qing He

    Abstract: Most people who have tried to learn a foreign language would have experienced difficulties understanding or speaking with a native speaker's accent. For native speakers, understanding or speaking a new accent is likewise a difficult task. An accent conversion system that changes a speaker's accent but preserves that speaker's voice identity, such as timbre and pitch, has the potential for a range… ▽ More

    Submitted 14 October, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: Accepted to IEEE ICASSP 2023

  4. arXiv:2210.16045  [pdf, other

    cs.SD cs.CL eess.AS

    Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

    Authors: Jason Fong, Yun Wang, Prabhav Agrawal, Vimal Manohar, Jilong Wu, Thilo Köhler, Qing He

    Abstract: Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech that is similar to the original speech in terms of clarity, speaker identity, and prosody. However, one limitation of prior work is the usage of finetuning to optimise performance: this requires further model… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  5. arXiv:2107.04154  [pdf, other

    eess.AS cs.LG

    On lattice-free boosted MMI training of HMM and CTC-based full-context ASR models

    Authors: Xiaohui Zhang, Vimal Manohar, David Zhang, Frank Zhang, Yangyang Shi, Nayan Singhal, Julian Chan, Fuchun Peng, Yatharth Saraf, Mike Seltzer

    Abstract: Hybrid automatic speech recognition (ASR) models are typically sequentially trained with CTC or LF-MMI criteria. However, they have vastly different legacies and are usually implemented in different frameworks. In this paper, by decoupling the concepts of modeling units and label topologies and building proper numerator/denominator graphs accordingly, we establish a generalized framework for hybri… ▽ More

    Submitted 26 September, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

    Comments: accepted by ASRU 2021

  6. arXiv:2106.07759  [pdf, ps, other

    eess.AS cs.CL

    Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

    Authors: Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised speech recognition (ASR). The proposed approach uses a teacher model which is updated as the exponential moving average (EMA) of the student model parameters. We demonstrate that it is critical for EMA to be accumulated with full-precision floating point. The Ka… ▽ More

    Submitted 27 October, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

    Comments: Updated with camera ready version

  7. arXiv:2005.07850  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Large scale weakly and semi-supervised learning for low-resource video ASR

    Authors: Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on th… ▽ More

    Submitted 6 August, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

  8. arXiv:2004.09249  [pdf, other

    cs.SD cs.CL eess.AS

    CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

    Authors: Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant

    Abstract: Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous C… ▽ More

    Submitted 2 May, 2020; v1 submitted 20 April, 2020; originally announced April 2020.

  9. arXiv:1802.08731  [pdf, other

    cs.CL

    Automatic Speech Recognition and Topic Identification for Almost-Zero-Resource Languages

    Authors: Matthew Wiesner, Chunxi Liu, Lucas Ondel, Craig Harman, Vimal Manohar, Jan Trmal, Zhongqiang Huang, Najim Dehak, Sanjeev Khudanpur

    Abstract: Automatic speech recognition (ASR) systems often need to be developed for extremely low-resource languages to serve end-uses such as audio content categorization and search. While universal phone recognition is natural to consider when no transcribed speech is available to train an ASR system in a language, adapting universal phone models using very small amounts (minutes rather than hours) of tra… ▽ More

    Submitted 18 June, 2018; v1 submitted 23 February, 2018; originally announced February 2018.

    Comments: Accepted for publication at Interspeech 2018

  10. arXiv:1706.03747  [pdf, other

    cs.CL

    Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework

    Authors: Xiaohui Zhang, Vimal Manohar, Daniel Povey, Sanjeev Khudanpur

    Abstract: Speech recognition systems for irregularly-spelled languages like English normally require hand-written pronunciations. In this paper, we describe a system for automatically obtaining pronunciations of words for which pronunciations are not available, but for which transcribed data exists. Our method integrates information from the letter sequence and from the acoustic evidence. The novel aspect o… ▽ More

    Submitted 12 June, 2017; originally announced June 2017.

  11. arXiv:1706.00321  [pdf, other

    cs.CL

    Using of heterogeneous corpora for training of an ASR system

    Authors: Jan Trmal, Gaurav Kumar, Vimal Manohar, Sanjeev Khudanpur, Matt Post, Paul McNamee

    Abstract: The paper summarizes the development of the LVCSR system built as a part of the Pashto speech-translation system at the SCALE (Summer Camp for Applied Language Exploration) 2015 workshop on "Speech-to-text-translation for low-resource languages". The Pashto language was chosen as a good "proxy" low-resource language, exhibiting multiple phenomena which make the speech-recognition and and speech-to… ▽ More

    Submitted 1 June, 2017; originally announced June 2017.