Zum Hauptinhalt springen

Showing 1–20 of 20 results for author: Audhkhasi, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2308.07486  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    O-1: Self-training with Oracle and 1-best Hypothesis

    Authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Kartik Audhkhasi

    Abstract: We introduce O-1, a new self-training objective to reduce training bias and unify training and evaluation metrics for speech recognition. O-1 is a faster variant of Expected Minimum Bayes Risk (EMBR), that boosts the oracle hypothesis and can accommodate both supervised and unsupervised data. We demonstrate the effectiveness of our approach in terms of recognition on publicly available SpeechStew… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

  2. arXiv:2306.08133  [pdf, ps, other

    eess.AS cs.CL

    Large-scale Language Model Rescoring on Long-form Data

    Authors: Tongzhou Chen, Cyril Allauzen, Yinghui Huang, Daniel Park, David Rybach, W. Ronny Huang, Rodrigo Cabrera, Kartik Audhkhasi, Bhuvana Ramabhadran, Pedro J. Moreno, Michael Riley

    Abstract: In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form ASR. We demonstrate up to 8\% relative reduction in Word Error Eate (WER) on US English (en-us) and code-switched Indian English (en-in) long-form ASR test sets and a reduction of up to 30\% relative on Salient Term Error Rate (STER)… ▽ More

    Submitted 5 September, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: 5 pages, accepted in ICASSP 2023

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  3. arXiv:2303.05958  [pdf, ps, other

    cs.CL cs.SD eess.AS stat.ML

    Robust Knowledge Distillation from RNN-T Models With Noisy Training Labels Using Full-Sum Loss

    Authors: Mohammad Zeineldeen, Kartik Audhkhasi, Murali Karthick Baskar, Bhuvana Ramabhadran

    Abstract: This work studies knowledge distillation (KD) and addresses its constraints for recurrent neural network transducer (RNN-T) models. In hard distillation, a teacher model transcribes large amounts of unlabelled speech to train a student model. Soft distillation is another popular KD method that distills the output logits of the teacher model. Due to the nature of RNN-T alignments, applying soft dis… ▽ More

    Submitted 10 March, 2023; originally announced March 2023.

    Comments: Accepted at ICASSP 2023

  4. arXiv:2210.17049  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Modular Hybrid Autoregressive Transducer

    Authors: Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang, Kartik Audhkhasi, Jesse Emond, Trevor Strohman, Bhuvana Ramabhadran, W. Ronny Huang, Ehsan Variani, Yinghui Huang, Pedro J. Moreno

    Abstract: Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition since the transducer has no clearly separated acoustic model (AM), language model (LM) or blank model. In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a… ▽ More

    Submitted 16 February, 2023; v1 submitted 30 October, 2022; originally announced October 2022.

    Comments: 8 pages, 1 figure, in SLT 2022

    Journal ref: 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar

  5. arXiv:2209.06096  [pdf, other

    cs.CL cs.SD eess.AS

    Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition

    Authors: Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno

    Abstract: Attention layers are an integral part of modern end-to-end automatic speech recognition systems, for instance as part of the Transformer or Conformer architecture. Attention is typically multi-headed, where each head has an independent set of learned parameters and operates on the same input feature sequence. The output of multi-headed attention is a fusion of the outputs from the individual heads… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

    Comments: Accepted for publication in Interspeech 2022

  6. arXiv:2010.04284  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems

    Authors: Yinghui Huang, Hong-Kwang Kuo, Samuel Thomas, Zvi Kons, Kartik Audhkhasi, Brian Kingsbury, Ron Hoory, Michael Picheny

    Abstract: Training an end-to-end (E2E) neural network speech-to-intent (S2I) system that directly extracts intents from speech requires large amounts of intent-labeled speech data, which is time consuming and expensive to collect. Initializing the S2I model with an ASR model trained on copious speech data can alleviate data sparsity. In this paper, we attempt to leverage NLU text resources. We implemented a… ▽ More

    Submitted 8 October, 2020; originally announced October 2020.

    Comments: 5 pages, published in ICASSP 2020

    ACM Class: I.2.7

  7. arXiv:2009.14386  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    End-to-End Spoken Language Understanding Without Full Transcripts

    Authors: Hong-Kwang J. Kuo, Zoltán Tüske, Samuel Thomas, Yinghui Huang, Kartik Audhkhasi, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, Luis Lastras

    Abstract: An essential component of spoken language understanding (SLU) is slot filling: representing the meaning of a spoken utterance using semantic entity labels. In this paper, we develop end-to-end (E2E) spoken language understanding systems that directly convert speech input to semantic entities and investigate if these E2E SLU models can be trained solely on semantic entity annotations without word-f… ▽ More

    Submitted 29 September, 2020; originally announced September 2020.

    Comments: 5 pages, to be published in Interspeech 2020

    ACM Class: I.2.7

  8. arXiv:2006.09199  [pdf, other

    cs.CV cs.CL cs.MM cs.SD eess.AS

    AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

    Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass

    Abstract: Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. To circumvent the nee… ▽ More

    Submitted 29 June, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

    Comments: A version of this work has been accepted to Interspeech 2021

  9. arXiv:2001.07263  [pdf, other

    eess.AS cs.CL

    Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard

    Authors: Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury

    Abstract: It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training. In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model. Using a cross-u… ▽ More

    Submitted 19 October, 2020; v1 submitted 20 January, 2020; originally announced January 2020.

    Comments: 5 pages, 2 figures

    MSC Class: 68T10 ACM Class: I.2.7

  10. arXiv:1908.03455  [pdf, other

    cs.CL cs.SD eess.AS

    Challenging the Boundaries of Speech Recognition: The MALACH Corpus

    Authors: Michael Picheny, Zóltan Tüske, Brian Kingsbury, Kartik Audhkhasi, Xiaodong Cui, George Saon

    Abstract: There has been huge progress in speech recognition over the last several years. Tasks once thought extremely difficult, such as SWITCHBOARD, now approach levels of human performance. The MALACH corpus (LDC catalog LDC2012S05), a 375-Hour subset of a large archive of Holocaust testimonies collected by the Survivors of the Shoah Visual History Foundation, presents significant challenges to the speec… ▽ More

    Submitted 9 August, 2019; originally announced August 2019.

    Comments: Accepted for publication at INTERSPEECH 2019

  11. arXiv:1904.08311  [pdf, other

    cs.CL cs.AI

    Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation

    Authors: Gakuto Kurata, Kartik Audhkhasi

    Abstract: Conventional automatic speech recognition (ASR) systems trained from frame-level alignments can easily leverage posterior fusion to improve ASR accuracy and build a better single model with knowledge distillation. End-to-end ASR systems trained using the Connectionist Temporal Classification (CTC) loss do not require frame-level alignment and hence simplify model training. However, sparse and arbi… ▽ More

    Submitted 2 July, 2019; v1 submitted 17 April, 2019; originally announced April 2019.

    Comments: Accepted to Interspeech 2019

  12. arXiv:1903.12306  [pdf, other

    cs.CL

    Acoustically Grounded Word Embeddings for Improved Acoustics-to-Word Speech Recognition

    Authors: Shane Settle, Kartik Audhkhasi, Karen Livescu, Michael Picheny

    Abstract: Direct acoustics-to-word (A2W) systems for end-to-end automatic speech recognition are simpler to train, and more efficient to decode with, than sub-word systems. However, A2W systems can have difficulties at training time when data is limited, and at decoding time when recognizing words outside the training vocabulary. To address these shortcomings, we investigate the use of recently proposed aco… ▽ More

    Submitted 28 March, 2019; originally announced March 2019.

    Comments: To appear at ICASSP 2019

  13. arXiv:1802.02656  [pdf, other

    cs.CL cs.SD eess.AS

    Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

    Authors: Xuesong Yang, Kartik Audhkhasi, Andrew Rosenberg, Samuel Thomas, Bhuvana Ramabhadran, Mark Hasegawa-Johnson

    Abstract: The performance of automatic speech recognition systems degrades with increasing mismatch between the training and testing scenarios. Differences in speaker accents are a significant source of such mismatch. The traditional approach to deal with multiple accents involves pooling data from several accents during training and building a single model in multi-task fashion, where tasks correspond to i… ▽ More

    Submitted 7 February, 2018; originally announced February 2018.

    Comments: Accepted in The 43rd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2018)

  14. arXiv:1712.03133  [pdf, other

    cs.CL cs.AI cs.NE stat.ML

    Building competitive direct acoustics-to-word models for English conversational speech recognition

    Authors: Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, Michael Picheny

    Abstract: Direct acoustics-to-word (A2W) models in the end-to-end paradigm have received increasing attention compared to conventional sub-word based automatic speech recognition models using phones, characters, or context-dependent hidden Markov model states. This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making train… ▽ More

    Submitted 8 December, 2017; originally announced December 2017.

    Comments: Submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

  15. arXiv:1703.07754  [pdf, other

    cs.CL cs.NE stat.ML

    Direct Acoustics-to-Word Models for English Conversational Speech Recognition

    Authors: Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, David Nahamoo

    Abstract: Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of mapping acoustics directly to words with… ▽ More

    Submitted 22 March, 2017; originally announced March 2017.

    Comments: Submitted to Interspeech-2017

  16. arXiv:1703.02136  [pdf, other

    cs.CL

    English Conversational Telephone Speech Recognition by Humans and Machines

    Authors: George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall

    Abstract: One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to b… ▽ More

    Submitted 6 March, 2017; originally announced March 2017.

  17. arXiv:1701.04313  [pdf, other

    cs.CL cs.IR cs.LG cs.NE

    End-to-End ASR-free Keyword Search from Speech

    Authors: Kartik Audhkhasi, Andrew Rosenberg, Abhinav Sethy, Bhuvana Ramabhadran, Brian Kingsbury

    Abstract: End-to-end (E2E) systems have achieved competitive results compared to conventional hybrid hidden Markov model (HMM)-deep neural network based automatic speech recognition (ASR) systems. Such E2E systems are attractive due to the lack of dependence on alignments between input acoustic and output grapheme or HMM state sequence during training. This paper explores the design of an ASR-free end-to-en… ▽ More

    Submitted 13 January, 2017; originally announced January 2017.

    Comments: Published in the IEEE 2017 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2017), scheduled for 5-9 March 2017 in New Orleans, Louisiana, USA

  18. arXiv:1612.01928  [pdf, other

    cs.CL cs.CV cs.LG cs.SD stat.ML

    Invariant Representations for Noisy Speech Recognition

    Authors: Dmitriy Serdyuk, Kartik Audhkhasi, Philémon Brakel, Bhuvana Ramabhadran, Samuel Thomas, Yoshua Bengio

    Abstract: Modern automatic speech recognition (ASR) systems need to be robust under acoustic variability arising from environmental, speaker, channel, and recording conditions. Ensuring such robustness to variability is a challenge in modern day neural network-based ASR systems, especially when all types of variability are not seen during training. We attempt to address this problem by encouraging the neura… ▽ More

    Submitted 27 November, 2016; originally announced December 2016.

    Comments: 5 pages, 1 figure, 1 table, NIPS workshop on end-to-end speech recognition

  19. arXiv:1412.7063  [pdf, other

    cs.CL cs.LG cs.NE

    Diverse Embedding Neural Network Language Models

    Authors: Kartik Audhkhasi, Abhinav Sethy, Bhuvana Ramabhadran

    Abstract: We propose Diverse Embedding Neural Network (DENN), a novel architecture for language models (LMs). A DENNLM projects the input word history vector onto multiple diverse low-dimensional sub-spaces instead of a single higher-dimensional sub-space as in conventional feed-forward neural network LMs. We encourage these sub-spaces to be diverse during network training through an augmented loss function… ▽ More

    Submitted 15 April, 2015; v1 submitted 22 December, 2014; originally announced December 2014.

    Comments: Under review as workshop contribution at ICLR 2015

  20. arXiv:1312.7463  [pdf, ps, other

    stat.ML cs.CV cs.LG

    Generalized Ambiguity Decomposition for Understanding Ensemble Diversity

    Authors: Kartik Audhkhasi, Abhinav Sethy, Bhuvana Ramabhadran, Shrikanth S. Narayanan

    Abstract: Diversity or complementarity of experts in ensemble pattern recognition and information processing systems is widely-observed by researchers to be crucial for achieving performance improvement upon fusion. Understanding this link between ensemble diversity and fusion performance is thus an important research question. However, prior works have theoretically characterized ensemble diversity and hav… ▽ More

    Submitted 28 December, 2013; originally announced December 2013.

    Comments: 32 pages, 10 figures

    ACM Class: I.5