Zum Hauptinhalt springen

Showing 1–50 of 79 results for author: Sainath, T N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.02921  [pdf, other

    cs.CL cs.AI cs.LG cs.NE eess.AS

    Text Injection for Neural Contextual Biasing

    Authors: Zhong Meng, Zelin Wu, Rohit Prabhavalkar, Cal Peyser, Weiran Wang, Nanxin Chen, Tara N. Sainath, Bhuvana Ramabhadran

    Abstract: Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker's context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI leverages not only the paired speech-text data, but also a much larger corpus of unpaired text to optimize the ASR model and it… ▽ More

    Submitted 11 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: 5 pages, 1 figure

    Journal ref: Interspeech 2024, Kos Island, Greece

  2. arXiv:2402.17184  [pdf, other

    cs.CL cs.SD eess.AS

    Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models

    Authors: Rohit Prabhavalkar, Zhong Meng, Weiran Wang, Adam Stooke, Xingyu Cai, Yanzhang He, Arun Narayanan, Dongseong Hwang, Tara N. Sainath, Pedro J. Moreno

    Abstract: The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models continues to improve as they are scaled to larger sizes, with some now reaching billions of parameters. Widespread deployment and adoption of these models, however, requires computationally efficient strategies for decoding. In the present work, we study one such strategy: applying multiple frame reduction layers in the enc… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: Accepted to 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)

  3. arXiv:2402.12862  [pdf, other

    cs.CL

    Handling Ambiguity in Emotion: From Out-of-Domain Detection to Distribution Estimation

    Authors: Wen Wu, Bo Li, Chao Zhang, Chung-Cheng Chiu, Qiujia Li, Junwen Bai, Tara N. Sainath, Philip C. Woodland

    Abstract: The subjective perception of emotion leads to inconsistent labels from human annotators. Typically, utterances lacking majority-agreed labels are excluded when training an emotion classifier, which cause problems when encountering ambiguous emotional expressions during testing. This paper investigates three methods to handle ambiguous emotion. First, we show that incorporating utterances without m… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  4. arXiv:2401.12789  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

    Authors: W. Ronny Huang, Cyril Allauzen, Tongzhou Chen, Kilol Gupta, Ke Hu, James Qin, Yu Zhang, Yongqiang Wang, Shuo-Yiin Chang, Tara N. Sainath

    Abstract: In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

    Comments: ICASSP 2024

  5. arXiv:2401.08992  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR

    Authors: Junwen Bai, Bo Li, Qiujia Li, Tara N. Sainath, Trevor Strohman

    Abstract: The end-to-end ASR model is often desired in the streaming multilingual scenario since it is easier to deploy and can benefit from pre-trained speech models such as powerful foundation models. Meanwhile, the heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation, leading to asynchronous peak performance for different languages during training, e… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  6. arXiv:2312.11123  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Improved Long-Form Speech Recognition by Jointly Modeling the Primary and Non-primary Speakers

    Authors: Guru Prakash Arumugam, Shuo-yiin Chang, Tara N. Sainath, Rohit Prabhavalkar, Quan Wang, Shaan Bijwadia

    Abstract: ASR models often suffer from a long-form deletion problem where the model predicts sequential blanks instead of words when transcribing a lengthy audio (in the order of minutes or hours). From the perspective of a user or downstream system consuming the ASR results, this behavior can be perceived as the model "being stuck", and potentially make the product hard to use. One of the culprits for long… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

    Comments: 8 pages, ASRU 2023

  7. arXiv:2312.08553  [pdf, other

    eess.AS cs.SD

    USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech Models

    Authors: Shaojin Ding, David Qiu, David Rim, Yanzhang He, Oleg Rybakov, Bo Li, Rohit Prabhavalkar, Weiran Wang, Tara N. Sainath, Zhonglin Han, Jian Li, Amir Yazdanbakhsh, Shivani Agrawal

    Abstract: End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios… ▽ More

    Submitted 16 January, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024. Preprint

  8. arXiv:2308.07395  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

    Authors: Shaan Bijwadia, Shuo-yiin Chang, Weiran Wang, Zhong Meng, Hao Zhang, Tara N. Sainath

    Abstract: Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

  9. arXiv:2308.06125  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Improving Joint Speech-Text Representations Without Alignment

    Authors: Cal Peyser, Zhong Meng, Ke Hu, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho

    Abstract: The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these metho… ▽ More

    Submitted 11 August, 2023; originally announced August 2023.

    Journal ref: INTERSPEECH 2023

  10. arXiv:2306.01015  [pdf, other

    cs.CL cs.NE cs.SD eess.AS

    How to Estimate Model Transferability of Pre-Trained Speech Models?

    Authors: Zih-Ching Chen, Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-yi Lee, Tara N. Sainath

    Abstract: In this work, we introduce a "score-based assessment" framework for estimating the transferability of pre-trained speech models (PSMs) for fine-tuning target tasks. We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates using the extracted representations. Our framework efficiently computes transferability… ▽ More

    Submitted 5 February, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech. Code is available at: https://github.com/virginiakm1988/LogME-CTC. Fixed a typo

  11. arXiv:2305.18419  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR

    Authors: W. Ronny Huang, Hao Zhang, Shankar Kumar, Shuo-yiin Chang, Tara N. Sainath

    Abstract: We propose a method of segmenting long-form speech by separating semantically complete sentences within the utterance. This prevents the ASR decoder from needlessly processing faraway context while also preventing it from missing relevant context within the current sentence. Semantically complete sentence boundaries are typically demarcated by punctuation in written text; but unfortunately, spoken… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023. First 3 authors contributed equally

  12. arXiv:2305.15663  [pdf, other

    cs.CL cs.SD eess.AS

    Mixture-of-Expert Conformer for Streaming Multilingual ASR

    Authors: Ke Hu, Bo Li, Tara N. Sainath, Yu Zhang, Francoise Beaufays

    Abstract: End-to-end models with large capacity have significantly improved multilingual automatic speech recognition, but their computation cost poses challenges for on-device applications. We propose a streaming truly multilingual Conformer incorporating mixture-of-expert (MoE) layers that learn to only activate a subset of parameters in training and inference. The MoE layer consists of a softmax gate whi… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  13. arXiv:2305.13408  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Modular Domain Adaptation for Conformer-Based Streaming ASR

    Authors: Qiujia Li, Bo Li, Dongseong Hwang, Tara N. Sainath, Pedro M. Mengibar

    Abstract: Speech data from different domains has distinct acoustic and linguistic characteristics. It is common to train a single multidomain model such as a Conformer transducer for speech recognition on a mixture of data from all domains. However, changing data in one domain or adding a new domain would require the multidomain model to be retrained. To this end, we propose a framework called modular domai… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  14. arXiv:2304.00173  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Lego-Features: Exporting modular encoder features for streaming and deliberation ASR

    Authors: Rami Botros, Rohit Prabhavalkar, Johan Schalkwyk, Ciprian Chelba, Tara N. Sainath, Françoise Beaufays

    Abstract: In end-to-end (E2E) speech recognition models, a representational tight-coupling inevitably emerges between the encoder and the decoder. We build upon recent work that has begun to explore building encoders with modular encoded representations, such that encoders and decoders from different models can be stitched together in a zero-shot manner without further fine-tuning. While previous research o… ▽ More

    Submitted 31 March, 2023; originally announced April 2023.

  15. arXiv:2304.00171  [pdf, other

    cs.CL cs.SD eess.AS

    Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR

    Authors: Rami Botros, Anmol Gulati, Tara N. Sainath, Krzysztof Choromanski, Ruoming Pang, Trevor Strohman, Weiran Wang, Jiahui Yu

    Abstract: Conformer models maintain a large number of internal states, the vast majority of which are associated with self-attention layers. With limited memory bandwidth, reading these from memory at each inference step can slow down inference. In this paper, we design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. We explore various ideas to impr… ▽ More

    Submitted 31 March, 2023; originally announced April 2023.

  16. arXiv:2303.15293  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    A Deliberation-based Joint Acoustic and Text Decoder

    Authors: Sepand Mavandadi, Tara N. Sainath, Ke Hu, Zelin Wu

    Abstract: We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data. Previously, the joint acoustic and text decoder (JATD) has shown promising results through the use of text data during model training and the recently introduced deliberation architecture has reduced recognition errors by leveraging first-pass dec… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Comments: Interspeech 2021

  17. arXiv:2303.08343  [pdf, ps, other

    eess.AS cs.AI cs.LG cs.SD

    Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models

    Authors: Steven M. Hernandez, Ding Zhao, Shaojin Ding, Antoine Bruguier, Rohit Prabhavalkar, Tara N. Sainath, Yanzhang He, Ian McGraw

    Abstract: Continued improvements in machine learning techniques offer exciting new opportunities through the use of larger models and larger training datasets. However, there is a growing need to offer these new capabilities on-board low-powered devices such as smartphones, wearables and other embedded environments where only low memory is available. Towards this, we consider methods to reduce the model siz… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

    Comments: Accepted to IEEE ICASSP 2023

  18. arXiv:2303.03329  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Speech Recognition: A Survey

    Authors: Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe

    Abstract: In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning. In the wake of this transition, a number of all-neural ASR architectures were introduced. These so-called end-to-end (E2E) models provide highly integrated, completely neural AS… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  19. arXiv:2302.11186  [pdf, other

    eess.AS cs.CL cs.SD

    UML: A Universal Monolingual Output Layer for Multilingual ASR

    Authors: Chao Zhang, Bo Li, Tara N. Sainath, Trevor Strohman, Shuo-yiin Chang

    Abstract: Word-piece models (WPMs) are commonly used subword units in state-of-the-art end-to-end automatic speech recognition (ASR) systems. For multilingual ASR, due to the differences in written scripts across languages, multilingual WPMs bring the challenges of having overly large output layers and scaling to more languages. In this work, we propose a universal monolingual output layer (UML) to address… ▽ More

    Submitted 22 February, 2023; originally announced February 2023.

    Comments: Published as a conference paper at ICASSP 2023

  20. arXiv:2302.08917  [pdf, other

    cs.CL cs.LG

    Massively Multilingual Shallow Fusion with Large Language Models

    Authors: Ke Hu, Tara N. Sainath, Bo Li, Nan Du, Yanping Huang, Andrew M. Dai, Yu Zhang, Rodrigo Cabrera, Zhifeng Chen, Trevor Strohman

    Abstract: While large language models (LLM) have made impressive progress in natural language processing, it remains unclear how to utilize them in improving automatic speech recognition (ASR). In this work, we propose to train a single multilingual language model (LM) for shallow fusion in multiple languages. We push the limits of the multilingual LM to cover up to 84 languages by scaling up using a mixtur… ▽ More

    Submitted 17 February, 2023; originally announced February 2023.

    Comments: Accepted to IEEE ICASSP 2023

  21. arXiv:2302.08583  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

    Authors: Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang, Bo Li, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: We propose JEIT, a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM during E2E training which improves rare-word speech recognition. With JEIT, the E2E model computes an E2E loss on audio-transcript pairs while its ILM estimates a cross-entropy loss on unpaired text. The E2E model is trained to minimize a weighted sum of E2… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

    Comments: 5 pages, 3 figures, in ICASSP 2023

    Journal ref: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes island, Greece

  22. arXiv:2302.01496  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    Efficient Domain Adaptation for Speech Foundation Models

    Authors: Bo Li, Dongseong Hwang, Zhouyuan Huo, Junwen Bai, Guru Prakash, Tara N. Sainath, Khe Chai Sim, Yu Zhang, Wei Han, Trevor Strohman, Francoise Beaufays

    Abstract: Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we presen… ▽ More

    Submitted 2 February, 2023; originally announced February 2023.

  23. arXiv:2301.07851  [pdf, other

    cs.SD cs.AI cs.LG cs.NE eess.AS

    From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

    Authors: Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman

    Abstract: In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time… ▽ More

    Submitted 18 January, 2023; originally announced January 2023.

    Comments: Submitted to ICASSP 2023. The project was initiated in May 2022 during a research internship at Google Research

  24. arXiv:2211.15432  [pdf, other

    cs.CL

    E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

    Authors: W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman

    Abstract: We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated wi… ▽ More

    Submitted 5 March, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

    Comments: ICASSP 2023

  25. arXiv:2211.02712  [pdf, other

    cs.LG cs.SD eess.AS

    Resource-Efficient Transfer Learning From Speech Foundation Model Using Hierarchical Feature Fusion

    Authors: Zhouyuan Huo, Khe Chai Sim, Bo Li, Dongseong Hwang, Tara N. Sainath, Trevor Strohman

    Abstract: Self-supervised pre-training of a speech foundation model, followed by supervised fine-tuning, has shown impressive quality improvements on automatic speech recognition (ASR) tasks. Fine-tuning separate foundation models for many downstream tasks are expensive since the foundation model is usually very big. Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods) offer an alte… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

  26. arXiv:2211.01263  [pdf, other

    cs.SD cs.LG eess.AS quant-ph

    A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition

    Authors: Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Tara N. Sainath, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose a quantum kernel learning (QKL) framework to address the inherent data sparsity issues often encountered in training large-scare acoustic models in low-resource scenarios. We project acoustic features based on classical-to-quantum feature encoding. Different from existing quantum convolution techniques, we utilize QKL with features in the quantum space to design kernel-based classifiers… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  27. arXiv:2210.07353  [pdf, other

    cs.CL cs.SD eess.AS

    JOIST: A Joint Speech and Text Streaming Model For ASR

    Authors: Tara N. Sainath, Rohit Prabhavalkar, Ankur Bapna, Yu Zhang, Zhouyuan Huo, Zhehuai Chen, Bo Li, Weiran Wang, Trevor Strohman

    Abstract: We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous works, we explore joint training with both modalities, rather than pre-training and fine-tuning. In addition, we explore JOIST using a streaming E2E model with an order of magnitude more data, which are also novelties comp… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

  28. arXiv:2210.05785  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling Up Deliberation for Multilingual ASR

    Authors: Ke Hu, Bo Li, Tara N. Sainath

    Abstract: Multilingual end-to-end automatic speech recognition models are attractive due to its simplicity in training and deployment. Recent work on large-scale training of such models has shown promising results compared to monolingual models. However, the work often focuses on multilingual models themselves in a single-pass setup. In this work, we investigate second-pass deliberation for multilingual spe… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

  29. arXiv:2208.13916  [pdf, other

    eess.AS cs.CL cs.SD

    A Language Agnostic Multilingual Streaming On-Device ASR System

    Authors: Bo Li, Tara N. Sainath, Ruoming Pang, Shuo-yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani

    Abstract: On-device end-to-end (E2E) models have shown improvements over a conventional model on English Voice Search tasks in both quality and latency. E2E models have also shown promising results for multilingual automatic speech recognition (ASR). In this paper, we extend our previous capacity solution to streaming applications and present a streaming multilingual E2E ASR system that runs fully on device… ▽ More

    Submitted 29 August, 2022; originally announced August 2022.

    Comments: Accepted in Interspeech 2022

  30. arXiv:2208.13322  [pdf, other

    cs.CL cs.SD eess.AS

    Streaming Intended Query Detection using E2E Modeling for Continued Conversation

    Authors: Shuo-yiin Chang, Guru Prakash, Zelin Wu, Qiao Liang, Tara N. Sainath, Bo Li, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman

    Abstract: In voice-enabled applications, a predetermined hotword isusually used to activate a device in order to attend to the query.However, speaking queries followed by a hotword each timeintroduces a cognitive burden in continued conversations. Toavoid repeating a hotword, we propose a streaming end-to-end(E2E) intended query detector that identifies the utterancesdirected towards the device and filters… ▽ More

    Submitted 28 August, 2022; originally announced August 2022.

    Comments: 5 pages, Interspeech 2022

  31. arXiv:2208.13321  [pdf, other

    cs.CL cs.SD eess.AS

    Turn-Taking Prediction for Natural Conversational Speech

    Authors: Shuo-yiin Chang, Bo Li, Tara N. Sainath, Chao Zhang, Trevor Strohman, Qiao Liang, Yanzhang He

    Abstract: While a streaming voice assistant system has been used in many applications, this system typically focuses on unnatural, one-shot interactions assuming input from a single voice query without hesitation or disfluency. However, a common conversational utterance often involves multiple queries with turn-taking, in addition to disfluencies. These disfluencies include pausing to think, hesitations, wo… ▽ More

    Submitted 28 August, 2022; originally announced August 2022.

    Comments: 5 pages, Interspeech 2022

  32. arXiv:2208.13191  [pdf, other

    cs.SD cs.AI eess.AS

    Towards Disentangled Speech Representations

    Authors: Cal Peyser, Ronny Huang Andrew Rosenberg Tara N. Sainath, Michael Picheny, Kyunghyun Cho

    Abstract: The careful construction of audio representations has become a dominant feature in the design of approaches to many speech tasks. Increasingly, such approaches have emphasized "disentanglement", where a representation contains only parts of the speech signal relevant to transcription while discarding irrelevant information. In this paper, we construct a representation learning task based on joint… ▽ More

    Submitted 28 August, 2022; originally announced August 2022.

  33. arXiv:2206.14716  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Deliberation by Text-Only and Semi-Supervised Training

    Authors: Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang

    Abstract: Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text and speech data. In this work, we propose incorporating text-only and semi-supervised training into an attention-based deliberation model. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deli… ▽ More

    Submitted 29 June, 2022; originally announced June 2022.

    Comments: Accepted by Interspeech 2022

  34. arXiv:2205.10643  [pdf, other

    cs.CL cs.SD eess.AS

    Self-Supervised Speech Representation Learning: A Review

    Authors: Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe

    Abstract: Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a… ▽ More

    Submitted 27 October, 2022; v1 submitted 21 May, 2022; originally announced May 2022.

  35. arXiv:2204.10749  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

    Authors: W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

    Abstract: Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundary locations based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for… ▽ More

    Submitted 15 June, 2022; v1 submitted 22 April, 2022; originally announced April 2022.

    Comments: Interspeech 2022

  36. arXiv:2204.07556  [pdf, other

    cs.CL cs.LG eess.AS

    Streaming Align-Refine for Non-autoregressive Deliberation

    Authors: Weiran Wang, Ke Hu, Tara N. Sainath

    Abstract: We propose a streaming non-autoregressive (non-AR) decoding algorithm to deliberate the hypothesis alignment of a streaming RNN-T model. Our algorithm facilitates a simple greedy decoding procedure, and at the same time is capable of producing the decoding result at each frame with limited right context, thus enjoying both high efficiency and low latency. These advantages are achieved by convertin… ▽ More

    Submitted 15 April, 2022; originally announced April 2022.

    Comments: In submission to INTERSPEECH 2022

  37. arXiv:2204.07553  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Rare Word Recognition with LM-aware MWER Training

    Authors: Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

    Abstract: Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups. In this work, we introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework, to mitigate the training versus inference gap regarding the use… ▽ More

    Submitted 27 June, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

    Comments: To appear in INTERSPEECH 2022

  38. arXiv:2204.06164  [pdf, other

    eess.AS cs.LG cs.SD

    A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

    Authors: Shaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman

    Abstract: In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios. Moreover, the model can significantly reduce model size and power consumption without loss of quality. Namely, with the dynamic cascaded encoder model, we explore three techniques to maximally boost the performance of each model size: 1) Use separa… ▽ More

    Submitted 24 June, 2022; v1 submitted 13 April, 2022; originally announced April 2022.

    Comments: Accepted by INTERSPEECH 2022

  39. arXiv:2203.05008  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

    Authors: W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar

    Abstract: Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather… ▽ More

    Submitted 15 June, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

    Comments: Interspeech 2022

  40. arXiv:2201.10240  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Improving the fusion of acoustic and text representations in RNN-T

    Authors: Chao Zhang, Bo Li, Zhiyun Lu, Tara N. Sainath, Shuo-yiin Chang

    Abstract: The recurrent neural network transducer (RNN-T) has recently become the mainstream end-to-end approach for streaming automatic speech recognition (ASR). To estimate the output distributions over subword units, RNN-T uses a fully connected layer as the joint network to fuse the acoustic representations extracted using the acoustic encoder with the text representations obtained using the prediction… ▽ More

    Submitted 25 January, 2022; originally announced January 2022.

    Comments: Paper to appear at ICASSP 2022

  41. arXiv:2111.08137  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Joint Unsupervised and Supervised Training for Multilingual ASR

    Authors: Junwen Bai, Bo Li, Yu Zhang, Ankur Bapna, Nikhil Siddhartha, Khe Chai Sim, Tara N. Sainath

    Abstract: Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. In this paper, we propose an end-to-end (E2E) Jo… ▽ More

    Submitted 15 November, 2021; originally announced November 2021.

  42. arXiv:2109.13226  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

    Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang , et al. (1 additional authors not shown)

    Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da… ▽ More

    Submitted 21 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

    Comments: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated

  43. arXiv:2109.07513  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Tied & Reduced RNN-T Decoder

    Authors: Rami Botros, Tara N. Sainath, Robert David, Emmanuel Guzman, Wei Li, Yanzhang He

    Abstract: Previous works on the Recurrent Neural Network-Transducer (RNN-T) models have shown that, under some conditions, it is possible to simplify its prediction network with little or no loss in recognition accuracy (arXiv:2003.07705 [eess.AS], [2], arXiv:2012.06749 [cs.CL]). This is done by limiting the context size of previous labels and/or using a simpler architecture for its layers instead of LSTMs.… ▽ More

    Submitted 15 September, 2021; originally announced September 2021.

    Journal ref: Proc. Interspeech 2021, 4563-4567

  44. arXiv:2104.14830  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling End-to-End Models for Large-Scale Multilingual ASR

    Authors: Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W. Ronny Huang, Min Ma, Junwen Bai

    Abstract: Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity.… ▽ More

    Submitted 11 September, 2021; v1 submitted 30 April, 2021; originally announced April 2021.

    Comments: ASRU 2021

  45. arXiv:2104.04552  [pdf, other

    cs.CL cs.SD eess.AS

    Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

    Authors: W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman

    Abstract: We introduce Lookup-Table Language Models (LookupLM), a method for scaling up the size of RNN language models with only a constant increase in the floating point operations, by increasing the expressivity of the embedding table. In particular, we instantiate an (additional) embedding table which embeds the previous n-gram token sequence, rather than a single token. This allows the embedding table… ▽ More

    Submitted 6 June, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

    Comments: Presented as conference paper at Interspeech 2021

  46. arXiv:2103.06716  [pdf, other

    eess.AS cs.CL cs.LG

    Learning Word-Level Confidence For Subword End-to-End ASR

    Authors: David Qiu, Qiujia Li, Yanzhang He, Yu Zhang, Bo Li, Liangliang Cao, Rohit Prabhavalkar, Deepti Bhatia, Wei Li, Ke Hu, Tara N. Sainath, Ian McGraw

    Abstract: We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR). Although prior works have proposed training auxiliary confidence models for ASR systems, they do not extend naturally to systems that operate on word-pieces (WP) as their vocabulary. In particular, ground truth WP correctness labels are needed for training confi… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: To appear in ICASSP 2021

  47. arXiv:2101.11577  [pdf, other

    cs.CL

    Transformer Based Deliberation for Two-Pass Speech Recognition

    Authors: Ke Hu, Ruoming Pang, Tara N. Sainath, Trevor Strohman

    Abstract: Interactive speech recognition systems must generate words quickly while also producing accurate results. Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate. Previous work has established that a deliberation network can be an effective second-pass model. The model attends… ▽ More

    Submitted 27 January, 2021; originally announced January 2021.

  48. arXiv:2012.06749  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Less Is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

    Authors: Rohit Prabhavalkar, Yanzhang He, David Rybach, Sean Campbell, Arun Narayanan, Trevor Strohman, Tara N. Sainath

    Abstract: End-to-end models that condition the output label sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since unique label histories correspond to distinct models states, such models are decoded using an approximate beam-search process which produces a tree of hypotheses. In this work, we study the influen… ▽ More

    Submitted 12 December, 2020; originally announced December 2020.

  49. arXiv:2011.10798  [pdf, other

    eess.AS cs.SD

    A Better and Faster End-to-End Model for Streaming ASR

    Authors: Bo Li, Anmol Gulati, Jiahui Yu, Tara N. Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, Wei Han, Qiao Liang, Yu Zhang, Trevor Strohman, Yonghui Wu

    Abstract: End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this i… ▽ More

    Submitted 11 February, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

    Comments: Accepted in ICASSP 2021

  50. arXiv:2010.14606  [pdf, other

    eess.AS cs.CL cs.SD

    Cascaded encoders for unifying streaming and non-streaming ASR

    Authors: Arun Narayanan, Tara N. Sainath, Ruoming Pang, Jiahui Yu, Chung-Cheng Chiu, Rohit Prabhavalkar, Ehsan Variani, Trevor Strohman

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoder… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.