Zum Hauptinhalt springen

Showing 1–14 of 14 results for author: Haghani, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.06664  [pdf, other

    eess.AS cs.LG cs.SD

    ASTRA: Aligning Speech and Text Representations for Asr without Sampling

    Authors: Neeraj Gaur, Rohan Agrawal, Gary Wang, Parisa Haghani, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: This paper introduces ASTRA, a novel method for improving Automatic Speech Recognition (ASR) through text injection.Unlike prevailing techniques, ASTRA eliminates the need for sampling to match sequence lengths between speech and text modalities. Instead, it leverages the inherent alignments learned within CTC/RNNT models. This approach offers the following two advantages, namely, avoiding potenti… ▽ More

    Submitted 13 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: To be published in Interspeech 2024

  2. arXiv:2310.13015  [pdf, other

    cs.CL cs.AI eess.AS

    Audio-AdapterFusion: A Task-ID-free Approach for Efficient and Non-Destructive Multi-task Speech Recognition

    Authors: Hillary Ngai, Rohan Agrawal, Neeraj Gaur, Ronny Huang, Parisa Haghani, Pedro Moreno Mengibar

    Abstract: Adapters are an efficient, composable alternative to full fine-tuning of pre-trained models and help scale the deployment of large ASR models to many tasks. In practice, a task ID is commonly prepended to the input during inference to route to single-task adapters for the specified task. However, one major limitation of this approach is that the task ID may not be known during inference, rendering… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Proceedings

  3. arXiv:2308.07393  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Using Text Injection to Improve Recognition of Personal Identifiers in Speech

    Authors: Yochai Blau, Rohan Agrawal, Lior Madmony, Gary Wang, Andrew Rosenberg, Zhehuai Chen, Zorik Gekhman, Genady Beryozkin, Parisa Haghani, Bhuvana Ramabhadran

    Abstract: Accurate recognition of specific categories, such as persons' names, dates or other identifiers is critical in many Automatic Speech Recognition (ASR) applications. As these categories represent personal information, ethical use of this data including collection, transcription, training and evaluation demands special care. One way of ensuring the security and privacy of individuals is to redact or… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted to Interspeech 2023

    MSC Class: 68T10 ACM Class: I.2.7

  4. arXiv:2308.03917  [pdf, other

    cs.CL cs.SD eess.AS

    Universal Automatic Phonetic Transcription into the International Phonetic Alphabet

    Authors: Chihiro Taguchi, Yusuke Sakai, Parisa Haghani, David Chiang

    Abstract: This paper presents a state-of-the-art model for transcribing speech in any language into the International Phonetic Alphabet (IPA). Transcription of spoken languages into IPA is an essential yet time-consuming process in language documentation, and even partially automating this process has the potential to drastically speed up the documentation of endangered languages. Like the previous best spe… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

    Comments: 5 pages, 7 tables

  5. arXiv:2303.01037  [pdf, other

    cs.CL cs.SD eess.AS

    Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

    Authors: Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk , et al. (2 additional authors not shown)

    Abstract: We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quant… ▽ More

    Submitted 24 September, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: 20 pages, 7 figures, 8 tables

  6. arXiv:2210.16481  [pdf, other

    eess.AS cs.CL cs.SD

    Accelerating RNN-T Training and Inference Using CTC guidance

    Authors: Yongqiang Wang, Zhehuai Chen, Chengjian Zheng, Yu Zhang, Wei Han, Parisa Haghani

    Abstract: We propose a novel method to accelerate training and inference process of recurrent neural network transducer (RNN-T) based on the guidance from a co-trained connectionist temporal classification (CTC) model. We made a key assumption that if an encoder embedding frame is classified as a blank frame by the CTC model, it is likely that this frame will be aligned to blank for all the partial alignmen… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: submitted to ICASSP 2023

  7. arXiv:2209.06058  [pdf, other

    eess.AS cs.CL

    Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

    Authors: Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-yiin Chang, Parisa Haghani

    Abstract: Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascade… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

  8. arXiv:2208.13916  [pdf, other

    eess.AS cs.CL cs.SD

    A Language Agnostic Multilingual Streaming On-Device ASR System

    Authors: Bo Li, Tara N. Sainath, Ruoming Pang, Shuo-yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani

    Abstract: On-device end-to-end (E2E) models have shown improvements over a conventional model on English Voice Search tasks in both quality and latency. E2E models have also shown promising results for multilingual automatic speech recognition (ASR). In this paper, we extend our previous capacity solution to streaming applications and present a streaming multilingual E2E ASR system that runs fully on device… ▽ More

    Submitted 29 August, 2022; originally announced August 2022.

    Comments: Accepted in Interspeech 2022

  9. arXiv:2110.03841  [pdf, ps, other

    eess.AS cs.CL

    Input Length Matters: Improving RNN-T and MWER Training for Long-form Telephony Speech Recognition

    Authors: Zhiyun Lu, Yanwei Pan, Thibault Doutre, Parisa Haghani, Liangliang Cao, Rohit Prabhavalkar, Chao Zhang, Trevor Strohman

    Abstract: End-to-end models have achieved state-of-the-art results on several automatic speech recognition tasks. However, they perform poorly when evaluated on long-form data, e.g., minutes long conversational telephony audio. One reason the model fails on long-form speech is that it has only seen short utterances during training. In this paper we study the effect of training utterance length on the word e… ▽ More

    Submitted 1 April, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: submitted to INTERSPEECH 2022

  10. arXiv:2104.14830  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling End-to-End Models for Large-Scale Multilingual ASR

    Authors: Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W. Ronny Huang, Min Ma, Junwen Bai

    Abstract: Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity.… ▽ More

    Submitted 11 September, 2021; v1 submitted 30 April, 2021; originally announced April 2021.

    Comments: ASRU 2021

  11. arXiv:1902.08295  [pdf, other

    cs.LG stat.ML

    Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

    Authors: Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob , et al. (66 additional authors not shown)

    Abstract: Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly w… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

  12. arXiv:1809.09190  [pdf, other

    eess.AS cs.CL cs.SD

    From Audio to Semantics: Approaches to end-to-end spoken language understanding

    Authors: Parisa Haghani, Arun Narayanan, Michiel Bacchiani, Galen Chuang, Neeraj Gaur, Pedro Moreno, Rohit Prabhavalkar, Zhongdi Qu, Austin Waters

    Abstract: Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to a transcript, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of domains, intents, and arguments. These modules are typically optimized independently. In this paper, we formulate audio to sem… ▽ More

    Submitted 24 September, 2018; originally announced September 2018.

  13. arXiv:1808.05312  [pdf, other

    cs.CL eess.AS

    Toward domain-invariant speech recognition via large scale training

    Authors: Arun Narayanan, Ananya Misra, Khe Chai Sim, Golan Pundak, Anshuman Tripathi, Mohamed Elfeky, Parisa Haghani, Trevor Strohman, Michiel Bacchiani

    Abstract: Current state-of-the-art automatic speech recognition systems are trained to work in specific `domains', defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. This work explores the idea of building a single domain-invariant model for varied use-cases by combining larg… ▽ More

    Submitted 15 August, 2018; originally announced August 2018.

  14. Efficient and Robust Secure Aggregation for Sensor Networks

    Authors: P. Haghani, P. Papadimitratos, M. Poturalski, K. Aberer, J. -P. Hubaux

    Abstract: Wireless Sensor Networks (WSNs) rely on in-network aggregation for efficiency, however, this comes at a price: A single adversary can severely influence the outcome by contributing an arbitrary partial aggregate value. Secure in-network aggregation can detect such manipulation. But as long as such faults persist, no aggregation result can be obtained. In contrast, the collection of individual se… ▽ More

    Submitted 19 August, 2008; originally announced August 2008.

    Journal ref: IEEE ICNP Workshop on Secure Network Protocols (NPSec), pages 1{6, Beijing, China, October 2007