-
Investigating End-to-End ASR Architectures for Long Form Audio Transcription
Authors:
Nithin Rao Koluguri,
Samuel Kriman,
Georgy Zelenfroind,
Somshubra Majumdar,
Dima Rekesh,
Vahid Noroozi,
Jagadeesh Balam,
Boris Ginsburg
Abstract:
This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maxim…
▽ More
This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maximum audio length and real-time factor for each model on a variety of long audio benchmarks: Earnings-21 and 22, CORAAL, and TED-LIUM3. The model from the category of self-attention with local attention and global token has the best accuracy comparing to other architectures. We also compared models with CTC and RNNT decoders and showed that CTC-based models are more robust and efficient than RNNT on long form audio.
△ Less
Submitted 20 September, 2023; v1 submitted 18 September, 2023;
originally announced September 2023.
-
Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
Authors:
Dima Rekesh,
Nithin Rao Koluguri,
Samuel Kriman,
Somshubra Majumdar,
Vahid Noroozi,
He Huang,
Oleksii Hrinchuk,
Krishna Puvvada,
Ankur Kumar,
Jagadeesh Balam,
Boris Ginsburg
Abstract:
Conformer-based models have become the dominant end-to-end architecture for speech processing tasks. With the objective of enhancing the conformer architecture for efficient training and inference, we carefully redesigned Conformer with a novel downsampling schema. The proposed model, named Fast Conformer(FC), is 2.8x faster than the original Conformer, supports scaling to Billion parameters witho…
▽ More
Conformer-based models have become the dominant end-to-end architecture for speech processing tasks. With the objective of enhancing the conformer architecture for efficient training and inference, we carefully redesigned Conformer with a novel downsampling schema. The proposed model, named Fast Conformer(FC), is 2.8x faster than the original Conformer, supports scaling to Billion parameters without any changes to the core architecture and also achieves state-of-the-art accuracy on Automatic Speech Recognition benchmarks. To enable transcription of long-form speech up to 11 hours, we replaced global attention with limited context attention post-training, while also improving accuracy through fine-tuning with the addition of a global token. Fast Conformer, when combined with a Transformer decoder also outperforms the original Conformer in accuracy and in speed for Speech Translation and Spoken Language Understanding.
△ Less
Submitted 30 September, 2023; v1 submitted 8 May, 2023;
originally announced May 2023.
-
Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models
Authors:
Travis M. Bartley,
Fei Jia,
Krishna C. Puvvada,
Samuel Kriman,
Boris Ginsburg
Abstract:
In this paper, we extend previous self-supervised approaches for language identification by experimenting with Conformer based architecture in a multilingual pre-training paradigm. We find that pre-trained speech models optimally encode language discriminatory information in lower layers. Further, we demonstrate that the embeddings obtained from these layers are significantly robust to classify un…
▽ More
In this paper, we extend previous self-supervised approaches for language identification by experimenting with Conformer based architecture in a multilingual pre-training paradigm. We find that pre-trained speech models optimally encode language discriminatory information in lower layers. Further, we demonstrate that the embeddings obtained from these layers are significantly robust to classify unseen languages and different acoustic environments without additional training. After fine-tuning a pre-trained Conformer model on the VoxLingua107 dataset, we achieve results similar to current state-of-the-art systems for language identification. More, our model accomplishes this with 5x less parameters. We open-source the model through the NVIDIA NeMo toolkit.
△ Less
Submitted 13 March, 2023; v1 submitted 9 November, 2022;
originally announced November 2022.
-
QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
Authors:
Samuel Kriman,
Stanislav Beliaev,
Boris Ginsburg,
Jocelyn Huang,
Oleksii Kuchaiev,
Vitaly Lavrukhin,
Ryan Leary,
Jason Li,
Yang Zhang
Abstract:
We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpe…
▽ More
We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models. We also demonstrate that this model can be effectively fine-tuned on new datasets.
△ Less
Submitted 22 October, 2019;
originally announced October 2019.
-
NeMo: a toolkit for building AI applications using Neural Modules
Authors:
Oleksii Kuchaiev,
Jason Li,
Huyen Nguyen,
Oleksii Hrinchuk,
Ryan Leary,
Boris Ginsburg,
Samuel Kriman,
Stanislav Beliaev,
Vitaly Lavrukhin,
Jack Cook,
Patrice Castonguay,
Mariya Popova,
Jocelyn Huang,
Jonathan M. Cohen
Abstract:
NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. NeMo is built around neural modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations…
▽ More
NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. NeMo is built around neural modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations. NeMo makes it easy to combine and re-use these building blocks while providing a level of semantic correctness checking via its neural type system. The toolkit comes with extendable collections of pre-built modules for automatic speech recognition and natural language processing. Furthermore, NeMo provides built-in support for distributed training and mixed precision on latest NVIDIA GPUs. NeMo is open-source https://github.com/NVIDIA/NeMo
△ Less
Submitted 13 September, 2019;
originally announced September 2019.