\interspeechcameraready\name

KunalDhawan \nameNithin RaoKoluguri \nameAnteJukić \nameRyanLangman \nameJagadeeshBalam \nameBorisGinsburg

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

Abstract

Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint speech-text foundational models. In this work, we present a comprehensive analysis on building ASR systems with discrete codes. We investigate different methods for codec training such as quantization schemes and time-domain vs spectral feature encodings. We further explore ASR training techniques aimed at enhancing performance, training efficiency, and noise robustness. Drawing upon our findings, we introduce a codec ASR pipeline that outperforms Encodec at similar bit-rate. Remarkably, it also surpasses the state-of-the-art results achieved by strong self-supervised models on the 143 languages ML-SUPERB benchmark despite being smaller in size and pretrained on significantly less data.

keywords:
discrete speech representation, automatic speech recognition, audio codecs, noise robustness

1 Introduction

A tremendous amount of progress has been achieved in the area of speech and audio technologies in recent years, in large part due to advances in deep learning and the availability of large-scale datasets [1, 2, 3]. In particular, transformer-based models led to significant improvements in speech-related tasks such as automatic speech recognition (ASR) [4, 5] and joint speech-text modeling [6, 7].

Typically, the input speech signal of an ASR model is represented using a mel-spectrogram, resulting in a continuous representation of the speech signal. Learnable alternatives have been explored for different applications [8, 9, 10, 11]. However, using mel-spectrograms is still a prevalent choice for ASR systems due to their effectiveness [12, 13]. Recently, the use of discrete speech representations has garnered attention for their efficacy in training transformer-based models for various speech-related tasks [7, 14, 15, 16] and compatibility with language-modeling architectures [6].

Discrete speech representations are typically categorized as either acoustic or semantic. The former capture the acoustic properties of the speech signal, such as pitch, tone, and rhythm. On the other hand, the latter capture the semantic properties of the speech signal, like the meaning and context conveyed by the speech, including words, phrases, and their associations. Semantic codes are typically obtained by clustering the speech representation at the output of a pre-trained encoder [17, 18, 19], or using a codec model [20]. Acoustic codes are typically obtained by compressing and quantizing the speech signal, e.g., using an audio codec, and aim to reconstruct the original signal from a compressed representation. Several neural audio codecs (NACs) have been proposed recently [21, 22, 23, 24, 25]. Typically, such codecs have an encoder-quantizer-decoder architecture, where the encoder compresses the input speech signal into a latent representation, quantizer approximates it using a discrete representation, and the decoder reconstructs the original signal from the discrete representation. Acoustic codes are particularly relevant for multi-task foundational models, which aim to simultaneously understand the content in the input signal and generate high-quality output signals. While acoustic tokens have been explored in the context of speech and audio synthesis [6, 26] and processing [7], their use in ASR systems has been relatively underexplored [14].

To address the above gap, we perform a comprehensive analysis on building ASR systems with discrete codes. Firstly, we train and evaluate codecs operating in either time or spectral domain with different quantizers. Secondly, we explore different approaches to improve the ASR system performance, training efficiency and also evaluate approaches for improving their noise robustness. Based on our findings, we present a pipeline for noise-robust ASR training with discrete representations generated using a neural audio codec. Thirdly, to prove the generalizabilty of the proposed NAC+ASR pipeline, we further experiment with the ML-SUPERB dataset [27] consisting of 143 languages. The presented results give us a better understanding of the various components of the NAC+ASR pipeline.

We demonstrate that the proposed pipeline based on above learnings is very competitive, outperforming the prevalent Encodec[22]-based systems in comparable settings. Our system also achieves a CER of 21% on the hard ML-SUPERB 1h test set, beating previous state-of-the-art (SOTA) results. The trained NAC111https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/audio_codec_16khz_small and ASR models along with accompanying code will be released in the open-source NVIDIA NeMo222https://github.com/NVIDIA/NeMo toolkit.

2 Speech recognition with audio codecs

FilterbankOptionaltime-domain input Encoder quantizecodes\pgfmathresultptdequantize Decoder time-domain outputQuantizer
Figure 1: Architecture of the considered neural audio codecs.

In this section, we discuss the various components of the proposed ASR pipeline that operates on discrete speech representations. The block scheme of the complete pipeline is depicted in Figure 2.

2.1 Audio codecs

Audio codecs capture details of the audio signal using discrete codes at a low bitrate, and are used for speech representation in various tasks, efficient data transmission, and general data compression. Here we consider two types of NACs, operating either on the time-domain signal or on a spectral domain. Figure 1 depicts the general architecture of the considered codecs.

2.1.1 Quantization schemes

Residual vector quantization (RVQ) is the common approach used for NAC, e.g., in SoundStream [21], Encodec [22], and DAC [23]. The RVQ uses a series of codebooks with size Dcbsubscript𝐷cbD_{\text{cb}}italic_D start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT, with the current codebook quantizing the residual from the previous quantization step [21]. For each time step, RVQ produces Ncbsubscript𝑁cbN_{\text{cb}}italic_N start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT codes, corresponding to the number of codebooks. In this paper, RVQ is configured using Denc=128subscript𝐷enc128D_{\text{enc}}=128italic_D start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT = 128, Dcb=1024subscript𝐷cb1024D_{\text{cb}}=1024italic_D start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT = 1024 and Ncb=8subscript𝑁cb8N_{\text{cb}}=8italic_N start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT = 8.

Finite scalar quantization (FSQ) [28] typically uses a smaller latent dimension Dencsubscript𝐷encD_{\text{enc}}italic_D start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT as compared to RVQ. Each element of the latent vector is quantized independently into a number level, e.g., to {1,0,1}101\{-1,0,1\}{ - 1 , 0 , 1 } when using three levels. As opposed to RVQ, FSQ results in a flat codebook, without a recursive relationship between individual codes. In this paper, FSQ is configured using Denc=32subscript𝐷enc32D_{\text{enc}}=32italic_D start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT = 32 and Ncb=8subscript𝑁cb8N_{\text{cb}}=8italic_N start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT = 8. For convenience, each Denc/Ncbsubscript𝐷encsubscript𝑁cbD_{\text{enc}}/N_{\text{cb}}italic_D start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT / italic_N start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT-dimensional subset of the embedding is seen as a separate group quantized with [8,5,5,5]8555\left[8,5,5,5\right][ 8 , 5 , 5 , 5 ] levels, resulting in Dcb=1000subscript𝐷cb1000D_{\text{cb}}=1000italic_D start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT = 1000 [28].

2.1.2 Time-domain NAC

Time-domain NAC (TD-NAC) follows the architecture used in previous works [21, 22, 24, 23, 25]. The encoder consists of a series of convolutional layers with downsampling applied directly on the time-domain signal at sample rate fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, resulting in total downsampling factor fdownsubscript𝑓downf_{\text{down}}italic_f start_POSTSUBSCRIPT down end_POSTSUBSCRIPT. For each time step, the encoder generates a latent representation of the input signal of dimension Dencsubscript𝐷encD_{\text{enc}}italic_D start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT at rate fenc=fs/fdownsubscript𝑓encsubscript𝑓𝑠subscript𝑓downf_{\text{enc}}=f_{s}/f_{\text{down}}italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT / italic_f start_POSTSUBSCRIPT down end_POSTSUBSCRIPT, which is quantized to obtain discrete codes. For reconstructions, discrete codes are dequantized into a latent representation, and a convolutional decoder is used to obtain a time-domain output signal. Our encoder and decoder configuration is following [22]. The encoder consists of 1D convolutions followed by residual convolution blocks with downsampling, with LSTM layers for sequence modeling and a final 1D convolution. The decoder uses a reverse layer ordering with transposed convolutions [22].

2.1.3 Spectral NAC

As opposed to the time-domain NAC, a spectral NAC [29] applies the encoder on a spectral representation of the input signal obtained using a filterbank as depicted in Figure 1. We use an 80-dimensional mel spectrogram obtained from a mel-filterbank and referred to the model as Mel-NAC.

With RVQ we encode the mel-spectrogram with a single residual network consisting of six HiFi-GAN V1 [30] residual blocks with a hidden dimension of 256 and 1024 residual channels. With FSQ we divide the mel-spectrogram into 8 groups each containing 10 mel-bands. Each group is encoded using separate residual encoders with hidden dimension of 128 and 256 residual channels. The decoder is the HiFi-GAN V1 generator with 1024 initial channels.

Refer to caption
Figure 2: The ASR with discrete codes pipeline.

2.2 Speech recognition pipeline

2.2.1 Embedding layer and codebook initialization

The initial stage of the pipeline involves the mapping of codes to embeddings, which are subsequently forwarded to the ASR encoder for model training. Here we employ a standard neural embedding layer which maps the output of each codebook to a fixed dimensional embedding of size Dembsubscript𝐷embD_{\text{emb}}italic_D start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT. The parameters of this neural embedding model are iteratively optimized during the end-to-end ASR system training. We can either initialize the weights of the embedding model randomly or use the learnt codebooks from the trained NAC model to provide a better starting point. We refer to the latter approach as codebook initialization of the embedding layer in the rest of the paper.

2.2.2 Code aggregation strategies

As discussed in Section 2.1, most NACs employ multiple codebooks to obtain reliable compressed discrete representation of the input signal. Consequently, this results in the presence of multiple codes per time step corresponding to each codebook. It becomes imperative to aggregate across codebooks for each timestep to enable their integration into standard ASR encoder-decoder architectures. This aggregation process can be executed through two distinct schemes, as illustrated in Figure 2: stacking and averaging. In the stacking (stack) aggregation approach, embeddings from different codebooks are stacked atop one another, yielding an embedding size of Ncb×Dembsubscript𝑁𝑐𝑏subscript𝐷embN_{cb}\times D_{\text{emb}}italic_N start_POSTSUBSCRIPT italic_c italic_b end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT. Conversely, the averaging (avg) aggregation approach entails the computation of the average of embeddings from different codebooks at each timestamp, resulting in an embedding size of Dembsubscript𝐷embD_{\text{emb}}italic_D start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT. In this paper, the default codes aggregation strategy is averaging, unless otherwise specified.

2.2.3 Spectrogram augmentation

The technique of spectrogram augmentation (SpecAug) serves as a method for augmenting audio data, as introduced in [31]. This methodology transforms the augmentation task for audio signals into one resembling image augmentation by operating on the audio spectrogram. Though in this work we are training the ASR systems on discrete codes, we evaluate the impact of SpecAug on the ASR pipeline.

2.2.4 Noisy embedding training

Advancements in large language model (LLM) research has shown that model fine-tuning process can be improved by the simple augmentation technique of adding noise to the embedding vectors during training [32]. We evaluate the efficacy of this method by adding scaled uniform noise (parameterized by α𝛼\alphaitalic_α as introduced in [32]) to the output of the embedding layer (Section 2.2.1) during the training phase.

3 Experimental setup

Table 1: Configurations of the considered NACs.
Codec Quantizer Parameters / 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT Dencsubscript𝐷encD_{\text{enc}}italic_D start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT fencsubscript𝑓encf_{\text{enc}}italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT / Hz
TD-NAC RVQ 13.8 128 80
TD-NAC FVQ 13.1 32 80
Mel-NAC RVQ 105 128 62.5
Mel-NAC FVQ 104 32 62.5

3.1 NAC model training

Both TD-NAC and Mel-NAC are trained on the Libri-Light dataset [33] with sample rate 16 kHz. TD-NAC models use an encoder with downsampling rates of {2,4,5,5}2455\left\{2,4,5,5\right\}{ 2 , 4 , 5 , 5 }, resulting in fenc=80subscript𝑓enc80f_{\text{enc}}=80italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT = 80 Hz. Both RVQ and FSQ quantizers use Ncb=8subscript𝑁cb8N_{\text{cb}}=8italic_N start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT = 8 codebooks with Dcb210subscript𝐷cbsuperscript210D_{\text{cb}}\approx 2^{10}italic_D start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT ≈ 2 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, resulting in a bitrate of 6.4 kbps. The TD-NAC decoder upsamples in the reverse order of {5,5,4,2}5542\left\{5,5,4,2\right\}{ 5 , 5 , 4 , 2 } to obtain the reconstructed audio signal. The model is trained on examples with one second of audio. Mel-NAC models use mel-filterbank with a frame length of 1024 samples and frame shift of 256 samples, resulting in fenc=62.5subscript𝑓enc62.5f_{\text{enc}}=62.5italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT = 62.5 Hz. Using the same quantizer setup as for TD-NAC, this results in a bitrate of 5 kbps. The Mel-NAC decoder upsamples at rates of {8,4,4,2}8442\left\{8,4,4,2\right\}{ 8 , 4 , 4 , 2 } to obtain the reconstructed audio signal. The model is trained on examples with 0.512 seconds of audio. All NAC models are trained end-to-end using time-domain loss, discriminative loss, and frequency-domain loss, similar to [22] with equal weights for frequency and discriminative loss and 0.10.10.10.1 weight for time-domain loss. Model sizes depending on the corresponding quantizer are provided in Table 1. The models are trained on eight NVIDIA V100 GPUs for 130k steps with the AdamW optimizer with a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. A StepLR scheduler with a step size of 1 and gamma of 0.999996 is employed for learning rate decay.

3.2 ASR model training

The ASR models presented in the paper adopt the FastConformer Transducer large architecture [34] with 114 M parameters. The encoder consists of 17 layers, with a model dimension of 512. We used 256 channels in sub-sampling module and a kernel size of 9 in convolution module. A single layer RNN-T with hidden dimension of 640 is used for decoder. We maintain the embedding layer output dimension Dembsubscript𝐷embD_{\text{emb}}italic_D start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT (Section 2.2.1) at 128 and set α𝛼\alphaitalic_α (Section 2.2.4) to 5 across all experiments to ensure equitable comparison. The ASR models are trained on the LibriSpeech corpus [35], encompassing 960 hours of English speech data. Evaluation of ASR model performance is conducted using the standard ’clean’ and ’other’ sets of dev and test partitions from the LibriSpeech dataset. We use a Sentencepiece Byte Pair Encoding (BPE) [36] tokenizer with a vocabulary size of 1024, trained on the text data from the LibriSpeech training set. All ASR models have been trained for 100k updates on two nodes with eight NVIDIA A100 80GB GPUs using a batch size of 32 on each GPU. We use AdamW with a peak learning rate of 21032superscript1032\cdot 10^{-3}2 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, 15k warmup steps with Cosine annealing, minimum learning rate of 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and weight decay of 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT.

3.3 Experiments and ablations

The experiments are designed to study and understand four major components of the pipeline: (i) role of the NAC type, i.e., TD-NAC vs Mel-NAC, (ii) role of quantizers in NAC, i.e., RVQ vs FSQ, (iii) effect of code aggregation strategies, (iv) performance improvements of codec ASR systems with pipeline optimizations. We also setup strong baselines in the form of the traditional Mel-Spectrogram features as well as the widely used Encodec audio codec [22]. All other components like Dembsubscript𝐷embD_{\text{emb}}italic_D start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT, ASR model size, ASR training data, and tokenizer have been kept constant to facilitate an unbiased study towards the role played by the above highlighted four components. Word error rate (WER) metric is used to evaluate the performance of the ASR models.

4 Results and discussion

Table 2: ASR improvement on LibriSpeech eval sets contributed by the various components of the presented ASR pipeline.
WER / % \downarrow
Codec dev-clean dev-other test-clean test-other
TD-NAC-RVQ 17.58 38.77 17.18 41.55
+ codebook initialization 3.87 (-13.71) 12.17 (-26.6) 3.84 (-13.34) 12.28 (-29.27)
+ spectrogram       augmentation 2.21 (-1.66) 5.83 (-6.34) 2.36 (-1.48) 5.84 (-6.44)
     + noisy embedding         training 2.19 (-0.02) 5.72 (-0.11) 2.4 (+0.04) 5.76 (-0.08)
Table 3: ASR performance on LibriSpeech evaluation sets for the considered pipeline configurations.
Input feature Quantizer fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT / kHz Bitrate / kbps Code aggregation Ncbsubscript𝑁cbN_{\text{cb}}italic_N start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT Dcbsubscript𝐷cbD_{\text{cb}}italic_D start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT Dencsubscript𝐷encD_{\text{enc}}italic_D start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT WER / % \downarrow
dev clean dev-other test clean test other
mel-spectrogram 2.12 4.88 2.27 5.03
EnCodec RVQ 24 24 avg 32 1024 128 2.16 5.68 2.3 5.47
EnCodec RVQ 24 12 avg 16 1024 128 2.26 5.77 2.45 5.8
EnCodec RVQ 24 6 avg 8 1024 128 2.23 6.02 2.35 5.96
EnCodec RVQ 24 3 avg 4 1024 128 2.44 7.13 2.6 7.13
TD-NAC RVQ 16 6.4 stack 8 1024 128 3.12 10.17 3.38 10.17
TD-NAC RVQ 16 6.4 avg 8 1024 128 2.19 5.72 2.40 5.76
TD-NAC FSQ 16 6.4 stack 8 1000 32 2.18 6.08 2.42 5.92
Mel-NAC RVQ 16 5 avg 8 1024 128 2.23 5.92 2.40 5.80
Mel-NAC FSQ 16 5 stack 8 1000 32 2.33 6.18 2.45 6.09

4.1 Codebook initialization, spectrogram augmentation and noisy embedding training

To investigate these components’ effects, we first train a TD-NAC model with RVQ following the specifications outlined in Section 3.1. Utilizing features from this audio codec as input, we establish our baseline ASR pipeline, employing parameters detailed in Section 3.2, yielding the baseline performance noted in the first row of Table 2. Subsequently, we adapt the ASR pipeline to initialize the embedding layer with codebooks learned from the trained NAC (Section 2.2.1), maintaining other pipeline components unchanged. With this setup, we train another ASR system and report it’s performance in the second row of Table 2. Likewise, we progressively integrate spectrogram augmentation and noisy embedding training into the pipeline. Notably, codebook initialization of the embedding layer significantly enhances the ASR model’s performance, with more than 10% absolute WER improvement across all the evaluation sets. Spectrogram Augmentation aids in enhancing the model’s noise robustness, as reflected by more than 6% absolute WER improvement on the noisy ’other’ sets. Noisy embedding training is able to even further improve this noise robustness of the model. Consequently, for all subsequent experiments, we incorporate all three components - codebook initialization, spectrogram augmentation, and noisy embedding training - into the training pipeline.

4.2 Code aggregation strategy

To assess the influence of the code aggregation strategy on the ASR+NAC model pipeline, we build up on the baseline setting as motivated in Section 4.1: TD-NAC model with RVQ, FastConformer-RNNT ASR model with embedding layer initialized with the learnt codebooks, SpecAug, and noisy embedding training. Two models are trained: one utilizing stacking for aggregating code embeddings and the other employing averaging (refer to Section 2.2.2). The performance of these models are reported in rows 6666 and 7777 of Table 3. Notably, the averaging strategy yields significantly superior WER performance compared to stacking. It’s worth noting that the embedding dimension Dembsubscript𝐷embD_{\text{emb}}italic_D start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT (as discussed in Section 2.2.1) remained fixed at 128 for both runs and the results might change with an increase in the embedding dimension. However, to ensure a fair comparison and assess the optimal configuration within the described setup, the embedding dimension was kept constant.

Despite the noted performance, stack remains the preferred aggregation scheme for all our NAC-FSQ systems. This choice is informed by the realization that different FSQ codebooks quantize distinct segments of the encoder output, whereas the RVQ codebooks encode residuals of the same vector.

4.3 Neural audio codec type

We proceed to examine and compare TD-NAC with Mel-NAC, assessing their influence on downstream ASR tasks. Owing to the distinct down-sampling structures and rates outlined in Section 2.1, the compared TD-NAC operates at a bit-rate of 6.4 kbps, whereas Mel-NAC operates at 5 kbps. The remainder of the ASR pipeline remains constant, incorporating insights from Section 4.2, and we compare both RVQ and FSQ versions of the codecs. The results of these ablations are presented in the last four rows of Table 3. Notably, TD-NAC demonstrates slightly better performance compared to Mel-NAC across all considered ASR eval sets. This finding is intriguing, given that Mel-NAC outperforms TD-NAC for TTS tasks [29]. Hence, the selection of the NAC should consider the downstream task.

Furthermore, we observe that the presented TD-NAC with RVQ and only 8 codebooks outperforms Encodec with 4, 8, and even 16 codebooks, while maintaining all other parameters such as codebook size and ASR system parameter counts constant. The performance of the TD-NAC system with a bit-rate of only 6.4 kbps closely matches that of Encodec with 24 kbps (utilizing all 32 codebooks). We have open-sourced the weights (audio_ codec_16khz_small) and code333https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb for this codec model so that it can be utilized by and be built upon by the community.

4.4 Quantization schemes

Finally, we study the effect of quantization schemes on downstream ASR performance. Analysis of the last four rows of Table 3 reveals that FSQ detrimentally affects ASR performance, particularly on the noisy ’other’ sets. We hypothesize this happens because of the fixed finite level encoding scheme utilized by FSQ, which poses challenges in modeling noisy data.

5 Multilingual extension

To demonstrate the generalization ability of the presented NAC+ASR pipeline, we performed a study using additional languages and broader corpora. To this end, we participated in the ASR track of the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge [37] that uses the ML-SUPERB [27] dataset comprising of 143 languages.

5.1 Model and data description

Our pipeline uses TD-NAC model with RVQ, as detailed in Section 3.1, that obtained the best performance in the experiments summarized in Table 3. The NAC model was not retrained and we use the same setup as in Section 3.1. For ASR we use the FastConformer-RNNT model described in Section 3.2 along with avg code aggregation strategy, codebook initialization of the embedding layer, SpecAug and noisy embedding training, based on Section 4. As per the challenge requirements, the ASR model is trained on the LibriSpeech-clean-100 subset (100 hrs) along with the ML-SUPERB 1h set (222 hrs) which contains data from 143 languages. The combined data has 6280 unique characters.

5.2 Results

Table 4: CER on the ML-SUPERB 1h test set.
System Challenge baseline Our system
CER 72.6 21.0

We compare the performance of our NAC+ASR pipeline with the baseline system [37] on the ML-SUPERB 1h test set which consists of 45 hours of unseen speech. Table 4 presents the Character Error Rate (CER) metric for both systems. It can be observed that our system with 21% CER significantly outperforms the challenge baseline. Moreover, our system surpasses the SOTA performance achieved by the XLSR-128 model, which reported a CER of 22% [27], despite being smaller in size and pretrained on significantly less data. This competitive CER underscores the effectiveness of the proposed NAC+ASR pipeline not only in monolingual scenarios (cf. Table 3) but also in multilingual settings encompassing over 100 languages.

6 Conclusion

In this work, we presented a speech recognition pipeline working on discrete codes from an audio codec and performed a study of different components of the system. We trained neural audio codecs with different quantizers and found that time-domain codec with RVQ resulted in the best performance on the considered data. We investigated ASR pipeline optimizations and found that optimal code aggregation and codebook initialization resulted in large performance improvements. Furthermore, we found that SpecAug and noisy embedding training in our pipeline lead to improved performance in clean conditions and superior robustness in noisy conditions. Our best result outperforms EnCodec-based model at a comparable bit-rate. Finally, we studied the performance on a large multi-lingual dataset. The proposed model beats the SOTA performance of strong self-supervised models like XLSR-128 on the 143-language ML-SUPERB benchmark despite being smaller and trained on significantly less data. All the trained NAC and ASR models along with accompanying code will be released in the NeMo toolkit [38].

References

  • [1] G. Hinton et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.
  • [2] W. Chan et al., “Speechstew: Simply mix all available speech recognition data to train one large neural network,” arXiv preprint arXiv:2104.02133, 2021.
  • [3] T. J. Park et al., “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, 2022.
  • [4] Q. Zhang et al., “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2020.
  • [5] A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020.
  • [6] C. Wang et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  • [7] X. Wang et al., “SpeechX: Neural codec language model as a versatile speech transformer,” arXiv preprint arXiv:2308.06873, 2023.
  • [8] T. N. Sainath et al., “Multichannel signal processing with deep neural networks for automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 965–979, 2017.
  • [9] Y. Luo and N. Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2018, pp. 696–700.
  • [10] M. Won et al., “Data-driven harmonic filters for audio representation learning,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2020.
  • [11] N. Zeghidour et al., “LEAF: A learnable frontend for audio classification,” in Proc. Int. Conf. Learning Representations (ICLR), 2021.
  • [12] G. Synnaeve et al., “End-to-end ASR: from supervised to semi-supervised learning with modern architectures,” in Proc. ICML Workshop on Self-supervision in Audio and Speech, 2020.
  • [13] R. Prabhavalkar et al., “End-to-end speech recognition: A survey,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • [14] K. C. Puvvada et al., “Discrete audio representation as an alternative to mel-spectrograms for speaker and speech recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2024.
  • [15] X. Chang et al., “Exploration of efficient end-to-end asr using discretized input from self-supervised learning,” arXiv preprint arXiv:2305.18108, 2023.
  • [16] X. Chang, B. Yan et al., “Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study,” arXiv preprint arXiv:2309.15800, 2023.
  • [17] W.-N. Hsu et al., “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  • [18] Y.-A. Chung et al., “w2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 244–250.
  • [19] S. Chen et al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  • [20] Z. Huang, C. Meng, and T. Ko, “Repcodec: A speech representation codec for speech tokenization,” arXiv preprint arXiv:2309.00169, 2023.
  • [21] N. Zeghidour et al., “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
  • [22] A. Défossez et al., “High fidelity neural audio compression,” Transactions on Machine Learning Research, 2023.
  • [23] R. Kumar et al., “High-fidelity audio compression with improved RVQGAN,” in Proc. Conf. on Neural Information Process. Systems (NeurIPS), 2023.
  • [24] Y.-C. Wu et al., “AudioDec: An open-source streaming high-fidelity neural audio codec,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  • [25] X. Zhang et al., “Speechtokenizer: Unified speech tokenizer for speech large language models,” arXiv preprint arXiv:2308.16692, 2023.
  • [26] Z. Borsos et al., “SoundStorm: Efficient parallel audio generation,” arXiv preprint arXiv:2305.09636, 2023.
  • [27] J. Shi, D. Berrebbi, W. Chen, H. L. Chung et al., “Ml-superb: Multilingual speech universal performance benchmark,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2023, 2023, pp. 884–888.
  • [28] F. Mentzer et al., “Finite scalar quantization: VQ-VAE made simple,” in Proc. International Conference on Learning Representations (ICLR), 2024.
  • [29] R. Langman et al., “Spectral Codecs: Spectrogram-based audio codecs for high quality speech synthesis,” arXiv preprint arXiv:2406.05298, 2024.
  • [30] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. Conf. on Neural Information Process. Systems (NeurIPS), 2020.
  • [31] D. S. Park et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” Interspeech 2019, 2019.
  • [32] N. Jain et al., “NEFTune: Noisy embeddings improve instruction finetuning,” in Proc. International Conference on Learning Representations (ICLR), 2023.
  • [33] J. Kahn et al., “Libri-Light: A benchmark for asr with limited or no supervision,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2020.
  • [34] D. Rekesh et al., “Fast conformer with linearly scalable attention for efficient speech recognition,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
  • [35] V. Panayotov et al., “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP).   IEEE, 2015, pp. 5206–5210.
  • [36] T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proc. Conf. on Empirical Methods in Natural Language Processing: System Demonstrations, 2018.
  • [37] X. Chang et al., “Interspeech 2024 speech processing using discrete speech unit challenge,” https://www.wavlab.org/activities/2024/Interspeech2024-Discrete-Speech-Unit-Challenge, [Online].
  • [38] NVIDIA, “NeMo: a toolkit for conversational AI,” https://github.com/NVIDIA/NeMo, [Online; accessed May, 2024].