\addbibresource

mybib.bib \defbibheadingbibliography[References] \DeclareSourcemap \maps[datatype=bibtex, overwrite=true] \map \step[fieldsource=booktitle, match=\regexp.*Interspeech.*, replace=Proc. Interspeech] \step[fieldsource=journal, match=\regexp.*INTERSPEECH.*, replace=Proc. Interspeech] \step[fieldsource=booktitle, match=\regexp.*ICASSP.*, replace=Proc. ICASSP] \step[fieldsource=booktitle, match=\regexp.*icassp_inpress.*, replace=Proc. ICASSP (in press)] \step[fieldsource=booktitle, match=\regexp.*Acoustics,.*Speech.*and.*Signal.*Processing.*, replace=Proc. ICASSP] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Learning.*Representations.*, replace=Proc. ICLR] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Computational.*Linguistics.*, replace=Proc. COLING] \step[fieldsource=booktitle, match=\regexp.*SIGdial.*Meeting.*on.*Discourse.*and.*Dialogue.*, replace=Proc. SIGDIAL] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Machine.*Learning.*, replace=Proc. ICML] \step[fieldsource=booktitle, match=\regexp.*North.*American.*Chapter.*of.*the.*Association.*for.*Computational.*Linguistics:.*Human.*Language.*Technologies.*, replace=Proc. NAACL] \step[fieldsource=booktitle, match=\regexp.*Empirical.*Methods.*in.*Natural.*Language.*Processing.*, replace=Proc. EMNLP] \step[fieldsource=booktitle, match=\regexp.*Association.*for.*Computational.*Linguistics.*, replace=Proc. ACL] \step[fieldsource=booktitle, match=\regexp.*Automatic.*Speech.*Recognition.*and.*Understanding.*, replace=Proc. ASRU] \step[fieldsource=booktitle, match=\regexp.*Spoken.*Language.*Technology.*, replace=Proc. SLT] \step[fieldsource=booktitle, match=\regexp.*Speech.*Synthesis.*Workshop.*, replace=Proc. SSW] \step[fieldsource=booktitle, match=\regexp.*workshop.*on.*speech.*synthesis.*, replace=Proc. SSW] \step[fieldsource=booktitle, match=\regexp.*Advances.*in.*neural.*information.*processing.*, replace=Proc. NeurIPS] \step[fieldsource=booktitle, match=\regexp.*Advances.*in.*Neural.*Information.*Processing.*, replace=Proc. NeurIPS] \step[fieldsource=booktitle, match=\regexp.*Workshop.*on.* Applications.* of.* Signal.*Processing.*to.*Audio.*and.*Acoustics.*, replace=Proc. WASPAA] \step[fieldsource=publisher, match=\regexp.+, replace=] \step[fieldsource=month, match=\regexp.+, replace=] \step[fieldsource=location, match=\regexp.+, replace=] \step[fieldsource=address, match=\regexp.+, replace=] \step[fieldsource=organization, match=\regexp.+, replace=] \interspeechcameraready \name[affiliation=*1]DarshanPrabhu \name[affiliation=*1]AbhishekGupta \name[affiliation=1]OmkarNitsure \name[affiliation=1]PreethiJyothi \name[affiliation=2]SriramGanapathy

Improving Self-supervised Pre-training using Accent-Specific Codebooks

Abstract

Speech accents present a serious challenge to the performance of state-of-the-art end-to-end Automatic Speech Recognition (ASR) systems. Even with self-supervised learning and pre-training of ASR models, accent invariance is seldom achieved. In this work, we propose an accent-aware adaptation technique for self-supervised learning that introduces a trainable set of accent-specific codebooks to the self-supervised architecture. These learnable codebooks enable the model to capture accent specific information during pre-training, that is further refined during ASR finetuning. On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches on both seen and unseen English accents, with up to 9%percent99\%9 % relative reduction in word error rate (WER).

keywords:
Automatic Speech Recognition, Accent Codebooks, Self-Supervised Pretraining.
**footnotetext: These authors contributed equally to this work.

1 Introduction

Self-supervised learning (SSL) has been established as a powerful technique to learn representations for speech and language processing [layerwise_analysis]. With pretrained SSL models as a starting point, even small amounts of labeled data are sufficient to achieve commercially acceptable results in various downstream speech tasks [baevski2020wav2vec, zhao2022improving]. This has led to the wide adoption of the current defacto training regime of pretraining a model with an SSL objective, followed by fine-tuning on downstream tasks such as automatic speech recognition (ASR) and spoken language understanding [yang21c_interspeech, tsai-etal-2022-superb]. A key limitation of SSL-based models is that performance on the downstream task suffers if there is a domain shift from the pretraining data [hsu2021robust]. In this work, we focus on speech accents as our domain of interest. The problem statement can be outlined as - How can we make SSL-trained models more robust to varying speech accents, both seen and unseen during training?

While prior work has extensively explored improving accented ASR in the fine-tuning stage [asr_residual, jain2018improved, winata2020learning, accentgrapheme, layer_accnt, e2emultitask], limited work has gone into improving the robustness of SSL pretraining to varying speech accents. Prior work on accent adaptation of SSL models includes introducing an additional classifier into SSL pretraining [deng21b_interspeech], employing accent-specific adapters [Bhatia2023], and normalizing pseudo-targets to a single accent [Poncelet2023UnsupervisedAA].

In this work, we adapt the technique introduced in Prabhu et al. [Prabhu2023AccentedSR] for SSL pretraining. We introduce accent information during self-supervised pre-training via a set of accent-specific codebooks. These codebooks are integrated into the model using a cross-attention module. For each accent seen in the training data, we introduce a learnable codebook consisting of vectors that capture accent-specific information during the pre-training process. While [Prabhu2023AccentedSR] used this technique exclusively during ASR finetuning, we show that the benefits increase when used during the SSL stage. To evaluate the effectiveness of our method, we show ASR experiments on the multi-accented Mozilla Common Voice corpus and show significant word error rate (WER) reductions compared to multiple variants of HuBERT and other accent adaptation techniques including an accent-agnostic approach (Domain Adversarial Training (DAT) [bobw]) and an accent-aware approach (MultiTask Learning (MTL) [asr_clf]).

While many self-supervised models exist for speech-related tasks [baevski2020wav2vec, chen2022wavlm, chung2021w2v, discretebert, decoar2, vqwav2vec, hubert], in this work we use HuBERT — a state-of-the-art architecture that uses a masked language modeling (MLM) objective and aims to reconstruct a sequence of discrete pseudo-targets derived via a k𝑘kitalic_k-means clustering step. The clustering step in HuBERT is used to generate pseudo-targets for pretraining. Prior work has aimed to improve the quality of these generated pseudo-targets either through teacher-forcing [hubert_teacher] or by designing task-specific pseudo-targets [ssldeepcluster]. Other efforts have focused on improving the HuBERT architecture itself [fan2023ctcbert, jointssldecoder]. To the best of our knowledge, adapting HuBERT pretraining to be more robust across accents has not been attempted before.

In summary, our main contributions are as follows:

  • We introduce codebook-based accent adaptation for self-supervised pre-training that aims at using a collection of accent-specific codebooks and cross-attention to improve SSL representations in the presence of accents.111Code is available at https://github.com/csalt-research/accented-codebooks-asr/tree/accented-pretraining. Our model outperforms baselines and all other previous approaches and achieves a significant improvement in performance with up to 9% relative WER reduction on the MCV dataset.

  • In a zero-shot setting on the L2-Arctic dataset, we show that our codebook-based pretrained model significantly outperforms baselines, thus demonstrating the generalization capability of our proposed system.

2 Methodology

Refer to caption
Figure 1: Overview of the two-stage training pipeline used in our proposed approach. In stage 1, we train an Encoder only model with the HuBERT style pre-training objective. Our architecture incorporates Accent-specific codebooks and additional Cross-Attention layers. The model trained in this stage is used as Encoder in stage 2, where we train an Encoder-Decoder model with Joint CTC-Attention objective for ASR fine-tuning.

Figure 1 illustrates the main workflow of our proposed technique. We adopt the standard two-stage training pipeline used in self-supervised architectures:

  1. 1.

    A self-supervised pretraining stage that involves training an encoder using a self-supervised learning objective [mohamed2022self].

  2. 2.

    A supervised ASR fine-tuning stage that uses the pretrained encoder from the previous stage and further finetunes it using an ASR-specific supervised loss.

In our proposed framework, both these stages benefit from accent-specific codebooks elaborated in Sections §§\S§2.1 and §§\S§2.2, respectively.

2.1 Self-supervised Pre-training with Codebooks

We use the HuBERT self-supervised architecture [hubert] that consists of three modules: a convolution-based waveform encoder (denoted by Conv), a Transformer-based encoder (denoted by Enc) and a projection (FFNtoksubscriptFFNtok\textsc{FFN}_{\text{tok}}FFN start_POSTSUBSCRIPT tok end_POSTSUBSCRIPT) module. Conv takes raw speech X={𝐱1,𝐱2,,𝐱L|𝐱i}𝑋conditional-setsubscript𝐱1subscript𝐱2subscript𝐱𝐿subscript𝐱𝑖X=\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{L}|\mathbf{x}_{i}\in% \mathbb{R}\}italic_X = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R } as its input, and maps it via a stack of convolutions to F=Conv(X)={𝐟1,𝐟2,,𝐟T|𝐟id}𝐹Conv𝑋conditional-setsubscript𝐟1subscript𝐟2subscript𝐟𝑇subscript𝐟𝑖superscript𝑑F=\textsc{Conv}(X)=\{\mathbf{f}_{1},\mathbf{f}_{2},\ldots,\mathbf{f}_{T}|% \mathbf{f}_{i}\in\mathbb{R}^{d}\}italic_F = Conv ( italic_X ) = { bold_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }. F𝐹Fitalic_F is subsequently masked to generate F~~𝐹\tilde{F}over~ start_ARG italic_F end_ARG, where a fixed P%percent𝑃P\%italic_P % of its frames are randomly masked. This masked representation F~~𝐹\tilde{F}over~ start_ARG italic_F end_ARG further passes through Enc to generate a contextualized representation H=Enc(X~)={𝐡1,𝐡2,𝐡T|𝐡id}𝐻Enc~𝑋conditional-setsubscript𝐡1subscript𝐡2subscript𝐡𝑇subscript𝐡𝑖superscript𝑑H=\textsc{Enc}(\tilde{X})=\{\mathbf{h}_{1},\mathbf{h}_{2},\ldots\mathbf{h}_{T}% |\mathbf{h}_{i}\in\mathbb{R}^{d}\}italic_H = Enc ( over~ start_ARG italic_X end_ARG ) = { bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … bold_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }. In parallel, an Acoustic Unit Discovery (AUD) module directly acts on X𝑋Xitalic_X as its input and generates pseudo-target labels Z={𝐳1,𝐳2,,𝐳T|𝐳i[1,V]}𝑍conditional-setsubscript𝐳1subscript𝐳2subscript𝐳𝑇subscript𝐳𝑖1𝑉Z=\{\mathbf{z}_{1},\mathbf{z}_{2},\ldots,\mathbf{z}_{T}|\mathbf{z}_{i}\in[1,V]\}italic_Z = { bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 1 , italic_V ] } via an offline clustering step that uses a single or ensemble of k𝑘kitalic_k-means clusterings. The size of the label set V𝑉Vitalic_V is determined by the number of clusters used in AUD. A simple projection layer FFNtoksubscriptFFNtok\textsc{FFN}_{\text{tok}}FFN start_POSTSUBSCRIPT tok end_POSTSUBSCRIPT is used to convert H𝐻Hitalic_H from dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to the target vocabulary Vsuperscript𝑉\mathbb{R}^{V}blackboard_R start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT, and the HuBERT model is trained end-to-end to predict Z𝑍Zitalic_Z using a weighted cross-entropy loss.

Table 1: Comparison of the performance (WER % ) of our architecture (Codebook Attention applied at all layers with 50505050 entries in each learnable codebook) with baseline and other architectures on the Mcv-Accent-100 dataset. Numbers in bold denote the best across baselines, and    denotes the best WER across all experiments. Ties are broken using overall WER. The aggregate result for proposed model achieves statistically significant improvements compared to best baseline (at p<𝑝absentp<italic_p < 0.001 MAPSSWE test [mapsswe]).
Method Size Aggregated Seen Accents Unseen Accents
All Seen Unseen AUS CAN UK SCT US AFR HKG IND IRL MAL NWZ PHL SGP WLS
Conformer [conformer] 43M 18.9 14.0 23.7 13.8 15.0 15.7 13.4 13.3 21.5 27.2 29.4 21.4 32.2 19.9 26.1 34.7 17.9
HuBERT [hubert] 104M 13.1 9.1 17.1 9.0 10.3 9.4 7.8 8.7 15.6 20.0 18.7 16.3 23.5 14.0 20.3 24.5 9.8
+ Pretrain ckpt 104M 9.7 6.3 13.1 5.3 7.7 6.2 4.7 6.1 12.4 15.5 13.8 12.5 18.8 10.0 15.2 20.0 7.8
+ Frozen layers 74M 9.3 6.0 12.5 4.9 7.8 5.5 4.7 5.9 11.6 15.8 12.2 11.8 18.0 9.6 15.0 19.8 7.3
MTL [asr_clf] 74M 9.4 6.0 12.8 5.0 7.6 5.6 4.7 5.9 11.9 15.5 13.4 12.2 17.6 10.0 15.4 19.1 8.3
DAT [bobw] 74M 9.3 6.0 12.5 5.1 7.5 5.9 \cellcolorgreen!20 4.4 5.9 11.6 15.2 12.3 12.3 17.1 9.6 15.4 19.4 7.9
Proposed 76M \cellcolorgreen!20 8.9 \cellcolorgreen!20 5.9 \cellcolorgreen!20 11.9 \cellcolorgreen!20 3.7 \cellcolorgreen!20 7.5 \cellcolorgreen!20 5.5 4.7 \cellcolorgreen!20 5.9 \cellcolorgreen!20 10.8 \cellcolorgreen!20 14.7 \cellcolorgreen!20 11.6 \cellcolorgreen!20 11.4 \cellcolorgreen!20 16.7 \cellcolorgreen!20 9.2 \cellcolorgreen!20 14.9 \cellcolorgreen!20 18.3 \cellcolorgreen!20 7.3

Our accent-based modifications are restricted to the Transformer-based encoder Enc. The Enc module consists of a stack of N𝑁Nitalic_N identical encoder layers. Each layer attends to the output of the previous layer and contextualizes it using self-attention blocks and projection layers. The self-attention block is responsible for introducing global context, while the projection layer refines the point-wise information. In addition to this, we introduce a cross-attention block that utilizes attention mechanism to incorporate information from accent-specific codebooks into the representations. We first discuss how the codebooks are generated followed by an explanation of how they are utilized in a single encoder layer.

Codebook Setup. Let E𝐸Eitalic_E be the number of accents seen during training. We define a set of accent-specific codebooks C={C1,C2,CE|CiM×d}𝐶conditional-setsuperscript𝐶1superscript𝐶2superscript𝐶𝐸superscript𝐶𝑖superscript𝑀𝑑C=\{C^{1},C^{2},\ldots C^{E}|C^{i}\in\mathbb{R}^{M\times d}\}italic_C = { italic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … italic_C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT | italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT }. Each Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT comprises M𝑀Mitalic_M learnable codebook entries. That is, each codebook Cisuperscript𝐶𝑖C^{i}italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is:

Ci={𝐜1i,,𝐜Mi|𝐜jid}=Embedding([1,2,,M])superscript𝐶𝑖conditional-setsubscriptsuperscript𝐜𝑖1subscriptsuperscript𝐜𝑖𝑀subscriptsuperscript𝐜𝑖𝑗superscript𝑑Embedding12𝑀\displaystyle C^{i}=\{\mathbf{c}^{i}_{1},\ldots,\mathbf{c}^{i}_{M}|\mathbf{c}^% {i}_{j}\in\mathbb{R}^{d}\}=\texttt{Embedding}([1,2,\ldots,M])italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { bold_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT | bold_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT } = Embedding ( [ 1 , 2 , … , italic_M ] )

where Embedding is a standard embedding layer. For a training speech input X𝑋Xitalic_X, whose underlying accent ID (indexing the E𝐸Eitalic_E seen accents) is a{1,,E}𝑎1𝐸a\in\{1,\ldots,E\}italic_a ∈ { 1 , … , italic_E }, we deterministically select the codebook Casuperscript𝐶𝑎C^{a}italic_C start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT. The codebook entries in Casuperscript𝐶𝑎C^{a}italic_C start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT are subsequently integrated with all the encoder layers and across all attention heads. We note here that the codebooks are accent-specific on account of a single codebook being deterministically chosen per utterance during training.

Integrating Codebooks using Cross-attention. Each Transformer-based encoder layer of the HuBERT architecture consists of a self-attention block that introduces global context, followed by a position-wise feed-forward block that refines point-wise information. These two blocks are separated by layer normalization and coupled via residual connections. In our method, we introduce a cross-attention module that is positioned between the self-attention and feedforward blocks and integrates information from the codebooks into the audio representations generated after self-attention. More precisely, for the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT encoder layer, let 𝐀𝐀\mathbf{A}bold_A be the input to the cross-attention block and 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG be the output. Then, the attention probabilities across the M𝑀Mitalic_M codebook entries in Casuperscript𝐶𝑎C^{a}italic_C start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT for the jthsuperscript𝑗thj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT position 𝐀jsubscript𝐀𝑗\mathbf{A}_{j}bold_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is computed as:

{β1j,β2j,,βMj}=softmax((𝐀jWQi)(CaWKi)Td)subscriptsuperscript𝛽𝑗1subscriptsuperscript𝛽𝑗2subscriptsuperscript𝛽𝑗𝑀softmaxsubscript𝐀𝑗subscriptsuperscript𝑊𝑖𝑄superscriptsuperscript𝐶𝑎subscriptsuperscript𝑊𝑖𝐾𝑇𝑑\displaystyle\{\beta^{j}_{1},\beta^{j}_{2},\ldots,\beta^{j}_{M}\}=\mathrm{% softmax}\left(\frac{(\mathbf{A}_{j}W^{i}_{Q})(C^{a}W^{i}_{K})^{T}}{\sqrt{d}}\right){ italic_β start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_β start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } = roman_softmax ( divide start_ARG ( bold_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) ( italic_C start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG )

where WQisubscriptsuperscript𝑊𝑖𝑄W^{i}_{Q}italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, WKid×dsubscriptsuperscript𝑊𝑖𝐾superscript𝑑𝑑W^{i}_{K}\in\mathbb{R}^{d\times d}italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT are learned projection matrices of the cross-attention block of the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT encoder layer and βkj[0,1]subscriptsuperscript𝛽𝑗𝑘01\beta^{j}_{k}\in[0,1]italic_β start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ [ 0 , 1 ] is the attention weight assigned by the jthsuperscript𝑗thj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT representation to the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT entry in codebook Casuperscript𝐶𝑎C^{a}italic_C start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT. Finally, the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame in the output 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG is a weighted average of the entries in codebook Casuperscript𝐶𝑎C^{a}italic_C start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT:

𝐀^j=k=1Mβkj(CkaWVi)subscript^𝐀𝑗superscriptsubscript𝑘1𝑀superscriptsubscript𝛽𝑘𝑗subscriptsuperscript𝐶𝑎𝑘subscriptsuperscript𝑊𝑖𝑉\displaystyle\mathbf{\hat{A}}_{j}=\sum_{k=1}^{M}\beta_{k}^{j}\cdot(C^{a}_{k}W^% {i}_{V})over^ start_ARG bold_A end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ⋅ ( italic_C start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT )

where 𝐀^jsubscript^𝐀𝑗\hat{\mathbf{A}}_{j}over^ start_ARG bold_A end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, Ckasubscriptsuperscript𝐶𝑎𝑘C^{a}_{k}italic_C start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT dabsentsuperscript𝑑\in\mathbb{R}^{d}∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and WVid×dsubscriptsuperscript𝑊𝑖𝑉superscript𝑑𝑑W^{i}_{V}\in\mathbb{R}^{d\times d}italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT. The output of the cross-attention block is subjected to layer normalization, residual connections are added and is fed as input to the feed-forward block.

2.2 Supervised ASR Fine-tuning

For the second stage of ASR fine-tuning, we adopt the state-of-the-art hybrid CTC-attention end-to-end ASR framework [jointctc] that consists of three modules: an encoder (EncfsubscriptEnc𝑓\textsc{Enc}_{f}Enc start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT), a decoder (Dec) and a Connectionist Temporal Classification (CTC[ctc] module. We replace the encoder EncfsubscriptEnc𝑓\textsc{Enc}_{f}Enc start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT with the pretrained Conv and Enc modules from Section 2.1, pretrained in conjunction with accent codebooks. ASR fine-tuning makes use of labeled speech instances (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) where X𝑋Xitalic_X is a raw speech sequence X={𝐱1,𝐱2,,𝐱L|𝐱i}𝑋conditional-setsubscript𝐱1subscript𝐱2subscript𝐱𝐿subscript𝐱𝑖X=\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{L}|\mathbf{x}_{i}\in% \mathbb{R}\}italic_X = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R } and Y𝑌Yitalic_Y is a token sequence {y1,,yM}subscript𝑦1subscript𝑦𝑀\{y_{1},\ldots,y_{M}\}{ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }. Both the encoder parameters and the accent-specific codebooks in EncfsubscriptEnc𝑓\textsc{Enc}_{f}Enc start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, initialized from the pretrained model, are further finetuned using a supervised ASR objective. The Dec module uses a cross-entropy loss (attsubscriptatt\mathcal{L}_{\text{att}}caligraphic_L start_POSTSUBSCRIPT att end_POSTSUBSCRIPT) to autoregressively predict a token ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given previous tokens {y1,,yt1}subscript𝑦1subscript𝑦𝑡1\{y_{1},\ldots,y_{t-1}\}{ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } and the encoder outputs. The CTC module, on the other hand, imposes a CTC loss (ctcsubscriptctc\mathcal{L}_{\text{ctc}}caligraphic_L start_POSTSUBSCRIPT ctc end_POSTSUBSCRIPT) directly on the encoder outputs and predicts a frame-aligned sequence of tokens by marginalizing over all possible alignments of Y𝑌Yitalic_Y to the encoder outputs. The final loss is the weighted sum of both losses:

asrsubscriptasr\displaystyle\mathcal{L}_{\text{asr}}caligraphic_L start_POSTSUBSCRIPT asr end_POSTSUBSCRIPT =η×ctc+(1η)×attabsent𝜂subscriptctc1𝜂subscriptatt\displaystyle=\eta\times\mathcal{L}_{\text{ctc}}+(1-\eta)\times\mathcal{L}_{% \text{att}}= italic_η × caligraphic_L start_POSTSUBSCRIPT ctc end_POSTSUBSCRIPT + ( 1 - italic_η ) × caligraphic_L start_POSTSUBSCRIPT att end_POSTSUBSCRIPT (1)

where η𝜂\etaitalic_η is a hyperparameter that balances the objectives.

2.3 Inference using Codebooks

During inference, we do not assume access to an accent label for a test utterance. We employ the joint-beam search proposed by Prabhu et al. [Prabhu2023AccentedSR]. This technique involves performing beam search jointly over all the seen accents. Scores for the seen accents are computed using each underlying accent-specific codebook. The beam width holds the best expansions across all seen accents. More details of the joint beam algorithm are in [Prabhu2023AccentedSR].

3 Experimental Setup

We conduct all our pre-training and fine-tuning experiments using Fairseq [fairseq] and ESPnet [espnet] toolkits, respectively, on NVIDIA RTX A6000 GPUs.

Dataset details. We use the Mozilla Common Voice [mcv] accented English Mcv-Accent benchmarking dataset222https://tinyurl.com/accent-dataset in all our experiments. This dataset consists of five seen (AUS, CAN, SCT, UK, US) and nine unseen accents (AFR, HKG, IND, IRL, MAL, NWZ, PHL, SGP, WLS). The dataset includes Mcv-Accent-100 and Mcv-Accent-600 training splits containing 100100100100 hours and 620620620620 hours of audio data respectively, along with 17171717 hours of validation and test splits. For pretraining, we only utilize the larger Mcv-Accent-600 train split and for ASR, we report results on both training splits.

Pre-training setup. In all our experiments, we utilize the HuBERT-Base architecture [hubert]. This architecture consists of a 7777-layer convolution feature extractor and a 12121212-layer Transformer Encoder with 12121212 attention (d=768𝑑768d=768italic_d = 768) heads. The model is trained for two iterations, each consisting of 200200200200k steps. In both iterations, we use 500500500500 hidden units obtained from the output of the 6thsuperscript6th6^{\text{th}}6 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT encoder layer via K-means clustering as the pseudo targets. In each iteration, the model is trained using an Adam optimizer [adam] with learning rate of 5e-55𝑒-55e\text{-}55 italic_e - 5, 32323232k warmup steps, a batch size of 87.587.587.587.5 seconds and gradient accumulation over four steps.

Fine-tuning setup. We use the joint CTC-attention based encoder-decoder architecture [jointctc], but instead of the standard encoder, we replace it with the model from the self-supervised pre-training 2.1 step. For experiments on the Mcv-Accent-100 training split, we add 3-way speed perturbation prior to training. Throughout all our experiments, we train the model for 50505050 epochs with a batch size of 128128128128, dropout rate of 0.10.10.10.1, a learning rate of 1.01.01.01.0 and 25K25𝐾25K25 italic_K warmup steps. The loss is calculated with a CTC weight of 0.3 and label smoothing of 0.1.

4 Experimental Results and Analysis

Table 1 compares the performance (WER %) of our proposed system against four alternative approaches: 1. Conformer baseline [conformer] 2. Replacing the encoder in the Conformer with a pre-trained HuBERT model [hubert], pretrained with the SSL objective on data Mcv-Accent-600 3. Jointly training HuBERT baseline with an accent classifier (MTL) [asr_clf] 4. Domain Adversarial Training (DAT) of the HuBERT baseline with an accent classifier on the 12thsuperscript12th12^{\text{th}}12 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT encoder layer.

Table 2: Comparison of the performances (WER%) of inferences done in absence of particular accents.
Accent used Seen Accents Unseen Accents
AUS CAN UK SCT US AFR HKG IND IRL MAL NWZ PHL SGP WLS
Codebook Attention \cellcolorgray!10 3.7 \cellcolorgray!10 7.5 \cellcolorgray!10 5.5 \cellcolorgray!10 4.7 \cellcolorgray!10 5.9 \cellcolorgray!10 10.8 \cellcolorgray!10 14.7 \cellcolorgray!10 11.6 \cellcolorgray!10 11.4 \cellcolorgray!10 16.7 \cellcolorgray!10 9.2 \cellcolorgray!10 14.9 \cellcolorgray!10 18.3 \cellcolorgray!10 7.3
Australia \cellcolorred!20 4.9 7.4 5.6 4.9 6.0 10.9 14.9 11.4 11.3 17.2 \cellcolorred!20 9.9 14.8 18.2 7.6
Canada 3.7 7.6 5.5 4.8 6.1 10.7 14.8 11.6 11.5 16.6 9.1 14.8 18.4 7.0
England 4.1 7.4 \cellcolorred!20 5.7 4.7 6.1 \cellcolorred!20 11.0 14.6 11.7 11.5 16.6 9.3 \cellcolorred!20 15.0 18.6 \cellcolorred!20 7.7
Scotland 3.6 7.4 5.5 \cellcolorred!20 5.2 5.8 10.7 14.3 11.3 11.5 16.5 9.0 14.7 18.6 7.1
US 3.8 7.7 5.6 4.4 6.1 10.8 14.8 11.9 11.5 17.2 9.2 14.9 18.5 7.3
US + Canada 3.6 \cellcolorred!20 8.0 5.6 4.8 \cellcolorred!20 6.4 10.7 \cellcolorred!20 15.4 \cellcolorred!20 12.3 \cellcolorred!20 11.9 \cellcolorred!20 \cellcolorred!20 17.3 9.1 14.9 \cellcolorred!20 18.9 7.0

Similar to previous studies [baevski2020wav2vec, hubert], we find that replacing Conformer’s encoder with a pre-trained HuBERT model leads to significant improvement in performance; we note here that the HuBERT-based encoder is pretrained from scratch on Mcv-Accent-600. Furthermore, we observe that, during self-supervised pre-training, using an existing checkpoint (i.e., the HUBERT-Base Librispeech checkpoint from Fairseq [fairseq]) leads to additional performance gains. Given the limited amount of supervised finetuning data in Mcv-Accent-100, to combat overfitting, we freeze the feature extractor and the first 3333 layers of the HuBERT encoder; this yields further benefits. (In all subsequent experiments, we will use this best setup of selective ASR finetuning.) We also compare our system against the MTL and DAT approaches. Overall, our proposed approach performs significantly better on nearly all accents, particularly the unseen accents.

Table 3: Comparing zero-shot performance (WER %) of our architecture with other approaches on the L2Arctic dataset. \dagger indicates statistical significance (at p𝑝pitalic_p <<<0.001 using MAPSSWE test) w.r.t. the HuBERT baseline.
Method All Accents
ARA HIN KOR MAN SPA VIA
HuBERT Encoder [hubert] 22.6 20.2 17.8 17.3 25.8 20.4 33.7
MTL [asr_clf] 23.0 21.0 18.1 17.6 26.4 20.9 34.1
DAT [bobw] 22.9 20.7 18.2 17.4 26.2 20.9 34.1
Codebook Attention 21.7 \dagger 19.9 16.5 16.4 24.8 19.8 32.7

Importance of Codebooks. Table 2 shows the WERs of our best codebook-based system when the codebook of the seen accent is withheld during inference with the joint beam search. The first row shows the WERs when all codebooks are present. This is to check whether the absence of the codebook of the seen accent during decoding degrades performance. On removing the Australian codebook, we find that the WER worsens from 3.7%percent3.73.7\%3.7 % to 4.9%percent4.94.9\%4.9 %. Interestingly, WER of the unseen New-Zealand accent suffers the most when the Australian codebook is withheld (9.29.99.29.99.2\rightarrow 9.99.2 → 9.9). This suggests that relatedness in accents is being captured in the codebooks. Also, on removing both US and Canada accent codebooks (that are closely related), we see the largest degradation in WER on US and Canada test samples.

Zero-shot ASR Evaluation. In Table 3, we compare WERs of our proposed system with baselines when evaluated in a zero-shot setting on out-of-domain accented samples from the L2-Arctic [l2arctic] dataset. The dataset consists of six non-native English accents namely: Arabic (ARA), Hindi (HIN), Korean (KOR), Mandarin (MAN), Spanish (SPA), and Vietnamese (VIA). Our system significantly outperforms the baselines (at p<0.001𝑝0.001p<0.001italic_p < 0.001) across all accents on the L2-Arctic dataset. This attests to the robustness of the learned speech representations from our system.

Ablation Analysis. In Table 4, we investigate various settings related to the codebooks. We apply codebooks to different encoder layers and obtain the best performance when applying codebooks on the 6thsuperscript6th6^{\text{th}}6 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT layer for seen accents, and on all 12 layers for unseen accents. With varying codebook sizes, 50505050 codebook entries performed the best on both seen/unseen accents; 500 codebook entries improved the seen accents further but hurt the unseen accents, thus indicating overfitting. Lastly, we examined different training protocols for the codebooks. Randomly-initialized (and not learnable) codebooks (Crandomsubscript𝐶randomC_{\text{random}}italic_C start_POSTSUBSCRIPT random end_POSTSUBSCRIPT) do nearly as well as codebooks trained during both SSL and ASR fine-tuning (denoted by Cbothsubscript𝐶bothC_{\text{both}}italic_C start_POSTSUBSCRIPT both end_POSTSUBSCRIPT). We also evaluate with frozen codebooks that are learned during pretraining and frozen during ASR fine-tuning (Cfrozensubscript𝐶frozenC_{\text{frozen}}italic_C start_POSTSUBSCRIPT frozen end_POSTSUBSCRIPT). This performs nearly as well as Cbothsubscript𝐶bothC_{\text{both}}italic_C start_POSTSUBSCRIPT both end_POSTSUBSCRIPT on seen accents but slightly underperforms on unseen accents. This suggests that the codebooks are meaningfully learned during the pretraining stage.

Table 4: Comparison of the performance (WER %) of different variants of our architecture. CL(i,,j)(P=k)subscriptC𝐿𝑖𝑗𝑃𝑘\textsc{C}_{L\in(i,\ldots,j)}(P=k)C start_POSTSUBSCRIPT italic_L ∈ ( italic_i , … , italic_j ) end_POSTSUBSCRIPT ( italic_P = italic_k ): Codebook attention applied at all layers from i𝑖iitalic_i to j𝑗jitalic_j with k𝑘kitalic_k entries per accent codebook.
Setup Overall Seen Unseen
Layers # codebooks
Varying codebook influence
L={6}𝐿6L=\{6\}italic_L = { 6 } C=50𝐶50C=50italic_C = 50 9.01 5.89 12.12
L={1,,6}𝐿16L=\{1,\ldots,6\}italic_L = { 1 , … , 6 } C=50𝐶50C=50italic_C = 50 9.03 5.91 12.13
L={1,,12}𝐿112L=\{1,\ldots,12\}italic_L = { 1 , … , 12 } C=50𝐶50C=50italic_C = 50 8.95 5.94 11.91
Varying codebook size
L={1,,12}𝐿112L=\{1,\ldots,12\}italic_L = { 1 , … , 12 } C=50𝐶50C=50italic_C = 50 8.95 5.94 11.91
L={1,,12}𝐿112L=\{1,\ldots,12\}italic_L = { 1 , … , 12 } C=200𝐶200C=200italic_C = 200 9.01 6.00 12.10
L={1,,12}𝐿112L=\{1,\ldots,12\}italic_L = { 1 , … , 12 } C=500𝐶500C=500italic_C = 500 9.03 5.87 12.18
Varying codebook nature
L={6}𝐿6L=\{6\}italic_L = { 6 } Cboth=50subscript𝐶both50C_{\text{both}}=50italic_C start_POSTSUBSCRIPT both end_POSTSUBSCRIPT = 50 9.01 5.89 12.12
L={6}𝐿6L=\{6\}italic_L = { 6 } Cfrozen=50subscript𝐶frozen50C_{\text{frozen}}=50italic_C start_POSTSUBSCRIPT frozen end_POSTSUBSCRIPT = 50 9.07 5.91 12.23
L={6}𝐿6L=\{6\}italic_L = { 6 } Crandom=50subscript𝐶random50C_{\text{random}}=50italic_C start_POSTSUBSCRIPT random end_POSTSUBSCRIPT = 50 8.98 5.93 12.02

ASR Fine-tuning with Larger Dataset. We fine-tune the pretrained HuBERT baseline, MTL, DAT, and our best system on Mcv-Accent-600. As seen in Table 5, our system performs comparably to the other baselines on the seen accents but significantly improves on unseen accents.

Table 5: Comparison of WERs% of our approach compared to other baselines on Mcv-Accent-600 dataset.
Method Overall Seen Unseen
HuBERT [hubert] 6.68 3.87 9.49
MTL [asr_clf] 6.57 3.76 9.37
DAT [bobw] 6.57 3.83 9.30
Codebook Attention 6.43 3.80 9.19

5 Conclusion

In this work, we propose an accent-aware ASR adaptation technique where accent-specific codebooks are incorporated within the Transformer layers of a HuBERT model via cross-attention. This integration happens right from the SSL-based pretraining stage. The pretrained codebooks and encoder layers are further finetuned using supervised ASR fine-tuning. Compared to existing accent adaptation techniques, we observe that this yields significant WER reductions on English utterances in both seen and unseen accents in the Mozilla Common Voice (MCV) corpus. The accent-aware models trained on MCV also generalize well to out-of-domain accented English samples (from a different corpus, L2Arctic) when evaluated in a zero-shot setting. In future work, we aim to use self-training with unlabeled data (with accent labels) to further refine the accent codebooks.

6 Acknowledgements

We acknowledge the financial support from a SERB Core Research Grant, Department of Science and Technology, Government of India on accented speech processing.

7 References

\printbibliography