\addbibresource

mybib.bib \defbibheadingbibliography[References] \DeclareSourcemap \maps[datatype=bibtex, overwrite=true] \map \step[fieldsource=booktitle, match=\regexp.*Interspeech.*, replace=Proc. Interspeech] \step[fieldsource=journal, match=\regexp.*INTERSPEECH.*, replace=Proc. Interspeech] \step[fieldsource=booktitle, match=\regexp.*ICASSP.*, replace=Proc. ICASSP] \step[fieldsource=booktitle, match=\regexp.*icassp_inpress.*, replace=Proc. ICASSP (in press)] \step[fieldsource=booktitle, match=\regexp.*Acoustics,.*Speech.*and.*Signal.*Processing.*, replace=Proc. ICASSP] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Learning.*Representations.*, replace=Proc. ICLR] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Computational.*Linguistics.*, replace=Proc. COLING] \step[fieldsource=booktitle, match=\regexp.*SIGdial.*Meeting.*on.*Discourse.*and.*Dialogue.*, replace=Proc. SIGDIAL] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Machine.*Learning.*, replace=Proc. ICML] \step[fieldsource=booktitle, match=\regexp.*North.*American.*Chapter.*of.*the.*Association.*for.*Computational.*Linguistics:.*Human.*Language.*Technologies.*, replace=Proc. NAACL] \step[fieldsource=booktitle, match=\regexp.*Empirical.*Methods.*in.*Natural.*Language.*Processing.*, replace=Proc. EMNLP] \step[fieldsource=booktitle, match=\regexp.*Association.*for.*Computational.*Linguistics.*, replace=Proc. ACL] \step[fieldsource=booktitle, match=\regexp.*Automatic.*Speech.*Recognition.*and.*Understanding.*, replace=Proc. ASRU] \step[fieldsource=booktitle, match=\regexp.*Spoken.*Language.*Technology.*, replace=Proc. SLT] \step[fieldsource=booktitle, match=\regexp.*Speech.*Synthesis.*Workshop.*, replace=Proc. SSW] \step[fieldsource=booktitle, match=\regexp.*workshop.*on.*speech.*synthesis.*, replace=Proc. SSW] \step[fieldsource=booktitle, match=\regexp.*Advances.*in.*neural.*information.*processing.*, replace=Proc. NeurIPS] \step[fieldsource=booktitle, match=\regexp.*Advances.*in.*Neural.*Information.*Processing.*, replace=Proc. NeurIPS] \step[fieldsource=booktitle, match=\regexp.*Workshop.*on.* Applications.* of.* Signal.*Processing.*to.*Audio.*and.*Acoustics.*, replace=Proc. WASPAA] \step[fieldsource=publisher, match=\regexp.+, replace=] \step[fieldsource=month, match=\regexp.+, replace=] \step[fieldsource=location, match=\regexp.+, replace=] \step[fieldsource=address, match=\regexp.+, replace=] \step[fieldsource=organization, match=\regexp.+, replace=] \interspeechcameraready \name[affiliation=*1]DarshanPrabhu \name[affiliation=*1]AbhishekGupta \name[affiliation=1]OmkarNitsure \name[affiliation=1]PreethiJyothi \name[affiliation=2]SriramGanapathy

Improving Self-supervised Pre-training using Accent-Specific Codebooks

Abstract

Speech accents present a serious challenge to the performance of state-of-the-art end-to-end Automatic Speech Recognition (ASR) systems. Even with self-supervised learning and pre-training of ASR models, accent invariance is seldom achieved. In this work, we propose an accent-aware adaptation technique for self-supervised learning that introduces a trainable set of accent-specific codebooks to the self-supervised architecture. These learnable codebooks enable the model to capture accent specific information during pre-training, that is further refined during ASR finetuning. On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches on both seen and unseen English accents, with up to $9\%$ relative reduction in word error rate (WER).

keywords:

Automatic Speech Recognition, Accent Codebooks, Self-Supervised Pretraining.

^*^*footnotetext: These authors contributed equally to this work.

1 Introduction

Self-supervised learning (SSL) has been established as a powerful technique to learn representations for speech and language processing [layerwise_analysis]. With pretrained SSL models as a starting point, even small amounts of labeled data are sufficient to achieve commercially acceptable results in various downstream speech tasks [baevski2020wav2vec, zhao2022improving]. This has led to the wide adoption of the current defacto training regime of pretraining a model with an SSL objective, followed by fine-tuning on downstream tasks such as automatic speech recognition (ASR) and spoken language understanding [yang21c_interspeech, tsai-etal-2022-superb]. A key limitation of SSL-based models is that performance on the downstream task suffers if there is a domain shift from the pretraining data [hsu2021robust]. In this work, we focus on speech accents as our domain of interest. The problem statement can be outlined as - How can we make SSL-trained models more robust to varying speech accents, both seen and unseen during training?

While prior work has extensively explored improving accented ASR in the fine-tuning stage [asr_residual, jain2018improved, winata2020learning, accentgrapheme, layer_accnt, e2emultitask], limited work has gone into improving the robustness of SSL pretraining to varying speech accents. Prior work on accent adaptation of SSL models includes introducing an additional classifier into SSL pretraining [deng21b_interspeech], employing accent-specific adapters [Bhatia2023], and normalizing pseudo-targets to a single accent [Poncelet2023UnsupervisedAA].

In this work, we adapt the technique introduced in Prabhu et al. [Prabhu2023AccentedSR] for SSL pretraining. We introduce accent information during self-supervised pre-training via a set of accent-specific codebooks. These codebooks are integrated into the model using a cross-attention module. For each accent seen in the training data, we introduce a learnable codebook consisting of vectors that capture accent-specific information during the pre-training process. While [Prabhu2023AccentedSR] used this technique exclusively during ASR finetuning, we show that the benefits increase when used during the SSL stage. To evaluate the effectiveness of our method, we show ASR experiments on the multi-accented Mozilla Common Voice corpus and show significant word error rate (WER) reductions compared to multiple variants of HuBERT and other accent adaptation techniques including an accent-agnostic approach (Domain Adversarial Training (DAT) [bobw]) and an accent-aware approach (MultiTask Learning (MTL) [asr_clf]).

While many self-supervised models exist for speech-related tasks [baevski2020wav2vec, chen2022wavlm, chung2021w2v, discretebert, decoar2, vqwav2vec, hubert], in this work we use HuBERT — a state-of-the-art architecture that uses a masked language modeling (MLM) objective and aims to reconstruct a sequence of discrete pseudo-targets derived via a $k$ -means clustering step. The clustering step in HuBERT is used to generate pseudo-targets for pretraining. Prior work has aimed to improve the quality of these generated pseudo-targets either through teacher-forcing [hubert_teacher] or by designing task-specific pseudo-targets [ssldeepcluster]. Other efforts have focused on improving the HuBERT architecture itself [fan2023ctcbert, jointssldecoder]. To the best of our knowledge, adapting HuBERT pretraining to be more robust across accents has not been attempted before.

In summary, our main contributions are as follows:

•

We introduce codebook-based accent adaptation for self-supervised pre-training that aims at using a collection of accent-specific codebooks and cross-attention to improve SSL representations in the presence of accents.¹¹1Code is available at https://github.com/csalt-research/accented-codebooks-asr/tree/accented-pretraining. Our model outperforms baselines and all other previous approaches and achieves a significant improvement in performance with up to 9% relative WER reduction on the MCV dataset.
•

In a zero-shot setting on the L2-Arctic dataset, we show that our codebook-based pretrained model significantly outperforms baselines, thus demonstrating the generalization capability of our proposed system.

2 Methodology

Refer to caption — Figure 1: Overview of the two-stage training pipeline used in our proposed approach. In stage 1, we train an Encoder only model with the HuBERT style pre-training objective. Our architecture incorporates Accent-specific codebooks and additional Cross-Attention layers. The model trained in this stage is used as Encoder in stage 2, where we train an Encoder-Decoder model with Joint CTC-Attention objective for ASR fine-tuning.

Figure 1 illustrates the main workflow of our proposed technique. We adopt the standard two-stage training pipeline used in self-supervised architectures:

1.

A self-supervised pretraining stage that involves training an encoder using a self-supervised learning objective [mohamed2022self].
2.

A supervised ASR fine-tuning stage that uses the pretrained encoder from the previous stage and further finetunes it using an ASR-specific supervised loss.

In our proposed framework, both these stages benefit from accent-specific codebooks elaborated in Sections $\S$ 2.1 and $\S$ 2.2, respectively.

2.1 Self-supervised Pre-training with Codebooks

We use the HuBERT self-supervised architecture [hubert] that consists of three modules: a convolution-based waveform encoder (denoted by Conv), a Transformer-based encoder (denoted by Enc) and a projection ( $\textsc{FFN}_{\text{tok}}$ ) module. Conv takes raw speech $X=\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{L}|\mathbf{x}_{i}\in% \mathbb{R}\}$ as its input, and maps it via a stack of convolutions to $F=\textsc{Conv}(X)=\{\mathbf{f}_{1},\mathbf{f}_{2},\ldots,\mathbf{f}_{T}|% \mathbf{f}_{i}\in\mathbb{R}^{d}\}$ . $F$ is subsequently masked to generate $\tilde{F}$ , where a fixed $P\%$ of its frames are randomly masked. This masked representation $\tilde{F}$ further passes through Enc to generate a contextualized representation $H=\textsc{Enc}(\tilde{X})=\{\mathbf{h}_{1},\mathbf{h}_{2},\ldots\mathbf{h}_{T}% |\mathbf{h}_{i}\in\mathbb{R}^{d}\}$ . In parallel, an Acoustic Unit Discovery (AUD) module directly acts on $X$ as its input and generates pseudo-target labels $Z=\{\mathbf{z}_{1},\mathbf{z}_{2},\ldots,\mathbf{z}_{T}|\mathbf{z}_{i}\in[1,V]\}$ via an offline clustering step that uses a single or ensemble of $k$ -means clusterings. The size of the label set $V$ is determined by the number of clusters used in AUD. A simple projection layer $\textsc{FFN}_{\text{tok}}$ is used to convert $H$ from $\mathbb{R}^{d}$ to the target vocabulary $\mathbb{R}^{V}$ , and the HuBERT model is trained end-to-end to predict $Z$ using a weighted cross-entropy loss.

Table 1: Comparison of the performance (WER % ) of our architecture (Codebook Attention applied at all layers with

50

entries in each learnable codebook) with baseline and other architectures on the Mcv-Accent-100 dataset. Numbers in bold denote the best across baselines, and denotes the best WER across all experiments. Ties are broken using overall WER. The aggregate result for proposed model achieves statistically significant improvements compared to best baseline (at

p<

0.001 MAPSSWE test [mapsswe]).

Method	Size	Aggregated			Seen Accents					Unseen Accents
Method	Size	All	Seen	Unseen	AUS	CAN	UK	SCT	US	AFR	HKG	IND	IRL	MAL	NWZ	PHL	SGP	WLS
Conformer [conformer]	43M	18.9	14.0	23.7	13.8	15.0	15.7	13.4	13.3	21.5	27.2	29.4	21.4	32.2	19.9	26.1	34.7	17.9
HuBERT [hubert]	104M	13.1	9.1	17.1	9.0	10.3	9.4	7.8	8.7	15.6	20.0	18.7	16.3	23.5	14.0	20.3	24.5	9.8
+ Pretrain ckpt	104M	9.7	6.3	13.1	5.3	7.7	6.2	4.7	6.1	12.4	15.5	13.8	12.5	18.8	10.0	15.2	20.0	7.8
+ Frozen layers	74M	9.3	6.0	12.5	4.9	7.8	5.5	4.7	5.9	11.6	15.8	12.2	11.8	18.0	9.6	15.0	19.8	7.3
MTL [asr_clf]	74M	9.4	6.0	12.8	5.0	7.6	5.6	4.7	5.9	11.9	15.5	13.4	12.2	17.6	10.0	15.4	19.1	8.3
DAT [bobw]	74M	9.3	6.0	12.5	5.1	7.5	5.9	\cellcolorgreen!20 4.4	5.9	11.6	15.2	12.3	12.3	17.1	9.6	15.4	19.4	7.9
Proposed	76M	\cellcolorgreen!20 8.9	\cellcolorgreen!20 5.9	\cellcolorgreen!20 11.9	\cellcolorgreen!20 3.7	\cellcolorgreen!20 7.5	\cellcolorgreen!20 5.5	4.7	\cellcolorgreen!20 5.9	\cellcolorgreen!20 10.8	\cellcolorgreen!20 14.7	\cellcolorgreen!20 11.6	\cellcolorgreen!20 11.4	\cellcolorgreen!20 16.7	\cellcolorgreen!20 9.2	\cellcolorgreen!20 14.9	\cellcolorgreen!20 18.3	\cellcolorgreen!20 7.3

Our accent-based modifications are restricted to the Transformer-based encoder Enc. The Enc module consists of a stack of $N$ identical encoder layers. Each layer attends to the output of the previous layer and contextualizes it using self-attention blocks and projection layers. The self-attention block is responsible for introducing global context, while the projection layer refines the point-wise information. In addition to this, we introduce a cross-attention block that utilizes attention mechanism to incorporate information from accent-specific codebooks into the representations. We first discuss how the codebooks are generated followed by an explanation of how they are utilized in a single encoder layer.

Codebook Setup. Let $E$ be the number of accents seen during training. We define a set of accent-specific codebooks $C=\{C^{1},C^{2},\ldots C^{E}|C^{i}\in\mathbb{R}^{M\times d}\}$ . Each $C_{i}$ comprises $M$ learnable codebook entries. That is, each codebook $C^{i}$ is:

\displaystyle C^{i}=\{\mathbf{c}^{i}_{1},\ldots,\mathbf{c}^{i}_{M}|\mathbf{c}^% {i}_{j}\in\mathbb{R}^{d}\}=\texttt{Embedding}([1,2,\ldots,M])

where Embedding is a standard embedding layer. For a training speech input $X$ , whose underlying accent ID (indexing the $E$ seen accents) is $a\in\{1,\ldots,E\}$ , we deterministically select the codebook $C^{a}$ . The codebook entries in $C^{a}$ are subsequently integrated with all the encoder layers and across all attention heads. We note here that the codebooks are accent-specific on account of a single codebook being deterministically chosen per utterance during training.

Integrating Codebooks using Cross-attention. Each Transformer-based encoder layer of the HuBERT architecture consists of a self-attention block that introduces global context, followed by a position-wise feed-forward block that refines point-wise information. These two blocks are separated by layer normalization and coupled via residual connections. In our method, we introduce a cross-attention module that is positioned between the self-attention and feedforward blocks and integrates information from the codebooks into the audio representations generated after self-attention. More precisely, for the $i^{\text{th}}$ encoder layer, let $\mathbf{A}$ be the input to the cross-attention block and $\hat{\mathbf{A}}$ be the output. Then, the attention probabilities across the $M$ codebook entries in $C^{a}$ for the $j^{\text{th}}$ position $\mathbf{A}_{j}$ is computed as:

\displaystyle\{\beta^{j}_{1},\beta^{j}_{2},\ldots,\beta^{j}_{M}\}=\mathrm{% softmax}\left(\frac{(\mathbf{A}_{j}W^{i}_{Q})(C^{a}W^{i}_{K})^{T}}{\sqrt{d}}\right)

where $W^{i}_{Q}$ , $W^{i}_{K}\in\mathbb{R}^{d\times d}$ are learned projection matrices of the cross-attention block of the $i^{\text{th}}$ encoder layer and $\beta^{j}_{k}\in[0,1]$ is the attention weight assigned by the $j^{\text{th}}$ representation to the $k^{th}$ entry in codebook $C^{a}$ . Finally, the $j^{th}$ frame in the output $\hat{\mathbf{A}}$ is a weighted average of the entries in codebook $C^{a}$ :

\displaystyle\mathbf{\hat{A}}_{j}=\sum_{k=1}^{M}\beta_{k}^{j}\cdot(C^{a}_{k}W^% {i}_{V})

where $\hat{\mathbf{A}}_{j}$ , $C^{a}_{k}$ $\in\mathbb{R}^{d}$ and $W^{i}_{V}\in\mathbb{R}^{d\times d}$ . The output of the cross-attention block is subjected to layer normalization, residual connections are added and is fed as input to the feed-forward block.

2.2 Supervised ASR Fine-tuning

For the second stage of ASR fine-tuning, we adopt the state-of-the-art hybrid CTC-attention end-to-end ASR framework [jointctc] that consists of three modules: an encoder ( $\textsc{Enc}_{f}$ ), a decoder (Dec) and a Connectionist Temporal Classification (CTC) [ctc] module. We replace the encoder $\textsc{Enc}_{f}$ with the pretrained Conv and Enc modules from Section 2.1, pretrained in conjunction with accent codebooks. ASR fine-tuning makes use of labeled speech instances $(X,Y)$ where $X$ is a raw speech sequence $X=\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{L}|\mathbf{x}_{i}\in% \mathbb{R}\}$ and $Y$ is a token sequence $\{y_{1},\ldots,y_{M}\}$ . Both the encoder parameters and the accent-specific codebooks in $\textsc{Enc}_{f}$ , initialized from the pretrained model, are further finetuned using a supervised ASR objective. The Dec module uses a cross-entropy loss ( $\mathcal{L}_{\text{att}}$ ) to autoregressively predict a token $y_{t}$ given previous tokens $\{y_{1},\ldots,y_{t-1}\}$ and the encoder outputs. The CTC module, on the other hand, imposes a CTC loss ( $\mathcal{L}_{\text{ctc}}$ ) directly on the encoder outputs and predicts a frame-aligned sequence of tokens by marginalizing over all possible alignments of $Y$ to the encoder outputs. The final loss is the weighted sum of both losses:

\displaystyle\mathcal{L}_{\text{asr}}

\displaystyle=\eta\times\mathcal{L}_{\text{ctc}}+(1-\eta)\times\mathcal{L}_{% \text{att}}

(1)

where $\eta$ is a hyperparameter that balances the objectives.

2.3 Inference using Codebooks

During inference, we do not assume access to an accent label for a test utterance. We employ the joint-beam search proposed by Prabhu et al. [Prabhu2023AccentedSR]. This technique involves performing beam search jointly over all the seen accents. Scores for the seen accents are computed using each underlying accent-specific codebook. The beam width holds the best expansions across all seen accents. More details of the joint beam algorithm are in [Prabhu2023AccentedSR].

3 Experimental Setup

We conduct all our pre-training and fine-tuning experiments using Fairseq [fairseq] and ESPnet [espnet] toolkits, respectively, on NVIDIA RTX A6000 GPUs.

Dataset details. We use the Mozilla Common Voice [mcv] accented English Mcv-Accent benchmarking dataset²²2https://tinyurl.com/accent-dataset in all our experiments. This dataset consists of five seen (AUS, CAN, SCT, UK, US) and nine unseen accents (AFR, HKG, IND, IRL, MAL, NWZ, PHL, SGP, WLS). The dataset includes Mcv-Accent-100 and Mcv-Accent-600 training splits containing $100$ hours and $620$ hours of audio data respectively, along with $17$ hours of validation and test splits. For pretraining, we only utilize the larger Mcv-Accent-600 train split and for ASR, we report results on both training splits.

Pre-training setup. In all our experiments, we utilize the HuBERT-Base architecture [hubert]. This architecture consists of a $7$ -layer convolution feature extractor and a $12$ -layer Transformer Encoder with $12$ attention ( $d=768$ ) heads. The model is trained for two iterations, each consisting of $200$ k steps. In both iterations, we use $500$ hidden units obtained from the output of the $6^{\text{th}}$ encoder layer via K-means clustering as the pseudo targets. In each iteration, the model is trained using an Adam optimizer [adam] with learning rate of $5e\text{-}5$ , $32$ k warmup steps, a batch size of $87.5$ seconds and gradient accumulation over four steps.

Fine-tuning setup. We use the joint CTC-attention based encoder-decoder architecture [jointctc], but instead of the standard encoder, we replace it with the model from the self-supervised pre-training 2.1 step. For experiments on the Mcv-Accent-100 training split, we add 3-way speed perturbation prior to training. Throughout all our experiments, we train the model for $50$ epochs with a batch size of $128$ , dropout rate of $0.1$ , a learning rate of $1.0$ and $25K$ warmup steps. The loss is calculated with a CTC weight of 0.3 and label smoothing of 0.1.

4 Experimental Results and Analysis

Table 1 compares the performance (WER %) of our proposed system against four alternative approaches: 1. Conformer baseline [conformer] 2. Replacing the encoder in the Conformer with a pre-trained HuBERT model [hubert], pretrained with the SSL objective on data Mcv-Accent-600 3. Jointly training HuBERT baseline with an accent classifier (MTL) [asr_clf] 4. Domain Adversarial Training (DAT) of the HuBERT baseline with an accent classifier on the $12^{\text{th}}$ encoder layer.

Table 2: Comparison of the performances (WER%) of inferences done in absence of particular accents.

Accent used	Seen Accents					Unseen Accents
Accent used	AUS	CAN	UK	SCT	US	AFR	HKG	IND	IRL	MAL	NWZ	PHL	SGP	WLS
Codebook Attention	\cellcolorgray!10 3.7	\cellcolorgray!10 7.5	\cellcolorgray!10 5.5	\cellcolorgray!10 4.7	\cellcolorgray!10 5.9	\cellcolorgray!10 10.8	\cellcolorgray!10 14.7	\cellcolorgray!10 11.6	\cellcolorgray!10 11.4	\cellcolorgray!10 16.7	\cellcolorgray!10 9.2	\cellcolorgray!10 14.9	\cellcolorgray!10 18.3	\cellcolorgray!10 7.3
Australia	\cellcolorred!20 4.9	7.4	5.6	4.9	6.0	10.9	14.9	11.4	11.3	17.2	\cellcolorred!20 9.9	14.8	18.2	7.6
Canada	3.7	7.6	5.5	4.8	6.1	10.7	14.8	11.6	11.5	16.6	9.1	14.8	18.4	7.0
England	4.1	7.4	\cellcolorred!20 5.7	4.7	6.1	\cellcolorred!20 11.0	14.6	11.7	11.5	16.6	9.3	\cellcolorred!20 15.0	18.6	\cellcolorred!20 7.7
Scotland	3.6	7.4	5.5	\cellcolorred!20 5.2	5.8	10.7	14.3	11.3	11.5	16.5	9.0	14.7	18.6	7.1
US	3.8	7.7	5.6	4.4	6.1	10.8	14.8	11.9	11.5	17.2	9.2	14.9	18.5	7.3
US + Canada	3.6	\cellcolorred!20 8.0	5.6	4.8	\cellcolorred!20 6.4	10.7	\cellcolorred!20 15.4	\cellcolorred!20 12.3	\cellcolorred!20 11.9	\cellcolorred!20 \cellcolorred!20 17.3	9.1	14.9	\cellcolorred!20 18.9	7.0

Similar to previous studies [baevski2020wav2vec, hubert], we find that replacing Conformer’s encoder with a pre-trained HuBERT model leads to significant improvement in performance; we note here that the HuBERT-based encoder is pretrained from scratch on Mcv-Accent-600. Furthermore, we observe that, during self-supervised pre-training, using an existing checkpoint (i.e., the HUBERT-Base Librispeech checkpoint from Fairseq [fairseq]) leads to additional performance gains. Given the limited amount of supervised finetuning data in Mcv-Accent-100, to combat overfitting, we freeze the feature extractor and the first $3$ layers of the HuBERT encoder; this yields further benefits. (In all subsequent experiments, we will use this best setup of selective ASR finetuning.) We also compare our system against the MTL and DAT approaches. Overall, our proposed approach performs significantly better on nearly all accents, particularly the unseen accents.

Table 3: Comparing zero-shot performance (WER %) of our architecture with other approaches on the L2Arctic dataset.

\dagger

indicates statistical significance (at

p

<

0.001 using MAPSSWE test) w.r.t. the HuBERT baseline.

Method	All	Accents
Method	All	ARA	HIN	KOR	MAN	SPA	VIA
HuBERT Encoder [hubert]	22.6	20.2	17.8	17.3	25.8	20.4	33.7
MTL [asr_clf]	23.0	21.0	18.1	17.6	26.4	20.9	34.1
DAT [bobw]	22.9	20.7	18.2	17.4	26.2	20.9	34.1
Codebook Attention	21.7 $\dagger$	19.9	16.5	16.4	24.8	19.8	32.7

Importance of Codebooks. Table 2 shows the WERs of our best codebook-based system when the codebook of the seen accent is withheld during inference with the joint beam search. The first row shows the WERs when all codebooks are present. This is to check whether the absence of the codebook of the seen accent during decoding degrades performance. On removing the Australian codebook, we find that the WER worsens from $3.7\%$ to $4.9\%$ . Interestingly, WER of the unseen New-Zealand accent suffers the most when the Australian codebook is withheld ( $9.2\rightarrow 9.9$ ). This suggests that relatedness in accents is being captured in the codebooks. Also, on removing both US and Canada accent codebooks (that are closely related), we see the largest degradation in WER on US and Canada test samples.

Zero-shot ASR Evaluation. In Table 3, we compare WERs of our proposed system with baselines when evaluated in a zero-shot setting on out-of-domain accented samples from the L2-Arctic [l2arctic] dataset. The dataset consists of six non-native English accents namely: Arabic (ARA), Hindi (HIN), Korean (KOR), Mandarin (MAN), Spanish (SPA), and Vietnamese (VIA). Our system significantly outperforms the baselines (at $p<0.001$ ) across all accents on the L2-Arctic dataset. This attests to the robustness of the learned speech representations from our system.

Ablation Analysis. In Table 4, we investigate various settings related to the codebooks. We apply codebooks to different encoder layers and obtain the best performance when applying codebooks on the $6^{\text{th}}$ layer for seen accents, and on all 12 layers for unseen accents. With varying codebook sizes, $50$ codebook entries performed the best on both seen/unseen accents; 500 codebook entries improved the seen accents further but hurt the unseen accents, thus indicating overfitting. Lastly, we examined different training protocols for the codebooks. Randomly-initialized (and not learnable) codebooks ( $C_{\text{random}}$ ) do nearly as well as codebooks trained during both SSL and ASR fine-tuning (denoted by $C_{\text{both}}$ ). We also evaluate with frozen codebooks that are learned during pretraining and frozen during ASR fine-tuning ( $C_{\text{frozen}}$ ). This performs nearly as well as $C_{\text{both}}$ on seen accents but slightly underperforms on unseen accents. This suggests that the codebooks are meaningfully learned during the pretraining stage.

Table 4: Comparison of the performance (WER %) of different variants of our architecture.

\textsc{C}_{L\in(i,\ldots,j)}(P=k)

: Codebook attention applied at all layers from

i

j

with

k

entries per accent codebook.

Varying codebook influence
Setup		Overall	Seen	Unseen
Layers	# codebooks
$L=\{6\}$	$C=50$	9.01	5.89	12.12
$L=\{1,\ldots,6\}$	$C=50$	9.03	5.91	12.13
$L=\{1,\ldots,12\}$	$C=50$	8.95	5.94	11.91
Varying codebook size
$L=\{1,\ldots,12\}$	$C=50$	8.95	5.94	11.91
$L=\{1,\ldots,12\}$	$C=200$	9.01	6.00	12.10
$L=\{1,\ldots,12\}$	$C=500$	9.03	5.87	12.18
Varying codebook nature
$L=\{6\}$	$C_{\text{both}}=50$	9.01	5.89	12.12
$L=\{6\}$	$C_{\text{frozen}}=50$	9.07	5.91	12.23
$L=\{6\}$	$C_{\text{random}}=50$	8.98	5.93	12.02

ASR Fine-tuning with Larger Dataset. We fine-tune the pretrained HuBERT baseline, MTL, DAT, and our best system on Mcv-Accent-600. As seen in Table 5, our system performs comparably to the other baselines on the seen accents but significantly improves on unseen accents.

Table 5: Comparison of WERs% of our approach compared to other baselines on Mcv-Accent-600 dataset.

Method	Overall	Seen	Unseen
HuBERT [hubert]	6.68	3.87	9.49
MTL [asr_clf]	6.57	3.76	9.37
DAT [bobw]	6.57	3.83	9.30
Codebook Attention	6.43	3.80	9.19

5 Conclusion

In this work, we propose an accent-aware ASR adaptation technique where accent-specific codebooks are incorporated within the Transformer layers of a HuBERT model via cross-attention. This integration happens right from the SSL-based pretraining stage. The pretrained codebooks and encoder layers are further finetuned using supervised ASR fine-tuning. Compared to existing accent adaptation techniques, we observe that this yields significant WER reductions on English utterances in both seen and unseen accents in the Mozilla Common Voice (MCV) corpus. The accent-aware models trained on MCV also generalize well to out-of-domain accented English samples (from a different corpus, L2Arctic) when evaluated in a zero-shot setting. In future work, we aim to use self-training with unlabeled data (with accent labels) to further refine the accent codebooks.

6 Acknowledgements

We acknowledge the financial support from a SERB Core Research Grant, Department of Science and Technology, Government of India on accented speech processing.

7 References

\printbibliography