Leave No Knowledge Behind During Knowledge Distillation:
Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data

Abstract

Recent advances in automatic speech recognition (ASR) often rely on large speech foundation models for generating high-quality transcriptions. However, these models can be impractical due to limited computing resources. The situation is even more severe in terms of more realistic or difficult scenarios, such as code-switching ASR (CS-ASR). To address this, we present a framework for developing more efficient models for CS-ASR through knowledge distillation using realistic speech-only data. Our proposed method, Leave No Knowledge Behind During Knowledge Distillation (K2D), leverages both the teacher model’s knowledge and additional insights from a small auxiliary model. We evaluate our approach on two in-domain and two out-domain datasets, demonstrating that K2D is effective. By conducting K2D on the unlabeled realistic data, we have successfully obtained a 2-time smaller model with 5-time faster generation speed while outperforming the baseline methods and the teacher model on all the testing sets. We have made our model publicly available on Hugging Face.

Index Terms—  automatic speech recognition, knowledge distillation, code-switching

1 Introduction

ASR has long posed significant challenges within the speech community. Recent breakthroughs have leveraged techniques such as self-supervised learning (SSL) [1, 2, 3], self-training [4, 5, 6], and large-scale or weakly-supervised learning [7, 8, 9] to develop high-quality ASR systems. Despite these advancements, achieving high-quality transcriptions often requires the use of large speech foundation models [7, 8, 9]. This can be impractical for many applications, especially when computational resources are limited. The challenge is further amplified in realistic scenarios, where speech is more diverse and difficult to transcribe. A particularly challenging example of this diversity is code-switching ASR (CS-ASR), where speakers frequently alternate between languages within and between utterances.

To address this issue, we aim to develop smaller and faster models for CS-ASR using realistic data that we have collected. Our dataset comprises academic course videos covering a range of subjects, including but not limited to engineering, science, and liberal arts. The data comprises approximately 60,000 hours of raw speech, mostly Mandarin-English code-switched. Despite the enormous volume of audio data, we lack transcriptions for training, which presents a significant challenge in developing effective ASR models. To address this, we propose using pseudo-labeling as a solution to leverage the unlabeled data. With the pseudo labels, we can conduct knowledge distillation along with the teacher model [10]. However, pseudo-labels often contain errors and hallucinations [4], which can degrade model performance if not properly managed.

This work proposes a novel method called Leave No Knowledge Behind During Knowledge Distillation (K2D). Our proposed method enhances the traditional knowledge distillation process by integrating insights from both a large teacher model and a smaller auxiliary model. We filter data by referencing the small model’s transcriptions on the realistic data. By evaluating on two in-domain and two out-of-domain (OOD) testing sets, we provide evidence that applying K2D can lead to a better student model than the original knowledge distillation. The distilled models even surpass the teacher model on all the testing sets while being two times smaller and five times faster, showing the strong generalizability and efficiency of applying K2D (Fig. 1). To our best knowledge, we are the first to explore knowledge distillation for CS-ASR with unlabeled realistic data, and we propose an improved framework by incorporating a small auxiliary model.

Refer to caption
Fig. 1: Our proposed framework K2D achieves significant performance improvement over the teacher model (Whisper Large-v2) on both in-domain and OOD testing sets.

2 Background

2.1 Code-Switching ASR

Code-switching ASR (CS-ASR) is a challenging sub-problem in automatic speech recognition. Unlike monolingual utterances, a code-switched utterance may contain transitions between languages at one or multiple positions. The problem is particularly pronounced in regions with more than one official language. Conventional methods address this issue by incorporating language identification (LID) [11, 12, 13, 14, 15], using monolingual data [16, 17, 18, 19, 20], or leveraging data augmentation techniques [21, 22]. Recent approaches focus on utilizing large language models (LLMs) to generate CS data or investigating large speech foundation models CS-ASR with different techniques [23, 24, 25]. Our work differs from those mentioned above by optimally utilizing large-scale unlabeled CS data collected from real-world scenarios.

2.2 Knowledge Distillation

Knowledge distillation (KD) [26] has been a common and effective method for model compression. In KD, a student model learns to mimic the behavior of a stronger teacher model to achieve similar performance. With the advent of powerful large foundation models, KD has been employed across multiple fields to obtain smaller and more efficient models [27, 28, 29, 30], further strengthening its effectiveness. Typically, the targets of KD can be categorized into two types: the teacher’s output and hidden representations. Distilling the teacher’s hidden representation is common when the task is not specific [27, 28, 29]. However, these methods are not tailored for CS-ASR and do not leverage large-scale unbalanced data effectively. In this work, we address these limitations by focusing specifically on CS-ASR in real-world scenarios and employing the learning objectives of Distil-Whisper [10], which utilize the teacher model’s outputs during knowledge distillation. Our approach optimally uses large-scale unlabeled CS data, enhancing the efficiency and performance of the distilled models for CS-ASR.

2.3 Large-Scale Pseudo-Labels for Distilled Models

Inspired by the success of Whisper [7], Distil-Whisper [10] aims to provide a more efficient model through knowledge distillation using large-scale pseudo-labels. Distil-Whisper utilizes 21,170 hours of audio data from 9 publicly available datasets for KD. However, the datasets are monolingual and exclusively in English, which motivates us to develop our own distilled model for more effective generations in a code-switching context. Unlike Distil-Whisper, which utilizes data in labeled monolingual corpora for training, the data we collected is code-switched and has no true labels. Therefore, we cannot filter data based on the word error rate (WER) between the real and the pseudo-labeled transcriptions, as suggested in Distil-Whisper. To address the issue, we aim to provide a simple and effective method based on a small auxiliary model to perform cross-model validation for data filtering without any labeled data.

2.4 Semi-Supervised Learning and Self-Training

Semi-supervised learning and self-training focus on developing techniques to exploit the unlabeled data. In semi-supervised learning, the data can be separated into labeled and unlabeled ones. Utilizing the labeled data for supervised learning is trivial. Thus, most efforts are dedicated to leveraging the unlabeled data, which is often large-scale and noisy. One of the most common solutions is to use the supervised-learned model to generate pseudo-labels and then use the pseudo-labels for self-training [31]. Due to the model’s imperfections, directly leveraging all pseudo-labels may lead to slight improvement. To address the issue, previous works may use additional LM to improve pseudo-labels quality [4] or conduct data filtering [32, 33, 34, 35]. Moreover, recent works propose techniques like model ensembling, iterative pseudo-labeling, or continuous pseudo-labeling for self-training [4, 6, 36, 37]. These methods are more effective but often involve complicated algorithms during training, resulting in even higher computation resources.

3 Method: K2D

Refer to caption
Fig. 2: Overview of the K2D Framework. (a) Realistic Pseudo-Labeling: The teacher model generates transcriptions with timestamps from long-form audio. (b) Data Pre-Filtering: Chunked audio is validated by the small auxiliary model, filtering out inaccurate labels. (c) Knowledge Distillation: Validated pseudo-labels are used to train the student model, enhancing accuracy and efficiency.

We propose a simple and effective framework called K2D for developing an efficient code-switching ASR system via knowledge distillation using speech-only realistic data. Our K2D framework comprises three stages: realistic pseudo-labeling, data pre-filtering, and knowledge distillation. First, we perform realistic pseudo-labeling (§ 3.1) on the realistic long-form data and segment each into small chunks. Next, we conduct data pre-filtering (§ 3.2) to validate the chunkwise pseudo-labels based on the additional knowledge from the auxiliary small model. Finally, we use the validated data for knowledge distillation (§ 3.3). The illustration of the proposed framework is presented in Fig. 2.

3.1 Realistic Pseudo-Labeling

Given a long-form realistic speech X𝑋Xitalic_X and the teacher model MteachersubscriptMteacher{\rm M}_{\rm teacher}roman_M start_POSTSUBSCRIPT roman_teacher end_POSTSUBSCRIPT, we first generate the corresponding transcriptions with timestamps Y=Mteacher(X)𝑌subscriptMteacher𝑋Y={\rm M}_{\rm teacher}(X)italic_Y = roman_M start_POSTSUBSCRIPT roman_teacher end_POSTSUBSCRIPT ( italic_X ). Specifically, we use Whisper [7] as the teacher model, which produces timestamps along with the transcriptions when performing sequential generation on the long-form audio. With an audio-transcription pair (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ), we then generate M𝑀Mitalic_M segments based on the timestamps in Y𝑌Yitalic_Y. The segments can be formulated as [(X1,Y1,d1),(X2,Y2,d2),,(XM,YM,dM)]subscriptsuperscript𝑋1subscriptsuperscript𝑌1subscriptsuperscript𝑑1subscriptsuperscript𝑋2subscriptsuperscript𝑌2subscriptsuperscript𝑑2subscriptsuperscript𝑋𝑀subscriptsuperscript𝑌𝑀subscriptsuperscript𝑑𝑀[(X^{\prime}_{1},Y^{\prime}_{1},d^{\prime}_{1}),(X^{\prime}_{2},Y^{\prime}_{2}% ,d^{\prime}_{2}),\ldots,(X^{\prime}_{M},Y^{\prime}_{M},d^{\prime}_{M})][ ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) ], where each segment tuple (Xi,Yi,di),i[1,M]subscriptsuperscript𝑋𝑖subscriptsuperscript𝑌𝑖subscriptsuperscript𝑑𝑖for-all𝑖1𝑀(X^{\prime}_{i},Y^{\prime}_{i},d^{\prime}_{i}),\forall i\in[1,M]( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ∀ italic_i ∈ [ 1 , italic_M ] represents the audio, transcription, and the duration of the i𝑖iitalic_i-th segment.

Next, we concatenate the adjacent segments into chunks with a maximum duration of 30 seconds based on [d1,,dM]subscriptsuperscript𝑑1subscriptsuperscript𝑑𝑀[d^{\prime}_{1},\ldots,d^{\prime}_{M}][ italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ]. We iterate through each segment in order, add the segment to the current chunk, and update the chunk’s duration, provided that doing so does not make the current chunk exceed 30 seconds. If adding a segment increases the duration of the chunk to more than 30 seconds, we finalize the current chunk and start a new chunk with the current segment. This process continues until all segments have been processed, ensuring that each chunk has a duration \leq 30 seconds. We describe the chunked audio-transcription pairs as [(X^1,Y^1),,(X^N,Y^N)],NMsubscript^𝑋1subscript^𝑌1subscript^𝑋𝑁subscript^𝑌𝑁𝑁𝑀[(\hat{X}_{1},\hat{Y}_{1}),\ldots,(\hat{X}_{N},\hat{Y}_{N})],N\leq M[ ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ] , italic_N ≤ italic_M.

Finally, let f𝑓fitalic_f be a function that generates the set of chunked audio-transcription pairs given X𝑋Xitalic_X; we can construct the realistic pseudo-labeled dataset based on the original dataset 𝒳𝒳\mathcal{X}caligraphic_X:

𝒟RPL=X𝒳f(X),f(X)={(X^i,Y^i)i[1,N]}.formulae-sequencesubscript𝒟RPLsubscriptfor-all𝑋𝒳𝑓𝑋𝑓𝑋conditional-setsubscript^𝑋𝑖subscript^𝑌𝑖𝑖1𝑁\mathcal{D}_{\rm RPL}=\bigcup_{\forall X\in\mathcal{X}}f(X),\quad f(X)=\{(\hat% {X}_{i},\hat{Y}_{i})\mid i\in[1,N]\}.caligraphic_D start_POSTSUBSCRIPT roman_RPL end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT ∀ italic_X ∈ caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_X ) , italic_f ( italic_X ) = { ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_i ∈ [ 1 , italic_N ] } .

We keep all the timestamp tokens in the transcription Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG to facilitate segmentation behavior learning during knowledge distillation.

3.2 Data Pre-Filtering

Due to the imperfection of the teacher model, the pseudo-labeled audio-transcription pairs (X^,Y^)𝒟RPL^𝑋^𝑌subscript𝒟RPL(\hat{X},\hat{Y})\in\mathcal{D}_{\rm RPL}( over^ start_ARG italic_X end_ARG , over^ start_ARG italic_Y end_ARG ) ∈ caligraphic_D start_POSTSUBSCRIPT roman_RPL end_POSTSUBSCRIPT may inevitably contain different levels of errors or hallucinations. As mentioned in the previous works [4, 10], one of the most common types of hallucination is looping or repetitive generation. The hallucination can be detected trivially by counting the n-gram according to [4]. Specifically, we may consider a pseudo-label Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG as hallucinated if it contains a n𝑛nitalic_n-gram pattern for over c𝑐citalic_c times. We refer this detection as trivial oder n𝑛nitalic_n-gram method, denoted as follows:

hcn(Y^)={1,if Y^ contains a n-gram pattern over c times,0,otherwise.subscriptsuperscript𝑛𝑐^𝑌cases1if ^𝑌 contains a n-gram pattern over c times0otherwiseh^{n}_{c}(\hat{Y})=\begin{cases}1,\quad&\text{if }\hat{Y}\text{ contains a $n$% -gram pattern over $c$ times},\\ 0,\quad&\text{otherwise}.\end{cases}italic_h start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) = { start_ROW start_CELL 1 , end_CELL start_CELL if over^ start_ARG italic_Y end_ARG contains a italic_n -gram pattern over italic_c times , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise . end_CELL end_ROW

By filtering out the pseudo-labels with the trivial hallucinations, we can form a new dataset:

𝒟trivial={(X^,Y^)𝒟RPLhcn(Y^)=0}.subscript𝒟trivialconditional-set^𝑋^𝑌subscript𝒟RPLsubscriptsuperscript𝑛𝑐^𝑌0\displaystyle\mathcal{D}_{\rm trivial}=\{(\hat{X},\hat{Y})\in\mathcal{D}_{\rm RPL% }\mid h^{n}_{c}(\hat{Y})=0\}.caligraphic_D start_POSTSUBSCRIPT roman_trivial end_POSTSUBSCRIPT = { ( over^ start_ARG italic_X end_ARG , over^ start_ARG italic_Y end_ARG ) ∈ caligraphic_D start_POSTSUBSCRIPT roman_RPL end_POSTSUBSCRIPT ∣ italic_h start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) = 0 } . (1)

While being suitable for detecting the looping errors, the n𝑛nitalic_n-gram method might fall short for the other types of hallucinations.

In this work, we aim to provide a simple and effective method to validate the pseudo-labels by introducing a small auxiliary model MvalidatorsubscriptMvalidator{\rm M}_{\rm validator}roman_M start_POSTSUBSCRIPT roman_validator end_POSTSUBSCRIPT for validation. Specifically, we perform data filtering by validating across the outputs from the teacher model MteachersubscriptMteacher{\rm M}_{\rm teacher}roman_M start_POSTSUBSCRIPT roman_teacher end_POSTSUBSCRIPT and the small auxiliary model MvalidatorsubscriptMvalidator{\rm M}_{\rm validator}roman_M start_POSTSUBSCRIPT roman_validator end_POSTSUBSCRIPT. Given a pseudo-labeled audio-transcription pair (X^,Y^)^𝑋^𝑌(\hat{X},\hat{Y})( over^ start_ARG italic_X end_ARG , over^ start_ARG italic_Y end_ARG ), we first generate the validation-oriented transcription V^=Mvalidator(X^)^𝑉subscriptMvalidator^𝑋\hat{V}={\rm M}_{\rm validator}(\hat{X})over^ start_ARG italic_V end_ARG = roman_M start_POSTSUBSCRIPT roman_validator end_POSTSUBSCRIPT ( over^ start_ARG italic_X end_ARG ) with the small model, and then perform cross-model validation between Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG and V^^𝑉\hat{V}over^ start_ARG italic_V end_ARG. The cross-model validation is accomplished by calculating the distance metric δ𝛿\deltaitalic_δ, accompanied by an additional threshold α[0.0,1.0)𝛼0.01.0\alpha\in[0.0,1.0)italic_α ∈ [ 0.0 , 1.0 ) for the final determination. We formulate the filtered dataset after cross-model validation as follows:

𝒟validated={(X^,Y^)𝒟RPLδ(Y^,V^)α}.subscript𝒟validatedconditional-set^𝑋^𝑌subscript𝒟RPL𝛿^𝑌^𝑉𝛼\displaystyle\mathcal{D}_{\rm validated}=\{(\hat{X},\hat{Y})\in\mathcal{D}_{% \rm RPL}\mid\delta(\hat{Y},\hat{V})\leq\alpha\}.caligraphic_D start_POSTSUBSCRIPT roman_validated end_POSTSUBSCRIPT = { ( over^ start_ARG italic_X end_ARG , over^ start_ARG italic_Y end_ARG ) ∈ caligraphic_D start_POSTSUBSCRIPT roman_RPL end_POSTSUBSCRIPT ∣ italic_δ ( over^ start_ARG italic_Y end_ARG , over^ start_ARG italic_V end_ARG ) ≤ italic_α } . (2)

As presented, the distance metric δ𝛿\deltaitalic_δ is essential during filtering. We introduce the three different kinds of distance metrics we experiment with. First, we introduce the two direct measurements. We define the distance metric by directly calculating the mixed error rate (MER) between the teacher’s transcription Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG and the validator’s transcription V^^𝑉\hat{V}over^ start_ARG italic_V end_ARG, or the phoneme error rate (PER) between the phonemicized transcriptions Y^psubscript^𝑌p\hat{Y}_{\rm p}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT and V^psubscript^𝑉p\hat{V}_{\rm p}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT:

δMERsubscript𝛿MER\displaystyle\delta_{{\rm MER}}italic_δ start_POSTSUBSCRIPT roman_MER end_POSTSUBSCRIPT =MER(Y^,V^),absentMER^𝑌^𝑉\displaystyle={\rm MER}(\hat{Y},\hat{V}),= roman_MER ( over^ start_ARG italic_Y end_ARG , over^ start_ARG italic_V end_ARG ) , (3)
δPERsubscript𝛿PER\displaystyle\delta_{{\rm PER}}italic_δ start_POSTSUBSCRIPT roman_PER end_POSTSUBSCRIPT =PER(Y^p,V^p),absentPERsubscript^𝑌psubscript^𝑉p\displaystyle={\rm PER}(\hat{Y}_{\rm p},\hat{V}_{\rm p}),= roman_PER ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT ) , (4)

Last but not least, we introduce the composite metric, which explicitly considers the trivial hallucinations of Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG and V^^𝑉\hat{V}over^ start_ARG italic_V end_ARG:

δcomp=max(hcn(Y^),min(1hcn(V^),δPER(Y^p,V^p))).subscript𝛿compsubscriptsuperscript𝑛𝑐^𝑌1subscriptsuperscript𝑛𝑐^𝑉subscript𝛿PERsubscript^𝑌psubscript^𝑉p\displaystyle\delta_{\rm comp}=\max(h^{n}_{c}(\hat{Y}),\min(1-h^{n}_{c}(\hat{V% }),\delta_{\rm PER}(\hat{Y}_{\rm p},\hat{V}_{\rm p}))).italic_δ start_POSTSUBSCRIPT roman_comp end_POSTSUBSCRIPT = roman_max ( italic_h start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) , roman_min ( 1 - italic_h start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_V end_ARG ) , italic_δ start_POSTSUBSCRIPT roman_PER end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT ) ) ) . (5)

When Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG is trivially hallucinated, we should directly discard the pseudo-labels; on the other hand, we should keep the pseudo-labels when V^^𝑉\hat{V}over^ start_ARG italic_V end_ARG is trivially hallucinated.

3.3 Knowledge Distillation

In K2D, we conduct knowledge distillation on the cross-validated realistic dataset 𝒟validatedsubscript𝒟validated\mathcal{D}_{\rm validated}caligraphic_D start_POSTSUBSCRIPT roman_validated end_POSTSUBSCRIPT (Eq. 2). Otherwise, the distillation process mainly follows Distil-Whisper [10]. First, we construct the student model MstudentsubscriptMstudent{\rm M}_{\rm student}roman_M start_POSTSUBSCRIPT roman_student end_POSTSUBSCRIPT. The student model is an encoder-decoder model, in which we initialized parameters from the teacher model MteachersubscriptMteacher{\rm M}_{\rm teacher}roman_M start_POSTSUBSCRIPT roman_teacher end_POSTSUBSCRIPT. Given a pseudo-labeled audio-transcription pair (X^,Y^)𝒟validated^𝑋^𝑌subscript𝒟validated(\hat{X},\hat{Y})\in\mathcal{D}_{\rm validated}( over^ start_ARG italic_X end_ARG , over^ start_ARG italic_Y end_ARG ) ∈ caligraphic_D start_POSTSUBSCRIPT roman_validated end_POSTSUBSCRIPT, we can generate student’s prediction S^=[s^1,s^2,,s^k]^𝑆subscript^𝑠1subscript^𝑠2subscript^𝑠𝑘\hat{S}=[\hat{s}_{1},\hat{s}_{2},\ldots,\hat{s}_{k}]over^ start_ARG italic_S end_ARG = [ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] based on MstudentsubscriptMstudent{\rm M}_{\rm student}roman_M start_POSTSUBSCRIPT roman_student end_POSTSUBSCRIPT and teacher forcing with Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG; and calculate the cross-entropy loss as follows:

CE=i=1klogp(s^i=y^iy^<i,X^),subscriptCEsuperscriptsubscript𝑖1𝑘𝑝subscript^𝑠𝑖conditionalsubscript^𝑦𝑖subscript^𝑦absent𝑖^𝑋\mathcal{L}_{\rm CE}=-\sum_{i=1}^{k}\log p(\hat{s}_{i}=\hat{y}_{i}\mid\hat{y}_% {<i},\hat{X}),caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_log italic_p ( over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_X end_ARG ) ,

where p𝑝pitalic_p indicates the probability function. Moreover, we generate the prediction distribution Q^=Mteacher(X^,Y^)^𝑄subscriptMteacher^𝑋^𝑌\hat{Q}={\rm M}_{\rm teacher}(\hat{X},\hat{Y})over^ start_ARG italic_Q end_ARG = roman_M start_POSTSUBSCRIPT roman_teacher end_POSTSUBSCRIPT ( over^ start_ARG italic_X end_ARG , over^ start_ARG italic_Y end_ARG ). The distribution Q^=q^1:k^𝑄subscript^𝑞:1𝑘\hat{Q}=\hat{q}_{1:k}over^ start_ARG italic_Q end_ARG = over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT then served as the soft-label during knowledge distillation. We use the soft-label Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG from the teacher model and the probability distribution P^=p(S^)=p^1:k^𝑃𝑝^𝑆subscript^𝑝:1𝑘\hat{P}=p(\hat{S})=\hat{p}_{1:k}over^ start_ARG italic_P end_ARG = italic_p ( over^ start_ARG italic_S end_ARG ) = over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT to calculate the Kullback-Leibler (KL) divergence loss between Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG and P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG:

KL=i=1kKL(q^i,p^i).subscriptKLsuperscriptsubscript𝑖1𝑘KLsubscript^𝑞𝑖subscript^𝑝𝑖\mathcal{L}_{\rm KL}=\sum_{i=1}^{k}{\rm KL}(\hat{q}_{i},\hat{p}_{i}).caligraphic_L start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_KL ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Finally, we combine the cross-entropy loss and the KL divergence loss. The overall objective is defined as follows:

KD=βCE+γKL,subscriptKD𝛽subscriptCE𝛾subscriptKL\displaystyle\mathcal{L}_{\rm KD}=\beta\mathcal{L}_{\rm CE}+\gamma\mathcal{L}_% {\rm KL},caligraphic_L start_POSTSUBSCRIPT roman_KD end_POSTSUBSCRIPT = italic_β caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT + italic_γ caligraphic_L start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT , (6)

where β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ are the weighted coefficients.

Table 1: The duration distribution over the top six subjects with the longest durations. Eng: Engineering; LA: Liberal Arts; Sci: Science; EECS: Electrical Engineering and Computer Science; Mgmt: Management; BA: Bio-resource and Agriculture.

Subject Eng LA Sci EECS Mgmt BA Others Duration (hr) 9540 7900 6222 4796 3374 3236 25365

4 Experiments

4.1 Data

As mentioned in Section 1, we use self-collected data for training. The data is collected from the course recordings of National Taiwan University, resulting in about 60,000 hours of audio. The courses can be categorized into different subjects, and we present the time distribution of the top six subjects with the most extended durations in Table 1. As for the testing sets, we introduce them as follows:

COOLTEST represents our own in-domain and in-house testing set. The testing set contains 5 hours of speech from different subjects. We self-annotated the testing data by revising the transcriptions generated by Whisper Large-v2.

NTUML2021 is a speech dataset from National Taiwan University’s 2021 "Machine Learning" course. Since the dataset is derived from lecture videos, we use the dataset as a publicly available in-domain Mandarin-English CS-ASR corpus. We use the 9-hour testing split to perform the evaluation.

CommonVoice is a publicly available, multilingual dataset of voice recordings collected by Mozilla for ASR [38]. It includes contributions from volunteers worldwide, providing a diverse range of accents, languages, and demographic variations. The diversity of the corpora makes it suitable for serving as the out-of-domain evaluation set. We evaluate our method on the 16161616-th version of CommonVoice with the zh-TW split flag, indicating the speech collected from Taiwan.

ASCEND [39] is a publicly available, spontaneous, Mandarin-English CS dataset collected from conversational recordings. Furthermore, it is collected in Hong Kong, which differs from the chosen CommonVoice testing split mentioned earlier. The dataset, which is code-switched and realistic, is suitable for serving as out-of-domain evaluation of our method.

Table 2: The evaluation results of K2D on the two in-domain and two out-of-domain (OOD) testing sets. Our method demonstrates clear performance improvements and strong generalizability compared with the teacher model and the baseline methods. MERR indicates the mix error reduction rate, calculated relative to the teacher model. The speed-up is calculated based on the RTF averaged over five runs.

    Method Speed- up Data usage In-domain MER% (MERR) \downarrow OOD MER% (MERR) \downarrow COOLTEST NTUML2021 CV16 (zhTW) ASCEND Teacher Model    Whisper Large-v2 ×1.0absent1.0\times 1.0× 1.0 - 13.96 (-0.0%) 7.35 (-0.0%) 9.07 (-0.0%) 25.69 (-0.0%) Baseline (KD)    Full Data ×5.0absent5.0\times 5.0× 5.0 100% 12.98 (-7.0%) 6.29 (-14.4%) 7.80 (-14.0%) 22.07 (-14.1%)    Trivial Method (Eq. 1) ×4.9absent4.9\times 4.9× 4.9 97% 12.56 (-10.0%) 6.28 (-14.5%) 8.05 (-11.2%) 19.78 (-23.0%) Our Method (K2D)    Direct MERMER{\rm MER}roman_MER (Eq. 3) ×5.1absent5.1\times 5.1× 5.1 55% 11.96 (-14.3%) 6.24 (-15.1%) 7.54 (-16.9%) 19.40 (-24.5%)    Direct PERPER{\rm PER}roman_PER (Eq. 4) ×5.0absent5.0\times 5.0× 5.0 61% 11.54 (-17.3%) 6.17 (-16.0%) 7.33 (-19.2%) 18.82 (-26.7%)    Composite δcompsubscript𝛿comp\delta_{\rm comp}italic_δ start_POSTSUBSCRIPT roman_comp end_POSTSUBSCRIPT (Eq. 5) ×5.1absent5.1\times 5.1× 5.1 74% 11.44 (-18.1%) 6.09 (-17.1%) 7.62 (-16.0%) 17.86 (-30.5%)

4.2 Models and Training Details

In K2D, we have three kinds of models: the teacher model MteachersubscriptMteacher{\rm M}_{\rm teacher}roman_M start_POSTSUBSCRIPT roman_teacher end_POSTSUBSCRIPT, the small validation model MvalidatorsubscriptMvalidator{\rm M}_{\rm validator}roman_M start_POSTSUBSCRIPT roman_validator end_POSTSUBSCRIPT, and the student model MstudentsubscriptMstudent{\rm M}_{\rm student}roman_M start_POSTSUBSCRIPT roman_student end_POSTSUBSCRIPT. Practically, we use Whisper Large-v2 as MteachersubscriptMteacher{\rm M}_{\rm teacher}roman_M start_POSTSUBSCRIPT roman_teacher end_POSTSUBSCRIPT, and Whisper Base as MvalidatorsubscriptMvalidator{\rm M}_{\rm validator}roman_M start_POSTSUBSCRIPT roman_validator end_POSTSUBSCRIPT. The validator model is over 20 times smaller than the teacher model, facilitating the fast validation-oriented transcription generation on large-scale realistic data. On the other hand, the student model comprises 32 encoder layers and two decoder layers and is two times smaller than the teacher model. The parameter initialization follows Distil-Whisper, where the encoder is identical to the teacher model’s; and the 2-layer decoder is initialized from the first and the last layer of the teacher’s decoder. We perform knowledge distillation on 4 NVIDIA H100. We use batch size 256 and update for 120,000 steps, which takes about 42 hours of training. The encoder is frozen during training. We set the threshold α=0.4𝛼0.4\alpha=0.4italic_α = 0.4 in Eq. 2 for cross-model validation; while the weighted coefficients β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ of KDsubscriptKD\mathcal{L}_{\rm KD}caligraphic_L start_POSTSUBSCRIPT roman_KD end_POSTSUBSCRIPT in Eq. 6 are set to 0.80.80.80.8 and 1.01.01.01.0. We use g2p [40] for English and pinyin for Mandarin phonemicization.

4.3 Evaluation Metrics

We use the mixed error rate (MER) to evaluate the quality of code-switching ASR. The metric gathers the two common evaluation metrics in the two languages: the character error rate (CER) in Mandarin and the word error rate (WER) in English. We perform the long-form evaluation if not specified to simulate the realistic scenario. Moreover, we use the real-time factor (RTF) under the same computing resource to measure the speed-up brought by the student model.

5 Results

Table 3: The detailed performance analysis of the two languages on the in- and out-of-domain code-switching datasets. Man. stands for the CER of the Mandarin part, while Eng. is the WER of the English part in the transcriptions.
Method COOLTEST ASCEND
Man. Eng. Man. Eng.
Whisper-Large-v2 13.23 20.64 24.68 47.84
KD–Full Data 11.93 22.41 20.87 36.31
KD–Trivial Method 11.74 21.77 17.61 33.88
K2D–Direct MERMER{\rm MER}roman_MER 11.31 17.93 18.56 33.12
K2D–Direct PERPER{\rm PER}roman_PER 10.77 18.57 16.62 33.35
K2D–Composite δcompsubscript𝛿comp\delta_{\rm comp}italic_δ start_POSTSUBSCRIPT roman_comp end_POSTSUBSCRIPT 10.61 18.89 15.14 35.05

5.1 Main Results

We present the MER of each method on all the in-domain and out-of-domain testing sets in Table 2. As the table indicates, K2D provides clear performance improvements on all the testing sets. Our method with the composite distance metric δcompsubscript𝛿comp\delta_{\rm comp}italic_δ start_POSTSUBSCRIPT roman_comp end_POSTSUBSCRIPT (Eq. 5) produces improvements for over 17% reduction rate on both of the in-domain testing sets. Furthermore, it demonstrates strong generalizability to different domains, especially to ASCEND, a publicly available code-switching realistic dataset. On the other hand, using the direct distance metrics, MERMER{\rm MER}roman_MER and PERPER{\rm PER}roman_PER, give performance improvements as well. Note that the data usage of the two direct metrics is around 60%percent6060\%60 %, which is much less than the baseline methods. The result further strengthens the effectiveness of our method. In K2D, we can perform simple and efficient data pre-filtering through cross-model validation, leveraging the knowledge from both the large and small models. Last but not least, our method achieves a five times faster generation speed than the teacher model.

5.2 Performance Analysis on Each Language

Next, we discuss the performance improvements our method brings to the two languages separately. We show the detailed evaluations of the CER for the Mandarin parts and the WER for the English parts on the code-switching datasets, COOLTEST and ASCEND, in Table 3. We find out that using the direct MER as the distance metric produces the most improvements over English on both testing sets. In comparison, the composite metric yields the best performance in Mandarin. Applying K2D can improve ASR performance for both languages across both the in-domain and out-of-domain testing sets.

Table 4: We present the detailed MER on the COOLTEST set and the repetitive hallucination counts detected using the n𝑛nitalic_n-gram method. Our results indicate that using the composite metric gives the fewest repetitive hallucinations.
Method Rep. Counts Detailed MER (%)
Del. Ins. Sub.
Whisper-Large-v2 110 4.53 4.07 5.35
KD–Full Data 101 3.09 4.57 5.32
KD–Trivial Method 47 2.75 4.53 5.29
K2D–Direct MERMER{\rm MER}roman_MER 40 3.20 3.69 5.07
K2D–Direct PERPER{\rm PER}roman_PER 45 2.80 3.74 5.00
K2D–Composite δcompsubscript𝛿comp\delta_{\rm comp}italic_δ start_POSTSUBSCRIPT roman_comp end_POSTSUBSCRIPT 22 2.55 3.83 5.07

5.3 Investigation on Repetitive Hallucinations

As previously mentioned, repetitive hallucinations can be effectively detected using the n𝑛nitalic_n-gram method. In Table 4, we present the evaluation results and the counts of hallucinations identified by this method. We observe a significant reduction in repetitive hallucinations when data filtering is applied. Even with direct methods, the number of repetitive hallucinations drops substantially compared to using the entire dataset. The composite metric yields the fewest hallucinations, likely due to its combination of the n𝑛nitalic_n-gram and distance metrics, which leverages the strengths of both approaches during filtering. The deletion rate is also lowest with the composite metric, probably due to the reduced number of repetitive hallucinations.

5.4 Validation Model Analysis

As we have mentioned in Section 4.2 We use Whisper-Base as the validator to provide additional knowledge during cross-model validation. In this analysis, we wish to communicate that utilizing a smaller model for validation may provide efficiency, effectiveness, and robustness when the data is large-scale and realistic. First, we show that our method is efficient by comparing the time cost between using different variants of Whisper models as the MvalidatorsubscriptMvalidator{\rm M}_{\rm validator}roman_M start_POSTSUBSCRIPT roman_validator end_POSTSUBSCRIPT. As shown in the table, Whisper-Base takes only 9 hours on 4 NVIDIA H100 GPUs, while Whisper-Medium takes over 30 times longer.

Next, we show that our proposed filtering method is effective in filtering out pseudo-labels with high error rates. We first define that a pseudo-label has a high error rate if its MER>0.4MER0.4\text{MER}>0.4MER > 0.4. We can then calculate the filter-out rate of the high-MER pseudo-labels for each cross-model validation method, which is composed of a MvalidatorsubscriptMvalidator{\rm M}_{\rm validator}roman_M start_POSTSUBSCRIPT roman_validator end_POSTSUBSCRIPT and a distance metric δ𝛿\deltaitalic_δ. The measurement of the filter-out rate is denoted as the recall in Table 5. The Max Recall indicates the maximum filter-out rate of a valid α𝛼\alphaitalic_α that yields over 50% of the data remaining in each cross-model validation method. Our results show that even with Whisper-Base, the max recall is similar to using the larger ones. This illustrates that we can achieve effective filtering on high-MER pseudo-labels under a reasonable filtering rate.

Finally, we highlight that using Whisper-Base as the validation model is robust over the selection of α𝛼\alphaitalic_α. By calculating the average recall (Avg. Recall) across different α[0.1,0.2,,0.9]𝛼0.10.20.9\alpha\in[0.1,0.2,\ldots,0.9]italic_α ∈ [ 0.1 , 0.2 , … , 0.9 ], we find out that using Whisper-Base yields the highest average recall. This shows that the cross-model validation with Whisper-Base is less sensitive to the selection of α𝛼\alphaitalic_α, enhancing the robustness of our method.

Table 5: The performance comparison of using the different Whisper models as the validation model. We approximate the time cost of Whisper-Small and -Medium by referencing the progress after one-hour generation. The recall can be considered as the filter-out rate of the high-MER pseudo-labels on COOLTEST set.

Validation Model Zeit Cost \downarrow Max Recall \uparrow Avg. Recall \uparrow δMERsubscript𝛿MER\delta_{\rm MER}italic_δ start_POSTSUBSCRIPT roman_MER end_POSTSUBSCRIPT δPERsubscript𝛿PER\delta_{\rm PER}italic_δ start_POSTSUBSCRIPT roman_PER end_POSTSUBSCRIPT δMERsubscript𝛿MER\delta_{\rm MER}italic_δ start_POSTSUBSCRIPT roman_MER end_POSTSUBSCRIPT δPERsubscript𝛿PER\delta_{\rm PER}italic_δ start_POSTSUBSCRIPT roman_PER end_POSTSUBSCRIPT Whisper-Medium 284 hr 0.97 0.97 0.63 0.62 Whisper-Small 37 hr 0.91 0.85 0.60 0.54 Whisper-Base (Ours) 9 hr 0.91 0.86 0.69 0.66

6 Conclusion

This work presents a novel framework, K2D, for conducting practical and effective knowledge distillation with realistic data for code-switching ASR. Given the realistic data is unlabeled, noisy, and large-scale, performing data filtering might be essential. We introduce a novel method for efficiently filtering data based on the knowledge of a small auxiliary model. The process is based on the cross-model validation between the teacher model and the small model’s transcriptions. Our results indicate the effectiveness of K2D by surpassing the original teacher model with over 17% and 30% performance improvements on the code-switching in-domain and out-of-domain testing sets, respectively. Furthermore, our method performs better than the baseline methods, which utilize the full dataset for training or conduct trivial n𝑛nitalic_n-gram-based data filtering. Last but not least, we conduct an analysis of our method, showing that the proposed filtering technique is efficient, effective, and robust. To our knowledge, we are the first ones to explore and enhance knowledge distillation in such a realistic scenario. We foresee that our method can facilitate more practical and effective knowledge distillation for ASR.

7 Acknowledgment

This project has been powered by the computing resources from NVIDIA Taipei-1. Furthermore, we would like to express our gratitude to Hsuan-He Chang, Yi Chen, Yi-Hua Chen, Haw-Yang Foo, Wen-Yen Huang, Heng Hsu, Bo-Cheng Ke, Pei-Jhen Lan, Fang-Yu Liu, Ka-Sin Thoe, Yan-Tong Lim, Pin-Yu Lu, and Yu-Chi Peng for their efforts in data annotation for our in-house testing set of this project.

References

  • [1] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, 2020.
  • [2] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
  • [3] C.-C. Chiu, J. Qin, Y. Zhang, J. Yu, and Y. Wu, “Self-supervised learning with random-projection quantizer for speech recognition,” in International Conference on Machine Learning, 2022.
  • [4] J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-end speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.
  • [5] Q. Xu, T. Likhomanenko, J. Kahn, A. Hannun, G. Synnaeve, and R. Collobert, “Iterative Pseudo-Labeling for Speech Recognition,” in Proc. Interspeech 2020, 2020.
  • [6] T. Likhomanenko, Q. Xu, J. Kahn, G. Synnaeve, and R. Collobert, “slimIPL: Language-Model-Free Iterative Pseudo-Labeling,” in Proc. Interspeech 2021, 2021.
  • [7] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, 2023.
  • [8] Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang, et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
  • [9] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman, et al., “Seamlessm4t-massively multilingual & multimodal machine translation,” arXiv preprint arXiv:2308.11596, 2023.
  • [10] S. Gandhi, P. von Platen, and A. M. Rush, “Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling,” arXiv preprint arXiv:2311.00430, 2023.
  • [11] K. Li, J. Li, G. Ye, R. Zhao, and Y. Gong, “Towards code-switching asr for end-to-end ctc models,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
  • [12] Z. Zeng, Y. Khassanov, V. T. Pham, H. Xu, E. S. Chng, and H. Li, “On the End-to-End Solution to Mandarin-English Code-Switching Speech Recognition,” in Proc. Interspeech 2019, 2019.
  • [13] S. Zhang, J. Yi, Z. Tian, J. Tao, and Y. Bai, “Rnn-transducer with language bias for end-to-end mandarin-english code-switching speech recognition,” in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2021.
  • [14] L.-H. Tseng, Y.-K. Fu, H.-J. Chang, and H.-y. Lee, “Mandarin-english code-switching speech recognition with self-supervised speech representation models,” arXiv preprint arXiv:2110.03504, 2021.
  • [15] H. Liu, L. P. Garcia, X. Zhang, A. W. Khong, and S. Khudanpur, “Enhancing code-switching speech recognition with interactive language biases,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.
  • [16] C. Shan, C. Weng, G. Wang, D. Su, M. Luo, D. Yu, and L. Xie, “Investigating end-to-end speech recognition for mandarin-english code-switching,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
  • [17] Y. Lu, M. Huang, H. Li, J. Guo, and Y. Qian, “Bi-Encoder Transformer Network for Mandarin-English Code-Switching Speech Recognition Using Mixture of Experts,” in Proc. Interspeech 2020, 2020.
  • [18] X. Zhou, E. Yılmaz, Y. Long, Y. Li, and H. Li, “Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition,” in Proc. Interspeech 2020, 2020.
  • [19] S. Dalmia, Y. Liu, S. Ronanki, and K. Kirchhoff, “Transformer-transducers for code-switched speech recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
  • [20] S.-P. Chuang, H.-J. Chang, S.-F. Huang, and H.-y. Lee, “Non-autoregressive mandarin-english code-switching speech recognition,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021.
  • [21] C.-T. Chang, S.-P. Chuang, and H.-Y. Lee, “Code-Switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation,” in Proc. Interspeech 2019, 2019.
  • [22] Y. Long, Y. Li, Q. Zhang, S. Wei, H. Ye, and J. Yang, “Acoustic data augmentation for mandarin-english code-switching speech recognition,” Applied Acoustics, 2020.
  • [23] H. Yu, Y. Hu, Y. Qian, M. Jin, L. Liu, S. Liu, Y. Shi, Y. Qian, E. Lin, and M. Zeng, “Code-switching text generation and injection in mandarin-english asr,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  • [24] K.-P. Huang, C.-K. Yang, Y.-K. Fu, E. Dunbar, and H.-Y. Lee, “Zero resource code-switched speech benchmark using speech utterance pairs for multiple spoken languages,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.
  • [25] C.-K. Yang, K.-P. Huang, K.-H. Lu, C.-Y. Kuan, C.-Y. Hsiao, and H.-y. Lee, “Investigating zero-shot generalizability on mandarin-english code-switched asr and speech-to-text translation of recent foundation models with self-supervision and weak supervision,” arXiv preprint arXiv:2401.00273, 2023.
  • [26] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [27] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
  • [28] H.-J. Chang, S.-w. Yang, and H.-y. Lee, “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
  • [29] Z. Fang, J. Wang, X. Hu, L. Wang, Y. Yang, and Z. Liu, “Compressing visual-linguistic model via knowledge distillation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  • [30] T. Wang, W. Zhou, Y. Zeng, and X. Zhang, “EfficientVLM: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023.
  • [31] H. Scudder, “Probability of error of some adaptive pattern-recognition machines,” IEEE Transactions on Information Theory, 1965.
  • [32] D. Charlet, “Confidence-measure-driven unsupervised incremental adaptation for hmm-based speech recognition,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001.
  • [33] F. Wessel and H. Ney, “Unsupervised training of acoustic models for large vocabulary continuous speech recognition,” IEEE Transactions on Speech and Audio Processing, 2005.
  • [34] K. Veselý, M. Hannemann, and L. Burget, “Semi-supervised training of deep neural networks,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013.
  • [35] K. Veselý, L. Burget, and J. Černocký, “Semi-Supervised DNN Training with Word Selection for ASR,” in Proc. Interspeech 2017, 2017.
  • [36] Y. Higuchi, N. Moritz, J. Le Roux, and T. Hori, “Momentum pseudo-labeling: Semi-supervised asr with continuously improving pseudo-labels,” IEEE Journal of Selected Topics in Signal Processing, 2022.
  • [37] D. Berrebbi, R. Collobert, S. Bengio, N. Jaitly, and T. Likhomanenko, “Continuous pseudo-labeling from the start,” ICLR, 2023.
  • [38] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020.
  • [39] H. Lovenia, S. Cahyawijaya, G. Winata, P. Xu, Y. Xu, Z. Liu, R. Frieske, T. Yu, W. Dai, E. J. Barezi, Q. Chen, X. Ma, B. Shi, and P. Fung, “ASCEND: A spontaneous Chinese-English dataset for code-switching in multi-turn conversation,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022.
  • [40] K. Park and J. Kim, “g2pe,” https://github.com/Kyubyong/g2p, 2019.