Romanization Encoding for Multilingual ASR

Abstract

We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and reduced memory consumption. Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system. In our study, applying this method to Mandarin-English ASR resulted in a remarkable 63.51% vocabulary reduction and notable performance gains of 13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method’s strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems.

**footnotetext: Equal contribution

Index Terms—  Romanization, Text Encoding, RNN Transducer, Multilingual ASR, Code-switching Speech Recognition

1 Introduction

Multilingual Automatic Speech Recognition (ASR) systems are designed to recognize and transcribe speech in multiple languages. Code-switching (CS) is a special case of this, dealing with speech that switches between two or more languages within a single utterance or conversation. While emerging cutting-edge web-scale large speech models such as [1, 2, 3] demonstrate magnificent performance on multilingual ASR, they still fall short in CS scenarios [4], often due to a lack of natural CS data for training. This scarcity hinders the ability of both general large speech models and specialized CS ASR systems to effectively learn and integrate acoustic and linguistic information [5].

Part of the challenge of multilingual and CS ASR arises from text representations of languages from different language families. Languages like those in the Indo-European family usually use a Latin-based alphabet with relatively smaller character sets. These can be efficiently represented using methods like byte-pair encoding (BPE), which breaks down words into smaller pieces or sub-words. Research has shown that using sub-words can lead to better performance in language processing tasks [6, 7]. However, languages such as Mandarin, Korean, and Japanese have a much larger set of unique characters, making sub-word representation less practical. While there are methods to break these characters into smaller units (like love in Mandarin’s {CJK*}UTF8gbsn爱情 \rightarrow ài qíng with Pinyin and Korean’s {CJK*}UTF8mj사랑 \rightarrow ㅅㅏ ㄹㅏㅇ with Jamo) and group these characters into sub-units (i.e. ài qàng \rightarrow àiqàng with segmentation), using these phonetic and semantic representations may not always yield the best results [8, 9, 10]. Despite the effectiveness of character-based approaches for individual languages, their integration into multilingual models is challenging. For instance, [11] documents the use of 8k characters for Mandarin, 4k for Japanese, and 2k for Korean, alongside a standardized set of 512 sub-words per language for other languages. This approach yields 11k unique tokens for these three languages alone, leading to a significantly large and potentially imbalanced vocabulary that inflates the model’s output dimension.

Effectively encoding languages with unique scripts is crucial for multilingual and CS ASR models. Many non-Latin languages can be transcribed into the Latin alphabet through romanization. For instance, pinyin, the primary romanization system for Standard Chinese, facilitates a mapping where a single character can be represented by different pinyins with tones representing the pronunciation. Typically, around 1,000 distinct pinyins with tones can represent about 5,000 Chinese characters. While romanization doesn’t provide a strict one-to-one match, it effectively reduces the vocabulary size and allows the encoder to focus on learning acoustic modeling.

We propose separating acoustic and language modeling in multilingual and CS ASR models, using romanization to reduce vocabulary size and speed up training and inference, aiming to improve model performance and adaptability. This approach enhances system flexibility and allows for the use of advanced decoders like Large Language Models (LLMs) for efficient conversion. With this approach, we can utilize synthetic text data for easy fine-tuning to address the shortage of CS audio data.

In this paper, we make the following contributions:

  • Romanization is investigated to be served as encoding method in multilingual and CS ASR tasks. We apply our encoding method with a balanced concatenated tokenizer to FastConformer-RNNT with a Roman2Char decoder without introducing additional modules such as Language Modeling (LM).

  • Experiments on Mandarin-English CS data show that our model significantly reduces vocabulary size and the dimensions of the output layer, supports larger training batches, lowers memory consumption, and achieves promising outcomes. We release the checkpoints and implementations in NeMo1.

  • Our ablation studies on Mandarin-Korean and Mandarin-Japaneses multilingual data demonstrate the effectiveness and generalizability of the proposed method.

11footnotetext: https://github.com/wd929/NeMo/tree/code-switch

2 Related work

Beyond the method described in [11], OpenAI’s Whisper [1] system uses Byte-level Byte Pair Encoding (BBPE) [12] for text tokenization, proving effective across various applications. However, it faces challenges with languages that have unique scripts or significantly differ from the Indo-European family, like Hebrew, Chinese, and Korean, primarily due to BBPE’s limitations in handling distinct scripts or linguistic structures. Research noted in [13] indicates that BBPE can lead to higher deletion rates in bilingual End-to-End (E2E) ASR systems due to invalid byte sequences, and these BBPE-based bilingual systems underperform compared to their monolingual counterparts. Google USM [2]’s approach with word-piece models (WPMs) also struggles with script diversity, resulting in large output layers and difficulties in scaling. Conversely, for complex-script languages like Chinese, substituting characters with Pinyin for text encoding in Natural Language Processing (NLP) and ASR tasks typically offers greater efficiency and robustness compared to processing each character individually as demonstrated in [14, 15, 16, 17].

Romanization Encoding has been studied in both NLP and speech processing fields. The uroman tool, introduced by [18], converts texts to Latin-scripts, aiming for phonetic representation. This work has been applied in multilingual pretrained language models [19] to enhance the low-resourced languages. Uroman is also utilized for pretraining the speech processing system in [3] as additional forced alignment to tokenize texts. Uroman’s unidirectional nature poses a challenge for ASR tasks that require original script output and an additional deromanization step.

Various languages have multiple romanization methods. While uroman is universal, our focus is on the most popular Romanization systems for each language studied: Pinyin for Chinese, Revised Romanization for Korean, and Hepburn Romanization for Japanese. This approach aims to preserve maximum phonetic and linguistic information and avoid unnecessary transformations, such as uroman converting digital numbers in various scripts to Western Arabic numerals. In addition, as phonological distinctions might be lost during the romanization process making deromanization more difficult [20]. Thus it is more feasible to unify the romanization and deromanization procedure in an end-to-end fashion.

Researchers have explored specialized model architectures [21, 22] for code-switching tasks to better capture language-specific information. This includes integrating Language Identification (LID) [23, 24, 25] enhancement strategies and leveraging pre-trained models alongside LM based Beam Search [17] during inference to boost performance. Despite these advancements, the evaluation of these models on monolingual test sets often goes unexamined, and the volume of training data available is typically constrained. This situation is largely attributed to the scarcity of CS data. To enhance the data and domain scope for CS ASR training, methods such as transfer learning from monolingual ASR to initialize encoders with both monolingual and code-switched datasets have been implemented in [23]. Additionally, efforts including synthetic text data generation have been explored to further augment the training resources [26, 27, 28]. To our knowledge, only one publicly accessible checkpoint2 for Mandarin-English CS exists and it was solely trained on CS dataset SEAME [29]. 22footnotetext: https://huggingface.co/espnet/vectominist_seame_asr_conformer_bpe5626

Refer to caption
Fig. 1: The proposed approach builds upon the baseline Fast-Conformer RNNT model, which incorporates a Concatenated Tokenizer and is outlined within a dashed rectangle. Instead of using direct Char input/output for Mandarin and BPE for English, our approach applies romanization encoding, feeding Pinyin (for Mandarin) and BPE (for English) into the Fast-Conformer RNNT. The Roman to Char Decoder then maps these inputs back to Char and BPE, respectively. The model is trained end-to-end (E2E), combining text-to-text loss with RNNT loss for optimization.

3 Method

3.1 Model structure

Our model structure is illustrated in Figure 1, where we take a ZH-EN CS as an example. In this work, we employ the Fast Conformer model [30] combined with Recurrent Neural Network Transducer (RNNT) technology [31] as our baseline. The Fast Conformer is an optimized version of the original Conformer model. It features a new downsampling schema that significantly reduces computational requirements by approximately 2.9 times and enhances inference speed, while maintaining or even improving performance across various Speech and NLP tasks. RNNT excels in capturing sequence knowledge and is popular in both monolingual [32] and CS ASR [25] tasks. However, despite RNNT’s strengths, it tends to struggle with script-heavy languages that have large vocabularies, as it can severely limit batch sizes and slow down training [33].

The baseline, highlighted by a dashed rectangle inputs ZH characters and EN words. In contrast, our method replaces characters with Romanized text (Pinyin) before processing through a concatenated tokenizer for text representations (detailed in Section 3.2). Meanwhile the audio representation is learned by FastConformer encoders. We introduce another decoder that transcribes the romanized text to characters (described in Section 3.3).

3.2 Roman and BPE concatenated tokenizer

Following the concatenated tokenizer approach introduced by [34], we leverage pre-trained monolingual tokenizers to construct a combined tokenizer. This setup ensures separate label spaces for each language, enabling good generalization capabilities. For instance, English tokens are assigned indices ranging from [0,1023], while Mandarin tokens are allocated indices starting from 1024 up to 1024 + vocab size. The aggregated approach, while achieving similar performance levels to the non-aggregated method, offers the added advantage of facilitating LID. It does so by providing pseudo language information during training, thereby enabling the model to learn LID representations internally.

We prepare 1,024 English BPE sub-word units using the LibriSpeech (LS) dataset [35] and roughly 5,000 Chinese characters from the AISHELL-2 (AS2) dataset [36]. These Chinese characters were then romanized into Latin characters referred to here as ‘Roman’ encoding for all languages using the PyPinyin3 toolkit. As seen in Table 1, this romanization process reduced the Chinese vocabulary size from 5,178 to 1,239, enabling us to create a balanced concatenated tokenizer for Mandarin and English. For Korean and Japanese, similar processing methods were employed using the kroman4 and pykakasi5 toolkit, respectively, to achieve comparable reductions in vocabulary size and to facilitate a unified approach to tokenizer construction across multiple languages.

33footnotetext: https://github.com/mozillazg/python-pinyin44footnotetext: https://github.com/victorteokw/kroman55footnotetext: https://github.com/miurahr/pykakasi
{CJK*}

UTF8mj

Table 1: Examples of romanization for Mandarin, Korean, Japanese and Mandarin-English. For Latin-based languages such as English, we bypass romanization and directly employ Byte Pair Encoding (BPE), setting the vocabulary count at 1,024.
language Char vocab Roman vocab
(a) Mandarin 差 不 多 5178 cha4 bu4 duo1 1239
(b) Korean 안 녕 하 세 요 1202 an nyeong ha se yo 1202
(c) Japanese か な 漢 字 3329 ka na kan ji 1059
(d) Mandarin-English 差 不 多 ten minutes 6202 cha4 bu4 duo1 ten minutes 2263

3.3 Roman to Character Decoder

To translate the romanized units Roman back to original characters, we introduce an additional module in our system called the Roman to Character (R2C) decoder. This Transformer-based model is designed to learn the multi-to-multi mappings between romanized text and original characters. For English, a language that already uses the Latin alphabet, the inputs and outputs remain unchanged as shown in (d) of in Table 1.

Importantly, despite the E2E training approach, the R2C decoder functions independently of the RNNT encoder outputs, focusing solely on learning the sequence-to-sequence mapping. Only its loss is merged with the RNNT loss, allowing for the possibility of separate training with text data or integration with more advanced pre-trained translation models, such as LLMs. This flexibility also enables the module’s extension to other languages with complex scripts, like Korean and Japanese. During training, accurately labeled Roman sequences serve as inputs for the module. To streamline the process, the decoding stage exclusively uses the greedy search hypothesis from the RNNT decoder as input, simplifying the overall pipeline.

4 Experiments

4.1 Data

Code-switching data SEAME is a publicly available dataset designed for Mandarin-English speech recognition, containing Mandarin, English, and natural intra-sentential code-switching data from interviews and conversations. The dataset specifics, including the duration of each data type, are outlined in Table 2. “ZH” refers to the monolingual Mandarin segments, and “EN” to the monolingual English parts. The dataset includes approximately 60 hours of natural CS data, a quantity considered limited for training robust models. Notably, most of SEAME speakers are from Singapore and Malaysia, presenting accents different from Mainland China. For the purposes of model selection, 10% of the training set samples are randomly chosen to form a validation set.

Monolingual data AISHELL-2 is an extensive 1000-hour open-source Mandarin speech corpus, the speakers of which mainly are from Mainland China. LibriSpeech comprises 960 hours of English speech from native speakers. These two monolingual datasets are used in our experiments to enhance the performance of code-switching ASR systems.

Evaluation data For evaluation, we stick to the data division of SEAME established by [37], which includes a test set for Mandarin speech named test_man and another tailored to Southeast Asian accented English, labeled test_sge. The specific durations of these test sets are also detailed in Table 2. Additionally, to assess performance on monolingual data, test sets from AISHELL-2 (as2_test) and LibriSpeech (ls_clean) are utilized in our analysis.

Table 2: Duration composition of Mandarin (ZH), English (EN), and code-switching (CS) utterances in SEAME corpus. The duration of Mandarin dev sets (as2_test) and English (ls_clean) are also included.
train val test_man test_sge as2_test ls_clean
duration(h) 85.4 9.8 7.5 3.9 4.0 5.4
ZH (%) 16.6 16.3 13.3 5.1 100 0
EN (%) 15.8 16.3 6.6 41.0 0 100
CS (%) 67.4 67.3 80.0 53.8 0 0

4.2 Experiment Setup

To evaluate monolingual test sets, Character Error Rate (CER) is applied to Mandarin, while Word Error Rate (WER) is used for English. For CS test sets, we employ Mixed Error Rate (MER), which incorporates word-level measurements for English and character-level assessments for Mandarin.

In all of our experiments, we use the Adam [38] optimizer combined with a Cosine Annealing learning rate scheduler, including a warm-up phase of 10,000 steps. The learning rate is set to peak at 1.5e-3 and then decrease to a minimum of 1e-6. In addition, we incorporate SpecAug [39] during training process to enhance model robustness and performance. Model averaging is employed, and during evaluation, greedy search is utilized without the assistance of any external LM or re-scoring techniques. The detailed training recipe will be open-sourced in NeMo.

4.3 Results

Performance of training solely on SEAME dataset is detailed in Table 3, showing that romanization encoding yields improved results for both the test_man and test_sge sets. In Table 4, we integrate monolingual datasets utilizing their rich acoustic and linguistic content to boost multilingual and CS ASR. The proposed Roman-based method outperforms the Char-based model in both of the test sets of SEAME, with 10.77% and 9.43% MER reductions respectively. Further analysis by dividing the CS test sets into Mandarin and English segments underscores the advantages of romanization encoding for both languages.

Moreover, to thoroughly assess the model’s proficiency in handling both CS and monolingual scenarios, results for monolingual test sets are presented, indicating an improvement in performance on the monolingual Mandarin test set, albeit with a slight decline on the monolingual English test set. To balance monolingual and code-switching (CS) data within a fixed 2085-hour training data, we upscale CS data to 285 hours and reduce both AISHELL-2 (AS2) and LibriSpeech (LS) data to 900 hours each. This adjustment result in significant MER reductions of 13.72% and 15.03% in CS test sets, demonstrating improvements over baseline models. Performance variations in monolingual sets were observed, largely due to the differing accents and speaking styles of speakers such as speakers from Singapore and Malaysia versus those from Mainland China [29]. Nevertheless, the model effectively retains its capability to process monolingual information, often delivering equal or superior performance.

Table 3: MERs (%) on the SEAME dataset reveal that the proposed romanization approach surpasses the character-based baseline by reducing vocabulary size, leading to a more balanced tokenizer and improved performance.
Encoding vocab test_man test_sge
Char+BPE 6202 22.26 32.30
Roman+BPE 2263 21.99 31.45
Table 4: Adding monolingual AISHELL2 (AS2) and LibriSpeech (LS) data during training. Results are evaluated with CS testsets including test_man and test_sge, and monolingual testsets including the testset of AS (as2_test) and test_clean of LS (ls_clean). To balance the different acoustic and language information, we attempt upsampling CS data but keep the total number of training data fixed.
Encoding Dataset (hours) SEAME AS2 LS
ZH EN SEAME AS2 LS test_man test_sge as2_test ls_clean
MER CER WER MER CER WER CER WER
Char BPE 85 1000 1000 17.64 16.87 28.08 25.35 26.33 29.44 7.74 2.60
Roman BPE 85 1000 1000 15.74 15.10 25.28 22.96 21.97 27.41 7.05 3.26
Roman BPE 285 900 900 15.22 14.79 24.31 21.54 20.74 25.70 7.48 2.75

4.4 Ablation Study

The proposed romanization encoding approach is designed to be easily adaptable to various languages. This section demonstrates the effectiveness of our method with Korean and Japanese, which face challenges such as limited publicly available corpora, insufficient for training large ASR models, and lack of CS data. Through training a multilingual model on constraint datasets (less than 50 hours), we demonstrate our approach’s capability and efficiency in data-limited situations and its potential extension to other languages.

Mandarin-Korean 66footnotetext: https://github.com/goodatlas/zeroth A Mandarin-Korean bilingual ASR model was trained using the entire 50-hour Zeroth Korean6 dataset and 50 hours data randomly selected from the AISHELL-1 dataset [40]. Evaluations on monolingual Mandarin (test_as1) and Korean (test_zeroth) test sets utilized CERs for performance measurement. Results in Table 5 indicate that the Roman-based bilingual ASR model maintains performance on the Korean test set while achieving better results on the AISHELL-1 test set compared to a character-based model.

Table 5: CER (%) for a Mandarin-Korean Bilingual system trained on 50h+50h of data shows the proposed method reduces Mandarin vocabulary size, boosts its performance, and maintains Korean results.
Encoding vocab test_as1 test_zeroth
Char 6380 12.87 1.40
Roman 2441 12.60 1.40

Mandarin-Japanese We also experiment on Mandarin-Japanese Bilingual ASR, drawing training data randomly from 50 hours of the AISHELL-1 dataset and 50 hours from the Japanese ReazonSpeech [41] dataset. By using romanization encoding, the concatenated vocabulary size is reduced from 8,507 to 2,298. As we can see in Table  6, when evaluated on AISHELL1 (test_as1) and ReazonSpeech (test_reazon) test sets, Roman-based model can perform better than Character encoding one in a large margin, which further indicates effeteness and scalability of our proposed Romanization encoding for other script-heavy languages.

Evaluations of R2C module Our proposed method includes additional R2C module to transcribe Roman to characters. The total training parameters are at par with the baseline system since the size of RNNT outputs is decreased. Although the end-to-end inference speed for the proposed system can not be faster than the character encoding models but the training batch sizes can be set larger, which is essential for the Multilingual ASR model training. For instance, the reduction in the RNNT concatenated vocabulary size in the Mandarin-Korean Bilingual ASR model from 6,380 to 2,441, primarily due to Mandarin (as Korean mapping is nearly one-to-one), allowed for at least a 2X larger training batch size and more than 20% quicker RNNT inference compared to models using character encoding. We believe that this work could benefit not only the RNNT-based model but also the popular auto-regressive speech large foundational models.

Table 6: CERs (%) for Mandarin-Japanese Bilingual ASR models indicates Roman encoding reduces vocabulary size by 73% and significantly enhances performance for both Mandarin and Japanese.
Encoding vocab test_as1 test_reazon
Char 8507 19.75 36.00
Roman 2298 11.30 29.31

5 Conclusion

In this study, we introduce romanization encoding as a strategy to enhance multilingual ASR systems, particularly for languages with complex scripts. Our experiments with Mandarin-English CS ASR illustrate that employing a balanced tokenizer by romanized characters can lead to significant performance gains, with improvements of 13.71% and 15.03% on SEAME CS test sets. Additionally, we have extended the application of Roman-based tokenizers to Mandarin-Korean and Mandarin-Japanese multilingual ASR systems, yielding promising results in terms of both faster training speeds and improved performance. Looking ahead, we plan to refine our approach by applying BPE or similar encoding methods to romanized text, aiming for a more compact and efficient vocabulary. Enhancing the R2C decoder with advanced models like LLMs could significantly boost overall accuracy. The system’s flexibility enables the use of synthetic text data to improve R2C decoder meanwhile leveraging the pre-trained audio encoder to mitigate the limited availability of code-switching audio data.

References

  • [1] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” in ICML, 2023, pp. 28492–28518.
  • [2] Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
  • [3] Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al., “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516, 2023.
  • [4] Puyuan Peng, Brian Yan, Shinji Watanabe, and David Harwath, “Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization,” in Proc. Interspeech, 2023, pp. 396–400.
  • [5] Xianghu Yue, Grandee Lee, Emre Yılmaz, Fang Deng, and Haizhou Li, “End-to-end code-switching asr for low-resourced language pairs,” in ASRU, 2019, pp. 972–979.
  • [6] Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier, David Rybach, and Patrick Nguyen, “On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition,” in Proc. Interspeech, 2019, pp. 3800–3804.
  • [7] Taku Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” in ACL, 2018, pp. 66–75.
  • [8] Jisung Wang, Jihwan Kim, Sangki Kim, and Yeha Lee, “Exploring Lexicon-Free Modeling Units for End-to-End Korean and Korean-English Code-Switching Speech Recognition,” in Proc. Interspeech, 2020, pp. 1072–1075.
  • [9] Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu, “A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on mandarin chinese,” in International Conference on Neural Information Processing, 2018, pp. 210–220.
  • [10] Xiaoya Li, Yuxian Meng, Xiaofei Sun, Qinghong Han, Arianna Yuan, and Jiwei Li, “Is word segmentation necessary for deep learning of chinese representations?,” in ACL, 2019, pp. 3242–3252.
  • [11] Andros Tjandra, Nayan Singhal, David Zhang, Ozlem Kalinli, Abdelrahman Mohamed, Duc Le, and Michael L Seltzer, “Massively multilingual asr on 70 languages: Tokenization, architecture, and generalization capabilities,” in ICASSP, 2023, pp. 1–5.
  • [12] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al., “Language models are unsupervised multitask learners,” .
  • [13] Liuhui Deng, Roger Hsiao, and Arnab Ghoshal, “Bilingual end-to-end asr with byte-level subwords,” in ICASSP, 2022, pp. 6417–6421.
  • [14] Chenglei Si, Zhengyan Zhang, Yingfa Chen, Fanchao Qi, Xiaozhi Wang, Zhiyuan Liu, Yasheng Wang, Qun Liu, and Maosong Sun, “Sub-Character Tokenization for Chinese Pretrained Language Models,” Transactions of the Association for Computational Linguistics, pp. 469–487, 2023.
  • [15] Zhao Yang, Dianwen Ng, Xiao Fu, Liping Han, Wei Xi, Rui Wang, Rui Jiang, and Jizhong Zhao, “On the effectiveness of pinyin-character dual-decoding for end-to-end mandarin chinese asr,” arXiv preprint arXiv:2201.10792, 2022.
  • [16] Jiahong Yuan, Xingyu Cai, Dongji Gao, Renjie Zheng, Liang Huang, and Kenneth Church, “Decoupling recognition and transcription in mandarin asr,” in ASRU, 2021, pp. 1019–1025.
  • [17] Shun-Po Chuang, Heng-Jui Chang, Sung-Feng Huang, and Hung-yi Lee, “Non-autoregressive mandarin-english code-switching speech recognition,” in ASRU, 2021, pp. 465–472.
  • [18] Ulf Hermjakob, Jonathan May, and Kevin Knight, “Out-of-the-box universal Romanization tool uroman,” in Proceedings of ACL 2018, System Demonstrations, Fei Liu and Thamar Solorio, Eds., Melbourne, Australia, July 2018, pp. 13–18, Association for Computational Linguistics.
  • [19] Sukannya Purkayastha, Sebastian Ruder, Jonas Pfeiffer, Iryna Gurevych, and Ivan Vulić, “Romanization-based large-scale adaptation of multilingual language models,” in The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  • [20] Rashed Rubby Riyadh and Grzegorz Kondrak, “Joint approach to deromanization of code-mixed texts,” in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, Ann Arbor, Michigan, June 2019, pp. 26–34, Association for Computational Linguistics.
  • [21] Xinyuan Zhou, Emre Yılmaz, Yanhua Long, Yijie Li, and Haizhou Li, “Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition,” in Proc. Interspeech, 2020, pp. 1042–1046.
  • [22] Yizhou Lu, Mingkun Huang, Hao Li, Jiaqi Guo, and Yanmin Qian, “Bi-encoder transformer network for mandarin-english code-switching speech recognition using mixture of experts.,” in Proc. Interspeech, 2020, pp. 4766–4770.
  • [23] Changhao Shan, Chao Weng, Guangsen Wang, Dan Su, Min Luo, Dong Yu, and Lei Xie, “Investigating end-to-end speech recognition for mandarin-english code-switching,” in ICASSP, 2019, pp. 6056–6060.
  • [24] Hexin Liu, Haihua Xu, Leibny Paola Garcia, Andy WH Khong, Yi He, and Sanjeev Khudanpur, “Reducing language confusion for code-switching speech recognition with token-level language diarization,” in ICASSP, 2023, pp. 1–5.
  • [25] Siddharth Dalmia, Yuzong Liu, Srikanth Ronanki, and Katrin Kirchhoff, “Transformer-transducers for code-switched speech recognition,” in ICASSP, 2021, pp. 5859–5863.
  • [26] Shun-Po Chuang, Tzu-Wei Sung, and Hung-yi Lee, “Training code-switching language model with monolingual data,” in ICASSP, 2020, pp. 7949–7953.
  • [27] Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung, “Code-switched language models using neural based synthetic data from parallel sentences,” in Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 2019, pp. 271–280.
  • [28] Chia-Yu Li and Ngoc Thang Vu, “Improving Code-Switching Language Modeling with Artificially Generated Texts Using Cycle-Consistent Adversarial Networks,” in Proc. Interspeech, 2020, pp. 1057–1061.
  • [29] Dau-Cheng Lyu, Tien-Ping Tan, Eng Siong Chng, and Haizhou Li, “SEAME: a Mandarin-English code-switching speech corpus in south-east asia,” in Proc. Interspeech, 2010, pp. 1986–1989.
  • [30] Dima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Huang, Oleksii Hrinchuk, Krishna Puvvada, Ankur Kumar, Jagadeesh Balam, and Boris Ginsburg, “Fast conformer with linearly scalable attention for efficient speech recognition,” in ASRU, 2023, pp. 1–8.
  • [31] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  • [32] Hainan Xu, Fei Jia, Somshubra Majumdar, He Huang, Shinji Watanabe, and Boris Ginsburg, “Efficient sequence transduction by jointly predicting tokens and durations,” in ICML, 2023, pp. 38462–38484.
  • [33] Fangjun Kuang, Liyong Guo, Wei Kang, Long Lin, Mingshuang Luo, Zengwei Yao, and Daniel Povey, “Pruned RNN-T for fast, memory-efficient ASR training,” in Proc. Interspeech, 2022, pp. 2068–2072.
  • [34] Kunal Dhawan, KDimating Rekesh, and Boris Ginsburg, “Unified model for code-switching speech recognition and language identification based on concatenated tokenizer,” in Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-witching, 2023, pp. 74–82.
  • [35] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in ICASSP, 2015, pp. 5206–5210.
  • [36] Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu, “Aishell-2: Transforming mandarin asr research into industrial scale,” arXiv preprint arXiv:1808.10583, 2018.
  • [37] Zhiping Zeng, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Eng Siong Chng, and Haizhou Li, “On the End-to-End Solution to Mandarin-English Code-Switching Speech Recognition,” in Proc. Interspeech, 2019, pp. 2165–2169.
  • [38] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
  • [39] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019, pp. 2613–2617.
  • [40] Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in COCOSDA, 2017, pp. 1–5.
  • [41] Yue Yin1 Daijiro Mori1 Seiji Fujimoto, “Reazonspeech: A free and massive corpus for japanese asr,” .