Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training

Abstract

The mainstream automatic speech recognition (ASR) technology usually requires hundreds to thousands of hours of annotated speech data. Three approaches to low-resourced ASR are phoneme or subword based supervised pre-training, and self-supervised pre-training over multilingual data. The Iu Mien language is the main ethnic language of the Yao ethnic group in China and is low-resourced in the sense that the annotated speech is very limited. With less than 10 hours of transcribed Iu Mien language, this paper investigates and compares the three approaches for Iu Mien speech recognition. Our experiments are based on the recently released, three backbone models pretrained over the 10 languages from the CommonVoice dataset (CV-Lang10), which correspond to the three approaches for low-resourced ASR. It is found that phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. Particularly, the Whistle models, i.e., obtained by the weakly-supervised phoneme-based multilingual pre-training, obtain the most competitive results.

Index Terms: speech recognition, Iu Mien language, low-resourced

1 Introduction

The Yao ethnic group is one of the ethnic minorities in China, with a population of over 3 million people, mainly distributed in Guangxi, Guangdong, Hunan, Guizhou, Yunnan and other regions. It is the most widely distributed ethnic minority in southern China, and also exists in Southeast Asian countries such as Vietnam, Thailand, and Cambodia. Yao is the language of the Yao ethnic group, but it lacks a completely unified standardization and consists of many dialects. Among these, Iu Mien is a major dialect and serves as the basis for the current official standard Yao language. Iu Mien belongs to the Yao branch of the Iu Mien-Yao sub-family within the Sino-Tibetan language family. Its written form is a phonetic syllabic system using basic Latin letters.

With the development of deep learning based artificial intelligence, the capabilities of automatic speech recognition (ASR) technology have also developed rapidly. However, current speech recognition technology often requires hundreds to thousands of hours of training data to obtain good performances. Due to the long-term scattered residence of the Yao population and the lack of an authoritative central dialect, the speech and language resources of the Iu Mien language are very scarce. In our practice, it takes non-trivial efforts to collect and transcribe even less than 10 hours of Iu Mien language. The development of Iu Mien language speech recognition systems is very challenging, while it is very important to reduce digital divides and culture inheritance.

The paradigm of pre-training (PT) followed by fine-tuning (FT), called the PTFT paradigm, has emerged in recent years as an effective way to solve the problem of limited training data for low-resource languages for ASR. In pre-training, training data for a number of languages are merged to train a multilingual model. The pre-trained model can then serve as a backbone, which can be further fine-tuned for crosslingual speech recognition. This PTFT paradigm helps the backbone to learn common knowledge for speech recognition from multiple, different languages during pre-training. So when fine-tuning the backbone model on the target language, decent ASR performance could be obtained with only a small amount of target language training data.

Currently, there are three main multilingual pre-training methods: self-supervised pre-training, subword-based supervised pre-training, and phoneme-based supervised pre-training. Self-supervised pre-training uses unlabeled audio data to train models to learn common representations of multilingual audio. Subword-based pre-training combines multilingual annotated audio data by creating a shared token set for multiple languages. Phoneme-based supervised pre-training typically uses International Phonetic Alphabet (IPA) as a common pronunciation annotation system, enabling effective multilingual supervised pre-training. The IPA is designed to provide a unified system of symbols to represent the basic sounds of different languages. Each IPA symbol corresponds to a specific phoneme, ensuring a one-to-one relationship between the symbol and the sound. With the IPA, it is possible to achieve consistent phonetic transcription across all languages [1]. Comparing the three pre-training methods, the phoneme-based approach not only allows for efficient model training but also maximizes the sharing of pronunciation features across different languages. However, obtaining high-quality IPA pronunciation annotations for many minor languages is challenging.

Our work is based on a recent study on weakly supervised phoneme pretraining, Whistle [2]. In [2], the effects of different pre-training methods on multilingual and cross-language speech recognition are analyzed and compared. A weakly phonetic supervision method (Whistle) is proposed. It is shown in [2] that phoneme-based supervised pre-training models are more data-efficient, revealing that they perform better in cross-language speech recognition with limited training data. Additionally, related pre-training models are publicly released, which we utilize to construct a Iu Mien language speech recognition model in this work.

To the best of our knowledge, there has been no studies on using the pre-training and fine-tuning paradigm to train Iu Mien language speech recognition models. The aim of this paper is to explore the potential of the Whistle modeling on Iu Mien Language, a Sino-Tibetan low-resource language. Specifically, this paper uses the Whistle model as a pre-trained backbone model and fine-tuning it with a small amount of Iu Mien language training data to construct a Iu Mien language speech recognition model. We compare the fine-tuning results on Iu Mien with those obtained using backbone models from other pre-training methods. Our main contributions are as follows.

•

This paper introduces the basic characteristics of the Iu Mien language, including its writing and pronunciation features, and explores how to train a Iu Mien language speech recognition model using a small amount of data.
•

Experiments demonstrate that the Whistle method is more data-efficient compared to other methods, resulting in better Iu Mien language speech recognition models with an limited amount of Iu Mien speech data.

Table 1: Examples of Iu Mien language word spellings.

Iu Mien word	syllable	initial consonant	vowel			tone
			initial	main	final
aanwatv	aan	-	-	aa	n	-
	watv	w	-	a	t	v
baengh	baengh	b	-	ae	ng	h
nqaang	nqaang	nq	-	aa	ng	-
guinh	guinh	g	ui	-	n	h

2 Related work

Grapheme-based pre-training methods use graphemes as modeling units to construct a shared vocabulary for multilingual data. There are currently three basic grapheme modeling methods: characters [3], subwords [4], and words [5]. To retain some semantic information while avoiding the out-of-vocabulary (OOV) issue, the most commonly used method is to use subwords for modeling. For example, Whisper [6] uses BPE-based (Byte-Pair Encoding base) text tokenizer and weakly graphemic supervision on more than 680,000 hours of clean web data, which can recognize more than 97 languages. However, graphemes are related to the writing system of a language, and it may be difficult to train grapheme based models to learn common speech recognition capabilities between different languages. Self-supervised training is another major multilingual pre-training method. For example, models such as XLS-R [7] use a large amount of unlabeled multilingual audio data for training, and their training methods are similar to wav2vec 2.0 [8] or BERT [9]. Based on the pre-trained model, a linear layer is added to map the output to the target token list of speech recognition. The CTC approach [10] can then be used to fine-tune the model.

Supervised pre-training based on phoneme annotations is different from supervised pre-training methods based on graphemics, in that it can reduce the influence of the writing systems of different languages during model training. On the other hand, compared with self-supervised training, clear phoneme training goals can also help the model quickly learn general speech recognition knowledge. However, obtaining accurate pronunciation annotations for each language often requires a lot of expert knowledge. Recently, [2] proposed a new phoneme-based supervised pre-training method called Whistle. Whistle uses IPA phoneme annotations with a small number of errors generated by the G2P model as the target (weakly supervision) during model pre-training to train a multilingual speech recognition model. Whistle used data from ten languages (English, French, Italian, Spanish, Russian, Dutch, Turkish, Kyrgyz, Swedish, Tatar) from commonvoice11 [11] as training data. The Conformer model was used as the encoder, and the CTC algorithm was used for supervised pre-training. Finally, three pre-trained models of different sizes were trained: Whistle-small (90M), Whistle-medium (220M), and Whistle-large (543M). [2] conducted cross-language fine-tuning experiments on Polish and Indonesian, respectively, to demonstrate the effectiveness of this method. However, during the pre-training process of [2], no Sino-Tibetan language was incorporated as the pre-training languages, and the IPA annotations of the pre-training data did not contain tone information.

3 Iu Mien language spelling scheme

The writing system currently used by the Yao people is a newly created writing system, which was unified by China and the United States in 1984. It spells the Iu Mien dialect. Therefore, it can also be called the Iu Mien Unified Script (IMUC). The script is a phonetic syllabic script based on the Latin alphabet, and each syllable can be composed of 30 initials, 128 finals, and 8 tones. As shown in Table 1, IMUC is a phonetic script that uses subwords to represent different sounds. For each IMUC word, its structure is similar to the Chinese pinyin to represent words, consisting of one or more syllables, but unlike Chinese pinyin, Iu Mien only uses the 26 letters of the basic Latin alphabet, and does not use additional symbols to distinguish tones or vowels. IMUC generally uses Latin letters at the end of the word to represent tones. However, there is no corresponding letter for the mid-level tone. If a word ends without a tone-indicating letter, it indicates that the word has a mid-level tone. Other tones are represented by five letters: ’h’, ’v’, ’z’, ’x’, and ’c’.

Refer to caption — Figure 1: Illustration of the fine-tuning procedures with phonetic supervision pre-training.

4 Method

This paper use a weakly-supervised phoneme-based multilingual pre-training model and then fine-tuning to improve the recognition performance of the Iu Mien language speech recognition model. As shown in Figure 1, we use an end-to-end speech recognition model based on the CTC approach. The acoustic encoder of the Iu Mien language speech recognition model is initialized using the pre-trained Whistle-small model. The fine-tuning training of the Iu Mien language speech recognition model is performed separately based on subword modeling and phoneme modeling.

We add a linear layer on top of the acoustic encoder, which is used to map the feature vectors of the encoder output to the target prediction space. For methods of fine-tuning based on subwords, we use the random initialization method to initialize the linear layer parameters. For the phoneme-based approach to fine-tuning, in order to share the parameters from the pre-training as much as possible, we used the method in Whistle to use the parameters of the linear layer from the pre-training backbone for initialization. For the k-th row vector in the linear layer weight matrix at the top of the encoder during pre-training, it is considered as the embedding vector corresponding to the k-th phoneme in the predicted phoneme list. For a target language phoneme that has appeared in the predicted phoneme lexicon at pre-training, we use the embedding vector of that phoneme obtained at pre-training to initialize the corresponding position of the linear layer weight matrix at fine-tuning.

5 Experimental setup

5.1 Dataset

MightLJSpeech is a publicly available dataset of the Iu Mien language. The dataset is 9.7 hours long, contains 9761 audios and corresponding Iu Mien language text annotations, and the audio file sampling rate is 22050Hz.

In the experiment, all audio files were first resampled to 16KHz, and then the dataset was divided into training set, validation set and test set in a ratio of 8:1:1. To test the statistical significance of our experimental results, we conducted each experiment three independent times and used the average of these independent runs as the final result. For each run, we divided the dataset using a cross-validation approach. Specifically, we split the original dataset into 10 parts. In each run, two parts were used as the development set and test set, while the remaining eight parts were used as the training set. The development set and test set selected for each run were not reused in other runs for development or test.

5.2 Pronunciation lexicon construction

The file that records words and their corresponding IPA pronunciation annotations is called a pronunciation lexicon. For speech recognition models that use phoneme modeling, we need to incorporate the pronunciation dictionary to build a phoneme vocabulary. In phoneme-based speech recognition models, decoding often requires the use of a lexicon and a language model. In the experiment, the training set is used to generate a word vocabulary. Based on the IMUS-IPA spelling correspondence table provided in the Iu Mien Language Wikipedia, the corresponding IPA pronunciation annotation is generated for each word according to the longest match principle. For the tone of each word, numbers are used in the pronunciation lexicon to represent different tones.

5.3 Model training

In our experiment, the CAT toolkit [12] was used to train a CTC-based speech recognition model. The same model structure as in Whistle was used to facilitate the transfer of pre-trained model parameters and fine-tuning on the Iu Mien language training set. The acoustic encoder used a Conformer network model with 14 encoder blocks, each self-attention layer containing 4 self-attention heads, each with 36-dimensional hidden states. In the Whistle model based experiment, 80 fbank features extracted from the audio (16KHz resample) were used as audio input.

For the Iu Mien Language speech recognition experiments based on subword modeling, we use the BPE algorithm implemented through the SentencePiece [13] tool to construct the tokenizer and set the vocabulary size to 500. For the Iu Mien Language speech recognition experiments based on phoneme modeling, the size of the phoneme vocabulary size is 54, and the size of the phoneme vocabulary after removing diacritics other than tones is 44.

In the experiment, early stopping was used as the training scheduling strategy. When the loss of the model on the development set did not decrease for a certain number of consecutive times, the training was stopped. Meanwhile, in order to make a fair comparison, we set a minimum number of training iterations so that different models are fully trained when conducting experimental comparisons.

5.4 Model decoding

Beam search with a beam size of 32 was used for decoding. In order to use the language model to improve the decoding with the speech recognition models, we used the Kenlm tool [14] to train a word-level 4-gram language model based on the text of the Iu Mien Language training set. In the experiments based on subword modeling, we use WFST (weighted finite state transducer) based decoding [15, 3] to combine the acoustic model as well as the language model to improve the speech recognition results. In the experiments based on phoneme modeling, we use WFST-based decoding to combine the acoustic model, the pronunciation lexicon, and the language model to get the final decoding results.

6 Experimental results

6.1 Results for subword-based FT

Table 2 shows the results for subword-based FT, starting from different pre-trained models. O1 denotes the subword-based monolingual baseline, i.e., trained from scratch. M1 denotes the result from fine-tuning of Whistle-small. M3 denotes the result from fine-tuning of a Wav2Vec2-base model, trained from scratch using self-supervised pre-training with the same training data as the Whistle-small model, which we call the Wav2Vec2-cv10 model. Wav2Vec2-cv10 was trained using the fairseq toolkit, following the wav2vec 2.0 base pre-training configuration provided with the toolkit. M4 denotes the result from fine-tuning of a subword pretraining model, called Mul10-subword, which uses the same pretraining data and encoder architecture as Whistle-small, but is modeled using subwords with a vocabulary size of 5000. To show the statistical significance of the experimental results, we conducted three independent cross-validation tests for each experiment and took the average as the final experimental result.

Table 2: WERs for subword-based fine-tuning. (averaged over three independent cross-validation runs)

id	Model	Test w/o LM	Test with LM
		(WER)	(WER)
O1	Mono.subword	9.71	6.87
M1	Whistle-small + subword FT	3.30	2.95
M3	Wav2vec2-cv10 + subword FT	3.76	3.06
M4	Mul10-subword + subword FT	4.33	3.46

6.2 Results for phoneme-based FT

The Whistle model, during phoneme-based supervised pretraining, chooses to remove diacritics from phoneme annotations and to split diphthongs to maximize the sharing of pronunciation knowledge among different languages. Notably, Iu Mien script is a phonographic writing system closely related to pronunciation. So when training the phoneme-based speech recognition model for the Iu Mien language, diacritics are retained, and diphthongs are not split when using IPA symbols to annotate word pronunciation. In this manner, we could describe the pronunciation differences between different words as accurately as possible. For example, in Iu Mien, the pronunciation of the subword ’hn’ is /\textipa*̊n/, and the pronunciation of the subword ’n’ is /n/. If the diacritics were removed, the pronunciation of these two subwords could not be well distinguished.

Table 3 shows the results for phoneme-based FT, starting from different pre-trained models. O2 denotes the phoneme-based monolingual baseline, i.e., trained from scratch. M2 denotes the result from fine-tuning of Whistle-small. M5 denotes the result from fine-tuning of the Wav2Vec2-cv10 model. Again, we conducted three independent cross-validation tests for each experiment and took the average as the final experimental result. The results in Table 2 and Table 3 use the same three splits, so they are directly comparable.

6.3 Analysis

By comparing O1 and M1, as well as O2 and M2, the results show that fine-tuning the pre-trained Whistle-small model on the Iu Mien language can significantly improve the performance, compared to training the model with only Iu Mien language data without using a pre-trained model.

When conducting subword based fine-tuning, the Whistle-small model (M1) performs the best, compared to both self-supervised pre-training (M3) and subword based pre-training (M4). Presumably, this is due to the clear goal of phoneme based pre-training, which directly learns to discriminate different sounds in the input audio.

In phoneme-based fine-tuning experiments, using the Whistle model as the backbone model (M2) gave better results, compared to using the Wav2vec2-cv10 model as the backbone (M5). Additionally, it is found that phoneme-based FT of Whistle-small (M2) outperforms subword-based FT of Whistle-small (M1), which suggests that phoneme-based FT helps the phoneme-based backbone to work better and the matching in pre-training and fine-tuning is beneficial.

Table 3: PERs and WERs for phoneme-based fine-tuning.
(averaged over three independent cross-validation runs)

id	Model	Test	Test
		(PER)	(WER)
O2	Mono.phoneme	4.22	4.69
M2	Whistle-small + phoneme FT	2.41	2.71
M5	Wav2vec2-cv10 + phoneme FT	2.53	2.76

7 Conclusion and future work

This paper primarily investigates and compares the effectiveness of different pre-training methods in fine-tuning Iu Mien language speech recognition models using a small amount of Iu Mien language data. By using the weakly-supervised phoneme-based multilingual pre-training model, Whistle, for fine-tuning, the speech recognition performance for the Iu Mien language can be effectively improved, achieving better results compared to other pre-training methods. Thanks to the common pronunciation among different languages, phoneme based pre-training with non-Chinese-Tibetan languages can also successfully improve the performance on the Chinese-Tibetan Iu Mien language. Hopefully, this could reduce the data requirement for low-resourced speech recognition for a broader range of languages.

There are some interesting future work. The Whistle model used in this paper was pre-trained using only ten non-tonal languages, without considering how to integrate tonal information from different languages into the multilingual pre-training. The Iu Mien language, however, has eight tones. There have been some effort towards addressing this problem [16, 17]. Potentially, multilingual pre-training with tone modeling incorporated would benefit low-resourced speech recognition for tonal languages.

References

[1] V. Fromkin, R. Rodman, and N. Hyams, An introduction to language: Eight edition. Thomson Wadsworth, 2007.
[2] S. Yusuyin, T. Ma, H. Huang, W. Zhao, and Z. Ou, “Whistle: Data-efficient multilingual and crosslingual speech recognition via weakly phonetic supervision,” arXiv preprint arXiv:2406.02166, 2024.
[3] H. Xiang and Z. Ou, “CRF-based single-stage acoustic modeling with CTC topology,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019.
[4] H. Zheng, W. Peng, Z. Ou, and J. Zhang, “Advancing CTC-CRF based end-to-end speech recognition with wordpieces and conformers,” arXiv preprint arXiv:2107.03007, 2021.
[5] H. Soltau, H. Liao, and H. Sak, “Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition,” in Proc. INTERSPEECH, 2017.
[6] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning (ICML), 2023.
[7] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. M. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” in Proc. INTERSPEECH, 2021.
[8] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
[9] J. D. M.-W. C. Kenton and L. K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2019.
[10] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning (ICML), 2006.
[11] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference (LREC), 2020.
[12] K. An, H. Xiang, and Z. Ou, “CAT: A CTC-CRF based ASR toolkit bridging the hybrid and the end-to-end approaches towards data efficiency and low latency,” in Proc. INTERSPEECH, 2020.
[13] T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP), 2018.
[14] K. Heafield, “Kenlm: Faster and smaller language model queries,” in Proceedings of the sixth workshop on statistical machine translation (WMT), 2011.
[15] M. Mohri, F. Pereira, and M. Riley, “Speech recognition with weighted finite-state transducers,” Springer Handbook of Speech Processing, pp. 559–584, 2008.
[16] J. Li and M. Hasegawa-Johnson, “Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?” in Proc. INTERSPEECH, 2020.
[17] ——, “Autosegmental neural nets 2.0: An extensive study of training synchronous and asynchronous phones and tones for under-resourced tonal languages,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1918–1926, 2022.