\interspeechcameraready\name

[affiliation=1,2]HaechanKim \name[affiliation=2]JunhoMyung \name[affiliation=2]SeoyoungKim \name[affiliation=1]SungpahLee \name[affiliation=3]DongyeopKang \name[affiliation=1,2]JuhoKim

LearnerVoice: A Dataset of Non-Native English Learners’ Spontaneous Speech

Abstract

Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners’ spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner’s Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. Fine-tuning whisper-small.en with LearnerVoice achieves a WER of 10.26%, 44.2% lower than vanilla whisper-small.en. Furthermore, our qualitative analysis indicates that 54.2% of errors from the vanilla model on LearnerVoice are attributable to L2S features, with 48.1% of them being reduced in the fine-tuned model.

keywords:

speech recognition, non-native spontaneous speech, English as a second/foreign language

1 Introduction

Spontaneous speech is often characterized by disfluencies such as filler words, self-repairs, word repetitions, and false starts, which are not commonly encountered in read speech [1, 2]. Second language (L2) learners exhibit these disfluencies more frequently in spontaneous speech, along with ungrammatical utterances not typically found in native speakers’ speech. These ungrammatical expressions and disfluencies, which we define as L2S (L2’s Spontaneous speech) features, increase the complexity of the Automatic Speech Recognition (ASR) task [3].

The precise transcription of L2S features stands out as a crucial element in the automatic assessment of speaking tests for L2 learners [4, 5]. One of the commonly used as an evaluation scheme of L2 speaker’s speaking abilities is Complexity, Accuracy, and Fluency (CAF) triad [6]. Ungrammatical utterances contribute to accuracy, while disfluencies are pivotal components for evaluating the fluency [7]. Despite the importance of the accurate transcription of L2S features, recent ASR systems show a higher error rate in transcribing the disfluencies [3], and often automatically rectify the grammatical errors [8]. One of the main reasons for such challenges comes from the lack of publicly available datasets or benchmarks that comprehensively encompass the L2S features in a spontaneous L2 speech [3].

To this end, we construct and publicly release LearnerVoice¹¹1Our dataset is available at: https://prep.ringleplus.com/research, a spontaneous English speech dataset collected from L2 learners. LearnerVoice consists of a fully transcribed 50.04 hours of audio (229,671 tokens) spoken by 58 L2 English learners who have Korean as their first language. The audio was collected from Ringle²²2https://www.ringleplus.com/, an online learning platform that hosts one-to-one video tutoring sessions between L2 learners and native English tutors. To accurately capture L2S features in the dataset, we recruited annotators who can fully understand the biased accent from L2 learners. Then we trained them on what L2S features are and how they should be transcribed by providing case examples. Our analysis result shows that our dataset includes significantly more L2S features than Switchboard [9] and Librispeech [10], popular native speech datasets.

To further investigate the importance of considering L2S features in improving ASR performance, we also identify a taxonomy with 9 types of errors made by the ASR model for L2 learners based on previous works [11, 12, 13, 14, 15, 16, 17, 18, 19]. Then we asked annotators to tag the error types on the transcription inferred from the subset of LearnerVoice by vanilla whisper-small.en, a state-of-the-art ASR model [20]. Results suggest that 54.2% of the errors are tagged to the L2S features related types (Filler word, Self-repair, and Ungrammatical Expression).

Fine-tuning whisper-small.en model with LearnerVoice show a WER of 10.26%, which corresponds to a 44.2% decrease in WER compared to that of vanilla whisper-small.en. Additionally, we measured the change rates of each error type from both the vanilla model and the fine-tuning model by error tagging. As a result, there was a reduction of 48.1% for error types associated with L2S features, while errors related to non-L2S features only decreased by 19.2%. We believe our work will serve as a cornerstone of future research in ASR to be more inclusive by supporting diverse L2 learners with different L2S feature distributions.

The contributions of this paper are as follows:

•

Public release of LearnerVoice, a spontaneous English speech dataset containing 50.04 hours of audio and corresponding transcription collected from L2 learners.
•

Identification of L2S features, consisting of ungrammatical expressions and disfluencies that frequently appear in L2 learners’ spontaneous speech. Experimental results show the influence of the L2S features on ASR errors for L2 learners’ spontaneous speech.

2 LearnerVoice Dataset

2.1 Dataset Overview

Audio is collected through Ringle, a platform based in Korea, which matches non-native English learners with native tutors for one-to-one tutoring. Lessons are conducted via video calls for either 20 or 40 minutes with various topics such as daily life, business/economics, current affairs/politics, and culture/sports. The audio from 239 lessons contains spontaneous speech from L2 learners. Users were informed and provided consent that lesson data could be released as a public dataset.

As a result, LearnerVoice consists of a total of 50.04 hours of audio, which is obtained from 239 lessons, with a sampling rate of 32,000Hz. The lessons consist of 168 (70.3%) 20-minute lessons and 71 (29.7%) 40-minute lessons. The transcription of LearnerVoice contains a total of 229,671 tokens based on whitespace tokenization. Compared to the existing native speaker spontaneous speech datasets, the number of tokens per hour is lower (the subset of Switchboard has 228,107 tokens in 22.92 hours). This is attributed to the slower speaking rate of L2 learners.

2.2 Dataset Construction

The voices of learners and tutors are collected separately as individual channels. To prepare the audio in a trainable format, we segmented the audio into short units using Voice Activity Detection model released by pyannote [21]. Subsequently, the segmented audio was provided to human annotators for transcribing.

We recruited annotators who have resided in the United States for over a year or have TOEFL scores exceeding 100 through an online recruitment posting. Recognizing the learner’s accent is an important qualification for annotators, so we selected native Korean for the task. The most important consideration during the construction of our dataset was ensuring that the L2S features present in the audio were accurately reflected in the transcription. Thus, we trained annotators on what L2S features are and how they should be transcribed by providing audio-transcription paired examples. Figure 1 is an example of our transcription where L2S features are accurately reflected.

Refer to caption — Figure 1: Examples of L2S features from the transcriptions of LearnerVoice, where filler words (FW), self-repairs (SR), and ungrammatical expressions (UE) are well represented. Word repetitions, self-repair, fragment, and false starts are all considered as types of self-repair and are labeled as SR.

2.3 L2 Learners Distribution

The dataset comprises speech recordings from 58 L2 learners whose first language is Korean. It consists of 38 female and 20 male learners with ages ranging from 20s to 40s. Since many did not have official English-speaking test scores, we utilized the CAF engine developed by Ringle to report the distribution of English-speaking proficiency of learners. The CAF engine measures Complexity, Accuracy, and Fluency of speech based on previous research [7, 6]. Subsequently, it assigns a level to speakers based on their IELTS³³3https://ielts.org/ speaking band (ranging from 1 to 9). The CAF engine showed root mean square error between the predicted bands and the ground truth bands as 0.66 out of 9 when tested on 98 lessons with ground truth bands. Table 1 presents the prediction of the CAF engine for the 58 L2 learners. The band predictions were derived from the most recent 30 lessons of each learner, some of which may not be included in the dataset. They are distributed across IELTS bands 4 to 9, with average score of 5.95. This value is similar to the average IELTS speaking band score of 5.9 for Koreans in 2022 [22].

Band	4	5	6	7	8	9
Complexity	14%	36%	21%	17%	9%	3%
Accuracy	21%	16%	24%	24%	9%	7%
Fluency	2%	31%	31%	28%	5%	3%

Table 1: The percentage of number of L2 learners categorized by IELTS band inferred through the CAF engine. The percentage has been rounded to the nearest whole number. Learners are mostly distributed across bands 4, 5, 6, and 7.

2.4 L2S Features Distribution

We define L2S features as distinct speech characteristics of L2 learners that prominently appear in a spontaneous speech. We assess and compare the occurrences of L2S features in LearnerVoice with those in other representative speech datasets from native speakers. We selected Switchboard [9] and Librispeech [10] as they exemplify native-spontaneous and native-read speech, respectively. To ensure a fair comparison, we then randomly selected subsets of Switchboard and Librispeech, each containing 228,107 tokens and 229,026 tokens, respectively.

2.4.1 Quantifying L2S Features in the Dataset

The quantification of L2S features is represented by filler words per token, self-repairs per token, and grammatical errors per c-unit. The concept of token is based on whitespace tokenization. A c-unit (communication unit) is defined as ”an independent clause with its modifiers,” often serving as the basic unit for measuring grammatical errors [23]. The detailed quantifying process is as follows: (1) Firstly, filler words were counted using hard-coded detection. (2) Self-repairs were identified after removing filler words from the original sentences. Self-repairs were defined as repetitions of lemma n-grams, where n is less than 5. (4) Next, to find grammatical errors, sentences with filler words and self-repairs removed were inputted into CoEdIT-large [24], along with the prompt ”Fix grammatical errors in this sentence.” Then, ERRANT [25] was employed to count the number of grammatical errors in the input sentences.

2.4.2 Comparison with Other Datasets

Figure 2 shows the distributions of L2S features for each dataset. The Mann-Whitney U test demonstrates that for all L2S features, LearnerVoice exhibits significant differences compared to both Switchboard and Librispeech.

The frequency of filler words is significantly higher in LearnerVoice (N=239, $.136\pm.063$ ) than in Switchboard (N=573, $.077\pm.077$ ) ( $p<.001$ ) and in Librispeech (N=240, $.000\pm.000$ ) ( $p<.001$ ). In case of self-repairs, it is also significantly higher in LearnerVoice ( $.127\pm.037$ ) than in Switchboard ( $.118\pm.049$ ) ( $p<.01$ ) and in Librispeech ( $.040\pm.012$ ) ( $p<.001$ ). Furthermore, the ratio of grammar errors was higher in LearnerVoice ( $.477\pm.248$ ) compared to Switchboard ( $.200\pm.210$ ) ( $p<.001$ ) and Librispeech ( $.280\pm.091$ ) ( $p<.001$ ). These results suggest that our dataset contains more L2S features than other datasets.

3 Fine-tuning with LearnerVoice

To evaluate whether LearnerVoice can enhance ASR performance for L2 spontaneous speech compared to existing datasets, we fine-tuned an ASR model using both LearnerVoice and a comparable subset of a native spontaneous speech dataset, Switchboard.

3.1 Dataset Used for Fine-Tuning

We chose the dataset to be used for fine-tuning based on the number of tokens in the transcriptions for comparison. LearnerVoice was divided into a training set of 185,463 tokens (41.12 hours) and a validation set of 20,705 tokens (4.50 hours). For Switchboard, a training set of 185,475 tokens (18.46 hours) and a validation set of 18,562 tokens (2.00 hours) were used. Each model fine-tuned with the respective datasets was evaluated on the LearnerVoice test set of 23,503 tokens (4.42 hours) and Switchboard test set of 24,070 tokens (2.45 hours).

3.2 Experiment Setting

The baseline ASR model used was whisper-small.en model released by OpenAI [20]. The whisper-small.en model has 244 million parameters and is capable of inference only in English. We selected the whisper-small.en model as the baseline model because multilingual inference models often misrecognized Korean English pronunciation as other languages (e.g., Korean, Japanese). The AdamW criterion was employed with a starting learning rate of 1e-5, with the first 500 training steps used as warm-up steps. To expedite training, gradient accumulation was applied every 2 steps.

3.3 Experiment Result

Table 2 displays the WERs for the LearnerVoice test set resulting from the vanilla whisper-large-v3 and vanilla whisper-small.en models, as well as the whisper-small.en model fine-tuned with LearnerVoice and Switchboard. The model chosen for testing was based on the validation loss measured during training. For LearnerVoice, a model trained for 2.04 epochs was selected, while for Switchboard, a model trained for 1.81 epochs was chosen.

Model	train	test	WER
large-v3	-	LearnerVoice	19.18
small.en	-	LearnerVoice	18.38
small.en	Switchboard	LearnerVoice	15.03
small.en	LearnerVoice	LearnerVoice	10.26
small.en	-	Switchboard	20.18
small.en	LearnerVoice	Switchboard	19.01

Table 2: WER of fine-tuning with LearnerVoice and Switchboard

The WERs observed for the vanilla whisper-large-v3 model and whisper-small.en model were 19.18% and 18.38%, respectively. The inferior performance of the whisper-large-v3 model can be attributed to its multilingual nature, which may result in biased accents from L2 speakers being recognized as different languages. After fine-tuning the whisper-small.en model with the same amount of data from LearnerVoice and Switchboard, the WERs were observed to be 10.26% and 15.03%, respectively, on the LearnerVoice test set. This indicates that the model fine-tuned with LearnerVoice experienced a 44.2% decrease in WER compared to the best-performing model before fine-tuning. Furthermore, when Switchboard was used as the test set, the WERs for the vanilla whisper-small.en and the fine-tuned model with LearnerVoice were 20.18% and 19.01%, respectively. This suggests that the model trained with LearnerVoice shows a slight improvement in ASR performance even for native spontaneous speech.

4 ASR Error Tagging on L2 Speech

To understand the importance of L2S features in ASR for L2 learners, we analyzed the causes of ASR errors on L2 learners’ speech using (1) vanilla whisper-small.en model and (2) whisper-small.en model after fine-tuning with LearnerVoice.

4.1 ASR Error Taxonomy

4.2 Methods

To see whether LearnerVoice effectively influences error types related to L2S features, we compared the error type distribution before and after fine-tuning. For this, we sampled 16 out of the total of 239 lessons that comprise LearnerVoice, which represents 6.7% of the entire dataset. Then we tagged errors based on the ASR Error Taxonomy in Table 3 that could be found in the text derived from the (1) vanilla whisper-small.en model and (2) whisper-small.en model fine-tuned with LearnerVoice to analyze the causes of ASR errors.

For annotation, we recruited annotators with high proficiency in English and familiarity with the typical Korean English accents. Annotators underwent a 30-minute on-boarding session explaining the task and error definitions, followed by another 30 minutes of working with the authors on example lessons. To ensure the robustness of the annotation, two annotators independently worked on each lesson and inter-rater reliability was calculated.

4.3 Results

Inter-rater reliability (Cohen’s kappa score) for the collected tags was 0.77. In the results for the vanilla whisper-small.en (Figure 3), a total of 4139 errors were observed, and 54.2% of them are associated with the L2S features. Filler word, Self-repair, and Ungrammatical Expression account for 37.6%, 17.1%, and 4.8% of all errors, respectively.

Figure 4 presents a comparison between the results of the vanilla model and the fine-tuned model. Upon examining the change ratio across different error types, we observe the highest error reduction rates for Pronunciation/Accent, Self-repair, Dictionary Error, Ungrammatical Utterance, and Filler word, in order. The decrease in Dictionary Error seems to be associated with errors in proper nouns linked to platform names. Excluding this, the results indicate a 48.1% decrease in errors stemming from L2S features after model fine-tuning. Conversely, errors originating from non-L2S features decreased by 19.2%. This means that LearnerVoice effectively captures L2S features, which aids in addressing L2 spontaneous speech in ASR. Furthermore, the notably high error reduction rate of Pronunciation/Accent suggests that the fine-tuned model effectively accommodates non-native accents.

5 Conclusion

We present LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners’ spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner’s Spontaneous speech) features—consisting of ungrammatical expressions and disfluencies (filler words, word repetitions, self-repairs, and false starts)—significantly more than other native speech datasets. Through fine-tuning the whisper-small.en model with LearnerVoice, we achieve a notable reduction in Word Error Rate (WER) compared to that of the vanilla model. Additionally, our ASR error tagging analysis uncovers that a considerable portion of ASR errors stem from the L2S features, underscoring the importance of addressing the L2S features in ASR systems. We anticipate that LearnerVoice and the insights regarding the impact of the L2S features in ASR systems for L2 learners’ spontaneous speech will serve as a cornerstone for future research.

References

[1] R. Dufour, V. Jousse, Y. Estève, F. Béchet, and G. Linarès, “Spontaneous speech characterization and detection in large audio database,” SPECOM, St. Petersburg, 2009.
[2] P. Boula de Mareüil, B. Habert, F. Bénard, M. Adda-Decker, C. Barras, G. Adda, and P. Paroubek, “A quantitative study of disfluencies in French broadcast interviews,” in Proc. Disfluency in Spontaneous Speech (DiSS 2005), 2005, pp. 27–32.
[3] Y. Qiao, W. Zhou, E. Kerz, and R. Schlüter, “The Impact of ASR on the Automatic Analysis of Linguistic Complexity and Sophistication in Spontaneous L2 Speech,” in Proc. Interspeech 2021, 2021, pp. 4453–4457.
[4] K.-n. Hassanali, S.-Y. Yoon, and L. Chen, “Automatic scoring of non-native children’s spoken language proficiency.” in SLaTE, 2015, pp. 13–18.
[5] R. Gretter, M. Matassoni, K. Allgaier, S. Tchistiakova, and D. Falavigna, “Automatic assessment of spoken language proficiency of non-native children,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 7435–7439.
[6] A. Housen, F. Kuiken, and I. Vedder, “Complexity, accuracy and fluency,” Dimensions of L2 performance and proficiency: Complexity, accuracy and fluency in SLA, vol. 32, pp. 1–20, 2012.
[7] X. Yan, H. R. Kim, and J. Y. Kim, “Complexity, accuracy and fluency (caf) features of speaking performances on aptis across different levels on the common european framework of reference for languages (cefr),” ARAGs Research Reports, AR-G/2018, vol. 1, 2018.
[8] K. Knill, M. Gales, K. Kyriakopoulos, A. Malinin, A. Ragni, Y. Wang, and A. Caines, “Impact of ASR Performance on Free Speaking Language Assessment,” in Proc. Interspeech 2018, 2018, pp. 1641–1645.
[9] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “Switchboard: Telephone speech corpus for research and development,” in Acoustics, speech, and signal processing, ieee international conference on, vol. 1. IEEE Computer Society, 1992, pp. 517–520.
[10] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
[11] S. Goldwater, D. Jurafsky, and C. D. Manning, “Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates,” Speech Communication, vol. 52, no. 3, pp. 181–200, 2010.
[12] G. Zhu, Y. Yan, J.-P. Caceres, and Z. Duan, “Transcription free filler word detection with neural semi-crfs,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[13] A. Zafar, B. Mamlin, S. Perkins, A. M. Belsito, J. Overhage, and C. J. McDonald, “A simple error classification system for understanding sources of error in automatic speech recognition and human transcription,” International Journal of Medical Informatics, vol. 73, no. 9, pp. 719–730, 2004. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1386505604001200
[14] J. Spilker, M. Klarner, and G. Görz, “Processing self-corrections in a speech-to-speech system,” in Verbmobil: Foundations of speech-to-speech translation. Springer, 2000, pp. 131–140.
[15] M.-S. Kim, “An error analysis of a learner corpus of written and spoken english produced by korean university students,” Unpublished doctoral dissertation). Korea University, South Korea, 2009.
[16] H. Strik, J. van Doremalen, J. van de Loo, and C. Cucchiarini, “Improving ASR processing of ungrammatical utterances through grammatical error modeling,” in Proc. Speech and Language Technology in Education (SLaTE 2011), 2011, pp. 109–112.
[17] R. Frieske and B. E. Shi, “Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,” arXiv preprint arXiv:2401.01572, 2024.
[18] A. Rodrigues, R. Santos, J. Abreu, P. Beça, P. Almeida, and S. Fernandes, “Analyzing the performance of asr systems: The effects of noise, distance to the device, age and gender,” in Proceedings of the XX International Conference on Human Computer Interaction, 2019, pp. 1–8.
[19] I. Vasilescu, D. Yahia, N. Snoeren, M. Adda-Decker, and L. Lamel, “Cross-lingual study of asr errors: on the role of the context in human perception of near-homophones,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
[20] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28 492–28 518.
[21] H. Bredin and A. Laurent, “End-to-end speaker segmentation for overlap-aware resegmentation,” in Proc. Interspeech 2021, Brno, Czech Republic, August 2021.
[22] IELTS. Ielts test statistics. Last Accessed: 2024-03-11. [Online]. Available: https://ielts.org/researchers/our-research/test-statistics
[23] S. L. Eisenberg and L.-Y. Guo, “Percent grammatical responses as a general outcome measure: Initial validity.” Language, speech, and hearing services in schools, vol. 49 1, pp. 98–107, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:36645664
[24] V. Raheja, D. Kumar, R. Koo, and D. Kang, “Coedit: Text editing by task-specific instruction tuning,” 2023.
[25] C. Bryant, M. Felice, and T. Briscoe, “Automatic annotation and evaluation of error types for grammatical error correction,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), R. Barzilay and M.-Y. Kan, Eds. Vancouver, Canada: Association for Computational Linguistics, Jul. 2017, pp. 793–805. [Online]. Available: https://aclanthology.org/P17-1074
[26] V. I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8. Soviet Union, 1966, pp. 707–710.