Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations

Wu, Wenqin; Lee, Joonwhoan

doi:10.3390/app14188532

Open AccessArticle

Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations

by

Wenqin Wu

and

Joonwhoan Lee

^*

Artificial Intelligence Lab, Department of Computer Science and Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8532; https://doi.org/10.3390/app14188532

Submission received: 19 August 2024 / Revised: 19 September 2024 / Accepted: 19 September 2024 / Published: 22 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

In general, it is difficult to obtain a huge, labeled dataset for deep learning-based phoneme recognition in singing voices. Studying singing voices also offers inherent challenges, compared to speech, because of the distinct variations in pitch, duration, and intensity. This paper proposes a detouring method to overcome this insufficient dataset, and applies it to the recognition of Korean phonemes in singing voices. The method started with pre-training the HuBERT, a self-supervised speech representation model, on a large-scale English corpus. The model was then adapted to the Korean speech domain with a relatively small-scale Korean corpus, in which the Korean phonemes were interpreted as similar English ones. Finally, the speech-adapted model was again trained with a tiny-scale Korean singing voice corpus for speech–singing adaptation. In the final adaptation, melodic supervision was chosen, which utilizes pitch information to improve the performance. For evaluation, the performance on multi-level error rates based on Word Error Rate (WER) was taken. Using the HuBERT-based transfer learning for adaptation improved the phoneme-level error rate of Korean speech by as much as 31.19%. Again, on singing voices by melodic supervision, it improved the rate by 0.55%. The significant improvement in speech recognition underscores the considerable potential of a model equipped with general human voice representations captured from the English corpus that can improve phoneme recognition on less target speech data. Moreover, the musical variation in singing voices is beneficial for phoneme recognition in singing voices. The proposed method could be applied to the phoneme recognition of other languages that have less speech and singing voice corpora.

Keywords:

phoneme recognition; Korean singing voices; self-supervised learning

1. Introduction

Automatic speech recognition (ASR) is attracting a great deal of attention from the research community [1,2,3] for its diverse real-world applications. Although automatic lyric transcription (ALT) [4] has languished in comparison, it is still being applied to lyric alignment [5], query by singing [6], audio indexing [7], and music subtitling [8]. While both involve extracting linguistic content from a human voice stream, a singing voice implicates significant variations in pitch, duration, and intensity to a greater degree than speech. Thus, well-designed ASR models [4,9] impair ALT performance, although they could be employed as a substitute for ALT.

Notably, most ALT research in this area is primarily focused on English corpora [4], although there is some research [9] on Korean or other non-English language corpora [10] as well. One reason for this may be the lack of temporal alignment annotations for songs of those languages. Even when using CTC (Connectionist Temporal Classification) Loss [11], which allows model training in the absence of alignment information, some form of coarse alignment annotation is still necessary during data preprocessing when songs are segmented into shorter segments.

To overcome the data challenge and the development lag in non-English ALT, it is natural to consider transferring priors from the more advanced English research ecosystem. To ensure high general cross-lingual priors rather than language-specific knowledge, self-supervised learning (SSL) can be a choice. Without annotated labels, SSL could better capture the general representations embedded in the language. In general, labels in the field of speech can vary, ranging from words to phonemes, so the learning relying solely on specific labels, such as words, could limit a model’s generalization performance. However, SSL with phonemes circumvents this limitation by reducing dependence on specific labels, affording SSL models, such as HuBERT [12], exceptional generalization capabilities. In short, SSL can handle a wide array of speech data types, without being constrained by a single label. This versatility enables SSL models to better adapt to and address complex and evolving tasks in the realm of speech processing. Consequently, their generalization performance and application are potentially superior to that of supervised learning with specific labels [13]. This advantage broadens the scope of SSL and permits more adaptable and efficient solutions for speech processing.

In theory, a pre-trained ASR model trained on an English corpus could be utilized to perform ALT from Korean singing voices. However, in general, the abrupt adaptation from priors of HuBERT-based SSL to target the Korean singing domain with a tiny dataset is not adequate. This is empirically shown in the experiment section. Additionally, the objective of ALT is to generate a word stream as output directly from the model. Alternatively, the model’s output can initially produce a phoneme stream, which can then be converted into a word stream through a post-processing step. This approach is less dependent on a specific language (English), and enables the efficient utilization of priors for domain shift (to Korean). Here, the phoneme recognition in singing voices can serve as the initial phase of ALT. This paper aims to propose a method for Korean phoneme recognition based on singing voices, which implies that an additional post-processing step is necessary for word stream output.

Figure 1 shows a flow diagram of the proposed method to build a Korean phoneme recognition model from Korean songs. In the first stage, the SSL model was trained to capture the general speech representation with 960 h of a large-scale English corpus. The pre-trained SSL model then adapts to a relatively small-scale 180 h of a Korean speech corpus. In that stage, the Korean phonemes were interpreted as similar to English ones. Finally, the Korean speech model was again adapted to the Korean song domain using 2.5 h of a tiny singing voice corpus. In this final adaptation, melodic supervision was chosen, which utilizes pitch information, particularly the onset of musical notes, to improve the performance. This was accomplished by employing a multi-task learning approach [14] that included pitch detection.

The two-stage adaptation process of the SSL model resulted in an improvement, achieving a better performance on multiple levels for Korean phoneme recognition in speech and singing voices. In short, the study explored the usefulness of SSL for phoneme recognition in languages with insufficient data, and validated the effectiveness of melodic supervision in our study. For evaluation, the performance on multi-level error rates based on the Word Error Rate (WER) was determined to prove the validity of our method. The transfer learning improved the phoneme-level error rate of Korean speech by as much as 31.19%. Additionally, the melodic supervision improved the phoneme-level error rate in singing voices by 0.55%.

Therefore, the contributions of this study can be summarized as follows:

An SSL-based detouring method to overcome the insufficient speech and song dataset was explored for the recognition of Korean phonemes in speech and singing voices. The method is quite general for the phoneme recognition of speech and the singing voices of any other language where the labelled data are insufficient.
In the proposed model, the HuBERT-type SSL model which was pre-trained on a large-scale English model proved to be useful for capturing the general representation of linguistic phoneme, which is adapted to Korean phoneme in the following stage with a relatively small-scale Korean dataset.
In the speech–singing voice adaptation, melodic supervision by employing multi-task learning proved to be useful for better phoneme recognition.

The ultimate goal of our work was directed towards the ALT of traditional Korean folk songs, which have been transferred via a mouth-to-mouth manner without text or music score. Thus, the method in the paper may be helpful in preserving the traditional ethnic folk songs of diverse countries.

This paper is organized as follows. Section 2 reviews the related prior works, while Section 3 presents a comprehensive discussion of the methodology. Section 4 provides the details of our datasets, elucidates the evaluation metrics, and presents the experimental findings together with the discussion. Finally, Section 5 concludes the paper, while exploring potential directions for future research.

2. Related Work

This section reviews the related works on the proposed method, including self-supervised speech representation learning and automatic lyric transcription. This review outlines the challenges we aim to address, and describes how our research idea emerged from these works.

2.1. Self-Supervised Speech Representation Learning

A lack of annotated data is a challenge in all aspects of deep learning, and ASR and ALT are no exceptions. In recent years, self-supervised learning (SSL), which alleviates the need for annotated data, has grown in popularity. Reference [13] argues that speech or audio processing using SSL can be categorized into three main approaches: generative methods [15,16] that generate or reconstruct input data using a limited perspective, contrastive methods [17,18,19] that focus on learning representations by distinguishing a target sample from other distractor samples using an anchor representation, and predictive methods [12,20,21] that utilize a learned target, and draw inspiration from the success of BERT-like models.

The modality of the input data for ASR and ALT is human voice audio, which compared to text, contains more complex information. The same sentence can be spoken or sung by different speakers with various prosodic structure, such as intonation, rhythm, and timbre. It is difficult to reconstruct a perspective of the original input data with these details by generative methods. On the other hand, a key part of contrastive methods is the strategy to define positive and negative samples, and the performance of models can be easily affected by this strategy. Because the samples of ASR and ALT can differ in multiple aspects as above, to try to define positive and negative samples on these multiple aspects is also another time-consuming annotation task. Both generative methods and contrastive methods tend to excessively rely on the original input data itself, so that they need to focus only on some specific aspects embedded in language. This is the reason why we selected a predictive method, specifically HuBERT, where the target is hidden units learned through a separate K-mean method, instead of a shallow perspective of the original input data itself, in order to encourage decoupling from a specific language.

2.2. Automatic Lyric Transcription

In the context of singing, the linguistic elements are commonly known as lyrics. The process of automatically extracting these lyrics from a singing voice is referred to as Automatic Lyric Transcription [5]. There are two deep learning-based approaches applicable to ALT. The first approach involves the creation and utilization of large-scale finely annotated singing voice datasets for ALT model training, such as DALI [22], Sing! 300 × 30 × 2 [4], and DSing Corpus [23]. These datasets can be augmented or synthesized from existing speech datasets to enhance the musical relevance [24,25]. The second approach is speech–singing adaptation [26], in which relatively well-designed ASR models are repurposed for ALT [27], in light of the similarity between ALT and ASR. Alternatively, well-trained ASR models can be fine-tuned to incorporate the necessary priors from large-scale speech datasets [28]. Finally, some have aimed to adapt the SSL models for ASR to the singing domain, such as [28], which proposed a transfer learning-based ALT method by adapting wav2vec 2.0 [19], an SSL ASR model, to the singing domain. Additionally, unlike naively regarding ALT as ASR, other speech–singing adaptation methods, such as [14,29], utilized the musical information to facilitate ALT.

However, both SSL for audio processing and speech–singing adaptation prioritize English data. For ALT in Korean, annotated singing voice datasets are rare, and even the open-source speech datasets are not as comprehensive as their English counterparts, such as DALI. This is the reason why the approach in which a huge amount of Korean speech or singing data is required is difficult to adopt for ALT in Korean. Therefore, English–Korean adaptation leverages the strong phonetic priors obtained from large-scale English datasets. Musical information is then employed in the speech–singing adaptation.

To further decouple from a specific language (English) and enable the efficient utilization of priors, our objective is to design a phoneme recognition system that recognizes phonemes (speech sounds) rather than words, which are the basic elements of language, when provided with audio consisting of the vocal portion of a Korean song. This system can serve as an initial step of ALT. We can then transcribe the lyrics consisting of words from the predicted phonemes by a separate language model in the post-processing.

3. Method

We used a limited amount of Korean song data and leveraged speech representations learned through self-supervised representation learning based on large-scale English speech data, rather than large-scale Korean song data. We utilized the melodic information about the difference between speech and singing in consideration of the high correlation of phoneme and melody. Adaptation is required two times sequentially: the English–Korean language domain, and the speech–singing domain.

Figure 2 provides a detailed representation of Figure 1. In step 0, the self-supervised speech representation learning from English speech data is adopted in our method. Then, the two-step adaptation process is followed; in step 1, the learned English speech representation undergoes linguistic adaptation to Korean phonemes, while in step 2, Korean singing voices are adapted with melodic supervision.

3.1. Self-Supervised English Speech Representation Learning

We employed the self-supervised speech representation learning model HuBERT [12], which leverages a pre-trained clustering model to provide target labels for masked prediction. The clustering model, denoted as

h

, takes a speech utterance

X = [x_{1}, . . ., x_{T}]

comprising

T

frames as the input, and generates T hidden units as the output, represented as

h (X) = Z = [z_{1}, . . ., z_{T}]

.

Let

M \subset [T]

denote the set of indices to be masked for a length-T sequence

X

, and

\tilde{X} (X, M)

represent a corrupted version of

X

where

x_{t}

is replaced with a mask embedding

\tilde{x}

if

t \in M

. A masked prediction model

f

takes

\tilde{X}

as input, and predicts a distribution over the target indices at each time step, denoted as

p_{f} (• | \tilde{X}, t)

. Define the cross-entropy loss computed over masked and unmasked time steps as

L_{m}

and

L_{u}

, respectively.

L_{m}

is expressed as follows:

L_{m} (f; X, M, Z) = \sum_{t \in M} \log p_{f} (z_{t}; X (X, M), t)

(1)

L_{u}

follows the same structure, but it sums over

t \notin M

. The overall loss is calculated as a weighted combination of these two terms:

L = α L_{m} + (1 - α) L_{u}

. In a scenario where

α = 0

(the extreme case), the loss is computed solely over the unmasked timesteps. The models offered by HuBERT are pre-trained using 960 h of large-scale English speech data.

3.2. English–Korean Adaptation

Our approach uses a self-supervised pre-trained model on an English corpus, which provides generalized prior knowledge encompassing sounds that humans can produce, regardless of the language that they speak. For the phonological systems of English and Korean, ref. [30] provides a comparative analysis that thoroughly examines the structures and processes for Korean–English interpretation training. Notably, English–Korean adaptation closely resembles the work of English–Korean interpretation. The key distinction is that we are not interpreting meaning, but are instead interpreting sounds, specifically mapping Korean to English phonological systems.

The challenge for the ‘interpreter’ lies in the differences between the English and Korean phonological systems, particularly the phonemes common to Korean, but lacking in English. English Phonemic Vowels are more finely differentiated, as shown in Figure 3. There are vowels present in Korean that are absent in English, such as /ɯ/(ㅡ), /o/(ㅗ), and /ɤ/(˧), just as there are vowels present in English but not in Korean, including /i/, /v/, /æ/, /ej/, /ow/, /ɔ/, /ə/, /ʌ/, and /a/. Rather than impede our research, this provides valuable insights into approaching /ɯ/, /o/, and /ɤ/.

Korean language also lacks voiced obstruents, but exhibits aspirations and glottis constriction contrasts, as shown in Table 1. There are consonants present in Korean but not in English, such as /ph/(ㅍ), /p’/(ㅃ), /th/(ㅌ), /t’/(ㄸ), /kh/(ㅋ), /k’/(ㄲ), /s’/(ㅆ), /t∫h/(ㅊ), /t∫’/(ㅉ), and /r/(ㄹ). Likewise, English has consonants that are absent in Korean, that include/b/, /d/, /g/, /f/, /v/, /θ/, /ð/, /z/, /∫/, /Z/, /dZ/, /l/, and /ô/. Hence, for phonemes present in Korean but not in English, we can assume a similar-sounding English phoneme from among the available English phonemes.

Given these considerations, we took a HuBERT encoder pre-trained on a vast amount of English speech data, and fine-tuned it with about 180 h of a small amount of Korean speech data, thereby enhancing its representational capabilities for Korean phonemes.

The HuBERT encoder

e

takes a Korean speech utterance

X = [x_{1}, . . ., x_{T_{s}}]

comprising

T_{s}

frames as input, and encodes to

T_{e}

feature frames

Z

as output, represented as

e (X) = Z = [z_{1}, . . ., z_{t}, . . ., z_{T_{e}}]

. The phoneme predictor (a fully connected layer)

n

takes the feature frames

Z

as input, and predicts a distribution over all phoneme classes on each frame as

p_{n} (• | Z, t)

. Define the CTC loss taking all frame-wise distributions

P = [p_{1}, . . ., p_{T_{e}}]

and a phoneme sequence target

Y_{p h o n e m e}

to compute as

L_{p h o n e m e}

, as follows:

L_{p h o n e m e} = C T C (P, Y_{p h o n e m e})

(2)

3.3. Speech–Singing Adaptation

Through the English–Korean adaptation, we obtained a model with some degree of discriminative ability for Korean phonemes. What remained was to adapt this model’s capabilities to singing voices with around 2.5 h of tiny-scale Korean singing voice data. The most significant difference between speech and singing lies in the melodic dynamics, particularly variations in pitch and intensity.

Therefore, there are two main approaches for speech–singing adaptation; one that does not leverage musical information, and one that does. Numerous previous studies treated singing voice as speech, and ignored musical information [1,4].

However, ref. [14] improved lyrics alignment performance with the addition of melodic supervision, utilizing musical information by multi-task learning. This approach involves leveraging musical information through melodic supervision, specifically pitch detection. Melodic information is beneficial for transcription, as knowing the onset (where a note begins), the offset (where it ends), and the duration, is crucial. As shown in Figure 4, onset, offset, and duration are closely related to the melody when a singing voice is used [32].

We, therefore, introduced the approach from [14], and as depicted in Figure 2, we performed our speech–singing adaptation using the English–Korean adapted model. This model tackled the phoneme recognition task by the phoneme predictor (a fully connected layer)

n

with CTC loss (phoneme supervision)

L_{p h o n e m e}

, while also including a pitch detection task using a melody predictor

m

, where

m

takes the feature frames

Z

encoded by the HuBERT encoder

e

as input, and predicts a distribution over the pitch classes on each frame as

p_{m} (• | Z, t)

. Defining the note (pitch) sequence target as

Y_{m e l o d y} = [y_{1}, . . ., y_{t}, . . ., y_{T_{e}}]

, the frame-level cross-entropy (CE) loss (melodic supervision)

L_{m e l o d y}

is computed as follows:

L_{m e l o d y} = \frac{1}{T_{e}} \sum_{t < T_{e}} C E (p_{t}, y_{t})

(3)

To enhance speech–singing adaptation, the final loss is computed as follows:

L = L_{p h o n e m e} + λ L_{m e l o d y}

(4)

in which λ controls for the relative weight of the melodic supervision. In our test,

λ = 1.5

was the best setting for our study.

4. Experiments

This section describes the experiments and datasets utilized for our model’s development and evaluation in detail. The section begins with data preparation and the metric for performance evaluation. In the data preparation, the datasets are precisely described, and the pre-processing steps are explained to prepare training data focused on converting grapheme annotations to phoneme sequences and unifying phoneme notations. The section also thoroughly describes the experimental setup, including the implementation details, dataset division, training strategies, and hyper parameters, along with the experimental results and discussions.

4.1. Data Preparation and Performance Evaluation

4.1.1. Datasets

The HuBERT encoder that we used to obtain feature representations was pre-trained by the LibriSpeech corpus [33], which is a corpus of 960 h of 16 kHz read English speech, derived from read audiobooks from the LibriVox project.

The dataset used for English–Korean adaptation was sourced from the National Institute of Korean Language, and comprises the Reading Aloud Korean Short Stories by Seoul People 2.0 (NIKLSEOUL) corpus [34]. This Corpus is composed of audio recordings and transcriptions of 930 sentences from 19 books. These sentences were read by 77 individuals from Seoul, ranging in age from 20 to 60, encompassing both male and female participants. The audio data is organized at the sentence level for a total of 87,035 recordings of around 180 h in duration. Each individual audio file varies in length (from 3 to 10 s), with an average duration of approximately 5 s. Additionally, all audio files share a common sampling rate of 16 kHz.

For speech–singing adaptation, we utilized the Children’s Song Dataset (CSD) corpus [35]. This corpus includes 50 songs sung by children in both Korean and English. These songs were professionally recorded by a female singer. Each song was recorded in two different keys, with variations from 3 to 5 semitones. This dataset is complemented by MIDI transcription files that include precise information about the onset and offset times of various elements in the songs. These MIDI files have been manually adjusted to accurately align with the corresponding syllable-level and phoneme-level lyrics annotations for training the melody predictor (Figure 2). The Korean part of this corpus consists of 100 recordings in total, collectively amounting to approximately 2.5 h of audio content. The duration of each individual audio file ranges from 17 to 190 s, with an average duration of approximately 90 s. All audio files in this corpus share a common sampling rate of 44.1 kHz.

4.1.2. Preprocessing of the NIKLSEOUL and CSD Corpora

The model’s target is a phoneme sequence, not a word sequence or grapheme sequence. However, in the case of the NIKLSEOUL corpus, annotations are written in graphemes, rather than phonemes that reflect real pronunciation. We, therefore, employed a grapheme-to-phoneme (G2P) tool [36] called g2pK to convert annotations into phoneme sequences that were consistent with the Korean standard pronunciation. g2pK strictly annotates phonemes based on the Revised Romanization (RR) of the Korean system [37]. In contrast, the annotations in the CSD corpus are in phonemes, but are not exactly consistent with the RR system style notation. The CSD corpus treats diphthongs (except ui/i/and oe/oe/) as a combination of two vowel phonemes, instead of a single independent vowel in the g2pK. We, therefore, manually unified the annotations into the CSD-style notation, and standardized the sampling rate to 16 KHz. We have 32 phoneme classes following the CSD-style notation, where we added a space symbol to distinguish syllables. We superadded a blank symbol for CTC loss so that we had 34 classes for phoneme recognition. The pitch range of the CSD corpus was F3−F5, which was 26 classes. We also added a symbol for silence or unknown pitches, so we had 27 classes for melodic supervision.

4.1.3. Evaluation Metrics

We employed the word error rate (WER), a common metric used in ASR systems, to evaluate our model’s performance, though we did adapt it to measure the performance on multi-level (phoneme, syllable, consonant, vowel) error rate by calculating the error rate between the predicted sequence

\hat{Y}

and the actual ground truth label

Y

. The ER value is computed using the following formula, where

E D

represents the Levenshtein distance:

E D = S + D + I

(5)

in which

S

represents the number of substitutions,

D

is the number of deletions, and

I

is the number of insertions.

E R (Y, \hat{Y}) = \frac{E D (Y, \hat{Y})}{l e n (Y)}

(6)

where,

E R

serves as the measure of accuracy, with lower values indicating superior model performance. Note that the space symbol is included in the calculation of all levels, as the symbol is for distinguishing syllables.

On the other hand, to demonstrate the validity of the speech–singing adaptation in Step 2, the precision, recall, and F1-score of the Correct Onset (COn) proposed in [38] was used to measure pitch detection. The onset tolerance was set to 50 ms.

4.2. Experimental Details

4.2.1. HuBERT-Based SSL and Adaptation Models

We implemented our method and performed the experiments based on the Fairseq toolkit [39], a PyTorch toolkit for sequence modeling. The toolkit already implemented the original HuBERT, and provided multiple scale pre-trained model weights (HuBERT Base, HuBERT Large, HuBERT Extra Large). Considering the experimental facilities and the ease of transfer to larger models, HuBERT Base model weights were employed in our experiments. The dimension of the encoder embedding was 768, so the phoneme predictor was a fully connected layer with an input size of 768 and an output size of 34. The melody predictor was a fully connected layer with 768 inputs and 27 outputs.

4.2.2. Dataset Division

For English–Korean adaptation, we used 13 books of the NIKLSEOUL corpus for training, 4 books for validation, and 2 books for testing. For speech–singing adaptation, we used 35 songs of the CSD corpus for training, 10 songs for validation, and 5 songs for testing. The split ratios for both sets of training, validation, and testing were approximately 7:2:1.

Figure 5, Figure 6 and Figure 7 show the phoneme and pitch distribution on these training, validation, and testing datasets, respectively. The distributions of the datasets are similar to one another, which implies that the datasets were adequately divided into training, validation, and testing.

4.2.3. Training Setup

Two fine-tuning strategies were used. The linear probing mentioned below indicates that freezing the pre-trained HuBERT encoder for the whole training time only updates the additional predictor to make the most of the priors. Correspondingly, the full fine-tuning mentioned below indicates updating the additional predictor with fixed encoder weights on a few earlier epochs, and then updating the encoder weights together with the predictor on the following epochs to make the best of the current data. Essentially, we followed the default training settings provided by the Fairseq toolkit, version 0.12.2. All models were trained with an Adam optimizer [40]. The hyper parameters not mentioned below are consistent with the toolkit defaults. We aimed to maximize the use of priors, and due to the scale of our datasets, the learning rate was set to 0.00002 for all phoneme recognition training processes. The pre-trained encoder was fixed for the first 10,000 updates using tri-stage learning rate scheduling [41] for the phoneme recognition training, except for the training processes for the first two rows of the results in Table 2. The learning rate for linear probing in the second row of the COn results in Table 3 was 0.000001. The GPU model used for computation was RTX 3090 Ti.

4.3. Experimental Results

This section illustrates the validity of each step of the method with several experiments. The results of the English–Korean adaptation and the speech–singing adaptation are then presented, demonstrating the effectiveness of self-supervised pre-training and the benefits of joint pitch detection supervision.

4.3.1. English–Korean Adaptation

HuBERT consists of an encoder and a decoder. We utilized self-supervised pre-training on the HuBERT encoder using an English speech corpus, Librispeech [33]. We then attached a phoneme predictor to the pre-trained HuBERT encoder, and fine-tuned it on the Korean speech corpus, NIKLSEOUL, using the CTC loss for adaptation.

Was the English speech representation prior helpful? Based on Table 2, even when there was no Korean data in the pre-training dataset, fixing the HuBERT encoder, which was pre-trained on a dataset of 960 h of English speech data from Librispeech, and fine-tuning only the attached phoneme predictor on the NIKLSEOUL dataset of 180 h, resulted in a performance of approximately 21.28%. Additional fine-tuning of the encoder without fixing it yielded a performance of 7.44%, after adapting to Korean. Compared to the baseline that trained directly from scratch, this reflects an improvement in performance of 17.55%. In this case, the self-supervised methods exhibited an ability to generalize beyond the language system of their input data, as their targets were not specific language labels, even when the input data came from a single specific linguistic system. The margin between the training from scratch and the fine-tuning of the pre-trained priors was 31.19% of phoneme-level error rate.

4.3.2. Speech–Singing Adaptation

After fine-tuning the model on the NIKLSEOUL dataset of 180 h, we used the CSD dataset, which included only 2.5 h of singing voice data, to adapt the model from the speech domain to the singing domain by joint pitch detection supervision. This adapted model is referred to as Model C in our results shown in Table 3. For comparison, we also have two other models: Model A and Model B in Table 3. Model B was trained only under phoneme supervision from the English–Korean adapted model, while Model A was trained only under phoneme supervision from a non-adapted model, without Korean speech prior. We evaluated the performance of phoneme recognition for these models on multi-level error rates using test datasets from both the CSD corpus and NIKLSEOUL corpus. Additionally, we assessed the performance of pitch detection. We used the linear probe method to evaluate the pitch detection performance of Model B, which was trained without pitch detection supervision. Table 3 shows the results.

Was the melodic supervision helpful? Considering the results, in Model C, pitch detection supervision improved the F1 score of COn and the performance of phoneme recognition at every level. In more detail, it slightly reduced the Precision score, but achieved a better Recall score, meaning it tended to catch more correct onset times of phonemes. On the other hand, for Model B, it already had a fair degree of pitch detection performance, even without pitch detection supervision. This proved once again that the phoneme recognition task is highly related to the pitch detection task, although little benefit can be obtained by combining the tasks of phoneme recognition and melodic supervision.

We observe that the performance in the singing domain is better in Model A, which lacks the English–Korean adaptation, compared to Model B, which includes it. In the speech domain, however, Model A performs much worse than Model B. A desirable model is expected to perform well not only with singing voices, but also with speech. Considering the significant performance gap in the speech domain, the superiority of Model A should be regarded as a type of overfitting.

5. Conclusions and Further Research

In the field of ASR and ALT, there exists a significant disparity in the availability of data and the pace of development between English and many other languages. This discrepancy poses a considerable challenge for advancing ASR and ALT capabilities in languages where development is lagging. A potential solution lies in leveraging the wealth of research and resources available in the English language ecosystem by transferring learned priors to other languages. However, to ensure that these priors are broadly applicable and not constrained by language-specific characteristics, it is crucial to adopt approaches that capture general linguistic patterns. Self-supervised learning (SSL) has emerged as a promising strategy in this regard.

This study demonstrates the effectiveness of self-supervised methods, such as HuBERT, for cross-lingual adaptation in phoneme recognition. By leveraging an English corpus for pre-training, we successfully adapted the model to recognize Korean phonemes. This showcases the impressive generalization capabilities of self-supervised models, enabling language priors to be utilized, even across different languages and domains. Additionally, we validated the effectiveness of melodic supervision for phoneme recognition of singing voices. The transfer learning improved the phoneme-level error rate of Korean speech by as much as 31.19%. Also, the melodic supervision improved the phoneme-level error rate in singing voices by 0.55%.

There are several directions for further research. In general, the melodic variations should not significantly impact phoneme recognition or ALT. Depending on the fact, a method known as feature representation alignment [42] has suggested that aligning the feature distributions of two domains could help to mitigate irrelevant domain shifts. The concept of alignment could serve as a promising direction of research for comparison with the results in the paper.

Furthermore, it is evident that melodic information can significantly aid in this alignment despite the notable differences between traditional Korean folk songs and Western songs. Due to the scarcity of available Korean singing voice datasets, however, we were unable to test our configurations on a wide range of datasets except CSD. Furthermore, despite considerable pronunciation differences between Korean and English, as discussed in Section 3.2, and the distant linguistic relationship between the two languages, we achieved the results outlined. We anticipate that similar adaptations, following the three steps described, could be applied to other languages. The ease or difficulty of this process will largely depend on the pronunciation similarities and linguistic relationships between the target language and English. Therefore, an additional step of analyzing these similarities and relationships may be necessary. It should be emphasized that there might be substantial potential for further exploration of diverse domain adaptation methods in the field of audio processing. Strictly speaking, our findings only underscore the potential of SSL in tackling complex and dynamic speech-processing tasks across languages and domains. Future research could focus on further refining these techniques and exploring their applicability in various real-world scenarios, including the development of a general model incorporating human voice representations to enhance cross-lingual phoneme recognition with limited target language data.

The primary aim of our research is to preserve and document the traditional Korean folk songs, which have been passed down orally without written text or musical notation. Thus, the recognized phonemes in the paper should go through the word recognition stage to obtain the proper lyrics to be preserved. Another direction of our work involves aligning lyrics with traditional Korean folk songs to support music education. For those applications, the approach in this paper would be helpful where only small-scale datasets are available.

Author Contributions

Conceptualization, W.W. and J.L.; methodology, W.W. and J.L.; software, W.W.; validation, W.W.; resource, J.L.; data curation, J.L.; writing—original draft preparation, W.W.; writing—review and editing, W.W. and J.L.; supervision, J.L.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Research Foundation of Korea (NRF) under the Development of AI-Based Analysis/Synthesis Techniques for Korean Traditional Music Project (Funding Number: RS-2024-00340948).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

LibriSpeech ASR corpus, https://www.openslr.org/12 (accessed on 15 August 2024); CSD Dataset, https://zenodo.org/records/4916302 (accessed 15 August 2024); The NIKLSEOUL (NIKL Corpus of Korean Short Stories Read in Seoul Dialect) data were obtained from National Institute of Korean Language and are available from the URL https://corpus.korean.go.kr/?lang=en (accessed 15 August 2024) with the permission of National Institute of Korean Language; and other dataset used to support the findings of this study can be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest.

References

Li, J. Recent advances in end-to-end automatic speech recognition. APSIPA Trans. Signal Inf. Process. 2022, 11, e8. [Google Scholar] [CrossRef]
Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
Malik, M.; Malik, M.K.; Mehmood, K.; Makhdoom, I. Automatic speech recognition: A survey. Multimed. Tools Appl. 2021, 80, 9411–9457. [Google Scholar] [CrossRef]
Demirel, E.; Ahlbäck, S.; Dixon, S. Automatic lyrics transcription using dilated convolutional neural networks with self-attention. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–8. [Google Scholar]
Gupta, C.; Yılmaz, E.; Li, H. Automatic lyrics alignment and transcription in polyphonic music: Does background music help? In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–9 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 496–500. [Google Scholar]
Hosoya, T.; Suzuki, M.; Ito, A.; Makino, S.; Smith, L.A.; Bainbridge, D.; Witten, I.H. Lyrics recognition from a singing voice based on finite state automaton for music information retrieval. In Proceedings of the 6th International Society for Music Information Retrieval Conference (ISMIR), London, UK, 11–15 September 2005; pp. 532–535. [Google Scholar]
Fujihara, H.; Goto, M.; Ogata, J. Hyperlinking lyrics: A method for creating hyperlinks between phrases in song lyrics. In Proceedings of the 9th International Society for Music Information Retrieval Conference (ISMIR), Philadelphia, PA, USA, 14–18 September 2008; pp. 281–286. [Google Scholar]
Dzhambazov, G. Knowledge-Based Probabilistic Modeling for Tracking Lyrics in Music Audio Signals. Ph.D. Thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2017. [Google Scholar]
Yong, S.; Su, L.; Nam, J. A phoneme-informed neural network model for note-level singing transcription. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Gao, X.; Gupta, C.; Li, H. Automatic lyrics transcription of polyphonic music with lyrics-chord multi-task learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2280–2294. [Google Scholar] [CrossRef]
Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Mohamed, A.; Lee, H.-Y.; Borgholt, L.; Havtorn, J.D.; Edin, J.; Igel, C.; Kirchhoff, K.; Li, S.-W.; Livescu, K.; Maaløe, L.; et al. Self-supervised speech representation learning: A review. IEEE J. Sel. Top. Signal Process. 2022, 16, 1179–1210. [Google Scholar] [CrossRef]
Huang, J.; Benetos, E.; Ewert, S. Improving lyrics alignment through joint pitch detection. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 451–455. [Google Scholar]
Chung, Y.-A.; Wu, C.-C.; Shen, C.-H.; Lee, H.-Y.; Lee, L.-S. Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. arXiv 2016, arXiv:1603.00982. [Google Scholar]
Liu, A.H.; Chung, Y.-A.; Glass, J. Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv 2020, arXiv:2011.00406. [Google Scholar]
van den Oord, A.; Li, Y.; Vinyals, O. Representa-tion learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised pre-training for speech recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Baevski, A.; Schneider, S.; Auli, M. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv 2019, arXiv:1910.05453. [Google Scholar]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Meseguer-Brocal, G.; Cohen-Hadria, A.; Peeters, G. Dali: A large dataset of synchronized audio, lyrics and notes, auto-matically created using teacher-student machine learning paradigm. arXiv 2019, arXiv:1906.10606. [Google Scholar]
Dabike, G.R.; Barker, J. Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Interspeech. 2019, pp. 579–583. Available online: https://www.isca-archive.org/interspeech_2019/dabike19_interspeech.html (accessed on 21 September 2024).
Zhang, C.; Yu, J.; Chang, L.; Tan, X.; Chen, J.; Qin, T.; Zhang, K. Pdaugment: Data augmentation by pitch and duration adjustments for automatic lyrics transcription. arXiv 2021, arXiv:2109.07940. [Google Scholar]
Anna, M.K. Training phoneme models for singing with “songified” speech data. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR 2015), Malaga, Spain, 26–30 October 2015; Volume 30, p. 50. [Google Scholar]
Mesaros, A.; Virtanen, T. Adaptation of a speech recognizer for singing voice. In Proceedings of the 2009 17th European Signal Processing Conference, Scotland, UK, 24–28 August 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1779–1783. [Google Scholar]
Mesaros, A. Singing voice identification and lyrics tran- scription for music information retrieval invited paper. In Proceedings of the 2013 7th Conference on Speech Technology and Human-Computer Dialogue (SpeD), Cluj-Napoca, Romania, 16–19 October 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1–10. [Google Scholar]
Ou, L.; Gu, X.; Wang, Y. Transfer learning of wav2vec 2.0 for automatic lyric transcription. arXiv 2022, arXiv:2207.09747. [Google Scholar]
Deng, T.; Nakamura, E.; Yoshii, K. End-to-end lyrics transcription informed by pitch and onset estimation. In Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR, Bengaluru, India, 4–8 December 2022. [Google Scholar]
Cho, J.; Park, H.-K. A comparative analysis of korean-english phonological structures and processes for pronunciation pedagogy in interpretation training. Meta 2006, 51, 229–246. [Google Scholar] [CrossRef]
O’Grady, W.; Archibald, J.; Aronoff, M.; Rees-Miller, J. Contemporary Linguistics: An Introduction; St. Martin’s: Bedford, UK, 2017. [Google Scholar]
Nichols, E.; Morris, D.; Basu, S.; Raphael, C. Relationships between lyrics and melody in popular music. In Proceedings of the ISMIR 2009-11th International Society for Music Information Retrieval Conference, Utrecht, The Netherlands, 9–13 August 2009; pp. 471–476. [Google Scholar]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 5206–5210. [Google Scholar]
National Institute of Korean Language. Nikl the Corpus of Reading Aloud Korean Short Stories by Seoul People (v.2.0). 2021. Available online: https://kli.korean.go.kr/corpus/main/requestMain.do (accessed on 21 September 2024).
Choi, S.; Kim, W.; Park, S.; Yong, S.; Nam, J. Children’s song dataset for singing voice research. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Montreal, Canada, 11–16 October 2020. [Google Scholar]
Kyubyong Park. g2pk. 2019. Available online: https://github.com/Kyubyong/g2pk (accessed on 21 September 2024).
National Institute of Korean Language. ‘Revised Romanization of Korean’. Available online: https://www.korean.go.kr/front_eng/roman/roman_01.do (accessed on 21 September 2024).
Molina, E.; Barbancho-Perez, A.M.; Tardon-Garcia, L.J.; Barbancho-Perez, I. Evaluation framework for automatic singing transcription. In Proceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, 27–31 October 2014; 2014; pp. 567–572. [Google Scholar]
Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the NAACL-HLT 2019: Demonstrations, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
Singhal, P.; Walambe, R.; Ramanna, S.; Kotecha, K. Domain adaptation: Challenges, methods, datasets, and applications. IEEE Access 2023, 11, 6973–7020. [Google Scholar] [CrossRef]

Figure 1. Overview of the SSL-based low-resource language ALT.

Figure 2. The proposed approach.

Figure 3. Vowel Charts of Korean (left) Phonemic Vowels and English (right) Phonemic Vowels [30,31].

Figure 4. Sample annotation. The different colors are used to distinguish the boundaries of syllables or pitches.

Figure 5. Phoneme distribution of the NIKLSEOUL corpus.

Figure 6. Phoneme distribution of the CSD corpus.

Figure 7. Pitch distribution of the CSD corpus.

Table 1. Korean and English phonemic consonants [30,31].

	Bilabial	Labiodental	Dental	Alveolar	Palatoalveolar	Palatal	Velar	Labiovelar	Glottal
Common	p m			t s n	t∫	j	k ŋ	w	h
English-only	$b$	f v	θ ð	d z l ɹ	∫ ʒ dʒ		g		ʔ
Korean-only	$p^{h} p^{'}$			$t^{h} t^{'}$ $s^{'}$ ɾ	${t ʃ}^{h} {t ʃ}^{'}$		$k^{h} k^{'}$ ɯ

Table 2. Comparison of English–Korean adaptation results. The downward arrow signifies that lower values are preferable.

Prior	Fine-Tuning Strategy	Error Rate (%)
Prior	Fine-Tuning Strategy	Phoneme ↓	Syllable ↓	Consonant ↓	Vowel ↓
None	Full	38.63	73.44	32.60	29.86
En	Linear Probe	21.28	49.43	17.99	18.48
En	Full	7.44	19.66	5.77	6.34

Table 3. Comparison of speech–singing adaptation results. The upward arrow signifies that higher values are preferable, whereas the downward arrow signifies that lower values are preferable.

Model	Error Rate (%) ↓								COn
	CSD Corpus				NIKLSEOUL Corpus				CSD Corpus
	Phoneme	Syllable	Consonant	Vowel	Phoneme	Syllable	Consonant	Vowel	Precision (%) ↑	Recall (%) ↑	F1-Score (%) ↑
A ¹	17.64	40.93	15.71	15.13	34.74	74.96	26.94	31.39	\	\	\
B ²	17.76	41.08	16.36	13.69	14.30	35.80	12.09	11.17	31.60	81.28	44.93
C ³	17.09	39.48	15.32	13.50	15.68	38.62	13.30	12.67	31.48	82.89	45.13

¹ Prior: En, Supervision: Phoneme; ² Prior: En + Ko, Supervision: Phoneme; ³ Prior: En + Ko, Supervision: Phoneme + Melody.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, W.; Lee, J. Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations. Appl. Sci. 2024, 14, 8532. https://doi.org/10.3390/app14188532

AMA Style

Wu W, Lee J. Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations. Applied Sciences. 2024; 14(18):8532. https://doi.org/10.3390/app14188532

Chicago/Turabian Style

Wu, Wenqin, and Joonwhoan Lee. 2024. "Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations" Applied Sciences 14, no. 18: 8532. https://doi.org/10.3390/app14188532

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations

Abstract

1. Introduction

2. Related Work

2.1. Self-Supervised Speech Representation Learning

2.2. Automatic Lyric Transcription

3. Method

3.1. Self-Supervised English Speech Representation Learning

3.2. English–Korean Adaptation

3.3. Speech–Singing Adaptation

4. Experiments

4.1. Data Preparation and Performance Evaluation

4.1.1. Datasets

4.1.2. Preprocessing of the NIKLSEOUL and CSD Corpora

4.1.3. Evaluation Metrics

4.2. Experimental Details

4.2.1. HuBERT-Based SSL and Adaptation Models

4.2.2. Dataset Division

4.2.3. Training Setup

4.3. Experimental Results

4.3.1. English–Korean Adaptation

4.3.2. Speech–Singing Adaptation

5. Conclusions and Further Research

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI