TTSDS - Text-to-Speech Distribution Score

Abstract

Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how well synthetic speech mirrors real speech by obtaining correlates of each factor and measuring their distance from both real speech datasets and noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations from each time period.

Index Terms— speech synthesis evaluation, synthetic data, distribution analysis

1 Introduction

There has been a recent surge in synthetic speech generation quality enabled by systems using language modeling architectures generating discrete units [1] which are then used to reconstruct the waveform [2]. However, many of the most recent systems have been released to the community without accompanying papers and/or evaluation. This is understandable since the field is moving at a quick pace, and evaluation of synthetic speech is hard. The remaining systems are evaluated mostly with Mean Opinion Score (MOS) and its factors such as naturalness, speaker similarity, and sometimes intelligibility. Some systems also rely on MOS predicted with trained neural networks [3, 4]. Unfortunately, MOS is becoming less useful as real and synthetic speech get closer in quality, and cannot be compared across studies and over time [5].

Many factors, such as prosody, intelligibility, naturalness, and speaker similarity, contribute to perceived overall speech quality. Some works focus on the prosody factor by comparing pitch or duration of real to synthetic speech [6, 7, 8], while others include metrics on Word Error Rate (WER) to account for intelligibility or more general algorithmic measures such as Mel-Cepstral Distoration (MCD) [3, 6]. To the best of our knowledge, there is only one published effort which evaluated state-of-the-art TTS systems based on LLM architectures [9]. It uses crowdsourced A/B preference tests to produce an Elo rating for each system. As these ratings are the first comparative study of the next generation of TTS systems, they are a great resource, however this type of evaluation comes with a set of challenges. Since the evaluation is centralized, it is up to the organizers to add newly released systems. Furthermore, human rating of the speech could change as time progresses, which has been shown to be the case for MOS scores [5]. Finally, adding more complex domains such as long-form speech synthesis, or testing the contribution of individual factors, requires running the evaluation from scratch.

Some solutions for objective evaluation of generative systems have emerged in other domains: In image generation, Fréchet Inception Distance (FID) [10] has become the de-facto standard, but while there have been attempts to apply these methods to TTS [11], they have not taken hold, perhaps due to the large number of required samples [12]. In NLP, the advances of capabilities of Large Language Models (LLMs) have lead to a number of published benchmarks, most ranking the model on a variety of tasks, such as the GLUE [13], SuperGLUE [14] and CoQA [15] benchmark. For speech processing, the SUPERB benchmark also uses this task approach [16].

Unlike math problems, or automatic speech recognition (ASR), speech synthesis does not have one correct solution, but many possible ones, which makes evaluation difficult. However, we can use the concept of factors for evaluation. As we can test a LLM on mathematical ability and reading comprehension as parts of an overall measure of reasoning ability, we can test a TTS system on factors such as prosody or intelligibility, and we can combine them to reflect the overall performance of the system. This approach also gives us more information on preferring one system over another. For example, for character voices in a video game, a system with higher scores in prosody might be preferred, while for a language-learning app intelligibility would be the most important factor.

In this paper, we devise a benchmark with the intelligibility, prosody, speakers, environment and general factors. We include intelligibility and prosody as they are important measures of synthetic speech quality [17]. We include the speaker category to evaluate how closely TTS systems can model realistic speakers [18]. We also include environment as a factor due to the prevalence of artifacts present in speech synthesis [19], and the difficulty of some TTS system to generate speech with realistic recording conditions [8]. The general factor is similar to previous measures of speech distributions as represented by latent representations such as Fréchet DeepSpeech Distance [11]. Finally, we compute the overall TTSDS score as an average of all factors.

We compute the score for each factor by comparing the distributions of both high-dimensional features (e.g., embeddings) and scalar features (e.g., pitch) extracted from the speech. This comparison allows us to measure the deviation of synthetic speech from real speech without assuming predefined distributions, thereby avoiding common pitfalls associated with objective speech evaluation measures. For instance, intelligibility is often quantified using Word Error Rate (WER) [20], where lower values are typically preferred. However, if the target domain naturally exhibits high WER (e.g., children’s speech), a TTS system should also reflect this characteristic, producing higher WER accordingly. Therefore, we compare the utterance-level distribution of our features (such as WER) rather than their mean.

We evaluate our benchmark by comparing to MOS scores and A/B test results obtained for 35 TTS systems from 2008, 2022 and 2024. An average of our factor scores show correlation coefficients ranging from 0.60 to 0.83 for each time period, while the performance of state-of-the-art MOS prediction are less consistent, ranging from 0.05 to 0.85. Additionally, we observe a shift in priorities of human evaluators over time, with environment being more important for earlier systems, and prosody being more important for later ones. We make our benchmark suite and leaderboard openly available.¹¹1https://ttsdsbenchmark.com

2 Methodology

The first step of developing any measure is to define the concept being measured. Since there is no objective way of directly measuring the naturalness or quality of synthetic speech, we define our measure as ”the distance between the distribution of real and synthetic speech”. The next step is to define relevant factors [21]. For speech we define the following five major factors.

A General factor which measures the distribution of the speech without any assumptions. We use self-supervised learning (SSL) representations of the speech.

An Environment factor which measures noise or distortion present in the speech. We use correlates for signal-to-noise ratio (SNR) and reverberation.

An Intelligibility factor which measures how easy the content of the speech is to recognize. We use WER obtained using the transcripts provided to the TTS systems and ASR systems.

A Prosody factor which measures how realistic the prosody of the speech is. We use a pitch extractor, a SSL representation of the prosody and a proxy for duration and speaking rate.

A Speaker factor which measures how close the speakers are to real ones. We use representations obtained from speaker verification systems.

Factor	Feature
Environment	Noise/Artifacts
	VoiceFixer [22] + PESQ [23]
	WADA SNR [24]
Speaker	Speaker Embedding
	d-vector [25]
	WeSpeaker [26]
Prosody	Segmental Length
	Hubert [27] token length
	Pitch
	WORLD [28]
	SSL Representations
	MPM [29]
Intelligibility	ASR WER
	wav2vec 2.0 [30]
	Whisper [31]
Allgemein	SSL Representations
	Hubert [27]
	wav2vec 2.0 [30]

Table 1: Features used in the benchmark and their respective factors. The overall TTSDS score is computed as an average of individual factor scores.

Since none of these factors can be measured directly, we rely on several features which correlate to the factors. The extensive body of work in speech processing and representation learning provides these features, including representations from self-supervised models, statistical features, and algorithmic features. For each feature derived from synthetic data, we define its score as how close it is to the same feature derived from real speech.

2.1 Features extracted from Speech

Here we identify the specific features (i.e., data points derived from speech) that represent each factor. Table 1 summarizes the models and algorithms that we use to extract these features. For each of the aforementioned factors, we use two to three features to achieve a robust benchmark despite the low number of 80-100 samples per system. For measuring the General factor score, we use frame-level self-supervised representations extracted from the middle layers of the Hubert base [27] and wav2vec 2.0 base [30] models. For the Environment distribution, we use two one-dimensional correlates of noise present in the signal – we use VoiceFixer [22] to remove noise from the speech, and then measure PESQ between the enhanced sample and the original one; we also use WadaSNR [24] to estimate the SNR of each sample. For Intelligibility, we calculate WER using the reference transcripts and automatic transcripts generated using wav2vec 2.0 [30] fine tuned on 960 hours of LibriSpeech [32] and Whisper (small) [31]. Prosody is quantified using frame-level representations from a self-supervised prosody model [29] and frame-level pitch features [28]. Additionally, we get a proxy for the segmental durations by using Hubert tokens (with 100 clusters) and extracting their lengths (i.e. how many of the same token occur in a row). Finally, for the Speaker factor, we use d-vectors [25] and the more recent WeSpeaker [26] representations.

2.2 Speech Distributions

The distribution of a feature can be computed on any audio dataset, whether it consists of synthetic speech, real speech, or noise. We let $D$ represent an audio dataset, and $X$ be the feature derived from the dataset. We can sample observed values $x_{i}$ from the empirical distribution $\hat{P}(X|D)$ as:

x_{i}\sim\hat{P}(X|D)

where $\mathbf{x}_{i}$ can be a scalar for one-dimensional features or a vector for multi-dimensional features.

2.3 Computing Distances Between Distributions

To compare the distributions of features derived from different datasets, we utilize the Wasserstein distance, specifically the 2-Wasserstein distance $W_{2}$ , also known as the Earth Mover’s Distance. $W_{2}$ measures the amount of ”work” needed to transfer one probability distribution to another as an optimal transport problem [33]. This distance measure is widely used in computer vision as the Fréchet Inception Distance (FID) [10] and in audio processing as the Fréchet Audio Distance [34]. Here, we formulate how to compute $W_{2}$ given the empirical distributions of a feature $X$ computed on datasets $D_{1}$ and $D_{2}$ for both the one-dimensional and multi-dimensional case. We denote the corresponding empirical probability distributions $\hat{P}(X|D_{*})$ as $\hat{P}_{*}$ .

One-Dimensional Case

In the one-dimensional case, $W_{2}$ can be computed as a function of the ordered samples [35]:

W_{2}(\hat{P}_{1},\hat{P}_{2})=\sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-y_{i% }\right)^{2}}

(1)

where $\{x_{1},\ldots,x_{n}\}$ denote sorted samples of $\hat{P}(X|D_{1})$ , and $\{y_{1},\ldots,y_{n}\}$ denote sorted samples of $\hat{P}(X|D_{2})$ .

Multi-Dimensional Case

We can compute $W_{2}$ distance for two normally distributed multi-dimensional $\hat{P_{1}}$ and $\hat{P_{2}}$ using their mean vectors $\mu_{1}$ and $\mu_{2}$ and their covariance matrices $\Sigma_{1}$ and $\Sigma_{2}$ [10]:

W_{2}(\hat{P}_{1},\hat{P}_{2})=\sqrt{\norm{\mu_{1}-\mu_{2}}^{2}+D_{B}(\Sigma_{% 1},\Sigma_{2})}

(2)

where $D_{B}$ is the unnormalized Bures metric defined as

D_{B}(\Sigma_{1},\Sigma_{2})=\text{trace}\left(\Sigma_{1}+\Sigma_{1}-2(\Sigma_% {2}^{1/2}\Sigma_{1}\Sigma_{2}^{1/2})^{1/2}\right)

We can use this approach since latent representations of neural networks are presumably normally distributed [10].

2.4 Overall Score

To evaluate how close a synthetic speech dataset $D_{\text{syn}}$ is to real speech given a particular feature $X$ , we compute its Wasserstein distance $W_{2}$ for real reference datasets $\mathfrak{D}_{\text{real}}$ and distractor (noise) datasets $\mathfrak{D}_{\text{noise}}$ . We find the smallest $W_{2}$ among the real and noise datasets respectively as

\begin{split}W_{\text{real}}&=\min_{D_{\text{real}}\in\mathfrak{D}_{\text{real% }}}W_{2}(\hat{P}_{\text{syn}},\hat{P}_{\text{real}})\\ W_{\text{noise}}&=\min_{D_{\text{noise}}\in\mathfrak{D}_{\text{noise}}}W_{2}(% \hat{P}_{\text{syn}},\hat{P}_{\text{noise}})\end{split}

We define the overall score (ranging from 0 to 100) as

S=100\times\frac{W_{\text{real}}}{W_{\text{real}}+W_{\text{noise}}}

Any score can be intuitively interpreted – if $S$ is greater than 50, then $D_{\text{syn}}$ is more similar to the closest real speech dataset than to the closest noise dataset for a particular feature. An example of this can be seen in Figure 1, in which we show the difference (for a SSL representation feature) between the best-performing and the worst-performing system in the TTS Arena dataset. The higher score of system (a) corresponds with a larger overlap with the reference dataset and smaller overlap with the noise dataset than system (b).

Refer to caption — Fig. 1: Distribution of the best (left) and worst (right) TTS Arena system with respect to Hubert representations. $S$ denotes the score.

3 Experiment Setup

To validate our benchmark, we compare our factors and TTSDS scores to subjective measures across three different time periods. The Blizzard 2008 challenge [36] compared 22 TTS systems across several tasks using MOS. We choose the ”Voice A” audio-book task with 15 hours of data. Later, the ”Back to the Future” (BTTF) [5] compared unit selection, hybrid and statistical parametric HMM-based systems from the Blizzard 2013 challenge [37] with deep learning systems inspired by the Blizzard 2021 challenge [38] based on FastPitch [39] and Tacotron [40]. The latest systems, which are most commonly based on discrete speech representations generatively modeled by LLM-like systems [3], are represented by the TTS Arena leaderboard [9], which is a crowdsourced effort to evaluate these systems. Only systems released in 2023 and 2024 are featured in this evaluation. Since the data is not publicly released, we reproduce datasets for as many of the systems as possible.²²2We had to exclude MetaVoice and GPT-SoVITS due to reference audio length requirements; and MeloTTS due to only female voices being available. This leaves us 9 out of the 12 publicly available systems. As conditioning for the TTS systems, we use speaker reference waveforms from the LibriTTS [41] test set, coupled with unrelated transcripts from the same set to make sure the data could not have been encountered during training.

Our reference speech datasets are LibriTTS [41], LibriTTS-R [42], LJSpeech [43], VCTK [44], and the training sets for the Blizzard challenges [5, 36]. We sample 100 utterances at random from each dataset (if available, from their test sets). For distractor/noise datasets, we use the ESC dataset of background noises [45], as well as the following generated noise – random uniform, random normal, all zeros and all ones.

We compare our benchmark with two MOS prediction methods. The first method is WVMOS [46], which uses wav2vec 2.0 model [30] fine-tuned to predict MOS scores. Its system-level correlation coefficients range from 0.68 to 0.96 for different corruptions of speech and their corresponding MOS scores [46]. The second method is UTMOS [47], which is an ensemble MOS prediction system that won several categories in the 2022 VoiceMOS challenge [48], and has since been used for evaluation of several leading TTS systems [4, 49].

For all system $\times$ feature combinations, we compute the score as described in Section 2.4. We average all features for each factor, which gives us the corresponding factor score. Averaging all factor scores in turn gives the TTSDS score.

4 Results

Given the scores of all systems and the subjective measures reported for the given datasets (MOS for Blizzard’08 and BTTF; Elo Rating for TTS Arena), we report their Spearman rank correlation coefficients in Figure 2. We observe both our factors and baseline MOS prediction systems vary strongly between different the different corpora, but the TTSDS score correlates consistently well with subjective evaluation results.

The baseline MOS prediction methods achieve mixed results for the Blizzard’08 and BTTF data. UTMOS and WVMOS respectively achieve a high correlation on one of the two datasets while only yielding low correlation on the other. We hypothesize that UTMOS might have included unit-selection systems in its training data, but it have not encountered enough variants of the FastPitch/Tactotron systems present in BTTF. The inverse seems to be the case for WVMOS. For TTS Arena, both systems do not perform well. In summary, these MOS prediction systems sometimes achieve high correlation with ground-truth MOS, but do not seem to generalize.

4.1 TTSDS Benchmark

We now discuss the individual scores of our benchmark – the development of individual of these scores’ correlations with subjective evaluation can be seen in Figure 3.

The General score shows some correlation with human evaluation results, but the correlation is generally low. The General score only slightly outperforms MOS prediction for TTS Arena, and shows the lowest correlation of all factors for the Blizzard’08 systems. For unit selection voices, this might be explained by the fact they consist of real speech samples, however, speaker verification representations should suffer from the same problem and they do not seem to be affected as much.

System	UTMOS	WVMOS	Gen	Env	Int	Pro	Spk	TTSDS	Elo Rating
StyleTTS 2 [7]	4.36	4.48	93.7	84.7	91.6	89.8	71.5	86.3	1237
XTTSv2 [4]	3.89	4.36	94.3	79.3	91.4	90.5	72.6	85.6	1232
OpenVoice [50]	4.10	4.57	91.7	88.0	91.6	91.8	68.8	86.4	1158
WhisperSpeech [51]	3.78	3.89	90.0	83.9	92.2	80.7	72.4	83.9	1149
Parler TTS [52]	3.97	4.16	94.7	80.8	87.5	83.0	74.1	84.0	1140
Vokan TTS [53]	3.80	4.22	88.6	85.1	91.6	85.3	69.1	83.9	1126
OpenVoice v2 [50]	4.29	4.75	90.7	91.2	91.6	88.6	68.7	86.2	1120
VoiceCraft 2 [54]	4.21	3.71	87.0	78.0	91.6	84.4	66.0	81.4	1114
Pheme [6]	3.92	4.26	94.0	81.9	91.5	85.1	66.1	83.7	1029

Table 2: Ranking, factor scores, TTSDS score and MOS predictions for the TTS Arena systems.

The Environment score has a low correlation with subjective measures for both Blizzard’08 and TTS Arena, but it is interestingly the most important factor for BTTF. Due to BTTF consisting of both deep learning systems from 2021 and systems from 2013, this factor might pick up on artifacts which are only present in the latter. Meanwhile for the Blizzard’08 systems, these artifacts might be similar enough between systems listeners didn’t prioritize them in evaluation, while for modern systems in TTS Arena, hardly any noise or artifacts are present.

The Intelligibility score shows a high correlation for Blizzard’08, but it is the only of our scores showing a negative correlation for BTTF. Again, this could be due to the difference between unit selection and neural voices, with the former perhaps having more realistic intelligibility, but worse naturalness as perceived by humans.

The Prosody score is consistent between datasets, which might be in part due to the diversity of the underlying features (i.e. pitch, SSL prosody representations and segmental durations). It also increases over time, with the TTS Arena system scores showing the highest correlation with our prosody score. This confirms the intuition that good prosody has always been a factor in subjective evaluation. The increase in prosody score might indicate that human evaluators focus more on the prosody of the speech as other factors such as the intelligibility or noise conditions have vastly improved with modern systems.

The Speaker score also shows high correlation for Blizzard’08 and TTS Arena, but fails for BTTF. We believe this is due to older unit-selection systems included in BTTF producing a natural speaker embedding for concatenated parts of real speech. This effect is pronounced because we only achieve a significant TTSDS score correlation when the Speech factor is excluded for BTTF.

The TTSDS score achieves higher correlation than any single factor for all datasets included in our study, despite the low number of 80-100 samples per dataset. One of the baseline MOS prediction networks still performed better for the early Blizzard’08 systems, but both MOS prediction networks were significantly outperformed by our benchmark for the more modern systems. However, individual factors often show lower correlations with MOS than the baseline systems, highlighting the need for combining several factors. We hypothesize that this might be the reason measures similar to the Fréchet Inception Distance [10] for computer vision have not become popular for speech evaluation – with the low number of samples typically used for TTS evaluation, and the many factors contributing to what ”good” speech synthesis is, and single distance measure might not be enough to show correlation with human evaluation.

Table 2 shows our benchmark compared to MOS prediction and the subjective human evaluation rating from TTS Arena. While UTMOS correctly predicts the best system, the other scores by the MOS prediction systems show little to no correlation with Elo ratings; our prosody factor, speaker factor and overall TTSDS scores correlate well with the Elo ratings. However, OpenVoice v2 [50] is scored highly by TTSDS but achieved low scores in TTS Arena – this might be due to differences in configuration, as the details for generating the speech used in TTS Arena are not public. To evaluate whether our benchmark could be used for system selection, we perform a Wilcoxon signed-rank test (Figure 4). We observe that while the worst-performing systems can generally be distinguished from the highest-performing ones, there is no statistically significant difference between the better-performing systems. Finding significant differences between TTS systems has been difficult, even with previous subjective evaluation methods [5, 36]. However, using more speech samples and features for future iterations of TTSDS could mitigate this.

5 Conclusion

In this work, we proposed a benchmark assessing prosody, speaker identity, intelligibility, environment, and general distribution of synthetic speech. Evaluating 35 TTS systems from 2008 to 2024, our benchmark showed strong correlation with human evaluations (0.60 to 0.83). This highlights the robustness and adaptability of our approach to evolving evaluation criteria. Individual factors alone showed limited correlation, but their combination significantly outperformed traditional MOS prediction systems, especially for modern TTS systems. Our results underscore the importance of intelligibility and prosody, and the need for TTS systems to replicate realistic recording conditions and speaker characteristics. We revealed limitations in existing MOS prediction systems, emphasizing the need for a nuanced approach to TTS evaluation. High correlation with human evaluations suggests our benchmark provides a reliable and comprehensive framework for assessing synthetic speech quality.

References

[1] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” arXiv:2210.13438, 2022.
[2] D. Lyth and S. King, “Natural language guidance of high-fidelity text-to-speech with synthetic annotations,” arXiv:2402.01912, 2024.
[3] S. Chen, S. Liu, L. Zhou, Y. Liu, X. Tan, J. Li, S. Zhao, Y. Qian, and F. Wei, “VALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers,” arXiv:2406.05370, 2024.
[4] E. Casanova, K. Davis, E. Gölge, G. Göknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, et al., “XTTS: a massively multilingual zero-shot text-to-speech model,” arXiv:2401.02839, 2024.
[5] S. Le Maguer, S. King, and N. Harte, “Back to the future: Extending the blizzard challenge 2013,” in INTERSPEECH, 2022.
[6] P. Budzianowski, T. Sereda, T. Cichy, and I. Vulić, “Pheme: Efficient and conversational speech generation,” arXiv:2401.02839, 2024.
[7] Y. A. Li, C. Han, V. Raghavan, G. Mischler, and N. Mesgarani, “StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,” NeurIPS, 2024.
[8] C. Minixhofer, O. Klejch, and P. Bell, “Evaluating and reducing the distance between synthetic and real speech distributions,” in INTERSPEECH, 2023.
[9] mrfakename, V. Srivastav, C. Fourrier, L. Pouget, Y. Lacombe, main, and S. Gandhi, “Text to speech arena,” https://huggingface.co/spaces/TTS-AGI/TTS-Arena, 2024.
[10] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” NeurIPS, 2017.
[11] A. Gritsenko, T. Salimans, R. v. d. Berg, J. Snoek, and N. Kalchbrenner, “A spectral energy distance for parallel speech synthesis,” NeurIPS, 2020.
[12] L.-W. Chen, S. Watanabe, and A. Rudnicky, “A vector quantized approach for text to speech synthesis on real-world spontaneous speech,” in AAAI, 2023.
[13] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in EMNLP, 2018.
[14] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “SuperGLUE: A stickier benchmark for general-purpose language understanding systems,” NeurIPS, 2019.
[15] S. Reddy, D. Chen, and C. D. Manning, “CoQA: A conversational question answering challenge,” TACL, 2019.
[16] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, et al., “SUPERB: Speech processing universal performance benchmark,” arXiv:2105.01051, 2021.
[17] N. Campbell, Evaluation of Speech Synthesis, pp. 29–64, Springer, 2007.
[18] Y. Fan, Y. Qian, F. K. Soong, and L. He, “Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis,” in ICASSP, 2015.
[19] P. Wagner, J. Beskow, S. Betz, J. Edlund, J. Gustafson, G. Eje, S. L. M. Henter, Z. Malisz, É. Székely, C. Tånnander, et al., “Speech synthesis evaluation — state-of-the-art assessment and suggestion for a novel research program,” in SSW, 2019.
[20] E. Cooper, W.-C. Huang, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “A review on subjective and objective evaluation of synthetic speech,” Acoustical Science and Technology, 2024.
[21] M. Viswanathan and M. Viswanathan, “Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale,” CSL, 2005.
[22] H. Liu, Q. Kong, Q. Tian, Y. Zhao, D. Wang, C. Huang, and Y. Wang, “VoiceFixer: Toward general speech restoration with neural vocoder,” arXiv:2109.13731, 2021.
[23] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs,” in ICASSP, 2001.
[24] C. Kim and R. M. Stern, “Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis.,” INTERSPEECH, 2008.
[25] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in ICASSP, 2018.
[26] H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y. Deng, and Y. Qian, “WeSpeaker: A research and production oriented speaker embedding learning toolkit,” in ICASSP, 2023.
[27] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” TASLP, 2021.
[28] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, 2016.
[29] S. Wallbridge and C. Minixhofer, “Masked prosody model,” https://huggingface.co/cdminix/masked_prosody_model, 2023.
[30] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, 2020.
[31] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
[32] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in ICASSP, 2015.
[33] L. N. Vaserstein, “Markov processes over denumerable products of spaces, describing large systems of automata,” Problemy Peredachi Informatsii, 1969.
[34] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms,” INTERSPEECH, 2019.
[35] S. Kolouri, P. E. Pope, C. E. Martin, and G. K. Rohde, “Sliced-wasserstein autoencoder: An embarrassingly simple generative model,” arXiv:1804.01947, 2018.
[36] S. King, R. A. Clark, C. Mayo, and V. Karaiskos, “The blizzard challenge 2008,” in The Blizzard Challenge Workshop, 2008.
[37] S. King and V. Karaiskos, “The blizzard challenge 2013,” The Blizzard Challenge Workshop, 2013.
[38] Z. Ling, X. Zhou, and S. King, “The blizzard challenge 2021,” The Blizzard Challenge Workshop, 2021.
[39] A. Lańcucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP, 2021.
[40] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” INTERSPEECH, 2017.
[41] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A corpus derived from librispeech for text-to-speech,” INTERSPEECH, 2019.
[42] Y. Koizumi, H. Zen, S. Karita, Y. Ding, K. Yatabe, N. Morioka, M. Bacchiani, Y. Zhang, W. Han, and A. Bapna, “Libritts-r: A restored multi-speaker text-to-speech corpus,” INTERSPEECH, 2023.
[43] K. Ito and L. Johnson, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
[44] J. Yamagishi, “English multi-speaker corpus for cstr voice cloning toolkit,” http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html, 2013.
[45] K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in ACM Multimedia, 2015.
[46] P. Andreev, A. Alanov, O. Ivanov, and D. Vetrov, “HiFi++: a unified framework for neural vocoding, bandwidth extension and speech enhancement,” arXiv:2203.13086, 2022.
[47] T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “Utmos: Utokyo-sarulab system for voicemos challenge 2022,” INTERSPEECH, 2022.
[48] W.-C. Huang, E. Cooper, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “The voicemos challenge 2022,” INTERSPEECH, 2022.
[49] Z. Ju, Y. Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y. Liu, Y. Leng, K. Song, S. Tang, et al., “NaturalSpeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,” arXiv:2403.03100, 2024.
[50] Z. Qin, W. Zhao, X. Yu, and X. Sun, “OpenVoice: Versatile instant voice cloning,” arXiv:2312.01479, 2023.
[51] J. P. Cłapa, “WhisperSpeech,” https://github.com/collabora/WhisperSpeech, 2024.
[52] Y. Lacombe, V. Srivastav, and S. Gandhi, “Parler TTS,” https://github.com/huggingface/parler-tts, 2024.
[53] ButterCream, “Vokan,” https://huggingface.co/ShoukanLabs/Vokan, 2024.
[54] P. Peng, P.-Y. Huang, D. Li, A. Mohamed, and D. Harwath, “VoiceCraft: Zero-shot speech editing and text-to-speech in the wild,” arXiv:2403.16973, 2024.