TTSDS - Text-to-Speech Distribution Score

Abstract

Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how well synthetic speech mirrors real speech by obtaining correlates of each factor and measuring their distance from both real speech datasets and noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations from each time period.

Index Terms—  speech synthesis evaluation, synthetic data, distribution analysis

1 Introduction

There has been a recent surge in synthetic speech generation quality enabled by systems using language modeling architectures generating discrete units [1] which are then used to reconstruct the waveform [2]. However, many of the most recent systems have been released to the community without accompanying papers and/or evaluation. This is understandable since the field is moving at a quick pace, and evaluation of synthetic speech is hard. The remaining systems are evaluated mostly with Mean Opinion Score (MOS) and its factors such as naturalness, speaker similarity, and sometimes intelligibility. Some systems also rely on MOS predicted with trained neural networks [3, 4]. Unfortunately, MOS is becoming less useful as real and synthetic speech get closer in quality, and cannot be compared across studies and over time [5].

Many factors, such as prosody, intelligibility, naturalness, and speaker similarity, contribute to perceived overall speech quality. Some works focus on the prosody factor by comparing pitch or duration of real to synthetic speech [6, 7, 8], while others include metrics on Word Error Rate (WER) to account for intelligibility or more general algorithmic measures such as Mel-Cepstral Distoration (MCD) [3, 6]. To the best of our knowledge, there is only one published effort which evaluated state-of-the-art TTS systems based on LLM architectures [9]. It uses crowdsourced A/B preference tests to produce an Elo rating for each system. As these ratings are the first comparative study of the next generation of TTS systems, they are a great resource, however this type of evaluation comes with a set of challenges. Since the evaluation is centralized, it is up to the organizers to add newly released systems. Furthermore, human rating of the speech could change as time progresses, which has been shown to be the case for MOS scores [5]. Finally, adding more complex domains such as long-form speech synthesis, or testing the contribution of individual factors, requires running the evaluation from scratch.

Some solutions for objective evaluation of generative systems have emerged in other domains: In image generation, Fréchet Inception Distance (FID) [10] has become the de-facto standard, but while there have been attempts to apply these methods to TTS [11], they have not taken hold, perhaps due to the large number of required samples [12]. In NLP, the advances of capabilities of Large Language Models (LLMs) have lead to a number of published benchmarks, most ranking the model on a variety of tasks, such as the GLUE [13], SuperGLUE [14] and CoQA [15] benchmark. For speech processing, the SUPERB benchmark also uses this task approach [16].

Unlike math problems, or automatic speech recognition (ASR), speech synthesis does not have one correct solution, but many possible ones, which makes evaluation difficult. However, we can use the concept of factors for evaluation. As we can test a LLM on mathematical ability and reading comprehension as parts of an overall measure of reasoning ability, we can test a TTS system on factors such as prosody or intelligibility, and we can combine them to reflect the overall performance of the system. This approach also gives us more information on preferring one system over another. For example, for character voices in a video game, a system with higher scores in prosody might be preferred, while for a language-learning app intelligibility would be the most important factor.

In this paper, we devise a benchmark with the intelligibility, prosody, speakers, environment and general factors. We include intelligibility and prosody as they are important measures of synthetic speech quality [17]. We include the speaker category to evaluate how closely TTS systems can model realistic speakers [18]. We also include environment as a factor due to the prevalence of artifacts present in speech synthesis [19], and the difficulty of some TTS system to generate speech with realistic recording conditions [8]. The general factor is similar to previous measures of speech distributions as represented by latent representations such as Fréchet DeepSpeech Distance [11]. Finally, we compute the overall TTSDS score as an average of all factors.

We compute the score for each factor by comparing the distributions of both high-dimensional features (e.g., embeddings) and scalar features (e.g., pitch) extracted from the speech. This comparison allows us to measure the deviation of synthetic speech from real speech without assuming predefined distributions, thereby avoiding common pitfalls associated with objective speech evaluation measures. For instance, intelligibility is often quantified using Word Error Rate (WER) [20], where lower values are typically preferred. However, if the target domain naturally exhibits high WER (e.g., children’s speech), a TTS system should also reflect this characteristic, producing higher WER accordingly. Therefore, we compare the utterance-level distribution of our features (such as WER) rather than their mean.

We evaluate our benchmark by comparing to MOS scores and A/B test results obtained for 35 TTS systems from 2008, 2022 and 2024. An average of our factor scores show correlation coefficients ranging from 0.60 to 0.83 for each time period, while the performance of state-of-the-art MOS prediction are less consistent, ranging from 0.05 to 0.85. Additionally, we observe a shift in priorities of human evaluators over time, with environment being more important for earlier systems, and prosody being more important for later ones. We make our benchmark suite and leaderboard openly available.111https://ttsdsbenchmark.com

2 Methodology

The first step of developing any measure is to define the concept being measured. Since there is no objective way of directly measuring the naturalness or quality of synthetic speech, we define our measure as ”the distance between the distribution of real and synthetic speech”. The next step is to define relevant factors [21]. For speech we define the following five major factors.

A General factor which measures the distribution of the speech without any assumptions. We use self-supervised learning (SSL) representations of the speech.

An Environment factor which measures noise or distortion present in the speech. We use correlates for signal-to-noise ratio (SNR) and reverberation.

An Intelligibility factor which measures how easy the content of the speech is to recognize. We use WER obtained using the transcripts provided to the TTS systems and ASR systems.

A Prosody factor which measures how realistic the prosody of the speech is. We use a pitch extractor, a SSL representation of the prosody and a proxy for duration and speaking rate.

A Speaker factor which measures how close the speakers are to real ones. We use representations obtained from speaker verification systems.

Factor Feature
Environment Noise/Artifacts
VoiceFixer [22] + PESQ [23]
WADA SNR [24]
Speaker Speaker Embedding
d-vector [25]
WeSpeaker [26]
Prosody Segmental Length
Hubert [27] token length
Pitch
WORLD [28]
SSL Representations
MPM [29]
Intelligibility ASR WER
wav2vec 2.0 [30]
Whisper [31]
Allgemein SSL Representations
Hubert [27]
wav2vec 2.0 [30]
Table 1: Features used in the benchmark and their respective factors. The overall TTSDS score is computed as an average of individual factor scores.

Since none of these factors can be measured directly, we rely on several features which correlate to the factors. The extensive body of work in speech processing and representation learning provides these features, including representations from self-supervised models, statistical features, and algorithmic features. For each feature derived from synthetic data, we define its score as how close it is to the same feature derived from real speech.

2.1 Features extracted from Speech

Here we identify the specific features (i.e., data points derived from speech) that represent each factor. Table 1 summarizes the models and algorithms that we use to extract these features. For each of the aforementioned factors, we use two to three features to achieve a robust benchmark despite the low number of 80-100 samples per system. For measuring the General factor score, we use frame-level self-supervised representations extracted from the middle layers of the Hubert base [27] and wav2vec 2.0 base  [30] models. For the Environment distribution, we use two one-dimensional correlates of noise present in the signal – we use VoiceFixer [22] to remove noise from the speech, and then measure PESQ between the enhanced sample and the original one; we also use WadaSNR [24] to estimate the SNR of each sample. For Intelligibility, we calculate WER using the reference transcripts and automatic transcripts generated using wav2vec 2.0 [30] fine tuned on 960 hours of LibriSpeech [32] and Whisper (small) [31]. Prosody is quantified using frame-level representations from a self-supervised prosody model [29] and frame-level pitch features [28]. Additionally, we get a proxy for the segmental durations by using Hubert tokens (with 100 clusters) and extracting their lengths (i.e. how many of the same token occur in a row). Finally, for the Speaker factor, we use d-vectors [25] and the more recent WeSpeaker [26] representations.

2.2 Speech Distributions

The distribution of a feature can be computed on any audio dataset, whether it consists of synthetic speech, real speech, or noise. We let D𝐷Ditalic_D represent an audio dataset, and X𝑋Xitalic_X be the feature derived from the dataset. We can sample observed values xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the empirical distribution P^(X|D)^𝑃conditional𝑋𝐷\hat{P}(X|D)over^ start_ARG italic_P end_ARG ( italic_X | italic_D ) as:

xiP^(X|D)similar-tosubscript𝑥𝑖^𝑃conditional𝑋𝐷x_{i}\sim\hat{P}(X|D)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ over^ start_ARG italic_P end_ARG ( italic_X | italic_D )

where 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be a scalar for one-dimensional features or a vector for multi-dimensional features.

2.3 Computing Distances Between Distributions

To compare the distributions of features derived from different datasets, we utilize the Wasserstein distance, specifically the 2-Wasserstein distance W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, also known as the Earth Mover’s Distance. W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT measures the amount of ”work” needed to transfer one probability distribution to another as an optimal transport problem [33]. This distance measure is widely used in computer vision as the Fréchet Inception Distance (FID) [10] and in audio processing as the Fréchet Audio Distance [34]. Here, we formulate how to compute W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT given the empirical distributions of a feature X𝑋Xitalic_X computed on datasets D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for both the one-dimensional and multi-dimensional case. We denote the corresponding empirical probability distributions P^(X|D)^𝑃conditional𝑋subscript𝐷\hat{P}(X|D_{*})over^ start_ARG italic_P end_ARG ( italic_X | italic_D start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) as P^subscript^𝑃\hat{P}_{*}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT.

One-Dimensional Case

In the one-dimensional case, W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be computed as a function of the ordered samples [35]:

W2(P^1,P^2)=1ni=1n(xiyi)2subscript𝑊2subscript^𝑃1subscript^𝑃21𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑥𝑖subscript𝑦𝑖2W_{2}(\hat{P}_{1},\hat{P}_{2})=\sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-y_{i% }\right)^{2}}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (1)

where {x1,,xn}subscript𝑥1subscript𝑥𝑛\{x_{1},\ldots,x_{n}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } denote sorted samples of P^(X|D1)^𝑃conditional𝑋subscript𝐷1\hat{P}(X|D_{1})over^ start_ARG italic_P end_ARG ( italic_X | italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), and {y1,,yn}subscript𝑦1subscript𝑦𝑛\{y_{1},\ldots,y_{n}\}{ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } denote sorted samples of P^(X|D2)^𝑃conditional𝑋subscript𝐷2\hat{P}(X|D_{2})over^ start_ARG italic_P end_ARG ( italic_X | italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

Multi-Dimensional Case

We can compute W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance for two normally distributed multi-dimensional P1^^subscript𝑃1\hat{P_{1}}over^ start_ARG italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG and P2^^subscript𝑃2\hat{P_{2}}over^ start_ARG italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG using their mean vectors μ1subscript𝜇1\mu_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and μ2subscript𝜇2\mu_{2}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and their covariance matrices Σ1subscriptΣ1\Sigma_{1}roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Σ2subscriptΣ2\Sigma_{2}roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [10]:

W2(P^1,P^2)=\normμ1μ22+DB(Σ1,Σ2)subscript𝑊2subscript^𝑃1subscript^𝑃2\normsubscript𝜇1superscriptsubscript𝜇22subscript𝐷𝐵subscriptΣ1subscriptΣ2W_{2}(\hat{P}_{1},\hat{P}_{2})=\sqrt{\norm{\mu_{1}-\mu_{2}}^{2}+D_{B}(\Sigma_{% 1},\Sigma_{2})}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = square-root start_ARG italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_D start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG (2)

where DBsubscript𝐷𝐵D_{B}italic_D start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is the unnormalized Bures metric defined as

DB(Σ1,Σ2)=trace(Σ1+Σ12(Σ21/2Σ1Σ21/2)1/2)subscript𝐷𝐵subscriptΣ1subscriptΣ2tracesubscriptΣ1subscriptΣ12superscriptsuperscriptsubscriptΣ212subscriptΣ1superscriptsubscriptΣ21212D_{B}(\Sigma_{1},\Sigma_{2})=\text{trace}\left(\Sigma_{1}+\Sigma_{1}-2(\Sigma_% {2}^{1/2}\Sigma_{1}\Sigma_{2}^{1/2})^{1/2}\right)italic_D start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = trace ( roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 2 ( roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT )

We can use this approach since latent representations of neural networks are presumably normally distributed [10].

2.4 Overall Score

To evaluate how close a synthetic speech dataset Dsynsubscript𝐷synD_{\text{syn}}italic_D start_POSTSUBSCRIPT syn end_POSTSUBSCRIPT is to real speech given a particular feature X𝑋Xitalic_X, we compute its Wasserstein distance W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for real reference datasets 𝔇realsubscript𝔇real\mathfrak{D}_{\text{real}}fraktur_D start_POSTSUBSCRIPT real end_POSTSUBSCRIPT and distractor (noise) datasets 𝔇noisesubscript𝔇noise\mathfrak{D}_{\text{noise}}fraktur_D start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT. We find the smallest W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT among the real and noise datasets respectively as

Wreal=minDreal𝔇realW2(P^syn,P^real)Wnoise=minDnoise𝔇noiseW2(P^syn,P^noise)subscript𝑊realsubscriptsubscript𝐷realsubscript𝔇realsubscript𝑊2subscript^𝑃synsubscript^𝑃realsubscript𝑊noisesubscriptsubscript𝐷noisesubscript𝔇noisesubscript𝑊2subscript^𝑃synsubscript^𝑃noise\begin{split}W_{\text{real}}&=\min_{D_{\text{real}}\in\mathfrak{D}_{\text{real% }}}W_{2}(\hat{P}_{\text{syn}},\hat{P}_{\text{real}})\\ W_{\text{noise}}&=\min_{D_{\text{noise}}\in\mathfrak{D}_{\text{noise}}}W_{2}(% \hat{P}_{\text{syn}},\hat{P}_{\text{noise}})\end{split}start_ROW start_CELL italic_W start_POSTSUBSCRIPT real end_POSTSUBSCRIPT end_CELL start_CELL = roman_min start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT real end_POSTSUBSCRIPT ∈ fraktur_D start_POSTSUBSCRIPT real end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT syn end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT real end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_W start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT end_CELL start_CELL = roman_min start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT ∈ fraktur_D start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT syn end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT ) end_CELL end_ROW

We define the overall score (ranging from 0 to 100) as

S=100×WrealWreal+Wnoise𝑆100subscript𝑊realsubscript𝑊realsubscript𝑊noiseS=100\times\frac{W_{\text{real}}}{W_{\text{real}}+W_{\text{noise}}}italic_S = 100 × divide start_ARG italic_W start_POSTSUBSCRIPT real end_POSTSUBSCRIPT end_ARG start_ARG italic_W start_POSTSUBSCRIPT real end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT end_ARG

Any score can be intuitively interpreted – if S𝑆Sitalic_S is greater than 50, then Dsynsubscript𝐷synD_{\text{syn}}italic_D start_POSTSUBSCRIPT syn end_POSTSUBSCRIPT is more similar to the closest real speech dataset than to the closest noise dataset for a particular feature. An example of this can be seen in Figure 1, in which we show the difference (for a SSL representation feature) between the best-performing and the worst-performing system in the TTS Arena dataset. The higher score of system (a) corresponds with a larger overlap with the reference dataset and smaller overlap with the noise dataset than system (b).

Refer to caption

Fig. 1: Distribution of the best (left) and worst (right) TTS Arena system with respect to Hubert representations. S𝑆Sitalic_S denotes the score.

3 Experiment Setup

Refer to caption

(a) Blizzard’08

Refer to caption

(b) BTTF

Refer to caption

(c) TTS Arena

Fig. 2: Spearman correlation between the subjective measure, benchmark systems and our benchmark.

To validate our benchmark, we compare our factors and TTSDS scores to subjective measures across three different time periods. The Blizzard 2008 challenge [36] compared 22 TTS systems across several tasks using MOS. We choose the ”Voice A” audio-book task with 15 hours of data. Later, the ”Back to the Future” (BTTF) [5] compared unit selection, hybrid and statistical parametric HMM-based systems from the Blizzard 2013 challenge [37] with deep learning systems inspired by the Blizzard 2021 challenge [38] based on FastPitch [39] and Tacotron [40]. The latest systems, which are most commonly based on discrete speech representations generatively modeled by LLM-like systems [3], are represented by the TTS Arena leaderboard [9], which is a crowdsourced effort to evaluate these systems. Only systems released in 2023 and 2024 are featured in this evaluation. Since the data is not publicly released, we reproduce datasets for as many of the systems as possible.222We had to exclude MetaVoice and GPT-SoVITS due to reference audio length requirements; and MeloTTS due to only female voices being available. This leaves us 9 out of the 12 publicly available systems. As conditioning for the TTS systems, we use speaker reference waveforms from the LibriTTS [41] test set, coupled with unrelated transcripts from the same set to make sure the data could not have been encountered during training.

Our reference speech datasets are LibriTTS [41], LibriTTS-R [42], LJSpeech [43], VCTK [44], and the training sets for the Blizzard challenges [5, 36]. We sample 100 utterances at random from each dataset (if available, from their test sets). For distractor/noise datasets, we use the ESC dataset of background noises [45], as well as the following generated noise – random uniform, random normal, all zeros and all ones.

We compare our benchmark with two MOS prediction methods. The first method is WVMOS [46], which uses wav2vec 2.0 model [30] fine-tuned to predict MOS scores. Its system-level correlation coefficients range from 0.68 to 0.96 for different corruptions of speech and their corresponding MOS scores [46]. The second method is UTMOS [47], which is an ensemble MOS prediction system that won several categories in the 2022 VoiceMOS challenge [48], and has since been used for evaluation of several leading TTS systems [4, 49].

For all system×\times×feature combinations, we compute the score as described in Section 2.4. We average all features for each factor, which gives us the corresponding factor score. Averaging all factor scores in turn gives the TTSDS score.

4 Results

Given the scores of all systems and the subjective measures reported for the given datasets (MOS for Blizzard’08 and BTTF; Elo Rating for TTS Arena), we report their Spearman rank correlation coefficients in Figure 2. We observe both our factors and baseline MOS prediction systems vary strongly between different the different corpora, but the TTSDS score correlates consistently well with subjective evaluation results.

The baseline MOS prediction methods achieve mixed results for the Blizzard’08 and BTTF data. UTMOS and WVMOS respectively achieve a high correlation on one of the two datasets while only yielding low correlation on the other. We hypothesize that UTMOS might have included unit-selection systems in its training data, but it have not encountered enough variants of the FastPitch/Tactotron systems present in BTTF. The inverse seems to be the case for WVMOS. For TTS Arena, both systems do not perform well. In summary, these MOS prediction systems sometimes achieve high correlation with ground-truth MOS, but do not seem to generalize.

4.1 TTSDS Benchmark

We now discuss the individual scores of our benchmark – the development of individual of these scores’ correlations with subjective evaluation can be seen in Figure 3.

The General score shows some correlation with human evaluation results, but the correlation is generally low. The General score only slightly outperforms MOS prediction for TTS Arena, and shows the lowest correlation of all factors for the Blizzard’08 systems. For unit selection voices, this might be explained by the fact they consist of real speech samples, however, speaker verification representations should suffer from the same problem and they do not seem to be affected as much.

Refer to caption

Fig. 3: Development of factor score correlation coefficients over time from early speech synthesis (Blizzard’08) to the latest systems (TTS Arena).
System UTMOS WVMOS Gen Env Int Pro Spk TTSDS Elo Rating
StyleTTS 2 [7] 4.36 4.48 93.7 84.7 91.6 89.8 71.5 86.3 1237
XTTSv2 [4] 3.89 4.36 94.3 79.3 91.4 90.5 72.6 85.6 1232
OpenVoice [50] 4.10 4.57 91.7 88.0 91.6 91.8 68.8 86.4 1158
WhisperSpeech [51] 3.78 3.89 90.0 83.9 92.2 80.7 72.4 83.9 1149
Parler TTS [52] 3.97 4.16 94.7 80.8 87.5 83.0 74.1 84.0 1140
Vokan TTS [53] 3.80 4.22 88.6 85.1 91.6 85.3 69.1 83.9 1126
OpenVoice v2 [50] 4.29 4.75 90.7 91.2 91.6 88.6 68.7 86.2 1120
VoiceCraft 2 [54] 4.21 3.71 87.0 78.0 91.6 84.4 66.0 81.4 1114
Pheme [6] 3.92 4.26 94.0 81.9 91.5 85.1 66.1 83.7 1029
Table 2: Ranking, factor scores, TTSDS score and MOS predictions for the TTS Arena systems.

The Environment score has a low correlation with subjective measures for both Blizzard’08 and TTS Arena, but it is interestingly the most important factor for BTTF. Due to BTTF consisting of both deep learning systems from 2021 and systems from 2013, this factor might pick up on artifacts which are only present in the latter. Meanwhile for the Blizzard’08 systems, these artifacts might be similar enough between systems listeners didn’t prioritize them in evaluation, while for modern systems in TTS Arena, hardly any noise or artifacts are present.

The Intelligibility score shows a high correlation for Blizzard’08, but it is the only of our scores showing a negative correlation for BTTF. Again, this could be due to the difference between unit selection and neural voices, with the former perhaps having more realistic intelligibility, but worse naturalness as perceived by humans.

The Prosody score is consistent between datasets, which might be in part due to the diversity of the underlying features (i.e. pitch, SSL prosody representations and segmental durations). It also increases over time, with the TTS Arena system scores showing the highest correlation with our prosody score. This confirms the intuition that good prosody has always been a factor in subjective evaluation. The increase in prosody score might indicate that human evaluators focus more on the prosody of the speech as other factors such as the intelligibility or noise conditions have vastly improved with modern systems.

The Speaker score also shows high correlation for Blizzard’08 and TTS Arena, but fails for BTTF. We believe this is due to older unit-selection systems included in BTTF producing a natural speaker embedding for concatenated parts of real speech. This effect is pronounced because we only achieve a significant TTSDS score correlation when the Speech factor is excluded for BTTF.

Refer to caption

Fig. 4: Results of Wilcoxon signed-rank tests between systems’ extracted features. \blacksquare indicates a significant difference between a pair of systems.

The TTSDS score achieves higher correlation than any single factor for all datasets included in our study, despite the low number of 80-100 samples per dataset. One of the baseline MOS prediction networks still performed better for the early Blizzard’08 systems, but both MOS prediction networks were significantly outperformed by our benchmark for the more modern systems. However, individual factors often show lower correlations with MOS than the baseline systems, highlighting the need for combining several factors. We hypothesize that this might be the reason measures similar to the Fréchet Inception Distance [10] for computer vision have not become popular for speech evaluation – with the low number of samples typically used for TTS evaluation, and the many factors contributing to what ”good” speech synthesis is, and single distance measure might not be enough to show correlation with human evaluation.

Table 2 shows our benchmark compared to MOS prediction and the subjective human evaluation rating from TTS Arena. While UTMOS correctly predicts the best system, the other scores by the MOS prediction systems show little to no correlation with Elo ratings; our prosody factor, speaker factor and overall TTSDS scores correlate well with the Elo ratings. However, OpenVoice v2 [50] is scored highly by TTSDS but achieved low scores in TTS Arena – this might be due to differences in configuration, as the details for generating the speech used in TTS Arena are not public. To evaluate whether our benchmark could be used for system selection, we perform a Wilcoxon signed-rank test (Figure 4). We observe that while the worst-performing systems can generally be distinguished from the highest-performing ones, there is no statistically significant difference between the better-performing systems. Finding significant differences between TTS systems has been difficult, even with previous subjective evaluation methods [5, 36]. However, using more speech samples and features for future iterations of TTSDS could mitigate this.

5 Conclusion

In this work, we proposed a benchmark assessing prosody, speaker identity, intelligibility, environment, and general distribution of synthetic speech. Evaluating 35 TTS systems from 2008 to 2024, our benchmark showed strong correlation with human evaluations (0.60 to 0.83). This highlights the robustness and adaptability of our approach to evolving evaluation criteria. Individual factors alone showed limited correlation, but their combination significantly outperformed traditional MOS prediction systems, especially for modern TTS systems. Our results underscore the importance of intelligibility and prosody, and the need for TTS systems to replicate realistic recording conditions and speaker characteristics. We revealed limitations in existing MOS prediction systems, emphasizing the need for a nuanced approach to TTS evaluation. High correlation with human evaluations suggests our benchmark provides a reliable and comprehensive framework for assessing synthetic speech quality.

References

  • [1] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” arXiv:2210.13438, 2022.
  • [2] D. Lyth and S. King, “Natural language guidance of high-fidelity text-to-speech with synthetic annotations,” arXiv:2402.01912, 2024.
  • [3] S. Chen, S. Liu, L. Zhou, Y. Liu, X. Tan, J. Li, S. Zhao, Y. Qian, and F. Wei, “VALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers,” arXiv:2406.05370, 2024.
  • [4] E. Casanova, K. Davis, E. Gölge, G. Göknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, et al., “XTTS: a massively multilingual zero-shot text-to-speech model,” arXiv:2401.02839, 2024.
  • [5] S. Le Maguer, S. King, and N. Harte, “Back to the future: Extending the blizzard challenge 2013,” in INTERSPEECH, 2022.
  • [6] P. Budzianowski, T. Sereda, T. Cichy, and I. Vulić, “Pheme: Efficient and conversational speech generation,” arXiv:2401.02839, 2024.
  • [7] Y. A. Li, C. Han, V. Raghavan, G. Mischler, and N. Mesgarani, “StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,” NeurIPS, 2024.
  • [8] C. Minixhofer, O. Klejch, and P. Bell, “Evaluating and reducing the distance between synthetic and real speech distributions,” in INTERSPEECH, 2023.
  • [9] mrfakename, V. Srivastav, C. Fourrier, L. Pouget, Y. Lacombe, main, and S. Gandhi, “Text to speech arena,” https://huggingface.co/spaces/TTS-AGI/TTS-Arena, 2024.
  • [10] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” NeurIPS, 2017.
  • [11] A. Gritsenko, T. Salimans, R. v. d. Berg, J. Snoek, and N. Kalchbrenner, “A spectral energy distance for parallel speech synthesis,” NeurIPS, 2020.
  • [12] L.-W. Chen, S. Watanabe, and A. Rudnicky, “A vector quantized approach for text to speech synthesis on real-world spontaneous speech,” in AAAI, 2023.
  • [13] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in EMNLP, 2018.
  • [14] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “SuperGLUE: A stickier benchmark for general-purpose language understanding systems,” NeurIPS, 2019.
  • [15] S. Reddy, D. Chen, and C. D. Manning, “CoQA: A conversational question answering challenge,” TACL, 2019.
  • [16] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, et al., “SUPERB: Speech processing universal performance benchmark,” arXiv:2105.01051, 2021.
  • [17] N. Campbell, Evaluation of Speech Synthesis, pp. 29–64, Springer, 2007.
  • [18] Y. Fan, Y. Qian, F. K. Soong, and L. He, “Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis,” in ICASSP, 2015.
  • [19] P. Wagner, J. Beskow, S. Betz, J. Edlund, J. Gustafson, G. Eje, S. L. M. Henter, Z. Malisz, É. Székely, C. Tånnander, et al., “Speech synthesis evaluation — state-of-the-art assessment and suggestion for a novel research program,” in SSW, 2019.
  • [20] E. Cooper, W.-C. Huang, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “A review on subjective and objective evaluation of synthetic speech,” Acoustical Science and Technology, 2024.
  • [21] M. Viswanathan and M. Viswanathan, “Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale,” CSL, 2005.
  • [22] H. Liu, Q. Kong, Q. Tian, Y. Zhao, D. Wang, C. Huang, and Y. Wang, “VoiceFixer: Toward general speech restoration with neural vocoder,” arXiv:2109.13731, 2021.
  • [23] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs,” in ICASSP, 2001.
  • [24] C. Kim and R. M. Stern, “Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis.,” INTERSPEECH, 2008.
  • [25] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in ICASSP, 2018.
  • [26] H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y. Deng, and Y. Qian, “WeSpeaker: A research and production oriented speaker embedding learning toolkit,” in ICASSP, 2023.
  • [27] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” TASLP, 2021.
  • [28] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, 2016.
  • [29] S. Wallbridge and C. Minixhofer, “Masked prosody model,” https://huggingface.co/cdminix/masked_prosody_model, 2023.
  • [30] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, 2020.
  • [31] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
  • [32] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in ICASSP, 2015.
  • [33] L. N. Vaserstein, “Markov processes over denumerable products of spaces, describing large systems of automata,” Problemy Peredachi Informatsii, 1969.
  • [34] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms,” INTERSPEECH, 2019.
  • [35] S. Kolouri, P. E. Pope, C. E. Martin, and G. K. Rohde, “Sliced-wasserstein autoencoder: An embarrassingly simple generative model,” arXiv:1804.01947, 2018.
  • [36] S. King, R. A. Clark, C. Mayo, and V. Karaiskos, “The blizzard challenge 2008,” in The Blizzard Challenge Workshop, 2008.
  • [37] S. King and V. Karaiskos, “The blizzard challenge 2013,” The Blizzard Challenge Workshop, 2013.
  • [38] Z. Ling, X. Zhou, and S. King, “The blizzard challenge 2021,” The Blizzard Challenge Workshop, 2021.
  • [39] A. Lańcucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP, 2021.
  • [40] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” INTERSPEECH, 2017.
  • [41] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A corpus derived from librispeech for text-to-speech,” INTERSPEECH, 2019.
  • [42] Y. Koizumi, H. Zen, S. Karita, Y. Ding, K. Yatabe, N. Morioka, M. Bacchiani, Y. Zhang, W. Han, and A. Bapna, “Libritts-r: A restored multi-speaker text-to-speech corpus,” INTERSPEECH, 2023.
  • [43] K. Ito and L. Johnson, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  • [44] J. Yamagishi, “English multi-speaker corpus for cstr voice cloning toolkit,” http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html, 2013.
  • [45] K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in ACM Multimedia, 2015.
  • [46] P. Andreev, A. Alanov, O. Ivanov, and D. Vetrov, “HiFi++: a unified framework for neural vocoding, bandwidth extension and speech enhancement,” arXiv:2203.13086, 2022.
  • [47] T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “Utmos: Utokyo-sarulab system for voicemos challenge 2022,” INTERSPEECH, 2022.
  • [48] W.-C. Huang, E. Cooper, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “The voicemos challenge 2022,” INTERSPEECH, 2022.
  • [49] Z. Ju, Y. Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y. Liu, Y. Leng, K. Song, S. Tang, et al., “NaturalSpeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,” arXiv:2403.03100, 2024.
  • [50] Z. Qin, W. Zhao, X. Yu, and X. Sun, “OpenVoice: Versatile instant voice cloning,” arXiv:2312.01479, 2023.
  • [51] J. P. Cłapa, “WhisperSpeech,” https://github.com/collabora/WhisperSpeech, 2024.
  • [52] Y. Lacombe, V. Srivastav, and S. Gandhi, “Parler TTS,” https://github.com/huggingface/parler-tts, 2024.
  • [53] ButterCream, “Vokan,” https://huggingface.co/ShoukanLabs/Vokan, 2024.
  • [54] P. Peng, P.-Y. Huang, D. Li, A. Mohamed, and D. Harwath, “VoiceCraft: Zero-shot speech editing and text-to-speech in the wild,” arXiv:2403.16973, 2024.