-
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
Authors:
Mateusz Łajszczak,
Guillermo Cámbara,
Yang Li,
Fatih Beyhan,
Arent van Korlaar,
Fan Yang,
Arnaud Joly,
Álvaro Martín-Cortinas,
Ammar Abbas,
Adam Michalski,
Alexis Moinet,
Sri Karlapati,
Ewa Muszyńska,
Haohan Guo,
Bartosz Putrycz,
Soledad López Gambino,
Kayeon Yoo,
Elena Sokolova,
Thomas Drugman
Abstract:
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts ra…
▽ More
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.
△ Less
Submitted 15 February, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
Controllable Emphasis with zero data for text-to-speech
Authors:
Arnaud Joly,
Marco Nicolis,
Ekaterina Peterova,
Alessandro Lombardi,
Ammar Abbas,
Arent van Korlaar,
Aman Hussain,
Parul Sharma,
Alexis Moinet,
Mateusz Lajszczak,
Penny Karanasou,
Antonio Bonafonte,
Thomas Drugman,
Elena Sokolova
Abstract:
We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques im…
▽ More
We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3\%$ and correct testers' identification of the emphasized word in a sentence by $40\%$ on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles.
△ Less
Submitted 13 July, 2023;
originally announced July 2023.
-
Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody
Authors:
Peter Makarov,
Ammar Abbas,
Mateusz Łajszczak,
Arnaud Joly,
Sri Karlapati,
Alexis Moinet,
Thomas Drugman,
Penny Karanasou
Abstract:
Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on m…
▽ More
Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on multi-speaker data all improve prosody. More interestingly, they result in synergies. Long context disambiguates prosody, improves coherence, and plays to the strengths of Transformers. Fine-tuning word-level features from a powerful language model, such as BERT, appears to profit from more training data, readily available in a multi-speaker setting. We look into objective metrics on pausing and pacing and perform thorough subjective evaluations for speech naturalness. Our main system, which incorporates all the extensions, achieves consistently strong results, including statistically significant improvements in speech naturalness over all its competitors.
△ Less
Submitted 29 June, 2022;
originally announced June 2022.
-
Distribution augmentation for low-resource expressive text-to-speech
Authors:
Mateusz Lajszczak,
Animesh Prasad,
Arent van Korlaar,
Bajibabu Bollepalli,
Antonio Bonafonte,
Arnaud Joly,
Marco Nicolis,
Alexis Moinet,
Thomas Drugman,
Trevor Wood,
Elena Sokolova
Abstract:
This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a w…
▽ More
This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a way that preserves syntactical correctness. We take additional measures to ensure that synthesized speech does not contain artifacts caused by combining inconsistent audio samples. The perceptual evaluations show that our method improves speech quality over a number of datasets, speakers, and TTS architectures. We also demonstrate that it greatly improves robustness of attention-based TTS models.
△ Less
Submitted 19 February, 2022; v1 submitted 13 February, 2022;
originally announced February 2022.
-
Multi-Scale Spectrogram Modelling for Neural Text-to-Speech
Authors:
Ammar Abbas,
Bajibabu Bollepalli,
Alexis Moinet,
Arnaud Joly,
Penny Karanasou,
Peter Makarov,
Simon Slangens,
Sri Karlapati,
Thomas Drugman
Abstract:
We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale me…
▽ More
We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale mel-spectrograms capturing fine-grained prosody.
We present details for two specific versions of MSS called Word-level MSS and Sentence-level MSS where the scales in our system are motivated by the linguistic units. The Word-level MSS models word, phoneme, and frame-level spectrograms while Sentence-level MSS models sentence-level spectrogram in addition.
Subjective evaluations show that Word-level MSS performs statistically significantly better compared to the baseline on two voices.
△ Less
Submitted 29 June, 2021;
originally announced June 2021.
-
A learned conditional prior for the VAE acoustic space of a TTS system
Authors:
Penny Karanasou,
Sri Karlapati,
Alexis Moinet,
Arnaud Joly,
Ammar Abbas,
Simon Slangen,
Jaime Lorenzo Trueba,
Thomas Drugman
Abstract:
Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for…
▽ More
Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for the VAE latent space of a neural text-to-speech (TTS) system. By doing so, we aim to sample with more prosodic variability, while gaining controllability over the latent space's structure.
By using as prior the posterior distribution of a secondary VAE, which we condition on a speaker vector, we can sample from the primary VAE taking explicitly the conditioning into account and resulting in samples from a specific region of the latent space for each condition (i.e. speaker). A formal preference test demonstrates significant preference of the proposed approach over standard Conditional VAE. We also provide visualisations of the latent space where well-separated condition-specific clusters appear, as well as ablation studies to better understand the behaviour of the system.
△ Less
Submitted 14 June, 2021;
originally announced June 2021.
-
Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech
Authors:
Sri Karlapati,
Ammar Abbas,
Zack Hodari,
Alexis Moinet,
Arnaud Joly,
Penny Karanasou,
Thomas Drugman
Abstract:
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information ava…
▽ More
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of $13.2\%$ in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.
△ Less
Submitted 4 November, 2020;
originally announced November 2020.
-
CAMP: a Two-Stage Approach to Modelling Prosody in Context
Authors:
Zack Hodari,
Alexis Moinet,
Sri Karlapati,
Jaime Lorenzo-Trueba,
Thomas Merritt,
Arnaud Joly,
Ammar Abbas,
Penny Karanasou,
Thomas Drugman
Abstract:
Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In th…
▽ More
Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation. To alleviate the ill-posed nature of prosody modelling, we use syntactic and semantic information derived from text to learn a context-dependent prior over our prosodic space. Our Context-Aware Model of Prosody (CAMP) outperforms the state-of-the-art technique, closing the gap with natural speech by 26%. We also find that replacing attention with a jointly-trained duration model improves prosody significantly.
△ Less
Submitted 12 February, 2021; v1 submitted 2 November, 2020;
originally announced November 2020.
-
CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech
Authors:
Sri Karlapati,
Alexis Moinet,
Arnaud Joly,
Viacheslav Klimkov,
Daniel Sáez-Trigueros,
Thomas Drugman
Abstract:
Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained…
▽ More
Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained PT suffer from source speaker leakage, where the synthesised speech has the voice identity of the source speaker as opposed to the target speaker. In order to mitigate this issue, they compromise on the quality of PT. In this paper, we propose CopyCat, a novel, many-to-many PT system that is robust to source speaker leakage, without using parallel data. We achieve this through a novel reference encoder architecture capable of capturing temporal prosodic representations which are robust to source speaker leakage. We compare CopyCat against a state-of-the-art fine-grained PT model through various subjective evaluations, where we show a relative improvement of $47\%$ in the quality of prosody transfer and $14\%$ in preserving the target speaker identity, while still maintaining the same naturalness.
△ Less
Submitted 30 April, 2020;
originally announced April 2020.