Zum Hauptinhalt springen

Showing 1–11 of 11 results for author: Stanton, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2212.03232  [pdf, other

    cs.LG cs.AI stat.ML

    Learning the joint distribution of two sequences using little or no paired data

    Authors: Soroosh Mariooryad, Matt Shannon, Siyuan Ma, Tom Bagby, David Kao, Daisy Stanton, Eric Battenberg, RJ Skerry-Ryan

    Abstract: We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available. To address the intractability of the exact model under a realistic data setup, we propose a variational inference approximation. To train this variational model with categorical data, we propose a KL en… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

  2. arXiv:2111.05095  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Speaker Generation

    Authors: Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, David Kao

    Abstract: This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to… ▽ More

    Submitted 7 November, 2021; originally announced November 2021.

    Comments: 12 pages, 3 figures, 4 tables, appendix with 2 tables

    ACM Class: I.2.7; G.3

  3. arXiv:2010.08029  [pdf, other

    cs.LG stat.ML

    Non-saturating GAN training as divergence minimization

    Authors: Matt Shannon, Ben Poole, Soroosh Mariooryad, Tom Bagby, Eric Battenberg, David Kao, Daisy Stanton, RJ Skerry-Ryan

    Abstract: Non-saturating generative adversarial network (GAN) training is widely used and has continued to obtain groundbreaking results. However so far this approach has lacked strong theoretical justification, in contrast to alternatives such as f-GANs and Wasserstein GANs which are motivated in terms of approximate divergence minimization. In this paper we show that non-saturating GAN training does in fa… ▽ More

    Submitted 15 October, 2020; originally announced October 2020.

  4. arXiv:1910.10288  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

    Authors: Eric Battenberg, RJ Skerry-Ryan, Soroosh Mariooryad, Daisy Stanton, David Kao, Matt Shannon, Tom Bagby

    Abstract: Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attentio… ▽ More

    Submitted 22 April, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: Accepted to ICASSP 2020

  5. arXiv:1910.01709  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Semi-Supervised Generative Modeling for Controllable Speech Synthesis

    Authors: Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, David Kao, Tom Bagby

    Abstract: We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn't been possible with purely unsupervised TTS models. We demonstrate that our model… ▽ More

    Submitted 3 October, 2019; originally announced October 2019.

  6. arXiv:1906.03402  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

    Authors: Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon, David Kao, Tom Bagby

    Abstract: Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of an… ▽ More

    Submitted 25 October, 2019; v1 submitted 8 June, 2019; originally announced June 2019.

    Comments: Submitted to ICLR 2020

  7. arXiv:1808.01410  [pdf, other

    cs.CL cs.LG cs.SD eess.AS stat.ML

    Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

    Authors: Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan

    Abstract: Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style. In this work, we introduce the Text-Predicted Global Style Token (TP-GST) architecture, which treats GST combina… ▽ More

    Submitted 3 August, 2018; originally announced August 2018.

    MSC Class: eess.AS

  8. arXiv:1803.09047  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

    Authors: RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

    Abstract: We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synth… ▽ More

    Submitted 23 March, 2018; originally announced March 2018.

  9. arXiv:1803.09017  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

    Authors: Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous

    Abstract: In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to contr… ▽ More

    Submitted 23 March, 2018; originally announced March 2018.

  10. arXiv:1711.00520  [pdf, other

    cs.CL cs.SD

    Uncovering Latent Style Factors for Expressive Speech Synthesis

    Authors: Yuxuan Wang, RJ Skerry-Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark, Rif A. Saurous

    Abstract: Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We sho… ▽ More

    Submitted 1 November, 2017; originally announced November 2017.

    Comments: Submitted to NIPS ML4Audio workshop and ICASSP

  11. arXiv:1703.10135  [pdf, other

    cs.CL cs.LG cs.SD

    Tacotron: Towards End-to-End Speech Synthesis

    Authors: Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous

    Abstract: A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Give… ▽ More

    Submitted 6 April, 2017; v1 submitted 29 March, 2017; originally announced March 2017.

    Comments: Submitted to Interspeech 2017. v2 changed paper title to be consistent with our conference submission (no content change other than typo fixes)