Zum Hauptinhalt springen

Showing 1–19 of 19 results for author: Skerry-Ryan, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1110 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 8 August, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  2. arXiv:2305.15255  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM

    Authors: Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, Michelle Tadmor Ramanovich

    Abstract: We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained end-to-end and operates directly on spectrograms, simplifying our architecture. Key… ▽ More

    Submitted 30 May, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: ICLR 2024 camera-ready

  3. arXiv:2212.03232  [pdf, other

    cs.LG cs.AI stat.ML

    Learning the joint distribution of two sequences using little or no paired data

    Authors: Soroosh Mariooryad, Matt Shannon, Siyuan Ma, Tom Bagby, David Kao, Daisy Stanton, Eric Battenberg, RJ Skerry-Ryan

    Abstract: We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available. To address the intractability of the exact model under a realistic data setup, we propose a variational inference approximation. To train this variational model with categorical data, we propose a KL en… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

  4. arXiv:2111.05095  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Speaker Generation

    Authors: Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, David Kao

    Abstract: This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to… ▽ More

    Submitted 7 November, 2021; originally announced November 2021.

    Comments: 12 pages, 3 figures, 4 tables, appendix with 2 tables

    ACM Class: I.2.7; G.3

  5. arXiv:2103.14574  [pdf, other

    cs.SD eess.AS

    Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

    Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, RJ Skerry-Ryan, Yonghui Wu

    Abstract: This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. The duration model is based on a novel attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time Warping, this model can learn token-frame alignments as well as token durations automatica… ▽ More

    Submitted 29 August, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

    Comments: Submitted to INTERSPEECH 2021

  6. arXiv:2011.03568  [pdf, other

    cs.CL cs.SD eess.AS

    Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

    Authors: Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, Diederik P. Kingma

    Abstract: We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within… ▽ More

    Submitted 5 February, 2021; v1 submitted 6 November, 2020; originally announced November 2020.

    Comments: 6 pages including supplement, 3 figures. accepted to ICASSP 2021

  7. arXiv:2010.08029  [pdf, other

    cs.LG stat.ML

    Non-saturating GAN training as divergence minimization

    Authors: Matt Shannon, Ben Poole, Soroosh Mariooryad, Tom Bagby, Eric Battenberg, David Kao, Daisy Stanton, RJ Skerry-Ryan

    Abstract: Non-saturating generative adversarial network (GAN) training is widely used and has continued to obtain groundbreaking results. However so far this approach has lacked strong theoretical justification, in contrast to alternatives such as f-GANs and Wasserstein GANs which are motivated in terms of approximate divergence minimization. In this paper we show that non-saturating GAN training does in fa… ▽ More

    Submitted 15 October, 2020; originally announced October 2020.

  8. arXiv:1910.10288  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

    Authors: Eric Battenberg, RJ Skerry-Ryan, Soroosh Mariooryad, Daisy Stanton, David Kao, Matt Shannon, Tom Bagby

    Abstract: Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attentio… ▽ More

    Submitted 22 April, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: Accepted to ICASSP 2020

  9. arXiv:1910.01709  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Semi-Supervised Generative Modeling for Controllable Speech Synthesis

    Authors: Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, David Kao, Tom Bagby

    Abstract: We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn't been possible with purely unsupervised TTS models. We demonstrate that our model… ▽ More

    Submitted 3 October, 2019; originally announced October 2019.

  10. arXiv:1907.04448  [pdf, other

    cs.CL cs.SD eess.AS

    Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

    Authors: Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related… ▽ More

    Submitted 24 July, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

    Comments: 5 pages, submitted to Interspeech 2019

  11. arXiv:1906.03402  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

    Authors: Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon, David Kao, Tom Bagby

    Abstract: Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of an… ▽ More

    Submitted 25 October, 2019; v1 submitted 8 June, 2019; originally announced June 2019.

    Comments: Submitted to ICLR 2020

  12. arXiv:1906.02246  [pdf, other

    cs.LG cs.CL cs.SD eess.AS eess.SP

    Complex Evolution Recurrent Neural Networks (ceRNNs)

    Authors: Izhak Shafran, Tom Bagby, R. J. Skerry-Ryan

    Abstract: Unitary Evolution Recurrent Neural Networks (uRNNs) have three attractive properties: (a) the unitary property, (b) the complex-valued nature, and (c) their efficient linear operators. The literature so far does not address -- how critical is the unitary property of the model? Furthermore, uRNNs have not been evaluated on large tasks. To study these shortcomings, we propose the complex evolution R… ▽ More

    Submitted 5 June, 2019; originally announced June 2019.

    Journal ref: Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5854-5858, 2018

  13. arXiv:1808.10128  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

    Authors: Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, RJ Skerry-Ryan

    Abstract: Although end-to-end text-to-speech (TTS) models such as Tacotron have shown excellent results, they typically require a sizable set of high-quality <text, audio> pairs for training, which are expensive to collect. In this paper, we propose a semi-supervised training framework to improve the data efficiency of Tacotron. The idea is to allow Tacotron to utilize textual and acoustic knowledge contain… ▽ More

    Submitted 30 August, 2018; originally announced August 2018.

  14. arXiv:1808.01410  [pdf, other

    cs.CL cs.LG cs.SD eess.AS stat.ML

    Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

    Authors: Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan

    Abstract: Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style. In this work, we introduce the Text-Predicted Global Style Token (TP-GST) architecture, which treats GST combina… ▽ More

    Submitted 3 August, 2018; originally announced August 2018.

    MSC Class: eess.AS

  15. arXiv:1803.09047  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

    Authors: RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

    Abstract: We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synth… ▽ More

    Submitted 23 March, 2018; originally announced March 2018.

  16. arXiv:1803.09017  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

    Authors: Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous

    Abstract: In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to contr… ▽ More

    Submitted 23 March, 2018; originally announced March 2018.

  17. arXiv:1712.05884  [pdf, other

    cs.CL

    Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

    Authors: Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu

    Abstract: This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion s… ▽ More

    Submitted 15 February, 2018; v1 submitted 15 December, 2017; originally announced December 2017.

    Comments: Accepted to ICASSP 2018

  18. arXiv:1711.00520  [pdf, other

    cs.CL cs.SD

    Uncovering Latent Style Factors for Expressive Speech Synthesis

    Authors: Yuxuan Wang, RJ Skerry-Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark, Rif A. Saurous

    Abstract: Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We sho… ▽ More

    Submitted 1 November, 2017; originally announced November 2017.

    Comments: Submitted to NIPS ML4Audio workshop and ICASSP

  19. arXiv:1703.10135  [pdf, other

    cs.CL cs.LG cs.SD

    Tacotron: Towards End-to-End Speech Synthesis

    Authors: Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous

    Abstract: A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Give… ▽ More

    Submitted 6 April, 2017; v1 submitted 29 March, 2017; originally announced March 2017.

    Comments: Submitted to Interspeech 2017. v2 changed paper title to be consistent with our conference submission (no content change other than typo fixes)