Zum Hauptinhalt springen

Showing 1–15 of 15 results for author: Kastner, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.10463  [pdf, other

    cs.SD cs.LG eess.AS

    Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

    Authors: Hyun Jin Park, Dhruuv Agarwal, Neng Chen, Rentao Sun, Kurt Partridge, Justin Chen, Harry Zhang, Pai Zhu, Jacob Bartel, Kyle Kastner, Gary Wang, Andrew Rosenberg, Quan Wang

    Abstract: The keyword spotting (KWS) problem requires large amounts of real speech training data to achieve high accuracy across diverse populations. Utilizing large amounts of text-to-speech (TTS) synthesized data can reduce the cost and time associated with KWS development. However, TTS data may contain artifacts not present in real speech, which the KWS model can exploit (overfit), leading to degraded ac… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: to be published in a Workshop at Interspeech 2024, Synthetic Data's Transformative Role in Foundational Speech Models

  2. arXiv:2407.18879  [pdf, other

    cs.SD cs.LG eess.AS

    Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model

    Authors: Hyun Jin Park, Dhruuv Agarwal, Neng Chen, Rentao Sun, Kurt Partridge, Justin Chen, Harry Zhang, Pai Zhu, Jacob Bartel, Kyle Kastner, Gary Wang, Andrew Rosenberg, Quan Wang

    Abstract: This paper explores the use of TTS synthesized training data for KWS (keyword spotting) task while minimizing development cost and time. Keyword spotting models require a huge amount of training data to be accurate, and obtaining such training data can be costly. In the current state of the art, TTS models can generate large amounts of natural-sounding data, which can help reducing cost and time f… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

    Comments: to be published in a Workshop at Interspeech 2024, Synthetic Data's Transformative Role in Foundational Speech Models

  3. arXiv:2402.18932  [pdf, other

    eess.AS cs.SD

    Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

    Authors: Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Fadi Biadsy, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov

    Abstract: Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data… ▽ More

    Submitted 16 July, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    Comments: To appear in ICASSP 2024. Demo page: https://google.github.io/tacotron/publications/extending_tts/

  4. arXiv:2401.04235  [pdf, other

    cs.CL cs.SD eess.AS

    High-precision Voice Search Query Correction via Retrievable Speech-text Embedings

    Authors: Christopher Li, Gary Wang, Kyle Kastner, Heng Su, Allen Chen, Andrew Rosenberg, Zhehuai Chen, Zelin Wu, Leonid Velikovich, Pat Rondon, Diamantino Caseiro, Petar Aleksic

    Abstract: Automatic speech recognition (ASR) systems can suffer from poor recall for various reasons, such as noisy audio, lack of sufficient training data, etc. Previous work has shown that recall can be improved by retrieving rewrite candidates from a large database of likely, contextually-relevant alternatives to the hypothesis text using nearest-neighbors search over embeddings of the ASR hypothesis t… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

  5. arXiv:2304.14514  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Understanding Shared Speech-Text Representations

    Authors: Gary Wang, Kyle Kastner, Ankur Bapna, Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang

    Abstract: Recently, a number of approaches to train speech models by incorpo-rating text into end-to-end models have been developed, with Mae-stro advancing state-of-the-art automatic speech recognition (ASR)and Speech Translation (ST) performance. In this paper, we expandour understanding of the resulting shared speech-text representationswith two types of analyses. First we examine the limits of speech-fr… ▽ More

    Submitted 27 April, 2023; originally announced April 2023.

    Comments: Accepted at ICASSP 2023, camera ready

  6. arXiv:2206.15276  [pdf, other

    cs.SD cs.LG eess.AS

    R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

    Authors: Kyle Kastner, Aaron Courville

    Abstract: This paper introduces R-MelNet, a two-part autoregressive architecture with a frontend based on the first tier of MelNet and a backend WaveRNN-style audio decoder for neural text-to-speech synthesis. Taking as input a mixed sequence of characters and phonemes, with an optional audio priming sequence, this model produces low-resolution mel-spectral features which are interpolated and used by a Wave… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

  7. arXiv:2112.09312  [pdf, other

    cs.SD cs.LG eess.AS

    MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling

    Authors: Yusong Wu, Ethan Manilow, Yi Deng, Rigel Swavely, Kyle Kastner, Tim Cooijmans, Aaron Courville, Cheng-Zhi Anna Huang, Jesse Engel

    Abstract: Musical expression requires control of both what notes are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments… ▽ More

    Submitted 17 March, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

    Comments: Accepted by International Conference on Learning Representations (ICLR) 2022

  8. arXiv:1811.10097  [pdf, other

    cs.LG cs.AI cs.RO stat.ML

    Planning in Dynamic Environments with Conditional Autoregressive Models

    Authors: Johanna Hansen, Kyle Kastner, Aaron Courville, Gregory Dudek

    Abstract: We demonstrate the use of conditional autoregressive generative models (van den Oord et al., 2016a) over a discrete latent space (van den Oord et al., 2017b) for forward planning with MCTS. In order to test this method, we introduce a new environment featuring varying difficulty levels, along with moving goals and obstacles. The combination of high-quality frame generation and classical planning a… ▽ More

    Submitted 25 November, 2018; originally announced November 2018.

    Comments: 6 pages, 1 figure, in Proceedings of the Prediction and Generative Modeling in Reinforcement Learning Workshop at the International Conference on Machine Learning (ICML) in 2018

  9. arXiv:1811.07426  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Harmonic Recomposition using Conditional Autoregressive Modeling

    Authors: Kyle Kastner, Rithesh Kumar, Tim Cooijmans, Aaron Courville

    Abstract: We demonstrate a conditional autoregressive pipeline for efficient music recomposition, based on methods presented in van den Oord et al.(2017). Recomposition (Casal & Casey, 2010) focuses on reworking existing musical pieces, adhering to structure at a high level while also re-imagining other aspects of the work. This can involve reuse of pre-existing themes or parts of the original piece, while… ▽ More

    Submitted 18 November, 2018; originally announced November 2018.

    Comments: 3 pages, 2 figures. In Proceedings of The Joint Workshop on Machine Learning for Music, ICML 2018

  10. arXiv:1811.07240  [pdf, other

    cs.LG cs.CL cs.SD eess.AS stat.ML

    Representation Mixing for TTS Synthesis

    Authors: Kyle Kastner, João Felipe Santos, Yoshua Bengio, Aaron Courville

    Abstract: Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in certain cases. We demonstrate a simple method for combining multiple types of linguistic information… ▽ More

    Submitted 24 November, 2018; v1 submitted 17 November, 2018; originally announced November 2018.

    Comments: 5 pages, 3 figures

  11. arXiv:1811.05013  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Blindfold Baselines for Embodied QA

    Authors: Ankesh Anand, Eugene Belilovsky, Kyle Kastner, Hugo Larochelle, Aaron Courville

    Abstract: We explore blindfold (question-only) baselines for Embodied Question Answering. The EmbodiedQA task requires an agent to answer a question by intelligently navigating in a simulated environment, gathering necessary visual information only through first-person vision before finally answering. Consequently, a blindfold baseline which ignores the environment and visual information is a degenerate sol… ▽ More

    Submitted 12 November, 2018; originally announced November 2018.

    Comments: NIPS 2018 Visually-Grounded Interaction and Language (ViGilL) Workshop

  12. Learning Distributed Representations from Reviews for Collaborative Filtering

    Authors: Amjad Almahairi, Kyle Kastner, Kyunghyun Cho, Aaron Courville

    Abstract: Recent work has shown that collaborative filter-based recommender systems can be improved by incorporating side information, such as natural language reviews, as a way of regularizing the derived product representations. Motivated by the success of this approach, we introduce two different models of reviews and study their effect on collaborative filtering performance. While the previous state-of-… ▽ More

    Submitted 18 June, 2018; originally announced June 2018.

    Comments: Published in RecSys 2015 conference

  13. arXiv:1511.07053  [pdf, other

    cs.CV cs.LG

    ReSeg: A Recurrent Neural Network-based Model for Semantic Segmentation

    Authors: Francesco Visin, Marco Ciccone, Adriana Romero, Kyle Kastner, Kyunghyun Cho, Yoshua Bengio, Matteo Matteucci, Aaron Courville

    Abstract: We propose a structured prediction architecture, which exploits the local generic features extracted by Convolutional Neural Networks and the capacity of Recurrent Neural Networks (RNN) to retrieve distant dependencies. The proposed architecture, called ReSeg, is based on the recently introduced ReNet model for image classification. We modify and extend it to perform the more challenging task of s… ▽ More

    Submitted 24 May, 2016; v1 submitted 22 November, 2015; originally announced November 2015.

    Comments: In CVPR Deep Vision Workshop, 2016

  14. arXiv:1506.02216  [pdf, other

    cs.LG

    A Recurrent Latent Variable Model for Sequential Data

    Authors: Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, Yoshua Bengio

    Abstract: In this paper, we explore the inclusion of latent random variables into the dynamic hidden state of a recurrent neural network (RNN) by combining elements of the variational autoencoder. We argue that through the use of high-level latent random variables, the variational RNN (VRNN)1 can model the kind of variability observed in highly structured sequential data such as natural speech. We empirical… ▽ More

    Submitted 6 April, 2016; v1 submitted 7 June, 2015; originally announced June 2015.

  15. arXiv:1505.00393  [pdf, other

    cs.CV

    ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks

    Authors: Francesco Visin, Kyle Kastner, Kyunghyun Cho, Matteo Matteucci, Aaron Courville, Yoshua Bengio

    Abstract: In this paper, we propose a deep neural network architecture for object recognition based on recurrent neural networks. The proposed network, called ReNet, replaces the ubiquitous convolution+pooling layer of the deep convolutional neural network with four recurrent neural networks that sweep horizontally and vertically in both directions across the image. We evaluate the proposed ReNet on three w… ▽ More

    Submitted 23 July, 2015; v1 submitted 3 May, 2015; originally announced May 2015.