Zum Hauptinhalt springen

Showing 1–4 of 4 results for author: Shih, K J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2303.07578  [pdf, ps, other

    cs.SD cs.LG eess.AS

    VANI: Very-lightweight Accent-controllable TTS for Native and Non-native speakers with Identity Preservation

    Authors: Rohan Badlani, Akshit Arora, Subhankar Ghosh, Rafael Valle, Kevin J. Shih, João Felipe Santos, Boris Ginsburg, Bryan Catanzaro

    Abstract: We introduce VANI, a very lightweight multi-lingual accent controllable speech synthesis system. Our model builds upon disentanglement strategies proposed in RADMMM and supports explicit control of accent, language, speaker and fine-grained $F_0$ and energy features for speech synthesis. We utilize the Indic languages dataset, released for LIMMITS 2023 as part of ICASSP Signal Processing Grand Cha… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: Presentation accepted at ICASSP 2023

  2. arXiv:2301.10335  [pdf, other

    cs.SD cs.LG eess.AS

    Multilingual Multiaccented Multispeaker TTS with RADTTS

    Authors: Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro

    Abstract: We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfe… ▽ More

    Submitted 24 January, 2023; originally announced January 2023.

    Comments: 5 pages, submitted to ICASSP 2023

  3. arXiv:2203.01786  [pdf, other

    cs.SD cs.LG eess.AS

    Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows

    Authors: Kevin J. Shih, Rafael Valle, Rohan Badlani, João Felipe Santos, Bryan Catanzaro

    Abstract: Despite recent advances in generative modeling for text-to-speech synthesis, these models do not yet have the same fine-grained adjustability of pitch-conditioned deterministic models such as FastPitch and FastSpeech2. Pitch information is not only low-dimensional, but also discontinuous, making it particularly difficult to model in a generative setting. Our work explores several techniques for ha… ▽ More

    Submitted 27 June, 2022; v1 submitted 3 March, 2022; originally announced March 2022.

    Comments: 22 pages, 11 figures, 3 tables

  4. arXiv:2108.10447  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    One TTS Alignment To Rule Them All

    Authors: Rohan Badlani, Adrian Łancucki, Kevin J. Shih, Rafael Valle, Wei Ping, Bryan Catanzaro

    Abstract: Speech-to-text alignment is a critical component of neural textto-speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line. However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words. Most non-autoregressive endto-end TTS models rely on durati… ▽ More

    Submitted 23 August, 2021; originally announced August 2021.