Zum Hauptinhalt springen

Showing 1–6 of 6 results for author: Shirahata, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.12194  [pdf, other

    eess.AS cs.SD

    Universal Score-based Speech Enhancement with High Content Preservation

    Authors: Robin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu

    Abstract: We propose UNIVERSE++, a universal speech enhancement method based on score-based diffusion and adversarial training. Specifically, we improve the existing UNIVERSE model that decouples clean speech feature extraction and diffusion. Our contributions are three-fold. First, we make several modifications to the network architecture, improving training stability and final performance. Second, we intr… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 5 pages, 5 figures, accepted at Interspeech 2024

  2. arXiv:2406.07969  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning

    Authors: Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana

    Abstract: We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics. We employ a hybrid approach to construct prompt annotations: (1) manual annotations that capture human perceptions of speaker characteristics and (2) synthetic annotations on speaking style. Compared to existing… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH 2024

  3. arXiv:2309.08140  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

    Authors: Reo Shimizu, Ryuichi Yamamoto, Masaya Kawamura, Yuma Shirahata, Hironori Doi, Tatsuya Komatsu, Kentaro Tachibana

    Abstract: We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of spe… ▽ More

    Submitted 27 December, 2023; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  4. arXiv:2210.15975  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

    Authors: Masaya Kawamura, Yuma Shirahata, Ryuichi Yamamoto, Kentaro Tachibana

    Abstract: We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band gene… ▽ More

    Submitted 21 February, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: Accepted to ICASSP 2023

  5. arXiv:2210.15964  [pdf, other

    eess.AS cs.LG cs.SD

    Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis

    Authors: Yuma Shirahata, Ryuichi Yamamoto, Eunwoo Song, Ryo Terashima, Jae-Min Kim, Kentaro Tachibana

    Abstract: Several fully end-to-end text-to-speech (TTS) models have been proposed that have shown better performance compared to cascade models (i.e., training acoustic and vocoder models separately). However, they often generate unstable pitch contour with audible artifacts when the dataset contains emotional attributes, i.e., large diversity of pronunciation and prosody. To address this problem, we propos… ▽ More

    Submitted 21 February, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: ICASSP 2023

  6. arXiv:2204.10020  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation

    Authors: Ryo Terashima, Ryuichi Yamamoto, Eunwoo Song, Yuma Shirahata, Hyun-Wook Yoon, Jae-Min Kim, Kentaro Tachibana

    Abstract: Data augmentation via voice conversion (VC) has been successfully applied to low-resource expressive text-to-speech (TTS) when only neutral data for the target speaker are available. Although the quality of VC is crucial for this approach, it is challenging to learn a stable VC model because the amount of data is limited in low-resource scenarios, and highly expressive speech has large acoustic va… ▽ More

    Submitted 5 July, 2022; v1 submitted 21 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022