Skip to main content

Showing 1–16 of 16 results for author: Tachibana, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.03963  [pdf, other

    cs.CL cs.AI

    LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

    Authors: LLM-jp, :, Akiko Aizawa, Eiji Aramaki, Bowen Chen, Fei Cheng, Hiroyuki Deguchi, Rintaro Enomoto, Kazuki Fujii, Kensuke Fukumoto, Takuya Fukushima, Namgi Han, Yuto Harada, Chikara Hashimoto, Tatsuya Hiraoka, Shohei Hisada, Sosuke Hosokawa, Lu Jie, Keisuke Kamata, Teruhito Kanazawa, Hiroki Kanezashi, Hiroshi Kataoka, Satoru Katsumata, Daisuke Kawahara, Seiya Kawano , et al. (57 additional authors not shown)

    Abstract: This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  2. arXiv:2406.07969  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning

    Authors: Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana

    Abstract: We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics. We employ a hybrid approach to construct prompt annotations: (1) manual annotations that capture human perceptions of speaker characteristics and (2) synthetic annotations on speaking style. Compared to existing… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH 2024

  3. arXiv:2406.07280  [pdf, ps, other

    cs.SD eess.AS

    Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

    Authors: Takuto Igarashi, Yuki Saito, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We propose noise-robust voice conversion (VC) which takes into account the recording quality and environment of noisy source speech. Conventional denoising training improves the noise robustness of a VC model by learning noisy-to-clean VC process. However, the naturalness of the converted speech is limited when the noise of the source speech is unseen during the training. To this end, our proposed… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 5 pages, accepted for INTERSPEECH 2024, audio samples: http://y-saito.sakura.ne.jp/sython/Corpus/SRC4VC/IS2024_CDT_supplementary/demo_cdt.html

  4. arXiv:2406.07254  [pdf, ps, other

    cs.SD eess.AS

    SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark

    Authors: Yuki Saito, Takuto Igarashi, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We present SRC4VC, a new corpus containing 11 hours of speech recorded on smartphones by 100 Japanese speakers. Although high-quality multi-speaker corpora can advance voice conversion (VC) technologies, they are not always suitable for testing VC when low-quality speech recording is given as the input. To this end, we first asked 100 crowdworkers to record their voice samples using smartphones. T… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted for INTERSPEECH 2024, corpus project page: https://y-saito.sakura.ne.jp/sython/Corpus/SRC4VC/index.html

  5. arXiv:2309.08140  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

    Authors: Reo Shimizu, Ryuichi Yamamoto, Masaya Kawamura, Yuma Shirahata, Hironori Doi, Tatsuya Komatsu, Kentaro Tachibana

    Abstract: We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of spe… ▽ More

    Submitted 27 December, 2023; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  6. arXiv:2305.13724  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived Context Word Embeddings

    Authors: Yuki Saito, Shinnosuke Takamichi, Eiji Iimori, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We propose ChatGPT-EDSS, an empathetic dialogue speech synthesis (EDSS) method using ChatGPT for extracting dialogue context. ChatGPT is a chatbot that can deeply understand the content and purpose of an input prompt and appropriately respond to the user's request. We focus on ChatGPT's reading comprehension and introduce it to EDSS, a task of synthesizing speech that can empathize with the interl… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: 5 pages, accepted for INTERSPEECH 2023

  7. arXiv:2305.13713  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center

    Authors: Yuki Saito, Eiji Iimori, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We present CALLS, a Japanese speech corpus that considers phone calls in a customer center as a new domain of empathetic spoken dialogue. The existing STUDIES corpus covers only empathetic dialogue between a teacher and student in a school. To extend the application range of empathetic dialogue speech synthesis (EDSS), we designed our corpus to include the same female speaker as the STUDIES teache… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: 5 pages, accepted for INTERSPEECH2023

  8. arXiv:2210.15975  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

    Authors: Masaya Kawamura, Yuma Shirahata, Ryuichi Yamamoto, Kentaro Tachibana

    Abstract: We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band gene… ▽ More

    Submitted 21 February, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: Accepted to ICASSP 2023

  9. arXiv:2210.15964  [pdf, other

    eess.AS cs.LG cs.SD

    Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis

    Authors: Yuma Shirahata, Ryuichi Yamamoto, Eunwoo Song, Ryo Terashima, Jae-Min Kim, Kentaro Tachibana

    Abstract: Several fully end-to-end text-to-speech (TTS) models have been proposed that have shown better performance compared to cascade models (i.e., training acoustic and vocoder models separately). However, they often generate unstable pitch contour with audible artifacts when the dataset contains emotional attributes, i.e., large diversity of pronunciation and prosody. To address this problem, we propos… ▽ More

    Submitted 21 February, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: ICASSP 2023

  10. arXiv:2210.15887  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Nonparallel High-Quality Audio Super Resolution with Domain Adaptation and Resampling CycleGANs

    Authors: Reo Yoneyama, Ryuichi Yamamoto, Kentaro Tachibana

    Abstract: Neural audio super-resolution models are typically trained on low- and high-resolution audio signal pairs. Although these methods achieve highly accurate super-resolution if the acoustic characteristics of the input data are similar to those of the training data, challenges remain: the models suffer from quality degradation for out-of-domain data, and paired data are required for training. To addr… ▽ More

    Submitted 27 February, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: Acceptted to ICASSP 2023

  11. arXiv:2206.08039  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History

    Authors: Yuto Nishimura, Yuki Saito, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We propose an end-to-end empathetic dialogue speech synthesis (DSS) model that considers both the linguistic and prosodic contexts of dialogue history. Empathy is the active attempt by humans to get inside the interlocutor in dialogue, and empathetic DSS is a technology to implement this act in spoken dialogue systems. Our model is conditioned by the history of linguistic and prosody features for… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

    Comments: 5 pages, 3 figures, Accepted for INTERSPEECH2022

  12. arXiv:2204.10020  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation

    Authors: Ryo Terashima, Ryuichi Yamamoto, Eunwoo Song, Yuma Shirahata, Hyun-Wook Yoon, Jae-Min Kim, Kentaro Tachibana

    Abstract: Data augmentation via voice conversion (VC) has been successfully applied to low-resource expressive text-to-speech (TTS) when only neutral data for the target speaker are available. Although the quality of VC is crucial for this approach, it is challenging to learn a stable VC model because the amount of data is limited in low-resource scenarios, and highly expressive speech has large acoustic va… ▽ More

    Submitted 5 July, 2022; v1 submitted 21 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  13. arXiv:2203.15683  [pdf, other

    cs.SD eess.AS

    DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning

    Authors: Takaaki Saeki, Kentaro Tachibana, Ryuichi Yamamoto

    Abstract: Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed environment, incurring a high cost for data collection. To solve this problem, existing noise-robust TTS methods are intended to use noisy speech corpora as training data. However, they only address either time-invariant or time-variant noises. We propose a degradation-robust TTS method, which can be trai… ▽ More

    Submitted 29 June, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted to INTERSPEECH 2022

  14. arXiv:2203.14757  [pdf, ps, other

    cs.SD cs.AI cs.CL cs.HC cs.LG

    STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent

    Authors: Yuki Saito, Yuto Nishimura, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We present STUDIES, a new speech corpus for developing a voice agent that can speak in a friendly manner. Humans naturally control their speech prosody to empathize with each other. By incorporating this "empathetic dialogue" behavior into a spoken dialogue system, we can develop a voice agent that can respond to a user more naturally. We designed the STUDIES corpus to include a speaker who speaks… ▽ More

    Submitted 16 June, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

    Comments: 5 pages, 2 figures, Accepted for INTERSPEECH2022, project page: http://sython.org/Corpus/STUDIES

  15. arXiv:2104.12395  [pdf, other

    eess.AS cs.CL cs.LG

    Phrase break prediction with bidirectional encoder representations in Japanese text-to-speech synthesis

    Authors: Kosuke Futamata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana

    Abstract: We propose a novel phrase break prediction method that combines implicit features extracted from a pre-trained large language model, a.k.a BERT, and explicit features extracted from BiLSTM with linguistic features. In conventional BiLSTM based methods, word representations and/or sentence representations are used as independent components. The proposed method takes account of both representations… ▽ More

    Submitted 26 April, 2021; originally announced April 2021.

    Comments: Submitted to INTERSPEECH 2021

  16. arXiv:1809.01890  [pdf, other

    cs.CV cs.GR cs.LG stat.ML

    Full-body High-resolution Anime Generation with Progressive Structure-conditional Generative Adversarial Networks

    Authors: Koichi Hamada, Kentaro Tachibana, Tianqi Li, Hiroto Honda, Yusuke Uchida

    Abstract: We propose Progressive Structure-conditional Generative Adversarial Networks (PSGAN), a new framework that can generate full-body and high-resolution character images based on structural information. Recent progress in generative adversarial networks with progressive training has made it possible to generate high-resolution images. However, existing approaches have limitations in achieving both hi… ▽ More

    Submitted 6 September, 2018; originally announced September 2018.

    Comments: Accepted to ECCV 2018 Workshop: Computer Vision for Fashion, Art and Design. Project page is at https://dena.com/intl/anime-generation