Search | arXiv e-print repository

arXiv:2406.07803 [pdf, other]

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Authors: Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee

Abstract: Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressi… ▽ More Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model ability to control emotional style and intensity with high-quality expressive speech. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted at INTERSPEECH 2024

arXiv:2401.08095 [pdf, other]

DurFlex-EVC: Duration-Flexible Emotional Voice Conversion with Parallel Generation

Authors: Hyung-Seok Oh, Sang-Hoon Lee, Deok-Hyeon Cho, Seong-Whan Lee

Abstract: Emotional voice conversion involves modifying the pitch, spectral envelope, and other acoustic characteristics of speech to match a desired emotional state while maintaining the speaker's identity. Recent advances in EVC involve simultaneously modeling pitch and duration by exploiting the potential of sequence-to-sequence models. In this study, we focus on parallel speech generation to increase th… ▽ More Emotional voice conversion involves modifying the pitch, spectral envelope, and other acoustic characteristics of speech to match a desired emotional state while maintaining the speaker's identity. Recent advances in EVC involve simultaneously modeling pitch and duration by exploiting the potential of sequence-to-sequence models. In this study, we focus on parallel speech generation to increase the reliability and efficiency of conversion. We introduce a duration-flexible EVC (DurFlex-EVC) that integrates a style autoencoder and a unit aligner. The previous variable-duration parallel generation model required text-to-speech alignment. We consider self-supervised model representation and discrete speech units to be the core of our parallel generation. The style autoencoder promotes content style disentanglement by separating the source style of the input features and applying them with the target style. The unit aligner encodes unit-level features by modeling emotional context. Furthermore, we enhance the style of the features with a hierarchical stylize encoder and generate high-quality Mel-spectrograms with a diffusion-based generator. The effectiveness of the approach has been validated through subjective and objective evaluations and has been demonstrated to be superior to baseline models. △ Less

Submitted 8 August, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

Comments: 14 pages, 11 figures, 12 tables

arXiv:2007.01524 [pdf, other]

Domain Adaptation without Source Data

Authors: Youngeun Kim, Donghyeon Cho, Kyeongtak Han, Priyadarshini Panda, Sungeun Hong

Abstract: Domain adaptation assumes that samples from source and target domains are freely accessible during a training phase. However, such an assumption is rarely plausible in the real-world and possibly causes data-privacy issues, especially when the label of the source domain can be a sensitive attribute as an identifier. To avoid accessing source data that may contain sensitive information, we introduc… ▽ More Domain adaptation assumes that samples from source and target domains are freely accessible during a training phase. However, such an assumption is rarely plausible in the real-world and possibly causes data-privacy issues, especially when the label of the source domain can be a sensitive attribute as an identifier. To avoid accessing source data that may contain sensitive information, we introduce Source data-Free Domain Adaptation (SFDA). Our key idea is to leverage a pre-trained model from the source domain and progressively update the target model in a self-learning manner. We observe that target samples with lower self-entropy measured by the pre-trained source model are more likely to be classified correctly. From this, we select the reliable samples with the self-entropy criterion and define these as class prototypes. We then assign pseudo labels for every target sample based on the similarity score with class prototypes. Furthermore, to reduce the uncertainty from the pseudo labeling process, we propose set-to-set distance-based filtering which does not require any tunable hyperparameters. Finally, we train the target model with the filtered pseudo labels with regularization from the pre-trained source model. Surprisingly, without direct usage of labeled source samples, our PrDA outperforms conventional domain adaptation methods on benchmark datasets. Our code is publicly available at https://github.com/youngryan1993/SFDA-SourceFreeDA △ Less

Submitted 30 August, 2021; v1 submitted 3 July, 2020; originally announced July 2020.

Comments: 13 pages

arXiv:1912.00374 [pdf]

Task Scheduling of Multiple Agile Satellites with Transition Time and Stereo Imaging Constraints

Authors: Junhong Kim, Doo-Hyun Cho, Jaemyung Ahn, Han-Lim Choi

Abstract: This paper proposes a framework for scheduling the observation and download tasks of multiple agile satellites with practical considerations such as attitude transition time, onboard data capacity, and stereoscopic image acquisition. A mixed integer linear programming (MILP) formulation for optimal scheduling that can address these practical considerations is introduced. A heuristic algorithm to o… ▽ More This paper proposes a framework for scheduling the observation and download tasks of multiple agile satellites with practical considerations such as attitude transition time, onboard data capacity, and stereoscopic image acquisition. A mixed integer linear programming (MILP) formulation for optimal scheduling that can address these practical considerations is introduced. A heuristic algorithm to obtain a near-optimal solution of the formulated MILP based on the time windows pruning procedure is proposed. A comprehensive case study demonstrating the validity of the proposed formulation and heuristic is presented. △ Less

Submitted 1 December, 2019; originally announced December 2019.

arXiv:1906.07851 [pdf, other]

Key Instance Selection for Unsupervised Video Object Segmentation

Authors: Donghyeon Cho, Sungeun Hong, Sungil Kang, Jiwon Kim

Abstract: This paper proposes key instance selection based on video saliency covering objectness and dynamics for unsupervised video object segmentation (UVOS). Our method takes frames sequentially and extracts object proposals with corresponding masks for each frame. We link objects according to their similarity until the M-th frame and then assign them unique IDs (i.e., instances). Similarity measure take… ▽ More This paper proposes key instance selection based on video saliency covering objectness and dynamics for unsupervised video object segmentation (UVOS). Our method takes frames sequentially and extracts object proposals with corresponding masks for each frame. We link objects according to their similarity until the M-th frame and then assign them unique IDs (i.e., instances). Similarity measure takes into account multiple properties such as ReID descriptor, expected trajectory, and semantic co-segmentation result. After M-th frame, we select K IDs based on video saliency and frequency of appearance; then only these key IDs are tracked through the remaining frames. Thanks to these technical contributions, our results are ranked third on the leaderboard of UVOS DAVIS challenge. △ Less

Submitted 26 July, 2019; v1 submitted 18 June, 2019; originally announced June 2019.

Comments: Ranked 3rd in 'Unsupervised DAVIS Challenge' (CVPR 2019)

Showing 1–5 of 5 results for author: Cho, D