Zum Hauptinhalt springen

Showing 1–50 of 50 results for author: Sisman, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.17432  [pdf, other

    eess.AS cs.LG

    SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection

    Authors: Ismail Rasim Ulgen, Shreeram Suresh Chandra, Junchen Lu, Berrak Sisman

    Abstract: Synthesizing the voices of unseen speakers is a persisting challenge in multi-speaker text-to-speech (TTS). Most multi-speaker TTS models rely on modeling speaker characteristics through speaker conditioning during training. Modeling unseen speaker attributes through this approach has necessitated an increase in model complexity, which makes it challenging to reproduce results and improve upon the… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

    Comments: Submitted to IEEE Signal Processing Letters

  2. arXiv:2408.06827  [pdf, other

    eess.AS cs.LG

    PRESENT: Zero-Shot Text-to-Prosody Control

    Authors: Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans

    Abstract: Current strategies for achieving fine-grained prosody control in speech synthesis entail extracting additional style embeddings or adopting more complex architectures. To enable zero-shot application of pretrained text-to-speech (TTS) models, we present PRESENT (PRosody Editing without Style Embeddings or New Training), which exploits explicit prosody prediction in FastSpeech2-based models by modi… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

  3. arXiv:2407.17571  [pdf, other

    cs.CV

    Diffusion Models for Multi-Task Generative Modeling

    Authors: Changyou Chen, Han Ding, Bunyamin Sisman, Yi Xu, Ouye Xie, Benjamin Z. Yao, Son Dinh Tran, Belinda Zeng

    Abstract: Diffusion-based generative modeling has been achieving state-of-the-art results on various generation tasks. Most diffusion models, however, are limited to a single-generation modeling. Can we generalize diffusion models with the ability of multi-modal generative training for more generalizable modeling? In this paper, we propose a principled way to define a diffusion model by constructing a unifi… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

    Comments: Published as a conference paper at ICLR 2024

  4. arXiv:2407.04291  [pdf, other

    eess.AS cs.LG

    We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings

    Authors: Ismail Rasim Ulgen, Carlos Busso, John H. L. Hansen, Berrak Sisman

    Abstract: In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech. Although speaker embeddings have been widely used in personalized speech synthesis as conditioning inputs, they are designed to lose variation to optimize speaker recognition accuracy. Thus, they are suboptimal for speech synthesis in terms of modeling the rich va… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: Submitted to IEEE Signal Processing Letters

  5. arXiv:2406.03637  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Style Mixture of Experts for Expressive Text-To-Speech Synthesis

    Authors: Ahad Jawaid, Shreeram Suresh Chandra, Junchen Lu, Berrak Sisman

    Abstract: Recent advances in style transfer text-to-speech (TTS) have improved the expressiveness of synthesized speech. Despite these advancements, encoding stylistic information from diverse and unseen reference speech remains challenging. This paper introduces StyleMoE, an approach that divides the embedding space, modeled by the style encoder, into tractable subsets handled by style experts. The propose… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  6. arXiv:2406.01018  [pdf, other

    eess.AS cs.LG cs.SD

    Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training

    Authors: Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

    Abstract: With rapid globalization, the need to build inclusive and representative speech technology cannot be overstated. Accent is an important aspect of speech that needs to be taken into consideration while building inclusive speech synthesizers. Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent. We note that state-of-the-art Text-to-Speech (T… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Under review

  7. arXiv:2405.11413  [pdf, other

    eess.AS cs.LG

    Exploring speech style spaces with language models: Emotional TTS without emotion labels

    Authors: Shreeram Suresh Chandra, Zongyang Du, Berrak Sisman

    Abstract: Many frameworks for emotional text-to-speech (E-TTS) rely on human-annotated emotion labels that are often inaccurate and difficult to obtain. Learning emotional prosody implicitly presents a tough challenge due to the subjective nature of emotions. In this study, we propose a novel approach that leverages text awareness to acquire emotional styles without the need for explicit emotion labels or t… ▽ More

    Submitted 18 May, 2024; originally announced May 2024.

    Comments: Accepted at Speaker Odyssey 2024

  8. arXiv:2405.01730  [pdf, other

    eess.AS cs.SD

    Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

    Authors: Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman

    Abstract: Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders.… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: Accepted by Speaker Odyssey 2024

  9. arXiv:2403.14083  [pdf, other

    cs.SD cs.LG eess.AS

    emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition

    Authors: Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Berrak Sisman, Bjorn W. Schuller, Carlos Busso

    Abstract: Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a pot… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Submitted to IEEE Transactions on Affective Computing on February 19, 2024. arXiv admin note: text overlap with arXiv:2305.14402

  10. arXiv:2402.09030  [pdf, other

    cs.RO

    Awareness in robotics: An early perspective from the viewpoint of the EIC Pathfinder Challenge "Awareness Inside''

    Authors: Cosimo Della Santina, Carlos Hernandez Corbato, Burak Sisman, Luis A. Leiva, Ioannis Arapakis, Michalis Vakalellis, Jean Vanderdonckt, Luis Fernando D'Haro, Guido Manzi, Cristina Becchio, Aïda Elamrani, Mohsen Alirezaei, Ginevra Castellano, Dimos V. Dimarogonas, Arabinda Ghosh, Sofie Haesaert, Sadegh Soudjani, Sybert Stroeve, Paul Verschure, Davide Bacciu, Ophelia Deroy, Bahador Bahrami, Claudio Gallicchio, Sabine Hauert, Ricardo Sanz , et al. (6 additional authors not shown)

    Abstract: Consciousness has been historically a heavily debated topic in engineering, science, and philosophy. On the contrary, awareness had less success in raising the interest of scholars in the past. However, things are changing as more and more researchers are getting interested in answering questions concerning what awareness is and how it can be artificially generated. The landscape is rapidly evolvi… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

  11. Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition

    Authors: Ismail Rasim Ulgen, Zongyang Du, Carlos Busso, Berrak Sisman

    Abstract: Speaker embeddings carry valuable emotion-related information, which makes them a promising resource for enhancing speech emotion recognition (SER), especially with limited labeled data. Traditionally, it has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and stat… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  12. arXiv:2306.17005  [pdf, other

    eess.AS cs.CL cs.SD

    High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units

    Authors: Junchen Lu, Berrak Sisman, Mingyang Zhang, Haizhou Li

    Abstract: The goal of Automatic Voice Over (AVO) is to generate speech in sync with a silent video given its text script. Recent AVO frameworks built upon text-to-speech synthesis (TTS) have shown impressive results. However, the current AVO learning objective of acoustic feature reconstruction brings in indirect supervision for inter-modal alignment learning, thus limiting the synchronization performance a… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

    Comments: Accepted to INTERSPEECH 2023

  13. arXiv:2306.00794  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    SlothSpeech: Denial-of-service Attack Against Speech Recognition Models

    Authors: Mirazul Haque, Rutvij Shah, Simin Chen, Berrak Şişman, Cong Liu, Wei Yang

    Abstract: Deep Learning (DL) models have been popular nowadays to execute different speech-related tasks, including automatic speech recognition (ASR). As ASR is being used in different real-time scenarios, it is important that the ASR model remains efficient against minor perturbations to the input. Hence, evaluating efficiency robustness of the ASR model is the need of the hour. We show that popular ASR m… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  14. arXiv:2305.14402  [pdf, other

    cs.SD cs.LG eess.AS

    Enhancing Speech Emotion Recognition Through Differentiable Architecture Search

    Authors: Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Berrak Sisman, Björn Schuller

    Abstract: Speech Emotion Recognition (SER) is a critical enabler of emotion-aware communication in human-computer interactions. Recent advancements in Deep Learning (DL) have substantially enhanced the performance of SER models through increased model complexity. However, designing optimal DL architectures requires prior experience and experimental evaluations. Encouragingly, Neural Architecture Search (NAS… ▽ More

    Submitted 18 January, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 5 pages, 4 figures

  15. arXiv:2305.07216  [pdf, other

    cs.LG cs.MM cs.SD eess.AS

    Versatile audio-visual learning for emotion recognition

    Authors: Lucas Goncalves, Seong-Gyun Leem, Wei-Cheng Lin, Berrak Sisman, Carlos Busso

    Abstract: Most current audio-visual emotion recognition models lack the flexibility needed for deployment in practical applications. We envision a multimodal system that works even when only one modality is available and can be implemented interchangeably for either predicting emotional attributes or recognizing categorical emotions. Achieving such flexibility in a multimodal emotion recognition system is d… ▽ More

    Submitted 30 July, 2024; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: 18 pages, 4 Figures, 3 tables (published at IEEE Transactions on Affective Computing)

  16. arXiv:2211.07283  [pdf, other

    eess.AS cs.SD

    SNIPER Training: Single-Shot Sparse Training for Text-to-Speech

    Authors: Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans

    Abstract: Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models can improve on dense models via pruning and extra retraining, or converge faster than dense models with some performance loss. Thus, we propose training TTS models using decaying sparsity, i.e. a high initial sparsity to acc… ▽ More

    Submitted 1 June, 2024; v1 submitted 14 November, 2022; originally announced November 2022.

  17. arXiv:2211.03316  [pdf, other

    eess.AS cs.LG cs.SD

    Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

    Authors: Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

    Abstract: Accent plays a significant role in speech communication, influencing one's capability to understand as well as conveying a person's identity. This paper introduces a novel and efficient framework for accented Text-to-Speech (TTS) synthesis based on a Conditional Variational Autoencoder. It has the ability to synthesize a selected speaker's voice, which is converted to any desired target accent. Ou… ▽ More

    Submitted 3 June, 2024; v1 submitted 7 November, 2022; originally announced November 2022.

    Comments: preprint submitted to a conference, under review

  18. arXiv:2210.13756  [pdf, other

    eess.AS cs.SD

    Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion

    Authors: Kun Zhou, Berrak Sisman, Carlos Busso, Bin Ma, Haizhou Li

    Abstract: Emotional voice conversion (EVC) traditionally targets the transformation of spoken utterances from one emotional state to another, with previous research mainly focusing on discrete emotion categories. This paper departs from the norm by introducing a novel perspective: a nuanced rendering of mixed emotions and enhancing control over emotional expression. To achieve this, we propose a novel EVC f… ▽ More

    Submitted 17 September, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

  19. EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models

    Authors: Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman

    Abstract: Neural models are known to be over-parameterized, and recent work has shown that sparse text-to-speech (TTS) models can outperform dense models. Although a plethora of sparse methods has been proposed for other domains, such methods have rarely been applied in TTS. In this work, we seek to answer the question: what are the characteristics of selected sparse techniques on the performance and model… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

    Journal ref: Interspeech 2022, 823-827 (2022)

  20. arXiv:2209.10804  [pdf, other

    cs.SD cs.CL eess.AS

    Controllable Accented Text-to-Speech Synthesis

    Authors: Rui Liu, Berrak Sisman, Guanglai Gao, Haizhou Li

    Abstract: Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). Accented TTS synthesis is challenging as L2 is different from L1 in both in terms of phonetic rendering and prosody pattern. Furthermore, there is no easy solution to the control of the accent intensity in an utterance. In this work, we propose a neural TTS architecture,… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

    Comments: To be submitted for possible journal publication

  21. arXiv:2208.05890  [pdf, other

    cs.CL cs.AI

    Speech Synthesis with Mixed Emotions

    Authors: Kun Zhou, Berrak Sisman, Rajib Rana, B. W. Schuller, Haizhou Li

    Abstract: Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions.… ▽ More

    Submitted 28 December, 2022; v1 submitted 11 August, 2022; originally announced August 2022.

    Comments: Accepted to IEEE Transactions on Affective Computing

  22. arXiv:2206.07229  [pdf, other

    cs.SD cs.LG eess.AS

    Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning

    Authors: Rui Liu, Berrak Sisman, Björn Schuller, Guanglai Gao, Haizhou Li

    Abstract: Emotion classification of speech and assessment of the emotion strength are required in applications such as emotional text-to-speech and voice conversion. The emotion attribute ranking function based on Support Vector Machine (SVM) was proposed to predict emotion strength for emotional speech corpus. However, the trained ranking function doesn't generalize to new domains, which limits the scope o… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: To appear in INTERSPEECH 2022. 5 pages, 4 figures. Substantial text overlap with arXiv:2110.03156

  23. arXiv:2201.03967  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Emotion Intensity and its Control for Emotional Voice Conversion

    Authors: Kun Zhou, Berrak Sisman, Rajib Rana, Björn W. Schuller, Haizhou Li

    Abstract: Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity… ▽ More

    Submitted 18 July, 2022; v1 submitted 9 January, 2022; originally announced January 2022.

    Comments: Accepted by IEEE Transactions on Affective Computing

  24. arXiv:2110.14509  [pdf, other

    cs.LG cs.DB

    Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation

    Authors: Di Jin, Bunyamin Sisman, Hao Wei, Xin Luna Dong, Danai Koutra

    Abstract: Multi-source entity linkage focuses on integrating knowledge from multiple sources by linking the records that represent the same real world entity. This is critical in high-impact applications such as data cleaning and user stitching. The state-of-the-art entity linkage pipelines mainly depend on supervised learning that requires abundant amounts of training data. However, collecting well-labeled… ▽ More

    Submitted 27 October, 2021; originally announced October 2021.

  25. arXiv:2110.10326  [pdf, other

    eess.AS cs.SD

    Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion

    Authors: Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li

    Abstract: Expressive voice conversion performs identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Due to the hierarchical structure of speech emotion, it is challenging to disentangle the emotional style for different speakers. Inspired by the recent success of speaker disentanglement with variational autoencoder (VAE), we propose an any-to-any expressive… ▽ More

    Submitted 21 July, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: Accepted by Interspeech 2022

  26. arXiv:2110.06434  [pdf, other

    eess.AS cs.LG cs.SD

    DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding

    Authors: Sergey Nikonorov, Berrak Sisman, Mingyang Zhang, Haizhou Li

    Abstract: Conventional vocoders are commonly used as analysis tools to provide interpretable features for downstream tasks such as speech synthesis and voice conversion. They are built under certain assumptions about the signals following signal processing principle, therefore, not easily generalizable to different audio, for example, from speech to singing. In this paper, we propose a deep neural analyzer,… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: Accepted to ASRU 2021

  27. arXiv:2110.03342  [pdf, other

    eess.AS cs.CL

    VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

    Authors: Junchen Lu, Berrak Sisman, Rui Liu, Mingyang Zhang, Haizhou Li

    Abstract: In this paper, we formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO). Unlike traditional speech synthesis, AVO seeks to generate not only human-sounding speech, but also perfect lip-speech synchronization. A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video.… ▽ More

    Submitted 2 March, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: To appear at ICASSP 2022

  28. arXiv:2110.03156  [pdf, other

    cs.SD cs.AI eess.AS

    StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis

    Authors: Rui Liu, Berrak Sisman, Haizhou Li

    Abstract: Recently, emotional speech synthesis has achieved remarkable performance. The emotion strength of synthesized speech can be controlled flexibly using a strength descriptor, which is obtained by an emotion attribute ranking function. However, a trained ranking function on specific data has poor generalization, which limits its applicability for more realistic cases. In this paper, we propose a deep… ▽ More

    Submitted 7 October, 2021; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022. 5 pages, 3 figures, 1 table. Our codes are available at: https://github.com/ttslr/StrengthNet

  29. arXiv:2107.03748  [pdf, other

    eess.AS cs.SD

    Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer

    Authors: Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li

    Abstract: Traditional voice conversion(VC) has been focused on speaker identity conversion for speech with a neutral expression. We note that emotional expression plays an essential role in daily communication, and the emotional style of speech can be speaker-dependent. In this paper, we study the technique to jointly convert the speaker identity and speaker-dependent emotional style, that is called express… ▽ More

    Submitted 19 October, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

    Comments: Accepted to ASRU 2021

  30. arXiv:2106.00793  [pdf, other

    cs.CL

    CoRI: Collective Relation Integration with Data Augmentation for Open Information Extraction

    Authors: Zhengbao Jiang, Jialong Han, Bunyamin Sisman, Xin Luna Dong

    Abstract: Integrating extracted knowledge from the Web to knowledge graphs (KGs) can facilitate tasks like question answering. We study relation integration that aims to align free-text relations in subject-relation-object extractions to relations in a target KG. To address the challenge that free-text relations are ambiguous, previous methods exploit neighbor entities and relations for additional context.… ▽ More

    Submitted 1 June, 2021; originally announced June 2021.

    Comments: ACL 2021

  31. arXiv:2105.14762  [pdf

    cs.CL

    Emotional Voice Conversion: Theory, Databases and ESD

    Authors: Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li

    Abstract: In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and the existing emotional speech databases. We then motivate the development of a novel emotional speech database (ESD) that addresses the increasing research need. With this paper, the ESD database is now made available to the research community. The ESD database consists of 350 parallel utteran… ▽ More

    Submitted 9 January, 2022; v1 submitted 31 May, 2021; originally announced May 2021.

    Comments: Speech Communication

  32. arXiv:2104.01408  [pdf, other

    cs.CL

    Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability

    Authors: Rui Liu, Berrak Sisman, Haizhou Li

    Abstract: Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. However, the generated voice is often not perceptually identifiable by its intended emotion category. To address this problem, we propose a new interactive training paradigm for ETTS, denoted as i-ETTS, which seeks to directly improve the emotion discriminability by interacting with a speech emotion recognition (SER)… ▽ More

    Submitted 13 June, 2021; v1 submitted 3 April, 2021; originally announced April 2021.

    Comments: 5 pages, 4 figures, in Proceedings of INTERSPEECH 2021 conference, Speech Samples: https://ttslr.github.io/i-ETTS

  33. arXiv:2103.16809  [pdf, other

    cs.CL

    Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training

    Authors: Kun Zhou, Berrak Sisman, Haizhou Li

    Abstract: Emotional voice conversion (EVC) aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. In this paper, we propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. We note that the proposed EVC framework leverages text-to-speech (TTS) as they share a common… ▽ More

    Submitted 9 June, 2021; v1 submitted 31 March, 2021; originally announced March 2021.

    Comments: Accepted by Interspeech 2021

  34. arXiv:2011.02314  [pdf, other

    cs.SD cs.CL eess.AS

    VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech

    Authors: Kun Zhou, Berrak Sisman, Haizhou Li

    Abstract: Emotional voice conversion (EVC) aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity. In this paper, we study the disentanglement and recomposition of emotional elements in speech through variational autoencoding Wasserstein generative adversarial network (VAW-GAN). We propose a speaker-dependent EVC framework based on VAW-GA… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted by IEEE SLT 2021. arXiv admin note: text overlap with arXiv:2005.07025

  35. arXiv:2010.14794  [pdf, other

    cs.SD cs.CL eess.AS

    Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

    Authors: Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li

    Abstract: Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles. In this paper, we propo… ▽ More

    Submitted 10 February, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

    Comments: Accepted by ICASSP 2021

  36. arXiv:2010.12423  [pdf, other

    cs.LG cs.SD eess.AS

    GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

    Authors: Rui Liu, Berrak Sisman, Haizhou Li

    Abstract: Attention-based end-to-end text-to-speech synthesis (TTS) is superior to conventional statistical methods in many ways. Transformer-based TTS is one of such successful implementations. While Transformer TTS models the speech frame sequence well with a self-attention mechanism, it does not associate input text with output utterances from a syntactic point of view at sentence level. We propose a nov… ▽ More

    Submitted 26 March, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

    Comments: To appear at ICASSP'2021 (Accepted). (Speech samples: https://ttslr.github.io/GraphSpeech/)

  37. arXiv:2009.07203  [pdf, other

    cs.DB cs.LG

    CorDEL: A Contrastive Deep Learning Approach for Entity Linkage

    Authors: Zhengyang Wang, Bunyamin Sisman, Hao Wei, Xin Luna Dong, Shuiwang Ji

    Abstract: Entity linkage (EL) is a critical problem in data cleaning and integration. In the past several decades, EL has typically been done by rule-based systems or traditional machine learning models with hand-curated features, both of which heavily depend on manual human inputs. With the ever-increasing growth of new data, deep learning (DL) based approaches have been proposed to alleviate the high cost… ▽ More

    Submitted 2 December, 2020; v1 submitted 15 September, 2020; originally announced September 2020.

    Comments: Accepted by the 20th IEEE International Conference on Data Mining (ICDM 2020)

  38. arXiv:2008.05284  [pdf, other

    eess.AS cs.CL cs.SD

    Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

    Authors: Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

    Abstract: Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning s… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

    Comments: To appear in IEEE Signal Processing Letters (SPL)

  39. arXiv:2008.04562  [pdf, other

    eess.AS cs.CL cs.SD

    Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN

    Authors: Zongyang Du, Kun Zhou, Berrak Sisman, Haizhou Li

    Abstract: Cross-lingual voice conversion aims to change source speaker's voice to sound like that of target speaker, when source and target speakers speak different languages. It relies on non-parallel training data from two different languages, hence, is more challenging than mono-lingual voice conversion. Previous studies on cross-lingual voice conversion mainly focus on spectral conversion with a linear… ▽ More

    Submitted 3 November, 2020; v1 submitted 11 August, 2020; originally announced August 2020.

    Comments: Accepted to APSIPA ASC 2020

  40. arXiv:2008.03992  [pdf, other

    eess.AS cs.CL cs.SD

    VAW-GAN for Singing Voice Conversion with Non-parallel Training Data

    Authors: Junchen Lu, Kun Zhou, Berrak Sisman, Haizhou Li

    Abstract: Singing voice conversion aims to convert singer's voice from source to target without changing singing content. Parallel training data is typically required for the training of singing voice conversion system, that is however not practical in real-life applications. Recent encoder-decoder structures, such as variational autoencoding Wasserstein generative adversarial network (VAW-GAN), provide an… ▽ More

    Submitted 3 November, 2020; v1 submitted 10 August, 2020; originally announced August 2020.

    Comments: Accepted to APSIPA ASC 2020

  41. arXiv:2008.03648  [pdf, other

    eess.AS cs.SD

    An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning

    Authors: Berrak Sisman, Junichi Yamagishi, Simon King, Haizhou Li

    Abstract: Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. With the recent advances in theory… ▽ More

    Submitted 16 November, 2020; v1 submitted 9 August, 2020; originally announced August 2020.

    Comments: accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing

  42. arXiv:2008.01490  [pdf, other

    cs.SD eess.AS

    Expressive TTS Training with Frame and Style Reconstruction Loss

    Authors: Rui Liu, Berrak Sisman, Guanglai Gao, Haizhou Li

    Abstract: We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system to improve the expressiveness of speech. One of the key challenges in prosody modeling is the lack of reference that makes explicit modeling difficult. The proposed technique doesn't require prosody annotations from training data. It doesn't attempt to model prosody explicitly either, but rather encodes the associa… ▽ More

    Submitted 12 April, 2021; v1 submitted 4 August, 2020; originally announced August 2020.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing

  43. arXiv:2005.07025  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

    Authors: Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li

    Abstract: Emotional voice conversion aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity. The prior studies on emotional voice conversion are mostly carried out under the assumption that emotion is speaker-dependent. We consider that there is a common code between speakers for emotional expression in a spoken language, therefore, a spe… ▽ More

    Submitted 13 October, 2020; v1 submitted 13 May, 2020; originally announced May 2020.

    Comments: Accepted by Interspeech 2020

  44. arXiv:2002.00417  [pdf, other

    eess.AS cs.CL cs.SD

    WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

    Authors: Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

    Abstract: Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input. Such frameworks typically consist of a feature prediction network that maps character sequences to frequency-domain acoustic features, followed by a waveform reconstruction algorithm or a neural vocoder that generates the time-domain waveform from acoustic features. As the loss function is usually calculated on… ▽ More

    Submitted 6 April, 2020; v1 submitted 2 February, 2020; originally announced February 2020.

    Comments: To appear at Odyssey 2020, Tokyo, Japan

  45. arXiv:2002.00198  [pdf, other

    eess.AS cs.CL cs.SD

    Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

    Authors: Kun Zhou, Berrak Sisman, Haizhou Li

    Abstract: Emotional voice conversion aims to convert the spectrum and prosody to change the emotional patterns of speech, while preserving the speaker identity and linguistic content. Many studies require parallel speech data between different emotional patterns, which is not practical in real life. Moreover, they often model the conversion of fundamental frequency (F0) with a simple linear transform. As F0… ▽ More

    Submitted 24 October, 2020; v1 submitted 1 February, 2020; originally announced February 2020.

    Comments: accepted by Speaker Odyssey 2020 in Tokyo, Japan

  46. AutoBlock: A Hands-off Blocking Framework for Entity Matching

    Authors: Wei Zhang, Hao Wei, Bunyamin Sisman, Xin Luna Dong, Christos Faloutsos, David Page

    Abstract: Entity matching seeks to identify data records over one or multiple data sources that refer to the same real-world entity. Virtually every entity matching task on large datasets requires blocking, a step that reduces the number of record pairs to be matched. However, most of the traditional blocking methods are learning-free and key-based, and their successes are largely built on laborious human e… ▽ More

    Submitted 6 December, 2019; originally announced December 2019.

    Comments: In The Thirteenth ACM International Conference on Web Search and Data Mining (WSDM '20), February 3-7, 2020, Houston, TX, USA. ACM, Anchorage, Alaska, USA , 9 pages

  47. arXiv:1911.02839  [pdf, other

    cs.CL cs.SD eess.AS

    Teacher-Student Training for Robust Tacotron-based TTS

    Authors: Rui Liu, Berrak Sisman, Jingdong Li, Feilong Bao, Guanglai Gao, Haizhou Li

    Abstract: While neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways, the exposure bias problem in the autoregressive models remains an issue to be resolved. The exposure bias problem arises from the mismatch between the training and inference process, that results in unpredictable performance for out-of-domain test data at run-time. To overcome this, we propos… ▽ More

    Submitted 11 February, 2020; v1 submitted 7 November, 2019; originally announced November 2019.

    Comments: To appear at ICASSP2020, Barcelona, Spain

  48. arXiv:1907.09657  [pdf, other

    cs.DB

    Efficient Knowledge Graph Accuracy Evaluation

    Authors: Junyang Gao, Xian Li, Yifan Ethan Xu, Bunyamin Sisman, Xin Luna Dong, Jun Yang

    Abstract: Estimation of the accuracy of a large-scale knowledge graph (KG) often requires humans to annotate samples from the graph. How to obtain statistically meaningful estimates for accuracy evaluation while keeping human annotation costs low is a problem critical to the development cycle of a KG and its practical applications. Surprisingly, this challenging problem has largely been ignored in prior res… ▽ More

    Submitted 22 July, 2019; originally announced July 2019.

    Comments: in VLDB 2019

  49. arXiv:1905.11449  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019

    Authors: Andros Tjandra, Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li, Satoshi Nakamura

    Abstract: We describe our submitted system for the ZeroSpeech Challenge 2019. The current challenge theme addresses the difficulty of constructing a speech synthesizer without any text or phonetic labels and requires a system that can (1) discover subword units in an unsupervised way, and (2) synthesize the speech with a target speaker's voice. Moreover, the system should also balance the discrimination sco… ▽ More

    Submitted 29 May, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

    Comments: Submitted to Interspeech 2019

  50. arXiv:1807.08447  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    LinkNBed: Multi-Graph Representation Learning with Entity Linkage

    Authors: Rakshit Trivedi, Bunyamin Sisman, Jun Ma, Christos Faloutsos, Hongyuan Zha, Xin Luna Dong

    Abstract: Knowledge graphs have emerged as an important model for studying complex multi-relational data. This has given rise to the construction of numerous large scale but incomplete knowledge graphs encoding information extracted from various resources. An effective and scalable approach to jointly learn over multiple graphs and eventually construct a unified graph is a crucial next step for the success… ▽ More

    Submitted 23 July, 2018; originally announced July 2018.

    Comments: ACL 2018