Zum Hauptinhalt springen

Showing 1–22 of 22 results for author: Ellinas, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.01805  [pdf, other

    cs.LG

    Improved Text Emotion Prediction Using Combined Valence and Arousal Ordinal Classification

    Authors: Michael Mitsios, Georgios Vamvoukakis, Georgia Maniati, Nikolaos Ellinas, Georgios Dimitriou, Konstantinos Markopoulos, Panos Kakoulidis, Alexandra Vioni, Myrsini Christidou, Junkwang Oh, Gunu Jho, Inchul Hwang, Georgios Vardaxoglou, Aimilios Chalamandaris, Pirros Tsiakoulis, Spyros Raptis

    Abstract: Emotion detection in textual data has received growing interest in recent years, as it is pivotal for developing empathetic human-computer interaction systems. This paper introduces a method for categorizing emotions from text, which acknowledges and differentiates between the diversified similarities and distinctions of various emotions. Initially, we establish a baseline by training a transforme… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  2. arXiv:2402.01520  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations

    Authors: Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Myrsini Christidou, Alexandra Vioni, Georgia Maniati, Junkwang Oh, Gunu Jho, Inchul Hwang, Pirros Tsiakoulis, Aimilios Chalamandaris

    Abstract: In this paper, we propose a singing voice synthesis model, Karaoker-SSL, that is trained only on text and speech data as a typical multi-speaker acoustic model. It is a low-resource pipeline that does not utilize any singing data end-to-end, since its vocoder is also trained on speech data. Karaoker-SSL is conditioned by self-supervised speech representations in an unsupervised manner. We preproce… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted to IEEE ICASSP SASB 2024

  3. arXiv:2211.16307  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Controllable speech synthesis by learning discrete phoneme-level prosodic representations

    Authors: Nikolaos Ellinas, Myrsini Christidou, Alexandra Vioni, June Sig Sung, Aimilios Chalamandaris, Pirros Tsiakoulis, Paris Mastorocostas

    Abstract: In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autore… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: Final published version available at: Speech Communication. arXiv admin note: substantial text overlap with arXiv:2111.10168

  4. arXiv:2211.01327  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis

    Authors: Konstantinos Klapsas, Karolos Nikitaras, Nikolaos Ellinas, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference. In this paper, we compare different prior architectures at the task of predicting phoneme level prosodic representations extracted with an unsupervised FVAE model. We use both subjective and objective metrics t… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  5. arXiv:2211.00523  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

    Authors: Karolos Nikitaras, Konstantinos Klapsas, Nikolaos Ellinas, Georgia Maniati, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the correspond… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  6. arXiv:2211.00375  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Generating Multilingual Gender-Ambiguous Text-to-Speech Voices

    Authors: Konstantinos Markopoulos, Georgia Maniati, Georgios Vamvoukakis, Nikolaos Ellinas, Georgios Vardaxoglou, Panos Kakoulidis, Junkwang Oh, Gunu Jho, Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis, Spyros Raptis

    Abstract: The gender of any voice user interface is a key element of its perceived identity. Recently, there has been increasing interest in interfaces where the gender is ambiguous rather than clearly identifying as female or male. This work addresses the task of generating novel gender-ambiguous TTS voices in a multi-speaker, multilingual setting. This is accomplished by efficiently sampling from a latent… ▽ More

    Submitted 11 June, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: Accepted to INTERSPEECH 2023

  7. arXiv:2211.00342  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

    Authors: Alexandra Vioni, Georgia Maniati, Nikolaos Ellinas, June Sig Sung, Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with re… ▽ More

    Submitted 7 May, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: Proceedings of ICASSP 2023

  8. arXiv:2210.17264   

    cs.SD cs.CL cs.LG eess.AS

    Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

    Authors: Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Georgia Maniati, Panos Kakoulidis, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture, where the decoder has been replaced with a normalizing flow network conditioned on the speaker identity, allowing both TTS and voice conversion (VC)… ▽ More

    Submitted 27 February, 2024; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: Fundamental changes to the model described and experimental procedure

  9. arXiv:2204.05070  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Fine-grained Noise Control for Multispeaker Speech Synthesis

    Authors: Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i.e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise.This paper pr… ▽ More

    Submitted 27 October, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  10. Karaoker: Alignment-free singing voice synthesis with speech training data

    Authors: Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios Chalamandaris

    Abstract: Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthes… ▽ More

    Submitted 31 August, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  11. arXiv:2204.03421  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Self-supervised learning for robust voice cloning

    Authors: Konstantinos Klapsas, Nikolaos Ellinas, Karolos Nikitaras, Georgios Vamvoukakis, Panos Kakoulidis, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are… ▽ More

    Submitted 2 November, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  12. arXiv:2204.03040  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

    Authors: Georgia Maniati, Alexandra Vioni, Nikolaos Ellinas, Karolos Nikitaras, Konstantinos Klapsas, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a publ… ▽ More

    Submitted 24 August, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  13. arXiv:2111.10177  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis

    Authors: Alexandra Vioni, Myrsini Christidou, Nikolaos Ellinas, Georgios Vamvoukakis, Panos Kakoulidis, Taehoon Kim, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Proceedings of ICASSP 2021

  14. arXiv:2111.10173  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Word-Level Style Control for Expressive, Non-attentive Speech Synthesis

    Authors: Konstantinos Klapsas, Nikolaos Ellinas, June Sig Sung, Hyoungmin Park, Spyros Raptis

    Abstract: This paper presents an expressive speech synthesis architecture for modeling and controlling the speaking style at a word level. It attempts to learn word-level stylistic and prosodic representations of the speech data, with the aid of two encoders. The first one models style by finding a combination of style tokens for each word given the acoustic features, and the second outputs a word-level seq… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Proceedings of SPECOM 2021

  15. arXiv:2111.10168  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

    Authors: Myrsini Christidou, Alexandra Vioni, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Panos Kakoulidis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control ra… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Proceedings of SPECOM 2021

  16. arXiv:2111.09146  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

    Authors: Konstantinos Markopoulos, Nikolaos Ellinas, Alexandra Vioni, Myrsini Christidou, Panos Kakoulidis, Georgios Vamvoukakis, Georgia Maniati, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis, Aimilios Chalamandaris

    Abstract: In this paper, a text-to-rapping/singing system is introduced, which can be adapted to any speaker's voice. It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data and which provides prosody control at the phoneme level. Dataset augmentation and additional prosody manipulation based on traditional DSP algorithms are also investigated. The neural TTS model is fine-… ▽ More

    Submitted 17 November, 2021; originally announced November 2021.

    Comments: Proceedings of 11th ISCA Speech Synthesis Workshop (SSW 11)

  17. arXiv:2111.09075  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

    Authors: Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: The idea of using phonological features instead of phonemes as input to sequence-to-sequence TTS has been recently proposed for zero-shot multilingual speech synthesis. This approach is useful for code-switching, as it facilitates the seamless uttering of foreign text embedded in a stream of native text. In our work, we train a language-agnostic multispeaker model conditioned on a set of phonologi… ▽ More

    Submitted 17 November, 2021; originally announced November 2021.

    Comments: Proceedings of INTERSPEECH 2021

  18. arXiv:2111.09052  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

    Authors: Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis, Spyros Raptis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis

    Abstract: This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by usin… ▽ More

    Submitted 17 November, 2021; originally announced November 2021.

    Comments: Proceedings of INTERSPEECH 2020

  19. Unsupervised low-rank representations for speech emotion recognition

    Authors: Georgios Paraskevopoulos, Efthymios Tzinis, Nikolaos Ellinas, Theodoros Giannakopoulos, Alexandros Potamianos

    Abstract: We examine the use of linear and non-linear dimensionality reduction algorithms for extracting low-rank feature representations for speech emotion recognition. Two feature sets are used, one based on low-level descriptors and their aggregations (IS10) and one modeling recurrence dynamics of speech (RQA), as well as their fusion. We report speech emotion recognition (SER) results for learned repres… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

    Comments: Published at Interspeech 2019 https://www.isca-speech.org/archive/Interspeech_2019/abstracts/2769.html

  20. arXiv:1804.06659  [pdf, other

    cs.CL

    NTUA-SLP at SemEval-2018 Task 3: Tracking Ironic Tweets using Ensembles of Word and Character Level Attentive RNNs

    Authors: Christos Baziotis, Nikos Athanasiou, Pinelopi Papalampidi, Athanasia Kolovou, Georgios Paraskevopoulos, Nikolaos Ellinas, Alexandros Potamianos

    Abstract: In this paper we present two deep-learning systems that competed at SemEval-2018 Task 3 "Irony detection in English tweets". We design and ensemble two independent models, based on recurrent neural networks (Bi-LSTM), which operate at the word and character level, in order to capture both the semantic and syntactic information in tweets. Our models are augmented with a self-attention mechanism, in… ▽ More

    Submitted 18 April, 2018; originally announced April 2018.

    Comments: SemEval-2018, Task 3 "Irony detection in English tweets"

  21. arXiv:1804.06658  [pdf, other

    cs.CL

    NTUA-SLP at SemEval-2018 Task 1: Predicting Affective Content in Tweets with Deep Attentive RNNs and Transfer Learning

    Authors: Christos Baziotis, Nikos Athanasiou, Alexandra Chronopoulou, Athanasia Kolovou, Georgios Paraskevopoulos, Nikolaos Ellinas, Shrikanth Narayanan, Alexandros Potamianos

    Abstract: In this paper we present deep-learning models that submitted to the SemEval-2018 Task~1 competition: "Affect in Tweets". We participated in all subtasks for English tweets. We propose a Bi-LSTM architecture equipped with a multi-layer self attention mechanism. The attention mechanism improves the model performance and allows us to identify salient words in tweets, as well as gain insight into the… ▽ More

    Submitted 18 April, 2018; originally announced April 2018.

    Comments: Semeval 2018, Task 1 "Affect in Tweets"

  22. arXiv:1804.06657  [pdf, other

    cs.CL

    NTUA-SLP at SemEval-2018 Task 2: Predicting Emojis using RNNs with Context-aware Attention

    Authors: Christos Baziotis, Nikos Athanasiou, Georgios Paraskevopoulos, Nikolaos Ellinas, Athanasia Kolovou, Alexandros Potamianos

    Abstract: In this paper we present a deep-learning model that competed at SemEval-2018 Task 2 "Multilingual Emoji Prediction". We participated in subtask A, in which we are called to predict the most likely associated emoji in English tweets. The proposed architecture relies on a Long Short-Term Memory network, augmented with an attention mechanism, that conditions the weight of each word, on a "context vec… ▽ More

    Submitted 18 April, 2018; originally announced April 2018.

    Comments: SemEval-2018, Task 2 "Multilingual Emoji Prediction"