Zum Hauptinhalt springen

Showing 1–14 of 14 results for author: Klejch, O

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.12707  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    TTSDS -- Text-to-Speech Distribution Score

    Authors: Christoph Minixhofer, Ondřej Klejch, Peter Bell

    Abstract: Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how w… ▽ More

    Submitted 22 July, 2024; v1 submitted 17 July, 2024; originally announced July 2024.

    Comments: Under review for SLT 2024

  2. arXiv:2309.15674  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Speech collage: code-switched audio generation by collaging monolingual corpora

    Authors: Amir Hussein, Dorsa Zeinali, Ondřej Klejch, Matthew Wiesner, Brian Yan, Shammur Chowdhury, Ahmed Ali, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  3. arXiv:2306.02153  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling

    Authors: Ramon Sanabria, Ondrej Klejch, Hao Tang, Sharon Goldwater

    Abstract: Acoustic word embeddings are typically created by training a pooling function using pairs of word-like units. For unsupervised systems, these are mined using k-nearest neighbor (KNN) search, which is slow. Recently, mean-pooled representations from a pre-trained self-supervised English model were suggested as a promising alternative, but their performance on target languages was not fully competit… ▽ More

    Submitted 3 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech 2023

  4. arXiv:2305.16065  [pdf, other

    eess.AS cs.CL cs.SD

    ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition

    Authors: Yuanchao Li, Zeyu Zhao, Ondrej Klejch, Peter Bell, Catherine Lai

    Abstract: In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to address their inherent variability. However, the reliance on human annotated text in most research hinders the development of practical SER systems. To overcome this challenge, we investigate how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpo… ▽ More

    Submitted 28 May, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH 2023

  5. arXiv:2303.18110  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR

    Authors: Ramon Sanabria, Nikolay Bogoychev, Nina Markl, Andrea Carmantini, Ondrej Klejch, Peter Bell

    Abstract: English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a result, there are many varieties of English. Although the great many advances in English automatic speech recognition (ASR) over the past decades, results are usually reported based on test datasets which fail to represent the diversity of English… ▽ More

    Submitted 31 March, 2023; originally announced March 2023.

    Comments: Accepted to IEEE ICASSP 2023

  6. arXiv:2211.16049  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Evaluating and reducing the distance between synthetic and real speech distributions

    Authors: Christoph Minixhofer, Ondřej Klejch, Peter Bell

    Abstract: While modern Text-to-Speech (TTS) systems can produce natural-sounding speech, they remain unable to reproduce the full diversity found in natural speech data. We consider the distribution of all possible real speech samples that could be generated by these speakers alongside the distribution of all synthetic samples that could be generated for the same set of speakers, using a particular TTS syst… ▽ More

    Submitted 25 May, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

    Comments: To be presented at INTERSPEECH 2023

  7. arXiv:2211.01458  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Zero-Shot Code-Switched Speech Recognition

    Authors: Brian Yan, Matthew Wiesner, Ondrej Klejch, Preethi Jyothi, Shinji Watanabe

    Abstract: In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditionally factorize the bilingual task into its constituent monolingual parts are a promising starting point for leveraging monolingual data efficiently. However, th… ▽ More

    Submitted 9 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: 5 pages

  8. arXiv:2111.06799  [pdf, other

    cs.CL eess.AS

    Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR

    Authors: Ondrej Klejch, Electra Wallington, Peter Bell

    Abstract: We present a method for cross-lingual training an ASR system using absolutely no transcribed training data from the target language, and with no phonetic knowledge of the language in question. Our approach uses a novel application of a decipherment algorithm, which operates given only unpaired speech and text data from the target language. We apply this decipherment to phone sequences generated by… ▽ More

    Submitted 6 June, 2022; v1 submitted 12 November, 2021; originally announced November 2021.

    Comments: Submitted to Interspeech 2022

  9. arXiv:2008.06580  [pdf, other

    eess.AS cs.CL cs.SD

    Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

    Authors: Peter Bell, Joachim Fainberg, Ondrej Klejch, Jinyu Li, Steve Renals, Pawel Swietojanski

    Abstract: We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data au… ▽ More

    Submitted 28 February, 2021; v1 submitted 14 August, 2020; originally announced August 2020.

    Comments: Total of 31 pages, 27 figures. Associated repository: https://github.com/pswietojanski/ojsp_adaptation_review_2020

    Journal ref: IEEE Open Journal of Signal Processing, vol. 2, pp. 33-66, 2021

  10. arXiv:1910.10605  [pdf, ps, other

    cs.CL cs.LG eess.AS

    Speaker Adaptive Training using Model Agnostic Meta-Learning

    Authors: Ondřej Klejch, Joachim Fainberg, Peter Bell, Steve Renals

    Abstract: Speaker adaptive training (SAT) of neural network acoustic models learns models in a way that makes them more suitable for adaptation to test conditions. Conventionally, model-based speaker adaptive training is performed by having a set of speaker dependent parameters that are jointly optimised with speaker independent parameters in order to remove speaker variation. However, this does not scale w… ▽ More

    Submitted 23 October, 2019; originally announced October 2019.

    Comments: Accepted to IEEE ASRU 2019

  11. arXiv:1909.13759  [pdf, other

    eess.AS cs.CL cs.SD

    Acoustic Model Adaptation from Raw Waveforms with SincNet

    Authors: Joachim Fainberg, Ondřej Klejch, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Raw waveform acoustic modelling has recently gained interest due to neural networks' ability to learn feature extraction, and the potential for finding better representations for a given scenario than hand-crafted features. SincNet has been proposed to reduce the number of parameters required in raw-waveform modelling, by restricting the filter functions, rather than having to learn every tap of e… ▽ More

    Submitted 30 September, 2019; originally announced September 2019.

    Comments: Accepted to IEEE ASRU 2019

  12. arXiv:1906.11521  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Lattice-Based Unsupervised Test-Time Adaptation of Neural Network Acoustic Models

    Authors: Ondrej Klejch, Joachim Fainberg, Peter Bell, Steve Renals

    Abstract: Acoustic model adaptation to unseen test recordings aims to reduce the mismatch between training and testing conditions. Most adaptation schemes for neural network models require the use of an initial one-best transcription for the test data, generated by an unadapted model, in order to estimate the adaptation transform. It has been found that adaptation methods using discriminative objective func… ▽ More

    Submitted 27 June, 2019; originally announced June 2019.

  13. arXiv:1905.13150  [pdf, other

    cs.CL cs.SD eess.AS

    Lattice-based lightly-supervised acoustic model training

    Authors: Joachim Fainberg, Ondřej Klejch, Steve Renals, Peter Bell

    Abstract: In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcrip… ▽ More

    Submitted 13 July, 2019; v1 submitted 30 May, 2019; originally announced May 2019.

    Comments: Proc. INTERSPEECH 2019

  14. arXiv:1901.01342  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

    Authors: Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, Caroline Pantofaru

    Abstract: Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made com… ▽ More

    Submitted 24 May, 2019; v1 submitted 4 January, 2019; originally announced January 2019.