Zum Hauptinhalt springen

Showing 1–33 of 33 results for author: Bell, P

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.12707  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    TTSDS -- Text-to-Speech Distribution Score

    Authors: Christoph Minixhofer, Ondřej Klejch, Peter Bell

    Abstract: Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how w… ▽ More

    Submitted 22 July, 2024; v1 submitted 17 July, 2024; originally announced July 2024.

    Comments: Under review for SLT 2024

  2. arXiv:2406.08353  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques

    Authors: Yuanchao Li, Peter Bell, Catherine Lai

    Abstract: Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SE… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  3. arXiv:2406.00898  [pdf, other

    cs.SD cs.CL eess.AS

    Phonetic Error Analysis of Raw Waveform Acoustic Models with Parametric and Non-Parametric CNNs

    Authors: Erfan Loweimi, Andrea Carmantini, Peter Bell, Steve Renals, Zoran Cvetkovic

    Abstract: In this paper, we analyse the error patterns of the raw waveform acoustic models in TIMIT's phone recognition task. Our analysis goes beyond the conventional phone error rate (PER) metric. We categorise the phones into three groups: {affricate, diphthong, fricative, nasal, plosive, semi-vowel, vowel, silence}, {consonant, vowel+, silence}, and {voiced, unvoiced, silence} and, compute the PER for e… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: 5 pages, 6 figures, 3 tables

  4. arXiv:2405.20064  [pdf, other

    eess.AS cs.SD

    1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem

    Authors: Mingjie Chen, Hezhao Zhang, Yuanchao Li, Jiachen Luo, Wen Wu, Ziyang Ma, Peter Bell, Catherine Lai, Joshua Reiss, Lin Wang, Philip C. Woodland, Xie Chen, Huy Phan, Thomas Hain

    Abstract: Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for t… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  5. arXiv:2405.19796  [pdf, other

    cs.SD cs.AI eess.AS

    Explainable Attribute-Based Speaker Verification

    Authors: Xiaoliang Wu, Chau Luu, Peter Bell, Ajitha Rajan

    Abstract: This paper proposes a fully explainable approach to speaker verification (SV), a task that fundamentally relies on individual speaker characteristics. The opaque use of speaker attributes in current SV systems raises concerns of trust. Addressing this, we propose an attribute-based explainable SV system that identifies speakers by comparing personal attributes such as gender, nationality, and age… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  6. arXiv:2405.16677  [pdf, other

    eess.AS cs.CL cs.SD

    Crossmodal ASR Error Correction with Discrete Speech Units

    Authors: Yuanchao Li, Pinzhen Chen, Peter Bell, Catherine Lai

    Abstract: ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  7. arXiv:2305.18011  [pdf, other

    cs.CL cs.SD eess.AS

    Can We Trust Explainable AI Methods on ASR? An Evaluation on Phoneme Recognition

    Authors: Xiaoliang Wu, Peter Bell, Ajitha Rajan

    Abstract: Explainable AI (XAI) techniques have been widely used to help explain and understand the output of deep learning models in fields such as image classification and Natural Language Processing. Interest in using XAI techniques to explain deep learning-based automatic speech recognition (ASR) is emerging. but there is not enough evidence on whether these explanations can be trusted. To address this,… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

  8. arXiv:2305.16076  [pdf, other

    eess.AS cs.SD

    Transfer Learning for Personality Perception via Speech Emotion Recognition

    Authors: Yuanchao Li, Peter Bell, Catherine Lai

    Abstract: Holistic perception of affective attributes is an important human perceptual ability. However, this ability is far from being realized in current affective computing, as not all of the attributes are well studied and their interrelationships are poorly understood. In this work, we investigate the relationship between two affective attributes: personality and emotion, from a transfer learning persp… ▽ More

    Submitted 28 May, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH 2023

  9. arXiv:2305.16065  [pdf, other

    eess.AS cs.CL cs.SD

    ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition

    Authors: Yuanchao Li, Zeyu Zhao, Ondrej Klejch, Peter Bell, Catherine Lai

    Abstract: In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to address their inherent variability. However, the reliance on human annotated text in most research hinders the development of practical SER systems. To overcome this challenge, we investigate how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpo… ▽ More

    Submitted 28 May, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH 2023

  10. arXiv:2305.13583  [pdf, other

    cs.CL cs.MM eess.AS eess.IV

    Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition

    Authors: Yaoting Wang, Yuanchao Li, Paul Pu Liang, Louis-Philippe Morency, Peter Bell, Catherine Lai

    Abstract: Fusing multiple modalities has proven effective for multimodal information processing. However, the incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. In this study, we first analyze how the salient affective information in one modality can be affected by the other, and demonstrate that inter-modal incongruity exists latently in crossmodal att… ▽ More

    Submitted 12 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: *First two authors contributed equally

  11. arXiv:2303.18110  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR

    Authors: Ramon Sanabria, Nikolay Bogoychev, Nina Markl, Andrea Carmantini, Ondrej Klejch, Peter Bell

    Abstract: English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a result, there are many varieties of English. Although the great many advances in English automatic speech recognition (ASR) over the past decades, results are usually reported based on test datasets which fail to represent the diversity of English… ▽ More

    Submitted 31 March, 2023; originally announced March 2023.

    Comments: Accepted to IEEE ICASSP 2023

  12. arXiv:2302.14062  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    Explanations for Automatic Speech Recognition

    Authors: Xiaoliang Wu, Peter Bell, Ajitha Rajan

    Abstract: We address quality assessment for neural network based ASR by providing explanations that help increase our understanding of the system and ultimately help build trust in the system. Compared to simple classification labels, explaining transcriptions is more challenging as judging their correctness is not straightforward and transcriptions as a variable-length sequence is not handled by existing i… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

    Comments: Accepted by Speech Track, ICASSP 2023

  13. arXiv:2211.16049  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Evaluating and reducing the distance between synthetic and real speech distributions

    Authors: Christoph Minixhofer, Ondřej Klejch, Peter Bell

    Abstract: While modern Text-to-Speech (TTS) systems can produce natural-sounding speech, they remain unable to reproduce the full diversity found in natural speech data. We consider the distribution of all possible real speech samples that could be generated by these speakers alongside the distribution of all synthetic samples that could be generated for the same set of speakers, using a particular TTS syst… ▽ More

    Submitted 25 May, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

    Comments: To be presented at INTERSPEECH 2023

  14. arXiv:2211.05163  [pdf, other

    cs.MM cs.SD eess.AS

    Multimodal Dyadic Impression Recognition via Listener Adaptive Cross-Domain Fusion

    Authors: Yuanchao Li, Peter Bell, Catherine Lai

    Abstract: As a sub-branch of affective computing, impression recognition, e.g., perception of speaker characteristics such as warmth or competence, is potentially a critical part of both human-human conversations and spoken dialogue systems. Most research has studied impressions only from the behaviors expressed by the speaker or the response from the listener, yet ignored their latent connection. In this p… ▽ More

    Submitted 16 February, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP2023. arXiv admin note: substantial text overlap with arXiv:2203.13932

  15. arXiv:2210.02595  [pdf, other

    eess.AS cs.CL cs.SD

    Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora

    Authors: Yuanchao Li, Yumnah Mohamied, Peter Bell, Catherine Lai

    Abstract: Self-supervised speech models have grown fast during the past few years and have proven feasible for use in various downstream tasks. Some recent work has started to look at the characteristics of these models, yet many concerns have not been fully addressed. In this work, we conduct a study on emotional corpora to explore a popular self-supervised model -- wav2vec 2.0. Via a set of quantitative a… ▽ More

    Submitted 12 December, 2022; v1 submitted 5 October, 2022; originally announced October 2022.

    Comments: Accepted to SLT 2022

  16. arXiv:2111.06799  [pdf, other

    cs.CL eess.AS

    Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR

    Authors: Ondrej Klejch, Electra Wallington, Peter Bell

    Abstract: We present a method for cross-lingual training an ASR system using absolutely no transcribed training data from the target language, and with no phonetic knowledge of the language in question. Our approach uses a novel application of a decipherment algorithm, which operates given only unpaired speech and text data from the target language. We apply this decipherment to phone sequences generated by… ▽ More

    Submitted 6 June, 2022; v1 submitted 12 November, 2021; originally announced November 2021.

    Comments: Submitted to Interspeech 2022

  17. arXiv:2110.15684  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    Fusing ASR Outputs in Joint Training for Speech Emotion Recognition

    Authors: Yuanchao Li, Peter Bell, Catherine Lai

    Abstract: Alongside acoustic information, linguistic features based on speech transcripts have been proven useful in Speech Emotion Recognition (SER). However, due to the scarcity of emotion labelled data and the difficulty of recognizing emotional speech, it is hard to obtain reliable linguistic features and models in this research area. In this paper, we propose to fuse Automatic Speech Recognition (ASR)… ▽ More

    Submitted 17 March, 2022; v1 submitted 29 October, 2021; originally announced October 2021.

    Comments: Accepted for ICASSP 2022

  18. arXiv:2102.04697  [pdf, other

    eess.AS cs.AI cs.SD

    Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers

    Authors: Shucong Zhang, Cong-Thanh Do, Rama Doddipatla, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Although the lower layers of a deep neural network learn features which are transferable across datasets, these layers are not transferable within the same dataset. That is, in general, freezing the trained feature extractor (the lower layers) and retraining the classifier (the upper layers) on the same dataset leads to worse performance. In this paper, for the first time, we show that the frozen… ▽ More

    Submitted 9 February, 2021; originally announced February 2021.

    Comments: Accepted by ICASSP 2021

  19. arXiv:2011.04906  [pdf, other

    cs.CL cs.SD eess.AS

    On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers

    Authors: Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Self-attention models such as Transformers, which can capture temporal relationships without being limited by the distance between events, have given competitive speech recognition results. However, we note the range of the learned context increases from the lower to upper self-attention layers, whilst acoustic events often happen within short time spans in a left-to-right order. This leads to a q… ▽ More

    Submitted 8 November, 2020; originally announced November 2020.

    Comments: arXiv admin note: substantial text overlap with arXiv:2005.13895

  20. arXiv:2011.04004  [pdf, other

    cs.CL cs.SD eess.AS

    Stochastic Attention Head Removal: A simple and effective method for improving Transformer Based ASR Models

    Authors: Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Recently, Transformer based models have shown competitive automatic speech recognition (ASR) performance. One key factor in the success of these models is the multi-head attention mechanism. However, for trained models, we have previously observed that many attention matrices are close to diagonal, indicating the redundancy of the corresponding attention heads. We have also found that some archite… ▽ More

    Submitted 6 April, 2021; v1 submitted 8 November, 2020; originally announced November 2020.

  21. arXiv:2010.14269  [pdf, other

    cs.SD cs.LG eess.AS

    Leveraging speaker attribute information using multi task learning for speaker verification and diarization

    Authors: Chau Luu, Peter Bell, Steve Renals

    Abstract: Deep speaker embeddings have become the leading method for encoding speaker identity in speaker recognition tasks. The embedding space should ideally capture the variations between all possible speakers, encoding the multiple acoustic aspects that make up a speaker's identity, whilst being robust to non-speaker acoustic variation. Deep speaker embeddings are normally trained discriminatively, pred… ▽ More

    Submitted 23 April, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: Submitted to Interspeech 2021

  22. arXiv:2009.03807  [pdf, other

    cs.CV eess.IV

    Understanding Compositional Structures in Art Historical Images using Pose and Gaze Priors

    Authors: Prathmesh Madhu, Tilman Marquart, Ronak Kosti, Peter Bell, Andreas Maier, Vincent Christlein

    Abstract: Image compositions as a tool for analysis of artworks is of extreme significance for art historians. These compositions are useful in analyzing the interactions in an image to study artists and their artworks. Max Imdahl in his work called Ikonik, along with other prominent art historians of the 20th century, underlined the aesthetic and semantic importance of the structural composition of an imag… ▽ More

    Submitted 8 September, 2020; originally announced September 2020.

    Comments: To be Published in ECCV 2020 Workshops (VISART V)

  23. arXiv:2008.06580  [pdf, other

    eess.AS cs.CL cs.SD

    Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

    Authors: Peter Bell, Joachim Fainberg, Ondrej Klejch, Jinyu Li, Steve Renals, Pawel Swietojanski

    Abstract: We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data au… ▽ More

    Submitted 28 February, 2021; v1 submitted 14 August, 2020; originally announced August 2020.

    Comments: Total of 31 pages, 27 figures. Associated repository: https://github.com/pswietojanski/ojsp_adaptation_review_2020

    Journal ref: IEEE Open Journal of Signal Processing, vol. 2, pp. 33-66, 2021

  24. arXiv:2005.13895  [pdf, other

    eess.AS cs.CL cs.SD

    When Can Self-Attention Be Replaced by Feed Forward Layers?

    Authors: Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Recently, self-attention models such as Transformers have given competitive results compared to recurrent neural network systems in speech recognition. The key factor for the outstanding performance of self-attention models is their ability to capture temporal relationships without being limited by the distance between two related events. However, we note that the range of the learned context prog… ▽ More

    Submitted 28 May, 2020; originally announced May 2020.

  25. arXiv:2002.00453  [pdf, other

    cs.SD cs.LG eess.AS

    DropClass and DropAdapt: Dropping classes for deep speaker representation learning

    Authors: Chau Luu, Peter Bell, Steve Renals

    Abstract: Many recent works on deep speaker embeddings train their feature extraction networks on large classification tasks, distinguishing between all speakers in a training set. Empirically, this has been shown to produce speaker-discriminative embeddings, even for unseen speakers. However, it is not clear that this is the optimal means of training embeddings that generalize well. This work proposes two… ▽ More

    Submitted 2 February, 2020; originally announced February 2020.

    Comments: Submitted to Speaker Odyssey 2020

  26. arXiv:1910.14443  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Multi-scale Octave Convolutions for Robust Speech Recognition

    Authors: Joanna Rownicka, Peter Bell, Steve Renals

    Abstract: We propose a multi-scale octave convolution layer to learn robust speech representations efficiently. Octave convolutions were introduced by Chen et al [1] in the computer vision field to reduce the spatial redundancy of the feature maps by decomposing the output of a convolutional layer into feature maps at two different spatial resolutions, one octave apart. This approach improved the efficiency… ▽ More

    Submitted 31 October, 2019; originally announced October 2019.

    Comments: submitted to ICASSP2020

  27. Channel adversarial training for speaker verification and diarization

    Authors: Chau Luu, Peter Bell, Steve Renals

    Abstract: Previous work has encouraged domain-invariance in deep speaker embedding by adversarially classifying the dataset or labelled environment to which the generated features belong. We propose a training strategy which aims to produce features that are invariant at the granularity of the recording or channel, a finer grained objective than dataset- or environment-invariance. By training an adversary t… ▽ More

    Submitted 25 October, 2019; originally announced October 2019.

    Comments: Submitted to IEEE ICASSP 2020

  28. arXiv:1910.10605  [pdf, ps, other

    cs.CL cs.LG eess.AS

    Speaker Adaptive Training using Model Agnostic Meta-Learning

    Authors: Ondřej Klejch, Joachim Fainberg, Peter Bell, Steve Renals

    Abstract: Speaker adaptive training (SAT) of neural network acoustic models learns models in a way that makes them more suitable for adaptation to test conditions. Conventionally, model-based speaker adaptive training is performed by having a set of speaker dependent parameters that are jointly optimised with speaker independent parameters in order to remove speaker variation. However, this does not scale w… ▽ More

    Submitted 23 October, 2019; originally announced October 2019.

    Comments: Accepted to IEEE ASRU 2019

  29. arXiv:1910.02168  [pdf, other

    eess.AS

    Cross lingual transfer learning for zero-resource domain adaptation

    Authors: Alberto Abad, Peter Bell, Andrea Carmantini, Steve Renals

    Abstract: We propose a method for zero-resource domain adaptation of DNN acoustic models, for use in low-resource situations where the only in-language training data available may be poorly matched to the intended target domain. Our method uses a multi-lingual model in which several DNN layers are shared between languages. This architecture enables domain adaptation transforms learned for one well-resourced… ▽ More

    Submitted 29 October, 2019; v1 submitted 4 October, 2019; originally announced October 2019.

    Comments: Submitted to ICASSP 2020. Main updates wrt previous versions: same network config in all experiments, added Babel/Material LR target language experiments, added comparison with alternative/similar methods of cross-lingual adaptation

  30. arXiv:1909.13759  [pdf, other

    eess.AS cs.CL cs.SD

    Acoustic Model Adaptation from Raw Waveforms with SincNet

    Authors: Joachim Fainberg, Ondřej Klejch, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Raw waveform acoustic modelling has recently gained interest due to neural networks' ability to learn feature extraction, and the potential for finding better representations for a given scenario than hand-crafted features. SincNet has been proposed to reduce the number of parameters required in raw-waveform modelling, by restricting the filter functions, rather than having to learn every tap of e… ▽ More

    Submitted 30 September, 2019; originally announced September 2019.

    Comments: Accepted to IEEE ASRU 2019

  31. arXiv:1909.13537  [pdf, other

    cs.CL cs.SD eess.AS

    Embeddings for DNN speaker adaptive training

    Authors: Joanna Rownicka, Peter Bell, Steve Renals

    Abstract: In this work, we investigate the use of embeddings for speaker-adaptive training of DNNs (DNN-SAT) focusing on a small amount of adaptation data per speaker. DNN-SAT can be viewed as learning a mapping from each embedding to transformation parameters that are applied to the shared parameters of the DNN. We investigate different approaches to applying these transformations, and find that with a goo… ▽ More

    Submitted 30 September, 2019; originally announced September 2019.

    Comments: Accepted at ASRU 2019

  32. arXiv:1906.11521  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Lattice-Based Unsupervised Test-Time Adaptation of Neural Network Acoustic Models

    Authors: Ondrej Klejch, Joachim Fainberg, Peter Bell, Steve Renals

    Abstract: Acoustic model adaptation to unseen test recordings aims to reduce the mismatch between training and testing conditions. Most adaptation schemes for neural network models require the use of an initial one-best transcription for the test data, generated by an unadapted model, in order to estimate the adaptation transform. It has been found that adaptation methods using discriminative objective func… ▽ More

    Submitted 27 June, 2019; originally announced June 2019.

  33. arXiv:1905.13150  [pdf, other

    cs.CL cs.SD eess.AS

    Lattice-based lightly-supervised acoustic model training

    Authors: Joachim Fainberg, Ondřej Klejch, Steve Renals, Peter Bell

    Abstract: In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcrip… ▽ More

    Submitted 13 July, 2019; v1 submitted 30 May, 2019; originally announced May 2019.

    Comments: Proc. INTERSPEECH 2019