Search | arXiv e-print repository

Extending RNN-T-based speech recognition systems with emotion and language classification

Authors: Zvi Kons, Hagai Aronowitz, Edmilson Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas, George Saon

Abstract: Speech transcription, emotion recognition, and language identification are usually considered to be three different tasks. Each one requires a different model with a different architecture and training process. We propose using a recurrent neural network transducer (RNN-T)-based speech-to-text (STT) system as a common component that can be used for emotion recognition and language identification a… ▽ More Speech transcription, emotion recognition, and language identification are usually considered to be three different tasks. Each one requires a different model with a different architecture and training process. We propose using a recurrent neural network transducer (RNN-T)-based speech-to-text (STT) system as a common component that can be used for emotion recognition and language identification as well as for speech recognition. Our work extends the STT system for emotion classification through minimal changes, and shows successful results on the IEMOCAP and MELD datasets. In addition, we demonstrate that by adding a lightweight component to the RNN-T module, it can also be used for language identification. In our evaluations, this new classifier demonstrates state-of-the-art accuracy for the NIST-LRE-07 dataset. △ Less

Submitted 28 July, 2022; originally announced July 2022.

Comments: Accepted for publication in Interspeech 2022

arXiv:2203.00613 [pdf]

Towards a Common Speech Analysis Engine

Authors: Hagai Aronowitz, Itai Gat, Edmilson Morais, Weizhong Zhu, Ron Hoory

Abstract: Recent innovations in self-supervised representation learning have led to remarkable advances in natural language processing. That said, in the speech processing domain, self-supervised representation learning-based systems are not yet considered state-of-the-art. We propose leveraging recent advances in self-supervised-based speech processing to create a common speech analysis engine. Such an eng… ▽ More Recent innovations in self-supervised representation learning have led to remarkable advances in natural language processing. That said, in the speech processing domain, self-supervised representation learning-based systems are not yet considered state-of-the-art. We propose leveraging recent advances in self-supervised-based speech processing to create a common speech analysis engine. Such an engine should be able to handle multiple speech processing tasks, using a single architecture, to obtain state-of-the-art accuracy. The engine must also enable support for new tasks with small training datasets. Beyond that, a common engine should be capable of supporting distributed training with client in-house private data. We present the architecture for a common speech analysis engine based on the HuBERT self-supervised speech representation. Based on experiments, we report our results for language identification and emotion recognition on the standard evaluations NIST-LRE 07 and IEMOCAP. Our results surpass the state-of-the-art performance reported so far on these tasks. We also analyzed our engine on the emotion recognition task using reduced amounts of training data and show how to achieve improved results. △ Less

Submitted 1 March, 2022; originally announced March 2022.

Comments: ICASSP 2022

arXiv:2202.03896 [pdf]

Speech Emotion Recognition using Self-Supervised Features

Authors: Edmilson Morais, Ron Hoory, Weizhong Zhu, Itai Gat, Matheus Damasceno, Hagai Aronowitz

Abstract: Self-supervised pre-trained features have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of speech emotion recognition (SER) still need further investigation. In this paper we introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration o… ▽ More Self-supervised pre-trained features have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of speech emotion recognition (SER) still need further investigation. In this paper we introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features. Several SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are performed. These experiments investigate interactions among fine-tuning of self-supervised feature models, aggregation of frame-level features into utterance-level features and back-end classification networks. The proposed monomodal speechonly based system not only achieves SOTA results, but also brings light to the possibility of powerful and well finetuned self-supervised acoustic features that reach results similar to the results achieved by SOTA multimodal systems using both Speech and Text modalities. △ Less

Submitted 6 February, 2022; originally announced February 2022.

Comments: 5 pages, 4 figures, 2 tables, ICASSP 2022

arXiv:2202.01252 [pdf, other]

Speaker Normalization for Self-supervised Speech Emotion Recognition

Authors: Itai Gat, Hagai Aronowitz, Weizhong Zhu, Edmilson Morais, Ron Hoory

Abstract: Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion r… ▽ More Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset. △ Less

Submitted 6 November, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

Comments: ICASSP 22

arXiv:2104.05752 [pdf, other]

Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs

Authors: Sujeong Cha, Wangrui Hou, Hyun Jung, My Phung, Michael Picheny, Hong-Kwang Kuo, Samuel Thomas, Edmilson Morais

Abstract: A major focus of recent research in spoken language understanding (SLU) has been on the end-to-end approach where a single model can predict intents directly from speech inputs without intermediate transcripts. However, this approach presents some challenges. First, since speech can be considered as personally identifiable information, in some cases only automatic speech recognition (ASR) transcri… ▽ More A major focus of recent research in spoken language understanding (SLU) has been on the end-to-end approach where a single model can predict intents directly from speech inputs without intermediate transcripts. However, this approach presents some challenges. First, since speech can be considered as personally identifiable information, in some cases only automatic speech recognition (ASR) transcripts are accessible. Second, intent-labeled speech data is scarce. To address the first challenge, we propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both. We demonstrate strong performance for either modality separately, and when both speech and ASR transcripts are available, through system combination, we achieve better results than using a single input modality. To address the second challenge, we leverage a semantically robust pre-trained BERT model and adopt a cross-modal system that co-trains text embeddings and acoustic embeddings in a shared latent space. We further enhance this system by utilizing an acoustic module pre-trained on LibriSpeech and domain-adapting the text module on our target datasets. Our experiments show significant advantages for these pre-training and fine-tuning strategies, resulting in a system that achieves competitive intent-classification performance on Snips SLU and Fluent Speech Commands datasets. △ Less

Submitted 14 June, 2021; v1 submitted 7 April, 2021; originally announced April 2021.

Comments: Accepted to Interspeech 2021

arXiv:2011.08238 [pdf]

End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features

Authors: Edmilson Morais, Hong-Kwang J. Kuo, Samuel Thomas, Zoltan Tuske, Brian Kingsbury

Abstract: Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of spoken language understanding (SLU) still need further investigation. In this paper we introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised p… ▽ More Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of spoken language understanding (SLU) still need further investigation. In this paper we introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised pre-trained acoustic features, pre-trained model initialization and multi-task training. Several SLU experiments for predicting intent and entity labels/values using the ATIS dataset are performed. These experiments investigate the interaction of pre-trained model initialization and multi-task training with either traditional filterbank or self-supervised pre-trained acoustic features. Results show not only that self-supervised pre-trained acoustic features outperform filterbank features in almost all the experiments, but also that when these features are used in combination with multi-task training, they almost eliminate the necessity of pre-trained model initialization. △ Less

Submitted 16 November, 2020; originally announced November 2020.

Comments: 5 pages, 3 tables and 1 figure

arXiv:1907.06381 [pdf, other]

A Survey on Zero Knowledge Range Proofs and Applications

Authors: Eduardo Morais, Tommy Koens, Cees van Wijk, Aleksei Koren

Abstract: In last years, there has been an increasing effort to leverage Distributed Ledger Technology (DLT), including blockchain. One of the main topics of interest, given its importance, is the research and development of privacy mechanisms, as for example is the case of Zero Knowledge Proofs (ZKP). ZKP is a cryptographic technique that can be used to hide information that is put into the ledger, while s… ▽ More In last years, there has been an increasing effort to leverage Distributed Ledger Technology (DLT), including blockchain. One of the main topics of interest, given its importance, is the research and development of privacy mechanisms, as for example is the case of Zero Knowledge Proofs (ZKP). ZKP is a cryptographic technique that can be used to hide information that is put into the ledger, while still allowing to perform validation of this data. In this work we describe different strategies to construct Zero Knowledge Range Proofs (ZKRP), as for example the scheme proposed by Boudot in 2001; the one proposed in 2008 by Camenisch et al, and Bulletproofs, proposed in 2017. We also compare these strategies and discuss possible use cases. Since Bulletproofs is the most efficient construction, we will give a detailed description of its algorithms and optimizations. Bulletproofs is not only more efficient than previous schemes, but also avoids the trusted setup, which is a requirement that is not desirable in the context of Distributed Ledger Technology (DLT) and blockchain. In case of cryptocurrencies, if the setup phase is compromised, it would be possible to generate money out of thin air. Interestingly, Bulletproofs can also be used to construct generic Zero Knowledge Proofs (ZKP), in the sense that it can be used to prove generic statements, and thus it is not only restricted to ZKRP, but it can be used for any kind of Proof of Knowledge (PoK). Hence Bulletproofs leads to a more powerful tool to provide privacy for DLT. Here we describe in detail the algorithms involved in Bulletproofs protocol for ZKRP. Also, we present our implementation, which was open sourced. △ Less

Submitted 15 July, 2019; originally announced July 2019.

Showing 1–7 of 7 results for author: Morais, E