-
Personalized Speech Enhancement Without a Separate Speaker Embedding Model
Authors:
Tanel Pärnamaa,
Ando Saabas
Abstract:
Personalized speech enhancement (PSE) models can improve the audio quality of teleconferencing systems by adapting to the characteristics of a speaker's voice. However, most existing methods require a separate speaker embedding model to extract a vector representation of the speaker from enrollment audio, which adds complexity to the training and deployment process. We propose to use the internal…
▽ More
Personalized speech enhancement (PSE) models can improve the audio quality of teleconferencing systems by adapting to the characteristics of a speaker's voice. However, most existing methods require a separate speaker embedding model to extract a vector representation of the speaker from enrollment audio, which adds complexity to the training and deployment process. We propose to use the internal representation of the PSE model itself as the speaker embedding, thereby avoiding the need for a separate model. We show that our approach performs equally well or better than the standard method of using a pre-trained speaker embedding model on noise suppression and echo cancellation tasks. Moreover, our approach surpasses the ICASSP 2023 Deep Noise Suppression Challenge winner by 0.15 in Mean Opinion Score.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
The ICASSP 2024 Audio Deep Packet Loss Concealment Challenge
Authors:
Lorenz Diener,
Solomiya Branets,
Ando Saabas,
Ross Cutler
Abstract:
Audio packet loss concealment is the hiding of gaps in VoIP audio streams caused by network packet loss. With the ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge, we build on the success of the previous Audio PLC Challenge held at INTERSPEECH 2022. We evaluate models on an overall harder dataset, and use the new ITU-T P.804 evaluation procedure to more closely evaluate the performan…
▽ More
Audio packet loss concealment is the hiding of gaps in VoIP audio streams caused by network packet loss. With the ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge, we build on the success of the previous Audio PLC Challenge held at INTERSPEECH 2022. We evaluate models on an overall harder dataset, and use the new ITU-T P.804 evaluation procedure to more closely evaluate the performance of systems specifically on the PLC task. We evaluate a total of 9 systems, 8 of which satisfy the strict real-time performance requirements of the challenge, using both P.804 and Word Accuracy evaluations.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
ICASSP 2024 Speech Signal Improvement Challenge
Authors:
Nicolae Catalin Ristea,
Ando Saabas,
Ross Cutler,
Babak Naderi,
Sebastian Braun,
Solomiya Branets
Abstract:
The ICASSP 2024 Speech Signal Improvement Grand Challenge is intended to stimulate research in the area of improving the speech signal quality in communication systems. This marks our second challenge, building upon the success from the previous ICASSP 2023 Grand Challenge. We enhance the competition by introducing a dataset synthesizer, enabling all participating teams to start at a higher baseli…
▽ More
The ICASSP 2024 Speech Signal Improvement Grand Challenge is intended to stimulate research in the area of improving the speech signal quality in communication systems. This marks our second challenge, building upon the success from the previous ICASSP 2023 Grand Challenge. We enhance the competition by introducing a dataset synthesizer, enabling all participating teams to start at a higher baseline, an objective metric for our extended P.804 tests, transcripts for the 2023 test set, and we add Word Accuracy (WAcc) as a metric. We evaluate a total of 13 systems in the real-time track and 11 systems in the non-real-time track using both subjective P.804 and objective Word Accuracy metrics.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
ICASSP 2023 Acoustic Echo Cancellation Challenge
Authors:
Ross Cutler,
Ando Saabas,
Tanel Parnamaa,
Marju Purin,
Evgenii Indenbom,
Nicolae-Catalin Ristea,
Jegor Gužvin,
Hannes Gamper,
Sebastian Braun,
Robert Aichner
Abstract:
The ICASSP 2023 Acoustic Echo Cancellation Challenge is intended to stimulate research in acoustic echo cancellation (AEC), which is an important area of speech enhancement and is still a top issue in audio communication. This is the fourth AEC challenge and it is enhanced by adding a second track for personalized acoustic echo cancellation, reducing the algorithmic + buffering latency to 20ms, as…
▽ More
The ICASSP 2023 Acoustic Echo Cancellation Challenge is intended to stimulate research in acoustic echo cancellation (AEC), which is an important area of speech enhancement and is still a top issue in audio communication. This is the fourth AEC challenge and it is enhanced by adding a second track for personalized acoustic echo cancellation, reducing the algorithmic + buffering latency to 20ms, as well as including a full-band version of AECMOS. We open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 10,000 real audio devices and human speakers in real environments, as well as a synthetic dataset. We open source an online subjective test framework and provide an objective metric for researchers to quickly test their results. The winners of this challenge were selected based on the average mean opinion score (MOS) achieved across all scenarios and the word accuracy (WAcc) rate.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation
Authors:
Evgenii Indenbom,
Nicolae-Catalin Ristea,
Ando Saabas,
Tanel Parnamaa,
Jegor Guzvin,
Ross Cutler
Abstract:
Acoustic echo cancellation (AEC), noise suppression (NS) and dereverberation (DR) are an integral part of modern full-duplex communication systems. As the demand for teleconferencing systems increases, addressing these tasks is required for an effective and efficient online meeting experience. Most prior research proposes solutions for these tasks separately, combining them with digital signal pro…
▽ More
Acoustic echo cancellation (AEC), noise suppression (NS) and dereverberation (DR) are an integral part of modern full-duplex communication systems. As the demand for teleconferencing systems increases, addressing these tasks is required for an effective and efficient online meeting experience. Most prior research proposes solutions for these tasks separately, combining them with digital signal processing (DSP) based components, resulting in complex pipelines that are often impractical to deploy in real-world applications. This paper proposes a real-time cross-attention deep model, named DeepVQE, based on residual convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to simultaneously address AEC, NS, and DR. We conduct several ablation studies to analyze the contributions of different components of our model to the overall performance. DeepVQE achieves state-of-the-art performance on non-personalized tracks from the ICASSP 2023 Acoustic Echo Cancellation Challenge and ICASSP 2023 Deep Noise Suppression Challenge test sets, showing that a single model can handle multiple tasks with excellent performance. Moreover, the model runs in real-time and has been successfully tested for the Microsoft Teams platform.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
PLCMOS -- a data-driven non-intrusive metric for the evaluation of packet loss concealment algorithms
Authors:
Lorenz Diener,
Marju Purin,
Sten Sootla,
Ando Saabas,
Robert Aichner,
Ross Cutler
Abstract:
Speech quality assessment is a problem for every researcher working on models that produce or process speech. Human subjective ratings, the gold standard in speech quality assessment, are expensive and time-consuming to acquire in a quantity that is sufficient to get reliable data, while automated objective metrics show a low correlation with gold standard ratings. This paper presents PLCMOS, a no…
▽ More
Speech quality assessment is a problem for every researcher working on models that produce or process speech. Human subjective ratings, the gold standard in speech quality assessment, are expensive and time-consuming to acquire in a quantity that is sufficient to get reliable data, while automated objective metrics show a low correlation with gold standard ratings. This paper presents PLCMOS, a non-intrusive data-driven tool for generating a robust, accurate estimate of the mean opinion score a human rater would assign an audio file that has been processed by being transmitted over a degraded packet-switched network with missing packets being healed by a packet loss concealment algorithm. Our new model shows a model-wise Pearson's correlation of ~0.97 and rank correlation of ~0.95 with human ratings, substantially above all other available intrusive and non-intrusive metrics. The model is released as an ONNX model for other researchers to use when building PLC systems.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Deep model with built-in cross-attention alignment for acoustic echo cancellation
Authors:
Evgenii Indenbom,
Nicolae-Cătălin Ristea,
Ando Saabas,
Tanel Pärnamaa,
Jegor Gužvin
Abstract:
With recent research advances, deep learning models have become an attractive choice for acoustic echo cancellation (AEC) in real-time teleconferencing applications. Since acoustic echo is one of the major sources of poor audio quality, a wide variety of deep models have been proposed. However, an important but often omitted requirement for good echo cancellation quality is the synchronization of…
▽ More
With recent research advances, deep learning models have become an attractive choice for acoustic echo cancellation (AEC) in real-time teleconferencing applications. Since acoustic echo is one of the major sources of poor audio quality, a wide variety of deep models have been proposed. However, an important but often omitted requirement for good echo cancellation quality is the synchronization of the microphone and far end signals. Typically implemented using classical algorithms based on cross-correlation, the alignment module is a separate functional block with known design limitations. In our work we propose a deep learning architecture with built-in self-attention based alignment, which is able to handle unaligned inputs, improving echo cancellation performance while simplifying the communication pipeline. Moreover, we show that our approach achieves significant improvements for difficult delay estimation cases on real recordings from AEC Challenge data set.
△ Less
Submitted 14 March, 2023; v1 submitted 24 August, 2022;
originally announced August 2022.
-
INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge
Authors:
Lorenz Diener,
Sten Sootla,
Solomiya Branets,
Ando Saabas,
Robert Aichner,
Ross Cutler
Abstract:
Audio Packet Loss Concealment (PLC) is the hiding of gaps in audio streams caused by data transmission failures in packet switched networks. This is a common problem, and of increasing importance as end-to-end VoIP telephony and teleconference systems become the default and ever more widely used form of communication in business as well as in personal usage. This paper presents the INTERSPEECH 202…
▽ More
Audio Packet Loss Concealment (PLC) is the hiding of gaps in audio streams caused by data transmission failures in packet switched networks. This is a common problem, and of increasing importance as end-to-end VoIP telephony and teleconference systems become the default and ever more widely used form of communication in business as well as in personal usage. This paper presents the INTERSPEECH 2022 Audio Deep Packet Loss Concealment challenge. We first give an overview of the PLC problem, and introduce some classical approaches to PLC as well as recent work. We then present the open source dataset released as part of this challenge as well as the evaluation methods and metrics used to determine the winner. We also briefly introduce PLCMOS, a novel data-driven metric that can be used to quickly evaluate the performance PLC systems. Finally, we present the results of the INTERSPEECH 2022 Audio Deep PLC Challenge, and provide a summary of important takeaways.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
ICASSP 2022 Acoustic Echo Cancellation Challenge
Authors:
Ross Cutler,
Ando Saabas,
Tanel Parnamaa,
Marju Purin,
Hannes Gamper,
Sebastian Braun,
Karsten Sørensen,
Robert Aichner
Abstract:
The ICASSP 2022 Acoustic Echo Cancellation Challenge is intended to stimulate research in acoustic echo cancellation (AEC), which is an important area of speech enhancement and still a top issue in audio communication. This is the third AEC challenge and it is enhanced by including mobile scenarios, adding speech recognition rate in the challenge goal metrics, and making the default sample rate 48…
▽ More
The ICASSP 2022 Acoustic Echo Cancellation Challenge is intended to stimulate research in acoustic echo cancellation (AEC), which is an important area of speech enhancement and still a top issue in audio communication. This is the third AEC challenge and it is enhanced by including mobile scenarios, adding speech recognition rate in the challenge goal metrics, and making the default sample rate 48 kHz. In this challenge, we open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 10,000 real audio devices and human speakers in real environments, as well as a synthetic dataset. We also open source an online subjective test framework and provide an online objective metric service for researchers to quickly test their results. The winners of this challenge are selected based on the average Mean Opinion Score achieved across all different single talk and double talk scenarios, and the speech recognition word acceptance rate.
△ Less
Submitted 26 February, 2022;
originally announced February 2022.
-
AECMOS: A speech quality assessment metric for echo impairment
Authors:
Marju Purin,
Sten Sootla,
Mateja Sponza,
Ando Saabas,
Ross Cutler
Abstract:
Traditionally, the quality of acoustic echo cancellers is evaluated using intrusive speech quality assessment measures such as ERLE \cite{g168} and PESQ \cite{p862}, or by carrying out subjective laboratory tests. Unfortunately, the former are not well correlated with human subjective measures, while the latter are time and resource consuming to carry out. We provide a new tool for speech quality…
▽ More
Traditionally, the quality of acoustic echo cancellers is evaluated using intrusive speech quality assessment measures such as ERLE \cite{g168} and PESQ \cite{p862}, or by carrying out subjective laboratory tests. Unfortunately, the former are not well correlated with human subjective measures, while the latter are time and resource consuming to carry out. We provide a new tool for speech quality assessment for echo impairment which can be used to evaluate the performance of acoustic echo cancellers. More precisely, we develop a neural network model to evaluate call quality degradations in two separate categories: echo and degradations from other sources. We show that our model is accurate as measured by correlation with human subjective quality ratings. Our tool can be used effectively to stack rank echo cancellation models. AECMOS is being made publicly available as an Azure service.
△ Less
Submitted 27 January, 2022; v1 submitted 6 October, 2021;
originally announced October 2021.
-
Crowdsourcing approach for subjective evaluation of echo impairment
Authors:
Ross Cutler,
Babak Naderi,
Markus Loide,
Sten Sootla,
Ando Saabas
Abstract:
The quality of acoustic echo cancellers (AECs) in real-time communication systems is typically evaluated using objective metrics like ERLE and PESQ, and less commonly with lab-based subjective tests like ITU-T Rec. P.831. We will show that these objective measures are not well correlated to subjective measures. We then introduce an open-source crowdsourcing approach for subjective evaluation of ec…
▽ More
The quality of acoustic echo cancellers (AECs) in real-time communication systems is typically evaluated using objective metrics like ERLE and PESQ, and less commonly with lab-based subjective tests like ITU-T Rec. P.831. We will show that these objective measures are not well correlated to subjective measures. We then introduce an open-source crowdsourcing approach for subjective evaluation of echo impairment which can be used to evaluate the performance of AECs. We provide a study that shows this tool is accurate and highly reproducible. This new tool has been recently used in the ICASSP 2021 AEC Challenge which made the challenge possible to do quickly and cost effectively.
△ Less
Submitted 27 February, 2022; v1 submitted 25 October, 2020;
originally announced October 2020.
-
ICASSP 2021 Acoustic Echo Cancellation Challenge: Datasets, Testing Framework, and Results
Authors:
Kusha Sridhar,
Ross Cutler,
Ando Saabas,
Tanel Parnamaa,
Markus Loide,
Hannes Gamper,
Sebastian Braun,
Robert Aichner,
Sriram Srinivasan
Abstract:
The ICASSP 2021 Acoustic Echo Cancellation Challenge is intended to stimulate research in the area of acoustic echo cancellation (AEC), which is an important part of speech enhancement and still a top issue in audio communication and conferencing systems. Many recent AEC studies report good performance on synthetic datasets where the train and test samples come from the same underlying distributio…
▽ More
The ICASSP 2021 Acoustic Echo Cancellation Challenge is intended to stimulate research in the area of acoustic echo cancellation (AEC), which is an important part of speech enhancement and still a top issue in audio communication and conferencing systems. Many recent AEC studies report good performance on synthetic datasets where the train and test samples come from the same underlying distribution. However, the AEC performance often degrades significantly on real recordings. Also, most of the conventional objective metrics such as echo return loss enhancement (ERLE) and perceptual evaluation of speech quality (PESQ) do not correlate well with subjective speech quality tests in the presence of background noise and reverberation found in realistic environments. In this challenge, we open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 2,500 real audio devices and human speakers in real environments, as well as a synthetic dataset. We open source two large test sets, and we open source an online subjective test framework for researchers to quickly test their results. The winners of this challenge will be selected based on the average Mean Opinion Score (MOS) achieved across all different single talk and double talk scenarios.
△ Less
Submitted 30 October, 2020; v1 submitted 10 September, 2020;
originally announced September 2020.