Search | arXiv e-print repository

doi 10.1109/WASPAA58266.2023.10248081

Complete and separate: Conditional separation with missing target source attribute completion

Authors: Dimitrios Bralios, Efthymios Tzinis, Paris Smaragdis

Abstract: Recent approaches in source separation leverage semantic information about their input mixtures and constituent sources that when used in conditional separation models can achieve impressive performance. Most approaches along these lines have focused on simple descriptions, which are not always useful for varying types of input mixtures. In this work, we present an approach in which a model, given… ▽ More Recent approaches in source separation leverage semantic information about their input mixtures and constituent sources that when used in conditional separation models can achieve impressive performance. Most approaches along these lines have focused on simple descriptions, which are not always useful for varying types of input mixtures. In this work, we present an approach in which a model, given an input mixture and partial semantic information about a target source, is trained to extract additional semantic data. We then leverage this pre-trained model to improve the separation performance of an uncoupled multi-conditional separation network. Our experiments demonstrate that the separation performance of this multi-conditional model is significantly improved, approaching the performance of an oracle model with complete semantic information. Furthermore, our approach achieves performance levels that are comparable to those of the best performing specialized single conditional models, thus providing an easier to use alternative. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: Accepted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2023

Journal ref: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

arXiv:2307.03533 [pdf, ps, other]

doi 10.21437/CHiME.2023-2

The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement

Authors: Simon Leglaive, Léonie Borne, Efthymios Tzinis, Mostafa Sadeghi, Matthieu Fraticelli, Scott Wisdom, Manuel Pariente, Daniel Pressnitzer, John R. Hershey

Abstract: Supervised speech enhancement models are trained using artificially generated mixtures of clean speech and noise signals, which may not match real-world recording conditions at test time. This mismatch can lead to poor performance if the test domain significantly differs from the synthetic training domain. This paper introduces the unsupervised domain adaptation for conversational speech enhanceme… ▽ More Supervised speech enhancement models are trained using artificially generated mixtures of clean speech and noise signals, which may not match real-world recording conditions at test time. This mismatch can lead to poor performance if the test domain significantly differs from the synthetic training domain. This paper introduces the unsupervised domain adaptation for conversational speech enhancement (UDASE) task of the 7th CHiME challenge. This task aims to leverage real-world noisy speech recordings from the target domain for unsupervised domain adaptation of speech enhancement models. The target domain corresponds to the multi-speaker reverberant conversational speech recordings of the CHiME-5 dataset, for which the ground-truth clean speech reference is unavailable. Given a CHiME-5 recording, the task is to estimate the clean, potentially multi-speaker, reverberant speech, removing the additive background noise. We discuss the motivation for the CHiME-7 UDASE task and describe the data, the task, and the baseline system. △ Less

Submitted 2 October, 2023; v1 submitted 7 July, 2023; originally announced July 2023.

Journal ref: The 7th International Workshop on Speech Processing in Everyday Environments (CHiME), Dublin, Ireland, 2023

arXiv:2211.11917 [pdf, other]

doi 10.1109/ICASSP49357.2023.10096897

Latent Iterative Refinement for Modular Source Separation

Authors: Dimitrios Bralios, Efthymios Tzinis, Gordon Wichern, Paris Smaragdis, Jonathan Le Roux

Abstract: Traditional source separation approaches train deep neural network models end-to-end with all the data available at once by minimizing the empirical risk on the whole training set. On the inference side, after training the model, the user fetches a static computation graph and runs the full model on some specified observed mixture signal to get the estimated source signals. Additionally, many of t… ▽ More Traditional source separation approaches train deep neural network models end-to-end with all the data available at once by minimizing the empirical risk on the whole training set. On the inference side, after training the model, the user fetches a static computation graph and runs the full model on some specified observed mixture signal to get the estimated source signals. Additionally, many of those models consist of several basic processing blocks which are applied sequentially. We argue that we can significantly increase resource efficiency during both training and inference stages by reformulating a model's training and inference procedures as iterative mappings of latent signal representations. First, we can apply the same processing block more than once on its output to refine the input signal and consequently improve parameter efficiency. During training, we can follow a block-wise procedure which enables a reduction on memory requirements. Thus, one can train a very complicated network structure using significantly less computation compared to end-to-end training. During inference, we can dynamically adjust how many processing blocks and iterations of a specific block an input signal needs using a gating module. △ Less

Submitted 15 October, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:2211.05927 [pdf, other]

Optimal Condition Training for Target Source Separation

Authors: Efthymios Tzinis, Gordon Wichern, Paris Smaragdis, Jonathan Le Roux

Abstract: Recent research has shown remarkable performance in leveraging multiple extraneous conditional and non-mutually exclusive semantic concepts for sound source separation, allowing the flexibility to extract a given target source based on multiple different queries. In this work, we propose a new optimal condition training (OCT) method for single-channel target source separation, based on greedy para… ▽ More Recent research has shown remarkable performance in leveraging multiple extraneous conditional and non-mutually exclusive semantic concepts for sound source separation, allowing the flexibility to extract a given target source based on multiple different queries. In this work, we propose a new optimal condition training (OCT) method for single-channel target source separation, based on greedy parameter updates using the highest performing condition among equivalent conditions associated with a given target source. Our experiments show that the complementary information carried by the diverse semantic concepts significantly helps to disentangle and isolate sources of interest much more efficiently compared to single-conditioned models. Moreover, we propose a variation of OCT with condition refinement, in which an initial conditional vector is adapted to the given mixture and transformed to a more amenable representation for target source extraction. We showcase the effectiveness of OCT on diverse source separation experiments where it improves upon permutation invariant models with oracle assignment and obtains state-of-the-art performance in the more challenging task of text-based source separation, outperforming even dedicated text-only conditioned models. △ Less

Submitted 10 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP 2023

arXiv:2207.10141 [pdf, other]

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

Authors: Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey

Abstract: We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify several limitations of previous work on audio-visual on-screen sound separation, including the coarse resolution of spatio-temporal attention, poor convergence o… ▽ More We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify several limitations of previous work on audio-visual on-screen sound separation, including the coarse resolution of spatio-temporal attention, poor convergence of the audio separation model, limited variety in training and evaluation data, and failure to account for the trade off between preservation of on-screen sounds and suppression of off-screen sounds. We provide solutions to all of these issues. Our proposed cross-modal and self-attention network architectures capture audio-visual dependencies at a finer resolution over time, and we also propose efficient separable variants that are capable of scaling to longer videos without sacrificing much performance. We also find that pre-training the separation model only on audio greatly improves results. For training and evaluation, we collected new human annotations of onscreen sounds from a large database of in-the-wild videos (YFCC100M). This new dataset is more diverse and challenging. Finally, we propose a calibration procedure that allows exact tuning of on-screen reconstruction versus off-screen suppression, which greatly simplifies comparing performance between models with different operating points. Overall, our experimental results show marked improvements in on-screen separation performance under much more general conditions than previous methods with minimal additional computational complexity. △ Less

Submitted 20 July, 2022; originally announced July 2022.

Comments: ECCV 2022

arXiv:2205.07390 [pdf, other]

doi 10.1109/LSP.2022.3229643

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

Authors: Zhepei Wang, Cem Subakan, Xilin Jiang, Junkai Wu, Efthymios Tzinis, Mirco Ravanelli, Paris Smaragdis

Abstract: In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framework where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically rel… ▽ More In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framework where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods. △ Less

Submitted 13 December, 2022; v1 submitted 15 May, 2022; originally announced May 2022.

Comments: Accepted to IEEE Signal Processing Letters

arXiv:2204.03594 [pdf, other]

doi 10.21437/Interspeech.2022-46

Heterogeneous Target Speech Separation

Authors: Efthymios Tzinis, Gordon Wichern, Aswin Subramanian, Paris Smaragdis, Jonathan Le Roux

Abstract: We introduce a new paradigm for single-channel target source separation where the sources of interest can be distinguished using non-mutually exclusive concepts (e.g., loudness, gender, language, spatial location, etc). Our proposed heterogeneous separation framework can seamlessly leverage datasets with large distribution shifts and learn cross-domain representations under a variety of concepts u… ▽ More We introduce a new paradigm for single-channel target source separation where the sources of interest can be distinguished using non-mutually exclusive concepts (e.g., loudness, gender, language, spatial location, etc). Our proposed heterogeneous separation framework can seamlessly leverage datasets with large distribution shifts and learn cross-domain representations under a variety of concepts used as conditioning. Our experiments show that training separation models with heterogeneous conditions facilitates the generalization to new concepts with unseen out-of-domain data while also performing substantially higher than single-domain specialist models. Notably, such training leads to more robust learning of new harder source separation discriminative concepts and can yield improvements over permutation invariant training with oracle source selection. We analyze the intrinsic behavior of source separation training with heterogeneous metadata and propose ways to alleviate emerging problems with challenging separation conditions. We release the collection of preparation recipes for all datasets used to further promote research towards this challenging task. △ Less

Submitted 7 April, 2022; originally announced April 2022.

Comments: Submitted to Interspeech 2022

Journal ref: Interspeech 2022

arXiv:2202.08862 [pdf, other]

doi 10.1109/JSTSP.2022.3200911

RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing

Authors: Efthymios Tzinis, Yossi Adi, Vamsi Krishna Ithapu, Buye Xu, Paris Smaragdis, Anurag Kumar

Abstract: We present RemixIT, a simple yet effective self-supervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent on clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous se… ▽ More We present RemixIT, a simple yet effective self-supervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent on clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous self-training scheme in which a pre-trained teacher model on out-of-domain data infers estimated pseudo-target signals for in-domain mixtures. Then, by permuting the estimated clean and noise signals and remixing them together, we generate a new set of bootstrapped mixtures and corresponding pseudo-targets which are used to train the student network. Vice-versa, the teacher periodically refines its estimates using the updated parameters of the latest student models. Experimental results on multiple speech enhancement datasets and tasks not only show the superiority of our method over prior approaches but also showcase that RemixIT can be combined with any separation model as well as be applied towards any semi-supervised and unsupervised domain adaptation task. Our analysis, paired with empirical evidence, sheds light on the inside functioning of our self-training scheme wherein the student model keeps obtaining better performance while observing severely degraded pseudo-targets. △ Less

Submitted 3 August, 2022; v1 submitted 17 February, 2022; originally announced February 2022.

Comments: To appear in IEEE Journal of Selected Topics in Signal Processing

Journal ref: J-STSP-SLSAP-00040-2022

arXiv:2110.10103 [pdf, other]

doi 10.1109/ICASSP43922.2022.9747463

Continual self-training with bootstrapped remixing for speech enhancement

Authors: Efthymios Tzinis, Yossi Adi, Vamsi K. Ithapu, Buye Xu, Anurag Kumar

Abstract: We propose RemixIT, a simple and novel self-supervised training method for speech enhancement. The proposed method is based on a continuously self-training scheme that overcomes limitations from previous studies including assumptions for the in-domain noise distribution and having access to clean target signals. Specifically, a separation teacher model is pre-trained on an out-of-domain dataset an… ▽ More We propose RemixIT, a simple and novel self-supervised training method for speech enhancement. The proposed method is based on a continuously self-training scheme that overcomes limitations from previous studies including assumptions for the in-domain noise distribution and having access to clean target signals. Specifically, a separation teacher model is pre-trained on an out-of-domain dataset and is used to infer estimated target signals for a batch of in-domain mixtures. Next, we bootstrap the mixing process by generating artificial mixtures using permuted estimated clean and noise signals. Finally, the student model is trained using the permuted estimated sources as targets while we periodically update teacher's weights using the latest student model. Our experiments show that RemixIT outperforms several previous state-of-the-art self-supervised methods under multiple speech enhancement tasks. Additionally, RemixIT provides a seamless alternative for semi-supervised and unsupervised domain adaptation for speech enhancement tasks, while being general enough to be applied to any separation task and paired with any separation model. △ Less

Submitted 29 January, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

Comments: To appear in Proc. ICASSP 2022, May 22-27, 2022, Singapore

Journal ref: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:2106.09669 [pdf, other]

Improving On-Screen Sound Separation for Open-Domain Videos with Audio-Visual Self-Attention

Authors: Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey

Abstract: We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous work on audio-visual on-screen sound separation, including the simplicity and coarse resolution of spatio-temporal attention, and poor convergence of the audio s… ▽ More We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous work on audio-visual on-screen sound separation, including the simplicity and coarse resolution of spatio-temporal attention, and poor convergence of the audio separation model. Our proposed model addresses these issues using cross-modal and self-attention modules that capture audio-visual dependencies at a finer resolution over time, and by unsupervised pre-training of audio separation model. These improvements allow the model to generalize to a much wider set of unseen videos. We also show a robust way to further improve the generalization capability of our models by calibrating the probabilities of our audio-visual on-screen classifier, using only a small amount of in-domain videos labeled for their on-screen presence. For evaluation and semi-supervised training, we collected human annotations of on-screen audio from a large database of in-the-wild videos (YFCC100m). Our results show marked improvements in on-screen separation performance, in more general conditions than previous methods. △ Less

Submitted 14 October, 2021; v1 submitted 17 June, 2021; originally announced June 2021.

arXiv:2105.07596 [pdf, other]

Sound Event Detection with Adaptive Frequency Selection

Authors: Zhepei Wang, Jonah Casebeer, Adam Clemmitt, Efthymios Tzinis, Paris Smaragdis

Abstract: In this work, we present HIDACT, a novel network architecture for adaptive computation for efficiently recognizing acoustic events. We evaluate the model on a sound event detection task where we train it to adaptively process frequency bands. The model learns to adapt to the input without requesting all frequency sub-bands provided. It can make confident predictions within fewer processing steps,… ▽ More In this work, we present HIDACT, a novel network architecture for adaptive computation for efficiently recognizing acoustic events. We evaluate the model on a sound event detection task where we train it to adaptively process frequency bands. The model learns to adapt to the input without requesting all frequency sub-bands provided. It can make confident predictions within fewer processing steps, hence reducing the amount of computation. Experimental results show that HIDACT has comparable performance to baseline models with more parameters and higher computational complexity. Furthermore, the model can adjust the amount of computation based on the data and computational budget. △ Less

Submitted 29 July, 2021; v1 submitted 16 May, 2021; originally announced May 2021.

Comments: Accepted by IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 2021

arXiv:2105.04727 [pdf, other]

doi 10.1109/WASPAA52581.2021.9632783

Separate but Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data

Authors: Efthymios Tzinis, Jonah Casebeer, Zhepei Wang, Paris Smaragdis

Abstract: We propose FEDENHANCE, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients. We simulate a real-world scenario where each client only has access to a few noisy recordings from a limited and disjoint number of speakers (hence non-IID). Each client trains their model in isolation using mixture invariant training… ▽ More We propose FEDENHANCE, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients. We simulate a real-world scenario where each client only has access to a few noisy recordings from a limited and disjoint number of speakers (hence non-IID). Each client trains their model in isolation using mixture invariant training while periodically providing updates to a central server. Our experiments show that our approach achieves competitive enhancement performance compared to IID training on a single device and that we can further facilitate the convergence speed and the overall performance using transfer learning on the server-side. Moreover, we show that we can effectively combine updates from clients trained locally with supervised and unsupervised losses. We also release a new dataset LibriFSD50K and its creation recipe in order to facilitate FL research for source separation problems. △ Less

Submitted 26 September, 2021; v1 submitted 10 May, 2021; originally announced May 2021.

Comments: Accepted to WASPAA 21

Journal ref: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

arXiv:2104.07072 [pdf, other]

doi 10.21437/Interspeech.2019-2769

Unsupervised low-rank representations for speech emotion recognition

Authors: Georgios Paraskevopoulos, Efthymios Tzinis, Nikolaos Ellinas, Theodoros Giannakopoulos, Alexandros Potamianos

Abstract: We examine the use of linear and non-linear dimensionality reduction algorithms for extracting low-rank feature representations for speech emotion recognition. Two feature sets are used, one based on low-level descriptors and their aggregations (IS10) and one modeling recurrence dynamics of speech (RQA), as well as their fusion. We report speech emotion recognition (SER) results for learned repres… ▽ More We examine the use of linear and non-linear dimensionality reduction algorithms for extracting low-rank feature representations for speech emotion recognition. Two feature sets are used, one based on low-level descriptors and their aggregations (IS10) and one modeling recurrence dynamics of speech (RQA), as well as their fusion. We report speech emotion recognition (SER) results for learned representations on two databases using different classification methods. Classification with low-dimensional representations yields performance improvement in a variety of settings. This indicates that dimensionality reduction is an effective way to combat the curse of dimensionality for SER. Visualization of features in two dimensions provides insight into discriminatory abilities of reduced feature sets. △ Less

Submitted 14 April, 2021; originally announced April 2021.

Comments: Published at Interspeech 2019 https://www.isca-speech.org/archive/Interspeech_2019/abstracts/2769.html

arXiv:2103.02644 [pdf, other]

doi 10.1007/s11265-021-01683-x

Compute and memory efficient universal sound source separation

Authors: Efthymios Tzinis, Zhepei Wang, Xilin Jiang, Paris Smaragdis

Abstract: Recent progress in audio source separation lead by deep learning has enabled many neural network models to provide robust solutions to this fundamental estimation problem. In this study, we provide a family of efficient neural network architectures for general purpose audio source separation while focusing on multiple computational aspects that hinder the application of neural networks in real-wor… ▽ More Recent progress in audio source separation lead by deep learning has enabled many neural network models to provide robust solutions to this fundamental estimation problem. In this study, we provide a family of efficient neural network architectures for general purpose audio source separation while focusing on multiple computational aspects that hinder the application of neural networks in real-world scenarios. The backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRM-RF) as well as their aggregation which is performed through simple one-dimensional convolutions. This mechanism enables our models to obtain high fidelity signal separation in a wide variety of settings where variable number of sources are present and with limited computational resources (e.g. floating point operations, memory footprint, number of parameters and latency). Our experiments show that SuDoRM-RF models perform comparably and even surpass several state-of-the-art benchmarks with significantly higher computational resource requirements. The causal variation of SuDoRM-RF is able to obtain competitive performance in real-time speech separation of around 10dB scale-invariant signal-to-distortion ratio improvement (SI-SDRi) while remaining up to 20 times faster than real-time on a laptop device. △ Less

Submitted 14 July, 2021; v1 submitted 3 March, 2021; originally announced March 2021.

Comments: Accepted to Journal of Signal Processing Systems https://www.springer.com/journal/11265. arXiv admin note: substantial text overlap with arXiv:2007.06833

arXiv:2011.01143 [pdf, other]

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Authors: Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Daniel P. W. Ellis, John R. Hershey

Abstract: Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Pri… ▽ More Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Prior audio-visual separation work assumed artificial limitations on the domain of sound classes (e.g., to speech or music), constrained the number of sources, and required strong sound separation or visual segmentation labels. AudioScope overcomes these limitations, operating on an open domain of sounds, with variable numbers of sources, and without labels or prior visual segmentation. The training procedure for AudioScope uses mixture invariant training (MixIT) to separate synthetic mixtures of mixtures (MoMs) into individual sources, where noisy labels for mixtures are provided by an unsupervised audio-visual coincidence model. Using the noisy labels, along with attention between video and audio features, AudioScope learns to identify audio-visual similarity and to suppress off-screen sounds. We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data. This dataset contains a wide diversity of sound classes recorded in unconstrained conditions, making the application of previous methods unsuitable. For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips. △ Less

Submitted 29 May, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

Comments: ICLR 2021, 27 pages

arXiv:2010.13228 [pdf, other]

doi 10.1109/ICASSP39728.2021.9414071

Unified Gradient Reweighting for Model Biasing with Applications to Source Separation

Authors: Efthymios Tzinis, Dimitrios Bralios, Paris Smaragdis

Abstract: Recent deep learning approaches have shown great improvement in audio source separation tasks. However, the vast majority of such work is focused on improving average separation performance, often neglecting to examine or control the distribution of the results. In this paper, we propose a simple, unified gradient reweighting scheme, with a lightweight modification to bias the learning process of… ▽ More Recent deep learning approaches have shown great improvement in audio source separation tasks. However, the vast majority of such work is focused on improving average separation performance, often neglecting to examine or control the distribution of the results. In this paper, we propose a simple, unified gradient reweighting scheme, with a lightweight modification to bias the learning process of a model and steer it towards a certain distribution of results. More specifically, we reweight the gradient updates of each batch, using a user-specified probability distribution. We apply this method to various source separation tasks, in order to shift the operating point of the models towards different objectives. We demonstrate different parameterizations of our unified reweighting scheme can be used towards addressing several real-world problems, such as unreliable separation estimates. Our framework enables the user to control a robustness trade-off between worst and average performance. Moreover, we experimentally show that our unified reweighting scheme can also be used in order to shift the focus of the model towards being more accurate for user-specified sound classes or even towards easier examples in order to enable faster convergence. △ Less

Submitted 25 October, 2020; originally announced October 2020.

Journal ref: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:2007.06833 [pdf, other]

doi 10.1109/MLSP49062.2020.9231900

Sudo rm -rf: Efficient Networks for Universal Audio Source Separation

Authors: Efthymios Tzinis, Zhepei Wang, Paris Smaragdis

Abstract: In this paper, we present an efficient neural network for end-to-end general purpose audio source separation. Specifically, the backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRMRF) as well as their aggregation which is performed through simple one-dimensional convolutions. In this way, we are able to obtain high qual… ▽ More In this paper, we present an efficient neural network for end-to-end general purpose audio source separation. Specifically, the backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRMRF) as well as their aggregation which is performed through simple one-dimensional convolutions. In this way, we are able to obtain high quality audio source separation with limited number of floating point operations, memory requirements, number of parameters and latency. Our experiments on both speech and environmental sound separation datasets show that SuDoRMRF performs comparably and even surpasses various state-of-the-art approaches with significantly higher computational resource requirements. △ Less

Submitted 14 July, 2020; originally announced July 2020.

Comments: accepted to MLSP 2020

Journal ref: Published in 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP)

arXiv:2006.12701 [pdf, other]

Unsupervised Sound Separation Using Mixture Invariant Training

Authors: Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron J. Weiss, Kevin Wilson, John R. Hershey

Abstract: In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon… ▽ More In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon the degree of match between the training data and real-world audio, especially in terms of the acoustic conditions and distribution of sources. The acoustic properties can be challenging to accurately simulate, and the distribution of sound types may be hard to replicate. In this paper, we propose a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures. In MixIT, training examples are constructed by mixing together existing mixtures, and the model separates them into a variable number of latent sources, such that the separated sources can be remixed to approximate the original mixtures. We show that MixIT can achieve competitive performance compared to supervised methods on speech separation. Using MixIT in a semi-supervised learning setting enables unsupervised domain adaptation and learning from large amounts of real world data without ground-truth source waveforms. In particular, we significantly improve reverberant speech separation performance by incorporating reverberant mixtures, train a speech enhancement system from noisy mixtures, and improve universal sound separation by incorporating a large amount of in-the-wild data. △ Less

Submitted 23 October, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

Comments: Accepted for spotlight presentation at NeurIPS 2020

arXiv:2005.04132 [pdf, other]

Asteroid: the PyTorch-based audio source separation toolkit for researchers

Authors: Manuel Pariente, Samuele Cornell, Joris Cosentino, Sunit Sivasankaran, Efthymios Tzinis, Jens Heitkaemper, Michel Olvera, Fabian-Robert Stöter, Mathieu Hu, Juan M. Martín-Doñas, David Ditter, Ariel Frank, Antoine Deleforge, Emmanuel Vincent

Abstract: This paper describes Asteroid, the PyTorch-based audio source separation toolkit for researchers. Inspired by the most successful neural source separation systems, it provides all neural building blocks required to build such a system. To improve reproducibility, Kaldi-style recipes on common audio source separation datasets are also provided. This paper describes the software architecture of Aste… ▽ More This paper describes Asteroid, the PyTorch-based audio source separation toolkit for researchers. Inspired by the most successful neural source separation systems, it provides all neural building blocks required to build such a system. To improve reproducibility, Kaldi-style recipes on common audio source separation datasets are also provided. This paper describes the software architecture of Asteroid and its most important features. By showing experimental results obtained with Asteroid's recipes, we show that our implementations are at least on par with most results reported in reference papers. The toolkit is publicly available at https://github.com/mpariente/asteroid . △ Less

Submitted 8 May, 2020; originally announced May 2020.

Comments: Submitted to Interspeech 2020

arXiv:1911.07951 [pdf, other]

doi 10.1109/ICASSP40776.2020.9053921

Improving Universal Sound Separation Using Sound Classification

Authors: Efthymios Tzinis, Scott Wisdom, John R. Hershey, Aren Jansen, Daniel P. W. Ellis

Abstract: Deep learning approaches have recently achieved impressive performance on both audio source separation and sound classification. Most audio source separation approaches focus only on separating sources belonging to a restricted domain of source classes, such as speech and music. However, recent work has demonstrated the possibility of "universal sound separation", which aims to separate acoustic s… ▽ More Deep learning approaches have recently achieved impressive performance on both audio source separation and sound classification. Most audio source separation approaches focus only on separating sources belonging to a restricted domain of source classes, such as speech and music. However, recent work has demonstrated the possibility of "universal sound separation", which aims to separate acoustic sources from an open domain, regardless of their class. In this paper, we utilize the semantic information learned by sound classifier networks trained on a vast amount of diverse sounds to improve universal sound separation. In particular, we show that semantic embeddings extracted from a sound classifier can be used to condition a separation network, providing it with useful additional information. This approach is especially useful in an iterative setup, where source estimates from an initial separation stage and their corresponding classifier-derived embeddings are fed to a second separation network. By performing a thorough hyperparameter search consisting of over a thousand experiments, we find that classifier embeddings from clean sources provide nearly one dB of SNR gain, and our best iterative models achieve a significant fraction of this oracle performance, establishing a new state-of-the-art for universal sound separation. △ Less

Submitted 18 November, 2019; originally announced November 2019.

Journal ref: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:1911.00102 [pdf, other]

End-to-end Non-Negative Autoencoders for Sound Source Separation

Authors: Shrikant Venkataramani, Efthymios Tzinis, Paris Smaragdis

Abstract: Discriminative models for source separation have recently been shown to produce impressive results. However, when operating on sources outside of the training set, these models can not perform as well and are cumbersome to update. Classical methods like Non-negative Matrix Factorization (NMF) provide modular approaches to source separation that can be easily updated to adapt to new mixture scenari… ▽ More Discriminative models for source separation have recently been shown to produce impressive results. However, when operating on sources outside of the training set, these models can not perform as well and are cumbersome to update. Classical methods like Non-negative Matrix Factorization (NMF) provide modular approaches to source separation that can be easily updated to adapt to new mixture scenarios. In this paper, we generalize NMF to develop end-to-end non-negative auto-encoders and demonstrate how they can be used for source separation. Our experiments indicate that these models deliver comparable separation performance to discriminative approaches, while retaining the modularity of NMF and the modeling flexibility of neural networks. △ Less

Submitted 31 October, 2019; originally announced November 2019.

arXiv:1910.09804 [pdf, other]

doi 10.1109/ICASSP40776.2020.9054172

Two-Step Sound Source Separation: Training on Learned Latent Targets

Authors: Efthymios Tzinis, Shrikant Venkataramani, Zhepei Wang, Cem Subakan, Paris Smaragdis

Abstract: In this paper, we propose a two-step training procedure for source separation via a deep neural network. In the first step we learn a transform (and it's inverse) to a latent space where masking-based separation performance using oracles is optimal. For the second step, we train a separation module that operates on the previously learned space. In order to do so, we also make use of a scale-invari… ▽ More In this paper, we propose a two-step training procedure for source separation via a deep neural network. In the first step we learn a transform (and it's inverse) to a latent space where masking-based separation performance using oracles is optimal. For the second step, we train a separation module that operates on the previously learned space. In order to do so, we also make use of a scale-invariant signal to distortion ratio (SI-SDR) loss function that works in the latent space, and we prove that it lower-bounds the SI-SDR in the time domain. We run various sound separation experiments that show how this approach can obtain better performance as compared to systems that learn the transform and the separation module jointly. The proposed methodology is general enough to be applicable to a large class of neural network end-to-end separation systems. △ Less

Submitted 23 October, 2019; v1 submitted 22 October, 2019; originally announced October 2019.

Comments: Submitted to ICASSP 2020

Journal ref: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:1906.00654 [pdf, other]

Continual Learning of New Sound Classes using Generative Replay

Authors: Zhepei Wang, Cem Subakan, Efthymios Tzinis, Paris Smaragdis, Laurent Charlin

Abstract: Continual learning consists in incrementally training a model on a sequence of datasets and testing on the union of all datasets. In this paper, we examine continual learning for the problem of sound classification, in which we wish to refine already trained models to learn new sound classes. In practice one does not want to maintain all past training data and retrain from scratch, but naively upd… ▽ More Continual learning consists in incrementally training a model on a sequence of datasets and testing on the union of all datasets. In this paper, we examine continual learning for the problem of sound classification, in which we wish to refine already trained models to learn new sound classes. In practice one does not want to maintain all past training data and retrain from scratch, but naively updating a model with new data(sets) results in a degradation of already learned tasks, which is referred to as "catastrophic forgetting." We develop a generative replay procedure for generating training audio spectrogram data, in place of keeping older training datasets. We show that by incrementally refining a classifier with generative replay a generator that is 4% of the size of all previous training data matches the performance of refining the classifier keeping 20% of all previous training data. We thus conclude that we can extend a trained sound classifier to learn new classes without having to keep previously used datasets. △ Less

Submitted 3 June, 2019; originally announced June 2019.

arXiv:1905.00151 [pdf, other]

A Style Transfer Approach to Source Separation

Authors: Shrikant Venkataramani, Efthymios Tzinis, Paris Smaragdis

Abstract: Training neural networks for source separation involves presenting a mixture recording at the input of the network and updating network parameters in order to produce an output that resembles the clean source. Consequently, supervised source separation depends on the availability of paired mixture-clean training examples. In this paper, we interpret source separation as a style transfer problem. W… ▽ More Training neural networks for source separation involves presenting a mixture recording at the input of the network and updating network parameters in order to produce an output that resembles the clean source. Consequently, supervised source separation depends on the availability of paired mixture-clean training examples. In this paper, we interpret source separation as a style transfer problem. We present a variational auto-encoder network that exploits the commonality across the domain of mixtures and the domain of clean sounds and learns a shared latent representation across the two domains. Using these cycle-consistent variational auto-encoders, we learn a mapping from the mixture domain to the domain of clean sounds and perform source separation without explicitly supervising with paired training examples. △ Less

Submitted 9 May, 2019; v1 submitted 30 April, 2019; originally announced May 2019.

arXiv:1902.01482 [pdf, other]

Bootstrapped Coordinate Search for Multidimensional Scaling

Authors: Efthymios Tzinis

Abstract: In this work, a unified framework for gradient-free Multidimensional Scaling (MDS) based on Coordinate Search (CS) is proposed. This family of algorithms is an instance of General Pattern Search (GPS) methods which avoid the explicit computation of derivatives but instead evaluate the objective function while searching on coordinate steps of the embedding space. The backbone element of CSMDS frame… ▽ More In this work, a unified framework for gradient-free Multidimensional Scaling (MDS) based on Coordinate Search (CS) is proposed. This family of algorithms is an instance of General Pattern Search (GPS) methods which avoid the explicit computation of derivatives but instead evaluate the objective function while searching on coordinate steps of the embedding space. The backbone element of CSMDS framework is the corresponding probability matrix that correspond to how likely is each corresponding coordinate to be evaluated. We propose a Bootstrapped instance of CSMDS (BS CSMDS) which enhances the probability of the direction that decreases the most the objective function while also reducing the corresponding probability of all the other coordinates. BS CSMDS manages to avoid unnecessary function evaluations and result to significant speedup over other CSMDS alternatives while also obtaining the same error rate. Experiments on both synthetic and real data reveal that BS CSMDS performs consistently better than other CSMDS alternatives under various experimental setups. △ Less

Submitted 4 February, 2019; originally announced February 2019.

arXiv:1811.04133 [pdf, other]

doi 10.21437/Interspeech.2018-1377

Integrating Recurrence Dynamics for Speech Emotion Recognition

Authors: Efthymios Tzinis, Georgios Paraskevopoulos, Christos Baziotis, Alexandros Potamianos

Abstract: We investigate the performance of features that can capture nonlinear recurrence dynamics embedded in the speech signal for the task of Speech Emotion Recognition (SER). Reconstruction of the phase space of each speech frame and the computation of its respective Recurrence Plot (RP) reveals complex structures which can be measured by performing Recurrence Quantification Analysis (RQA). These measu… ▽ More We investigate the performance of features that can capture nonlinear recurrence dynamics embedded in the speech signal for the task of Speech Emotion Recognition (SER). Reconstruction of the phase space of each speech frame and the computation of its respective Recurrence Plot (RP) reveals complex structures which can be measured by performing Recurrence Quantification Analysis (RQA). These measures are aggregated by using statistical functionals over segment and utterance periods. We report SER results for the proposed feature set on three databases using different classification methods. When fusing the proposed features with traditional feature sets, we show an improvement in unweighted accuracy of up to 5.7% and 10.7% on Speaker-Dependent (SD) and Speaker-Independent (SI) SER tasks, respectively, over the baseline. Following a segment-based approach we demonstrate state-of-the-art performance on IEMOCAP using a Bidirectional Recurrent Neural Network. △ Less

Submitted 9 November, 2018; originally announced November 2018.

Journal ref: Proc. Interspeech 2018, pp. 927-931

arXiv:1811.01531 [pdf, other]

doi 10.1109/ICASSP.2019.8683201

Unsupervised Deep Clustering for Source Separation: Direct Learning from Mixtures using Spatial Information

Authors: Efthymios Tzinis, Shrikant Venkataramani, Paris Smaragdis

Abstract: We present a monophonic source separation system that is trained by only observing mixtures with no ground truth separation information. We use a deep clustering approach which trains on multi-channel mixtures and learns to project spectrogram bins to source clusters that correlate with various spatial features. We show that using such a training process we can obtain separation performance that i… ▽ More We present a monophonic source separation system that is trained by only observing mixtures with no ground truth separation information. We use a deep clustering approach which trains on multi-channel mixtures and learns to project spectrogram bins to source clusters that correlate with various spatial features. We show that using such a training process we can obtain separation performance that is as good as making use of ground truth separation information. Once trained, this system is capable of performing sound separation on monophonic inputs, despite having learned how to do so using multi-channel recordings. △ Less

Submitted 9 November, 2018; v1 submitted 5 November, 2018; originally announced November 2018.

Comments: Submitted to ICASSP 2019 (v1: November 5th 2018)

Journal ref: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:1806.00416 [pdf, other]

Pattern Search Multidimensional Scaling

Authors: Georgios Paraskevopoulos, Efthymios Tzinis, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Alexandros Potamianos

Abstract: We present a novel view of nonlinear manifold learning using derivative-free optimization techniques. Specifically, we propose an extension of the classical multi-dimensional scaling (MDS) method, where instead of performing gradient descent, we sample and evaluate possible "moves" in a sphere of fixed radius for each point in the embedded space. A fixed-point convergence guarantee can be shown by… ▽ More We present a novel view of nonlinear manifold learning using derivative-free optimization techniques. Specifically, we propose an extension of the classical multi-dimensional scaling (MDS) method, where instead of performing gradient descent, we sample and evaluate possible "moves" in a sphere of fixed radius for each point in the embedded space. A fixed-point convergence guarantee can be shown by formulating the proposed algorithm as an instance of General Pattern Search (GPS) framework. Evaluation on both clean and noisy synthetic datasets shows that pattern search MDS can accurately infer the intrinsic geometry of manifolds embedded in high-dimensional spaces. Additionally, experiments on real data, even under noisy conditions, demonstrate that the proposed pattern search MDS yields state-of-the-art results. △ Less

Submitted 30 October, 2019; v1 submitted 1 June, 2018; originally announced June 2018.

Comments: 36 pages, Under review for JMLR

Showing 1–28 of 28 results for author: Tzinis, E