Skip to main content

Showing 1–38 of 38 results for author: Wichern, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.04212  [pdf, ps, other

    eess.AS cs.SD

    Sound Event Bounding Boxes

    Authors: Janek Ebbers, Francois G. Germain, Gordon Wichern, Jonathan Le Roux

    Abstract: Sound event detection is the task of recognizing sounds and determining their extent (onset/offset times) within an audio clip. Existing systems commonly predict sound presence confidence in short time frames. Then, thresholding produces binary frame-level presence decisions, with the extent of individual events determined by merging consecutive positive frames. In this paper, we show that frame-l… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted for publication at Interspeech 2024

  2. arXiv:2404.02252  [pdf, other

    cs.SD eess.AS

    SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers

    Authors: Junghyun Koo, Gordon Wichern, Francois G. Germain, Sameer Khurana, Jonathan Le Roux

    Abstract: We introduce Self-Monitored Inference-Time INtervention (SMITIN), an approach for controlling an autoregressive generative music transformer using classifier probes. These simple logistic regression probes are trained on the output of each attention head in the transformer using a small dataset of audio examples both exhibiting and missing a specific musical trait (e.g., the presence/absence of dr… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  3. arXiv:2402.17907  [pdf, other

    eess.AS cs.SD

    NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization

    Authors: Yoshiki Masuyama, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux

    Abstract: Head-related transfer functions (HRTFs) are important for immersive audio, and their spatial interpolation has been studied to upsample finite measurements. Recently, neural fields (NFs) which map from sound source direction to HRTF have gained attention. Existing NF-based methods focused on estimating the magnitude of the HRTF from a given sound source direction, and the magnitude is converted to… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024

  4. arXiv:2312.07513  [pdf, other

    eess.AS cs.SD

    NeuroHeed+: Improving Neuro-steered Speaker Extraction with Joint Auditory Attention Detection

    Authors: Zexu Pan, Gordon Wichern, Francois G. Germain, Sameer Khurana, Jonathan Le Roux

    Abstract: Neuro-steered speaker extraction aims to extract the listener's brain-attended speech signal from a multi-talker speech signal, in which the attention is derived from the cortical activity. This activity is usually recorded using electroencephalography (EEG) devices. Though promising, current methods often have a high speaker confusion error, where the interfering speaker is extracted instead of t… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

  5. arXiv:2310.19644  [pdf, other

    eess.AS cs.MM

    Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

    Authors: Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois G. Germain, Sameer Khurana, Chiori Hori, Jonathan Le Roux

    Abstract: Target speech extraction aims to extract, based on a given conditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers. Building upon the achievements of the state-of-the-art (SOTA) time-frequency speaker separation model TF-GridNet, we propose AV-GridNet, a visual-grounded variant that incorporates the face recording of a target speaker a… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

    Comments: Accepted by ASRU 2023

  6. arXiv:2310.10604  [pdf, other

    eess.AS cs.SD

    Generation or Replication: Auscultating Audio Latent Diffusion Models

    Authors: Dimitrios Bralios, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux

    Abstract: The introduction of audio latent diffusion models possessing the ability to generate realistic sound clips on demand from a text description has the potential to revolutionize how we work with audio. In this work, we make an initial attempt at understanding the inner workings of audio latent diffusion models by investigating how their audio outputs compare with the training data, similar to how a… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  7. arXiv:2309.17352  [pdf, other

    cs.SD eess.AS

    Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

    Authors: Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, Shinji Watanabe

    Abstract: Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this w… ▽ More

    Submitted 9 January, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ICASSP 2024 camera-ready paper. Winner of the DCASE 2023 Challenge Task 6A: Automated Audio Captioning (AAC)

  8. arXiv:2308.06981  [pdf, other

    eess.AS cs.SD

    The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

    Authors: Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji

    Abstract: This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most succes… ▽ More

    Submitted 18 April, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted for Transactions of the International Society for Music Information Retrieval

  9. arXiv:2304.02160  [pdf, other

    cs.SD cs.LG eess.AS

    Pac-HuBERT: Self-Supervised Music Source Separation via Primitive Auditory Clustering and Hidden-Unit BERT

    Authors: Ke Chen, Gordon Wichern, François G. Germain, Jonathan Le Roux

    Abstract: In spite of the progress in music source separation research, the small amount of publicly-available clean source data remains a constant limiting factor for performance. Thus, recent advances in self-supervised learning present a largely-unexplored opportunity for improving separation models by leveraging unlabelled music data. In this paper, we propose a self-supervised learning framework for mu… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

    Comments: 5 pages, 2 figures, 3 tables

  10. arXiv:2303.03849  [pdf, other

    eess.AS cs.SD

    TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

    Authors: Christoph Boeddeker, Aswin Shanmugam Subramanian, Gordon Wichern, Reinhold Haeb-Umbach, Jonathan Le Roux

    Abstract: Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the target-speaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that pro… ▽ More

    Submitted 1 January, 2024; v1 submitted 7 March, 2023; originally announced March 2023.

    Comments: Submitted to IEEE/ACM TASLP

  11. arXiv:2212.07327  [pdf, other

    eess.AS cs.SD

    Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks

    Authors: Darius Petermann, Gordon Wichern, Aswin Shanmugam Subramanian, Zhong-Qiu Wang, Jonathan Le Roux

    Abstract: Emulating the human ability to solve the cocktail party problem, i.e., focus on a source of interest in a complex acoustic scene, is a long standing goal of audio source separation research. Much of this research investigates separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. In this paper, we focus on the cocktail fork problem,… ▽ More

    Submitted 14 December, 2022; originally announced December 2022.

    Comments: Submitted to IEEE TASLP (In review), 13 pages, 6 figures

  12. arXiv:2212.05008  [pdf, other

    eess.AS cs.SD

    Hyperbolic Audio Source Separation

    Authors: Darius Petermann, Gordon Wichern, Aswin Subramanian, Jonathan Le Roux

    Abstract: We introduce a framework for audio source separation using embeddings on a hyperbolic manifold that compactly represent the hierarchical relationship between sound sources and time-frequency features. Inspired by recent successes modeling hierarchical relationships in text and images with hyperbolic embeddings, our algorithm obtains a hyperbolic embedding for each time-frequency bin of a mixture s… ▽ More

    Submitted 9 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023, Demo page: https://darius522.github.io/hyperbolic-audio-sep/

  13. Latent Iterative Refinement for Modular Source Separation

    Authors: Dimitrios Bralios, Efthymios Tzinis, Gordon Wichern, Paris Smaragdis, Jonathan Le Roux

    Abstract: Traditional source separation approaches train deep neural network models end-to-end with all the data available at once by minimizing the empirical risk on the whole training set. On the inference side, after training the model, the user fetches a static computation graph and runs the full model on some specified observed mixture signal to get the estimated source signals. Additionally, many of t… ▽ More

    Submitted 15 October, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  14. arXiv:2211.08303  [pdf, other

    eess.AS cs.AI cs.LG cs.SD stat.ML

    Reverberation as Supervision for Speech Separation

    Authors: Rohith Aralikatti, Christoph Boeddeker, Gordon Wichern, Aswin Shanmugam Subramanian, Jonathan Le Roux

    Abstract: This paper proposes reverberation as supervision (RAS), a novel unsupervised loss function for single-channel reverberant speech separation. Prior methods for unsupervised separation required the synthesis of mixtures of mixtures or assumed the existence of a teacher model, making them difficult to consider as potential methods explaining the emergence of separation abilities in an animal's audito… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: 5 pages, 2 figures, 4 tables. Submitted to ICASSP 2023

  15. arXiv:2211.07768  [pdf, other

    cs.LG eess.SY math.OC

    Meta-Learning of Neural State-Space Models Using Data From Similar Systems

    Authors: Ankush Chakrabarty, Gordon Wichern, Christopher R. Laughman

    Abstract: Deep neural state-space models (SSMs) provide a powerful tool for modeling dynamical systems solely using operational data. Typically, neural SSMs are trained using data collected from the actual system under consideration, despite the likely existence of operational data from similar systems which have previously been deployed in the field. In this paper, we propose the use of model-agnostic meta… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

    Comments: Submitted for conference publication

  16. arXiv:2211.05927  [pdf, other

    cs.SD cs.LG eess.AS

    Optimal Condition Training for Target Source Separation

    Authors: Efthymios Tzinis, Gordon Wichern, Paris Smaragdis, Jonathan Le Roux

    Abstract: Recent research has shown remarkable performance in leveraging multiple extraneous conditional and non-mutually exclusive semantic concepts for sound source separation, allowing the flexibility to extract a given target source based on multiple different queries. In this work, we propose a new optimal condition training (OCT) method for single-channel target source separation, based on greedy para… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  17. arXiv:2211.02527  [pdf, other

    eess.AS cs.SD

    Cold Diffusion for Speech Enhancement

    Authors: Hao Yen, François G. Germain, Gordon Wichern, Jonathan Le Roux

    Abstract: Diffusion models have recently shown promising results for difficult enhancement tasks such as the conditional and unconditional restoration of natural images and audio signals. In this work, we explore the possibility of leveraging a recently proposed advanced iterative diffusion model, namely cold diffusion, to recover clean speech signals from noisy signals. The unique mathematical properties o… ▽ More

    Submitted 23 May, 2023; v1 submitted 4 November, 2022; originally announced November 2022.

    Comments: 5 pages, 1 figure, 1 table, 3 algorithms. To appear in ICASSP 2023. With corrected references

  18. arXiv:2211.01299  [pdf, other

    eess.AS cs.CL cs.SD

    Late Audio-Visual Fusion for In-The-Wild Speaker Diarization

    Authors: Zexu Pan, Gordon Wichern, François G. Germain, Aswin Subramanian, Jonathan Le Roux

    Abstract: Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system… ▽ More

    Submitted 27 September, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

  19. arXiv:2204.09911  [pdf, other

    cs.SD eess.AS

    STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency

    Authors: Zhong-Qiu Wang, Gordon Wichern, Shinji Watanabe, Jonathan Le Roux

    Abstract: Deep learning based speech enhancement in the short-time Fourier transform (STFT) domain typically uses a large window length such as 32 ms. A larger window can lead to higher frequency resolution and potentially better enhancement. This however incurs an algorithmic latency of 32 ms in an online setup, because the overlap-add algorithm used in the inverse STFT (iSTFT) is also performed using the… ▽ More

    Submitted 5 December, 2022; v1 submitted 21 April, 2022; originally announced April 2022.

    Comments: in IEEE/ACM Transactions on Audio, Speech, and Language Processing

  20. Heterogeneous Target Speech Separation

    Authors: Efthymios Tzinis, Gordon Wichern, Aswin Subramanian, Paris Smaragdis, Jonathan Le Roux

    Abstract: We introduce a new paradigm for single-channel target source separation where the sources of interest can be distinguished using non-mutually exclusive concepts (e.g., loudness, gender, language, spatial location, etc). Our proposed heterogeneous separation framework can seamlessly leverage datasets with large distribution shifts and learn cross-domain representations under a variety of concepts u… ▽ More

    Submitted 7 April, 2022; originally announced April 2022.

    Comments: Submitted to Interspeech 2022

    Journal ref: Interspeech 2022

  21. arXiv:2203.04197  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Locate This, Not That: Class-Conditioned Sound Event DOA Estimation

    Authors: Olga Slizovskaia, Gordon Wichern, Zhong-Qiu Wang, Jonathan Le Roux

    Abstract: Existing systems for sound event localization and detection (SELD) typically operate by estimating a source location for all classes at every time instant. In this paper, we propose an alternative class-conditioned SELD model for situations where we may not be interested in localizing all classes all of the time. This class-conditioned SELD model takes as input the spatial and spectral features fr… ▽ More

    Submitted 8 March, 2022; originally announced March 2022.

    Comments: Accepted for publication at ICASSP 2022

  22. arXiv:2110.09958  [pdf, other

    eess.AS cs.SD

    The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

    Authors: Darius Petermann, Gordon Wichern, Zhong-Qiu Wang, Jonathan Le Roux

    Abstract: The cocktail party problem aims at isolating any source of interest within a complex acoustic scene, and has long inspired audio source separation research. Recent efforts have mainly focused on separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. However, separating an audio mixture (e.g., movie soundtrack) into the three broad ca… ▽ More

    Submitted 23 March, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: Accepted to ICASSP2022. For resources and examples, see https://cocktail-fork.github.io

  23. arXiv:2110.00570  [pdf, other

    cs.SD eess.AS

    Leveraging Low-Distortion Target Estimates for Improved Speech Enhancement

    Authors: Zhong-Qiu Wang, Gordon Wichern, Jonathan Le Roux

    Abstract: A promising approach for multi-microphone speech separation involves two deep neural networks (DNN), where the predicted target speech from the first DNN is used to compute signal statistics for time-invariant minimum variance distortionless response (MVDR) beamforming, and the MVDR result is then used as extra features for the second DNN to predict target speech. Previous studies suggested that t… ▽ More

    Submitted 1 October, 2021; originally announced October 2021.

    Comments: in submission

  24. arXiv:2108.07376  [pdf, other

    cs.SD eess.AS

    Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation

    Authors: Zhong-Qiu Wang, Gordon Wichern, Jonathan Le Roux

    Abstract: A promising approach for speech dereverberation is based on supervised learning, where a deep neural network (DNN) is trained to predict the direct sound from noisy-reverberant speech. This data-driven approach is based on leveraging prior knowledge of clean speech patterns and seldom explicitly exploits the linear-filter structure in reverberation, i.e., that reverberation results from a linear c… ▽ More

    Submitted 10 November, 2021; v1 submitted 16 August, 2021; originally announced August 2021.

    Comments: in IEEE/ACM Transactions on Audio, Speech, and Language Processing

  25. arXiv:2108.07194  [pdf, other

    cs.SD eess.AS

    Convolutive Prediction for Reverberant Speech Separation

    Authors: Zhong-Qiu Wang, Gordon Wichern, Jonathan Le Roux

    Abstract: We investigate the effectiveness of convolutive prediction, a novel formulation of linear prediction for speech dereverberation, for speaker separation in reverberant conditions. The key idea is to first use a deep neural network (DNN) to estimate the direct-path signal of each speaker, and then identify delayed and decayed copies of the estimated direct-path signal. Such copies are likely due to… ▽ More

    Submitted 16 August, 2021; originally announced August 2021.

    Comments: in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021

  26. On The Compensation Between Magnitude and Phase in Speech Separation

    Authors: Zhong-Qiu Wang, Gordon Wichern, Jonathan Le Roux

    Abstract: Deep neural network (DNN) based end-to-end optimization in the complex time-frequency (T-F) domain or time domain has shown considerable potential in monaural speech separation. Many recent studies optimize loss functions defined solely in the time or complex domain, without including a loss on magnitude. Although such loss functions typically produce better scores if the evaluation metrics are ob… ▽ More

    Submitted 27 September, 2021; v1 submitted 11 August, 2021; originally announced August 2021.

    Comments: in IEEE Signal Processing Letters

  27. arXiv:2106.15502  [pdf, other

    cs.LG cs.AI math.OC

    Attentive Neural Processes and Batch Bayesian Optimization for Scalable Calibration of Physics-Informed Digital Twins

    Authors: Ankush Chakrabarty, Gordon Wichern, Christopher Laughman

    Abstract: Physics-informed dynamical system models form critical components of digital twins of the built environment. These digital twins enable the design of energy-efficient infrastructure, but must be properly calibrated to accurately reflect system behavior for downstream prediction and analysis. Dynamical system models of modern buildings are typically described by a large number of parameters and inc… ▽ More

    Submitted 29 June, 2021; originally announced June 2021.

    Comments: 12 pages, accepted to ICML 2021 Workshop on Tackling Climate Change with Machine Learning

  28. arXiv:2010.11904  [pdf, other

    cs.SD cs.LG eess.AS

    Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision

    Authors: Yun-Ning Hung, Gordon Wichern, Jonathan Le Roux

    Abstract: Most music source separation systems require large collections of isolated sources for training, which can be difficult to obtain. In this work, we use musical scores, which are comparatively easy to obtain, as a weak label for training a source separation system. In contrast with previous score-informed separation approaches, our system does not require isolated sources, and score is used only as… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  29. arXiv:2007.14469  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    AutoClip: Adaptive Gradient Clipping for Source Separation Networks

    Authors: Prem Seetharaman, Gordon Wichern, Bryan Pardo, Jonathan Le Roux

    Abstract: Clipping the gradient is a known approach to improving gradient descent, but requires hand selection of a clipping threshold hyperparameter. We present AutoClip, a simple method for automatically and adaptively choosing a gradient clipping threshold, based on the history of gradient norms observed during training. Experimental results show that applying AutoClip results in improved generalization… ▽ More

    Submitted 25 July, 2020; originally announced July 2020.

    Comments: Accepted at 2020 IEEE International Workshop on Machine Learning for Signal Processing, Sept.\ 21--24, 2020, Espoo, Finland

  30. arXiv:1911.02182  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Finding Strength in Weakness: Learning to Separate Sounds with Weak Supervision

    Authors: Fatemeh Pishdadian, Gordon Wichern, Jonathan Le Roux

    Abstract: While there has been much recent progress using deep learning techniques to separate speech and music audio signals, these systems typically require large collections of isolated sources during the training process. When extending audio source separation algorithms to more general domains such as environmental monitoring, it may not be possible to obtain isolated signals for training. Here, we pro… ▽ More

    Submitted 28 August, 2020; v1 submitted 5 November, 2019; originally announced November 2019.

    Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing vol 28 (2020) 2386-2399

  31. arXiv:1910.11133  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Bootstrapping deep music separation from primitive auditory grouping principles

    Authors: Prem Seetharaman, Gordon Wichern, Jonathan Le Roux, Bryan Pardo

    Abstract: Separating an audio scene such as a cocktail party into constituent, meaningful components is a core task in computer audition. Deep networks are the state-of-the-art approach. They are trained on synthetic mixtures of audio made from isolated sound source recordings so that ground truth for the separation is known. However, the vast majority of available audio is not isolated. The brain uses prim… ▽ More

    Submitted 23 October, 2019; originally announced October 2019.

  32. arXiv:1910.10279  [pdf, ps, other

    cs.SD eess.AS

    WHAMR!: Noisy and Reverberant Single-Channel Speech Separation

    Authors: Matthew Maciejewski, Gordon Wichern, Emmett McQuinn, Jonathan Le Roux

    Abstract: While significant advances have been made with respect to the separation of overlapping speech signals, studies have been largely constrained to mixtures of clean, near anechoic speech, not representative of many real-world scenarios. Although the WHAM! dataset introduced noise to the ubiquitous wsj0-2mix dataset, it did not include reverberation, which is generally present in indoor recordings ou… ▽ More

    Submitted 14 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: Accepted for publication at ICASSP 2020

  33. arXiv:1909.08494  [pdf, other

    cs.SD cs.LG eess.AS

    Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity

    Authors: Ethan Manilow, Gordon Wichern, Prem Seetharaman, Jonathan Le Roux

    Abstract: Music source separation performance has greatly improved in recent years with the advent of approaches based on deep learning. Such methods typically require large amounts of labelled training data, which in the case of music consist of mixtures and corresponding instrument stems. However, stems are unavailable for most commercial music, and only limited datasets have so far been released to the p… ▽ More

    Submitted 18 September, 2019; originally announced September 2019.

    Comments: Accepted for publication at WASPAA 2019

  34. arXiv:1907.01160  [pdf, other

    cs.SD cs.CL cs.LG eess.AS stat.ML

    WHAM!: Extending Speech Separation to Noisy Environments

    Authors: Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, Jonathan Le Roux

    Abstract: Recent progress in separating the speech signals from multiple overlapping speakers using a single audio channel has brought us closer to solving the cocktail party problem. However, most studies in this area use a constrained problem setup, comparing performance when speakers overlap almost completely, at artificially low sampling rates, and with no external background noise. In this paper, we st… ▽ More

    Submitted 2 July, 2019; originally announced July 2019.

    Comments: Accepted for publication at Interspeech 2019

  35. arXiv:1811.03076  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Class-conditional embeddings for music source separation

    Authors: Prem Seetharaman, Gordon Wichern, Shrikant Venkataramani, Jonathan Le Roux

    Abstract: Isolating individual instruments in a musical mixture has a myriad of potential applications, and seems imminently achievable given the levels of performance reached by recent deep learning methods. While most musical source separation techniques learn an independent model for each instrument, we propose using a common embedding space for the time-frequency bins of all instruments in a mixture ins… ▽ More

    Submitted 7 November, 2018; originally announced November 2018.

    Comments: 5 pages

  36. arXiv:1811.02130  [pdf, other

    cs.SD cs.AI cs.LG eess.AS stat.ML

    Bootstrapping single-channel source separation via unsupervised spatial clustering on stereo mixtures

    Authors: Prem Seetharaman, Gordon Wichern, Jonathan Le Roux, Bryan Pardo

    Abstract: Separating an audio scene into isolated sources is a fundamental problem in computer audition, analogous to image segmentation in visual scene analysis. Source separation systems based on deep learning are currently the most successful approaches for solving the underdetermined separation problem, where there are more sources than channels. Traditionally, such systems are trained on sound mixtures… ▽ More

    Submitted 5 November, 2018; originally announced November 2018.

    Comments: 5 pages, 2 figures

  37. arXiv:1810.01395  [pdf, other

    cs.SD cs.CL cs.LG eess.AS stat.ML

    Phasebook and Friends: Leveraging Discrete Representations for Source Separation

    Authors: Jonathan Le Roux, Gordon Wichern, Shinji Watanabe, Andy Sarroff, John R. Hershey

    Abstract: Deep learning based speech enhancement and source separation systems have recently reached unprecedented levels of quality, to the point that performance is reaching a new ceiling. Most systems rely on estimating the magnitude of a target source by estimating a real-valued mask to be applied to a time-frequency representation of the mixture signal. A limiting factor in such approaches is a lack of… ▽ More

    Submitted 7 March, 2019; v1 submitted 2 October, 2018; originally announced October 2018.

  38. arXiv:1806.08409  [pdf, other

    cs.CL cs.CV cs.SD eess.AS

    End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

    Authors: Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, Devi Parikh

    Abstract: Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-of-the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog dat… ▽ More

    Submitted 29 June, 2018; v1 submitted 21 June, 2018; originally announced June 2018.

    Comments: A prototype system for the Audio Visual Scene-aware Dialog (AVSD) at DSTC7