Zum Hauptinhalt springen

Showing 1–31 of 31 results for author: Mesgarani, N

Searching in archive eess. Search in all archives.
.
  1. arXiv:2408.11849  [pdf, other

    cs.CL cs.AI eess.AS

    Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

    Authors: Yinghao Aaron Li, Xilin Jiang, Jordan Darefsky, Ge Zhu, Nima Mesgarani

    Abstract: The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and contextually relevant dialogues. However, extending these advancements to enable end-to-end speech-to-speech conversation bots remains a formidable challenge, primarily due to the extensive dataset and computational resou… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: CoLM 2024

  2. arXiv:2407.20535  [pdf, other

    cs.NE cs.SD eess.AS

    DeepSpeech models show Human-like Performance and Processing of Cochlear Implant Inputs

    Authors: Cynthia R. Steinhardt, Menoua Keshishian, Nima Mesgarani, Kim Stachenfeld

    Abstract: Cochlear implants(CIs) are arguably the most successful neural implant, having restored hearing to over one million people worldwide. While CI research has focused on modeling the cochlear activations in response to low-level acoustic features, we hypothesize that the success of these implants is due in large part to the role of the upstream network in extracting useful features from a degraded si… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

    Comments: NEURIPS preprint

  3. arXiv:2407.09732  [pdf, other

    eess.AS cs.LG cs.SD

    Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis

    Authors: Xilin Jiang, Yinghao Aaron Li, Adrian Nicolas Florea, Cong Han, Nima Mesgarani

    Abstract: It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks. To reach this conclusion, we propose and evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis. We compar… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  4. arXiv:2405.11831  [pdf, other

    eess.AS cs.LG

    SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model

    Authors: Siavash Shams, Sukru Samet Dindar, Xilin Jiang, Nima Mesgarani

    Abstract: Transformers have revolutionized deep learning across various tasks, including audio representation learning, due to their powerful modeling capabilities. However, they often suffer from quadratic complexity in both GPU memory usage and computational inference time, affecting their efficiency. Recently, state space models (SSMs) like Mamba have emerged as a promising alternative, offering a more e… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

    Comments: Code at https://github.com/SiavashShams/ssamba

  5. arXiv:2403.18257  [pdf, other

    eess.AS cs.SD

    Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation

    Authors: Xilin Jiang, Cong Han, Nima Mesgarani

    Abstract: Transformers have been the most successful architecture for various speech modeling tasks, including speech separation. However, the self-attention mechanism in transformers with quadratic complexity is inefficient in computation and memory. Recent models incorporate new layers and modules along with transformers for better performance but also introduce extra model complexity. In this work, we re… ▽ More

    Submitted 30 April, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

    Comments: work in progress

  6. arXiv:2402.03710  [pdf, other

    eess.AS cs.CL cs.SD

    Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience

    Authors: Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani

    Abstract: In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces "Listen, Chat, and Edit" (LCE), a novel multimodal sound mixture editor that modifies each sound source in a mixture based on user-provided text instructions. LCE distinguishes itself with a user-friendly chat interface and its unique ability to… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

    Comments: preprint

  7. arXiv:2309.15938  [pdf, other

    eess.AS cs.LG cs.SD

    Exploring Self-Supervised Contrastive Learning of Spatial Sound Event Representation

    Authors: Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani

    Abstract: In this study, we present a simple multi-channel framework for contrastive learning (MC-SimCLR) to encode 'what' and 'where' of spatial audios. MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios, thereby enhancing both event classification and sound localization in downstream tasks. At its core, we propose a multi-level data augmentation pipeline that augment… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  8. arXiv:2309.09493  [pdf, other

    eess.AS cs.AI cs.SD

    HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

    Authors: Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani

    Abstract: Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In th… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

  9. arXiv:2307.09435  [pdf, other

    eess.AS cs.AI cs.SD

    SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs

    Authors: Yinghao Aaron Li, Cong Han, Nima Mesgarani

    Abstract: In recent years, large-scale pre-trained speech language models (SLMs) have demonstrated remarkable advancements in various generative speech modeling applications, such as text-to-speech synthesis, voice conversion, and speech enhancement. These applications typically involve mapping text or speech inputs to pre-trained SLM representations, from which target speech is decoded. This paper introduc… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

    Comments: WASPAA 2023

  10. arXiv:2306.07691  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

    Authors: Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani

    Abstract: In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, a… ▽ More

    Submitted 19 November, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023

  11. DeCoR: Defy Knowledge Forgetting by Predicting Earlier Audio Codes

    Authors: Xilin Jiang, Yinghao Aaron Li, Nima Mesgarani

    Abstract: Lifelong audio feature extraction involves learning new sound classes incrementally, which is essential for adapting to new data distributions over time. However, optimizing the model only on new data can lead to catastrophic forgetting of previously learned tasks, which undermines the model's ability to perform well over the long term. This paper introduces a new approach to continual audio repre… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: INTERSPEECH 2023

    Journal ref: Proc. INTERSPEECH 2023, pp.2818--2822

  12. arXiv:2303.07458  [pdf, other

    eess.AS cs.SD

    Online Binaural Speech Separation of Moving Speakers With a Wavesplit Network

    Authors: Cong Han, Nima Mesgarani

    Abstract: Binaural speech separation in real-world scenarios often involves moving speakers. Most current speech separation methods use utterance-level permutation invariant training (u-PIT) for training. In inference time, however, the order of outputs can be inconsistent over time particularly in long-form speech separation. This situation which is referred to as the speaker swap problem is even more prob… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: To appear in ICASSP 2023

  13. arXiv:2302.05756  [pdf, other

    eess.AS cs.SD eess.SP

    Improved Decoding of Attentional Selection in Multi-Talker Environments with Self-Supervised Learned Speech Representation

    Authors: Cong Han, Vishal Choudhari, Yinghao Aaron Li, Nima Mesgarani

    Abstract: Auditory attention decoding (AAD) is a technique used to identify and amplify the talker that a listener is focused on in a noisy environment. This is done by comparing the listener's brainwaves to a representation of all the sound sources to find the closest match. The representation is typically the waveform or spectrogram of the sounds. The effectiveness of these representations for AAD is unce… ▽ More

    Submitted 11 February, 2023; originally announced February 2023.

  14. arXiv:2301.08810  [pdf, other

    cs.CL cs.SD eess.AS

    Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions

    Authors: Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani

    Abstract: Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we pro… ▽ More

    Submitted 20 January, 2023; originally announced January 2023.

  15. arXiv:2212.14227  [pdf, other

    eess.AS cs.SD

    StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models

    Authors: Yinghao Aaron Li, Cong Han, Nima Mesgarani

    Abstract: One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning f… ▽ More

    Submitted 29 December, 2022; originally announced December 2022.

    Comments: SLT 2022

  16. arXiv:2205.15439  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

    Authors: Yinghao Aaron Li, Cong Han, Nima Mesgarani

    Abstract: Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems, but producing speech with naturalistic prosodic variations, speaking styles and emotional tones remains challenging. Moreover, since duration and speech are generated separately, parallel TTS models still have problems finding the best monotonic alignmen… ▽ More

    Submitted 19 November, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

  17. arXiv:2107.10394  [pdf, other

    cs.SD cs.LG eess.AS

    StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

    Authors: Yinghao Aaron Li, Ali Zare, Nima Mesgarani

    Abstract: We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Using a combination of adversarial source classifier loss and perceptual loss, our model significantly outperforms previous VC models. Although our model is trained only with 20 English speakers, it generalizes to a variety of voice conversion tasks, suc… ▽ More

    Submitted 22 July, 2021; v1 submitted 21 July, 2021; originally announced July 2021.

    Comments: INTERSPEECH 2021

  18. arXiv:2102.04056  [pdf, other

    cs.SD eess.AS

    Speaker and Direction Inferred Dual-channel Speech Separation

    Authors: Chenxing Li, Jiaming Xu, Nima Mesgarani, Bo Xu

    Abstract: Most speech separation methods, trying to separate all channel sources simultaneously, are still far from having enough general- ization capabilities for real scenarios where the number of input sounds is usually uncertain and even dynamic. In this work, we employ ideas from auditory attention with two ears and propose a speaker and direction inferred speech separation network (dubbed SDNet) to so… ▽ More

    Submitted 8 February, 2021; originally announced February 2021.

    Comments: Accepted by ICASSP 2021

  19. arXiv:2012.09727  [pdf, other

    eess.AS cs.SD eess.SP

    Continuous Speech Separation Using Speaker Inventory for Long Multi-talker Recording

    Authors: Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen

    Abstract: Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years. Recent research includes extracting target speech by using the target speaker's voice snippet and jointly separating all participating speakers by using a pool of additional speaker signals, which is known as speech separation using speaker inventory (SSUSI). However, all th… ▽ More

    Submitted 18 December, 2020; v1 submitted 17 December, 2020; originally announced December 2020.

  20. arXiv:2012.07291  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Group Communication with Context Codec for Lightweight Source Separation

    Authors: Yi Luo, Cong Han, Nima Mesgarani

    Abstract: Despite the recent progress on neural network architectures for speech separation, the balance between the model size, model complexity and model performance is still an important and challenging problem for the deployment of such models to low-resource platforms. In this paper, we propose two simple modules, group communication and context codec, that can be easily applied to a wide range of arch… ▽ More

    Submitted 16 May, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

  21. arXiv:2011.08401  [pdf, other

    eess.AS cs.SD

    Implicit Filter-and-sum Network for Multi-channel Speech Separation

    Authors: Yi Luo, Nima Mesgarani

    Abstract: Various neural network architectures have been proposed in recent years for the task of multi-channel speech separation. Among them, the filter-and-sum network (FaSNet) performs end-to-end time-domain filter-and-sum beamforming and has shown effective in both ad-hoc and fixed microphone array geometries. In this paper, we investigate multiple ways to improve the performance of FaSNet. From the pro… ▽ More

    Submitted 16 November, 2020; originally announced November 2020.

  22. arXiv:2011.08400  [pdf, other

    eess.AS cs.SD

    Rethinking the Separation Layers in Speech Separation Networks

    Authors: Yi Luo, Zhuo Chen, Cong Han, Chenda Li, Tianyan Zhou, Nima Mesgarani

    Abstract: Modules in all existing speech separation networks can be categorized into single-input-multi-output (SIMO) modules and single-input-single-output (SISO) modules. SIMO modules generate more outputs than input, and SISO modules keep the numbers of input and output the same. While the majority of separation models only contain SIMO architectures, it has also been shown that certain two-stage separat… ▽ More

    Submitted 16 November, 2020; originally announced November 2020.

  23. arXiv:2011.08397  [pdf, other

    eess.AS cs.SD

    Ultra-Lightweight Speech Separation via Group Communication

    Authors: Yi Luo, Cong Han, Nima Mesgarani

    Abstract: Model size and complexity remain the biggest challenges in the deployment of speech enhancement and separation systems on low-resource devices such as earphones and hearing aids. Although methods such as compression, distillation and quantization can be applied to large models, they often come with a cost on the model performance. In this paper, we provide a simple model design paradigm that expli… ▽ More

    Submitted 20 November, 2020; v1 submitted 16 November, 2020; originally announced November 2020.

  24. arXiv:2011.07338  [pdf, other

    eess.AS

    Distortion-controlled Training for End-to-end Reverberant Speech Separation with Auxiliary Autoencoding Loss

    Authors: Yi Luo, Cong Han, Nima Mesgarani

    Abstract: The performance of speech enhancement and separation systems in anechoic environments has been significantly advanced with the recent progress in end-to-end neural network architectures. However, the performance of such systems in reverberant environments is yet to be explored. A core problem in reverberant speech separation is about the training and evaluation metrics. Standard time-domain metric… ▽ More

    Submitted 14 November, 2020; originally announced November 2020.

    Comments: SLT 2021

  25. arXiv:2003.12326  [pdf, other

    eess.AS cs.LG cs.SD

    Separating Varying Numbers of Sources with Auxiliary Autoencoding Loss

    Authors: Yi Luo, Nima Mesgarani

    Abstract: Many recent source separation systems are designed to separate a fixed number of sources out of a mixture. In the cases where the source activation patterns are unknown, such systems have to either adjust the number of outputs or to identify invalid outputs from the valid ones. Iterative separation methods have gain much attention in the community as they can flexibly decide the number of outputs,… ▽ More

    Submitted 18 August, 2020; v1 submitted 27 March, 2020; originally announced March 2020.

    Comments: Interspeech 2020

  26. arXiv:2002.06637  [pdf, other

    eess.AS cs.SD

    Real-time binaural speech separation with preserved spatial cues

    Authors: Cong Han, Yi Luo, Nima Mesgarani

    Abstract: Deep learning speech separation algorithms have achieved great success in improving the quality and intelligibility of separated speech from mixed audio. Most previous methods focused on generating a single-channel output for each of the target speakers, hence discarding the spatial cues needed for the localization of sound sources in space. However, preserving the spatial information is important… ▽ More

    Submitted 16 February, 2020; originally announced February 2020.

    Comments: To appear in ICASSP 2020

  27. arXiv:1910.14104  [pdf, other

    eess.AS cs.LG cs.SD

    End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation

    Authors: Yi Luo, Zhuo Chen, Nima Mesgarani, Takuya Yoshioka

    Abstract: An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones. The former requires the system to be invariant to different indexing of the microphones with the same locations, while the latter requires the system to be able to process inputs with varying dimensions. Conventional optimization-based… ▽ More

    Submitted 27 March, 2020; v1 submitted 30 October, 2019; originally announced October 2019.

    Comments: ICASSP 2020

  28. arXiv:1909.13387  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    FaSNet: Low-latency Adaptive Beamforming for Multi-microphone Audio Processing

    Authors: Yi Luo, Enea Ceolini, Cong Han, Shih-Chii Liu, Nima Mesgarani

    Abstract: Beamforming has been extensively investigated for multi-channel audio processing tasks. Recently, learning-based beamforming methods, sometimes called \textit{neural beamformers}, have achieved significant improvements in both signal quality (e.g. signal-to-noise ratio (SNR)) and speech recognition (e.g. word error rate (WER)). Such systems are generally non-causal and require a large context for… ▽ More

    Submitted 30 September, 2019; v1 submitted 29 September, 2019; originally announced September 2019.

    Comments: Accepted to ASRU 2019

  29. arXiv:1809.07454  [pdf, other

    cs.SD cs.LG eess.AS

    Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

    Authors: Yi Luo, Nima Mesgarani

    Abstract: Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and m… ▽ More

    Submitted 15 May, 2019; v1 submitted 19 September, 2018; originally announced September 2018.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing. This version is the authors' version and may vary from the final publication in details

  30. arXiv:1711.00541  [pdf, other

    cs.SD cs.LG cs.MM cs.NE eess.AS

    TasNet: time-domain audio separation network for real-time, single-channel speech separation

    Authors: Yi Luo, Nima Mesgarani

    Abstract: Robust speech processing in multi-talker environments requires effective speech separation. Recent deep learning systems have made significant progress toward solving this problem, yet it remains challenging particularly in real-time, short latency applications. Most methods attempt to construct a mask for each source in time-frequency representation of the mixture signal which is not necessarily… ▽ More

    Submitted 17 April, 2018; v1 submitted 1 November, 2017; originally announced November 2017.

    Comments: Camera ready version for ICASSP 2018, Calgary, Canada

  31. arXiv:1710.09798  [pdf, other

    cs.CV eess.AS eess.IV

    Lip2AudSpec: Speech reconstruction from silent lip movements video

    Authors: Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani

    Abstract: In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of speech and its corresponding sound generation method resulting in a more natural sounding reconstructed speech. Our proposed network consists of an autoencoder to extract bottleneck features from the auditory spectrogram w… ▽ More

    Submitted 26 October, 2017; originally announced October 2017.