Zum Hauptinhalt springen

Showing 1–25 of 25 results for author: Pardo, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.05471  [pdf, other

    eess.AS cs.SD

    Fine-Grained and Interpretable Neural Speech Editing

    Authors: Max Morrison, Cameron Churchwell, Nathan Pruyne, Bryan Pardo

    Abstract: Fine-grained editing of speech attributes$\unicode{x2014}$such as prosody (i.e., the pitch, loudness, and phoneme durations), pronunciation, speaker identity, and formants$\unicode{x2014}$is useful for fine-tuning and fixing imperfections in human and AI-generated speech recordings for creation of podcasts, film dialogue, and video game dialogue. Existing speech synthesis systems use representatio… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: Interspeech 2024

  2. arXiv:2402.17735  [pdf, other

    eess.AS cs.SD

    High-Fidelity Neural Phonetic Posteriorgrams

    Authors: Cameron Churchwell, Max Morrison, Bryan Pardo

    Abstract: A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units of speech (e.g., phonemes). PPGs are a popular representation in speech generation due to their ability to disentangle pronunciation features from speaker identity, allowing accurate reconstruction of pronunciation (e.g., voice conversion) and coarse-grained pronunciation editing (e.g., foreign accent con… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024 Workshop on Explainable Machine Learning for Speech and Audio

  3. arXiv:2401.14542  [pdf, other

    cs.SD cs.AI eess.AS

    Exploring Musical Roots: Applying Audio Embeddings to Empower Influence Attribution for a Generative Music Model

    Authors: Julia Barnett, Hugo Flores Garcia, Bryan Pardo

    Abstract: Every artist has a creative process that draws inspiration from previous artists and their works. Today, "inspiration" has been automated by generative music models. The black box nature of these models obscures the identity of the works that influence their creative output. As a result, users may inadvertently appropriate, misuse, or copy existing artists' works. We establish a replicable methodo… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

    Comments: 14 pages + references. Under conference review

  4. arXiv:2310.08464  [pdf, other

    eess.AS cs.SD

    Crowdsourced and Automatic Speech Prominence Estimation

    Authors: Max Morrison, Pranav Pawar, Nathan Pruyne, Jennifer Cole, Bryan Pardo

    Abstract: The prominence of a spoken word is the degree to which an average native listener perceives the word as salient or emphasized relative to its context. Speech prominence estimation is the process of assigning a numeric value to the prominence of each word in an utterance. These prominence labels are useful for linguistic analysis, as well as training automated systems to perform emphasis-controlled… ▽ More

    Submitted 22 December, 2023; v1 submitted 12 October, 2023; originally announced October 2023.

    Comments: Published as a conference paper at ICASSP 2024

  5. arXiv:2307.04686  [pdf, other

    cs.SD cs.AI eess.AS

    VampNet: Music Generation via Masked Acoustic Token Modeling

    Authors: Hugo Flores Garcia, Prem Seetharaman, Rithesh Kumar, Bryan Pardo

    Abstract: We introduce VampNet, a masked acoustic token modeling approach to music synthesis, compression, inpainting, and variation. We use a variable masking schedule during training which allows us to sample coherent music from the model by applying a variety of masking approaches (called prompts) during inference. VampNet is non-autoregressive, leveraging a bidirectional transformer architecture that at… ▽ More

    Submitted 12 July, 2023; v1 submitted 10 July, 2023; originally announced July 2023.

  6. arXiv:2301.12258  [pdf, other

    eess.AS cs.SD

    Cross-domain Neural Pitch and Periodicity Estimation

    Authors: Max Morrison, Caedon Hsieh, Nathan Pruyne, Bryan Pardo

    Abstract: Pitch is a foundational aspect of our perception of audio signals. Pitch contours are commonly used to analyze speech and music signals and as input features for many audio tasks, including music transcription, singing voice synthesis, and prosody editing. In this paper, we describe a set of techniques for improving the accuracy of widely-used neural pitch and periodicity estimators to achieve sta… ▽ More

    Submitted 11 August, 2024; v1 submitted 28 January, 2023; originally announced January 2023.

  7. arXiv:2208.12387  [pdf, other

    cs.SD cs.LG eess.AS

    Music Separation Enhancement with Generative Modeling

    Authors: Noah Schaffer, Boaz Cogan, Ethan Manilow, Max Morrison, Prem Seetharaman, Bryan Pardo

    Abstract: Despite phenomenal progress in recent years, state-of-the-art music separation systems produce source estimates with significant perceptual shortcomings, such as adding extraneous noise or removing harmonics. We propose a post-processing model (the Make it Sound Good (MSG) post-processor) to enhance the output of music source separation systems. We apply our post-processing model to state-of-the-a… ▽ More

    Submitted 25 August, 2022; originally announced August 2022.

    Comments: Accepted to ISMIR 2022

  8. arXiv:2203.15140  [pdf, other

    cs.SD eess.AS

    Improving Source Separation by Explicitly Modeling Dependencies Between Sources

    Authors: Ethan Manilow, Curtis Hawthorne, Cheng-Zhi Anna Huang, Bryan Pardo, Jesse Engel

    Abstract: We propose a new method for training a supervised source separation system that aims to learn the interdependent relationships between all combinations of sources in a mixture. Rather than independently estimating each source from a mix, we reframe the source separation problem as an Orderless Neural Autoregressive Density Estimator (NADE), and estimate each source from both the mix and a random s… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: To appear at ICASSP 2022

  9. arXiv:2203.04444  [pdf, other

    cs.HC cs.LG

    Reproducible Subjective Evaluation

    Authors: Max Morrison, Brian Tang, Gefei Tan, Bryan Pardo

    Abstract: Human perceptual studies are the gold standard for the evaluation of many research tasks in machine learning, linguistics, and psychology. However, these studies require significant time and cost to perform. As a result, many researchers use objective measures that can correlate poorly with human evaluation. When subjective evaluations are performed, they are often not reported with sufficient det… ▽ More

    Submitted 8 March, 2022; originally announced March 2022.

    Comments: Submitted to ICLR 2022 Workshop on Setting up ML Evaluation Standards to Accelerate Progress

  10. arXiv:2110.13323  [pdf, other

    cs.SD cs.LG eess.AS

    Deep Learning Tools for Audacity: Helping Researchers Expand the Artist's Toolkit

    Authors: Hugo Flores Garcia, Aldo Aguilar, Ethan Manilow, Dmitry Vedenko, Bryan Pardo

    Abstract: We present a software framework that integrates neural networks into the popular open-source audio editing software, Audacity, with a minimal amount of developer effort. In this paper, we showcase some example use cases for both end-users and neural network developers. We hope that this work fosters a new level of interactivity between deep learning practitioners and end-users.

    Submitted 28 October, 2021; v1 submitted 25 October, 2021; originally announced October 2021.

  11. arXiv:2110.13071  [pdf, other

    cs.SD cs.LG eess.AS

    Unsupervised Source Separation By Steering Pretrained Music Models

    Authors: Ethan Manilow, Patrick O'Reilly, Prem Seetharaman, Bryan Pardo

    Abstract: We showcase an unsupervised method that repurposes deep models trained for music generation and music tagging for audio source separation, without any retraining. An audio generation model is conditioned on an input mixture, producing a latent encoding of the audio used to generate audio. This generated audio is fed to a pretrained music tagger that creates source labels. The cross-entropy loss be… ▽ More

    Submitted 25 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  12. arXiv:2110.02360  [pdf, other

    eess.AS cs.SD

    Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet

    Authors: Max Morrison, Zeyu Jin, Nicholas J. Bryan, Juan-Pablo Caceres, Bryan Pardo

    Abstract: Modifying the pitch and timing of an audio signal are fundamental audio editing operations with applications in speech manipulation, audio-visual synchronization, and singing voice editing and synthesis. Thus far, methods for pitch-shifting and time-stretching that use digital signal processing (DSP) have been favored over deep learning approaches due to their speed and relatively higher quality.… ▽ More

    Submitted 5 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  13. arXiv:2107.07029  [pdf, other

    cs.SD cs.LG eess.AS

    Leveraging Hierarchical Structures for Few-Shot Musical Instrument Recognition

    Authors: Hugo Flores Garcia, Aldo Aguilar, Ethan Manilow, Bryan Pardo

    Abstract: Deep learning work on musical instrument recognition has generally focused on instrument classes for which we have abundant data. In this work, we exploit hierarchical relationships between instruments in a few-shot learning setup to enable classification of a wider set of musical instruments, given a few examples at inference. We apply a hierarchical loss function to the training of prototypical… ▽ More

    Submitted 29 July, 2021; v1 submitted 14 July, 2021; originally announced July 2021.

  14. arXiv:2102.08328  [pdf, other

    eess.AS cs.LG cs.SD

    Context-Aware Prosody Correction for Text-Based Speech Editing

    Authors: Max Morrison, Lucas Rencker, Zeyu Jin, Nicholas J. Bryan, Juan-Pablo Caceres, Bryan Pardo

    Abstract: Text-based speech editors expedite the process of editing speech recordings by permitting editing via intuitive cut, copy, and paste operations on a speech transcript. A major drawback of current systems, however, is that edited recordings often sound unnatural because of prosody mismatches around edited regions. In our work, we propose a new context-aware method for more natural sounding text-bas… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

    Comments: To appear in proceedings of ICASSP 2021

  15. arXiv:2010.12650  [pdf, other

    cs.SD cs.LG

    A Study of Transfer Learning in Music Source Separation

    Authors: Andreas Bugler, Bryan Pardo, Prem Seetharaman

    Abstract: Supervised deep learning methods for performing audio source separation can be very effective in domains where there is a large amount of training data. While some music domains have enough data suitable for training a separation system, such as rock and pop genres, many musical domains do not, such as classical music, choral music, and non-Western music traditions. It is well known that transferr… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

    Comments: 4 pages + 1 reference page. 3 figures. Submitted to ICASSP

    ACM Class: I.5.4

  16. arXiv:2009.13729  [pdf, other

    cs.SD cs.LG eess.AS

    Bespoke Neural Networks for Score-Informed Source Separation

    Authors: Ethan Manilow, Bryan Pardo

    Abstract: In this paper, we introduce a simple method that can separate arbitrary musical instruments from an audio mixture. Given an unaligned MIDI transcription for a target instrument from an input mixture, we synthesize new mixtures from the midi transcription that sound similar to the mixture to be separated. This lets us create a labeled training set to train a network on the specific bespoke task. Wh… ▽ More

    Submitted 28 September, 2020; originally announced September 2020.

    Comments: ISMIR 2020 - Late Breaking Demo

  17. arXiv:2007.14469  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    AutoClip: Adaptive Gradient Clipping for Source Separation Networks

    Authors: Prem Seetharaman, Gordon Wichern, Bryan Pardo, Jonathan Le Roux

    Abstract: Clipping the gradient is a known approach to improving gradient descent, but requires hand selection of a clipping threshold hyperparameter. We present AutoClip, a simple method for automatically and adaptively choosing a gradient clipping threshold, based on the history of gradient norms observed during training. Experimental results show that applying AutoClip results in improved generalization… ▽ More

    Submitted 25 July, 2020; originally announced July 2020.

    Comments: Accepted at 2020 IEEE International Workshop on Machine Learning for Signal Processing, Sept.\ 21--24, 2020, Espoo, Finland

  18. arXiv:2006.13331  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Incorporating Music Knowledge in Continual Dataset Augmentation for Music Generation

    Authors: Alisa Liu, Alexander Fang, Gaëtan Hadjeres, Prem Seetharaman, Bryan Pardo

    Abstract: Deep learning has rapidly become the state-of-the-art approach for music generation. However, training a deep model typically requires a large training set, which is often not available for specific musical styles. In this paper, we present augmentative generation (Aug-Gen), a method of dataset augmentation for any music generation system trained on a resource-constrained domain. The key intuition… ▽ More

    Submitted 20 July, 2020; v1 submitted 23 June, 2020; originally announced June 2020.

    Comments: 2 pages, 2 figures, Machine Learning for Media Discovery (ML4MD) Workshop at ICML 2020

  19. arXiv:2006.13329  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Bach or Mock? A Grading Function for Chorales in the Style of J.S. Bach

    Authors: Alexander Fang, Alisa Liu, Prem Seetharaman, Bryan Pardo

    Abstract: Deep generative systems that learn probabilistic models from a corpus of existing music do not explicitly encode knowledge of a musical style, compared to traditional rule-based systems. Thus, it can be difficult to determine whether deep models generate stylistically correct output without expert evaluation, but this is expensive and time-consuming. Therefore, there is a need for automatic, inter… ▽ More

    Submitted 17 July, 2020; v1 submitted 23 June, 2020; originally announced June 2020.

    Comments: 2 pages, 3 figures, Machine Learning for Media Discovery (ML4MD) Workshop at ICML 2020

  20. arXiv:1911.02073  [pdf, other

    cs.SD cs.LG eess.AS

    OtoMechanic: Auditory Automobile Diagnostics via Query-by-Example

    Authors: Max Morrison, Bryan Pardo

    Abstract: Early detection and repair of failing components in automobiles reduces the risk of vehicle failure in life-threatening situations. Many automobile components in need of repair produce characteristic sounds. For example, loose drive belts emit a high-pitched squeaking sound, and bad starter motors have a characteristic whirring or clicking noise. Often drivers can tell that the sound of their car… ▽ More

    Submitted 5 November, 2019; originally announced November 2019.

    Comments: Submitted to Workshop on Detection and Classification of Acoustic Scenes and Events 2019 (DCASE2019)

  21. arXiv:1910.12626  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Model selection for deep audio source separation via clustering analysis

    Authors: Alisa Liu, Prem Seetharaman, Bryan Pardo

    Abstract: Audio source separation is the process of separating a mixture (e.g. a pop band recording) into isolated sounds from individual sources (e.g. just the lead vocals). Deep learning models are the state-of-the-art in source separation, given that the mixture to be separated is similar to the mixtures the deep model was trained on. This requires the end user to know enough about each model's training… ▽ More

    Submitted 26 July, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

  22. arXiv:1910.12621  [pdf, other

    eess.AS cs.LG cs.SD

    Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments

    Authors: Ethan Manilow, Prem Seetharaman, Bryan Pardo

    Abstract: We present a single deep learning architecture that can both separate an audio recording of a musical mixture into constituent single-instrument recordings and transcribe these instruments into a human-readable format at the same time, learning a shared musical representation for both tasks. This novel architecture, which we call Cerberus, builds on the Chimera network for source separation by add… ▽ More

    Submitted 12 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: Accepted to ICASSP 2020

  23. arXiv:1910.11133  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Bootstrapping deep music separation from primitive auditory grouping principles

    Authors: Prem Seetharaman, Gordon Wichern, Jonathan Le Roux, Bryan Pardo

    Abstract: Separating an audio scene such as a cocktail party into constituent, meaningful components is a core task in computer audition. Deep networks are the state-of-the-art approach. They are trained on synthetic mixtures of audio made from isolated sound source recordings so that ground truth for the separation is known. However, the vast majority of available audio is not isolated. The brain uses prim… ▽ More

    Submitted 23 October, 2019; originally announced October 2019.

  24. arXiv:1811.02130  [pdf, other

    cs.SD cs.AI cs.LG eess.AS stat.ML

    Bootstrapping single-channel source separation via unsupervised spatial clustering on stereo mixtures

    Authors: Prem Seetharaman, Gordon Wichern, Jonathan Le Roux, Bryan Pardo

    Abstract: Separating an audio scene into isolated sources is a fundamental problem in computer audition, analogous to image segmentation in visual scene analysis. Source separation systems based on deep learning are currently the most successful approaches for solving the underdetermined separation problem, where there are more sources than channels. Traditionally, such systems are trained on sound mixtures… ▽ More

    Submitted 5 November, 2018; originally announced November 2018.

    Comments: 5 pages, 2 figures

  25. arXiv:1804.08300  [pdf, other

    cs.SD eess.AS

    An Overview of Lead and Accompaniment Separation in Music

    Authors: Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, Derry FitzGerald, Bryan Pardo

    Abstract: Popular music is often composed of an accompaniment and a lead component, the latter typically consisting of vocals. Filtering such mixtures to extract one or both components has many applications, such as automatic karaoke and remixing. This particular case of source separation yields very specific challenges and opportunities, including the particular complexity of musical structures, but also r… ▽ More

    Submitted 23 April, 2018; originally announced April 2018.