-
Shennong: a Python toolbox for audio speech features extraction
Authors:
Mathieu Bernard,
Maxime Poli,
Julien Karadayi,
Emmanuel Dupoux
Abstract:
We introduce Shennong, a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters such as Mel-Frequency Cepstral Filterbanks or Predictive Linear Filters, pre-trained neural networks, pitch estimators as well as speaker normalization methods and post-processing algorithms. Shenn…
▽ More
We introduce Shennong, a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters such as Mel-Frequency Cepstral Filterbanks or Predictive Linear Filters, pre-trained neural networks, pitch estimators as well as speaker normalization methods and post-processing algorithms. Shennong is an open source, easy-to-use, reliable and extensible framework. The use of Python makes the integration to others speech modeling and machine learning tools easy. It aims to replace or complement several heterogeneous software, such as Kaldi or Praat. After describing the Shennong software architecture, its core components and implemented algorithms, this paper illustrates its use on three applications: a comparison of speech features performances on a phones discrimination task, an analysis of a Vocal Tract Length Normalization model as a function of the speech duration used for training and a comparison of pitch estimation algorithms under various noise conditions.
△ Less
Submitted 10 December, 2021;
originally announced December 2021.
-
The Zero Resource Speech Challenge 2021: Spoken language modelling
Authors:
Ewan Dunbar,
Mathieu Bernard,
Nicolas Hamilakis,
Tu Anh Nguyen,
Maureen de Seyssel,
Patricia Rozé,
Morgane Rivière,
Eugene Kharitonov,
Emmanuel Dupoux
Abstract:
We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels. The challenge is based on the Libri-light dataset, which provides up to 60k hours of audio from English audio books without any associated text. We provide a pipeline baseline system consisting on an encoder based on contrastive predictive coding (C…
▽ More
We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels. The challenge is based on the Libri-light dataset, which provides up to 60k hours of audio from English audio books without any associated text. We provide a pipeline baseline system consisting on an encoder based on contrastive predictive coding (CPC), a quantizer ($k$-means) and a standard language model (BERT or LSTM). The metrics evaluate the learned representations at the acoustic (ABX discrimination), lexical (spot-the-word), syntactic (acceptability judgment) and semantic levels (similarity judgment). We present an overview of the eight submitted systems from four groups and discuss the main results.
△ Less
Submitted 9 August, 2021; v1 submitted 29 April, 2021;
originally announced April 2021.
-
Asymptotic Shape of Quantum Markov Semigroups for Compact Uniform Trees
Authors:
Margarita Belova,
Matthew Bernard
Abstract:
We give locally finite Markov trees in $L^p$-compact$,$ separable Hilbert$,$ supersymmetric process$:$ $[0,\infty)\!\times\!\mathbb{R}^{\lvert\mathcal{A}^{\otimes m}\rvert}/\mathcal{A}^{\otimes m}$ on quantum ${\rm U}(\lvert\mathcal{A}^{\otimes m}\rvert)$ semigroups$.$ In full automorphism group ${\rm Aut}({\rm\bf T})$ of modular subgroup$,$ asymptotic-ergodicity is entropy-worthy $\mathbb{R}$ sha…
▽ More
We give locally finite Markov trees in $L^p$-compact$,$ separable Hilbert$,$ supersymmetric process$:$ $[0,\infty)\!\times\!\mathbb{R}^{\lvert\mathcal{A}^{\otimes m}\rvert}/\mathcal{A}^{\otimes m}$ on quantum ${\rm U}(\lvert\mathcal{A}^{\otimes m}\rvert)$ semigroups$.$ In full automorphism group ${\rm Aut}({\rm\bf T})$ of modular subgroup$,$ asymptotic-ergodicity is entropy-worthy $\mathbb{R}$ shape for uniform partition$.$
△ Less
Submitted 1 December, 2020;
originally announced December 2020.
-
The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units
Authors:
Ewan Dunbar,
Julien Karadayi,
Mathieu Bernard,
Xuan-Nga Cao,
Robin Algayres,
Lucas Ondel,
Laurent Besacier,
Sakriani Sakti,
Emmanuel Dupoux
Abstract:
We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speec…
▽ More
We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speech synthesis; the second one is to discover word-like units from unsegmented raw speech. We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
The Zero Resource Speech Challenge 2019: TTS without T
Authors:
Ewan Dunbar,
Robin Algayres,
Julien Karadayi,
Mathieu Bernard,
Juan Benjumea,
Xuan-Nga Cao,
Lucie Miskic,
Charlotte Dugrain,
Lucas Ondel,
Alan W. Black,
Laurent Besacier,
Sakriani Sakti,
Emmanuel Dupoux
Abstract:
We present the Zero Resource Speech Challenge 2019, which proposes to build a speech synthesizer without any text or phonetic labels: hence, TTS without T (text-to-speech without text). We provide raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery datase…
▽ More
We present the Zero Resource Speech Challenge 2019, which proposes to build a speech synthesizer without any text or phonetic labels: hence, TTS without T (text-to-speech without text). We provide raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery dataset) and align them to the voice recordings in a way that works best for the purpose of synthesizing novel utterances from novel speakers, similar to the target speaker's voice. We describe the metrics used for evaluation, a baseline system consisting of unsupervised subword unit discovery plus a standard TTS system, and a topline TTS using gold phoneme transcriptions. We present an overview of the 19 submitted systems from 10 teams and discuss the main results.
△ Less
Submitted 7 July, 2019; v1 submitted 25 April, 2019;
originally announced April 2019.
-
IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning
Authors:
Ronan Riochet,
Mario Ynocente Castro,
Mathieu Bernard,
Adam Lerer,
Rob Fergus,
Véronique Izard,
Emmanuel Dupoux
Abstract:
In order to reach human performance on complexvisual tasks, artificial systems need to incorporate a sig-nificant amount of understanding of the world in termsof macroscopic objects, movements, forces, etc. Inspiredby work on intuitive physics in infants, we propose anevaluation benchmark which diagnoses how much a givensystem understands about physics by testing whether itcan tell apart well matc…
▽ More
In order to reach human performance on complexvisual tasks, artificial systems need to incorporate a sig-nificant amount of understanding of the world in termsof macroscopic objects, movements, forces, etc. Inspiredby work on intuitive physics in infants, we propose anevaluation benchmark which diagnoses how much a givensystem understands about physics by testing whether itcan tell apart well matched videos of possible versusimpossible events constructed with a game engine. Thetest requires systems to compute a physical plausibilityscore over an entire video. It is free of bias and cantest a range of basic physical reasoning concepts. Wethen describe two Deep Neural Networks systems aimedat learning intuitive physics in an unsupervised way,using only physically possible videos. The systems aretrained with a future semantic mask prediction objectiveand tested on the possible versus impossible discrimi-nation task. The analysis of their results compared tohuman data gives novel insights in the potentials andlimitations of next frame prediction architectures.
△ Less
Submitted 11 February, 2020; v1 submitted 20 March, 2018;
originally announced March 2018.
-
The Zero Resource Speech Challenge 2017
Authors:
Ewan Dunbar,
Xuan Nga Cao,
Juan Benjumea,
Julien Karadayi,
Mathieu Bernard,
Laurent Besacier,
Xavier Anguera,
Emmanuel Dupoux
Abstract:
We describe a new challenge aimed at discovering subword and word units from raw speech. This challenge is the followup to the Zero Resource Speech Challenge 2015. It aims at constructing systems that generalize across languages and adapt to new speakers. The design features and evaluation metrics of the challenge are presented and the results of seventeen models are discussed.
We describe a new challenge aimed at discovering subword and word units from raw speech. This challenge is the followup to the Zero Resource Speech Challenge 2015. It aims at constructing systems that generalize across languages and adapt to new speakers. The design features and evaluation metrics of the challenge are presented and the results of seventeen models are discussed.
△ Less
Submitted 12 December, 2017;
originally announced December 2017.