-
Online Continual Learning of End-to-End Speech Recognition Models
Authors:
Muqiao Yang,
Ian Lane,
Shinji Watanabe
Abstract:
Continual Learning, also known as Lifelong Learning, aims to continually learn from new data as it becomes available. While prior research on continual learning in automatic speech recognition has focused on the adaptation of models across multiple different speech recognition tasks, in this paper we propose an experimental setting for \textit{online continual learning} for automatic speech recogn…
▽ More
Continual Learning, also known as Lifelong Learning, aims to continually learn from new data as it becomes available. While prior research on continual learning in automatic speech recognition has focused on the adaptation of models across multiple different speech recognition tasks, in this paper we propose an experimental setting for \textit{online continual learning} for automatic speech recognition of a single task. Specifically focusing on the case where additional training data for the same task becomes available incrementally over time, we demonstrate the effectiveness of performing incremental model updates to end-to-end speech recognition models with an online Gradient Episodic Memory (GEM) method. Moreover, we show that with online continual learning and a selective sampling strategy, we can maintain an accuracy that is similar to retraining a model from scratch while requiring significantly lower computation costs. We have also verified our method with self-supervised learning (SSL) features.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
Authors:
Yifan Peng,
Siddharth Dalmia,
Ian Lane,
Shinji Watanabe
Abstract:
Conformer has proven to be effective in many speech processing tasks. It combines the benefits of extracting local dependencies using convolutions and global dependencies using self-attention. Inspired by this, we propose a more flexible, interpretable and customizable encoder alternative, Branchformer, with parallel branches for modeling various ranged dependencies in end-to-end speech processing…
▽ More
Conformer has proven to be effective in many speech processing tasks. It combines the benefits of extracting local dependencies using convolutions and global dependencies using self-attention. Inspired by this, we propose a more flexible, interpretable and customizable encoder alternative, Branchformer, with parallel branches for modeling various ranged dependencies in end-to-end speech processing. In each encoder layer, one branch employs self-attention or its variant to capture long-range dependencies, while the other branch utilizes an MLP module with convolutional gating (cgMLP) to extract local relationships. We conduct experiments on several speech recognition and spoken language understanding benchmarks. Results show that our model outperforms both Transformer and cgMLP. It also matches with or outperforms state-of-the-art results achieved by Conformer. Furthermore, we show various strategies to reduce computation thanks to the two-branch architecture, including the ability to have variable inference complexity in a single trained model. The weights learned for merging branches indicate how local and global dependencies are utilized in different layers, which benefits model designing.
△ Less
Submitted 6 July, 2022;
originally announced July 2022.
-
Identifying Actions for Sound Event Classification
Authors:
Benjamin Elizalde,
Radu Revutchi,
Samarjit Das,
Bhiksha Raj,
Ian Lane,
Laurie M. Heller
Abstract:
In Psychology, actions are paramount for humans to identify sound events. In Machine Learning (ML), action recognition achieves high accuracy; however, it has not been asked whether identifying actions can benefit Sound Event Classification (SEC), as opposed to mapping the audio directly to a sound event. Therefore, we propose a new Psychology-inspired approach for SEC that includes identification…
▽ More
In Psychology, actions are paramount for humans to identify sound events. In Machine Learning (ML), action recognition achieves high accuracy; however, it has not been asked whether identifying actions can benefit Sound Event Classification (SEC), as opposed to mapping the audio directly to a sound event. Therefore, we propose a new Psychology-inspired approach for SEC that includes identification of actions via human listeners. To achieve this goal, we used crowdsourcing to have listeners identify 20 actions that in isolation or in combination may have produced any of the 50 sound events in the well-studied dataset ESC-50. The resulting annotations for each audio recording relate actions to a database of sound events for the first time. The annotations were used to create semantic representations called Action Vectors (AVs). We evaluated SEC by comparing the AVs with two types of audio features -- log-mel spectrograms and state-of-the-art audio embeddings. Because audio features and AVs capture different abstractions of the acoustic content, we combined them and achieved one of the highest reported accuracies (88%).
△ Less
Submitted 5 August, 2021; v1 submitted 26 April, 2021;
originally announced April 2021.
-
Learning Question-Guided Video Representation for Multi-Turn Video Question Answering
Authors:
Guan-Lin Chao,
Abhinav Rastogi,
Semih Yavuz,
Dilek Hakkani-Tür,
Jindong Chen,
Ian Lane
Abstract:
Understanding and conversing about dynamic scenes is one of the key capabilities of AI agents that navigate the environment and convey useful information to humans. Video question answering is a specific scenario of such AI-human interaction where an agent generates a natural language response to a question regarding the video of a dynamic scene. Incorporating features from multiple modalities, wh…
▽ More
Understanding and conversing about dynamic scenes is one of the key capabilities of AI agents that navigate the environment and convey useful information to humans. Video question answering is a specific scenario of such AI-human interaction where an agent generates a natural language response to a question regarding the video of a dynamic scene. Incorporating features from multiple modalities, which often provide supplementary information, is one of the challenging aspects of video question answering. Furthermore, a question often concerns only a small segment of the video, hence encoding the entire video sequence using a recurrent neural network is not computationally efficient. Our proposed question-guided video representation module efficiently generates the token-level video summary guided by each word in the question. The learned representations are then fused with the question to generate the answer. Through empirical evaluation on the Audio Visual Scene-aware Dialog (AVSD) dataset, our proposed models in single-turn and multi-turn question answering achieve state-of-the-art performance on several automatic natural language generation evaluation metrics.
△ Less
Submitted 30 July, 2019;
originally announced July 2019.
-
BERT-DST: Scalable End-to-End Dialogue State Tracking with Bidirectional Encoder Representations from Transformer
Authors:
Guan-Lin Chao,
Ian Lane
Abstract:
An important yet rarely tackled problem in dialogue state tracking (DST) is scalability for dynamic ontology (e.g., movie, restaurant) and unseen slot values. We focus on a specific condition, where the ontology is unknown to the state tracker, but the target slot value (except for none and dontcare), possibly unseen during training, can be found as word segment in the dialogue context. Prior appr…
▽ More
An important yet rarely tackled problem in dialogue state tracking (DST) is scalability for dynamic ontology (e.g., movie, restaurant) and unseen slot values. We focus on a specific condition, where the ontology is unknown to the state tracker, but the target slot value (except for none and dontcare), possibly unseen during training, can be found as word segment in the dialogue context. Prior approaches often rely on candidate generation from n-gram enumeration or slot tagger outputs, which can be inefficient or suffer from error propagation. We propose BERT-DST, an end-to-end dialogue state tracker which directly extracts slot values from the dialogue context. We use BERT as dialogue context encoder whose contextualized language representations are suitable for scalable DST to identify slot values from their semantic context. Furthermore, we employ encoder parameter sharing across all slots with two advantages: (1) Number of parameters does not grow linearly with the ontology. (2) Language representation knowledge can be transferred among slots. Empirical evaluation shows BERT-DST with cross-slot parameter sharing outperforms prior work on the benchmark scalable DST datasets Sim-M and Sim-R, and achieves competitive performance on the standard DSTC2 and WOZ 2.0 datasets.
△ Less
Submitted 5 July, 2019;
originally announced July 2019.
-
Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments
Authors:
Guan-Lin Chao,
William Chan,
Ian Lane
Abstract:
Speech recognition in cocktail-party environments remains a significant challenge for state-of-the-art speech recognition systems, as it is extremely difficult to extract an acoustic signal of an individual speaker from a background of overlapping speech with similar frequency and temporal characteristics. We propose the use of speaker-targeted acoustic and audio-visual models for this task. We co…
▽ More
Speech recognition in cocktail-party environments remains a significant challenge for state-of-the-art speech recognition systems, as it is extremely difficult to extract an acoustic signal of an individual speaker from a background of overlapping speech with similar frequency and temporal characteristics. We propose the use of speaker-targeted acoustic and audio-visual models for this task. We complement the acoustic features in a hybrid DNN-HMM model with information of the target speaker's identity as well as visual features from the mouth region of the target speaker. Experimentation was performed using simulated cocktail-party data generated from the GRID audio-visual corpus by overlapping two speakers's speech on a single acoustic channel. Our audio-only baseline achieved a WER of 26.3%. The audio-visual model improved the WER to 4.4%. Introducing speaker identity information had an even more pronounced effect, improving the WER to 3.6%. Combining both approaches, however, did not significantly improve performance further. Our work demonstrates that speaker-targeted models can significantly improve the speech recognition in cocktail party environments.
△ Less
Submitted 13 June, 2019;
originally announced June 2019.
-
Assignment of excited-state bond lengths using branching-ratio measurements: The B$^2Σ^+$ state of BaH molecules
Authors:
K. Moore,
I. C. Lane,
R. L. McNally,
T. Zelevinsky
Abstract:
Vibrational branching ratios in the B$^2Σ^+$ -- X$^2Σ^+$ and A$^2Π$ -- X$^2Σ^+$ optical-cycling transitions of BaH molecules are investigated using measurements and {\it ab initio} calculations. The experimental values are determined using fluorescence and absorption detection. The observed branching ratios have a very sensitive dependence on the difference in the equilibrium bond length between t…
▽ More
Vibrational branching ratios in the B$^2Σ^+$ -- X$^2Σ^+$ and A$^2Π$ -- X$^2Σ^+$ optical-cycling transitions of BaH molecules are investigated using measurements and {\it ab initio} calculations. The experimental values are determined using fluorescence and absorption detection. The observed branching ratios have a very sensitive dependence on the difference in the equilibrium bond length between the excited and ground state, $Δr_e$: a 1 pm (.5\%) displacement can have a 25\% effect on the branching ratios but only a 1\% effect on the lifetime. The measurements are combined with theoretical calculations to reveal a preference for a particular set of published spectroscopic values for the B$^2Σ^+$ state ($Δr_e^{B-X}$ = +5.733 pm), while a larger bond-length difference ($Δr_e^{B-X} = 6.3-6.7$ pm) would match the branching-ratio data even better. By contrast, the observed branching ratio for the A$^2Π_{3/2}$ -- X$^2Σ^+$ transition is in excellent agreement with both the {\it ab initio} result and the spectroscopically measured bond lengths. This shows that care must be taken when estimating branching ratios for molecular laser cooling candidates, as small errors in bond-length measurements can have outsize effects on the suitability for laser cooling. Additionally, our calculations agree more closely with experimental values of the B$^2Σ^+$ state lifetime and spin-rotation constant, and revise the predicted lifetime of the H$^2Δ$ state to 9.5 $μ$s.
△ Less
Submitted 9 August, 2019; v1 submitted 15 April, 2019;
originally announced April 2019.
-
Speaker Diarization With Lexical Information
Authors:
Tae Jin Park,
Kyu Han,
Ian Lane,
Panayiotis Georgiou
Abstract:
This work presents a novel approach to leverage lexical information for speaker diarization. We introduce a speaker diarization system that can directly integrate lexical as well as acoustic information into a speaker clustering process. Thus, we propose an adjacency matrix integration technique to integrate word level speaker turn probabilities with speaker embeddings in a comprehensive way. Our…
▽ More
This work presents a novel approach to leverage lexical information for speaker diarization. We introduce a speaker diarization system that can directly integrate lexical as well as acoustic information into a speaker clustering process. Thus, we propose an adjacency matrix integration technique to integrate word level speaker turn probabilities with speaker embeddings in a comprehensive way. Our proposed method works without any reference transcript. Words, and word boundary information are provided by an ASR system. We show that our proposed method improves a baseline speaker diarization system solely based on speaker embeddings, achieving a meaningful improvement on the CALLHOME American English Speech dataset.
△ Less
Submitted 28 November, 2018; v1 submitted 26 November, 2018;
originally announced November 2018.
-
Understanding and Improving Recurrent Networks for Human Activity Recognition by Continuous Attention
Authors:
Ming Zeng,
Haoxiang Gao,
Tong Yu,
Ole J. Mengshoel,
Helge Langseth,
Ian Lane,
Xiaobing Liu
Abstract:
Deep neural networks, including recurrent networks, have been successfully applied to human activity recognition. Unfortunately, the final representation learned by recurrent networks might encode some noise (irrelevant signal components, unimportant sensor modalities, etc.). Besides, it is difficult to interpret the recurrent networks to gain insight into the models' behavior. To address these is…
▽ More
Deep neural networks, including recurrent networks, have been successfully applied to human activity recognition. Unfortunately, the final representation learned by recurrent networks might encode some noise (irrelevant signal components, unimportant sensor modalities, etc.). Besides, it is difficult to interpret the recurrent networks to gain insight into the models' behavior. To address these issues, we propose two attention models for human activity recognition: temporal attention and sensor attention. These two mechanisms adaptively focus on important signals and sensor modalities. To further improve the understandability and mean F1 score, we add continuity constraints, considering that continuous sensor signals are more robust than discrete ones. We evaluate the approaches on three datasets and obtain state-of-the-art results. Furthermore, qualitative analysis shows that the attention learned by the models agree well with human intuition.
△ Less
Submitted 7 October, 2018;
originally announced October 2018.
-
Adversarial Learning of Task-Oriented Neural Dialog Models
Authors:
Bing Liu,
Ian Lane
Abstract:
In this work, we propose an adversarial learning method for reward estimation in reinforcement learning (RL) based task-oriented dialog models. Most of the current RL based task-oriented dialog systems require the access to a reward signal from either user feedback or user ratings. Such user ratings, however, may not always be consistent or available in practice. Furthermore, online dialog policy…
▽ More
In this work, we propose an adversarial learning method for reward estimation in reinforcement learning (RL) based task-oriented dialog models. Most of the current RL based task-oriented dialog systems require the access to a reward signal from either user feedback or user ratings. Such user ratings, however, may not always be consistent or available in practice. Furthermore, online dialog policy learning with RL typically requires a large number of queries to users, suffering from sample efficiency problem. To address these challenges, we propose an adversarial learning method to learn dialog rewards directly from dialog samples. Such rewards are further used to optimize the dialog policy with policy gradient based RL. In the evaluation in a restaurant search domain, we show that the proposed adversarial dialog learning method achieves advanced dialog success rate comparing to strong baseline methods. We further discuss the covariate shift problem in online adversarial dialog learning and show how we can address that with partial access to user feedback.
△ Less
Submitted 29 May, 2018;
originally announced May 2018.
-
Quantitative theoretical analysis of lifetimes and decay rates relevant in laser cooling BaH
Authors:
Keith Moore,
Ian C Lane
Abstract:
Tiny radiative losses below the 0.1% level can prove ruinous to the effective laser cooling of a molecule. In this paper the laser cooling of a hydride is studied with rovibronic detail using ab initio quantum chemistry in order to document the decays to all possible electronic states (not just the vibrational branching within a single electronic transition) and to identify the most populated fina…
▽ More
Tiny radiative losses below the 0.1% level can prove ruinous to the effective laser cooling of a molecule. In this paper the laser cooling of a hydride is studied with rovibronic detail using ab initio quantum chemistry in order to document the decays to all possible electronic states (not just the vibrational branching within a single electronic transition) and to identify the most populated final quantum states. The effect of spin-orbit and associated couplings on the properties of the lowest excited states of BaH are analysed in detail. The lifetimes of the A$^2Π_{1/2}$, H$^2Δ_{3/2}$ and E$^2Π_{1/2}$ states are calculated (136 ns, 5.8 μs and 46 ns respectively) for the first time, while the theoretical value for B$^2Σ^+_{1/2}$ is in good agreement with experiments. Using a simple rate model the numbers of absorption-emission cycles possible for both one- and two-colour cooling on the competing electronic transitions are determined, and it is clearly demonstrated that the A$^2Π$ - X$^2Σ^+$ transition is superior to B$^2Σ^+$ - X$^2Σ^+$, where multiple tiny decay channels degrade its efficiency. Further possible improvements to the cooling method are proposed.
△ Less
Submitted 15 March, 2018; v1 submitted 13 March, 2018;
originally announced March 2018.
-
Semi-Supervised Convolutional Neural Networks for Human Activity Recognition
Authors:
Ming Zeng,
Tong Yu,
Xiao Wang,
Le T. Nguyen,
Ole J. Mengshoel,
Ian Lane
Abstract:
Labeled data used for training activity recognition classifiers are usually limited in terms of size and diversity. Thus, the learned model may not generalize well when used in real-world use cases. Semi-supervised learning augments labeled examples with unlabeled examples, often resulting in improved performance. However, the semi-supervised methods studied in the activity recognition literatures…
▽ More
Labeled data used for training activity recognition classifiers are usually limited in terms of size and diversity. Thus, the learned model may not generalize well when used in real-world use cases. Semi-supervised learning augments labeled examples with unlabeled examples, often resulting in improved performance. However, the semi-supervised methods studied in the activity recognition literatures assume that feature engineering is already done. In this paper, we lift this assumption and present two semi-supervised methods based on convolutional neural networks (CNNs) to learn discriminative hidden features. Our semi-supervised CNNs learn from both labeled and unlabeled data while also performing feature learning on raw sensor data. In experiments on three real world datasets, we show that our CNNs outperform supervised methods and traditional semi-supervised learning methods by up to 18% in mean F1-score (Fm).
△ Less
Submitted 22 January, 2018;
originally announced January 2018.
-
The CAPIO 2017 Conversational Speech Recognition System
Authors:
Kyu J. Han,
Akshay Chandrashekaran,
Jungsuk Kim,
Ian Lane
Abstract:
In this paper we show how we have achieved the state-of-the-art performance on the industry-standard NIST 2000 Hub5 English evaluation set. We explore densely connected LSTMs, inspired by the densely connected convolutional networks recently introduced for image classification tasks. We also propose an acoustic model adaptation scheme that simply averages the parameters of a seed neural network ac…
▽ More
In this paper we show how we have achieved the state-of-the-art performance on the industry-standard NIST 2000 Hub5 English evaluation set. We explore densely connected LSTMs, inspired by the densely connected convolutional networks recently introduced for image classification tasks. We also propose an acoustic model adaptation scheme that simply averages the parameters of a seed neural network acoustic model and its adapted version. This method was applied with the CallHome training corpus and improved individual system performances by on average 6.1% (relative) against the CallHome portion of the evaluation set with no performance loss on the Switchboard portion. With RNN-LM rescoring and lattice combination on the 5 systems trained across three different phone sets, our 2017 speech recognition system has obtained 5.0% and 9.1% on Switchboard and CallHome, respectively, both of which are the best word error rates reported thus far. According to IBM in their latest work to compare human and machine transcriptions, our reported Switchboard word error rate can be considered to surpass the human parity (5.1%) of transcribing conversational telephone speech.
△ Less
Submitted 9 April, 2018; v1 submitted 29 December, 2017;
originally announced January 2018.
-
Multi-Domain Adversarial Learning for Slot Filling in Spoken Language Understanding
Authors:
Bing Liu,
Ian Lane
Abstract:
The goal of this paper is to learn cross-domain representations for slot filling task in spoken language understanding (SLU). Most of the recently published SLU models are domain-specific ones that work on individual task domains. Annotating data for each individual task domain is both financially costly and non-scalable. In this work, we propose an adversarial training method in learning common f…
▽ More
The goal of this paper is to learn cross-domain representations for slot filling task in spoken language understanding (SLU). Most of the recently published SLU models are domain-specific ones that work on individual task domains. Annotating data for each individual task domain is both financially costly and non-scalable. In this work, we propose an adversarial training method in learning common features and representations that can be shared across multiple domains. Model that produces such shared representations can be combined with models trained on individual domain SLU data to reduce the amount of training samples required for developing a new domain. In our experiments using data sets from multiple domains, we show that adversarial training helps in learning better domain-general SLU models, leading to improved slot filling F1 scores. We further show that applying adversarial learning on domain-general model also helps in achieving higher slot filling performance when the model is jointly optimized with domain-specific models.
△ Less
Submitted 30 November, 2017;
originally announced November 2017.
-
Customized Nonlinear Bandits for Online Response Selection in Neural Conversation Models
Authors:
Bing Liu,
Tong Yu,
Ian Lane,
Ole J. Mengshoel
Abstract:
Dialog response selection is an important step towards natural response generation in conversational agents. Existing work on neural conversational models mainly focuses on offline supervised learning using a large set of context-response pairs. In this paper, we focus on online learning of response selection in retrieval-based dialog systems. We propose a contextual multi-armed bandit model with…
▽ More
Dialog response selection is an important step towards natural response generation in conversational agents. Existing work on neural conversational models mainly focuses on offline supervised learning using a large set of context-response pairs. In this paper, we focus on online learning of response selection in retrieval-based dialog systems. We propose a contextual multi-armed bandit model with a nonlinear reward function that uses distributed representation of text for online response selection. A bidirectional LSTM is used to produce the distributed representations of dialog context and responses, which serve as the input to a contextual bandit. In learning the bandit, we propose a customized Thompson sampling method that is applied to a polynomial feature space in approximating the reward. Experimental results on the Ubuntu Dialogue Corpus demonstrate significant performance gains of the proposed method over conventional linear contextual bandits. Moreover, we report encouraging response selection performance of the proposed neural bandit model using the Recall@k metric for a small set of online training samples.
△ Less
Submitted 22 November, 2017;
originally announced November 2017.
-
Iterative Policy Learning in End-to-End Trainable Task-Oriented Neural Dialog Models
Authors:
Bing Liu,
Ian Lane
Abstract:
In this paper, we present a deep reinforcement learning (RL) framework for iterative dialog policy optimization in end-to-end task-oriented dialog systems. Popular approaches in learning dialog policy with RL include letting a dialog agent to learn against a user simulator. Building a reliable user simulator, however, is not trivial, often as difficult as building a good dialog agent. We address t…
▽ More
In this paper, we present a deep reinforcement learning (RL) framework for iterative dialog policy optimization in end-to-end task-oriented dialog systems. Popular approaches in learning dialog policy with RL include letting a dialog agent to learn against a user simulator. Building a reliable user simulator, however, is not trivial, often as difficult as building a good dialog agent. We address this challenge by jointly optimizing the dialog agent and the user simulator with deep RL by simulating dialogs between the two agents. We first bootstrap a basic dialog agent and a basic user simulator by learning directly from dialog corpora with supervised training. We then improve them further by letting the two agents to conduct task-oriented dialogs and iteratively optimizing their policies with deep RL. Both the dialog agent and the user simulator are designed with neural network models that can be trained end-to-end. Our experiment results show that the proposed method leads to promising improvements on task success rate and total task reward comparing to supervised training and single-agent RL training baseline models.
△ Less
Submitted 18 September, 2017;
originally announced September 2017.
-
An End-to-End Trainable Neural Network Model with Belief Tracking for Task-Oriented Dialog
Authors:
Bing Liu,
Ian Lane
Abstract:
We present a novel end-to-end trainable neural network model for task-oriented dialog systems. The model is able to track dialog state, issue API calls to knowledge base (KB), and incorporate structured KB query results into system responses to successfully complete task-oriented dialogs. The proposed model produces well-structured system responses by jointly learning belief tracking and KB result…
▽ More
We present a novel end-to-end trainable neural network model for task-oriented dialog systems. The model is able to track dialog state, issue API calls to knowledge base (KB), and incorporate structured KB query results into system responses to successfully complete task-oriented dialogs. The proposed model produces well-structured system responses by jointly learning belief tracking and KB result processing conditioning on the dialog history. We evaluate the model in a restaurant search domain using a dataset that is converted from the second Dialog State Tracking Challenge (DSTC2) corpus. Experiment results show that the proposed model can robustly track dialog state given the dialog history. Moreover, our model demonstrates promising results in producing appropriate system responses, outperforming prior end-to-end trainable neural network models using per-response accuracy evaluation metrics.
△ Less
Submitted 20 August, 2017;
originally announced August 2017.
-
Dialog Context Language Modeling with Recurrent Neural Networks
Authors:
Bing Liu,
Ian Lane
Abstract:
In this work, we propose contextual language models that incorporate dialog level discourse information into language modeling. Previous works on contextual language model treat preceding utterances as a sequence of inputs, without considering dialog interactions. We design recurrent neural network (RNN) based contextual language models that specially track the interactions between speakers in a d…
▽ More
In this work, we propose contextual language models that incorporate dialog level discourse information into language modeling. Previous works on contextual language model treat preceding utterances as a sequence of inputs, without considering dialog interactions. We design recurrent neural network (RNN) based contextual language models that specially track the interactions between speakers in a dialog. Experiment results on Switchboard Dialog Act Corpus show that the proposed model outperforms conventional single turn based RNN language model by 3.3% on perplexity. The proposed models also demonstrate advantageous performance over other competitive contextual language models.
△ Less
Submitted 15 January, 2017;
originally announced January 2017.
-
An Approach for Self-Training Audio Event Detectors Using Web Data
Authors:
Benjamin Elizalde,
Ankit Shah,
Siddharth Dalmia,
Min Hun Lee,
Rohan Badlani,
Anurag Kumar,
Bhiksha Raj,
Ian Lane
Abstract:
Audio Event Detection (AED) aims to recognize sounds within audio and video recordings. AED employs machine learning algorithms commonly trained and tested on annotated datasets. However, available datasets are limited in number of samples and hence it is difficult to model acoustic diversity. Therefore, we propose combining labeled audio from a dataset and unlabeled audio from the web to improve…
▽ More
Audio Event Detection (AED) aims to recognize sounds within audio and video recordings. AED employs machine learning algorithms commonly trained and tested on annotated datasets. However, available datasets are limited in number of samples and hence it is difficult to model acoustic diversity. Therefore, we propose combining labeled audio from a dataset and unlabeled audio from the web to improve the sound models. The audio event detectors are trained on the labeled audio and ran on the unlabeled audio downloaded from YouTube. Whenever the detectors recognized any of the known sounds with high confidence, the unlabeled audio was use to re-train the detectors. The performance of the re-trained detectors is compared to the one from the original detectors using the annotated test set. Results showed an improvement of the AED, and uncovered challenges of using web audio from videos.
△ Less
Submitted 27 June, 2017; v1 submitted 20 September, 2016;
originally announced September 2016.
-
Joint Online Spoken Language Understanding and Language Modeling with Recurrent Neural Networks
Authors:
Bing Liu,
Ian Lane
Abstract:
Speaker intent detection and semantic slot filling are two critical tasks in spoken language understanding (SLU) for dialogue systems. In this paper, we describe a recurrent neural network (RNN) model that jointly performs intent detection, slot filling, and language modeling. The neural network model keeps updating the intent estimation as word in the transcribed utterance arrives and uses it as…
▽ More
Speaker intent detection and semantic slot filling are two critical tasks in spoken language understanding (SLU) for dialogue systems. In this paper, we describe a recurrent neural network (RNN) model that jointly performs intent detection, slot filling, and language modeling. The neural network model keeps updating the intent estimation as word in the transcribed utterance arrives and uses it as contextual features in the joint model. Evaluation of the language model and online SLU model is made on the ATIS benchmarking data set. On language modeling task, our joint model achieves 11.8% relative reduction on perplexity comparing to the independent training language model. On SLU tasks, our joint model outperforms the independent task training model by 22.3% on intent detection error rate, with slight degradation on slot filling F1 score. The joint model also shows advantageous performance in the realistic ASR settings with noisy speech input.
△ Less
Submitted 6 September, 2016;
originally announced September 2016.
-
Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling
Authors:
Bing Liu,
Ian Lane
Abstract:
Attention-based encoder-decoder neural network models have recently shown promising results in machine translation and speech recognition. In this work, we propose an attention-based neural network model for joint intent detection and slot filling, both of which are critical steps for many speech understanding and dialog systems. Unlike in machine translation and speech recognition, alignment is e…
▽ More
Attention-based encoder-decoder neural network models have recently shown promising results in machine translation and speech recognition. In this work, we propose an attention-based neural network model for joint intent detection and slot filling, both of which are critical steps for many speech understanding and dialog systems. Unlike in machine translation and speech recognition, alignment is explicit in slot filling. We explore different strategies in incorporating this alignment information to the encoder-decoder framework. Learning from the attention mechanism in encoder-decoder model, we further propose introducing attention to the alignment-based RNN models. Such attentions provide additional information to the intent classification and slot label prediction. Our independent task models achieve state-of-the-art intent detection error rate and slot filling F1 score on the benchmark ATIS task. Our joint training model further obtains 0.56% absolute (23.8% relative) error reduction on intent detection and 0.23% absolute gain on slot filling over the independent task models.
△ Less
Submitted 6 September, 2016;
originally announced September 2016.
-
Experiments on the DCASE Challenge 2016: Acoustic Scene Classification and Sound Event Detection in Real Life Recording
Authors:
Benjamin Elizalde,
Anurag Kumar,
Ankit Shah,
Rohan Badlani,
Emmanuel Vincent,
Bhiksha Raj,
Ian Lane
Abstract:
In this paper we present our work on Task 1 Acoustic Scene Classi- fication and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our performance for both tasks improved the baseline from DCASE: for Task 1 we achieved an overall accuracy of 78.9% compared to the…
▽ More
In this paper we present our work on Task 1 Acoustic Scene Classi- fication and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our performance for both tasks improved the baseline from DCASE: for Task 1 we achieved an overall accuracy of 78.9% compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based Error Rate of 0.76 compared to the baseline of 0.91.
△ Less
Submitted 25 August, 2016; v1 submitted 22 July, 2016;
originally announced July 2016.
-
AudioPairBank: Towards A Large-Scale Tag-Pair-Based Audio Content Analysis
Authors:
Sebastian Sager,
Benjamin Elizalde,
Damian Borth,
Christian Schulze,
Bhiksha Raj,
Ian Lane
Abstract:
Recently, sound recognition has been used to identify sounds, such as car and river. However, sounds have nuances that may be better described by adjective-noun pairs such as slow car, and verb-noun pairs such as flying insects, which are under explored. Therefore, in this work we investigate the relation between audio content and both adjective-noun pairs and verb-noun pairs. Due to the lack of d…
▽ More
Recently, sound recognition has been used to identify sounds, such as car and river. However, sounds have nuances that may be better described by adjective-noun pairs such as slow car, and verb-noun pairs such as flying insects, which are under explored. Therefore, in this work we investigate the relation between audio content and both adjective-noun pairs and verb-noun pairs. Due to the lack of datasets with these kinds of annotations, we collected and processed the AudioPairBank corpus consisting of a combined total of 1,123 pairs and over 33,000 audio files. One contribution is the previously unavailable documentation of the challenges and implications of collecting audio recordings with these type of labels. A second contribution is to show the degree of correlation between the audio content and the labels through sound recognition experiments, which yielded results of 70% accuracy, hence also providing a performance benchmark. The results and study in this paper encourage further exploration of the nuances in audio and are meant to complement similar research performed on images and text in multimedia analysis.
△ Less
Submitted 8 January, 2018; v1 submitted 13 July, 2016;
originally announced July 2016.
-
City-Identification of Flickr Videos Using Semantic Acoustic Features
Authors:
Benjamin Elizalde,
Guan-Lin Chao,
Ming Zeng,
Ian Lane
Abstract:
City-identification of videos aims to determine the likelihood of a video belonging to a set of cities. In this paper, we present an approach using only audio, thus we do not use any additional modality such as images, user-tags or geo-tags. In this manner, we show to what extent the city-location of videos correlates to their acoustic information. Success in this task suggests improvements can be…
▽ More
City-identification of videos aims to determine the likelihood of a video belonging to a set of cities. In this paper, we present an approach using only audio, thus we do not use any additional modality such as images, user-tags or geo-tags. In this manner, we show to what extent the city-location of videos correlates to their acoustic information. Success in this task suggests improvements can be made to complement the other modalities. In particular, we present a method to compute and use semantic acoustic features to perform city-identification and the features show semantic evidence of the identification. The semantic evidence is given by a taxonomy of urban sounds and expresses the potential presence of these sounds in the city- soundtracks. We used the MediaEval Placing Task set, which contains Flickr videos labeled by city. In addition, we used the UrbanSound8K set containing audio clips labeled by sound- type. Our method improved the state-of-the-art performance and provides a novel semantic approach to this task
△ Less
Submitted 12 July, 2016;
originally announced July 2016.
-
Environmental Noise Embeddings for Robust Speech Recognition
Authors:
Suyoun Kim,
Bhiksha Raj,
Ian Lane
Abstract:
We propose a novel deep neural network architecture for speech recognition that explicitly employs knowledge of the background environmental noise within a deep neural network acoustic model. A deep neural network is used to predict the acoustic environment in which the system in being used. The discriminative embedding generated at the bottleneck layer of this network is then concatenated with tr…
▽ More
We propose a novel deep neural network architecture for speech recognition that explicitly employs knowledge of the background environmental noise within a deep neural network acoustic model. A deep neural network is used to predict the acoustic environment in which the system in being used. The discriminative embedding generated at the bottleneck layer of this network is then concatenated with traditional acoustic features as input to a deep neural network acoustic model. Through a series of experiments on Resource Management, CHiME-3 task, and Aurora4, we show that the proposed approach significantly improves speech recognition accuracy in noisy and highly reverberant environments, outperforming multi-condition training, noise-aware training, i-vector framework, and multi-task learning on both in-domain noise and unseen noise.
△ Less
Submitted 29 September, 2016; v1 submitted 11 January, 2016;
originally announced January 2016.
-
Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition
Authors:
Suyoun Kim,
Ian Lane
Abstract:
Integration of multiple microphone data is one of the key ways to achieve robust speech recognition in noisy environments or when the speaker is located at some distance from the input device. Signal processing techniques such as beamforming are widely used to extract a speech signal of interest from background noise. These techniques, however, are highly dependent on prior spatial information abo…
▽ More
Integration of multiple microphone data is one of the key ways to achieve robust speech recognition in noisy environments or when the speaker is located at some distance from the input device. Signal processing techniques such as beamforming are widely used to extract a speech signal of interest from background noise. These techniques, however, are highly dependent on prior spatial information about the microphones and the environment in which the system is being used. In this work, we present a neural attention network that directly combines multi-channel audio to generate phonetic states without requiring any prior knowledge of the microphone layout or any explicit signal preprocessing for speech enhancement. We embed an attention mechanism within a Recurrent Neural Network (RNN) based acoustic model to automatically tune its attention to a more reliable input source. Unlike traditional multi-channel preprocessing, our system can be optimized towards the desired output in one step. Although attention-based models have recently achieved impressive results on sequence-to-sequence learning, no attention mechanisms have previously been applied to learn potentially asynchronous and non-stationary multiple inputs. We evaluate our neural attention model on the CHiME-3 challenge task, and show that the model achieves comparable performance to beamforming using a purely data-driven method.
△ Less
Submitted 7 January, 2016; v1 submitted 19 November, 2015;
originally announced November 2015.
-
Towards a spectroscopically accurate set of potentials for heavy hydride laser cooling candidates: effective core potential calculations of BaH
Authors:
Keith Moore,
Brendan M. McLaughlin,
Ian C. Lane
Abstract:
BaH (and its isotopomers) is an attractive molecular candidate for laser cooling to ultracold temperatures and a potential precursor for the production of ultracold gases of hydrogen and deuterium. The theoretical challenge is to simulate the laser cooling cycle as reliably as possible and this paper addresses the generation of a highly accurate ab initio $^{2}Σ^+$ potential for such studies. The…
▽ More
BaH (and its isotopomers) is an attractive molecular candidate for laser cooling to ultracold temperatures and a potential precursor for the production of ultracold gases of hydrogen and deuterium. The theoretical challenge is to simulate the laser cooling cycle as reliably as possible and this paper addresses the generation of a highly accurate ab initio $^{2}Σ^+$ potential for such studies. The performance of various basis sets within the multi-reference configuration-interaction (MRCI) approximation with the Davidson correction (MRCI+Q) is tested and taken to the complete basis set limit. It is shown that the calculated molecular constants using a 46 electron Effective Core-Potential (ECP), the augmented polarized core-valence quintuplet basis set (aug-pCV5Z-PP) but only including three active electrons in the MRCI calculation are in close agreement with the available experimental values. The predicted dissociation energy D$_e$ for the X$^2Σ^+$ state (extrapolated to the complete basis set (CBS) limit) is 16895.12 cm$^{-1}$ (2.094 eV), which agrees within 0.1$\%$ of a revised experimental value of $<$16910.6 cm$^{-1}$, while the calculated r$_e$ is within 0.03 pm of the experimental result.
△ Less
Submitted 28 March, 2016; v1 submitted 22 September, 2015;
originally announced September 2015.
-
Transferring Knowledge from a RNN to a DNN
Authors:
William Chan,
Nan Rosemary Ke,
Ian Lane
Abstract:
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for e…
▽ More
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
△ Less
Submitted 7 April, 2015;
originally announced April 2015.
-
Deep Recurrent Neural Networks for Acoustic Modelling
Authors:
William Chan,
Ian Lane
Abstract:
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then…
▽ More
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
△ Less
Submitted 7 April, 2015;
originally announced April 2015.
-
Ultracold, radiative charge transfer in hybrid Yb ion - Rb atom traps
Authors:
B. M. McLaughlin,
H. D. L. Lamb,
I. C. Lane,
J. F. McCann
Abstract:
Ultracold hybrid ion-atom traps offer the possibility of microscopic manipulation of quantum coherences in the gas using the ion as a probe. However, inelastic processes, particularly charge transfer can be a significant process of ion loss and has been measured experimentally for the Yb$^{+}$ ion immersed in a Rb vapour. We use first-principles quantum chemistry codes to obtain the potential ener…
▽ More
Ultracold hybrid ion-atom traps offer the possibility of microscopic manipulation of quantum coherences in the gas using the ion as a probe. However, inelastic processes, particularly charge transfer can be a significant process of ion loss and has been measured experimentally for the Yb$^{+}$ ion immersed in a Rb vapour. We use first-principles quantum chemistry codes to obtain the potential energy curves and dipole moments for the lowest-lying energy states of this complex. Calculations for the radiative decay processes cross sections and rate coefficients are presented for the total decay processes. Comparing the semi-classical Langevin approximation with the quantum approach, we find it provides a very good estimate of the background at higher energies. The results demonstrate that radiative decay mechanisms are important over the energy and temperature region considered. In fact, the Langevin process of ion-atom collisions dominates cold ion-atom collisions. For spin dependent processes \cite{kohl13} the anisotropic magnetic dipole-dipole interaction and the second-order spin-orbit coupling can play important roles, inducing couplingbetween the spin and the orbital motion. They measured the spin-relaxing collision rate to be approximately 5 orders of magnitude higher than the charge-exchange collision rate \cite{kohl13}. Regarding the measured radiative charge transfer collision rate, we find that our calculation is in very good agreement with experiment and with previous calculations. Nonetheless, we find no broad resonances features that might underly a strong isotope effect. In conclusion, we find, in agreement with previous theory that the isotope anomaly observed in experiment remains an open question.
△ Less
Submitted 25 April, 2014;
originally announced April 2014.
-
Ultracold hydrogen and deuterium production via Doppler-cooled Feshbach molecules
Authors:
Ian Lane
Abstract:
A counterintuitive scheme to produce ultracold hydrogen via fragmentation of laser cooled diatomic hydrides is presented where the final atomic H temperature is inversely proportional to the mass of the molecular parent. In addition, the critical density for formation of a Bose-Einstein Condensate (BEC) at a fixed temperature is reduced by a factor ratio hydrogen mass: parent mass raised to power…
▽ More
A counterintuitive scheme to produce ultracold hydrogen via fragmentation of laser cooled diatomic hydrides is presented where the final atomic H temperature is inversely proportional to the mass of the molecular parent. In addition, the critical density for formation of a Bose-Einstein Condensate (BEC) at a fixed temperature is reduced by a factor ratio hydrogen mass: parent mass raised to power 3/2 over directly cooled hydrogen atoms. The narrow Feshbach resonances between a singlet S atom and hydrogen are well suited to a tiny center of mass energy release necessary during fragmentation. With the support of ab initio quantum chemistry, it is demonstrated that BaH is an ideal diatomic precursor that can be laser cooled to a Doppler temperature of ~37 microKelvin with just two rovibronic transitions, the simplest molecular cooling scheme identified to date. Preparation of a hydrogen atom gas below the critical BEC temperature Tc is feasible with present cooling technology, with optical pulse control of the condensation process.
△ Less
Submitted 27 November, 2013;
originally announced November 2013.
-
Structure and interactions of ultracold Yb ions and Rb atoms
Authors:
H. D. L. Lamb,
J. F. McCann,
B. M. McLaughlin,
J. Goold,
N. Wells,
I. Lane
Abstract:
In order to study ultracold charge-transfer processes in hybrid atom-ion traps, we have mapped out the potential energy curves and molecular parameters for several low lying states of the Rb, Yb$^+$ system. We employ both a multi-reference configuration interaction (MRCI) and a full configuration interaction (FCI) approach. Turning points, crossing points, potential minima and spectroscopic molecu…
▽ More
In order to study ultracold charge-transfer processes in hybrid atom-ion traps, we have mapped out the potential energy curves and molecular parameters for several low lying states of the Rb, Yb$^+$ system. We employ both a multi-reference configuration interaction (MRCI) and a full configuration interaction (FCI) approach. Turning points, crossing points, potential minima and spectroscopic molecular constants are obtained for the lowest five molecular states. Long-range parameters, including the dispersion coefficients are estimated from our {\it ab initio} data. The separated-atom ionization potentials and atomic polarizability of the ytterbium atom ($α_d=128.4$ atomic units) are in good agreement with experiment and previous calculations. We present some dynamical calculations for (adiabatic) scattering lengths for the two lowest (Yb,Rb$^+$) channels that were carried out in our work. However, we find that the pseudo potential approximation is rather limited in validity, and only applies to nK temperatures. The adiabatic scattering lengths for both the triplet and singlet channels indicate that both are large and negative in the FCI approximation.
△ Less
Submitted 23 July, 2012; v1 submitted 6 July, 2011;
originally announced July 2011.
-
Doppler cooling of gallium atoms: 2. Simulation in complex multilevel systems
Authors:
L Rutherford,
I C Lane,
J F McCann
Abstract:
This paper derives a general procedure for the numerical solution of the Lindblad equations that govern the coherences arising from multicoloured light interacting with a multilevel system. A systematic approach to finding the conservative and dissipative terms is derived and applied to the laser cooling of gallium. An improved numerical method is developed to solve the time-dependent master equat…
▽ More
This paper derives a general procedure for the numerical solution of the Lindblad equations that govern the coherences arising from multicoloured light interacting with a multilevel system. A systematic approach to finding the conservative and dissipative terms is derived and applied to the laser cooling of gallium. An improved numerical method is developed to solve the time-dependent master equation and results are presented for transient cooling processes. The method is significantly more robust, efficient and accurate than the standard method and can be applied to a broad range of atomic and molecular systems. Radiation pressure forces and the formation of dynamic dark-states are studied in the gallium isotope 66Ga.
△ Less
Submitted 3 June, 2010;
originally announced June 2010.
-
Measurement of the 1s-2s energy interval in muonium
Authors:
V. Meyer,
S. N. Bagayev,
P. E. G. Baird,
P. Bakule,
M. G. Boshier,
A. Breitrueck,
S. L. Cornish,
S. Dychkov,
G. H. Eaton,
A. Grossmann,
D. Huebl,
V. W. Hughes,
K. Jungmann,
I. C. Lane,
Y. W. Liu,
D. Lucas,
Y. Matyugin,
J. Merkel,
G. zu Putlitz,
I. Reinhard,
P. G. H. Sandars,
R. Santra,
P. Schmidt,
C. A. Scott,
W. T. Toner
, et al. (4 additional authors not shown)
Abstract:
The 1s-2s interval has been measured in the muonium ({$μ^+e^-$}) atom by Doppler-free two-photon laser spectroscopy. The frequency separation of the states was determined to be 2 455 528 941.0(9.8) MHz in good agreement with quantum electrodynamics. The muon-electron mass ratio can be extracted and is found to be 206.768 38(17). The result may be interpreted as measurement of the muon-electron c…
▽ More
The 1s-2s interval has been measured in the muonium ({$μ^+e^-$}) atom by Doppler-free two-photon laser spectroscopy. The frequency separation of the states was determined to be 2 455 528 941.0(9.8) MHz in good agreement with quantum electrodynamics. The muon-electron mass ratio can be extracted and is found to be 206.768 38(17). The result may be interpreted as measurement of the muon-electron charge ratio as $-1- 1.1(2.1)\cdot 10^{-9}$.
△ Less
Submitted 12 July, 1999;
originally announced July 1999.