Zum Hauptinhalt springen

Showing 1–14 of 14 results for author: Sigtia, S

.
  1. A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

    Authors: Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi

    Abstract: Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from th… ▽ More

    Submitted 26 March, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

    Comments: arXiv admin note: text overlap with arXiv:2312.03632

  2. arXiv:2312.03632  [pdf, other

    cs.SD cs.LG eess.AS

    Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

    Authors: Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi

    Abstract: Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

  3. arXiv:2204.02455  [pdf, other

    cs.SD cs.LG eess.AS

    Improving Voice Trigger Detection with Metric Learning

    Authors: Prateeth Nayak, Takuya Higuchi, Anmol Gupta, Shivesh Ranjan, Stephen Shum, Siddharth Sigtia, Erik Marchi, Varun Lakshminarasimhan, Minsik Cho, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

    Abstract: Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented… ▽ More

    Submitted 13 September, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: Accepted at InterSpeech 2022

  4. arXiv:2105.06598  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation

    Authors: Vineet Garg, Wonil Chang, Siddharth Sigtia, Saurabh Adya, Pramod Simha, Pranay Dighe, Chandra Dhir

    Abstract: We present a unified and hardware efficient architecture for two stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks. Two stage VTD systems of voice assistants can get falsely activated to audio segments acoustically similar to the trigger phrase of interest. FTM systems cancel such activations by using post trigger audio context. Traditional FTM systems rely on automatic… ▽ More

    Submitted 13 May, 2021; originally announced May 2021.

  5. arXiv:2010.15446  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Progressive Voice Trigger Detection: Accuracy vs Latency

    Authors: Siddharth Sigtia, John Bridle, Hywel Richards, Pascal Clark, Erik Marchi, Vineet Garg

    Abstract: We present an architecture for voice trigger detection for virtual assistants. The main idea in this work is to exploit information in words that immediately follow the trigger phrase. We first demonstrate that by including more audio context after a detected trigger phrase, we can indeed get a more accurate decision. However, waiting to listen to more audio each time incurs a latency increase. Pr… ▽ More

    Submitted 2 March, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

    Comments: Camera Ready Version: ICASSP 2021

  6. arXiv:2008.02323  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

    Authors: Saurabh Adya, Vineet Garg, Siddharth Sigtia, Pramod Simha, Chandra Dhir

    Abstract: We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the first-pass. Our baseline is an acoustic model(AM), with BiLSTM layers, trained by minimizing the CTC loss. We replace the BiLSTM layers with self-attention layers. Results on internal evaluation sets show that self-attention… ▽ More

    Submitted 5 August, 2020; originally announced August 2020.

    Comments: INTERSPEECH, 2020

  7. arXiv:2001.10816  [pdf, other

    eess.AS cs.CL cs.LG cs.SD stat.ML

    Multi-task Learning for Speaker Verification and Voice Trigger Detection

    Authors: Siddharth Sigtia, Erik Marchi, Sachin Kajarekar, Devang Naik, John Bridle

    Abstract: Automatic speech transcription and speaker recognition are usually treated as separate tasks even though they are interdependent. In this study, we investigate training a single network to perform both tasks jointly. We train the network in a supervised multi-task learning setup, where the speech transcription branch of the network is trained to minimise a phonetic connectionist temporal classific… ▽ More

    Submitted 26 January, 2020; originally announced January 2020.

    Journal ref: International Conference on Acoustics, Speech and Signal Processing (ICASSP), Spain, 2020, pp. 6844-6848

  8. arXiv:2001.09519  [pdf, other

    eess.AS cs.CL cs.LG cs.SD stat.ML

    Multi-task Learning for Voice Trigger Detection

    Authors: Siddharth Sigtia, Pascal Clark, Rob Haynes, Hywel Richards, John Bridle

    Abstract: We describe the design of a voice trigger detection system for smart speakers. In this study, we address two major challenges. The first is that the detectors are deployed in complex acoustic environments with external noise and loud playback by the device itself. Secondly, collecting training examples for a specific keyword or trigger phrase is challenging resulting in a scarcity of trigger phras… ▽ More

    Submitted 20 April, 2020; v1 submitted 26 January, 2020; originally announced January 2020.

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7449-7453

  9. Automatic Environmental Sound Recognition: Performance versus Computational Cost

    Authors: Siddharth Sigtia, Adam M. Stark, Sacha Krstulovic, Mark D. Plumbley

    Abstract: In the context of the Internet of Things (IoT), sound sensing applications are required to run on embedded platforms where notions of product pricing and form factor impose hard constraints on the available computing power. Whereas Automatic Environmental Sound Recognition (AESR) algorithms are most often developed with limited consideration for computational cost, this article seeks which AESR al… ▽ More

    Submitted 15 July, 2016; originally announced July 2016.

    Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing 24(11): 2096-2107, Nov 2016

  10. Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging

    Authors: Yong Xu, Qiang Huang, Wenwu Wang, Peter Foster, Siddharth Sigtia, Philip J. B. Jackson, Mark D. Plumbley

    Abstract: Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classific… ▽ More

    Submitted 29 November, 2016; v1 submitted 13 July, 2016; originally announced July 2016.

    Comments: 10 pages, dcase 2016 challenge

    Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing 25(6):1230-1241, Jun 2017

  11. arXiv:1604.04153  [pdf, other

    cs.NE

    Learning to Generate Genotypes with Neural Networks

    Authors: Alexander W. Churchill, Siddharth Sigtia, Chrisantha Fernando

    Abstract: Neural networks and evolutionary computation have a rich intertwined history. They most commonly appear together when an evolutionary algorithm optimises the parameters and topology of a neural network for reinforcement learning problems, or when a neural network is applied as a surrogate fitness function to aid the evolutionary optimisation of expensive fitness functions. In this paper we take a… ▽ More

    Submitted 14 April, 2016; originally announced April 2016.

  12. arXiv:1508.01774  [pdf, other

    stat.ML cs.LG cs.SD

    An End-to-End Neural Network for Polyphonic Piano Music Transcription

    Authors: Siddharth Sigtia, Emmanouil Benetos, Simon Dixon

    Abstract: We present a supervised neural network model for polyphonic piano music transcription. The architecture of the proposed model is analogous to speech recognition systems and comprises an acoustic model and a music language model. The acoustic model is a neural network used for estimating the probabilities of pitches in a frame of audio. The language model is a recurrent neural network that models t… ▽ More

    Submitted 11 February, 2016; v1 submitted 7 August, 2015; originally announced August 2015.

  13. arXiv:1411.1623  [pdf, ps, other

    cs.LG

    A Hybrid Recurrent Neural Network For Music Transcription

    Authors: Siddharth Sigtia, Emmanouil Benetos, Nicolas Boulanger-Lewandowski, Tillman Weyde, Artur S. d'Avila Garcez, Simon Dixon

    Abstract: We investigate the problem of incorporating higher-level symbolic score-like information into Automatic Music Transcription (AMT) systems to improve their performance. We use recurrent neural networks (RNNs) and their variants as music language models (MLMs) and present a generative architecture for combining these models with predictions from a frame level acoustic classifier. We also compare dif… ▽ More

    Submitted 6 November, 2014; originally announced November 2014.

  14. arXiv:1404.1614  [pdf, other

    cs.NE cs.LG

    A Denoising Autoencoder that Guides Stochastic Search

    Authors: Alexander W. Churchill, Siddharth Sigtia, Chrisantha Fernando

    Abstract: An algorithm is described that adaptively learns a non-linear mutation distribution. It works by training a denoising autoencoder (DA) online at each generation of a genetic algorithm to reconstruct a slowly decaying memory of the best genotypes so far. A compressed hidden layer forces the autoencoder to learn hidden features in the training set that can be used to accelerate search on novel probl… ▽ More

    Submitted 6 April, 2014; originally announced April 2014.

    Comments: Submitted to Parallel Problem Solving from Nature 2014