Zum Hauptinhalt springen

Showing 1–26 of 26 results for author: Hori, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2303.03329  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Speech Recognition: A Survey

    Authors: Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe

    Abstract: In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning. In the wake of this transition, a number of all-neural ASR architectures were introduced. These so-called end-to-end (E2E) models provide highly integrated, completely neural AS… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  2. arXiv:2211.01438  [pdf, other

    eess.AS cs.CL cs.SD

    Variable Attention Masking for Configurable Transformer Transducer Speech Recognition

    Authors: Pawel Swietojanski, Stefan Braun, Dogan Can, Thiago Fraga da Silva, Arnab Ghoshal, Takaaki Hori, Roger Hsiao, Henry Mason, Erik McDermott, Honza Silovsky, Ruchir Travadi, Xiaodan Zhuang

    Abstract: This work studies the use of attention masking in transformer transducer based speech recognition for building a single configurable model for different deployment scenarios. We present a comprehensive set of experiments comparing fixed masking, where the same attention mask is applied at every frame, with chunked masking, where the attention mask for each frame is determined by chunk boundaries,… ▽ More

    Submitted 18 April, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: To appear in ICASSP 2023

    Journal ref: International Conference on Acoustics, Speech, and Signal Processing, 2023 International Conference on Acoustics, Speech, and Signal Processing International Conference on Acoustics, Speech, and Signal Processing

  3. arXiv:2203.00232  [pdf, other

    cs.SD cs.CL eess.AS

    Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR

    Authors: Xuankai Chang, Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux

    Abstract: Graph-based temporal classification (GTC), a generalized form of the connectionist temporal classification loss, was recently proposed to improve automatic speech recognition (ASR) systems using graph-based supervision. For example, GTC was first used to encode an N-best list of pseudo-label sequences into a graph for semi-supervised learning. In this paper, we propose an extension of GTC to model… ▽ More

    Submitted 1 March, 2022; originally announced March 2022.

    Comments: To appear in ICASSP2022

  4. arXiv:2111.01272  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Sequence Transduction with Graph-based Supervision

    Authors: Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux

    Abstract: The recurrent neural network transducer (RNN-T) objective plays a major role in building today's best automatic speech recognition (ASR) systems for production. Similarly to the connectionist temporal classification (CTC) objective, the RNN-T loss uses specific rules that define how a set of alignments is generated to form a lattice for the full-sum training. However, it is yet largely unknown if… ▽ More

    Submitted 31 March, 2022; v1 submitted 1 November, 2021; originally announced November 2021.

    Comments: Accepted for publication at IEEE ICASSP 2022

  5. arXiv:2110.04948  [pdf, other

    eess.AS cs.SD

    Advancing Momentum Pseudo-Labeling with Conformer and Initialization Strategy

    Authors: Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori

    Abstract: Pseudo-labeling (PL), a semi-supervised learning (SSL) method where a seed model performs self-training using pseudo-labels generated from untranscribed speech, has been shown to enhance the performance of end-to-end automatic speech recognition (ASR). Our prior work proposed momentum pseudo-labeling (MPL), which performs PL-based SSL via an interaction between online and offline models, inspired… ▽ More

    Submitted 10 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP2022

  6. arXiv:2107.01269  [pdf, other

    eess.AS cs.LG cs.SD

    Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition

    Authors: Niko Moritz, Takaaki Hori, Jonathan Le Roux

    Abstract: Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks. However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR, where each word must be recognized shortly after it was spoken. In this work, we present the dual causal/non-causal self-attention (D… ▽ More

    Submitted 2 July, 2021; originally announced July 2021.

    Comments: Accepted to Interspeech 2021

  7. arXiv:2106.08922  [pdf, other

    eess.AS cs.LG cs.SD

    Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition

    Authors: Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori

    Abstract: Pseudo-labeling (PL) has been shown to be effective in semi-supervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. While PL can be further improved by iteratively updating pseudo-labels as the model evolves, most of the previous approaches involve inefficient retraining of the model or intricate control of the label updat… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021

  8. arXiv:2104.02858  [pdf, other

    eess.AS cs.LG cs.SD

    Capturing Multi-Resolution Context by Dilated Self-Attention

    Authors: Niko Moritz, Takaaki Hori, Jonathan Le Roux

    Abstract: Self-attention has become an important and widely used neural network component that helped to establish new state-of-the-art results for various applications, such as machine translation and automatic speech recognition (ASR). However, the computational complexity of self-attention grows quadratically with the input sequence length. This can be particularly problematic for applications such as AS… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.

    Comments: In Proc. ICASSP 2021

  9. arXiv:2012.13006  [pdf, other

    eess.AS cs.SD

    The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

    Authors: Shinji Watanabe, Florian Boyer, Xuankai Chang, Pengcheng Guo, Tomoki Hayashi, Yosuke Higuchi, Takaaki Hori, Wen-Chin Huang, Hirofumi Inaguma, Naoyuki Kamo, Shigeki Karita, Chenda Li, Jing Shi, Aswin Shanmugam Subramanian, Wangyou Zhang

    Abstract: This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text… ▽ More

    Submitted 23 December, 2020; originally announced December 2020.

  10. arXiv:2011.13439  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

    Authors: Sameer Khurana, Niko Moritz, Takaaki Hori, Jonathan Le Roux

    Abstract: The performance of automatic speech recognition (ASR) systems typically degrades significantly when the training and test data domains are mismatched. In this paper, we show that self-training (ST) combined with an uncertainty-based pseudo-label filtering approach can be effectively used for domain adaptation. We propose DUST, a dropout-based uncertainty-driven self-training technique which uses a… ▽ More

    Submitted 16 February, 2021; v1 submitted 26 November, 2020; originally announced November 2020.

    Comments: ICASSP 2021

  11. arXiv:2010.15653  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    Semi-Supervised Speech Recognition via Graph-based Temporal Classification

    Authors: Niko Moritz, Takaaki Hori, Jonathan Le Roux

    Abstract: Semi-supervised learning has demonstrated promising results in automatic speech recognition (ASR) by self-training using a seed ASR model with pseudo-labels generated for unlabeled data. The effectiveness of this approach largely relies on the pseudo-label accuracy, for which typically only the 1-best ASR hypothesis is used. However, alternative ASR hypotheses of an N-best list can provide more ac… ▽ More

    Submitted 16 February, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

    Comments: ICASSP 2021

  12. arXiv:2002.06165  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

    Authors: Leda Sarı, Niko Moritz, Takaaki Hori, Jonathan Le Roux

    Abstract: We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic… ▽ More

    Submitted 14 February, 2020; originally announced February 2020.

    Comments: To appear in Proc. ICASSP 2020

  13. arXiv:2001.02674  [pdf, other

    cs.SD cs.CL cs.LG eess.AS stat.ML

    Streaming automatic speech recognition with the transformer model

    Authors: Niko Moritz, Takaaki Hori, Jonathan Le Roux

    Abstract: Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context information, has been shown to achieve significantly lower word error rates (WERs) compared to recurrent neural network (RNN) based system architectures. Despite its… ▽ More

    Submitted 30 June, 2020; v1 submitted 8 January, 2020; originally announced January 2020.

  14. A Comparative Study on Transformer vs RNN in Speech Applications

    Authors: Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang

    Abstract: Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We underto… ▽ More

    Submitted 28 September, 2019; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: Accepted at ASRU 2019

    Journal ref: IEEE Automatic Speech Recognition and Understanding Workshop 2019

  15. arXiv:1906.08041  [pdf, other

    eess.AS cs.CL cs.SD

    Multi-Stream End-to-End Speech Recognition

    Authors: Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Shinji Watanabe, Takaaki Hori, Hynek Hermansky

    Abstract: Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end (E2E) Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a multi-stream framework based on joint CTC/Attention E2E ASR… ▽ More

    Submitted 18 October, 2019; v1 submitted 17 June, 2019; originally announced June 2019.

    Comments: submitted to IEEE TASLP (In review). arXiv admin note: substantial text overlap with arXiv:1811.04897, arXiv:1811.04903

  16. arXiv:1905.01152  [pdf, ps, other

    eess.AS cs.CL cs.IR cs.LG cs.SD

    Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text

    Authors: Murali Karthick Baskar, Shinji Watanabe, Ramon Astudillo, Takaaki Hori, Lukáš Burget, Jan Černocký

    Abstract: Sequence-to-sequence automatic speech recognition (ASR) models require large quantities of data to attain high performance. For this reason, there has been a recent surge in interest for unsupervised and semi-supervised training in such models. This work builds upon recent results showing notable improvements in semi-supervised training using cycle-consistency and related techniques. Such techniqu… ▽ More

    Submitted 20 August, 2019; v1 submitted 30 April, 2019; originally announced May 2019.

    Comments: INTERSPEECH 2019

  17. arXiv:1811.04903  [pdf, other

    cs.CL cs.SD eess.AS

    Stream attention-based multi-array end-to-end speech recognition

    Authors: Xiaofei Wang, Ruizhi Li, Sri Harish Mallid, Takaaki Hori, Shinji Watanabe, Hynek Hermansky

    Abstract: Automatic Speech Recognition (ASR) using multiple microphone arrays has achieved great success in the far-field robustness. Taking advantage of all the information that each array shares and contributes is crucial in this task. Motivated by the advances of joint Connectionist Temporal Classification (CTC)/attention mechanism in the End-to-End (E2E) ASR, a stream attention-based multi-array framewo… ▽ More

    Submitted 18 February, 2019; v1 submitted 12 November, 2018; originally announced November 2018.

    Comments: Submitted to ICASSP 2019

  18. arXiv:1811.04568  [pdf, ps, other

    cs.SD cs.CL eess.AS stat.ML

    Vectorization of hypotheses and speech for faster beam search in encoder decoder-based speech recognition

    Authors: Hiroshi Seki, Takaaki Hori, Shinji Watanabe

    Abstract: Attention-based encoder decoder network uses a left-to-right beam search algorithm in the inference step. The current beam search expands hypotheses and traverses the expanded hypotheses at the next time step. This traversal is implemented using a for-loop program in general, and it leads to speed down of the recognition process. In this paper, we propose a parallelism technique for beam search, w… ▽ More

    Submitted 12 November, 2018; originally announced November 2018.

  19. arXiv:1811.03451  [pdf, other

    eess.AS cs.CL cs.LG

    Analysis of Multilingual Sequence-to-Sequence speech recognition systems

    Authors: Martin Karafiát, Murali Karthick Baskar, Shinji Watanabe, Takaaki Hori, Matthew Wiesner, Jan "Honza'' Černocký

    Abstract: This paper investigates the applications of various multilingual approaches developed in conventional hidden Markov model (HMM) systems to sequence-to-sequence (seq2seq) automatic speech recognition (ASR). On a set composed of Babel data, we first show the effectiveness of multi-lingual training with stacked bottle-neck (SBN) features. Then we explore various architectures and training strategies… ▽ More

    Submitted 7 November, 2018; originally announced November 2018.

    Comments: arXiv admin note: text overlap with arXiv:1810.03459

  20. arXiv:1811.02770  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Promising Accurate Prefix Boosting for sequence-to-sequence ASR

    Authors: Murali Karthick Baskar, Lukáš Burget, Shinji Watanabe, Martin Karafiát, Takaaki Hori, Jan Honza Černocký

    Abstract: In this paper, we present promising accurate prefix boosting (PAPB), a discriminative training technique for attention based sequence-to-sequence (seq2seq) ASR. PAPB is devised to unify the training and testing scheme in an effective manner. The training procedure involves maximizing the score of each partial correct sequence obtained during beam search compared to other hypotheses. The training o… ▽ More

    Submitted 7 November, 2018; originally announced November 2018.

  21. arXiv:1811.02735  [pdf, other

    eess.AS cs.CL cs.SD

    CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments

    Authors: Nelson Yalta, Shinji Watanabe, Takaaki Hori, Kazuhiro Nakadai, Tetsuya Ogata

    Abstract: Casual conversations involving multiple speakers and noises from surrounding devices are common in everyday environments, which degrades the performances of automatic speech recognition systems. These challenging characteristics of environments are the target of the CHiME-5 challenge. By employing a convolutional neural network (CNN)-based multichannel end-to-end speech recognition system, this st… ▽ More

    Submitted 20 June, 2019; v1 submitted 6 November, 2018; originally announced November 2018.

    Comments: 5 pages, 1 figure, EUSIPCO 2019

  22. arXiv:1811.02162  [pdf, other

    eess.AS cs.SD

    Language model integration based on memory control for sequence to sequence speech recognition

    Authors: Jaejin Cho, Shinji Watanabe, Takaaki Hori, Murali Karthick Baskar, Hirofumi Inaguma, Jesus Villalba, Najim Dehak

    Abstract: In this paper, we explore several new schemes to train a seq2seq model to integrate a pre-trained LM. Our proposed fusion methods focus on the memory cell state and the hidden state in the seq2seq decoder long short-term memory (LSTM), and the memory cell state is updated by the LM unlike the prior studies. This means the memory retained by the main seq2seq would be adjusted by the external LM. Th… ▽ More

    Submitted 5 November, 2018; originally announced November 2018.

    Comments: 4 pages, 1 figure, 5 tables, submitted to ICASSP 2019

  23. arXiv:1811.01690  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Cycle-consistency training for end-to-end speech recognition

    Authors: Takaaki Hori, Ramon Astudillo, Tomoki Hayashi, Yu Zhang, Shinji Watanabe, Jonathan Le Roux

    Abstract: This paper presents a method to train end-to-end automatic speech recognition (ASR) models using unpaired data. Although the end-to-end approach can eliminate the need for expert knowledge such as pronunciation dictionaries to build ASR systems, it still requires a large amount of paired data, i.e., speech utterances and their transcriptions. Cycle-consistency losses have been recently proposed as… ▽ More

    Submitted 22 May, 2019; v1 submitted 2 November, 2018; originally announced November 2018.

    Comments: Submitted to ICASSP'19

  24. arXiv:1810.03459  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling

    Authors: Jaejin Cho, Murali Karthick Baskar, Ruizhi Li, Matthew Wiesner, Sri Harish Mallidi, Nelson Yalta, Martin Karafiat, Shinji Watanabe, Takaaki Hori

    Abstract: Sequence-to-sequence (seq2seq) approach for low-resource ASR is a relatively new direction in speech research. The approach benefits by performing model training without using lexicon and alignments. However, this poses a new problem of requiring more data compared to conventional DNN-HMM systems. In this work, we attempt to use data from 10 BABEL languages to build a multi-lingual seq2seq model a… ▽ More

    Submitted 4 October, 2018; originally announced October 2018.

  25. arXiv:1806.08409  [pdf, other

    cs.CL cs.CV cs.SD eess.AS

    End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

    Authors: Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, Devi Parikh

    Abstract: Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-of-the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog dat… ▽ More

    Submitted 29 June, 2018; v1 submitted 21 June, 2018; originally announced June 2018.

    Comments: A prototype system for the Audio Visual Scene-aware Dialog (AVSD) at DSTC7

  26. arXiv:1805.05826  [pdf, other

    cs.SD cs.CL eess.AS stat.ML

    A Purely End-to-end System for Multi-speaker Speech Recognition

    Authors: Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey

    Abstract: Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence fr… ▽ More

    Submitted 15 May, 2018; originally announced May 2018.

    Comments: ACL 2018