Zum Hauptinhalt springen

Showing 1–45 of 45 results for author: Kawahara, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.16180  [pdf, other

    eess.AS cs.CL cs.SD

    Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction

    Authors: Yuka Ko, Sheng Li, Chao-Han Huck Yang, Tatsuya Kawahara

    Abstract: With the strong representational power of large language models (LLMs), generative error correction (GER) for automatic speech recognition (ASR) aims to provide semantic and phonetic refinements to address ASR errors. This work explores how LLM-based GER can enhance and expand the capabilities of Japanese language processing, presenting the first GER benchmark for Japanese ASR with 0.9-2.6k text u… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: submitted to SLT2024

  2. arXiv:2408.02271  [pdf, other

    cs.CL

    StyEmp: Stylizing Empathetic Response Generation via Multi-Grained Prefix Encoder and Personality Reinforcement

    Authors: Yahui Fu, Chenhui Chu, Tatsuya Kawahara

    Abstract: Recent approaches for empathetic response generation mainly focus on emotional resonance and user understanding, without considering the system's personality. Consistent personality is evident in real human expression and is important for creating trustworthy systems. To address this problem, we propose StyEmp, which aims to stylize the empathetic response generation with a consistent personality.… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: Accepted by the 25th Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2024)

  3. arXiv:2403.06487  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual Turn-taking Prediction Using Voice Activity Projection

    Authors: Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze

    Abstract: This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The re… ▽ More

    Submitted 14 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: This paper has been accepted for presentation at The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) and represents the author's version of the work

  4. arXiv:2402.18275  [pdf, other

    cs.SD cs.CL eess.AS

    Exploration of Adapter for Noise Robust Automatic Speech Recognition

    Authors: Hao Shi, Tatsuya Kawahara

    Abstract: Adapting an automatic speech recognition (ASR) system to unseen noise environments is crucial. Integrating adapters into neural networks has emerged as a potent technique for transfer learning. This study thoroughly investigates adapter-based ASR adaptation in noisy environments. We conducted experiments using the CHiME--4 dataset. The results show that inserting the adapter in the shallow layer y… ▽ More

    Submitted 4 June, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

  5. arXiv:2402.14863  [pdf, other

    cs.CL

    Evaluation of a semi-autonomous attentive listening system with takeover prompting

    Authors: Haruki Kawai, Divesh Lala, Koji Inoue, Keiko Ochi, Tatsuya Kawahara

    Abstract: The handling of communication breakdowns and loss of engagement is an important aspect of spoken dialogue systems, particularly for chatting systems such as attentive listening, where the user is mostly speaking. We presume that a human is best equipped to handle this task and rescue the flow of conversation. To this end, we propose a semi-autonomous system, where a remote operator can take contro… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  6. arXiv:2402.12770  [pdf, other

    cs.CL

    Acknowledgment of Emotional States: Generating Validating Responses for Empathetic Dialogue

    Authors: Zi Haur Pang, Yahui Fu, Divesh Lala, Keiko Ochi, Koji Inoue, Tatsuya Kawahara

    Abstract: In the realm of human-AI dialogue, the facilitation of empathetic responses is important. Validation is one of the key communication techniques in psychology, which entails recognizing, understanding, and acknowledging others' emotional states, thoughts, and actions. This study introduces the first framework designed to engender empathetic dialogue with validating responses. Our approach incorpora… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

    Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024)

  7. arXiv:2401.13249  [pdf, other

    eess.AS cs.MM

    MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction

    Authors: Wangjin Zhou, Zhengdong Yang, Chenhui Chu, Sheng Li, Raj Dabre, Yi Zhao, Tatsuya Kawahara

    Abstract: Automatic Mean Opinion Score (MOS) prediction is employed to evaluate the quality of synthetic speech. This study extends the application of predicted MOS to the task of Fake Audio Detection (FAD), as we expect that MOS can be used to assess how close synthesized speech is to the natural human voice. We propose MOS-FAD, where MOS can be leveraged at two key points in FAD: training data selection a… ▽ More

    Submitted 24 January, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted in ICASSP2024

  8. arXiv:2401.05871  [pdf, other

    cs.CL

    Enhancing Personality Recognition in Dialogue by Data Augmentation and Heterogeneous Conversational Graph Networks

    Authors: Yahui Fu, Haiyue Song, Tianyu Zhao, Tatsuya Kawahara

    Abstract: Personality recognition is useful for enhancing robots' ability to tailor user-adaptive responses, thus fostering rich human-robot interactions. One of the challenges in this task is a limited number of speakers in existing dialogue corpora, which hampers the development of robust, speaker-independent personality recognition models. Additionally, accurately modeling both the interdependencies amon… ▽ More

    Submitted 8 March, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

    Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024)

  9. arXiv:2401.04868  [pdf, other

    cs.CL cs.HC cs.SD eess.AS

    Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

    Authors: Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze

    Abstract: A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input contex… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

    Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024) and represents the author's version of the work

  10. arXiv:2401.04867  [pdf, other

    cs.CL cs.AI cs.HC

    An Analysis of User Behaviors for Objectively Evaluating Spoken Dialogue Systems

    Authors: Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara, Gabriel Skantze

    Abstract: Establishing evaluation schemes for spoken dialogue systems is important, but it can also be challenging. While subjective evaluations are commonly used in user experiments, objective evaluations are necessary for research comparison and reproducibility. To address this issue, we propose a framework for indirectly but objectively evaluating systems based on users' behaviors. In this paper, to this… ▽ More

    Submitted 23 January, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

    Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024) and represents the author's version of the work

  11. arXiv:2309.09223  [pdf, other

    cs.SD eess.AS

    Zero- and Few-shot Sound Event Localization and Detection

    Authors: Kazuki Shimada, Kengo Uchida, Yuichiro Koyama, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji, Tatsuya Kawahara

    Abstract: Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few… ▽ More

    Submitted 17 January, 2024; v1 submitted 17 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures, accepted for publication in IEEE ICASSP 2024

  12. arXiv:2308.11020  [pdf, other

    cs.CL cs.HC cs.RO

    Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors

    Authors: Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara, Gabriel Skantze

    Abstract: This paper tackles the challenging task of evaluating socially situated conversational robots and presents a novel objective evaluation approach that relies on multimodal user behaviors. In this study, our main focus is on assessing the human-likeness of the robot as the primary evaluation metric. While previous research often relied on subjective evaluations from users, our approach aims to evalu… ▽ More

    Submitted 25 September, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

    Comments: Accepted by 25th ACM International Conference on Multimodal Interaction (ICMI '23), Late-Breaking Results

  13. arXiv:2308.00085  [pdf, other

    cs.CL cs.AI

    Reasoning before Responding: Integrating Commonsense-based Causality Explanation for Empathetic Response Generation

    Authors: Yahui Fu, Koji Inoue, Chenhui Chu, Tatsuya Kawahara

    Abstract: Recent approaches to empathetic response generation try to incorporate commonsense knowledge or reasoning about the causes of emotions to better understand the user's experiences and feelings. However, these approaches mainly focus on understanding the causalities of context from the user's perspective, ignoring the system's perspective. In this paper, we propose a commonsense-based causality expl… ▽ More

    Submitted 5 September, 2023; v1 submitted 27 July, 2023; originally announced August 2023.

    Comments: Accepted by the 24th Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2023)

  14. arXiv:2305.10734  [pdf, other

    cs.SD cs.CL eess.AS

    Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

    Authors: Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji

    Abstract: Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us… ▽ More

    Submitted 28 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

  15. arXiv:2303.14593  [pdf, other

    cs.SD eess.AS

    Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder

    Authors: Hao Shi, Masato Mimura, Longbiao Wang, Jianwu Dang, Tatsuya Kawahara

    Abstract: Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information,… ▽ More

    Submitted 25 March, 2023; originally announced March 2023.

  16. arXiv:2303.00146  [pdf, other

    cs.HC cs.RO cs.SD eess.AS

    I Know Your Feelings Before You Do: Predicting Future Affective Reactions in Human-Computer Dialogue

    Authors: Yuanchao Li, Koji Inoue, Leimin Tian, Changzeng Fu, Carlos Ishi, Hiroshi Ishiguro, Tatsuya Kawahara, Catherine Lai

    Abstract: Current Spoken Dialogue Systems (SDSs) often serve as passive listeners that respond only after receiving user speech. To achieve human-like dialogue, we propose a novel future prediction architecture that allows an SDS to anticipate future affective reactions based on its current behaviors before the user speaks. In this work, we investigate two scenarios: speech and laughter. In speech, we propo… ▽ More

    Submitted 17 March, 2023; v1 submitted 28 February, 2023; originally announced March 2023.

    Comments: Accepted to CHI2023 Late-Breaking Work

  17. arXiv:2211.08526  [pdf, other

    cs.RO

    Alzheimer's Dementia Detection through Spontaneous Dialogue with Proactive Robotic Listeners

    Authors: Yuanchao Li, Catherine Lai, Divesh Lala, Koji Inoue, Tatsuya Kawahara

    Abstract: As the aging of society continues to accelerate, Alzheimer's Disease (AD) has received more and more attention from not only medical but also other fields, such as computer science, over the past decade. Since speech is considered one of the effective ways to diagnose cognitive decline, AD detection from speech has emerged as a hot topic. Nevertheless, such approaches fail to tackle several key is… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: Accepted for HRI2022 Late-Breaking Report

  18. arXiv:2209.04062  [pdf, other

    cs.CL cs.SD eess.AS

    Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

    Authors: Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Connectionist temporal classification (CTC) -based models are attractive in automatic speech recognition (ASR) because of their non-autoregressive nature. To take advantage of text-only data, language model (LM) integration approaches such as rescoring and shallow fusion have been widely used for CTC. However, they lose CTC's non-autoregressive nature because of the need for beam search, which slo… ▽ More

    Submitted 8 September, 2022; originally announced September 2022.

    Comments: Accepted in Interspeech2022

  19. arXiv:2209.02030  [pdf, other

    cs.CL cs.SD eess.AS

    Distilling the Knowledge of BERT for CTC-based ASR

    Authors: Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Connectionist temporal classification (CTC) -based models are attractive because of their fast inference in automatic speech recognition (ASR). Language model (LM) integration approaches such as shallow fusion and rescoring can improve the recognition accuracy of CTC-based ASR by taking advantage of the knowledge in text corpora. However, they significantly slow down the inference of CTC. In this… ▽ More

    Submitted 5 September, 2022; originally announced September 2022.

  20. arXiv:2207.03169  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-end Speech-to-Punctuated-Text Recognition

    Authors: Jumon Nozaki, Tatsuya Kawahara, Kenkichi Ishizuka, Taiichi Hashimoto

    Abstract: Conventional automatic speech recognition systems do not produce punctuation marks which are important for the readability of the speech recognition results. They are also needed for subsequent natural language processing tasks such as machine translation. There have been a lot of works on punctuation prediction models that insert punctuation marks into speech recognition results as post-processin… ▽ More

    Submitted 7 July, 2022; originally announced July 2022.

    Comments: Accepted to INTERSPEECH2022

  21. arXiv:2110.01857  [pdf, other

    cs.CL eess.AS

    ASR Rescoring and Confidence Estimation with ELECTRA

    Authors: Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: In automatic speech recognition (ASR) rescoring, the hypothesis with the fewest errors should be selected from the n-best list using a language model (LM). However, LMs are usually trained to maximize the likelihood of correct word sequences, not to detect ASR errors. We propose an ASR rescoring method for directly detecting errors with ELECTRA, which is originally a pre-training method for NLP ta… ▽ More

    Submitted 5 October, 2021; originally announced October 2021.

    Comments: Accepted in ASRU2021

  22. arXiv:2109.04411  [pdf, other

    eess.AS cs.CL cs.SD

    Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring

    Authors: Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

    Abstract: This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models. End-to-end speech translation models have several advantages over traditional cascade systems such as inference latency reduction. However, conventional AR decoding methods are not fast enough because each token is generated incrementally. NAR models, however, can accelera… ▽ More

    Submitted 9 September, 2021; originally announced September 2021.

  23. arXiv:2107.07509  [pdf, other

    eess.AS cs.CL cs.SD

    VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording

    Authors: Hirofumi Inaguma, Tatsuya Kawahara

    Abstract: In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-… ▽ More

    Submitted 15 July, 2021; originally announced July 2021.

    Comments: Accepted at Interspeech 2021

  24. arXiv:2107.00635  [pdf, other

    eess.AS cs.CL cs.SD

    StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR

    Authors: Hirofumi Inaguma, Tatsuya Kawahara

    Abstract: While attention-based encoder-decoder (AED) models have been successfully extended to the online variants for streaming automatic speech recognition (ASR), such as monotonic chunkwise attention (MoChA), the models still have a large label emission latency because of the unconstrained end-to-end training objective. Previous works tackled this problem by leveraging alignment information to control t… ▽ More

    Submitted 15 July, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

    Comments: Accepted at Interspeech 2021

  25. arXiv:2106.02325  [pdf, other

    cs.CL cs.HC

    ERICA: An Empathetic Android Companion for Covid-19 Quarantine

    Authors: Etsuko Ishii, Genta Indra Winata, Samuel Cahyawijaya, Divesh Lala, Tatsuya Kawahara, Pascale Fung

    Abstract: Over the past year, research in various domains, including Natural Language Processing (NLP), has been accelerated to fight against the COVID-19 pandemic, yet such research has just started on dialogue systems. In this paper, we introduce an end-to-end dialogue system which aims to ease the isolation of people under self-quarantine. We conduct a control simulation experiment to assess the effects… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

    Comments: Accepted in SIGDIAL 2021

  26. arXiv:2105.00403  [pdf, other

    cs.CL cs.AI cs.RO

    Intelligent Conversational Android ERICA Applied to Attentive Listening and Job Interview

    Authors: Tatsuya Kawahara, Koji Inoue, Divesh Lala

    Abstract: Following the success of spoken dialogue systems (SDS) in smartphone assistants and smart speakers, a number of communicative robots are developed and commercialized. Compared with the conventional SDSs designed as a human-machine interface, interaction with robots is expected to be in a closer manner to talking to a human because of the anthropomorphism and physical presence. The goal or task of… ▽ More

    Submitted 2 May, 2021; originally announced May 2021.

    Comments: 7 pages, 5 figures, 1 table

  27. arXiv:2104.06457  [pdf, other

    cs.CL cs.SD eess.AS

    Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation

    Authors: Hirofumi Inaguma, Tatsuya Kawahara, Shinji Watanabe

    Abstract: A conventional approach to improving the performance of end-to-end speech translation (E2E-ST) models is to leverage the source transcription via pre-training and joint training with automatic speech recognition (ASR) and neural machine translation (NMT) tasks. However, since the input modalities are different, it is difficult to leverage source language text successfully. In this work, we focus o… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

    Comments: Accepted at NAACL-HLT 2021 (short paper)

  28. arXiv:2103.00422  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition

    Authors: Hirofumi Inaguma, Tatsuya Kawahara

    Abstract: This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonic chunkwise attention (MoChA). However, the… ▽ More

    Submitted 22 August, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

  29. arXiv:2010.13047  [pdf, other

    cs.CL cs.SD eess.AS

    Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder

    Authors: Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

    Abstract: Fast inference speed is an important goal towards real-world deployment of speech translation (ST) systems. End-to-end (E2E) models based on the encoder-decoder architecture are more suitable for this goal than traditional cascaded systems, but their effectiveness regarding decoding speed has not been explored so far. Inspired by recent progress in non-autoregressive (NAR) methods in text-based tr… ▽ More

    Submitted 18 February, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

    Comments: Accepted at IEEE ICASSP 2021

  30. arXiv:2009.07117  [pdf, other

    cs.CL

    Multi-Referenced Training for Dialogue Response Generation

    Authors: Tianyu Zhao, Tatsuya Kawahara

    Abstract: In open-domain dialogue response generation, a dialogue context can be continued with diverse responses, and the dialogue models should capture such one-to-many relations. In this work, we first analyze the training objective of dialogue models from the view of Kullback-Leibler divergence (KLD) and show that the gap between the real world probability distribution and the single-referenced data's p… ▽ More

    Submitted 18 October, 2020; v1 submitted 15 September, 2020; originally announced September 2020.

  31. arXiv:2008.03822  [pdf, other

    cs.CL eess.AS

    Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

    Authors: Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Attention-based sequence-to-sequence (seq2seq) models have achieved promising results in automatic speech recognition (ASR). However, as these models decode in a left-to-right way, they do not have access to context on the right. We leverage both left and right context by applying BERT as an external language model to seq2seq ASR through knowledge distillation. In our proposed method, BERT generat… ▽ More

    Submitted 9 August, 2020; originally announced August 2020.

    Comments: Accepted in INTERSPEECH2020

  32. arXiv:2005.09394  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Enhancing Monotonic Multihead Attention for Streaming ASR

    Authors: Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara

    Abstract: We investigate a monotonic multihead attention (MMA) by extending hard monotonic attention to Transformer-based automatic speech recognition (ASR) for online streaming applications. For streaming inference, all monotonic attention (MA) heads should learn proper alignments because the next token is not generated until all heads detect the corresponding token boundaries. However, we found not all MA… ▽ More

    Submitted 30 September, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech 2020

  33. arXiv:2005.09256  [pdf, other

    eess.AS cs.CL

    Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition

    Authors: Kohei Matsuura, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: It is important to transcribe and archive speech data of endangered languages for preserving heritages of verbal culture and automatic speech recognition (ASR) is a powerful tool to facilitate this process. However, since endangered languages do not generally have large corpora with many speakers, the performance of ASR models trained on them are considerably poor in general. Nevertheless, we are… ▽ More

    Submitted 31 July, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: Accepted for Interspeech 2020

  34. arXiv:2005.04712  [pdf, other

    cs.CL cs.LG

    CTC-synchronous Training for Monotonic Attention Model

    Authors: Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara

    Abstract: Monotonic chunkwise attention (MoChA) has been studied for the online streaming automatic speech recognition (ASR) based on a sequence-to-sequence framework. In contrast to connectionist temporal classification (CTC), backward probabilities cannot be leveraged in the alignment marginalization process during training due to left-to-right dependency in the decoder. This results in the error propagat… ▽ More

    Submitted 6 August, 2020; v1 submitted 10 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech 2020

  35. arXiv:2004.11419  [pdf, other

    cs.SD cs.CL eess.AS

    End-to-end speech-to-dialog-act recognition

    Authors: Viet-Trung Dang, Tianyu Zhao, Sei Ueno, Hirofumi Inaguma, Tatsuya Kawahara

    Abstract: Spoken language understanding, which extracts intents and/or semantic concepts in utterances, is conventionally formulated as a post-processing of automatic speech recognition. It is usually trained with oracle transcripts, but needs to deal with errors by ASR. Moreover, there are acoustic features which are related with intents but not represented with the transcripts. In this paper, we present a… ▽ More

    Submitted 28 July, 2020; v1 submitted 23 April, 2020; originally announced April 2020.

  36. arXiv:2004.04908  [pdf, ps, other

    cs.CL

    Designing Precise and Robust Dialogue Response Evaluators

    Authors: Tianyu Zhao, Divesh Lala, Tatsuya Kawahara

    Abstract: Automatic dialogue response evaluator has been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In this work, we propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained (masked) language models. Experimental… ▽ More

    Submitted 24 April, 2020; v1 submitted 10 April, 2020; originally announced April 2020.

    Comments: Accepted at ACL 2020

  37. arXiv:2002.06675  [pdf, other

    cs.CL cs.SD eess.AS

    Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for Ainu Language

    Authors: Kohei Matsuura, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Ainu is an unwritten language that has been spoken by Ainu people who are one of the ethnic groups in Japan. It is recognized as critically endangered by UNESCO and archiving and documentation of its language heritage is of paramount importance. Although a considerable amount of voice recordings of Ainu folklore has been produced and accumulated to save their culture, only a quite limited parts of… ▽ More

    Submitted 16 May, 2020; v1 submitted 16 February, 2020; originally announced February 2020.

    Comments: Accepted in LREC 2020

  38. arXiv:1910.00254  [pdf, ps, other

    cs.CL eess.AS

    Multilingual End-to-End Speech Translation

    Authors: Hirofumi Inaguma, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

    Abstract: In this paper, we propose a simple yet effective framework for multilingual end-to-end speech translation (ST), in which speech utterances in source languages are directly translated to the desired target languages with a universal sequence-to-sequence architecture. While multilingual models have shown to be useful for automatic speech recognition (ASR) and machine translation (MT), this is the fi… ▽ More

    Submitted 31 October, 2019; v1 submitted 1 October, 2019; originally announced October 2019.

    Comments: Accepted to ASRU 2019

  39. arXiv:1909.09993  [pdf, other

    cs.CL

    Improving OOV Detection and Resolution with External Language Models in Acoustic-to-Word ASR

    Authors: Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Acoustic-to-word (A2W) end-to-end automatic speech recognition (ASR) systems have attracted attention because of an extremely simplified architecture and fast decoding. To alleviate data sparseness issues due to infrequent words, the combination with an acoustic-to-character (A2C) model is investigated. Moreover, the A2C model can be used to recover out-of-vocabulary (OOV) words that are not cover… ▽ More

    Submitted 22 September, 2019; originally announced September 2019.

    Comments: SLT2018

  40. arXiv:1907.05599  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Effective Incorporation of Speaker Information in Utterance Encoding in Dialog

    Authors: Tianyu Zhao, Tatsuya Kawahara

    Abstract: In dialog studies, we often encode a dialog using a hierarchical encoder where each utterance is converted into an utterance vector, and then a sequence of utterance vectors is converted into a dialog vector. Since knowing who produced which utterance is essential to understanding a dialog, conventional methods tried integrating speaker labels into utterance vectors. We found the method problemati… ▽ More

    Submitted 12 July, 2019; originally announced July 2019.

    Comments: 8+1 pages, 3 figures, and 5 tables. Rejected by SIGDIAL 2019

  41. arXiv:1905.13438  [pdf, ps, other

    cs.CL cs.AI

    Content Word-based Sentence Decoding and Evaluating for Open-domain Neural Response Generation

    Authors: Tianyu Zhao, Shinsuke Mori, Tatsuya Kawahara

    Abstract: Various encoder-decoder models have been applied to response generation in open-domain dialogs, but a majority of conventional models directly learn a mapping from lexical input to lexical output without explicitly modeling intermediate representations. Utilizing language hierarchy and modeling intermediate information have been shown to benefit many language understanding and generation tasks. Mo… ▽ More

    Submitted 26 June, 2019; v1 submitted 31 May, 2019; originally announced May 2019.

    Comments: 13 pages, 2 figures, 8 tables (rejected by ACL 2019)

  42. arXiv:1903.09341  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition

    Authors: Kazuki Shimada, Yoshiaki Bando, Masato Mimura, Katsutoshi Itoyama, Kazuyoshi Yoshii, Tatsuya Kawahara

    Abstract: This paper describes multichannel speech enhancement for improving automatic speech recognition (ASR) in noisy environments. Recently, the minimum variance distortionless response (MVDR) beamforming has widely been used because it works well if the steering vector of speech and the spatial covariance matrix (SCM) of noise are given. To estimating such spatial information, conventional studies take… ▽ More

    Submitted 31 March, 2019; v1 submitted 21 March, 2019; originally announced March 2019.

  43. arXiv:1811.02134  [pdf, ps, other

    cs.CL

    Transfer learning of language-independent end-to-end ASR with language model fusion

    Authors: Hirofumi Inaguma, Jaejin Cho, Murali Karthick Baskar, Tatsuya Kawahara, Shinji Watanabe

    Abstract: This work explores better adaptation methods to low-resource languages using an external language model (LM) under the framework of transfer learning. We first build a language-independent ASR system in a unified sequence-to-sequence (S2S) architecture with a shared vocabulary among all languages. During adaptation, we perform LM fusion transfer, where an external LM is integrated into the decoder… ▽ More

    Submitted 7 May, 2019; v1 submitted 5 November, 2018; originally announced November 2018.

    Comments: Accepted at ICASSP2019

  44. arXiv:1710.11439  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization

    Authors: Yoshiaki Bando, Masato Mimura, Katsutoshi Itoyama, Kazuyoshi Yoshii, Tatsuya Kawahara

    Abstract: This paper presents a statistical method of single-channel speech enhancement that uses a variational autoencoder (VAE) as a prior distribution on clean speech. A standard approach to speech enhancement is to train a deep neural network (DNN) to take noisy speech as input and output clean speech. Although this supervised approach requires a very large amount of pair data for training, it is not ro… ▽ More

    Submitted 19 March, 2018; v1 submitted 31 October, 2017; originally announced October 2017.

    Comments: 5 pages, 3 figures, version that Eqs. (9), (19), and (20) in v2 (submitted to ICASSP 2018) are corrected. Samples available here: http://sap.ist.i.kyoto-u.ac.jp/members/yoshiaki/demo/vae-nmf/

  45. arXiv:1709.10257  [pdf, other

    cs.HC cs.RO

    Detection of social signals for recognizing engagement in human-robot interaction

    Authors: Divesh Lala, Koji Inoue, Pierrick Milhorat, Tatsuya Kawahara

    Abstract: Detection of engagement during a conversation is an important function of human-robot interaction. The level of user engagement can influence the dialogue strategy of the robot. Our motivation in this work is to detect several behaviors which will be used as social signal inputs for a real-time engagement recognition model. These behaviors are nodding, laughter, verbal backchannels and eye gaze. W… ▽ More

    Submitted 29 September, 2017; originally announced September 2017.

    Comments: AAAI Fall Symposium on Natural Communication for Human-Robot Collaboration, 2017