Zum Hauptinhalt springen

Showing 1–17 of 17 results for author: Mimura, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.00205  [pdf, other

    cs.CL eess.AS

    Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation

    Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Masato Mimura, Takatomo Kano, Atsunori Ogawa, Marc Delcroix

    Abstract: This paper introduces a novel approach called sentence-wise speech summarization (Sen-SSum), which generates text summaries from a spoken document in a sentence-by-sentence manner. Sen-SSum combines the real-time processing of automatic speech recognition (ASR) with the conciseness of speech summarization. To explore this approach, we present two datasets for Sen-SSum: Mega-SSum and CSJ-SSum. Usin… ▽ More

    Submitted 31 July, 2024; originally announced August 2024.

    Comments: Accepted to Interspeech2024. Dataset: https://huggingface.co/datasets/komats/mega-ssum

  2. arXiv:2407.01857  [pdf, other

    eess.AS cs.SD eess.SP

    SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

    Authors: Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix

    Abstract: Real-time target speaker extraction (TSE) is intended to extract the desired speaker's voice from the observed mixture of multiple speakers in a streaming manner. Implementing real-time TSE is challenging as the computational complexity must be reduced to provide real-time operation. This work introduces to Conv-TasNet-based TSE a new architecture based on state space modeling (SSM) that has been… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted to Interspeech 2024

  3. arXiv:2303.14593  [pdf, other

    cs.SD eess.AS

    Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder

    Authors: Hao Shi, Masato Mimura, Longbiao Wang, Jianwu Dang, Tatsuya Kawahara

    Abstract: Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information,… ▽ More

    Submitted 25 March, 2023; originally announced March 2023.

  4. arXiv:2211.13110  [pdf, other

    cs.LG cs.CR

    Compiler Provenance Recovery for Multi-CPU Architectures Using a Centrifuge Mechanism

    Authors: Yuhei Otsubo, Akira Otsuka, Mamoru Mimura

    Abstract: Bit-stream recognition (BSR) has many applications, such as forensic investigations, detection of copyright infringement, and malware analysis. We propose the first BSR that takes a bare input bit-stream and outputs a class label without any preprocessing. To achieve our goal, we propose a centrifuge mechanism, where the upstream layers (sub-net) capture global features and tell the downstream lay… ▽ More

    Submitted 23 November, 2022; v1 submitted 21 November, 2022; originally announced November 2022.

    Comments: 8 pages, 4 figures, 5 tables

  5. arXiv:2209.04062  [pdf, other

    cs.CL cs.SD eess.AS

    Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

    Authors: Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Connectionist temporal classification (CTC) -based models are attractive in automatic speech recognition (ASR) because of their non-autoregressive nature. To take advantage of text-only data, language model (LM) integration approaches such as rescoring and shallow fusion have been widely used for CTC. However, they lose CTC's non-autoregressive nature because of the need for beam search, which slo… ▽ More

    Submitted 8 September, 2022; originally announced September 2022.

    Comments: Accepted in Interspeech2022

  6. arXiv:2209.02030  [pdf, other

    cs.CL cs.SD eess.AS

    Distilling the Knowledge of BERT for CTC-based ASR

    Authors: Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Connectionist temporal classification (CTC) -based models are attractive because of their fast inference in automatic speech recognition (ASR). Language model (LM) integration approaches such as shallow fusion and rescoring can improve the recognition accuracy of CTC-based ASR by taking advantage of the knowledge in text corpora. However, they significantly slow down the inference of CTC. In this… ▽ More

    Submitted 5 September, 2022; originally announced September 2022.

  7. arXiv:2110.01857  [pdf, other

    cs.CL eess.AS

    ASR Rescoring and Confidence Estimation with ELECTRA

    Authors: Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: In automatic speech recognition (ASR) rescoring, the hypothesis with the fewest errors should be selected from the n-best list using a language model (LM). However, LMs are usually trained to maximize the likelihood of correct word sequences, not to detect ASR errors. We propose an ASR rescoring method for directly detecting errors with ELECTRA, which is originally a pre-training method for NLP ta… ▽ More

    Submitted 5 October, 2021; originally announced October 2021.

    Comments: Accepted in ASRU2021

  8. On the spectrum and linear programming bound for hypergraphs

    Authors: Sebastian M. Cioabă, Jack H. Koolen, Masato Mimura, Hiroshi Nozaki, Takayuki Okuda

    Abstract: The spectrum of a graph is closely related to many graph parameters. In particular, the spectral gap of a regular graph which is the difference between its valency and second eigenvalue, is widely seen an algebraic measure of connectivity and plays a key role in the theory of expander graphs. In this paper, we extend previous work done for graphs and bipartite graphs and present a linear programmi… ▽ More

    Submitted 5 April, 2022; v1 submitted 7 September, 2020; originally announced September 2020.

    Comments: references updated, fixed some typos, added explanation describing the differences between the graphs and hypergraphs results, 27 pages, 3 tables, European Journal of Combinatorics, accepted for publication

    Journal ref: European Journal of Combinatorics, 104 (2022), 103535

  9. arXiv:2008.03822  [pdf, other

    cs.CL eess.AS

    Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

    Authors: Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Attention-based sequence-to-sequence (seq2seq) models have achieved promising results in automatic speech recognition (ASR). However, as these models decode in a left-to-right way, they do not have access to context on the right. We leverage both left and right context by applying BERT as an external language model to seq2seq ASR through knowledge distillation. In our proposed method, BERT generat… ▽ More

    Submitted 9 August, 2020; originally announced August 2020.

    Comments: Accepted in INTERSPEECH2020

  10. arXiv:2005.09394  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Enhancing Monotonic Multihead Attention for Streaming ASR

    Authors: Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara

    Abstract: We investigate a monotonic multihead attention (MMA) by extending hard monotonic attention to Transformer-based automatic speech recognition (ASR) for online streaming applications. For streaming inference, all monotonic attention (MA) heads should learn proper alignments because the next token is not generated until all heads detect the corresponding token boundaries. However, we found not all MA… ▽ More

    Submitted 30 September, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech 2020

  11. arXiv:2005.09256  [pdf, other

    eess.AS cs.CL

    Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition

    Authors: Kohei Matsuura, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: It is important to transcribe and archive speech data of endangered languages for preserving heritages of verbal culture and automatic speech recognition (ASR) is a powerful tool to facilitate this process. However, since endangered languages do not generally have large corpora with many speakers, the performance of ASR models trained on them are considerably poor in general. Nevertheless, we are… ▽ More

    Submitted 31 July, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: Accepted for Interspeech 2020

  12. arXiv:2005.04712  [pdf, other

    cs.CL cs.LG

    CTC-synchronous Training for Monotonic Attention Model

    Authors: Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara

    Abstract: Monotonic chunkwise attention (MoChA) has been studied for the online streaming automatic speech recognition (ASR) based on a sequence-to-sequence framework. In contrast to connectionist temporal classification (CTC), backward probabilities cannot be leveraged in the alignment marginalization process during training due to left-to-right dependency in the decoder. This results in the error propagat… ▽ More

    Submitted 6 August, 2020; v1 submitted 10 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech 2020

  13. arXiv:2002.06675  [pdf, other

    cs.CL cs.SD eess.AS

    Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for Ainu Language

    Authors: Kohei Matsuura, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Ainu is an unwritten language that has been spoken by Ainu people who are one of the ethnic groups in Japan. It is recognized as critically endangered by UNESCO and archiving and documentation of its language heritage is of paramount importance. Although a considerable amount of voice recordings of Ainu folklore has been produced and accumulated to save their culture, only a quite limited parts of… ▽ More

    Submitted 16 May, 2020; v1 submitted 16 February, 2020; originally announced February 2020.

    Comments: Accepted in LREC 2020

  14. arXiv:1909.09993  [pdf, other

    cs.CL

    Improving OOV Detection and Resolution with External Language Models in Acoustic-to-Word ASR

    Authors: Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Acoustic-to-word (A2W) end-to-end automatic speech recognition (ASR) systems have attracted attention because of an extremely simplified architecture and fast decoding. To alleviate data sparseness issues due to infrequent words, the combination with an acoustic-to-character (A2C) model is investigated. Moreover, the A2C model can be used to recover out-of-vocabulary (OOV) words that are not cover… ▽ More

    Submitted 22 September, 2019; originally announced September 2019.

    Comments: SLT2018

  15. arXiv:1903.09341  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition

    Authors: Kazuki Shimada, Yoshiaki Bando, Masato Mimura, Katsutoshi Itoyama, Kazuyoshi Yoshii, Tatsuya Kawahara

    Abstract: This paper describes multichannel speech enhancement for improving automatic speech recognition (ASR) in noisy environments. Recently, the minimum variance distortionless response (MVDR) beamforming has widely been used because it works well if the steering vector of speech and the spatial covariance matrix (SCM) of noise are given. To estimating such spatial information, conventional studies take… ▽ More

    Submitted 31 March, 2019; v1 submitted 21 March, 2019; originally announced March 2019.

  16. arXiv:1806.05328  [pdf, other

    cs.CR

    o-glasses: Visualizing x86 Code from Binary Using a 1d-CNN

    Authors: Yuhei Otsubo, Akira Otsuka, Mamoru Mimura, Takeshi Sakaki, Atsuhiro Goto

    Abstract: Malicious document files used in targeted attacks often contain a small program called shellcode. It is often hard to prepare a runnable environment for dynamic analysis of these document files because they exploit specific vulnerabilities. In these cases, it is necessary to identify the position of the shellcode in each document file to analyze it. If the exploit code uses executable scripts such… ▽ More

    Submitted 13 June, 2018; originally announced June 2018.

    Comments: 21 pages, 15 figures

    MSC Class: 94A99

  17. arXiv:1710.11439  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization

    Authors: Yoshiaki Bando, Masato Mimura, Katsutoshi Itoyama, Kazuyoshi Yoshii, Tatsuya Kawahara

    Abstract: This paper presents a statistical method of single-channel speech enhancement that uses a variational autoencoder (VAE) as a prior distribution on clean speech. A standard approach to speech enhancement is to train a deep neural network (DNN) to take noisy speech as input and output clean speech. Although this supervised approach requires a very large amount of pair data for training, it is not ro… ▽ More

    Submitted 19 March, 2018; v1 submitted 31 October, 2017; originally announced October 2017.

    Comments: 5 pages, 3 figures, version that Eqs. (9), (19), and (20) in v2 (submitted to ICASSP 2018) are corrected. Samples available here: http://sap.ist.i.kyoto-u.ac.jp/members/yoshiaki/demo/vae-nmf/