Zum Hauptinhalt springen

Showing 1–23 of 23 results for author: Kashino, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.17107  [pdf, other

    eess.AS cs.SD

    Exploring Pre-trained General-purpose Audio Representations for Heart Murmur Detection

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: To reduce the need for skilled clinicians in heart sound interpretation, recent studies on automating cardiac auscultation have explored deep learning approaches. However, despite the demands for large data for deep learning, the size of the heart sound datasets is limited, and no pre-trained model is available. On the contrary, many pre-trained models for general audio tasks are available as gene… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: 4 pages, 1 figure, and 4 tables. Accepted by IEEE EMBC 2024

    MSC Class: 68T07

  2. arXiv:2404.06095  [pdf, other

    eess.AS cs.SD

    Masked Modeling Duo: Towards a Universal Audio Pre-training Framework

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Self-supervised learning (SSL) using masked prediction has made great strides in general-purpose audio representation. This study proposes Masked Modeling Duo (M2D), an improved masked prediction SSL, which learns by predicting representations of masked input signals that serve as training signals. Unlike conventional methods, M2D obtains a training signal by encoding only the masked part, encoura… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: 15 pages, 6 figures, 15 tables. Accepted by TASLP

    MSC Class: 68T07

  3. arXiv:2309.06720  [pdf, other

    cs.CV

    Deep Attentive Time Warping

    Authors: Shinnosuke Matsuo, Xiaomeng Wu, Gantugs Atarsaikhan, Akisato Kimura, Kunio Kashino, Brian Kenji Iwana, Seiichi Uchida

    Abstract: Similarity measures for time series are important problems for time series classification. To handle the nonlinear time distortions, Dynamic Time Warping (DTW) has been widely used. However, DTW is not learnable and suffers from a trade-off between robustness against time distortion and discriminative power. In this paper, we propose a neural network model for task-adaptive time warping. Specifica… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

    Comments: Accepted at Pattern Recognition

  4. arXiv:2308.11923  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

    Authors: Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino

    Abstract: We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attentio… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

    Comments: Accepted to DCASE2023 Workshop

  5. arXiv:2305.14079  [pdf, other

    eess.AS cs.SD

    Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specifi… ▽ More

    Submitted 3 August, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023; 5+2 pages, 2 figures, 6+6 tables, Code: https://github.com/nttcslab/m2d/tree/master/speech

    MSC Class: 68T07

  6. arXiv:2210.14648  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Masked Autoencoders is a simple yet powerful self-supervised learning method. However, it learns representations indirectly by reconstructing masked input patches. Several methods learn representations directly by predicting representations of masked patches; however, we think using all patches to encode training signal representations is suboptimal. We propose a new method, Masked Modeling Duo (M… ▽ More

    Submitted 2 March, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

    Comments: 6 pages, 3 figures, and 6 tables. To appear at ICASSP2023

    MSC Class: 68T07

  7. arXiv:2209.06406  [pdf, other

    cs.CV

    Reflectance-Oriented Probabilistic Equalization for Image Enhancement

    Authors: Xiaomeng Wu, Yongqing Sun, Akisato Kimura, Kunio Kashino

    Abstract: Despite recent advances in image enhancement, it remains difficult for existing approaches to adaptively improve the brightness and contrast for both low-light and normal-light images. To solve this problem, we propose a novel 2D histogram equalization approach. It assumes intensity occurrence and co-occurrence to be dependent on each other and derives the distribution of intensity occurrence (1D… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: Published in ICASSP 2021. For GitHub code, see https://github.com/nttcslab/rope

  8. arXiv:2209.06405  [pdf, other

    cs.CV

    Reflectance-Guided, Contrast-Accumulated Histogram Equalization

    Authors: Xiaomeng Wu, Takahito Kawanishi, Kunio Kashino

    Abstract: Existing image enhancement methods fall short of expectations because with them it is difficult to improve global and local image contrast simultaneously. To address this problem, we propose a histogram equalization-based method that adapts to the data-dependent requirements of brightness enhancement and improves the visibility of details without losing the global contrast. This method incorporate… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: Published in ICASSP 2020. For GitHub code, see https://github.com/nttcslab/rg-cache

  9. arXiv:2207.11964  [pdf, other

    eess.AS cs.LG cs.MM cs.SD

    ConceptBeam: Concept Driven Target Speech Extraction

    Authors: Yasunori Ohishi, Marc Delcroix, Tsubasa Ochiai, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Akisato Kimura, Noboru Harada, Kunio Kashino

    Abstract: We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we… ▽ More

    Submitted 25 July, 2022; originally announced July 2022.

    Comments: Accepted to ACM Multimedia 2022

  10. arXiv:2207.09732  [pdf, other

    eess.AS cs.CL cs.IR cs.LG cs.SD

    Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval

    Authors: Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino

    Abstract: The amount of audio data available on public websites is growing rapidly, and an efficient mechanism for accessing the desired data is necessary. We propose a content-based audio retrieval method that can retrieve a target audio that is similar to but slightly different from the query audio by introducing auxiliary textual information which describes the difference between the query and target aud… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

    Comments: Accepted to Interspeech 2022

  11. arXiv:2205.08138  [pdf, ps, other

    eess.AS cs.SD

    Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Many application studies rely on audio DNN models pre-trained on a large-scale dataset as essential feature extractors, and they extract features from the last layers. In this study, we focus on our finding that the middle layer features of existing supervised pre-trained models are more effective than the late layer features for some tasks. We propose a simple approach to compose features effecti… ▽ More

    Submitted 17 May, 2022; originally announced May 2022.

    Comments: 5 pages, 4 figures and 4 tables. Accepted by EUSIPCO 2022

    MSC Class: 68T07

  12. arXiv:2204.12260  [pdf, other

    eess.AS cs.SD

    Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example, typical audio contrastive learning uses temporal relationships among input sounds to create training signals, whereas some methods use a difference among input views… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

    Comments: 22 pages, 8 figures. Under the review process

    MSC Class: 68T07

    Journal ref: HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition) PMLR 166 (2022) 1-24

  13. BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Pre-trained models are essential as feature extractors in modern machine learning systems in various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound. For recognizing sounds regardless of perturbations such as varying pitch or timbre, features should be robust to these perturbations.… ▽ More

    Submitted 16 June, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

    Comments: 15 pages, 6 figures, and 15 tables. Under the review process

    MSC Class: 68T07

    Journal ref: IEEE/ACM Trans. Audio, Speech, Language Process. 31 (2023) 137-151

  14. arXiv:2103.15074  [pdf, other

    cs.CV

    Attention to Warp: Deep Metric Learning for Multivariate Time Series

    Authors: Shinnosuke Matsuo, Xiaomeng Wu, Gantugs Atarsaikhan, Akisato Kimura, Kunio Kashino, Brian Kenji Iwana, Seiichi Uchida

    Abstract: Deep time series metric learning is challenging due to the difficult trade-off between temporal invariance to nonlinear distortion and discriminative power in identifying non-matching sequences. This paper proposes a novel neural network-based approach for robust yet discriminative time series classification and verification. This approach adapts a parameterized attention model to time warping for… ▽ More

    Submitted 21 June, 2021; v1 submitted 28 March, 2021; originally announced March 2021.

    Comments: Accepted at ICDAR2021

  15. arXiv:2103.06695  [pdf, other

    eess.AS cs.LG cs.SD

    BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning approach. We propose learning general-purpose audio representation from a single audio segment without expecting relationships between different time segments of audio samples. To implement this principle… ▽ More

    Submitted 20 April, 2021; v1 submitted 11 March, 2021; originally announced March 2021.

    Comments: IJCNN 2021, 8 pages, 4 figures

    MSC Class: 68T07

  16. arXiv:2009.11436  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning

    Authors: Daiki Takeuchi, Yuma Koizumi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning. The system received the highest evaluation scores, butwhich of the individual elements most fully contributed to its perfor-mance has not ye… ▽ More

    Submitted 23 September, 2020; originally announced September 2020.

    Comments: Accepted to DCASE2020 Workshop

  17. arXiv:2007.00225  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation

    Authors: Yuma Koizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning. Our submission focuses on solving two indeterminacy problems in automated audio captioning: word selection indeterminacy and sentence length indeterminacy. We simultaneously solve the main caption generation and sub i… ▽ More

    Submitted 1 July, 2020; originally announced July 2020.

    Comments: Technical Report of DCASE2020 Challenge Task 6

  18. arXiv:1805.10603  [pdf, other

    cs.CV

    Generative Adversarial Image Synthesis with Decision Tree Latent Controller

    Authors: Takuhiro Kaneko, Kaoru Hiramatsu, Kunio Kashino

    Abstract: This paper proposes the decision tree latent controller generative adversarial network (DTLC-GAN), an extension of a GAN that can learn hierarchically interpretable representations without relying on detailed supervision. To impose a hierarchical inclusion structure on latent variables, we incorporate a new architecture called the DTLC into the generator input. The DTLC has a multiple-layer tree s… ▽ More

    Submitted 27 May, 2018; originally announced May 2018.

    Comments: CVPR 2018. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/dtlc-gan/

  19. arXiv:1805.07137  [pdf, ps, other

    stat.ML cs.LG

    Knowledge Discovery from Layered Neural Networks based on Non-negative Task Decomposition

    Authors: Chihiro Watanabe, Kaoru Hiramatsu, Kunio Kashino

    Abstract: Interpretability has become an important issue in the machine learning field, along with the success of layered neural networks in various practical tasks. Since a trained layered neural network consists of a complex nonlinear relationship between large number of parameters, we failed to understand how they could achieve input-output mappings with a given data set. In this paper, we propose the no… ▽ More

    Submitted 20 May, 2018; v1 submitted 18 May, 2018; originally announced May 2018.

  20. arXiv:1804.04778  [pdf, ps, other

    stat.ML cs.LG

    Understanding Community Structure in Layered Neural Networks

    Authors: Chihiro Watanabe, Kaoru Hiramatsu, Kunio Kashino

    Abstract: A layered neural network is now one of the most common choices for the prediction of high-dimensional practical data sets, where the relationship between input and output data is complex and cannot be represented well by simple conventional models. Its effectiveness is shown in various tasks, however, the lack of interpretability of the trained result by a layered neural network has limited its ap… ▽ More

    Submitted 12 April, 2018; originally announced April 2018.

  21. arXiv:1703.00168  [pdf, ps, other

    stat.ML cs.LG

    Modular Representation of Layered Neural Networks

    Authors: Chihiro Watanabe, Kaoru Hiramatsu, Kunio Kashino

    Abstract: Layered neural networks have greatly improved the performance of various applications including image processing, speech recognition, natural language processing, and bioinformatics. However, it is still difficult to discover or interpret knowledge from the inference provided by a layered neural network, since its internal representation has many nonlinear and complex parameters embedded in hierar… ▽ More

    Submitted 4 October, 2017; v1 submitted 1 March, 2017; originally announced March 2017.

  22. arXiv:1004.0085  [pdf, other

    cs.CV cs.MM cs.NE stat.ML

    A stochastic model of human visual attention with a dynamic Bayesian network

    Authors: Akisato kimura, Derek Pang, Tatsuto Takeuchi, Kouji Miyazato, Junji Yamato, Kunio Kashino

    Abstract: Recent studies in the field of human vision science suggest that the human responses to the stimuli on a visual display are non-deterministic. People may attend to different locations on the same visual input at the same time. Based on this knowledge, we propose a new stochastic model of visual attention by introducing a dynamic Bayesian network to predict the likelihood of where humans typically… ▽ More

    Submitted 1 April, 2010; originally announced April 2010.

    Comments: 24 pages, single-column, 13 figures excluding portlaits, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence.

    MSC Class: 68U10 ACM Class: I.4.8; I.4.10; I.5.1; I.6.8; I.2.10; I.4.4; I.2.9; I.3.1

  23. A quick search method for audio signals based on a piecewise linear representation of feature trajectories

    Authors: Akisato Kimura, Kunio Kashino, Takayuki Kurozumi, Hiroshi Murase

    Abstract: This paper presents a new method for a quick similarity-based search through long unlabeled audio streams to detect and locate audio clips provided by users. The method involves feature-dimension reduction based on a piecewise linear representation of a sequential feature trajectory extracted from a long audio stream. Two techniques enable us to obtain a piecewise linear representation: the dyna… ▽ More

    Submitted 22 October, 2007; originally announced October 2007.

    Comments: 20 pages, to appear in IEEE Transactions on Audio, Speech and Language Processing

    Journal ref: IEEE Transactions on Audio, Speech and Language Processing, Vol.16, No.2, pp.396-407, February 2008.