Search | arXiv e-print repository

Predicting Generalization of AI Colonoscopy Models to Unseen Data

Authors: Joel Shor, Carson McNeil, Yotam Intrator, Joseph R Ledsam, Hiro-o Yamano, Daisuke Tsurumaru, Hiroki Kayama, Atsushi Hamabe, Koji Ando, Mitsuhiko Ota, Haruei Ogino, Hiroshi Nakase, Kaho Kobayashi, Masaaki Miyo, Eiji Oki, Ichiro Takemasa, Ehud Rivlin, Roman Goldenberg

Abstract: $\textbf{Background}$: Generalizability of AI colonoscopy algorithms is important for wider adoption in clinical practice. However, current techniques for evaluating performance on unseen data require expensive and time-intensive labels. $\textbf{Methods}… ▽ More $\textbf{Background}$: Generalizability of AI colonoscopy algorithms is important for wider adoption in clinical practice. However, current techniques for evaluating performance on unseen data require expensive and time-intensive labels. $\textbf{Methods}$: We use a "Masked Siamese Network" (MSN) to identify novel phenomena in unseen data and predict polyp detector performance. MSN is trained to predict masked out regions of polyp images, without any labels. We test MSN's ability to be trained on data only from Israel and detect unseen techniques, narrow-band imaging (NBI) and chromendoscoy (CE), on colonoscopes from Japan (354 videos, 128 hours). We also test MSN's ability to predict performance of Computer Aided Detection (CADe) of polyps on colonoscopies from both countries, even though MSN is not trained on data from Japan. $\textbf{Results}$: MSN correctly identifies NBI and CE as less similar to Israel whitelight than Japan whitelight (bootstrapped z-test, |z| > 496, p < 10^-8 for both) using the label-free Frechet distance. MSN detects NBI with 99% accuracy, predicts CE better than our heuristic (90% vs 79% accuracy) despite being trained only on whitelight, and is the only method that is robust to noisy labels. MSN predicts CADe polyp detector performance on in-domain Israel and out-of-domain Japan colonoscopies (r=0.79, 0.37 respectively). With few examples of Japan detector performance to train on, MSN prediction of Japan performance improves (r=0.56). $\textbf{Conclusion}$: Our technique can identify distribution shifts in clinical data and can predict CADe detector performance on unseen data, without labels. Our self-supervised approach can aid in detecting when data in practice is different from training, such as between hospitals or data has meaningfully shifted from training. MSN has potential for application to medical image domains beyond colonoscopy. △ Less

Submitted 22 March, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

arXiv:2312.09529 [pdf, other]

Can Physician Judgment Enhance Model Trustworthiness? A Case Study on Predicting Pathological Lymph Nodes in Rectal Cancer

Authors: Kazuma Kobayashi, Yasuyuki Takamizawa, Mototaka Miyake, Sono Ito, Lin Gu, Tatsuya Nakatsuka, Yu Akagi, Tatsuya Harada, Yukihide Kanemitsu, Ryuji Hamamoto

Abstract: Explainability is key to enhancing artificial intelligence's trustworthiness in medicine. However, several issues remain concerning the actual benefit of explainable models for clinical decision-making. Firstly, there is a lack of consensus on an evaluation framework for quantitatively assessing the practical benefits that effective explainability should provide to practitioners. Secondly, physici… ▽ More Explainability is key to enhancing artificial intelligence's trustworthiness in medicine. However, several issues remain concerning the actual benefit of explainable models for clinical decision-making. Firstly, there is a lack of consensus on an evaluation framework for quantitatively assessing the practical benefits that effective explainability should provide to practitioners. Secondly, physician-centered evaluations of explainability are limited. Thirdly, the utility of built-in attention mechanisms in transformer-based models as an explainability technique is unclear. We hypothesize that superior attention maps should align with the information that physicians focus on, potentially reducing prediction uncertainty and increasing model reliability. We employed a multimodal transformer to predict lymph node metastasis in rectal cancer using clinical data and magnetic resonance imaging, exploring how well attention maps, visualized through a state-of-the-art technique, can achieve agreement with physician understanding. We estimated the model's uncertainty using meta-level information like prediction probability variance and quantified agreement. Our assessment of whether this agreement reduces uncertainty found no significant effect. In conclusion, this case study did not confirm the anticipated benefit of attention maps in enhancing model reliability. Superficial explanations could do more harm than good by misleading physicians into relying on uncertain predictions, suggesting that the current state of attention mechanisms in explainability should not be overestimated. Identifying explainability mechanisms truly beneficial for clinical decision-making remains essential. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2309.09627 [pdf, other]

Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders

Authors: Lester Phillip Violeta, Wen-Chin Huang, Ding Ma, Ryuichi Yamamoto, Kazuhiro Kobayashi, Tomoki Toda

Abstract: We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conv… ▽ More We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conversion performance of this framework. To resolve this issue, we propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch, making it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score. △ Less

Submitted 20 January, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

Comments: Accepted to ICASSP 2024. Demo page: lesterphillip.github.io/icassp2024_el_sie

arXiv:2309.07598 [pdf, other]

AAS-VC: On the Generalization Ability of Automatic Alignment Search based Non-autoregressive Sequence-to-sequence Voice Conversion

Authors: Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda

Abstract: Non-autoregressive (non-AR) sequence-to-seqeunce (seq2seq) models for voice conversion (VC) is attractive in its ability to effectively model the temporal structure while enjoying boosted intelligibility and fast inference thanks to non-AR modeling. However, the dependency of current non-AR seq2seq VC models on ground truth durations extracted from an external AR model greatly limits its generaliz… ▽ More Non-autoregressive (non-AR) sequence-to-seqeunce (seq2seq) models for voice conversion (VC) is attractive in its ability to effectively model the temporal structure while enjoying boosted intelligibility and fast inference thanks to non-AR modeling. However, the dependency of current non-AR seq2seq VC models on ground truth durations extracted from an external AR model greatly limits its generalization ability to smaller training datasets. In this paper, we first demonstrate the above-mentioned problem by varying the training data size. Then, we present AAS-VC, a non-AR seq2seq VC model based on automatic alignment search (AAS), which removes the dependency on external durations and serves as a proper inductive bias to provide the required generalization ability for small datasets. Experimental results show that AAS-VC can generalize better to a training dataset of only 5 minutes. We also conducted ablation studies to justify several model design choices. The audio samples and implementation are available online. △ Less

Submitted 15 September, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: Submitted to ICASSP 2024. Demo: https://unilight.github.io/Publication-Demos/publications/aas-vc/index.html. Code: https://github.com/unilight/seq2seq-vc

arXiv:2309.03331 [pdf, other]

Expert Uncertainty and Severity Aware Chest X-Ray Classification by Multi-Relationship Graph Learning

Authors: Mengliang Zhang, Xinyue Hu, Lin Gu, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, Ronald M. Summers, Yingying Zhu

Abstract: Patients undergoing chest X-rays (CXR) often endure multiple lung diseases. When evaluating a patient's condition, due to the complex pathologies, subtle texture changes of different lung lesions in images, and patient condition differences, radiologists may make uncertain even when they have experienced long-term clinical training and professional guidance, which makes much noise in extracting di… ▽ More Patients undergoing chest X-rays (CXR) often endure multiple lung diseases. When evaluating a patient's condition, due to the complex pathologies, subtle texture changes of different lung lesions in images, and patient condition differences, radiologists may make uncertain even when they have experienced long-term clinical training and professional guidance, which makes much noise in extracting disease labels based on CXR reports. In this paper, we re-extract disease labels from CXR reports to make them more realistic by considering disease severity and uncertainty in classification. Our contributions are as follows: 1. We re-extracted the disease labels with severity and uncertainty by a rule-based approach with keywords discussed with clinical experts. 2. To further improve the explainability of chest X-ray diagnosis, we designed a multi-relationship graph learning method with an expert uncertainty-aware loss function. 3. Our multi-relationship graph learning method can also interpret the disease classification results. Our experimental results show that models considering disease severity and uncertainty outperform previous state-of-the-art methods. △ Less

Submitted 6 September, 2023; originally announced September 2023.

arXiv:2210.10314 [pdf, other]

Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion

Authors: Ding Ma, Lester Phillip Violeta, Kazuhiro Kobayashi, Tomoki Toda

Abstract: Sequence-to-sequence (seq2seq) voice conversion (VC) models have greater potential in converting electrolaryngeal (EL) speech to normal speech (EL2SP) compared to conventional VC models. However, EL2SP based on seq2seq VC requires a sufficiently large amount of parallel data for the model training and it suffers from significant performance degradation when the amount of training data is insuffici… ▽ More Sequence-to-sequence (seq2seq) voice conversion (VC) models have greater potential in converting electrolaryngeal (EL) speech to normal speech (EL2SP) compared to conventional VC models. However, EL2SP based on seq2seq VC requires a sufficiently large amount of parallel data for the model training and it suffers from significant performance degradation when the amount of training data is insufficient. To address this issue, we suggest a novel, two-stage strategy to optimize the performance on EL2SP based on seq2seq VC when a small amount of the parallel dataset is available. In contrast to utilizing high-quality data augmentations in previous studies, we first combine a large amount of imperfect synthetic parallel data of EL and normal speech, with the original dataset into VC training. Then, a second stage training is conducted with the original parallel dataset only. The results show that the proposed method progressively improves the performance of EL2SP based on seq2seq VC. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: Accepted to SLT 2022

arXiv:2210.09055 [pdf, other]

Data-driven multi-scale modeling and robust optimization of composite structure with uncertainty quantification

Authors: Kazuma Kobayashi, Shoaib Usman, Carlos Castano, Dinesh Kumar, Syed Alam

Abstract: It is important to accurately model materials' properties at lower length scales (micro-level) while translating the effects to the components and/or system level (macro-level) can significantly reduce the amount of experimentation required to develop new technologies. Robustness analysis of fuel and structural performance for harsh environments (such as power uprated reactor systems or aerospace… ▽ More It is important to accurately model materials' properties at lower length scales (micro-level) while translating the effects to the components and/or system level (macro-level) can significantly reduce the amount of experimentation required to develop new technologies. Robustness analysis of fuel and structural performance for harsh environments (such as power uprated reactor systems or aerospace applications) using machine learning-based multi-scale modeling and robust optimization under uncertainties are required. The fiber and matrix material characteristics are potential sources of uncertainty at the microscale. The stacking sequence (angles of stacking and thickness of layers) of composite layers causes meso-scale uncertainties. It is also possible for macro-scale uncertainties to arise from system properties, like the load or the initial conditions. This chapter demonstrates advanced data-driven methods and outlines the specific capability that must be developed/added for the multi-scale modeling of advanced composite materials. This chapter proposes a multi-scale modeling method for composite structures based on a finite element method (FEM) simulation driven by surrogate models/emulators based on microstructurally informed meso-scale materials models to study the impact of operational parameters/uncertainties using machine learning approaches. To ensure optimal composite materials, composite properties are optimized with respect to initial materials volume fraction using data-driven numerical algorithms. △ Less

Submitted 4 November, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

Journal ref: Handbook of Smart Energy Systems, 2022

arXiv:2209.12146 [pdf]

doi 10.1007/978-3-030-72322-4_149-1

Machine Learning and Artificial Intelligence-Driven Multi-Scale Modeling for High Burnup Accident-Tolerant Fuels for Light Water-Based SMR Applications

Authors: Md. Shamim Hassan, Abid Hossain Khan, Richa Verma, Dinesh Kumar, Kazuma Kobayashi, Shoaib Usman, Syed Alam

Abstract: The concept of small modular reactor has changed the outlook for tackling future energy crises. This new reactor technology is very promising considering its lower investment requirements, modularity, design simplicity, and enhanced safety features. The application of artificial intelligence-driven multi-scale modeling (neutronics, thermal hydraulics, fuel performance, etc.) incorporating Digital… ▽ More The concept of small modular reactor has changed the outlook for tackling future energy crises. This new reactor technology is very promising considering its lower investment requirements, modularity, design simplicity, and enhanced safety features. The application of artificial intelligence-driven multi-scale modeling (neutronics, thermal hydraulics, fuel performance, etc.) incorporating Digital Twin and associated uncertainties in the research of small modular reactors is a recent concept. In this work, a comprehensive study is conducted on the multiscale modeling of accident-tolerant fuels. The application of these fuels in the light water-based small modular reactors is explored. This chapter also focuses on the application of machine learning and artificial intelligence in the design optimization, control, and monitoring of small modular reactors. Finally, a brief assessment of the research gap on the application of artificial intelligence to the development of high burnup composite accident-tolerant fuels is provided. Necessary actions to fulfill these gaps are also discussed. △ Less

Submitted 25 September, 2022; originally announced September 2022.

Journal ref: Handbook of Smart Energy Systems, 2022

arXiv:2106.01415 [pdf, other]

A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion

Authors: Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Ching-Feng Liu, Yu Tsao, Hsin-Min Wang, Tomoki Toda

Abstract: We propose a new paradigm for maintaining speaker identity in dysarthric voice conversion (DVC). The poor quality of dysarthric speech can be greatly improved by statistical VC, but as the normal speech utterances of a dysarthria patient are nearly impossible to collect, previous work failed to recover the individuality of the patient. In light of this, we suggest a novel, two-stage approach for D… ▽ More We propose a new paradigm for maintaining speaker identity in dysarthric voice conversion (DVC). The poor quality of dysarthric speech can be greatly improved by statistical VC, but as the normal speech utterances of a dysarthria patient are nearly impossible to collect, previous work failed to recover the individuality of the patient. In light of this, we suggest a novel, two-stage approach for DVC, which is highly flexible in that no normal speech of the patient is required. First, a powerful parallel sequence-to-sequence model converts the input dysarthric speech into a normal speech of a reference speaker as an intermediate product, and a nonparallel, frame-wise VC model realized with a variational autoencoder then converts the speaker identity of the reference speech back to that of the patient while assumed to be capable of preserving the enhanced quality. We investigate several design options. Experimental evaluation results demonstrate the potential of our approach to improving the quality of the dysarthric speech while maintaining the speaker identity. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: Accepted to Interspeech 2021. 5 pages, 3 figures, 1 table

arXiv:2104.06793 [pdf, other]

Non-autoregressive sequence-to-sequence voice conversion

Authors: Tomoki Hayashi, Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda

Abstract: This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models. Inspired by the great success of NAR-S2S models such as FastSpeech in text-to-speech (TTS), we extend the FastSpeech2 model for the VC problem. We introduce the convolution-augmented Transformer (Conformer) instead of the Transformer, making it possible to capture both local… ▽ More This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models. Inspired by the great success of NAR-S2S models such as FastSpeech in text-to-speech (TTS), we extend the FastSpeech2 model for the VC problem. We introduce the convolution-augmented Transformer (Conformer) instead of the Transformer, making it possible to capture both local and global context information from the input sequence. Furthermore, we extend variance predictors to variance converters to explicitly convert the source speaker's prosody components such as pitch and energy into the target speaker. The experimental evaluation with the Japanese speaker dataset, which consists of male and female speakers of 1,000 utterances, demonstrates that the proposed model enables us to perform more stable, faster, and better conversion than autoregressive S2S (AR-S2S) models such as Tacotron2 and Transformer. △ Less

Submitted 14 April, 2021; originally announced April 2021.

Comments: Accepted to ICASSP2021. Demo HP: https://kan-bayashi.github.io/NonARSeq2SeqVC/

arXiv:2103.02858 [pdf, ps, other]

crank: An Open-Source Software for Nonparallel Voice Conversion Based on Vector-Quantized Variational Autoencoder

Authors: Kazuhiro Kobayashi, Wen-Chin Huang, Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Tomoki Toda

Abstract: In this paper, we present an open-source software for developing a nonparallel voice conversion (VC) system named crank. Although we have released an open-source VC software based on the Gaussian mixture model named sprocket in the last VC Challenge, it is not straightforward to apply any speech corpus because it is necessary to prepare parallel utterances of source and target speakers to model a… ▽ More In this paper, we present an open-source software for developing a nonparallel voice conversion (VC) system named crank. Although we have released an open-source VC software based on the Gaussian mixture model named sprocket in the last VC Challenge, it is not straightforward to apply any speech corpus because it is necessary to prepare parallel utterances of source and target speakers to model a statistical conversion function. To address this issue, in this study, we developed a new open-source VC software that enables users to model the conversion function by using only a nonparallel speech corpus. For implementing the VC software, we used a vector-quantized variational autoencoder (VQVAE). To rapidly examine the effectiveness of recent technologies developed in this research field, crank also supports several representative works for autoencoder-based VC methods such as the use of hierarchical architectures, cyclic architectures, generative adversarial networks, speaker adversarial training, and neural vocoders. Moreover, it is possible to automatically estimate objective measures such as mel-cepstrum distortion and pseudo mean opinion score based on MOSNet. In this paper, we describe representative functions developed in crank and make brief comparisons by objective evaluations. △ Less

Submitted 4 March, 2021; originally announced March 2021.

Comments: Accepted to ICASSP 2021

arXiv:2011.10196 [pdf, other]

Deep unfolding-based output feedback control design for linear systems with input saturation

Authors: Koki Kobayashi, Masaki Ogura, Taisuke Kobayashi, Kenji Sugimoto

Abstract: In this paper, we propose a deep unfolding-based framework for the output feedback control of systems with input saturation. Although saturation commonly arises in several practical control systems, there is still a scarce of effective design methodologies that can directly deal with the severe non-linearity of the saturation operator. In this paper, we aim to design an anti-windup controller for… ▽ More In this paper, we propose a deep unfolding-based framework for the output feedback control of systems with input saturation. Although saturation commonly arises in several practical control systems, there is still a scarce of effective design methodologies that can directly deal with the severe non-linearity of the saturation operator. In this paper, we aim to design an anti-windup controller for enlarging the region of stability of the closed-loop system by learning from the numerical simulations of the closed-loop system. The data-driven framework we propose in this paper is based on a deep-learning technique called Neural Ordinary Differential Equations. Within our framework, we first obtain a candidate controller by using the deep-learning technique, which is then tested by the existing theoretical results already established in the literature, thereby avoiding the computational challenge in the conventional design methodologies as well as theoretically guaranteeing the performance of the system. Our numerical simulation shows that the proposed framework can significantly outperform a conventional design methodology based on linear matrix inequalities. △ Less

Submitted 27 January, 2021; v1 submitted 19 November, 2020; originally announced November 2020.

Comments: 7 pages, 5 figures

arXiv:2011.06224 [pdf, other]

Decomposing Normal and Abnormal Features of Medical Images for Content-based Image Retrieval

Authors: Kazuma Kobayashi, Ryuichiro Hataya, Yusuke Kurose, Tatsuya Harada, Ryuji Hamamoto

Abstract: Medical images can be decomposed into normal and abnormal features, which is considered as the compositionality. Based on this idea, we propose an encoder-decoder network to decompose a medical image into two discrete latent codes: a normal anatomy code and an abnormal anatomy code. Using these latent codes, we demonstrate a similarity retrieval by focusing on either normal or abnormal features of… ▽ More Medical images can be decomposed into normal and abnormal features, which is considered as the compositionality. Based on this idea, we propose an encoder-decoder network to decompose a medical image into two discrete latent codes: a normal anatomy code and an abnormal anatomy code. Using these latent codes, we demonstrate a similarity retrieval by focusing on either normal or abnormal features of medical images. △ Less

Submitted 12 November, 2020; originally announced November 2020.

Comments: Machine Learning for Health (ML4H) at NeurIPS 2020 - Extended Abstract

arXiv:2010.04446 [pdf, other]

The NU Voice Conversion System for the Voice Conversion Challenge 2020: On the Effectiveness of Sequence-to-sequence Models and Autoregressive Neural Vocoders

Authors: Wen-Chin Huang, Patrick Lumban Tobing, Yi-Chiao Wu, Kazuhiro Kobayashi, Tomoki Toda

Abstract: In this paper, we present the voice conversion (VC) systems developed at Nagoya University (NU) for the Voice Conversion Challenge 2020 (VCC2020). We aim to determine the effectiveness of two recent significant technologies in VC: sequence-to-sequence (seq2seq) models and autoregressive (AR) neural vocoders. Two respective systems were developed for the two tasks in the challenge: for task 1, we a… ▽ More In this paper, we present the voice conversion (VC) systems developed at Nagoya University (NU) for the Voice Conversion Challenge 2020 (VCC2020). We aim to determine the effectiveness of two recent significant technologies in VC: sequence-to-sequence (seq2seq) models and autoregressive (AR) neural vocoders. Two respective systems were developed for the two tasks in the challenge: for task 1, we adopted the Voice Transformer Network, a Transformer-based seq2seq VC model, and extended it with synthetic parallel data to tackle nonparallel data; for task 2, we used the frame-based cyclic variational autoencoder (CycleVAE) to model the spectral features of a speech waveform and the AR WaveNet vocoder with additional fine-tuning. By comparing with the baseline systems, we confirmed that the seq2seq modeling can improve the conversion similarity and that the use of AR vocoders can improve the naturalness of the converted speech. △ Less

Submitted 9 October, 2020; originally announced October 2020.

Comments: Accepted to the ISCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020

arXiv:2007.05663 [pdf, other]

doi 10.1109/TASLP.2021.3061245

Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

Authors: Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda

Abstract: In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achieves high-fidelity audio waveform generatio… ▽ More In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achieves high-fidelity audio waveform generation. However, the pure-data-driven nature and the lack of prior knowledge of audio signals degrade the pitch controllability of WN. For instance, it is difficult for WN to precisely generate the periodic components of audio signals when the given auxiliary fundamental frequency ($F_{0}$) features are outside the $F_{0}$ range observed in the training data. To address this problem, QPNet with two novel designs is proposed. First, the PDCNN component is applied to dynamically change the network architecture of WN according to the given auxiliary $F_{0}$ features. Second, a cascaded network structure is utilized to simultaneously model the long- and short-term dependencies of quasi-periodic signals such as speech. The performances of single-tone sinusoid and speech generations are evaluated. The experimental results show the effectiveness of the PDCNNs for unseen auxiliary $F_{0}$ features and the effectiveness of the cascaded structure for speech generation. △ Less

Submitted 27 March, 2021; v1 submitted 10 July, 2020; originally announced July 2020.

Comments: 15 pages, 12 figures, 11 tables

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1134-1148, 2021

arXiv:2005.12573 [pdf, other]

Learning Global and Local Features of Normal Brain Anatomy for Unsupervised Abnormality Detection

Authors: Kazuma Kobayashi, Ryuichiro Hataya, Yusuke Kurose, Amina Bolatkan, Mototaka Miyake, Hirokazu Watanabe, Masamichi Takahashi, Jun Itami, Tatsuya Harada, Ryuji Hamamoto

Abstract: In real-world clinical practice, overlooking unanticipated findings can result in serious consequences. However, supervised learning, which is the foundation for the current success of deep learning, only encourages models to identify abnormalities that are defined in datasets in advance. Therefore, abnormality detection must be implemented in medical images that are not limited to a specific dise… ▽ More In real-world clinical practice, overlooking unanticipated findings can result in serious consequences. However, supervised learning, which is the foundation for the current success of deep learning, only encourages models to identify abnormalities that are defined in datasets in advance. Therefore, abnormality detection must be implemented in medical images that are not limited to a specific disease category. In this study, we demonstrate an unsupervised learning framework for pixel-wise abnormality detection in brain magnetic resonance imaging captured from a patient population with metastatic brain tumor. Our concept is as follows: If an image reconstruction network can faithfully reproduce the global features of normal anatomy, then the abnormal lesions in unseen images can be identified based on the local difference from those reconstructed as normal by a discriminative network. Both networks are trained on a dataset comprising only normal images without labels. In addition, we devise a metric to evaluate the anatomical fidelity of the reconstructed images and confirm that the overall detection performance is improved when the image reconstruction network achieves a higher score. For evaluation, clinically significant abnormalities are comprehensively segmented. The results show that the area under the receiver operating characteristics curve values for metastatic brain tumors, extracranial metastatic tumors, postoperative cavities, and structural changes are 0.78, 0.61, 0.91, and 0.60, respectively. △ Less

Submitted 8 May, 2021; v1 submitted 26 May, 2020; originally announced May 2020.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2003.11750 [pdf]

doi 10.1109/ACCESS.2020.2984007

Non-parallel Voice Conversion System with WaveNet Vocoder and Collapsed Speech Suppression

Authors: Yi-Chiao Wu, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Hayashi, Tomoki Toda

Abstract: In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic featu… ▽ More In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic features, acoustic and temporal mismatches, and exposure bias usually lead to significant speech quality degradation, making WN generate some very noisy speech segments called collapsed speech. To tackle the problem, we take conventional-vocoder-generated speech as the reference speech to derive a linear predictive coding distribution constraint (LPCDC) to avoid the collapsed speech problem. Furthermore, to mitigate the negative effects introduced by the LPCDC, we propose a collapsed speech segment detector (CSSD) to ensure that the LPCDC is only applied to the problematic segments to limit the loss of quality to short periods. Objective and subjective evaluations are conducted, and the experimental results confirm the effectiveness of the proposed method, which further improves the speech quality of our previous non-parallel VC system submitted to Voice Conversion Challenge 2018. △ Less

Submitted 6 April, 2020; v1 submitted 26 March, 2020; originally announced March 2020.

Comments: 13 pages, 13 figures, 1 table, accepted to publish in IEEE Access

arXiv:1907.11898 [pdf, other]

Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

Authors: Wen-Chin Huang, Yi-Chiao Wu, Kazuhiro Kobayashi, Yu-Huai Peng, Hsin-Te Hwang, Patrick Lumban Tobing, Yu Tsao, Hsin-Min Wang, Tomoki Toda

Abstract: We present a modification to the spectrum differential based direct waveform modification for voice conversion (DIFFVC) so that it can be directly applied as a waveform generation module to voice conversion models. The recently proposed DIFFVC avoids the use of a vocoder, meanwhile preserves rich spectral details hence capable of generating high quality converted voice. To apply the DIFFVC framewo… ▽ More We present a modification to the spectrum differential based direct waveform modification for voice conversion (DIFFVC) so that it can be directly applied as a waveform generation module to voice conversion models. The recently proposed DIFFVC avoids the use of a vocoder, meanwhile preserves rich spectral details hence capable of generating high quality converted voice. To apply the DIFFVC framework, a model that can estimate the spectral differential from the F0 transformed input speech needs to be trained beforehand. This requirement imposes several constraints, including a limitation on the estimation model to parallel training and the need of extra training on each conversion pair, which make DIFFVC inflexible. Based on the above motivations, we propose a new DIFFVC framework based on an F0 transformation in the residual domain. By performing inverse filtering on the input signal followed by synthesis filtering on the F0 transformed residual signal using the converted spectral features directly, the spectral conversion model does not need to be retrained or capable of predicting the spectral differential. We describe several details that need to be taken care of under this modification, and by applying our proposed method to a non-parallel, variational autoencoder (VAE)-based spectral conversion model, we demonstrate that this framework can be generalized to any spectral conversion model, and experimental evaluations show that it can outperform a baseline framework whose waveform generation process is carried out by a vocoder. △ Less

Submitted 27 July, 2019; originally announced July 2019.

Comments: 6 pages, 4 figures, 1 table; accepted to the 10th ISCA speech synthesis workshop (SSW10)

arXiv:1907.10185 [pdf, ps, other]

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

Authors: Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

Abstract: In this paper, we present a novel technique for a non-parallel voice conversion (VC) with the use of cyclic variational autoencoder (CycleVAE)-based spectral modeling. In a variational autoencoder(VAE) framework, a latent space, usually with a Gaussian prior, is used to encode a set of input features. In a VAE-based VC, the encoded latent features are fed into a decoder, along with speaker-coding… ▽ More In this paper, we present a novel technique for a non-parallel voice conversion (VC) with the use of cyclic variational autoencoder (CycleVAE)-based spectral modeling. In a variational autoencoder(VAE) framework, a latent space, usually with a Gaussian prior, is used to encode a set of input features. In a VAE-based VC, the encoded latent features are fed into a decoder, along with speaker-coding features, to generate estimated spectra with either the original speaker identity (reconstructed) or another speaker identity (converted). Due to the non-parallel modeling condition, the converted spectra can not be directly optimized, which heavily degrades the performance of a VAE-based VC. In this work, to overcome this problem, we propose to use CycleVAE-based spectral model that indirectly optimizes the conversion flow by recycling the converted features back into the system to obtain corresponding cyclic reconstructed spectra that can be directly optimized. The cyclic flow can be continued by using the cyclic reconstructed features as input for the next cycle. The experimental results demonstrate the effectiveness of the proposed CycleVAE-based VC, which yields higher accuracy of converted spectra, generates latent features with higher correlation degree, and significantly improves the quality and conversion accuracy of the converted speech. △ Less

Submitted 23 July, 2019; originally announced July 2019.

Comments: Accepted to INTERSPEECH 2019

arXiv:1907.08940 [pdf]

Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder

Authors: Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

Abstract: In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module in many different voice conversion frameworks and achieves significant improvement over conventional vocoders. However, because of the fixed dila… ▽ More In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module in many different voice conversion frameworks and achieves significant improvement over conventional vocoders. However, because of the fixed dilated convolution and generic network architecture, the WN vocoder lacks robustness against unseen input features and often requires a huge network size to achieve acceptable speech quality. Such limitations usually lead to performance degradation in the voice conversion task. To overcome this problem, the QPNet vocoder is applied, which includes a pitch-dependent dilated convolution component to enhance the pitch controllability and attain a more compact network than the WN vocoder. In the proposed method, input spectral features are first converted using a framewise deep neural network, and then the QPNet vocoder generates converted speech conditioned on the linearly converted prosodic and transformed spectral features. The experimental results confirm that the QPNet vocoder achieves significantly better performance than the same-size WN vocoder while maintaining comparable speech quality to the double-size WN vocoder. Index Terms: WaveNet, vocoder, voice conversion, pitch-dependent dilated convolution, pitch controllability △ Less

Submitted 22 March, 2020; v1 submitted 21 July, 2019; originally announced July 2019.

Comments: 6pages, 7figures, Proc. SSW10, 2019

arXiv:1907.00797 [pdf]

Quasi-Periodic WaveNet Vocoder: A Pitch Dependent Dilated Convolution Model for Parametric Speech Generation

Authors: Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda

Abstract: In this paper, we propose a quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder. The effectiveness of the WN vocoder to generate high-fidelity speech samples from given acoustic features has been proved recently. However, because of the fixed dilated convolutio… ▽ More In this paper, we propose a quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder. The effectiveness of the WN vocoder to generate high-fidelity speech samples from given acoustic features has been proved recently. However, because of the fixed dilated convolution and generic network architecture, the WN vocoder hardly generates speech with given F0 values which are outside the range observed in training data. Consequently, the WN vocoder lacks the pitch controllability which is one of the essential capabilities of conventional vocoders. To address this limitation, we propose the PDCNN component which has the time-variant adaptive dilation size related to the given F0 values and a cascade network structure of the QPNet vocoder to generate quasi-periodic signals such as speech. Both objective and subjective tests are conducted, and the experimental results demonstrate the better pitch controllability of the QPNet vocoder compared to the same and double sized WN vocoders while attaining comparable speech qualities. Index Terms: WaveNet, vocoder, quasi-periodic signal, pitch-dependent dilated convolution, pitch controllability △ Less

Submitted 22 March, 2020; v1 submitted 1 July, 2019; originally announced July 2019.

Comments: 5 pages, 4 figures, Proc. Interspeech, 2019

arXiv:1905.00615 [pdf, other]

Investigation of F0 conditioning and Fully Convolutional Networks in Variational Autoencoder based Voice Conversion

Authors: Wen-Chin Huang, Yi-Chiao Wu, Chen-Chou Lo, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao, Hsin-Min Wang

Abstract: In this work, we investigate the effectiveness of two techniques for improving variational autoencoder (VAE) based voice conversion (VC). First, we reconsider the relationship between vocoder features extracted using the high quality vocoders adopted in conventional VC systems, and hypothesize that the spectral features are in fact F0 dependent. Such hypothesis implies that during the conversion p… ▽ More In this work, we investigate the effectiveness of two techniques for improving variational autoencoder (VAE) based voice conversion (VC). First, we reconsider the relationship between vocoder features extracted using the high quality vocoders adopted in conventional VC systems, and hypothesize that the spectral features are in fact F0 dependent. Such hypothesis implies that during the conversion phase, the latent codes and the converted features in VAE based VC are in fact source F0 dependent. To this end, we propose to utilize the F0 as an additional input of the decoder. The model can learn to disentangle the latent code from the F0 and thus generates converted F0 dependent converted features. Second, to better capture temporal dependencies of the spectral features and the F0 pattern, we replace the frame wise conversion structure in the original VAE based VC framework with a fully convolutional network structure. Our experiments demonstrate that the degree of disentanglement as well as the naturalness of the converted speech are indeed improved. △ Less

Submitted 8 July, 2019; v1 submitted 2 May, 2019; originally announced May 2019.

Comments: 5 pages, 6 figures, 3 tables; Accepted to Interspeech 2019

arXiv:1811.11078 [pdf, other]

doi 10.23919/EUSIPCO.2019.8902651

Refined WaveNet Vocoder for Variational Autoencoder Based Voice Conversion

Authors: Wen-Chin Huang, Yi-Chiao Wu, Hsin-Te Hwang, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao, Hsin-Min Wang

Abstract: This paper presents a refinement framework of WaveNet vocoders for variational autoencoder (VAE) based voice conversion (VC), which reduces the quality distortion caused by the mismatch between the training data and testing data. Conventional WaveNet vocoders are trained with natural acoustic features but conditioned on the converted features in the conversion stage for VC, and such a mismatch oft… ▽ More This paper presents a refinement framework of WaveNet vocoders for variational autoencoder (VAE) based voice conversion (VC), which reduces the quality distortion caused by the mismatch between the training data and testing data. Conventional WaveNet vocoders are trained with natural acoustic features but conditioned on the converted features in the conversion stage for VC, and such a mismatch often causes significant quality and similarity degradation. In this work, we take advantage of the particular structure of VAEs to refine WaveNet vocoders with the self-reconstructed features generated by VAE, which are of similar characteristics with the converted features while having the same temporal structure with the target natural features. We analyze these features and show that the self-reconstructed features are similar to the converted features. Objective and subjective experimental results demonstrate the effectiveness of our proposed framework. △ Less

Submitted 8 July, 2019; v1 submitted 27 November, 2018; originally announced November 2018.

Comments: 5 pages, 7 figures, 1 table. Accepted to EUSIPCO 2019

arXiv:1810.09137 [pdf, other]

doi 10.1109/TASLP.2018.2842156

DNN-based Source Enhancement to Increase Objective Sound Quality Assessment Score

Authors: Yuma Koizumi, Kenta Niwa, Yusuke Hioka, Kazunori Kobayashi, Yoichi Haneda

Abstract: We propose a training method for deep neural network (DNN)-based source enhancement to increase objective sound quality assessment (OSQA) scores such as the perceptual evaluation of speech quality (PESQ). In many conventional studies, DNNs have been used as a mapping function to estimate time-frequency masks and trained to minimize an analytically tractable objective function such as the mean squa… ▽ More We propose a training method for deep neural network (DNN)-based source enhancement to increase objective sound quality assessment (OSQA) scores such as the perceptual evaluation of speech quality (PESQ). In many conventional studies, DNNs have been used as a mapping function to estimate time-frequency masks and trained to minimize an analytically tractable objective function such as the mean squared error (MSE). Since OSQA scores have been used widely for sound-quality evaluation, constructing DNNs to increase OSQA scores would be better than using the minimum-MSE to create high-quality output signals. However, since most OSQA scores are not analytically tractable, \textit{i.e.}, they are black boxes, the gradient of the objective function cannot be calculated by simply applying back-propagation. To calculate the gradient of the OSQA-based objective function, we formulated a DNN optimization scheme on the basis of \textit{black-box optimization}, which is used for training a computer that plays a game. For a black-box-optimization scheme, we adopt the policy gradient method for calculating the gradient on the basis of a sampling algorithm. To simulate output signals using the sampling algorithm, DNNs are used to estimate the probability density function of the output signals that maximize OSQA scores. The OSQA scores are calculated from the simulated output signals, and the DNNs are trained to increase the probability of generating the simulated output signals that achieve high OSQA scores. Through several experiments, we found that OSQA scores significantly increased by applying the proposed method, even though the MSE was not minimized. △ Less

Submitted 22 October, 2018; originally announced October 2018.

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.26, Issue.10, 2018

arXiv:1804.11055 [pdf]

Collapsed speech segment detection and suppression for WaveNet vocoder

Authors: Yi-Chiao Wu, Kazuhiro Kobayashi, Tomoki Hayashi, Patrick Lumban Tobing, Tomoki Toda

Abstract: In this paper, we propose a technique to alleviate the quality degradation caused by collapsed speech segments sometimes generated by the WaveNet vocoder. The effectiveness of the WaveNet vocoder for generating natural speech from acoustic features has been proved in recent works. However, it sometimes generates very noisy speech with collapsed speech segments when only a limited amount of trainin… ▽ More In this paper, we propose a technique to alleviate the quality degradation caused by collapsed speech segments sometimes generated by the WaveNet vocoder. The effectiveness of the WaveNet vocoder for generating natural speech from acoustic features has been proved in recent works. However, it sometimes generates very noisy speech with collapsed speech segments when only a limited amount of training data is available or significant acoustic mismatches exist between the training and testing data. Such a limitation on the corpus and limited ability of the model can easily occur in some speech generation applications, such as voice conversion and speech enhancement. To address this problem, we propose a technique to automatically detect collapsed speech segments. Moreover, to refine the detected segments, we also propose a waveform generation technique for WaveNet using a linear predictive coding constraint. Verification and subjective tests are conducted to investigate the effectiveness of the proposed techniques. The verification results indicate that the detection technique can detect most collapsed segments. The subjective evaluations of voice conversion demonstrate that the generation technique significantly improves the speech quality while maintaining the same speaker similarity. △ Less

Submitted 9 August, 2018; v1 submitted 30 April, 2018; originally announced April 2018.

Comments: 5 pages, 6 figures. Proc. Interspeech, 2018

Showing 1–25 of 25 results for author: Kobayashi, K