Search | arXiv e-print repository

arXiv:2405.01591 [pdf, other]

Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model

Authors: Seonhee Cho, Choonghan Kim, Jiho Lee, Chetan Chilkunda, Sujin Choi, Joo Heung Yoon

Abstract: Recent advancements in Large Multimodal Models (LMMs) have attracted interest in their generalization capability with only a few samples in the prompt. This progress is particularly relevant to the medical domain, where the quality and sensitivity of data pose unique challenges for model training and application. However, the dependency on high-quality data for effective in-context learning raises… ▽ More Recent advancements in Large Multimodal Models (LMMs) have attracted interest in their generalization capability with only a few samples in the prompt. This progress is particularly relevant to the medical domain, where the quality and sensitivity of data pose unique challenges for model training and application. However, the dependency on high-quality data for effective in-context learning raises questions about the feasibility of these models when encountering with the inevitable variations and errors inherent in real-world medical data. In this paper, we introduce MID-M, a novel framework that leverages the in-context learning capabilities of a general-domain Large Language Model (LLM) to process multimodal data via image descriptions. MID-M achieves a comparable or superior performance to task-specific fine-tuned LMMs and other general-domain ones, without the extensive domain-specific training or pre-training on multimodal data, with significantly fewer parameters. This highlights the potential of leveraging general-domain LLMs for domain-specific tasks and offers a sustainable and cost-effective alternative to traditional LMM developments. Moreover, the robustness of MID-M against data quality issues demonstrates its practical utility in real-world medical domain applications. △ Less

Submitted 29 April, 2024; originally announced May 2024.

Comments: Under review

arXiv:2401.15938 [pdf, other]

Motion-induced error reduction for high-speed dynamic digital fringe projection system

Authors: Sanghoon Jeon, Hyo-Geon Lee, Jae-Sung Lee, Bo-Min Kang, Byung-Wook Jeon, Jun Young Yoon, Jae-Sang Hyun

Abstract: In phase-shifting profilometry (PSP), any motion during the acquisition of fringe patterns can introduce errors because it assumes both the object and measurement system are stationary. Therefore, we propose a method to pixel-wise reduce the errors when the measurement system is in motion due to a motorized linear stage. The proposed method introduces motion-induced error reduction algorithm, whic… ▽ More In phase-shifting profilometry (PSP), any motion during the acquisition of fringe patterns can introduce errors because it assumes both the object and measurement system are stationary. Therefore, we propose a method to pixel-wise reduce the errors when the measurement system is in motion due to a motorized linear stage. The proposed method introduces motion-induced error reduction algorithm, which leverages the motor's encoder and pinhole model of the camera and projector. 3D shape measurement is possible with only three fringe patterns by applying geometric constraints of the digital fringe projection system. We address the mismatch problem due to the motion-induced camera pixel disparities and reduce phase-shift errors. These processes are easy to implement and require low computational cost. Experimental results demonstrate that the presented method effectively reduces the errors even in non-uniform motion. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: 9 pages, 7 figures

arXiv:2401.13921 [pdf, other]

Intelli-Z: Toward Intelligible Zero-Shot TTS

Authors: Sunghee Jung, Won Jang, Jaesam Yoon, Bongwan Kim

Abstract: Although numerous recent studies have suggested new frameworks for zero-shot TTS using large-scale, real-world data, studies that focus on the intelligibility of zero-shot TTS are relatively scarce. Zero-shot TTS demands additional efforts to ensure clear pronunciation and speech quality due to its inherent requirement of replacing a core parameter (speaker embedding or acoustic prompt) with a new… ▽ More Although numerous recent studies have suggested new frameworks for zero-shot TTS using large-scale, real-world data, studies that focus on the intelligibility of zero-shot TTS are relatively scarce. Zero-shot TTS demands additional efforts to ensure clear pronunciation and speech quality due to its inherent requirement of replacing a core parameter (speaker embedding or acoustic prompt) with a new one at the inference stage. In this study, we propose a zero-shot TTS model focused on intelligibility, which we refer to as Intelli-Z. Intelli-Z learns speaker embeddings by using multi-speaker TTS as its teacher and is trained with a cycle-consistency loss to include mismatched text-speech pairs for training. Additionally, it selectively aggregates speaker embeddings along the temporal dimension to minimize the interference of the text content of reference speech at the inference stage. We substantiate the effectiveness of the proposed methods with an ablation study. The Mean Opinion Score (MOS) increases by 9% for unseen speakers when the first two methods are applied, and it further improves by 16% when selective temporal aggregation is applied. △ Less

Submitted 24 January, 2024; originally announced January 2024.

arXiv:2310.08598 [pdf, other]

Domain Generalization for Medical Image Analysis: A Survey

Authors: Jee Seok Yoon, Kwanseok Oh, Yooseung Shin, Maciej A. Mazurowski, Heung-Il Suk

Abstract: Medical image analysis (MedIA) has become an essential tool in medicine and healthcare, aiding in disease diagnosis, prognosis, and treatment planning, and recent successes in deep learning (DL) have made significant contributions to its advances. However, deploying DL models for MedIA in real-world situations remains challenging due to their failure to generalize across the distributional gap bet… ▽ More Medical image analysis (MedIA) has become an essential tool in medicine and healthcare, aiding in disease diagnosis, prognosis, and treatment planning, and recent successes in deep learning (DL) have made significant contributions to its advances. However, deploying DL models for MedIA in real-world situations remains challenging due to their failure to generalize across the distributional gap between training and testing samples - a problem known as domain shift. Researchers have dedicated their efforts to developing various DL methods to adapt and perform robustly on unknown and out-of-distribution data distributions. This paper comprehensively reviews domain generalization studies specifically tailored for MedIA. We provide a holistic view of how domain generalization techniques interact within the broader MedIA system, going beyond methodologies to consider the operational implications on the entire MedIA workflow. Specifically, we categorize domain generalization methods into data-level, feature-level, model-level, and analysis-level methods. We show how those methods can be used in various stages of the MedIA workflow with DL equipped from data acquisition to model prediction and analysis. Furthermore, we critically analyze the strengths and weaknesses of various methods, unveiling future research opportunities. △ Less

Submitted 15 February, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

arXiv:2308.07947 [pdf]

Targeted Multispectral Filter Array Design for Endoscopic Cancer Detection in the Gastrointestinal Tract

Authors: Michaela Taylor-Williams, Ran Tao, Travis W Sawyer, Dale J Waterhouse, Jonghee Yoon, Sarah E Bohndiek

Abstract: Colour differences between healthy and diseased tissue in the gastrointestinal tract are detected visually by clinicians during white light endoscopy (WLE); however, the earliest signs of disease are often just a slightly different shade of pink compared to healthy tissue. Here, we propose to target alternative colours for imaging to improve contrast using custom multispectral filter arrays (MSFAs… ▽ More Colour differences between healthy and diseased tissue in the gastrointestinal tract are detected visually by clinicians during white light endoscopy (WLE); however, the earliest signs of disease are often just a slightly different shade of pink compared to healthy tissue. Here, we propose to target alternative colours for imaging to improve contrast using custom multispectral filter arrays (MSFAs) that could be deployed in an endoscopic chip-on-tip configuration. Using an open-source toolbox, Opti-MSFA, we examined the optimal design of MSFAs for early cancer detection in the gastrointestinal tract. The toolbox was first extended to use additional classification models (k-Nearest Neighbour, Support Vector Machine, and Spectral Angle Mapper). Using input spectral data from published clinical trials examining the oesophagus and colon, we optimised the design of MSFAs with 3 to 9 different bands. We examined the variation of the spectral and spatial classification accuracy as a function of number of bands. The MSFA designs have high classification accuracies, suggesting that future implementation in endoscopy hardware could potentially enable improved early detection of disease in the gastrointestinal tract during routine screening and surveillance. Optimal MSFA configurations can achieve similar classification accuracies as the full spectral data in an implementation that could be realised in far simpler hardware. The reduced number of spectral bands could enable future deployment of multispectral imaging in an endoscopic chip-on-tip configuration. △ Less

Submitted 15 August, 2023; originally announced August 2023.

Comments: 29 pages

arXiv:2306.10058 [pdf, other]

EM-Network: Oracle Guided Self-distillation for Sequence Learning

Authors: Ji Won Yoon, Sunghwan Ahn, Hyeonseung Lee, Minchan Kim, Seok Min Kim, Nam Soo Kim

Abstract: We introduce EM-Network, a novel self-distillation approach that effectively leverages target information for supervised sequence-to-sequence (seq2seq) learning. In contrast to conventional methods, it is trained with oracle guidance, which is derived from the target sequence. Since the oracle guidance compactly represents the target-side context that can assist the sequence model in solving the t… ▽ More We introduce EM-Network, a novel self-distillation approach that effectively leverages target information for supervised sequence-to-sequence (seq2seq) learning. In contrast to conventional methods, it is trained with oracle guidance, which is derived from the target sequence. Since the oracle guidance compactly represents the target-side context that can assist the sequence model in solving the task, the EM-Network achieves a better prediction compared to using only the source input. To allow the sequence model to inherit the promising capability of the EM-Network, we propose a new self-distillation strategy, where the original sequence model can benefit from the knowledge of the EM-Network in a one-stage manner. We conduct comprehensive experiments on two types of seq2seq models: connectionist temporal classification (CTC) for speech recognition and attention-based encoder-decoder (AED) for machine translation. Experimental results demonstrate that the EM-Network significantly advances the current state-of-the-art approaches, improving over the best prior work on speech recognition and establishing state-of-the-art performance on WMT'14 and IWSLT'14. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: ICML 2023

arXiv:2306.08463 [pdf, other]

MCR-Data2vec 2.0: Improving Self-supervised Speech Pre-training via Model-level Consistency Regularization

Authors: Ji Won Yoon, Seok Min Kim, Nam Soo Kim

Abstract: Self-supervised learning (SSL) has shown significant progress in speech processing tasks. However, despite the intrinsic randomness in the Transformer structure, such as dropout variants and layer-drop, improving the model-level consistency remains under-explored in the speech SSL literature. To address this, we propose a new pre-training method that uses consistency regularization to improve Data… ▽ More Self-supervised learning (SSL) has shown significant progress in speech processing tasks. However, despite the intrinsic randomness in the Transformer structure, such as dropout variants and layer-drop, improving the model-level consistency remains under-explored in the speech SSL literature. To address this, we propose a new pre-training method that uses consistency regularization to improve Data2vec 2.0, the recent state-of-the-art (SOTA) SSL model. Specifically, the proposed method involves sampling two different student sub-models within the Data2vec 2.0 framework, enabling two output variants derived from a single input without additional parameters. Subsequently, we regularize the outputs from the student sub-models to be consistent and require them to predict the representation of the teacher model. Our experimental results demonstrate that the proposed approach improves the SSL model's robustness and generalization ability, resulting in SOTA results on the SUPERB benchmark. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: INTERSPEECH 2023

arXiv:2211.15075 [pdf, other]

Inter-KD: Intermediate Knowledge Distillation for CTC-Based Automatic Speech Recognition

Authors: Ji Won Yoon, Beom Jun Woo, Sunghwan Ahn, Hyeonseung Lee, Nam Soo Kim

Abstract: Recently, the advance in deep learning has brought a considerable improvement in the end-to-end speech recognition field, simplifying the traditional pipeline while producing promising results. Among the end-to-end models, the connectionist temporal classification (CTC)-based model has attracted research interest due to its non-autoregressive nature. However, such CTC models require a heavy comput… ▽ More Recently, the advance in deep learning has brought a considerable improvement in the end-to-end speech recognition field, simplifying the traditional pipeline while producing promising results. Among the end-to-end models, the connectionist temporal classification (CTC)-based model has attracted research interest due to its non-autoregressive nature. However, such CTC models require a heavy computational cost to achieve outstanding performance. To mitigate the computational burden, we propose a simple yet effective knowledge distillation (KD) for the CTC framework, namely Inter-KD, that additionally transfers the teacher's knowledge to the intermediate CTC layers of the student network. From the experimental results on the LibriSpeech, we verify that the Inter-KD shows better achievements compared to the conventional KD methods. Without using any language model (LM) and data augmentation, Inter-KD improves the word error rate (WER) performance from 8.85 % to 6.30 % on the test-clean. △ Less

Submitted 28 November, 2022; originally announced November 2022.

Comments: Accepted by 2022 SLT Workshop

arXiv:2210.05524 [pdf, other]

A Learning-Based Estimation and Control Framework for Contact-Intensive Tight-Tolerance Tasks

Authors: Bukun Son, Hyelim Choi, Jaemin Yoon, Dongjun Lee

Abstract: We present a two-stage framework that integrates a learning-based estimator and a controller, designed to address contact-intensive tasks. The estimator leverages a Bayesian particle filter with a mixture density network (MDN) structure, effectively handling multi-modal issues arising from contact information. The controller combines a self-supervised and reinforcement learning (RL) approach, stra… ▽ More We present a two-stage framework that integrates a learning-based estimator and a controller, designed to address contact-intensive tasks. The estimator leverages a Bayesian particle filter with a mixture density network (MDN) structure, effectively handling multi-modal issues arising from contact information. The controller combines a self-supervised and reinforcement learning (RL) approach, strategically dividing the low-level admittance controller's parameters into labelable and non-labelable categories, which are then trained accordingly. To further enhance accuracy and generalization performance, a transformer model is incorporated into the self-supervised learning component. The proposed framework is evaluated on the bolting task using an accurate real-time simulator and successfully transferred to an experimental environment. More visualization results are available on our project website: https://sites.google.com/view/2stagecitt △ Less

Submitted 1 August, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

arXiv:2208.12332 [pdf, other]

2nd Place Solutions for UG2+ Challenge 2022 -- D$^{3}$Net for Mitigating Atmospheric Turbulence from Images

Authors: Sunder Ali Khowaja, Ik Hyun Lee, Jiseok Yoon

Abstract: This technical report briefly introduces to the D$^{3}$Net proposed by our team "TUK-IKLAB" for Atmospheric Turbulence Mitigation in $UG2^{+}$ Challenge at CVPR 2022. In the light of test and validation results on textual images to improve text recognition performance and hot-air balloon images for image enhancement, we can say that the proposed method achieves state-of-the-art performance. Furthe… ▽ More This technical report briefly introduces to the D$^{3}$Net proposed by our team "TUK-IKLAB" for Atmospheric Turbulence Mitigation in $UG2^{+}$ Challenge at CVPR 2022. In the light of test and validation results on textual images to improve text recognition performance and hot-air balloon images for image enhancement, we can say that the proposed method achieves state-of-the-art performance. Furthermore, we also provide a visual comparison with publicly available denoising, deblurring, and frame averaging methods with respect to the proposed work. The proposed method ranked 2nd on the final leader-board of the aforementioned challenge in the testing phase, respectively. △ Less

Submitted 25 August, 2022; originally announced August 2022.

Comments: 4 pages, 4 figures

arXiv:2207.13223 [pdf, other]

XADLiME: eXplainable Alzheimer's Disease Likelihood Map Estimation via Clinically-guided Prototype Learning

Authors: Ahmad Wisnu Mulyadi, Wonsik Jung, Kwanseok Oh, Jee Seok Yoon, Heung-Il Suk

Abstract: Diagnosing Alzheimer's disease (AD) involves a deliberate diagnostic process owing to its innate traits of irreversibility with subtle and gradual progression. These characteristics make AD biomarker identification from structural brain imaging (e.g., structural MRI) scans quite challenging. Furthermore, there is a high possibility of getting entangled with normal aging. We propose a novel deep-le… ▽ More Diagnosing Alzheimer's disease (AD) involves a deliberate diagnostic process owing to its innate traits of irreversibility with subtle and gradual progression. These characteristics make AD biomarker identification from structural brain imaging (e.g., structural MRI) scans quite challenging. Furthermore, there is a high possibility of getting entangled with normal aging. We propose a novel deep-learning approach through eXplainable AD Likelihood Map Estimation (XADLiME) for AD progression modeling over 3D sMRIs using clinically-guided prototype learning. Specifically, we establish a set of topologically-aware prototypes onto the clusters of latent clinical features, uncovering an AD spectrum manifold. We then measure the similarities between latent clinical features and well-established prototypes, estimating a "pseudo" likelihood map. By considering this pseudo map as an enriched reference, we employ an estimating network to estimate the AD likelihood map over a 3D sMRI scan. Additionally, we promote the explainability of such a likelihood map by revealing a comprehensible overview from two perspectives: clinical and morphological. During the inference, this estimated likelihood map served as a substitute over unseen sMRI scans for effectively conducting the downstream task while providing thorough explainable states. △ Less

Submitted 26 July, 2022; originally announced July 2022.

arXiv:2206.09074 [pdf, other]

Weakly Supervised Classification of Vital Sign Alerts as Real or Artifact

Authors: Arnab Dey, Mononito Goswami, Joo Heung Yoon, Gilles Clermont, Michael Pinsky, Marilyn Hravnak, Artur Dubrawski

Abstract: A significant proportion of clinical physiologic monitoring alarms are false. This often leads to alarm fatigue in clinical personnel, inevitably compromising patient safety. To combat this issue, researchers have attempted to build Machine Learning (ML) models capable of accurately adjudicating Vital Sign (VS) alerts raised at the bedside of hemodynamically monitored patients as real or artifact.… ▽ More A significant proportion of clinical physiologic monitoring alarms are false. This often leads to alarm fatigue in clinical personnel, inevitably compromising patient safety. To combat this issue, researchers have attempted to build Machine Learning (ML) models capable of accurately adjudicating Vital Sign (VS) alerts raised at the bedside of hemodynamically monitored patients as real or artifact. Previous studies have utilized supervised ML techniques that require substantial amounts of hand-labeled data. However, manually harvesting such data can be costly, time-consuming, and mundane, and is a key factor limiting the widespread adoption of ML in healthcare (HC). Instead, we explore the use of multiple, individually imperfect heuristics to automatically assign probabilistic labels to unlabeled training data using weak supervision. Our weakly supervised models perform competitively with traditional supervised techniques and require less involvement from domain experts, demonstrating their use as efficient and practical alternatives to supervised learning in HC applications of ML. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: Accepted at American Medical Informatics Association (AMIA) Annual Symposium 2022. 10 pages, 4 figures and 2 tables

arXiv:2204.06328 [pdf, other]

HuBERT-EE: Early Exiting HuBERT for Efficient Speech Recognition

Authors: Ji Won Yoon, Beom Jun Woo, Nam Soo Kim

Abstract: Pre-training with self-supervised models, such as Hidden-unit BERT (HuBERT) and wav2vec 2.0, has brought significant improvements in automatic speech recognition (ASR). However, these models usually require an expensive computational cost to achieve outstanding performance, slowing down the inference speed. To improve the model efficiency, we introduce an early exit scheme for ASR, namely HuBERT-E… ▽ More Pre-training with self-supervised models, such as Hidden-unit BERT (HuBERT) and wav2vec 2.0, has brought significant improvements in automatic speech recognition (ASR). However, these models usually require an expensive computational cost to achieve outstanding performance, slowing down the inference speed. To improve the model efficiency, we introduce an early exit scheme for ASR, namely HuBERT-EE, that allows the model to stop the inference dynamically. In HuBERT-EE, multiple early exit branches are added at the intermediate layers. When the intermediate prediction of the early exit branch is confident, the model stops the inference, and the corresponding result can be returned early. We investigate the proper early exiting criterion and fine-tuning strategy to effectively perform early exiting. Experimental results on the LibriSpeech show that HuBERT-EE can accelerate the inference of the HuBERT while simultaneously balancing the trade-off between the performance and the latency. △ Less

Submitted 19 June, 2024; v1 submitted 13 April, 2022; originally announced April 2022.

Comments: Accepted by INTERSPEECH 2024

arXiv:2111.03664 [pdf, other]

doi 10.1109/TASLP.2023.3297955

Oracle Teacher: Leveraging Target Information for Better Knowledge Distillation of CTC Models

Authors: Ji Won Yoon, Hyung Yong Kim, Hyeonseung Lee, Sunghwan Ahn, Nam Soo Kim

Abstract: Knowledge distillation (KD), best known as an effective method for model compression, aims at transferring the knowledge of a bigger network (teacher) to a much smaller network (student). Conventional KD methods usually employ the teacher model trained in a supervised manner, where output labels are treated only as targets. Extending this supervised scheme further, we introduce a new type of teach… ▽ More Knowledge distillation (KD), best known as an effective method for model compression, aims at transferring the knowledge of a bigger network (teacher) to a much smaller network (student). Conventional KD methods usually employ the teacher model trained in a supervised manner, where output labels are treated only as targets. Extending this supervised scheme further, we introduce a new type of teacher model for connectionist temporal classification (CTC)-based sequence models, namely Oracle Teacher, that leverages both the source inputs and the output labels as the teacher model's input. Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance. One potential risk for the proposed approach is a trivial solution that the model's output directly copies the target input. Based on a many-to-one mapping property of the CTC algorithm, we present a training strategy that can effectively prevent the trivial solution and thus enables utilizing both source and target inputs for model training. Extensive experiments are conducted on two sequence learning tasks: speech recognition and scene text recognition. From the experimental results, we empirically show that the proposed model improves the students across these tasks while achieving a considerable speed-up in the teacher model's training time. △ Less

Submitted 11 August, 2023; v1 submitted 5 November, 2021; originally announced November 2021.

Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing

arXiv:2106.07889 [pdf, other]

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Authors: Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, Juntae Kim

Abstract: Most neural vocoders employ band-limited mel-spectrograms to generate waveforms. If full-band spectral features are used as the input, the vocoder can be provided with as much acoustic information as possible. However, in some models employing full-band mel-spectrograms, an over-smoothing problem occurs as part of which non-sharp spectrograms are generated. To address this problem, we propose Univ… ▽ More Most neural vocoders employ band-limited mel-spectrograms to generate waveforms. If full-band spectral features are used as the input, the vocoder can be provided with as much acoustic information as possible. However, in some models employing full-band mel-spectrograms, an over-smoothing problem occurs as part of which non-sharp spectrograms are generated. To address this problem, we propose UnivNet, a neural vocoder that synthesizes high-fidelity waveforms in real time. Inspired by works in the field of voice activity detection, we added a multi-resolution spectrogram discriminator that employs multiple linear spectrogram magnitudes computed using various parameter sets. Using full-band mel-spectrograms as input, we expect to generate high-resolution signals by adding a discriminator that employs spectrograms of multiple resolutions as the input. In an evaluation on a dataset containing information on hundreds of speakers, UnivNet obtained the best objective and subjective results among competing models for both seen and unseen speakers. These results, including the best subjective score for text-to-speech, demonstrate the potential for fast adaptation to new speakers without a need for training from scratch. △ Less

Submitted 15 June, 2021; originally announced June 2021.

Comments: Accepted to INTERSPEECH 2021

arXiv:2105.00240 [pdf, other]

Simultaneous super-resolution and motion artifact removal in diffusion-weighted MRI using unsupervised deep learning

Authors: Hyungjin Chung, Jaehyun Kim, Jeong Hee Yoon, Jeong Min Lee, Jong Chul Ye

Abstract: Diffusion-weighted MRI is nowadays performed routinely due to its prognostic ability, yet the quality of the scans are often unsatisfactory which can subsequently hamper the clinical utility. To overcome the limitations, here we propose a fully unsupervised quality enhancement scheme, which boosts the resolution and removes the motion artifact simultaneously. This process is done by first training… ▽ More Diffusion-weighted MRI is nowadays performed routinely due to its prognostic ability, yet the quality of the scans are often unsatisfactory which can subsequently hamper the clinical utility. To overcome the limitations, here we propose a fully unsupervised quality enhancement scheme, which boosts the resolution and removes the motion artifact simultaneously. This process is done by first training the network using optimal transport driven cycleGAN with stochastic degradation block which learns to remove aliasing artifacts and enhance the resolution, then using the trained network in the test stage by utilizing bootstrap subsampling and aggregation for motion artifact suppression. We further show that we can control the trade-off between the amount of artifact correction and resolution by controlling the bootstrap subsampling ratio at the inference stage. To the best of our knowledge, the proposed method is the first to tackle super-resolution and motion artifact correction simultaneously in the context of MRI using unsupervised learning. We demonstrate the efficiency of our method by applying it to both quantitative evaluation using simulation study, and to in vivo diffusion-weighted MR scans, which shows that our method is superior to the current state-of-the-art methods. The proposed method is flexible in that it can be applied to various quality enhancement schemes in other types of MR scans, and also directly to the quality enhancement of apparent diffusion coefficient maps. △ Less

Submitted 1 May, 2021; originally announced May 2021.

arXiv:2011.09631 [pdf, other]

Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains

Authors: Won Jang, Dan Lim, Jaesam Yoon

Abstract: We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains. To preserve sound quality when the MelGAN-based structure is trained with a dataset of hundreds of speakers, we added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms. This enables the model to generate realistic waveforms of multi-speakers, by… ▽ More We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains. To preserve sound quality when the MelGAN-based structure is trained with a dataset of hundreds of speakers, we added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms. This enables the model to generate realistic waveforms of multi-speakers, by alleviating the over-smoothing problem in the high frequency band of the large footprint model. Our structure generates signals close to ground-truth data without reducing the inference speed, by discriminating the waveform and spectrogram during training. The model achieved the best mean opinion score (MOS) in most scenarios using ground-truth mel-spectrogram as an input. Especially, it showed superior performance in unseen domains with regard of speaker, emotion, and language. Moreover, in a multi-speaker text-to-speech scenario using mel-spectrogram generated by a transformer model, it synthesized high-fidelity speech of 4.22 MOS. These results, achieved without external domain information, highlight the potential of the proposed model as a universal vocoder. △ Less

Submitted 3 March, 2021; v1 submitted 18 November, 2020; originally announced November 2020.

arXiv:2008.00671 [pdf, other]

doi 10.1109/TASLP.2021.3071662

TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition

Authors: Ji Won Yoon, Hyeonseung Lee, Hyung Yong Kim, Won Ik Cho, Nam Soo Kim

Abstract: In recent years, there has been a great deal of research in developing end-to-end speech recognition models, which enable simplifying the traditional pipeline and achieving promising results. Despite their remarkable performance improvements, end-to-end models typically require expensive computational cost to show successful performance. To reduce this computational burden, knowledge distillation… ▽ More In recent years, there has been a great deal of research in developing end-to-end speech recognition models, which enable simplifying the traditional pipeline and achieving promising results. Despite their remarkable performance improvements, end-to-end models typically require expensive computational cost to show successful performance. To reduce this computational burden, knowledge distillation (KD), which is a popular model compression method, has been used to transfer knowledge from a deep and complex model (teacher) to a shallower and simpler model (student). Previous KD approaches have commonly designed the architecture of the student model by reducing the width per layer or the number of layers of the teacher model. This structural reduction scheme might limit the flexibility of model selection since the student model structure should be similar to that of the given teacher. To cope with this limitation, we propose a new KD method for end-to-end speech recognition, namely TutorNet, that can transfer knowledge across different types of neural networks at the hidden representation-level as well as the output-level. For concrete realizations, we firstly apply representation-level knowledge distillation (RKD) during the initialization step, and then apply the softmax-level knowledge distillation (SKD) combined with the original task learning. When the student is trained with RKD, we make use of frame weighting that points out the frames to which the teacher model pays more attention. Through a number of experiments on LibriSpeech dataset, it is verified that the proposed method not only distills the knowledge between networks with different topologies but also significantly contributes to improving the word error rate (WER) performance of the distilled student. Interestingly, TutorNet allows the student model to surpass its teacher's performance in some particular cases. △ Less

Submitted 16 September, 2021; v1 submitted 3 August, 2020; originally announced August 2020.

Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing

arXiv:2005.08213 [pdf, other]

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

Authors: Won Ik Cho, Donghyun Kwak, Ji Won Yoon, Nam Soo Kim

Abstract: Speech is one of the most effective means of communication and is full of information that helps the transmission of utterer's thoughts. However, mainly due to the cumbersome processing of acoustic features, phoneme or word posterior probability has frequently been discarded in understanding the natural language. Thus, some recent spoken language understanding (SLU) modules have utilized end-to-en… ▽ More Speech is one of the most effective means of communication and is full of information that helps the transmission of utterer's thoughts. However, mainly due to the cumbersome processing of acoustic features, phoneme or word posterior probability has frequently been discarded in understanding the natural language. Thus, some recent spoken language understanding (SLU) modules have utilized end-to-end structures that preserve the uncertainty information. This further reduces the propagation of speech recognition error and guarantees computational efficiency. We claim that in this process, the speech comprehension can benefit from the inference of massive pre-trained language models (LMs). We transfer the knowledge from a concrete Transformer-based text LM to an SLU module which can face a data shortage, based on recent cross-modal distillation methodologies. We demonstrate the validity of our proposal upon the performance on Fluent Speech Command, an English SLU benchmark. Thereby, we experimentally verify our hypothesis that the knowledge could be shared from the top layer of the LM to a fully speech-based module, in which the abstracted speech is expected to meet the semantic representation. △ Less

Submitted 8 August, 2020; v1 submitted 17 May, 2020; originally announced May 2020.

Comments: Interspeech 2020 Camera-ready

arXiv:2005.07799 [pdf, other]

JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment

Authors: Dan Lim, Won Jang, Gyeonghwan O, Heayoung Park, Bongwan Kim, Jaesam Yoon

Abstract: We propose Jointly trained Duration Informed Transformer (JDI-T), a feed-forward Transformer with a duration predictor jointly trained without explicit alignments in order to generate an acoustic feature sequence from an input text. In this work, inspired by the recent success of the duration informed networks such as FastSpeech and DurIAN, we further simplify its sequential, two-stage training pi… ▽ More We propose Jointly trained Duration Informed Transformer (JDI-T), a feed-forward Transformer with a duration predictor jointly trained without explicit alignments in order to generate an acoustic feature sequence from an input text. In this work, inspired by the recent success of the duration informed networks such as FastSpeech and DurIAN, we further simplify its sequential, two-stage training pipeline to a single-stage training. Specifically, we extract the phoneme duration from the autoregressive Transformer on the fly during the joint training instead of pretraining the autoregressive model and using it as a phoneme duration extractor. To our best knowledge, it is the first implementation to jointly train the feed-forward Transformer without relying on a pre-trained phoneme duration extractor in a single training pipeline. We evaluate the effectiveness of the proposed model on the publicly available Korean Single speaker Speech (KSS) dataset compared to the baseline text-to-speech (TTS) models trained by ESPnet-TTS. △ Less

Submitted 4 October, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

Comments: Accepted for publication in Interspeech 2020

arXiv:1911.04824 [pdf, other]

How Low Can You Go? Reducing Frequency and Time Resolution in Current CNN Architectures for Music Auto-tagging

Authors: Andres Ferraro, Dmitry Bogdanov, Xavier Serra, Jay Ho Jeon, Jason Yoon

Abstract: Automatic tagging of music is an important research topic in Music Information Retrieval and audio analysis algorithms proposed for this task have achieved improvements with advances in deep learning. In particular, many state-of-the-art systems use Convolutional Neural Networks and operate on mel-spectrogram representations of the audio. In this paper, we compare commonly used mel-spectrogram rep… ▽ More Automatic tagging of music is an important research topic in Music Information Retrieval and audio analysis algorithms proposed for this task have achieved improvements with advances in deep learning. In particular, many state-of-the-art systems use Convolutional Neural Networks and operate on mel-spectrogram representations of the audio. In this paper, we compare commonly used mel-spectrogram representations and evaluate model performances that can be achieved by reducing the input size in terms of both lesser amount of frequency bands and larger frame rates. We use the MagnaTagaTune dataset for comprehensive performance comparisons and then compare selected configurations on the larger Million Song Dataset. The results of this study can serve researchers and practitioners in their trade-off decision between accuracy of the models, data storage size and training and inference times. △ Less

Submitted 28 June, 2020; v1 submitted 12 November, 2019; originally announced November 2019.

Comments: The 28th European Signal Processing Conference (EUSIPCO)

arXiv:1909.13692 [pdf]

Nonlinear Dipole Inversion (NDI) enables Quantitative Susceptibility Mapping (QSM) without parameter tuning

Authors: Daniel Polak, Itthi Chatnuntawech, Jaeyeon Yoon, Siddharth Srinivasan Iyer, Jongho Lee, Peter Bachert, Elfar Adalsteinsson, Kawin Setsompop, Berkin Bilgic

Abstract: We propose Nonlinear Dipole Inversion (NDI) for high-quality Quantitative Susceptibility Mapping (QSM) without regularization tuning, while matching the image quality of state-of-the-art reconstruction techniques. In addition to avoiding over-smoothing that these techniques often suffer from, we also obviate the need for parameter selection. NDI is flexible enough to allow for reconstruction from… ▽ More We propose Nonlinear Dipole Inversion (NDI) for high-quality Quantitative Susceptibility Mapping (QSM) without regularization tuning, while matching the image quality of state-of-the-art reconstruction techniques. In addition to avoiding over-smoothing that these techniques often suffer from, we also obviate the need for parameter selection. NDI is flexible enough to allow for reconstruction from an arbitrary number of head orientations, and outperforms COSMOS even when using as few as 1-direction data. This is made possible by a nonlinear forward-model that uses the magnitude as an effective prior, for which we derived a simple gradient descent update rule. We synergistically combine this physics-model with a Variational Network (VN) to leverage the power of deep learning in the VaNDI algorithm. This technique adopts the simple gradient descent rule from NDI and learns the network parameters during training, hence requires no additional parameter tuning. Further, we evaluate NDI at 7T using highly accelerated Wave-CAIPI acquisitions at 0.5 mm isotropic resolution and demonstrate high-quality QSM from as few as 2-direction data. △ Less

Submitted 30 September, 2019; originally announced September 2019.

arXiv:1909.09263 [pdf, other]

Propagated Perturbation of Adversarial Attack for well-known CNNs: Empirical Study and its Explanation

Authors: Jihyeun Yoon, Kyungyul Kim, Jongseong Jang

Abstract: Deep Neural Network based classifiers are known to be vulnerable to perturbations of inputs constructed by an adversarial attack to force misclassification. Most studies have focused on how to make vulnerable noise by gradient based attack methods or to defense model from adversarial attack. The use of the denoiser model is one of a well-known solution to reduce the adversarial noise although clas… ▽ More Deep Neural Network based classifiers are known to be vulnerable to perturbations of inputs constructed by an adversarial attack to force misclassification. Most studies have focused on how to make vulnerable noise by gradient based attack methods or to defense model from adversarial attack. The use of the denoiser model is one of a well-known solution to reduce the adversarial noise although classification performance had not significantly improved. In this study, we aim to analyze the propagation of adversarial attack as an explainable AI(XAI) point of view. Specifically, we examine the trend of adversarial perturbations through the CNN architectures. To analyze the propagated perturbation, we measured normalized Euclidean Distance and cosine distance in each CNN layer between the feature map of the perturbed image passed through denoiser and the non-perturbed original image. We used five well-known CNN based classifiers and three gradient-based adversarial attacks. From the experimental results, we observed that in most cases, Euclidean Distance explosively increases in the final fully connected layer while cosine distance fluctuated and disappeared at the last layer. This means that the use of denoiser can decrease the amount of noise. However, it failed to defense accuracy degradation. △ Less

Submitted 23 September, 2019; v1 submitted 19 September, 2019; originally announced September 2019.

Journal ref: ICCV 2019 Workshop on Interpreting and Explaining Visual Artificial Intelligence Models

arXiv:1909.07716 [pdf]

Exploring linearity of deep neural network trained QSM: QSMnet+

Authors: Woojin Jung, Jaeyeon Yoon, Joon Yul Choi, Jae Myung Kim, Yoonho Nam, Eung Yeop Kim, Jongho Lee

Abstract: Recently, deep neural network-powered quantitative susceptibility mapping (QSM), QSMnet, successfully performed ill conditioned dipole inversion in QSM and generated high-quality susceptibility maps. In this paper, the network, which was trained by healthy volunteer data, is evaluated for hemorrhagic lesions that have substantially higher susceptibility than healthy tissues in order to test linear… ▽ More Recently, deep neural network-powered quantitative susceptibility mapping (QSM), QSMnet, successfully performed ill conditioned dipole inversion in QSM and generated high-quality susceptibility maps. In this paper, the network, which was trained by healthy volunteer data, is evaluated for hemorrhagic lesions that have substantially higher susceptibility than healthy tissues in order to test linearity of QSMnet for susceptibility. The results show that QSMnet underestimates susceptibility in hemorrhagic lesions, revealing degraded linearity of the network for the untrained susceptibility range. To overcome this limitation, a data augmentation method is proposed to generalize the network for a wider range of susceptibility. The newly trained network, which is referred to as QSMnet+, is assessed in computer-simulated lesions with an extended susceptibility range (-1.4 ppm to +1.4 ppm) and also in twelve hemorrhagic patients. The simulation results demonstrate improved linearity of QSMnet+ over QSMnet (root mean square error of QSMnet+: 0.04 ppm vs. QSMnet: 0.36 ppm). When applied to patient data, QSMnet+ maps show less noticeable artifacts to those of conventional QSM maps. Moreover, the susceptibility values of QSMnet+ in hemorrhagic lesions are better matched to those of the conventional QSM method than those of QSMnet when analyzed using linear regression (QSMnet+: slope = 1.05, intercept = -0.03, R2 = 0.93; QSMnet: slope = 0.68, intercept = 0.06, R2 = 0.86), consolidating improved linearity in QSMnet+. This study demonstrates the importance of the trained data range in deep neural network-powered parametric mapping and suggests the data augmentation approach for generalization of network. The new network can be applicable for a wide range of susceptibility quantification. △ Less

Submitted 14 October, 2019; v1 submitted 17 September, 2019; originally announced September 2019.

Comments: 22 pages

arXiv:1906.05797 [pdf, other]

The Replica Dataset: A Digital Replica of Indoor Spaces

Authors: Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra , et al. (5 additional authors not shown)

Abstract: We introduce Replica, a dataset of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale. Each scene consists of a dense mesh, high-resolution high-dynamic-range (HDR) textures, per-primitive semantic class and instance information, and planar mirror and glass reflectors. The goal of Replica is to enable machine learning (ML) research that relies on visually, geometr… ▽ More We introduce Replica, a dataset of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale. Each scene consists of a dense mesh, high-resolution high-dynamic-range (HDR) textures, per-primitive semantic class and instance information, and planar mirror and glass reflectors. The goal of Replica is to enable machine learning (ML) research that relies on visually, geometrically, and semantically realistic generative models of the world - for instance, egocentric computer vision, semantic segmentation in 2D and 3D, geometric inference, and the development of embodied agents (virtual robots) performing navigation, instruction following, and question answering. Due to the high level of realism of the renderings from Replica, there is hope that ML systems trained on Replica may transfer directly to real world image and video data. Together with the data, we are releasing a minimal C++ SDK as a starting point for working with the Replica dataset. In addition, Replica is `Habitat-compatible', i.e. can be natively used with AI Habitat for training and testing embodied agents. △ Less

Submitted 13 June, 2019; originally announced June 2019.

arXiv:1904.02644 [pdf, other]

doi 10.1103/PhysRevX.9.041050

Characterising optical fibre transmission matrices using metasurface reflector stacks for lensless imaging without distal access

Authors: George S. D. Gordon, Milana Gataric, Alberto Gil C. P. Ramos, Ralf Mouthaan, Calum Williams, Jonghee Yoon, Timothy D. Wilkinson, Sarah E. Bohndiek

Abstract: The ability to form images through hair-thin optical fibres promises to open up new applications from biomedical imaging to industrial inspection. Unfortunately, deployment has been limited because small changes in mechanical deformation (e.g. bending) and temperature can completely scramble optical information, distorting images. Since such changes are dynamic, correcting them requires measuremen… ▽ More The ability to form images through hair-thin optical fibres promises to open up new applications from biomedical imaging to industrial inspection. Unfortunately, deployment has been limited because small changes in mechanical deformation (e.g. bending) and temperature can completely scramble optical information, distorting images. Since such changes are dynamic, correcting them requires measurement of the fibre transmission matrix (TM) in situ immediately before imaging. TM calibration typically requires access to both the proximal and distal facets of the fibre simultaneously, which is not feasible during most realistic usage scenarios without compromising the thin form factor with bulky distal optics. Here, we introduce a new approach to determine the TM of multi-mode fibre (MMF) or multi-core fibre (MCF) in a reflection-mode configuration without access to the distal facet. A thin stack of structured metasurface reflectors is used at the distal facet to introduce wavelength-dependent, spatially heterogeneous reflectance profiles. We derive a first-order fibre model that compensates these wavelength-dependent changes in the TM and show that, consequently, the reflected data at 3 wavelengths can be used to unambiguously reconstruct the full TM by an iterative optimisation algorithm. We then present a method for sample illumination and imaging following TM reconstruction. Unlike previous approaches, our method does not require the TM to be unitary making it applicable to physically realistic fibre systems. We demonstrate TM reconstruction and imaging first using simulated non-unitary fibres and noisy reflection matrices, then using much larger experimentally-measured TMs of a densely-packed MCF, and finally on an experimentally-measured multi-wavelength set of TMs recorded from a MMF. Our findings pave the way for online transmission matrix calibration in situ in hair-thin optical fibres △ Less

Submitted 5 April, 2019; v1 submitted 4 April, 2019; originally announced April 2019.

Comments: Main text: 38 pages, 9 Figures, Appendices: 26 pages, 6 Figures. Corrected author affiliation

Journal ref: Phys. Rev. X 9, 041050 (2019)

arXiv:1810.04325 [pdf, other]

Analysis of Maximal Topologies Achieving Optimal DoF and DoF $\frac{1}{n}$ in Topological Interference Management

Authors: Jong-Yoon Yoon, Jong-Seon No

Abstract: Topological interference management (TIM) can obtain degrees of freedom (DoF) gains with no channel state information at the transmitters (CSIT) except topological information of network in the interference channel. It was shown that TIM achieves the optimal symmetric DoF when internal conflict does not exist among messages. However, it is difficult to assure whether a specific topology can achiev… ▽ More Topological interference management (TIM) can obtain degrees of freedom (DoF) gains with no channel state information at the transmitters (CSIT) except topological information of network in the interference channel. It was shown that TIM achieves the optimal symmetric DoF when internal conflict does not exist among messages. However, it is difficult to assure whether a specific topology can achieve the optimal DoF without scrutinizing internal conflict, which requires lots of works. Also, it is hard to design a specific optimal topology directly from the conventional condition for the optimal DoF. With these problems in mind, we propose a method to derive maximal topology directly in TIM, named as alliance construction in K-user interference channel. That is, it is proved that a topology is maximal if and only if it is derived from alliance construction. We translate a topology design by alliance construction in message graph into topology matrix and propose conditions for maximal topology matrix (MTM). Moreover, we propose a generalized alliance construction that derives a topology achieving DoF 1/n for n>=3 by generalizing sub-alliances. A topology matrix can also be used to analyze maximality of topology with DoF 1/n. △ Less

Submitted 9 October, 2018; originally announced October 2018.

arXiv:1808.02401 [pdf, other]

Building Encoder and Decoder with Deep Neural Networks: On the Way to Reality

Authors: Minhoe Kim, Woonsup Lee, Jungmin Yoon, Ohyun Jo

Abstract: Deep learning has been a groundbreaking technology in various fields as well as in communications systems. In spite of the notable advancements of deep neural network (DNN) based technologies in recent years, the high computational complexity has been a major obstacle to apply DNN in practical communications systems which require real-time operation. In this sense, challenges regarding practical i… ▽ More Deep learning has been a groundbreaking technology in various fields as well as in communications systems. In spite of the notable advancements of deep neural network (DNN) based technologies in recent years, the high computational complexity has been a major obstacle to apply DNN in practical communications systems which require real-time operation. In this sense, challenges regarding practical implementation must be addressed before the proliferation of DNN-based intelligent communications becomes a reality. To the best of the authors' knowledge, for the first time, this article presents an efficient learning architecture and design strategies including link level verification through digital circuit implementations using hardware description language (HDL) to mitigate this challenge and to deduce feasibility and potential of DNN for communications systems. In particular, DNN is applied for an encoder and a decoder to enable flexible adaptation with respect to the system environments without needing any domain specific information. Extensive investigations and interdisciplinary design considerations including the DNN-based autoencoder structure, learning framework, and low-complexity digital circuit implementations for real-time operation are taken into account by the authors which ascertains the use of DNN-based communications in practice. △ Less

Submitted 7 August, 2018; originally announced August 2018.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:1803.05627 [pdf]

doi 10.1016/j.neuroimage.2018.06.030.

Quantitative Susceptibility Mapping using Deep Neural Network: QSMnet

Authors: Jaeyeon Yoon, Enhao Gong, Itthi Chatnuntawech, Berkin Bilgic, Jingu Lee, Woojin Jung, Jingyu Ko, Hosan Jung, Kawin Setsompop, Greg Zaharchuk, Eung Yeop Kim, John Pauly, Jongho Lee

Abstract: Deep neural networks have demonstrated promising potential for the field of medical image reconstruction. In this work, an MRI reconstruction algorithm, which is referred to as quantitative susceptibility mapping (QSM), has been developed using a deep neural network in order to perform dipole deconvolution, which restores magnetic susceptibility source from an MRI field map. Previous approaches of… ▽ More Deep neural networks have demonstrated promising potential for the field of medical image reconstruction. In this work, an MRI reconstruction algorithm, which is referred to as quantitative susceptibility mapping (QSM), has been developed using a deep neural network in order to perform dipole deconvolution, which restores magnetic susceptibility source from an MRI field map. Previous approaches of QSM require multiple orientation data (e.g. Calculation of Susceptibility through Multiple Orientation Sampling or COSMOS) or regularization terms (e.g. Truncated K-space Division or TKD; Morphology Enabled Dipole Inversion or MEDI) to solve the ill-conditioned deconvolution problem. Unfortunately, they either require long multiple orientation scans or suffer from artifacts. To overcome these shortcomings, a deep neural network, QSMnet, is constructed to generate a high quality susceptibility map from single orientation data. The network has a modified U-net structure and is trained using gold-standard COSMOS QSM maps. 25 datasets from 5 subjects (5 orientation each) were applied for patch-wise training after doubling the data using augmentation. Two additional datasets of 5 orientation data were used for validation and test (one dataset each). The QSMnet maps of the test dataset were compared with those from TKD and MEDI for image quality and consistency in multiple head orientations. Quantitative and qualitative image quality comparisons demonstrate that the QSMnet results have superior image quality to those of TKD or MEDI and have comparable image quality to those of COSMOS. Additionally, QSMnet maps reveal substantially better consistency across the multiple orientations than those from TKD or MEDI. As a preliminary application, the network was tested for two patients. The QSMnet maps showed similar lesion contrasts with those from MEDI, demonstrating potential for future applications. △ Less

Submitted 15 June, 2018; v1 submitted 15 March, 2018; originally announced March 2018.

Comments: This work is accepted in neuroimage on 8 June, 2018 and soon will be published. The pubmed link is https://www.ncbi.nlm.nih.gov/pubmed/29894829

Showing 1–29 of 29 results for author: Yoon, J