Search | arXiv e-print repository

BPMP-Tracker: A Versatile Aerial Target Tracker Using Bernstein Polynomial Motion Primitives

Authors: Yunwoo Lee, Jungwon Park, Boseong Jeon, Seungwoo Jung, H. Jin Kim

Abstract: This letter presents a versatile trajectory planning pipeline for aerial tracking. The proposed tracker is capable of handling various chasing settings such as complex unstructured environments, crowded dynamic obstacles and multiple-target following. Among the entire pipeline, we focus on developing a predictor for future target motion and a chasing trajectory planner. For rapid computation, we e… ▽ More This letter presents a versatile trajectory planning pipeline for aerial tracking. The proposed tracker is capable of handling various chasing settings such as complex unstructured environments, crowded dynamic obstacles and multiple-target following. Among the entire pipeline, we focus on developing a predictor for future target motion and a chasing trajectory planner. For rapid computation, we employ the sample-check-select strategy: modules sample a set of candidate movements, check multiple constraints, and then select the best trajectory. Also, we leverage the properties of Bernstein polynomials for quick calculations. The prediction module predicts the trajectories of the targets, which do not overlap with static and dynamic obstacles. Then the trajectory planner outputs a trajectory, ensuring various conditions such as occlusion and collision avoidance, the visibility of all targets within a camera image and dynamical limits. We fully test the proposed tracker in simulations and hardware experiments under challenging scenarios, including dual-target following, environments with dozens of dynamic obstacles and complex indoor and outdoor spaces. △ Less

Submitted 8 August, 2024; originally announced August 2024.

Comments: 8 pages, 9 figures

arXiv:2406.16994 [pdf, other]

Quantum Multi-Agent Reinforcement Learning for Cooperative Mobile Access in Space-Air-Ground Integrated Networks

Authors: Gyu Seon Kim, Yeryeong Cho, Jaehyun Chung, Soohyun Park, Soyi Jung, Zhu Han, Joongheon Kim

Abstract: Achieving global space-air-ground integrated network (SAGIN) access only with CubeSats presents significant challenges such as the access sustainability limitations in specific regions (e.g., polar regions) and the energy efficiency limitations in CubeSats. To tackle these problems, high-altitude long-endurance unmanned aerial vehicles (HALE-UAVs) can complement these CubeSat shortcomings for prov… ▽ More Achieving global space-air-ground integrated network (SAGIN) access only with CubeSats presents significant challenges such as the access sustainability limitations in specific regions (e.g., polar regions) and the energy efficiency limitations in CubeSats. To tackle these problems, high-altitude long-endurance unmanned aerial vehicles (HALE-UAVs) can complement these CubeSat shortcomings for providing cooperatively global access sustainability and energy efficiency. However, as the number of CubeSats and HALE-UAVs, increases, the scheduling dimension of each ground station (GS) increases. As a result, each GS can fall into the curse of dimensionality, and this challenge becomes one major hurdle for efficient global access. Therefore, this paper provides a quantum multi-agent reinforcement Learning (QMARL)-based method for scheduling between GSs and CubeSats/HALE-UAVs in order to improve global access availability and energy efficiency. The main reason why the QMARL-based scheduler can be beneficial is that the algorithm facilitates a logarithmic-scale reduction in scheduling action dimensions, which is one critical feature as the number of CubeSats and HALE-UAVs expands. Additionally, individual GSs have different traffic demands depending on their locations and characteristics, thus it is essential to provide differentiated access services. The superiority of the proposed scheduler is validated through data-intensive experiments in realistic CubeSat/HALE-UAV settings. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 17 pages, 22 figures

arXiv:2406.16322 [pdf, other]

doi 10.1016/j.compbiomed.2024.108746

Lesion-Aware Cross-Phase Attention Network for Renal Tumor Subtype Classification on Multi-Phase CT Scans

Authors: Kwang-Hyun Uhm, Seung-Won Jung, Sung-Hoo Hong, Sung-Jea Ko

Abstract: Multi-phase computed tomography (CT) has been widely used for the preoperative diagnosis of kidney cancer due to its non-invasive nature and ability to characterize renal lesions. However, since enhancement patterns of renal lesions across CT phases are different even for the same lesion type, the visual assessment by radiologists suffers from inter-observer variability in clinical practice. Altho… ▽ More Multi-phase computed tomography (CT) has been widely used for the preoperative diagnosis of kidney cancer due to its non-invasive nature and ability to characterize renal lesions. However, since enhancement patterns of renal lesions across CT phases are different even for the same lesion type, the visual assessment by radiologists suffers from inter-observer variability in clinical practice. Although deep learning-based approaches have been recently explored for differential diagnosis of kidney cancer, they do not explicitly model the relationships between CT phases in the network design, limiting the diagnostic performance. In this paper, we propose a novel lesion-aware cross-phase attention network (LACPANet) that can effectively capture temporal dependencies of renal lesions across CT phases to accurately classify the lesions into five major pathological subtypes from time-series multi-phase CT images. We introduce a 3D inter-phase lesion-aware attention mechanism to learn effective 3D lesion features that are used to estimate attention weights describing the inter-phase relations of the enhancement patterns. We also present a multi-scale attention scheme to capture and aggregate temporal patterns of lesion features at different spatial scales for further improvement. Extensive experiments on multi-phase CT scans of kidney cancer patients from the collected dataset demonstrate that our LACPANet outperforms state-of-the-art approaches in diagnostic accuracy. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: This article has been accepted for publication in Computers in Biology and Medicine

Journal ref: Computers in Biology and Medicine, 108746, 2024

arXiv:2404.03991 [pdf, other]

Towards Efficient and Accurate CT Segmentation via Edge-Preserving Probabilistic Downsampling

Authors: Shahzad Ali, Yu Rim Lee, Soo Young Park, Won Young Tak, Soon Ki Jung

Abstract: Downsampling images and labels, often necessitated by limited resources or to expedite network training, leads to the loss of small objects and thin boundaries. This undermines the segmentation network's capacity to interpret images accurately and predict detailed labels, resulting in diminished performance compared to processing at original resolutions. This situation exemplifies the trade-off be… ▽ More Downsampling images and labels, often necessitated by limited resources or to expedite network training, leads to the loss of small objects and thin boundaries. This undermines the segmentation network's capacity to interpret images accurately and predict detailed labels, resulting in diminished performance compared to processing at original resolutions. This situation exemplifies the trade-off between efficiency and accuracy, with higher downsampling factors further impairing segmentation outcomes. Preserving information during downsampling is especially critical for medical image segmentation tasks. To tackle this challenge, we introduce a novel method named Edge-preserving Probabilistic Downsampling (EPD). It utilizes class uncertainty within a local window to produce soft labels, with the window size dictating the downsampling factor. This enables a network to produce quality predictions at low resolutions. Beyond preserving edge details more effectively than conventional nearest-neighbor downsampling, employing a similar algorithm for images, it surpasses bilinear interpolation in image downsampling, enhancing overall performance. Our method significantly improved Intersection over Union (IoU) to 2.85%, 8.65%, and 11.89% when downsampling data to 1/2, 1/4, and 1/8, respectively, compared to conventional interpolation methods. △ Less

Submitted 5 April, 2024; originally announced April 2024.

Comments: 5 pages (4 figures, 1 table); This work has been submitted to the IEEE Signal Processing Letters. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2403.14154 [pdf, other]

LR-FHSS Transceiver for Direct-to-Satellite IoT Communications: Design, Implementation, and Verification

Authors: Sooyeob Jung, Seongah Jeong, Jinkyu Kang, Gyeongrae Im, Sangjae Lee, Mi-Kyung Oh, Joon Gyu Ryu, Joonhyuk Kang

Abstract: This paper proposes a long range-frequency hopping spread spectrum (LR-FHSS) transceiver design for the Direct-to-Satellite Internet of Things (DtS-IoT) communication system. The DtS-IoT system has recently attracted attention as a promising nonterrestrial network (NTN) solution to provide high-traffic and low-latency data transfer services to IoT devices in global coverage. In particular, this st… ▽ More This paper proposes a long range-frequency hopping spread spectrum (LR-FHSS) transceiver design for the Direct-to-Satellite Internet of Things (DtS-IoT) communication system. The DtS-IoT system has recently attracted attention as a promising nonterrestrial network (NTN) solution to provide high-traffic and low-latency data transfer services to IoT devices in global coverage. In particular, this study provides guidelines for the overall DtS-IoT system architecture and design details that conform to the Long Range Wide-Area Network (LoRaWAN). Furthermore, we also detail various DtS-IoT use cases. Considering the multiple low-Earth orbit (LEO) satellites, we developed the LR-FHSS transceiver to improve system efficiency, which is the first attempt in real satellite communication systems using LR-FHSS. Moreover, as an extension of our previous work with perfect synchronization, we applied a robust synchronization scheme against the Doppler effect and co-channel interference (CCI) caused by LEO satellite channel environments, including signal detection for the simultaneous reception of numerous frequency hopping signals and an enhanced soft-output-Viterbi-algorithm (SOVA) for the header and payload receptions. Lastly, we present proof-of-concept implementation and testbeds using an application-specific integrated circuit (ASIC) chipset and a field-programmable gate array (FPGA) that verify the performance of the proposed LR-FHSS transceiver design of DtS-IoT communication systems. The laboratory test results reveal that the proposed LR-FHSS-based framework with the robust synchronization technique can provide wide coverage, seamless connectivity, and high throughput communication links for the realization of future sixth-generation (6G) networks. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: 17pages, 23 figures

arXiv:2403.05093 [pdf, other]

Spectrum Translation for Refinement of Image Generation (STIG) Based on Contrastive Learning and Spectral Filter Profile

Authors: Seokjun Lee, Seung-Won Jung, Hyunseok Seo

Abstract: Currently, image generation and synthesis have remarkably progressed with generative models. Despite photo-realistic results, intrinsic discrepancies are still observed in the frequency domain. The spectral discrepancy appeared not only in generative adversarial networks but in diffusion models. In this study, we propose a framework to effectively mitigate the disparity in frequency domain of the… ▽ More Currently, image generation and synthesis have remarkably progressed with generative models. Despite photo-realistic results, intrinsic discrepancies are still observed in the frequency domain. The spectral discrepancy appeared not only in generative adversarial networks but in diffusion models. In this study, we propose a framework to effectively mitigate the disparity in frequency domain of the generated images to improve generative performance of both GAN and diffusion models. This is realized by spectrum translation for the refinement of image generation (STIG) based on contrastive learning. We adopt theoretical logic of frequency components in various generative networks. The key idea, here, is to refine the spectrum of the generated image via the concept of image-to-image translation and contrastive learning in terms of digital signal processing. We evaluate our framework across eight fake image datasets and various cutting-edge models to demonstrate the effectiveness of STIG. Our framework outperforms other cutting-edges showing significant decreases in FID and log frequency distance of spectrum. We further emphasize that STIG improves image quality by decreasing the spectral anomaly. Additionally, validation results present that the frequency-based deepfake detector confuses more in the case where fake spectrums are manipulated by STIG. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: Accepted to AAAI 2024

arXiv:2401.13921 [pdf, other]

Intelli-Z: Toward Intelligible Zero-Shot TTS

Authors: Sunghee Jung, Won Jang, Jaesam Yoon, Bongwan Kim

Abstract: Although numerous recent studies have suggested new frameworks for zero-shot TTS using large-scale, real-world data, studies that focus on the intelligibility of zero-shot TTS are relatively scarce. Zero-shot TTS demands additional efforts to ensure clear pronunciation and speech quality due to its inherent requirement of replacing a core parameter (speaker embedding or acoustic prompt) with a new… ▽ More Although numerous recent studies have suggested new frameworks for zero-shot TTS using large-scale, real-world data, studies that focus on the intelligibility of zero-shot TTS are relatively scarce. Zero-shot TTS demands additional efforts to ensure clear pronunciation and speech quality due to its inherent requirement of replacing a core parameter (speaker embedding or acoustic prompt) with a new one at the inference stage. In this study, we propose a zero-shot TTS model focused on intelligibility, which we refer to as Intelli-Z. Intelli-Z learns speaker embeddings by using multi-speaker TTS as its teacher and is trained with a cycle-consistency loss to include mismatched text-speech pairs for training. Additionally, it selectively aggregates speaker embeddings along the temporal dimension to minimize the interference of the text content of reference speech at the inference stage. We substantiate the effectiveness of the proposed methods with an ablation study. The Mean Opinion Score (MOS) increases by 9% for unseen speakers when the first two methods are applied, and it further improves by 16% when selective temporal aggregation is applied. △ Less

Submitted 24 January, 2024; originally announced January 2024.

arXiv:2401.13146 [pdf, other]

Locality enhanced dynamic biasing and sampling strategies for contextual ASR

Authors: Md Asif Jalal, Pablo Peso Parada, George Pavlidis, Vasileios Moschopoulos, Karthikeyan Saravanan, Chrysovalantis-Giorgos Kontoulis, Jisi Zhang, Anastasios Drosou, Gil Ho Lee, Jungin Lee, Seokyeong Jung

Abstract: Automatic Speech Recognition (ASR) still face challenges when recognizing time-variant rare-phrases. Contextual biasing (CB) modules bias ASR model towards such contextually-relevant phrases. During training, a list of biasing phrases are selected from a large pool of phrases following a sampling strategy. In this work we firstly analyse different sampling strategies to provide insights into the t… ▽ More Automatic Speech Recognition (ASR) still face challenges when recognizing time-variant rare-phrases. Contextual biasing (CB) modules bias ASR model towards such contextually-relevant phrases. During training, a list of biasing phrases are selected from a large pool of phrases following a sampling strategy. In this work we firstly analyse different sampling strategies to provide insights into the training of CB for ASR with correlation plots between the bias embeddings among various training stages. Secondly, we introduce a neighbourhood attention (NA) that localizes self attention (SA) to the nearest neighbouring frames to further refine the CB output. The results show that this proposed approach provides on average a 25.84% relative WER improvement on LibriSpeech sets and rare-word evaluation compared to the baseline. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: Accepted for IEEE ASRU 2023

arXiv:2401.12085 [pdf, other]

Consistency Based Unsupervised Self-training For ASR Personalisation

Authors: Jisi Zhang, Vandana Rajan, Haaris Mehmood, David Tuckey, Pablo Peso Parada, Md Asif Jalal, Karthikeyan Saravanan, Gil Ho Lee, Jungin Lee, Seokyeong Jung

Abstract: On-device Automatic Speech Recognition (ASR) models trained on speech data of a large population might underperform for individuals unseen during training. This is due to a domain shift between user data and the original training data, differed by user's speaking characteristics and environmental acoustic conditions. ASR personalisation is a solution that aims to exploit user data to improve model… ▽ More On-device Automatic Speech Recognition (ASR) models trained on speech data of a large population might underperform for individuals unseen during training. This is due to a domain shift between user data and the original training data, differed by user's speaking characteristics and environmental acoustic conditions. ASR personalisation is a solution that aims to exploit user data to improve model robustness. The majority of ASR personalisation methods assume labelled user data for supervision. Personalisation without any labelled data is challenging due to limited data size and poor quality of recorded audio samples. This work addresses unsupervised personalisation by developing a novel consistency based training method via pseudo-labelling. Our method achieves a relative Word Error Rate Reduction (WERR) of 17.3% on unlabelled training data and 8.1% on held-out data compared to a pre-trained model, and outperforms the current state-of-the art methods. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: Accepted for IEEE ASRU 2023

arXiv:2312.05548 [pdf, other]

doi 10.1109/JBHI.2022.3219123

A Unified Multi-Phase CT Synthesis and Classification Framework for Kidney Cancer Diagnosis with Incomplete Data

Authors: Kwang-Hyun Uhm, Seung-Won Jung, Moon Hyung Choi, Sung-Hoo Hong, Sung-Jea Ko

Abstract: Multi-phase CT is widely adopted for the diagnosis of kidney cancer due to the complementary information among phases. However, the complete set of multi-phase CT is often not available in practical clinical applications. In recent years, there have been some studies to generate the missing modality image from the available data. Nevertheless, the generated images are not guaranteed to be effectiv… ▽ More Multi-phase CT is widely adopted for the diagnosis of kidney cancer due to the complementary information among phases. However, the complete set of multi-phase CT is often not available in practical clinical applications. In recent years, there have been some studies to generate the missing modality image from the available data. Nevertheless, the generated images are not guaranteed to be effective for the diagnosis task. In this paper, we propose a unified framework for kidney cancer diagnosis with incomplete multi-phase CT, which simultaneously recovers missing CT images and classifies cancer subtypes using the completed set of images. The advantage of our framework is that it encourages a synthesis model to explicitly learn to generate missing CT phases that are helpful for classifying cancer subtypes. We further incorporate lesion segmentation network into our framework to exploit lesion-level features for effective cancer classification in the whole CT volumes. The proposed framework is based on fully 3D convolutional neural networks to jointly optimize both synthesis and classification of 3D CT volumes. Extensive experiments on both in-house and external datasets demonstrate the effectiveness of our framework for the diagnosis with incomplete data compared with state-of-the-art baselines. In particular, cancer subtype classification using the completed CT data by our method achieves higher performance than the classification using the given incomplete data. △ Less

Submitted 9 December, 2023; originally announced December 2023.

Comments: This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics

Journal ref: JBHI, 2022

arXiv:2312.05528 [pdf, other]

Exploring 3D U-Net Training Configurations and Post-Processing Strategies for the MICCAI 2023 Kidney and Tumor Segmentation Challenge

Authors: Kwang-Hyun Uhm, Hyunjun Cho, Zhixin Xu, Seohoon Lim, Seung-Won Jung, Sung-Hoo Hong, Sung-Jea Ko

Abstract: In 2023, it is estimated that 81,800 kidney cancer cases will be newly diagnosed, and 14,890 people will die from this cancer in the United States. Preoperative dynamic contrast-enhanced abdominal computed tomography (CT) is often used for detecting lesions. However, there exists inter-observer variability due to subtle differences in the imaging features of kidney and kidney tumors. In this paper… ▽ More In 2023, it is estimated that 81,800 kidney cancer cases will be newly diagnosed, and 14,890 people will die from this cancer in the United States. Preoperative dynamic contrast-enhanced abdominal computed tomography (CT) is often used for detecting lesions. However, there exists inter-observer variability due to subtle differences in the imaging features of kidney and kidney tumors. In this paper, we explore various 3D U-Net training configurations and effective post-processing strategies for accurate segmentation of kidneys, cysts, and kidney tumors in CT images. We validated our model on the dataset of the 2023 Kidney and Kidney Tumor Segmentation (KiTS23) challenge. Our method took second place in the final ranking of the KiTS23 challenge on unseen test data with an average Dice score of 0.820 and an average Surface Dice of 0.712. △ Less

Submitted 9 December, 2023; originally announced December 2023.

Comments: MICCAI 2023, KITS 2023 challenge 2nd place

arXiv:2312.01638 [pdf, other]

J-Net: Improved U-Net for Terahertz Image Super-Resolution

Authors: Woon-Ha Yeo, Seung-Hwan Jung, Seung Jae Oh, Inhee Maeng, Eui Su Lee, Han-Cheol Ryu

Abstract: Terahertz (THz) waves are electromagnetic waves in the 0.1 to 10 THz frequency range, and THz imaging is utilized in a range of applications, including security inspections, biomedical fields, and the non-destructive examination of materials. However, THz images have low resolution due to the long wavelength of THz waves. Therefore, improving the resolution of THz images is one of the current hot… ▽ More Terahertz (THz) waves are electromagnetic waves in the 0.1 to 10 THz frequency range, and THz imaging is utilized in a range of applications, including security inspections, biomedical fields, and the non-destructive examination of materials. However, THz images have low resolution due to the long wavelength of THz waves. Therefore, improving the resolution of THz images is one of the current hot research topics. We propose a novel network architecture called J-Net which is improved version of U-Net to solve the THz image super-resolution. It employs the simple baseline blocks which can extract low resolution (LR) image features and learn the mapping of LR images to highresolution (HR) images efficiently. All training was conducted using the DIV2K+Flickr2K dataset, and we employed the peak signal-to-noise ratio (PSNR) for quantitative comparison. In our comparisons with other THz image super-resolution methods, JNet achieved a PSNR of 32.52 dB, surpassing other techniques by more than 1 dB. J-Net also demonstrates superior performance on real THz images compared to other methods. Experiments show that the proposed J-Net achieves better PSNR and visual improvement compared with other THz image super-resolution methods. △ Less

Submitted 4 December, 2023; originally announced December 2023.

arXiv:2311.15683 [pdf]

doi 10.1038/s41528-024-00315-1

Ultrasensitive Textile Strain Sensors Redefine Wearable Silent Speech Interfaces with High Machine Learning Efficiency

Authors: Chenyu Tang, Muzi Xu, Wentian Yi, Zibo Zhang, Edoardo Occhipinti, Chaoqun Dong, Dafydd Ravenscroft, Sung-Min Jung, Sanghyo Lee, Shuo Gao, Jong Min Kim, Luigi G. Occhipinti

Abstract: Our research presents a wearable Silent Speech Interface (SSI) technology that excels in device comfort, time-energy efficiency, and speech decoding accuracy for real-world use. We developed a biocompatible, durable textile choker with an embedded graphene-based strain sensor, capable of accurately detecting subtle throat movements. This sensor, surpassing other strain sensors in sensitivity by 42… ▽ More Our research presents a wearable Silent Speech Interface (SSI) technology that excels in device comfort, time-energy efficiency, and speech decoding accuracy for real-world use. We developed a biocompatible, durable textile choker with an embedded graphene-based strain sensor, capable of accurately detecting subtle throat movements. This sensor, surpassing other strain sensors in sensitivity by 420%, simplifies signal processing compared to traditional voice recognition methods. Our system uses a computationally efficient neural network, specifically a one-dimensional convolutional neural network with residual structures, to decode speech signals. This network is energy and time-efficient, reducing computational load by 90% while achieving 95.25% accuracy for a 20-word lexicon and swiftly adapting to new users and words with minimal samples. This innovation demonstrates a practical, sensitive, and precise wearable SSI suitable for daily communication applications. △ Less

Submitted 7 December, 2023; v1 submitted 27 November, 2023; originally announced November 2023.

Comments: 5 figures in the article; 11 figures and 4 tables in supplementary information

Journal ref: npj Flexible Electronics (2024)

arXiv:2307.13343 [pdf, other]

On-Device Speaker Anonymization of Acoustic Embeddings for ASR based onFlexible Location Gradient Reversal Layer

Authors: Md Asif Jalal, Pablo Peso Parada, Jisi Zhang, Karthikeyan Saravanan, Mete Ozay, Myoungji Han, Jung In Lee, Seokyeong Jung

Abstract: Smart devices serviced by large-scale AI models necessitates user data transfer to the cloud for inference. For speech applications, this means transferring private user information, e.g., speaker identity. Our paper proposes a privacy-enhancing framework that targets speaker identity anonymization while preserving speech recognition accuracy for our downstream task~-~Automatic Speech Recognition… ▽ More Smart devices serviced by large-scale AI models necessitates user data transfer to the cloud for inference. For speech applications, this means transferring private user information, e.g., speaker identity. Our paper proposes a privacy-enhancing framework that targets speaker identity anonymization while preserving speech recognition accuracy for our downstream task~-~Automatic Speech Recognition (ASR). The proposed framework attaches flexible gradient reversal based speaker adversarial layers to target layers within an ASR model, where speaker adversarial training anonymizes acoustic embeddings generated by the targeted layers to remove speaker identity. We propose on-device deployment by execution of initial layers of the ASR model, and transmitting anonymized embeddings to the cloud, where the rest of the model is executed while preserving privacy. Experimental results show that our method efficiently reduces speaker recognition relative accuracy by 33%, and improves ASR performance by achieving 6.2% relative Word Error Rate (WER) reduction. △ Less

Submitted 25 July, 2023; originally announced July 2023.

Comments: Proceedings of INTERSPEECH 2023

arXiv:2306.09382 [pdf, ps, other]

Sound Demixing Challenge 2023 Music Demixing Track Technical Report: TFC-TDF-UNet v3

Authors: Minseok Kim, Jun Hyung Lee, Soonyoung Jung

Abstract: In this report, we present our award-winning solutions for the Music Demixing Track of Sound Demixing Challenge 2023. First, we propose TFC-TDF-UNet v3, a time-efficient music source separation model that achieves state-of-the-art results on the MUSDB benchmark. We then give full details regarding our solutions for each Leaderboard, including a loss masking approach for noise-robust training. Code… ▽ More In this report, we present our award-winning solutions for the Music Demixing Track of Sound Demixing Challenge 2023. First, we propose TFC-TDF-UNet v3, a time-efficient music source separation model that achieves state-of-the-art results on the MUSDB benchmark. We then give full details regarding our solutions for each Leaderboard, including a loss masking approach for noise-robust training. Code for reproducing model training and final submissions is available at github.com/kuielab/sdx23. △ Less

Submitted 21 July, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: 5 pages, 4 tables

arXiv:2306.04137 [pdf, other]

Multi-Agent Reinforcement Learning for Cooperative Air Transportation Services in City-Wide Autonomous Urban Air Mobility

Authors: Chanyoung Park, Gyu Seon Kim, Soohyun Park, Soyi Jung, Joongheon Kim

Abstract: The development of urban-air-mobility (UAM) is rapidly progressing with spurs, and the demand for efficient transportation management systems is a rising need due to the multifaceted environmental uncertainties. Thus, this paper proposes a novel air transportation service management algorithm based on multi-agent deep reinforcement learning (MADRL) to address the challenges of multi-UAM cooperatio… ▽ More The development of urban-air-mobility (UAM) is rapidly progressing with spurs, and the demand for efficient transportation management systems is a rising need due to the multifaceted environmental uncertainties. Thus, this paper proposes a novel air transportation service management algorithm based on multi-agent deep reinforcement learning (MADRL) to address the challenges of multi-UAM cooperation. Specifically, the proposed algorithm in this paper is based on communication network (CommNet) method utilizing centralized training and distributed execution (CTDE) in multiple UAMs for providing efficient air transportation services to passengers collaboratively. Furthermore, this paper adopts actual vertiport maps and UAM specifications for constructing realistic air transportation networks. By evaluating the performance of the proposed algorithm in data-intensive simulations, the results show that the proposed algorithm outperforms existing approaches in terms of air transportation service quality. Furthermore, there are no inferior UAMs by utilizing parameter sharing in CommNet and a centralized critic network in CTDE. Therefore, it can be confirmed that the research results in this paper can provide a promising solution for autonomous air transportation management systems in city-wide urban areas. △ Less

Submitted 7 June, 2023; originally announced June 2023.

Comments: 15 pages, 14 figures

arXiv:2305.13779 [pdf, other]

Transceiver Design and Performance Analysis for LR-FHSS-based Direct-to-Satellite IoT

Authors: Sooyeob Jung, Seongah Jeong, Jinkyu Kang, Joon Gyu Ryu, Joonhyuk Kang

Abstract: This paper presents a novel transceiver design aimed at enabling Direct-to-Satellite Internet of Things (DtS-IoT) systems based on long range-frequency hopping spread spectrum (LR-FHSS). Our focus lies in developing an accurate transmission method through the analysis of the frame structure and key parameters outlined in Long Range Wide-Area Network (LoRaWAN) [1]. To address the Doppler effect in… ▽ More This paper presents a novel transceiver design aimed at enabling Direct-to-Satellite Internet of Things (DtS-IoT) systems based on long range-frequency hopping spread spectrum (LR-FHSS). Our focus lies in developing an accurate transmission method through the analysis of the frame structure and key parameters outlined in Long Range Wide-Area Network (LoRaWAN) [1]. To address the Doppler effect in DtS-IoT networks and simultaneously receive numerous frequency hopping signals, a robust signal detector for the receiver is proposed. We verify the performance of the proposed LR-FHSS transceiver design through simulations conducted in a realistic satellite channel environment, assessing metrics such as miss detection probability and packet error probability. △ Less

Submitted 25 May, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: 5 pages, 6 figures

Report number: CL2023-1147

arXiv:2304.10839 [pdf, other]

doi 10.1109/TMI.2024.3405024

Cross-domain Denoising for Low-dose Multi-frame Spiral Computed Tomography

Authors: Yucheng Lu, Zhixin Xu, Moon Hyung Choi, Jimin Kim, Seung-Won Jung

Abstract: Computed tomography (CT) has been used worldwide as a non-invasive test to assist in diagnosis. However, the ionizing nature of X-ray exposure raises concerns about potential health risks such as cancer. The desire for lower radiation doses has driven researchers to improve reconstruction quality. Although previous studies on low-dose computed tomography (LDCT) denoising have demonstrated the effe… ▽ More Computed tomography (CT) has been used worldwide as a non-invasive test to assist in diagnosis. However, the ionizing nature of X-ray exposure raises concerns about potential health risks such as cancer. The desire for lower radiation doses has driven researchers to improve reconstruction quality. Although previous studies on low-dose computed tomography (LDCT) denoising have demonstrated the effectiveness of learning-based methods, most were developed on the simulated data. However, the real-world scenario differs significantly from the simulation domain, especially when using the multi-slice spiral scanner geometry. This paper proposes a two-stage method for the commercially available multi-slice spiral CT scanners that better exploits the complete reconstruction pipeline for LDCT denoising across different domains. Our approach makes good use of the high redundancy of multi-slice projections and the volumetric reconstructions while leveraging the over-smoothing problem in conventional cascaded frameworks caused by aggressive denoising. The dedicated design also provides a more explicit interpretation of the data flow. Extensive experiments on various datasets showed that the proposed method could remove up to 70\% of noise without compromised spatial resolution, and subjective evaluations by two experienced radiologists further supported its superior performance against state-of-the-art methods in clinical practice. △ Less

Submitted 28 June, 2024; v1 submitted 21 April, 2023; originally announced April 2023.

Journal ref: IEEE Transactions on Medical Imaging (2024)

arXiv:2304.05920 [pdf, other]

Learning to exploit z-Spatial Diversity for Coherent Nonlinear Optical Fiber Communication

Authors: Sebastian Jung, Tim Uhlemann, Alexander Span, Maximilian Bauhofer, Stephan ten Brink

Abstract: Higher-order solitons inherently possess a spatial periodicity along the propagation axis. The pulse expands and compresses in both, frequency and time domain. This property is exploited for a bandwidth-limited receiver by sampling the optical signal at two different distances. Numerical simulations show that when pure solions are transmitted and the second (i.e., further propagated) signal is als… ▽ More Higher-order solitons inherently possess a spatial periodicity along the propagation axis. The pulse expands and compresses in both, frequency and time domain. This property is exploited for a bandwidth-limited receiver by sampling the optical signal at two different distances. Numerical simulations show that when pure solions are transmitted and the second (i.e., further propagated) signal is also processed, a significant gain in terms of required receiver bandwidth is obtained. Since all pulses propagating in a nonlinear optical fiber exhibit solitonic behavior given sufficient input power and propagation distance, the above concept can also be applied to spectrally efficient Nyquist pulse shaping and higher symbol rates. Transmitter and receiver are trainable structures as part of an autoencoder, aiming to learn a suitable predistortion and post-equalization using both signals to increase the spectral efficiency. △ Less

Submitted 12 April, 2023; originally announced April 2023.

arXiv:2302.14273 [pdf, other]

QP Chaser: Polynomial Trajectory Generation for Autonomous Aerial Tracking

Authors: Yunwoo Lee, Jungwon Park, Seungwoo Jung, Boseong Jeon, Dahyun Oh, H. Jin Kim

Abstract: Maintaining the visibility of the targets is one of the major objectives of aerial tracking applications. This paper proposes QP Chaser, a trajectory planning pipeline that can enhance the visibility of single- and dual-target in both static and dynamic environments. As the name suggests, the proposed planner generates a target-visible trajectory via quadratic programming problems. First, the pred… ▽ More Maintaining the visibility of the targets is one of the major objectives of aerial tracking applications. This paper proposes QP Chaser, a trajectory planning pipeline that can enhance the visibility of single- and dual-target in both static and dynamic environments. As the name suggests, the proposed planner generates a target-visible trajectory via quadratic programming problems. First, the predictor forecasts the reachable sets of moving objects with a sample-and-check strategy considering obstacles. Subsequently, the trajectory planner reinforces the visibility of targets with consideration of 1) path topology and 2) reachable sets of targets and obstacles. We define a target-visible region (TVR) with topology analysis of not only static obstacles but also dynamic obstacles, and it reflects reachable sets of moving targets and obstacles to maintain the whole body of the target within the camera image robustly and ceaselessly. The online performance of the proposed planner is validated in multiple scenarios, including high-fidelity simulations and real-world experiments. △ Less

Submitted 27 February, 2023; originally announced February 2023.

Comments: 15 pages, 13 figures

arXiv:2301.03815 [pdf, other]

Marine IoT Systems with Space-Air-Sea Integrated Networks: Hybrid LEO and UAV Edge Computing

Authors: Sooyeob Jung, Seongah Jeong, Jinkyu Kang, Joonhyuk Kang

Abstract: Marine Internet of Things (IoT) systems have grown substantially with the development of non-terrestrial networks (NTN) via aerial and space vehicles in the upcoming sixth-generation (6G), thereby assisting environment protection, military reconnaissance, and sea transportation. Due to unpredictable climate changes and the extreme channel conditions of maritime networks, however, it is challenging… ▽ More Marine Internet of Things (IoT) systems have grown substantially with the development of non-terrestrial networks (NTN) via aerial and space vehicles in the upcoming sixth-generation (6G), thereby assisting environment protection, military reconnaissance, and sea transportation. Due to unpredictable climate changes and the extreme channel conditions of maritime networks, however, it is challenging to efficiently and reliably collect and compute a huge amount of maritime data. In this paper, we propose a hybrid low-Earth orbit (LEO) and unmanned aerial vehicle (UAV) edge computing method in space-air-sea integrated networks for marine IoT systems. Specifically, two types of edge servers mounted on UAVs and LEO satellites are endowed with computational capabilities for the real-time utilization of a sizable data collected from ocean IoT sensors. Our system aims at minimizing the total energy consumption of the battery-constrained UAV by jointly optimizing the bit allocation of communication and computation along with the UAV path planning under latency, energy budget and operational constraints. For availability and practicality, the proposed methods were developed for three different cases according to the accessibility of the LEO satellite, ``Always On," ``Always Off" and ``Intermediate Disconnected", by leveraging successive convex approximation (SCA) strategies. Via numerical results, we verify that significant energy savings can be accrued for all cases of LEO accessibility by means of joint optimization of bit allocation and UAV path planning compared to partial optimization schemes that design for only the bit allocation or trajectory of the UAV. △ Less

Submitted 10 January, 2023; originally announced January 2023.

Comments: 12 pages, 8 figures, 3 tables, submission in IEEE IoT Journal

Report number: IoT-27450-2022

arXiv:2301.00124 [pdf, other]

Situation-Aware Deep Reinforcement Learning for Autonomous Nonlinear Mobility Control in Cyber-Physical Loitering Munition Systems

Authors: Hyunsoo Lee, Soohyun Park, Won Joon Yun, Soyi Jung, Joongheon Kim

Abstract: According to the rapid development of drone technologies, drones are widely used in many applications including military domains. In this paper, a novel situation-aware DRL- based autonomous nonlinear drone mobility control algorithm in cyber-physical loitering munition applications. On the battlefield, the design of DRL-based autonomous control algorithm is not straightforward because real-world… ▽ More According to the rapid development of drone technologies, drones are widely used in many applications including military domains. In this paper, a novel situation-aware DRL- based autonomous nonlinear drone mobility control algorithm in cyber-physical loitering munition applications. On the battlefield, the design of DRL-based autonomous control algorithm is not straightforward because real-world data gathering is generally not available. Therefore, the approach in this paper is that cyber-physical virtual environment is constructed with Unity environment. Based on the virtual cyber-physical battlefield scenarios, a DRL-based automated nonlinear drone mobility control algorithm can be designed, evaluated, and visualized. Moreover, many obstacles exist which is harmful for linear trajectory control in real-world battlefield scenarios. Thus, our proposed autonomous nonlinear drone mobility control algorithm utilizes situation-aware components those are implemented with a Raycast function in Unity virtual scenarios. Based on the gathered situation-aware information, the drone can autonomously and nonlinearly adjust its trajectory during flight. Therefore, this approach is obviously beneficial for avoiding obstacles in obstacle-deployed battlefields. Our visualization-based performance evaluation shows that the proposed algorithm is superior from the other linear mobility control algorithms. △ Less

Submitted 31 December, 2022; originally announced January 2023.

arXiv:2211.03502 [pdf, other]

Neural Architectural Nonlinear Pre-Processing for mmWave Radar-based Human Gesture Perception

Authors: Hankyul Baek, Yoo Jeong, Ha, Minjae Yoo, Soyi Jung, Joongheon Kim

Abstract: In modern on-driving computing environments, many sensors are used for context-aware applications. This paper utilizes two deep learning models, U-Net and EfficientNet, which consist of a convolutional neural network (CNN), to detect hand gestures and remove noise in the Range Doppler Map image that was measured through a millimeter-wave (mmWave) radar. To improve the performance of classification… ▽ More In modern on-driving computing environments, many sensors are used for context-aware applications. This paper utilizes two deep learning models, U-Net and EfficientNet, which consist of a convolutional neural network (CNN), to detect hand gestures and remove noise in the Range Doppler Map image that was measured through a millimeter-wave (mmWave) radar. To improve the performance of classification, accurate pre-processing algorithms are essential. Therefore, a novel pre-processing approach to denoise images before entering the first deep learning model stage increases the accuracy of classification. Thus, this paper proposes a deep neural network based high-performance nonlinear pre-processing method. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: 4 pages, 7 figures

arXiv:2208.07639 [pdf, other]

RAWtoBit: A Fully End-to-end Camera ISP Network

Authors: Wooseok Jeong, Seung-Won Jung

Abstract: Image compression is an essential and last processing unit in the camera image signal processing (ISP) pipeline. While many studies have been made to replace the conventional ISP pipeline with a single end-to-end optimized deep learning model, image compression is barely considered as a part of the model. In this paper, we investigate the designing of a fully end-to-end optimized camera ISP incorp… ▽ More Image compression is an essential and last processing unit in the camera image signal processing (ISP) pipeline. While many studies have been made to replace the conventional ISP pipeline with a single end-to-end optimized deep learning model, image compression is barely considered as a part of the model. In this paper, we investigate the designing of a fully end-to-end optimized camera ISP incorporating image compression. To this end, we propose RAWtoBit network (RBN) that can effectively perform both tasks simultaneously. RBN is further improved with a novel knowledge distillation scheme by introducing two teacher networks specialized in each task. Extensive experiments demonstrate that our proposed method significantly outperforms alternative approaches in terms of rate-distortion trade-off. △ Less

Submitted 16 August, 2022; originally announced August 2022.

Comments: Accepted at ECCV2022

arXiv:2207.02515 [pdf, other]

doi 10.1007/978-3-031-06381-7_17

Lightweight Encoder-Decoder Architecture for Foot Ulcer Segmentation

Authors: Shahzad Ali, Arif Mahmood, Soon Ki Jung

Abstract: Continuous monitoring of foot ulcer healing is needed to ensure the efficacy of a given treatment and to avoid any possibility of deterioration. Foot ulcer segmentation is an essential step in wound diagnosis. We developed a model that is similar in spirit to the well-established encoder-decoder and residual convolution neural networks. Our model includes a residual connection along with a channel… ▽ More Continuous monitoring of foot ulcer healing is needed to ensure the efficacy of a given treatment and to avoid any possibility of deterioration. Foot ulcer segmentation is an essential step in wound diagnosis. We developed a model that is similar in spirit to the well-established encoder-decoder and residual convolution neural networks. Our model includes a residual connection along with a channel and spatial attention integrated within each convolution block. A simple patch-based approach for model training, test time augmentations, and majority voting on the obtained predictions resulted in superior performance. Our model did not leverage any readily available backbone architecture, pre-training on a similar external dataset, or any of the transfer learning techniques. The total number of network parameters being around 5 million made it a significantly lightweight model as compared with the available state-of-the-art models used for the foot ulcer segmentation task. Our experiments presented results at the patch-level and image-level. Applied on publicly available Foot Ulcer Segmentation (FUSeg) Challenge dataset from MICCAI 2021, our model achieved state-of-the-art image-level performance of 88.22% in terms of Dice similarity score and ranked second in the official challenge leaderboard. We also showed an extremely simple solution that could be compared against the more advanced architectures. △ Less

Submitted 6 July, 2022; originally announced July 2022.

Comments: Published version of this article is available at https://link.springer.com/chapter/10.1007/978-3-031-06381-7_17

Journal ref: Frontiers of Computer Vision. IW-FCV 2022. Communications in Computer and Information Science, vol 1578. Springer, Cham (2022)

arXiv:2204.00491 [pdf, other]

FrequencyLowCut Pooling -- Plug & Play against Catastrophic Overfitting

Authors: Julia Grabinski, Steffen Jung, Janis Keuper, Margret Keuper

Abstract: Over the last years, Convolutional Neural Networks (CNNs) have been the dominating neural architecture in a wide range of computer vision tasks. From an image and signal processing point of view, this success might be a bit surprising as the inherent spatial pyramid design of most CNNs is apparently violating basic signal processing laws, i.e. Sampling Theorem in their down-sampling operations. Ho… ▽ More Over the last years, Convolutional Neural Networks (CNNs) have been the dominating neural architecture in a wide range of computer vision tasks. From an image and signal processing point of view, this success might be a bit surprising as the inherent spatial pyramid design of most CNNs is apparently violating basic signal processing laws, i.e. Sampling Theorem in their down-sampling operations. However, since poor sampling appeared not to affect model accuracy, this issue has been broadly neglected until model robustness started to receive more attention. Recent work [17] in the context of adversarial attacks and distribution shifts, showed after all, that there is a strong correlation between the vulnerability of CNNs and aliasing artifacts induced by poor down-sampling operations. This paper builds on these findings and introduces an aliasing free down-sampling operation which can easily be plugged into any CNN architecture: FrequencyLowCut pooling. Our experiments show, that in combination with simple and fast FGSM adversarial training, our hyper-parameter free operator significantly improves model robustness and avoids catastrophic overfitting. △ Less

Submitted 20 September, 2022; v1 submitted 1 April, 2022; originally announced April 2022.

Comments: accepted at ECCV 2022

arXiv:2203.16852 [pdf, other]

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

Authors: Dan Lim, Sunghee Jung, Eesung Kim

Abstract: In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have shown synthesis quality close to human speech. For example, FastSpeech2 transforms an input text to a mel-spectrogram and then HiFi-GAN generates a raw waveform from a mel-spectogram where they are called an acoustic feature generator and a neural vocoder respectively. However, their training pipeline i… ▽ More In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have shown synthesis quality close to human speech. For example, FastSpeech2 transforms an input text to a mel-spectrogram and then HiFi-GAN generates a raw waveform from a mel-spectogram where they are called an acoustic feature generator and a neural vocoder respectively. However, their training pipeline is somewhat cumbersome in that it requires a fine-tuning and an accurate speech-text alignment for optimal performance. In this work, we present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models. Specifically, our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module. Since there is no acoustic feature mismatch between training and inference, it does not requires fine-tuning. Furthermore, we remove dependency on an external speech-text alignment tool by adopting an alignment learning objective in our joint training framework. Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS on subjective evaluation (MOS) and some objective evaluations. △ Less

Submitted 1 July, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: Accepted to INTERSPEECH 2022

arXiv:2202.10456 [pdf, other]

Feasibility Study of Multi-Site Split Learning for Privacy-Preserving Medical Systems under Data Imbalance Constraints in COVID-19, X-Ray, and Cholesterol Dataset

Authors: Yoo Jeong Ha, Gusang Lee, Minjae Yoo, Soyi Jung, Seehwan Yoo, Joongheon Kim

Abstract: It seems as though progressively more people are in the race to upload content, data, and information online; and hospitals haven't neglected this trend either. Hospitals are now at the forefront for multi-site medical data sharing to provide groundbreaking advancements in the way health records are shared and patients are diagnosed. Sharing of medical data is essential in modern medical research.… ▽ More It seems as though progressively more people are in the race to upload content, data, and information online; and hospitals haven't neglected this trend either. Hospitals are now at the forefront for multi-site medical data sharing to provide groundbreaking advancements in the way health records are shared and patients are diagnosed. Sharing of medical data is essential in modern medical research. Yet, as with all data sharing technology, the challenge is to balance improved treatment with protecting patient's personal information. This paper provides a novel split learning algorithm coined the term, "multi-site split learning", which enables a secure transfer of medical data between multiple hospitals without fear of exposing personal data contained in patient records. It also explores the effects of varying the number of end-systems and the ratio of data-imbalance on the deep learning performance. A guideline for the most optimal configuration of split learning that ensures privacy of patient data whilst achieving performance is empirically given. We argue the benefits of our multi-site split learning algorithm, especially regarding the privacy preserving factor, using CT scans of COVID-19 patients, X-ray bone scans, and cholesterol level medical data. △ Less

Submitted 20 February, 2022; originally announced February 2022.

arXiv:2201.05843 [pdf, other]

doi 10.1109/TII.2022.3143175

Cooperative Multi-Agent Deep Reinforcement Learning for Reliable Surveillance via Autonomous Multi-UAV Control

Authors: Won Joon Yun, Soohyun Park, Joongheon Kim, MyungJae Shin, Soyi Jung, David A. Mohaisen, Jae-Hyun Kim

Abstract: CCTV-based surveillance using unmanned aerial vehicles (UAVs) is considered a key technology for security in smart city environments. This paper creates a case where the UAVs with CCTV-cameras fly over the city area for flexible and reliable surveillance services. UAVs should be deployed to cover a large area while minimize overlapping and shadow areas for a reliable surveillance system. However,… ▽ More CCTV-based surveillance using unmanned aerial vehicles (UAVs) is considered a key technology for security in smart city environments. This paper creates a case where the UAVs with CCTV-cameras fly over the city area for flexible and reliable surveillance services. UAVs should be deployed to cover a large area while minimize overlapping and shadow areas for a reliable surveillance system. However, the operation of UAVs is subject to high uncertainty, necessitating autonomous recovery systems. This work develops a multi-agent deep reinforcement learning-based management scheme for reliable industry surveillance in smart city applications. The core idea this paper employs is autonomously replenishing the UAV's deficient network requirements with communications. Via intensive simulations, our proposed algorithm outperforms the state-of-the-art algorithms in terms of surveillance coverage, user support capability, and computational costs. △ Less

Submitted 15 January, 2022; originally announced January 2022.

Comments: 10 pages, 6 figures, Accepted for publication in IEEE Transactions on Industrial Informatics (TII)

arXiv:2111.13321 [pdf, other]

Learning source-aware representations of music in a discrete latent space

Authors: Jinsung Kim, Yeong-Seok Jeong, Woosung Choi, Jaehwa Chung, Soonyoung Jung

Abstract: In recent years, neural network based methods have been proposed as a method that cangenerate representations from music, but they are not human readable and hardly analyzable oreditable by a human. To address this issue, we propose a novel method to learn source-awarelatent representations of music through Vector-Quantized Variational Auto-Encoder(VQ-VAE).We train our VQ-VAE to encode an input mi… ▽ More In recent years, neural network based methods have been proposed as a method that cangenerate representations from music, but they are not human readable and hardly analyzable oreditable by a human. To address this issue, we propose a novel method to learn source-awarelatent representations of music through Vector-Quantized Variational Auto-Encoder(VQ-VAE).We train our VQ-VAE to encode an input mixture into a tensor of integers in a discrete latentspace, and design them to have a decomposed structure which allows humans to manipulatethe latent vector in a source-aware manner. This paper also shows that we can generate basslines by estimating latent vectors in a discrete space. △ Less

Submitted 26 November, 2021; originally announced November 2021.

Comments: MDX Workshop @ ISMIR 2021, 7 pages, 2 figure

arXiv:2111.12516 [pdf, other]

LightSAFT: Lightweight Latent Source Aware Frequency Transform for Source Separation

Authors: Yeong-Seok Jeong, Jinsung Kim, Woosung Choi, Jaehwa Chung, Soonyoung Jung

Abstract: Conditioned source separations have attracted significant attention because of their flexibility, applicability and extensionality. Their performance was usually inferior to the existing approaches, such as the single source separation model. However, a recently proposed method called LaSAFT-Net has shown that conditioned models can show comparable performance against existing single-source separa… ▽ More Conditioned source separations have attracted significant attention because of their flexibility, applicability and extensionality. Their performance was usually inferior to the existing approaches, such as the single source separation model. However, a recently proposed method called LaSAFT-Net has shown that conditioned models can show comparable performance against existing single-source separation models. This paper presents LightSAFT-Net, a lightweight version of LaSAFT-Net. As a baseline, it provided a sufficient SDR performance for comparison during the Music Demixing Challenge at ISMIR 2021. This paper also enhances the existing LightSAFT-Net by replacing the LightSAFT blocks in the encoder with TFC-TDF blocks. Our enhanced LightSAFT-Net outperforms the previous one with fewer parameters.Conditioned source separations have attracted significant attention because of their flexibility, applicability and extensionality. Their performance was usually inferior to the existing approaches, such as the single source separation model. However, a recently proposed method called LaSAFT-Net has shown that conditioned models can show comparable performance against existing single-source separation models. This paper presents LightSAFT-Net, a lightweight version of LaSAFT-Net. As a baseline, it provided a sufficient SDR performance for comparison during the Music Demixing Challenge at ISMIR 2021. △ Less

Submitted 26 January, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

Comments: MDX Workshop @ ISMIR 2021, 6 pages, 1 figure

arXiv:2111.12203 [pdf, other]

KUIELab-MDX-Net: A Two-Stream Neural Network for Music Demixing

Authors: Minseok Kim, Woosung Choi, Jaehwa Chung, Daewon Lee, Soonyoung Jung

Abstract: Recently, many methods based on deep learning have been proposed for music source separation. Some state-of-the-art methods have shown that stacking many layers with many skip connections improve the SDR performance. Although such a deep and complex architecture shows outstanding performance, it usually requires numerous computing resources and time for training and evaluation. This paper proposes… ▽ More Recently, many methods based on deep learning have been proposed for music source separation. Some state-of-the-art methods have shown that stacking many layers with many skip connections improve the SDR performance. Although such a deep and complex architecture shows outstanding performance, it usually requires numerous computing resources and time for training and evaluation. This paper proposes a two-stream neural network for music demixing, called KUIELab-MDX-Net, which shows a good balance of performance and required resources. The proposed model has a time-frequency branch and a time-domain branch, where each branch separates stems, respectively. It blends results from two streams to generate the final estimation. KUIELab-MDX-Net took second place on leaderboard A and third place on leaderboard B in the Music Demixing Challenge at ISMIR 2021. This paper also summarizes experimental results on another benchmark, MUSDB18. Our source code is available online. △ Less

Submitted 23 November, 2021; originally announced November 2021.

Comments: MDX Workshop @ ISMIR 2021, 7 pages, 3 figures

arXiv:2110.08796 [pdf, other]

Stable Marriage Matching for Traffic-Aware Space-Air-Ground Integrated Networks: A Gale-Shapley Algorithmic Approach

Authors: Hyunsoo Lee, Haemin Lee, Soyi Jung, Joongheon Kim

Abstract: In keeping with the rapid development of communication technology, a new communication structure is required in a next-generation communication system. In particular, research using High Altitude Platform (HAP) or Unmanned Aerial Vehicle(UAV) in existing terrestrial networks is active. In this paper, we propose matching HAP and UAV using the Gale-Shapley algorithm in a relay communication situatio… ▽ More In keeping with the rapid development of communication technology, a new communication structure is required in a next-generation communication system. In particular, research using High Altitude Platform (HAP) or Unmanned Aerial Vehicle(UAV) in existing terrestrial networks is active. In this paper, we propose matching HAP and UAV using the Gale-Shapley algorithm in a relay communication situation. The numerical simulation results demonstrate that applying the Gale-Shapley algorithm shows superior performance compared to random matching. △ Less

Submitted 17 October, 2021; originally announced October 2021.

arXiv:2108.10147 [pdf, other]

Spatio-Temporal Split Learning for Privacy-Preserving Medical Platforms: Case Studies with COVID-19 CT, X-Ray, and Cholesterol Data

Authors: Yoo Jeong Ha, Minjae Yoo, Gusang Lee, Soyi Jung, Sae Won Choi, Joongheon Kim, Seehwan Yoo

Abstract: Machine learning requires a large volume of sample data, especially when it is used in high-accuracy medical applications. However, patient records are one of the most sensitive private information that is not usually shared among institutes. This paper presents spatio-temporal split learning, a distributed deep neural network framework, which is a turning point in allowing collaboration among pri… ▽ More Machine learning requires a large volume of sample data, especially when it is used in high-accuracy medical applications. However, patient records are one of the most sensitive private information that is not usually shared among institutes. This paper presents spatio-temporal split learning, a distributed deep neural network framework, which is a turning point in allowing collaboration among privacy-sensitive organizations. Our spatio-temporal split learning presents how distributed machine learning can be efficiently conducted with minimal privacy concerns. The proposed split learning consists of a number of clients and a centralized server. Each client has only has one hidden layer, which acts as the privacy-preserving layer, and the centralized server comprises the other hidden layers and the output layer. Since the centralized server does not need to access the training data and trains the deep neural network with parameters received from the privacy-preserving layer, privacy of original data is guaranteed. We have coined the term, spatio-temporal split learning, as multiple clients are spatially distributed to cover diverse datasets from different participants, and we can temporally split the learning process, detaching the privacy preserving layer from the rest of the learning process to minimize privacy breaches. This paper shows how we can analyze the medical data whilst ensuring privacy using our proposed multi-site spatio-temporal split learning algorithm on Coronavirus Disease-19 (COVID-19) chest Computed Tomography (CT) scans, MUsculoskeletal RAdiographs (MURA) X-ray images, and cholesterol levels. △ Less

Submitted 20 August, 2021; originally announced August 2021.

arXiv:2108.00626 [pdf, ps, other]

Quantum Scheduling for Millimeter-Wave Observation Satellite Constellation

Authors: Joongheon Kim, Yunseok Kwak, Soyi Jung, Jae-Hyun Kim

Abstract: In beyond 5G and 6G network scenarios, the use of satellites has been actively discussed for extending target monitoring areas, even for extreme circumstances, where the monitoring functionalities can be realized due to the usage of millimeter-wave wireless links. This paper designs an efficient scheduling algorithm which minimizes overlapping monitoring areas among observation satellite constella… ▽ More In beyond 5G and 6G network scenarios, the use of satellites has been actively discussed for extending target monitoring areas, even for extreme circumstances, where the monitoring functionalities can be realized due to the usage of millimeter-wave wireless links. This paper designs an efficient scheduling algorithm which minimizes overlapping monitoring areas among observation satellite constellation. In order to achieve this objective, a quantum optimization based algorithm is used because the overlapping can be mathematically modelled via a max-weight independent set (MWIS) problem which is one of well-known NP-hard problems. △ Less

Submitted 2 August, 2021; originally announced August 2021.

arXiv:2107.11790 [pdf, other]

Distributed and Autonomous Aerial Data Collection in Smart City Surveillance Applications

Authors: Haemin Lee, Soyi Jung, Joongheon Kim

Abstract: The massive growth of Smart City and Internet of Things applications enables safety and security. The data those are produced from surveillance cameras in aerial devices such as unmanned aerial networks (UAVs) are needed to be transferred to ground stations for secure data analysis. When the scale of network is relatively large compare to the wireless communication coverage of device, it is not al… ▽ More The massive growth of Smart City and Internet of Things applications enables safety and security. The data those are produced from surveillance cameras in aerial devices such as unmanned aerial networks (UAVs) are needed to be transferred to ground stations for secure data analysis. When the scale of network is relatively large compare to the wireless communication coverage of device, it is not always available to transmit the data to the ground stations, thus distributed and autonomous algorithms are essentially desired. Based on the needs, we propose a novel algorithm that is for collecting surveillance data under the consideration of mobility and flexibility of UAV networks. Due to the battery limitation in UAVs, we selectively collect data from the UAVs by setting rules under the consideration of distance and similarity. As a sequence, the UAV devices have to compete for a chance to get data processing. For this purpose, this paper designs a Myerson auction-based deep learning algorithm to leverage the UAV's revenue compare to traditional second-price auction while preserving truthfulness. Based on simulation results, we verify that our proposed algorithm achieves desired performance improvements. △ Less

Submitted 25 July, 2021; originally announced July 2021.

arXiv:2106.14844 [pdf, other]

doi 10.1109/TIP.2022.3155948

Progressive Joint Low-light Enhancement and Noise Removal for Raw Images

Authors: Yucheng Lu, Seung-Won Jung

Abstract: Low-light imaging on mobile devices is typically challenging due to insufficient incident light coming through the relatively small aperture, resulting in a low signal-to-noise ratio. Most of the previous works on low-light image processing focus either only on a single task such as illumination adjustment, color enhancement, or noise removal; or on a joint illumination adjustment and denoising ta… ▽ More Low-light imaging on mobile devices is typically challenging due to insufficient incident light coming through the relatively small aperture, resulting in a low signal-to-noise ratio. Most of the previous works on low-light image processing focus either only on a single task such as illumination adjustment, color enhancement, or noise removal; or on a joint illumination adjustment and denoising task that heavily relies on short-long exposure image pairs collected from specific camera models, and thus these approaches are less practical and generalizable in real-world settings where camera-specific joint enhancement and restoration is required. To tackle this problem, in this paper, we propose a low-light image processing framework that performs joint illumination adjustment, color enhancement, and denoising. Considering the difficulty in model-specific data collection and the ultra-high definition of the captured images, we design two branches: a coefficient estimation branch as well as a joint enhancement and denoising branch. The coefficient estimation branch works in a low-resolution space and predicts the coefficients for enhancement via bilateral learning, whereas the joint enhancement and denoising branch works in a full-resolution space and progressively performs joint enhancement and denoising. In contrast to existing methods, our framework does not need to recollect massive data when being adapted to another camera model, which significantly reduces the efforts required to fine-tune our approach for practical usage. Through extensive experiments, we demonstrate its great potential in real-world low-light imaging applications when compared with current state-of-the-art methods. △ Less

Submitted 2 September, 2022; v1 submitted 28 June, 2021; originally announced June 2021.

arXiv:2104.13553 [pdf, other]

AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries

Authors: Woosung Choi, Minseok Kim, Marco A. Martínez Ramírez, Jaehwa Chung, Soonyoung Jung

Abstract: This paper proposes a neural network that performs audio transformations to user-specified sources (e.g., vocals) of a given audio track according to a given description while preserving other sources not mentioned in the description. Audio Manipulation on a Specific Source (AMSS) is challenging because a sound object (i.e., a waveform sample or frequency bin) is `transparent'; it usually carries… ▽ More This paper proposes a neural network that performs audio transformations to user-specified sources (e.g., vocals) of a given audio track according to a given description while preserving other sources not mentioned in the description. Audio Manipulation on a Specific Source (AMSS) is challenging because a sound object (i.e., a waveform sample or frequency bin) is `transparent'; it usually carries information from multiple sources, in contrast to a pixel in an image. To address this challenging problem, we propose AMSS-Net, which extracts latent sources and selectively manipulates them while preserving irrelevant sources. We also propose an evaluation benchmark for several AMSS tasks, and we show that AMSS-Net outperforms baselines on several AMSS tasks via objective metrics and empirical verification. △ Less

Submitted 27 April, 2021; originally announced April 2021.

Comments: 10 pages, 8 figures, 3 tables, under reviewing of ACMMM 21

arXiv:2010.11631 [pdf, other]

LaSAFT: Latent Source Attentive Frequency Transformation for Conditioned Source Separation

Authors: Woosung Choi, Minseok Kim, Jaehwa Chung, Soonyoung Jung

Abstract: Recent deep-learning approaches have shown that Frequency Transformation (FT) blocks can significantly improve spectrogram-based single-source separation models by capturing frequency patterns. The goal of this paper is to extend the FT block to fit the multi-source task. We propose the Latent Source Attentive Frequency Transformation (LaSAFT) block to capture source-dependent frequency patterns.… ▽ More Recent deep-learning approaches have shown that Frequency Transformation (FT) blocks can significantly improve spectrogram-based single-source separation models by capturing frequency patterns. The goal of this paper is to extend the FT block to fit the multi-source task. We propose the Latent Source Attentive Frequency Transformation (LaSAFT) block to capture source-dependent frequency patterns. We also propose the Gated Point-wise Convolutional Modulation (GPoCM), an extension of Feature-wise Linear Modulation (FiLM), to modulate internal features. By employing these two novel methods, we extend the Conditioned-U-Net (CUNet) for multi-source separation, and the experimental results indicate that our LaSAFT and GPoCM can improve the CUNet's performance, achieving state-of-the-art SDR performance on several MUSDB18 source separation tasks. △ Less

Submitted 14 April, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

Comments: 5 pages, 3 figures, 2 tables. accepted to ICASSP 2021

arXiv:2009.05210 [pdf]

A 6.3-Nanowatt-per-Channel 96-Channel Neural Spike Processor for a Movement-Intention-Decoding Brain-Computer-Interface Implant

Authors: Zhewei Jiang, Jiangyi Li, Pavan K. Chundi, Sung Justin Kim, Minhao Yang, Joonseong Kang, Seungchul Jung, Sang Joon Kim, Mingoo Seok

Abstract: This paper presents microwatt end-to-end neural signal processing hardware for deployment-stage real-time upper-limb movement intent decoding. This module features intercellular spike detection, sorting, and decoding operations for a 96-channel prosthetic implant. We design the algorithms for those operations to achieve minimal computation complexity while matching or advancing the accuracy of sta… ▽ More This paper presents microwatt end-to-end neural signal processing hardware for deployment-stage real-time upper-limb movement intent decoding. This module features intercellular spike detection, sorting, and decoding operations for a 96-channel prosthetic implant. We design the algorithms for those operations to achieve minimal computation complexity while matching or advancing the accuracy of state-of-art Brain-Computer-Interface sorting and movement decoding. Based on those algorithms, we devise the architect of the neural signal processing hardware with the focus on hardware reuse and event-driven operation. The design achieves among the highest levels of integration, reducing wireless data rate by more than four orders of magnitude. The chip prototype in a 180-nm high-VTH, achieving the lowest power dissipation of 0.61 uW for 96 channels, 21X lower than the prior art at a comparable/better accuracy even with integration of kinematic state estimation computation. △ Less

Submitted 10 September, 2020; originally announced September 2020.

arXiv:2008.06208 [pdf]

Adaptable Multi-Domain Language Model for Transformer ASR

Authors: Taewoo Lee, Min-Joong Lee, Tae Gyoon Kang, Seokyeoung Jung, Minseok Kwon, Yeona Hong, Jungin Lee, Kyoung-Gu Woo, Ho-Gyeong Kim, Jiseung Jeong, Jihyun Lee, Hosik Lee, Young Sang Choi

Abstract: We propose an adapter based multi-domain Transformer based language model (LM) for Transformer ASR. The model consists of a big size common LM and small size adapters. The model can perform multi-domain adaptation with only the small size adapters and its related layers. The proposed model can reuse the full fine-tuned LM which is fine-tuned using all layers of an original model. The proposed LM c… ▽ More We propose an adapter based multi-domain Transformer based language model (LM) for Transformer ASR. The model consists of a big size common LM and small size adapters. The model can perform multi-domain adaptation with only the small size adapters and its related layers. The proposed model can reuse the full fine-tuned LM which is fine-tuned using all layers of an original model. The proposed LM can be expanded to new domains by adding about 2% of parameters for a first domain and 13% parameters for after second domain. The proposed model is also effective in reducing the model maintenance cost because it is possible to omit the costly and time-consuming common LM pre-training process. Using proposed adapter based approach, we observed that a general LM with adapter can outperform a dedicated music domain LM in terms of word error rate (WER). △ Less

Submitted 10 February, 2021; v1 submitted 14 August, 2020; originally announced August 2020.

Comments: This paper is accepted for presentation at IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE ICASSP), 2021

arXiv:2006.06942 [pdf]

Domain-adversarial training of multi-speaker TTS

Authors: Sunghee Jung, Hoirin Kim

Abstract: Multi-speaker TTS has to learn both linguistic embedding and text embedding to generate speech of desired linguistic content in desired voice. However, it is unclear which characteristic of speech results from speaker and which part from linguistic content. In this paper, text embedding is forced to unlearn speaker dependent characteristic using gradient reversal layer to auxiliary speaker classif… ▽ More Multi-speaker TTS has to learn both linguistic embedding and text embedding to generate speech of desired linguistic content in desired voice. However, it is unclear which characteristic of speech results from speaker and which part from linguistic content. In this paper, text embedding is forced to unlearn speaker dependent characteristic using gradient reversal layer to auxiliary speaker classifier that we introduce. We train a speaker classifier using angular margin softmax loss. In subjective evaluation, it is shown that the adversarial training of text embedding for unilingual multi-speaker TTS results in 39.9% improvement on similarity MOS and 40.1% improvement on naturalness MOS. △ Less

Submitted 12 June, 2020; originally announced June 2020.

arXiv:2006.06940 [pdf]

Neural voice cloning with a few low-quality samples

Authors: Sunghee Jung, Hoirin Kim

Abstract: In this paper, we explore the possibility of speech synthesis from low quality found data using only limited number of samples of target speaker. We try to extract only the speaker embedding from found data of target speaker unlike previous works which tries to train the entire text-to-speech system on found data. Also, the two speaker mimicking approaches which are adaptation and speaker-encoder-… ▽ More In this paper, we explore the possibility of speech synthesis from low quality found data using only limited number of samples of target speaker. We try to extract only the speaker embedding from found data of target speaker unlike previous works which tries to train the entire text-to-speech system on found data. Also, the two speaker mimicking approaches which are adaptation and speaker-encoder-based are applied on newly released LibriTTS dataset and previously released VCTK corpus to examine the impact of speaker variety on clarity and target-speaker-similarity . △ Less

Submitted 12 June, 2020; originally announced June 2020.

arXiv:2006.06937 [pdf]

Non-parallel voice conversion based on source-to-target direct mapping

Authors: Sunghee Jung, Youngjoo Suh, Yeunju Choi, Hoirin Kim

Abstract: Recent works of utilizing phonetic posteriograms (PPGs) for non-parallel voice conversion have significantly increased the usability of voice conversion since the source and target DBs are no longer required for matching contents. In this approach, the PPGs are used as the linguistic bridge between source and target speaker features. However, this PPG-based non-parallel voice conversion has some l… ▽ More Recent works of utilizing phonetic posteriograms (PPGs) for non-parallel voice conversion have significantly increased the usability of voice conversion since the source and target DBs are no longer required for matching contents. In this approach, the PPGs are used as the linguistic bridge between source and target speaker features. However, this PPG-based non-parallel voice conversion has some limitation that it needs two cascading networks at conversion time, making it less suitable for real-time applications and vulnerable to source speaker intelligibility at conversion stage. To address this limitation, we propose a new non-parallel voice conversion technique that employs a single neural network for direct source-to-target voice parameter mapping. With this single network structure, the proposed approach can reduce both conversion time and number of network parameters, which can be especially important factors in embedded or real-time environments. Additionally, it improves the quality of voice conversion by skipping the phone recognizer at conversion stage. It can effectively prevent possible loss of phonetic information the PPG-based indirect method suffers. Experiments show that our approach reduces number of network parameters and conversion time by 41.9% and 44.5%, respectively, with improved voice similarity over the original PPG-based method. △ Less

Submitted 12 June, 2020; originally announced June 2020.

Comments: Submitted to Interspeech 2019

arXiv:2005.10456 [pdf]

Pitchtron: Towards audiobook generation from ordinary people's voices

Authors: Sunghee Jung, Hoirin Kim

Abstract: In this paper, we explore prosody transfer for audiobook generation under rather realistic condition where training DB is plain audio mostly from multiple ordinary people and reference audio given during inference is from professional and richer in prosody than training DB. To be specific, we explore transferring Korean dialects and emotive speech even though training set is mostly composed of sta… ▽ More In this paper, we explore prosody transfer for audiobook generation under rather realistic condition where training DB is plain audio mostly from multiple ordinary people and reference audio given during inference is from professional and richer in prosody than training DB. To be specific, we explore transferring Korean dialects and emotive speech even though training set is mostly composed of standard and neutral Korean. We found that under this setting, original global style token method generates undesirable glitches in pitch, energy and pause length. To deal with this issue, we propose two models, hard and soft pitchtron and release the toolkit and corpus that we have developed. Hard pitchtron uses pitch as input to the decoder while soft pitchtron uses pitch as input to the prosody encoder. We verify the effectiveness of proposed models with objective and subjective tests. AXY score over GST is 2.01 and 1.14 for hard pitchtron and soft pitchtron respectively. △ Less

Submitted 21 May, 2020; originally announced May 2020.

arXiv:2003.11982 [pdf, ps, other]

doi 10.21437/Interspeech.2020-1064

In defence of metric learning for speaker recognition

Authors: Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, Icksang Han

Abstract: The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance. A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper… ▽ More The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance. A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper, we present an extensive evaluation of most popular loss functions for speaker recognition on the VoxCeleb dataset. We demonstrate that the vanilla triplet loss shows competitive performance compared to classification-based losses, and those trained with our proposed metric learning objective outperform state-of-the-art methods. △ Less

Submitted 24 April, 2020; v1 submitted 26 March, 2020; originally announced March 2020.

Comments: The code can be found at https://github.com/clovaai/voxceleb_trainer

arXiv:2001.00577 [pdf, other]

Attention based on-device streaming speech recognition with large speech corpus

Authors: Kwangyoun Kim, Kyungmin Lee, Dhananjaya Gowda, Junmo Park, Sungsoo Kim, Sichen Jin, Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung, Jungin Lee, Myoungji Han, Chanwoo Kim

Abstract: In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer… ▽ More In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pre-training and data augmentation methods. In addition, we compressed our models by more than 3.4 times smaller using an iterative hyper low-rank approximation (LRA) method while minimizing the degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization to bring down the final model size to lower than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain. △ Less

Submitted 1 January, 2020; originally announced January 2020.

Comments: Accepted and presented at the ASRU 2019 conference

arXiv:1912.02591 [pdf, other]

Investigating U-Nets with various Intermediate Blocks for Spectrogram-based Singing Voice Separation

Authors: Woosung Choi, Minseok Kim, Jaehwa Chung, Daewon Lee, Soonyoung Jung

Abstract: Singing Voice Separation (SVS) tries to separate singing voice from a given mixed musical signal. Recently, many U-Net-based models have been proposed for the SVS task, but there were no existing works that evaluate and compare various types of intermediate blocks that can be used in the U-Net architecture. In this paper, we introduce a variety of intermediate spectrogram transformation blocks. We… ▽ More Singing Voice Separation (SVS) tries to separate singing voice from a given mixed musical signal. Recently, many U-Net-based models have been proposed for the SVS task, but there were no existing works that evaluate and compare various types of intermediate blocks that can be used in the U-Net architecture. In this paper, we introduce a variety of intermediate spectrogram transformation blocks. We implement U-nets based on these blocks and train them on complex-valued spectrograms to consider both magnitude and phase. These networks are then compared on the SDR metric. When using a particular block composed of convolutional and fully-connected layers, it achieves state-of-the-art SDR on the MUSDB singing voice separation task by a large margin of 0.9 dB. Our code and models are available online. △ Less

Submitted 8 October, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

Comments: 8 pages 4 tables 6 figures, accepted to ISMIR 2020

arXiv:1905.11172 [pdf, other]

GRDN:Grouped Residual Dense Network for Real Image Denoising and GAN-based Real-world Noise Modeling

Authors: Dong-Wook Kim, Jae Ryun Chung, Seung-Won Jung

Abstract: Recent research on image denoising has progressed with the development of deep learning architectures, especially convolutional neural networks. However, real-world image denoising is still very challenging because it is not possible to obtain ideal pairs of ground-truth images and real-world noisy images. Owing to the recent release of benchmark datasets, the interest of the image denoising commu… ▽ More Recent research on image denoising has progressed with the development of deep learning architectures, especially convolutional neural networks. However, real-world image denoising is still very challenging because it is not possible to obtain ideal pairs of ground-truth images and real-world noisy images. Owing to the recent release of benchmark datasets, the interest of the image denoising community is now moving toward the real-world denoising problem. In this paper, we propose a grouped residual dense network (GRDN), which is an extended and generalized architecture of the state-of-the-art residual dense network (RDN). The core part of RDN is defined as grouped residual dense block (GRDB) and used as a building module of GRDN. We experimentally show that the image denoising performance can be significantly improved by cascading GRDBs. In addition to the network architecture design, we also develop a new generative adversarial network-based real-world noise modeling method. We demonstrate the superiority of the proposed methods by achieving the highest score in terms of both the peak signal-to-noise ratio and the structural similarity in the NTIRE2019 Real Image Denoising Challenge - Track 2:sRGB. △ Less

Submitted 27 May, 2019; originally announced May 2019.

Comments: To appear in CVPR 2019 workshop. The winners of the NTIRE2019 Challenge on Image Denoising Challenge: Track 2 sRGB

arXiv:1701.06811 [pdf, other]

Socio-technical Smart Grid Optimization via Decentralized Charge Control of Electric Vehicles

Authors: Evangelos Pournaras, Seoho Jung, Srivatsan Yadhunathan, Huiting Zhang, Xingliang Fang

Abstract: The penetration of electric vehicles becomes a catalyst for the sustainability of Smart Cities. However, unregulated battery charging remains a challenge causing high energy costs, power peaks or even blackouts. This paper studies this challenge from a socio-technical perspective: social dynamics such as the participation in demand-response programs, the discomfort experienced by alternative sugge… ▽ More The penetration of electric vehicles becomes a catalyst for the sustainability of Smart Cities. However, unregulated battery charging remains a challenge causing high energy costs, power peaks or even blackouts. This paper studies this challenge from a socio-technical perspective: social dynamics such as the participation in demand-response programs, the discomfort experienced by alternative suggested vehicle usage times and even the fairness in terms of how equally discomfort is experienced among the population are highly intertwined with Smart Grid reliability. To address challenges of such a socio-technical nature, this paper introduces a fully decentralized and participatory learning mechanism for privacy-preserving coordinated charging control of electric vehicles that regulates three Smart Grid socio-technical aspects: (i) reliability, (ii) discomfort and (iii) fairness. In contrast to related work, a novel autonomous software agent exclusively uses local knowledge to generate energy demand plans for its vehicle that encode different battery charging regimes. Agents interact to learn and make collective decisions of which plan to execute so that power peaks and energy cost are reduced system-wide. Evaluation with real-world data confirms the improvement of drivers' comfort and fairness using the proposed planning method, while this improvement is assessed in terms of reliability and cost reduction under a varying number of participating vehicles. These findings have a significant relevance and impact for power utilities and system operator on designing more reliable and socially responsible Smart Grids with high penetration of electric vehicles. △ Less

Submitted 21 May, 2019; v1 submitted 24 January, 2017; originally announced January 2017.

Showing 1–50 of 51 results for author: Jung, S