Search | arXiv e-print repository

arXiv:2407.18892 [pdf, other]

SHANGUS: Deep Reinforcement Learning Meets Heuristic Optimization for Speedy Frontier-Based Exploration of Autonomous Vehicles in Unknown Spaces

Authors: Seunghyeop Nam, Tuan Anh Nguyen, Eunmi Choi, Dugki Min

Abstract: This paper introduces SHANGUS, an advanced framework combining Deep Reinforcement Learning (DRL) with heuristic optimization to improve frontier-based exploration efficiency in unknown environments, particularly for intelligent vehicles in autonomous air services, search and rescue operations, and space exploration robotics. SHANGUS harnesses DRL's adaptability and heuristic prioritization, marked… ▽ More This paper introduces SHANGUS, an advanced framework combining Deep Reinforcement Learning (DRL) with heuristic optimization to improve frontier-based exploration efficiency in unknown environments, particularly for intelligent vehicles in autonomous air services, search and rescue operations, and space exploration robotics. SHANGUS harnesses DRL's adaptability and heuristic prioritization, markedly enhancing exploration efficiency, reducing completion time, and minimizing travel distance. The strategy involves a frontier selection node to identify unexplored areas and a DRL navigation node using the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm for robust path planning and dynamic obstacle avoidance. Extensive experiments in ROS2 and Gazebo simulation environments show SHANGUS surpasses representative traditional methods like the Nearest Frontier (NF), Novel Frontier-Based Exploration Algorithm (CFE), and Goal-Driven Autonomous Exploration (GDAE) algorithms, especially in complex scenarios, excelling in completion time, travel distance, and exploration rate. This scalable solution is suitable for real-time autonomous navigation in fields such as industrial automation, autonomous driving, household robotics, and space exploration. Future research will integrate additional sensory inputs and refine heuristic functions to further boost SHANGUS's efficiency and robustness. △ Less

Submitted 26 July, 2024; originally announced July 2024.

arXiv:2406.15725 [pdf, other]

Self Training and Ensembling Frequency Dependent Networks with Coarse Prediction Pooling and Sound Event Bounding Boxes

Authors: Hyeonuk Nam, Deokki Min, Seungdeok Choi, Inhan Choi, Yong-Hwa Park

Abstract: To tackle sound event detection (SED) task, we propose frequency dependent networks (FreDNets), which heavily leverage frequency-dependent methods. We apply frequency warping and FilterAugment, which are frequency-dependent data augmentation methods. The model architecture consists of 3 branches: audio teacher-student transformer (ATST) branch, BEATs branch and CNN branch including either partial… ▽ More To tackle sound event detection (SED) task, we propose frequency dependent networks (FreDNets), which heavily leverage frequency-dependent methods. We apply frequency warping and FilterAugment, which are frequency-dependent data augmentation methods. The model architecture consists of 3 branches: audio teacher-student transformer (ATST) branch, BEATs branch and CNN branch including either partial dilated frequency dynamic convolution (PDFD) or squeeze-and-Excitation (SE) with time-frame frequency-wise SE (tfwSE). To train MAESTRO labels with coarse temporal resolution, we apply max pooling on prediction for the MAESTRO dataset. Using best ensemble model, we apply self training to obtain pseudo label from DESED weak set, DESED unlabeled set and AudioSet. AudioSet labels are filtered to focus on high-confidence pseudo labels and AudioSet pseudo labels are used to train on DESED labels only. We used change-detection-based sound event bounding boxes (cSEBBs) as post processing for ensemble models on self training and submission models. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: DCASE 2024 Challenge Task 4 technical report

arXiv:2406.05341 [pdf, other]

Diversifying and Expanding Frequency-Adaptive Convolution Kernels for Sound Event Detection

Authors: Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Junhyeok Lee, Yong-Hwa Park

Abstract: Frequency dynamic convolution (FDY conv) has shown the state-of-the-art performance in sound event detection (SED) using frequency-adaptive kernels obtained by frequency-varying combination of basis kernels. However, FDY conv lacks an explicit mean to diversify frequency-adaptive kernels, potentially limiting the performance. In addition, size of basis kernels is limited while time-frequency patte… ▽ More Frequency dynamic convolution (FDY conv) has shown the state-of-the-art performance in sound event detection (SED) using frequency-adaptive kernels obtained by frequency-varying combination of basis kernels. However, FDY conv lacks an explicit mean to diversify frequency-adaptive kernels, potentially limiting the performance. In addition, size of basis kernels is limited while time-frequency patterns span larger spectro-temporal range. Therefore, we propose dilated frequency dynamic convolution (DFD conv) which diversifies and expands frequency-adaptive kernels by introducing different dilation sizes to basis kernels. Experiments showed advantages of varying dilation sizes along frequency dimension, and analysis on attention weight variance proved dilated basis kernels are effectively diversified. By adapting class-wise median filter with intersection-based F1 score, proposed DFD-CRNN outperforms FDY-CRNN by 3.12% in terms of polyphonic sound detection score (PSDS). △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 2024

arXiv:2306.11427 [pdf]

Auditory Neural Response Inspired Sound Event Detection Based on Spectro-temporal Receptive Field

Authors: Deokki Min, Hyeonuk Nam, Yong-Hwa Park

Abstract: Sound event detection (SED) is one of tasks to automate function by human auditory system which listens and understands auditory scenes. Therefore, we were inspired to make SED recognize sound events in the way human auditory system does. Spectro-temporal receptive field (STRF), an approach to describe the relationship between perceived sound at ear and transformed neural response in the auditory… ▽ More Sound event detection (SED) is one of tasks to automate function by human auditory system which listens and understands auditory scenes. Therefore, we were inspired to make SED recognize sound events in the way human auditory system does. Spectro-temporal receptive field (STRF), an approach to describe the relationship between perceived sound at ear and transformed neural response in the auditory cortex, is closely related to recognition of sound. In this work, we utilized STRF as a kernel of the first convolutional layer in SED model to extract neural response from input sound to make SED model similar to human auditory system. In addition, we constructed two-branched SED model named as Two Branch STRFNet (TB-STRFNet) composed of STRF branch and baseline branch. While STRF branch extracts sound event information from auditory neural response, baseline branch extracts sound event information directly from the mel spectrogram just as conventional SED models do. TB-STRFNet outperformed the DCASE baseline by 4.3% in terms of threshold-independent macro F1 score, achieving 4th rank in DCASE Challenge 2023 Task 4b. We further improved TB-STRFNet by applying frequency dynamic convolution (FDYConv) which also leveraged domain knowledge on acoustics. As a result, two branch model applied with FDYConv on both branches outperformed the DCASE baseline by 6.2% in terms of the same metric. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: Submitted to DCASE 2023 Workshop

arXiv:2306.11277 [pdf, other]

Frequency & Channel Attention for Computationally Efficient Sound Event Detection

Authors: Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Yong-Hwa Park

Abstract: We explore on various attention methods on frequency and channel dimensions for sound event detection (SED) in order to enhance performance with minimal increase in computational cost while leveraging domain knowledge to address the frequency dimension of audio data. We have introduced frequency dynamic convolution (FDY conv) in a previous work to release the translational equivariance issue assoc… ▽ More We explore on various attention methods on frequency and channel dimensions for sound event detection (SED) in order to enhance performance with minimal increase in computational cost while leveraging domain knowledge to address the frequency dimension of audio data. We have introduced frequency dynamic convolution (FDY conv) in a previous work to release the translational equivariance issue associated with 2D convolution on the frequency dimension of 2D audio data. Although this approach demonstrated state-of-the-art SED performance, it resulted in a model with 150% more trainable parameters. To achieve comparable SED performance with computationally efficient methods for practicality, we explore on lighter alternative attention methods. In addition, we focus on attention methods applied to frequency and channel dimensions. Joint application Squeeze-and-excitation (SE) module and time-frame frequency-wise SE (tfwSE) to apply attention on both frequency and channel dimensions shows comparable performance to SED model with FDY conv with only 2.7% more trainable parameters compared to the baseline model. In addition, we performed class-wise comparison of various attention methods to further discuss various attention methods' characteristics. △ Less

Submitted 28 August, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

Comments: Accepted to DCASE 2023 workshop

arXiv:2303.13407 [pdf, other]

doi 10.1109/ICASSP49357.2023.10097142

Adaptive Endpointing with Deep Contextual Multi-armed Bandits

Authors: Do June Min, Andreas Stolcke, Anirudh Raju, Colin Vaz, Di He, Venkatesh Ravichandran, Viet Anh Trinh

Abstract: Current endpointing (EP) solutions learn in a supervised framework, which does not allow the model to incorporate feedback and improve in an online setting. Also, it is a common practice to utilize costly grid-search to find the best configuration for an endpointing model. In this paper, we aim to provide a solution for adaptive endpointing by proposing an efficient method for choosing an optimal… ▽ More Current endpointing (EP) solutions learn in a supervised framework, which does not allow the model to incorporate feedback and improve in an online setting. Also, it is a common practice to utilize costly grid-search to find the best configuration for an endpointing model. In this paper, we aim to provide a solution for adaptive endpointing by proposing an efficient method for choosing an optimal endpointing configuration given utterance-level audio features in an online setting, while avoiding hyperparameter grid-search. Our method does not require ground truth labels, and only uses online learning from reward signals without requiring annotated labels. Specifically, we propose a deep contextual multi-armed bandit-based approach, which combines the representational power of neural networks with the action exploration behavior of Thompson modeling algorithms. We compare our approach to several baselines, and show that our deep bandit models also succeed in reducing early cutoff errors while maintaining low latency. △ Less

Submitted 23 March, 2023; originally announced March 2023.

Journal ref: Proc. IEEE ICASSP, June 2023

arXiv:2211.09383 [pdf, other]

Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models

Authors: Minki Kang, Dongchan Min, Sung Ju Hwang

Abstract: There has been a significant progress in Text-To-Speech (TTS) synthesis technology in recent years, thanks to the advancement in neural generative modeling. However, existing methods on any-speaker adaptive TTS have achieved unsatisfactory performance, due to their suboptimal accuracy in mimicking the target speakers' styles. In this work, we present Grad-StyleSpeech, which is an any-speaker adapt… ▽ More There has been a significant progress in Text-To-Speech (TTS) synthesis technology in recent years, thanks to the advancement in neural generative modeling. However, existing methods on any-speaker adaptive TTS have achieved unsatisfactory performance, due to their suboptimal accuracy in mimicking the target speakers' styles. In this work, we present Grad-StyleSpeech, which is an any-speaker adaptive TTS framework that is based on a diffusion model that can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. Grad-StyleSpeech significantly outperforms recent speaker-adaptive TTS baselines on English benchmarks. Audio samples are available at https://nardien.github.io/grad-stylespeech-demo. △ Less

Submitted 13 March, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

Comments: ICASSP 2023

arXiv:2208.10922 [pdf, other]

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

Authors: Dongchan Min, Minyoung Song, Eunji Ko, Sung Ju Hwang

Abstract: We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given… ▽ More We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given audio. This is made possible with several newly devised components: 1) A contrastive lip-sync discriminator for accurate lip synchronization, 2) A conditional sequential variational autoencoder that learns the latent motion space disentangled from the lip movements, such that we can independently manipulate the motions and lip movements while preserving the identity. 3) An auto-regressive prior augmented with normalizing flow to learn a complex audio-to-motion multi-modal latent space. Equipped with these components, StyleTalker can generate talking head videos not only in a motion-controllable way when another motion source video is given but also in a completely audio-driven manner by inferring realistic motions from the input audio. Through extensive experiments and user studies, we show that our model is able to synthesize talking head videos with impressive perceptual quality which are accurately lip-synced with the input audios, largely outperforming state-of-the-art baselines. △ Less

Submitted 15 March, 2024; v1 submitted 23 August, 2022; originally announced August 2022.

arXiv:2206.12059 [pdf]

Data Augmentation and Squeeze-and-Excitation Network on Multiple Dimension for Sound Event Localization and Detection in Real Scenes

Authors: Byeong-Yun Ko, Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Seung-Deok Choi, Yong-Hwa Park

Abstract: Performance of sound event localization and detection (SELD) in real scenes is limited by small size of SELD dataset, due to difficulty in obtaining sufficient amount of realistic multi-channel audio data recordings with accurate label. We used two main strategies to solve problems arising from the small real SELD dataset. First, we applied various data augmentation methods on all data dimensions:… ▽ More Performance of sound event localization and detection (SELD) in real scenes is limited by small size of SELD dataset, due to difficulty in obtaining sufficient amount of realistic multi-channel audio data recordings with accurate label. We used two main strategies to solve problems arising from the small real SELD dataset. First, we applied various data augmentation methods on all data dimensions: channel, frequency and time. We also propose original data augmentation method named Moderate Mixup in order to simulate situations where noise floor or interfering events exist. Second, we applied Squeeze-and-Excitation block on channel and frequency dimensions to efficiently extract feature characteristics. Result of our trained models on the STARSS22 test dataset achieved the best ER, F1, LE, and LR of 0.53, 49.8%, 16.0deg., and 56.2% respectively. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: Technical Report submitted for DCASE2022 Challenge Task3

arXiv:2206.11645 [pdf, ps, other]

Frequency Dependent Sound Event Detection for DCASE 2022 Challenge Task 4

Authors: Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Byeong-Yun Ko, Seung-Deok Choi, Yong-Hwa Park

Abstract: While many deep learning methods on other domains have been applied to sound event detection (SED), differences between original domains of the methods and SED have not been appropriately considered so far. As SED uses audio data with two dimensions (time and frequency) for input, thorough comprehension on these two dimensions is essential for application of methods from other domains on SED. Prev… ▽ More While many deep learning methods on other domains have been applied to sound event detection (SED), differences between original domains of the methods and SED have not been appropriately considered so far. As SED uses audio data with two dimensions (time and frequency) for input, thorough comprehension on these two dimensions is essential for application of methods from other domains on SED. Previous works proved that methods those address on frequency dimension are especially powerful in SED. By applying FilterAugment and frequency dynamic convolution those are frequency dependent methods proposed to enhance SED performance, our submitted models achieved best PSDS1 of 0.4704 and best PSDS2 of 0.8224. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: Technical Reprot submitted for DCASE2022 Challenge Task4

arXiv:2106.03153 [pdf, other]

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Authors: Dongchan Min, Dong Bok Lee, Eunho Yang, Sung Ju Hwang

Abstract: With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality witho… ▽ More With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines. △ Less

Submitted 16 June, 2021; v1 submitted 6 June, 2021; originally announced June 2021.

Comments: Accepted by ICML 2021

arXiv:2011.10897

Reinforcement learning with distance-based incentive/penalty (DIP) updates for highly constrained industrial control systems

Authors: Hyungjun Park, Daiki Min, Jong-hyun Ryu, Dong Gu Choi

Abstract: Typical reinforcement learning (RL) methods show limited applicability for real-world industrial control problems because industrial systems involve various constraints and simultaneously require continuous and discrete control. To overcome these challenges, we devise a novel RL algorithm that enables an agent to handle a highly constrained action space. This algorithm has two main features. First… ▽ More Typical reinforcement learning (RL) methods show limited applicability for real-world industrial control problems because industrial systems involve various constraints and simultaneously require continuous and discrete control. To overcome these challenges, we devise a novel RL algorithm that enables an agent to handle a highly constrained action space. This algorithm has two main features. First, we devise two distance-based Q-value update schemes, incentive update and penalty update, in a distance-based incentive/penalty update technique to enable the agent to decide discrete and continuous actions in the feasible region and to update the value of these types of actions. Second, we propose a method for defining the penalty cost as a shadow price-weighted penalty. This approach affords two advantages compared to previous methods to efficiently induce the agent to not select an infeasible action. We apply our algorithm to an industrial control problem, microgrid system operation, and the experimental results demonstrate its superiority. △ Less

Submitted 19 May, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

Comments: We request withdrawal of this article due to a definition error on methodology and problem definition (Section 3-4; pages 2-5)

arXiv:2006.16659 [pdf, other]

Delayed Q-update: A novel credit assignment technique for deriving an optimal operation policy for the Grid-Connected Microgrid

Authors: Hyungjun Park, Daiki Min, Jong-hyun Ryu, Dong Gu Choi

Abstract: A microgrid is an innovative system that integrates distributed energy resources to supply electricity demand within electrical boundaries. This study proposes an approach for deriving a desirable microgrid operation policy that enables sophisticated controls in the microgrid system using the proposed novel credit assignment technique, delayed-Q update. The technique employs novel features such as… ▽ More A microgrid is an innovative system that integrates distributed energy resources to supply electricity demand within electrical boundaries. This study proposes an approach for deriving a desirable microgrid operation policy that enables sophisticated controls in the microgrid system using the proposed novel credit assignment technique, delayed-Q update. The technique employs novel features such as the ability to tackle and resolve the delayed effective property of the microgrid, which prevents learning agents from deriving a well-fitted policy under sophisticated controls. The proposed technique tracks the history of the charging period and retroactively assigns an adjusted value to the ESS charging control. The operation policy derived using the proposed approach is well-fitted for the real effects of ESS operation because of the process of the technique. Therefore, it supports the search for a near-optimal operation policy under a sophisticatedly controlled microgrid environment. To validate our technique, we simulate the operation policy under a real-world grid-connected microgrid system and demonstrate the convergence to a near-optimal policy by comparing performance measures of our policy with benchmark policy and optimal policy. △ Less

Submitted 20 October, 2020; v1 submitted 30 June, 2020; originally announced June 2020.

arXiv:1610.00089 [pdf]

System Identification of NN-based Model Reference Control of RUAV during Hover

Authors: Bhaskar Prasad Rimal, Idris E. Putro, Agus Budiyono, Dugki Min, Eunmi Choi

Abstract: UAV control system is a huge and complex system, and to design and test a UAV control system is time-cost and money-cost. This paper considered the simulation of identification of a nonlinear system dynamics using artificial neural networks approach. This experiment develops a neural network model of the plant that we want to control. In the control design stage, experiment uses the neural network… ▽ More UAV control system is a huge and complex system, and to design and test a UAV control system is time-cost and money-cost. This paper considered the simulation of identification of a nonlinear system dynamics using artificial neural networks approach. This experiment develops a neural network model of the plant that we want to control. In the control design stage, experiment uses the neural network plant model to design (or train) the controller. We use Matlab to train the network and simulate the behavior. This chapter provides the mathematical overview of MRC technique and neural network architecture to simulate nonlinear identification of UAV systems. MRC provides a direct and effective method to control a complex system without an equation-driven model. NN approach provides a good framework to implement MEC by identifying complicated models and training a controller for it. △ Less

Submitted 1 October, 2016; originally announced October 2016.

Comments: 26 pages, Book Chapter, Artificial Neural Networks- Industrial and Control Engineering Applications, INTECH, April, 2011

Showing 1–14 of 14 results for author: Min, D