Search | arXiv e-print repository

Share What You Already Know: Cross-Language-Script Transfer and Alignment for Sentiment Detection in Code-Mixed Data

Abstract: Code-switching entails mixing multiple languages. It is an increasingly occurring phenomenon in social media texts. Usually, code-mixed texts are written in a single script, even though the languages involved have different scripts. Pre-trained multilingual models primarily utilize the data in the native script of the language. In existing studies, the code-switched texts are utilized as they are.… ▽ More Code-switching entails mixing multiple languages. It is an increasingly occurring phenomenon in social media texts. Usually, code-mixed texts are written in a single script, even though the languages involved have different scripts. Pre-trained multilingual models primarily utilize the data in the native script of the language. In existing studies, the code-switched texts are utilized as they are. However, using the native script for each language can generate better representations of the text owing to the pre-trained knowledge. Therefore, a cross-language-script knowledge sharing architecture utilizing the cross attention and alignment of the representations of text in individual language scripts was proposed in this study. Experimental results on two different datasets containing Nepali-English and Hindi-English code-switched texts, demonstrate the effectiveness of the proposed method. The interpretation of the model using model explainability technique illustrates the sharing of language-specific knowledge between language-specific representations. △ Less

Submitted 6 February, 2024; originally announced February 2024.

arXiv:2401.00365 [pdf, other]

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

Authors: Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Wei-Hsiang Liao, Yuki Mitsufuji

Abstract: Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the co… ▽ More Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the codebook is not efficiently used to express the data, and hence degrades reconstruction accuracy. To mitigate this problem, we propose a novel unified framework to stochastically learn hierarchical discrete representation on the basis of the variational Bayes framework, called hierarchically quantized variational autoencoder (HQ-VAE). HQ-VAE naturally generalizes the hierarchical variants of VQ-VAE, such as VQ-VAE-2 and residual-quantized VAE (RQ-VAE), and provides them with a Bayesian training scheme. Our comprehensive experiments on image datasets show that HQ-VAE enhances codebook usage and improves reconstruction performance. We also validated HQ-VAE in terms of its applicability to a different modality with an audio dataset. △ Less

Submitted 28 March, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

Comments: 34 pages with 17 figures, accepted for TMLR

arXiv:2309.09223 [pdf, other]

Zero- and Few-shot Sound Event Localization and Detection

Authors: Kazuki Shimada, Kengo Uchida, Yuichiro Koyama, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji, Tatsuya Kawahara

Abstract: Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few… ▽ More Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few-shot SELD tasks, in which we set new classes with a text sample or a few audio samples. While zero-shot sound classification tasks are achievable by embedding from contrastive language-audio pretraining (CLAP), zero-shot SELD tasks require assigning an activity and a DOA to each embedding, especially in overlapping cases. To tackle the assignment problem in overlapping cases, we propose an embed-ACCDOA model, which is trained to output track-wise CLAP embedding and corresponding activity-coupled Cartesian direction-of-arrival (ACCDOA). In our experimental evaluations on zero- and few-shot SELD tasks, the embed-ACCDOA model showed better location-dependent scores than a straightforward combination of the CLAP audio encoder and a DOA estimation model. Moreover, the proposed combination of the embed-ACCDOA model and CLAP audio encoder with zero- or few-shot samples performed comparably to an official baseline system trained with complete train data in an evaluation dataset. △ Less

Submitted 17 January, 2024; v1 submitted 17 September, 2023; originally announced September 2023.

Comments: 5 pages, 4 figures, accepted for publication in IEEE ICASSP 2024

arXiv:2309.09121 [pdf, other]

Heuristic-based Incremental Probabilistic Roadmap for Efficient UAV Exploration in Dynamic Environments

Authors: Zhefan Xu, Christopher Suzuki, Xiaoyang Zhan, Kenji Shimada

Abstract: Autonomous exploration in dynamic environments necessitates a planner that can proactively respond to changes and make efficient and safe decisions for robots. Although plenty of sampling-based works have shown success in exploring static environments, their inherent sampling randomness and limited utilization of previous samples often result in sub-optimal exploration efficiency. Additionally, mo… ▽ More Autonomous exploration in dynamic environments necessitates a planner that can proactively respond to changes and make efficient and safe decisions for robots. Although plenty of sampling-based works have shown success in exploring static environments, their inherent sampling randomness and limited utilization of previous samples often result in sub-optimal exploration efficiency. Additionally, most of these methods struggle with efficient replanning and collision avoidance in dynamic settings. To overcome these limitations, we propose the Heuristic-based Incremental Probabilistic Roadmap Exploration (HIRE) planner for UAVs exploring dynamic environments. The proposed planner adopts an incremental sampling strategy based on the probabilistic roadmap constructed by heuristic sampling toward the unexplored region next to the free space, defined as the heuristic frontier regions. The heuristic frontier regions are detected by applying a lightweight vision-based method to the different levels of the occupancy map. Moreover, our dynamic module ensures that the planner dynamically updates roadmap information based on the environment changes and avoids dynamic obstacles. Simulation and physical experiments prove that our planner can efficiently and safely explore dynamic environments. △ Less

Submitted 16 September, 2023; originally announced September 2023.

arXiv:2309.08544 [pdf, other]

Quadcopter Trajectory Time Minimization and Robust Collision Avoidance via Optimal Time Allocation

Authors: Zhefan Xu, Kenji Shimada

Abstract: Autonomous navigation requires robots to generate trajectories for collision avoidance efficiently. Although plenty of previous works have proven successful in generating smooth and spatially collision-free trajectories, their solutions often suffer from suboptimal time efficiency and potential unsafety, particularly when accounting for uncertainties in robot perception and control. To address thi… ▽ More Autonomous navigation requires robots to generate trajectories for collision avoidance efficiently. Although plenty of previous works have proven successful in generating smooth and spatially collision-free trajectories, their solutions often suffer from suboptimal time efficiency and potential unsafety, particularly when accounting for uncertainties in robot perception and control. To address this issue, this paper presents the Robust Optimal Time Allocation (ROTA) framework. This framework is designed to optimize the time progress of the trajectories temporally, serving as a post-processing tool to enhance trajectory time efficiency and safety under uncertainties. In this study, we begin by formulating a non-convex optimization problem aimed at minimizing trajectory execution time while incorporating constraints on collision probability as the robot approaches obstacles. Subsequently, we introduce the concept of the trajectory braking zone and adopt the chance-constrained formulation for robust collision avoidance in the braking zones. Finally, the non-convex optimization problem is reformulated into a second-order cone programming problem to achieve real-time performance. Through simulations and physical flight experiments, we demonstrate that the proposed approach effectively reduces trajectory execution time while enabling robust collision avoidance in complex environments. △ Less

Submitted 15 September, 2023; originally announced September 2023.

arXiv:2306.09126 [pdf, other]

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Authors: Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji

Abstract: While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information… ▽ More While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637. △ Less

Submitted 14 November, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: 27 pages, 9 figures, accepted for publication in NeurIPS 2023 Track on Datasets and Benchmarks

arXiv:2305.10734 [pdf, other]

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

Authors: Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji

Abstract: Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us… ▽ More Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us to use the further complementarity between predictive and diffusion-based generative SE. In this paper, we propose a unified system that use jointly generative and predictive decoders across two levels. The encoder encodes both generative and predictive information at the shared encoding level. At the decoded feature level, we fuse the two decoded features by generative and predictive decoders. Specifically, the two SE modules are fused in the initial and final diffusion steps: the initial fusion initializes the diffusion process with the predictive SE to improve convergence, and the final fusion combines the two complementary SE outputs to enhance SE performance. Experiments conducted on the Voice-Bank dataset demonstrate that incorporating predictive information leads to faster decoding and higher PESQ scores compared with other score-based diffusion SE (StoRM and SGMSE+). △ Less

Submitted 28 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

arXiv:2305.06701 [pdf, ps, other]

Extending Audio Masked Autoencoders Toward Audio Restoration

Authors: Zhi Zhong, Hao Shi, Masato Hirano, Kazuki Shimada, Kazuya Tateishi, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced benefits, there has been rising interest in how to improve the performance of pretrained models for restoration tasks, e.g., s… ▽ More Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced benefits, there has been rising interest in how to improve the performance of pretrained models for restoration tasks, e.g., speech enhancement (SE). Previous works have shown that the features extracted by pretrained audio encoders are effective for SE tasks, but these speech-specialized encoder-only models usually require extra decoders to become compatible with SE, and involve complicated pretraining procedures or complex data augmentation. Therefore, in pursuit of a universal audio model, the audio masked autoencoder (MAE) whose backbone is the autoencoder of Vision Transformers (ViT-AE), is extended from audio classification to SE, a representative restoration task with well-established evaluation standards. ViT-AE learns to restore masked audio signal via a mel-to-mel mapping during pretraining, which is similar to restoration tasks like SE. We propose variations of ViT-AE for a better SE performance, where the mel-to-mel variations yield high scores in non-intrusive metrics and the STFT-oriented variation is effective at intrusive metrics such as PESQ. Different variations can be used in accordance with the scenarios. Comprehensive evaluations reveal that MAE pretraining is beneficial to SE tasks and help the ViT-AE to better generalize to out-of-domain distortions. We further found that large-scale noisy data of general audio sources, rather than clean speech, is sufficiently effective for pretraining. △ Less

Submitted 17 August, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

Comments: WASPAA 2023.Copyright 2023 IEEE.Personal use of this material is permitted.Permission from IEEE must be obtained for all other uses,in any current or future media,including reprinting/republishing this material for advertising or promotional purposes, creating new collective works,for resale or redistribution to servers or lists,or reuse of any copyrighted component of this work in other works

arXiv:2305.05857 [pdf, other]

Diffusion-based Signal Refiner for Speech Separation

Authors: Masato Hirano, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Yuki Mitsufuji

Abstract: We have developed a diffusion-based speech refiner that improves the reference-free perceptual quality of the audio predicted by preceding single-channel speech separation models. Although modern deep neural network-based speech separation models have show high performance in reference-based metrics, they often produce perceptually unnatural artifacts. The recent advancements made to diffusion mod… ▽ More We have developed a diffusion-based speech refiner that improves the reference-free perceptual quality of the audio predicted by preceding single-channel speech separation models. Although modern deep neural network-based speech separation models have show high performance in reference-based metrics, they often produce perceptually unnatural artifacts. The recent advancements made to diffusion models motivated us to tackle this problem by restoring the degraded parts of initial separations with a generative approach. Utilizing the denoising diffusion restoration model (DDRM) as a basis, we propose a shared DDRM-based refiner that generates samples conditioned on the global information of preceding outputs from arbitrary speech separation models. We experimentally show that our refiner can provide a clearer harmonic structure of speech and improves the reference-free metric of perceptual quality for arbitrary preceding model architectures. Furthermore, we tune the variance of the measurement noise based on preceding outputs, which results in higher scores in both reference-free and reference-based metrics. The separation quality can also be further improved by blending the discriminative and generative outputs. △ Less

Submitted 12 May, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: Under review

arXiv:2303.00132 [pdf, other]

doi 10.1109/LRA.2023.3334683

Onboard dynamic-object detection and tracking for autonomous robot navigation with RGB-D camera

Authors: Zhefan Xu, Xiaoyang Zhan, Yumeng Xiu, Christopher Suzuki, Kenji Shimada

Abstract: Deploying autonomous robots in crowded indoor environments usually requires them to have accurate dynamic obstacle perception. Although plenty of previous works in the autonomous driving field have investigated the 3D object detection problem, the usage of dense point clouds from a heavy Light Detection and Ranging (LiDAR) sensor and their high computation cost for learning-based data processing m… ▽ More Deploying autonomous robots in crowded indoor environments usually requires them to have accurate dynamic obstacle perception. Although plenty of previous works in the autonomous driving field have investigated the 3D object detection problem, the usage of dense point clouds from a heavy Light Detection and Ranging (LiDAR) sensor and their high computation cost for learning-based data processing make those methods not applicable to small robots, such as vision-based UAVs with small onboard computers. To address this issue, we propose a lightweight 3D dynamic obstacle detection and tracking (DODT) method based on an RGB-D camera, which is designed for low-power robots with limited computing power. Our method adopts a novel ensemble detection strategy, combining multiple computationally efficient but low-accuracy detectors to achieve real-time high-accuracy obstacle detection. Besides, we introduce a new feature-based data association and tracking method to prevent mismatches utilizing point clouds' statistical features. In addition, our system includes an optional and auxiliary learning-based module to enhance the obstacle detection range and dynamic obstacle identification. The proposed method is implemented in a small quadcopter, and the results show that our method can achieve the lowest position error (0.11m) and a comparable velocity error (0.23m/s) across the benchmarking algorithms running on the robot's onboard computer. The flight experiments prove that the tracking results from the proposed method can make the robot efficiently alter its trajectory for navigating dynamic environments. Our software is available on GitHub as an open-source ROS package. △ Less

Submitted 23 November, 2023; v1 submitted 28 February, 2023; originally announced March 2023.

Comments: 8 pages, 10 figures, 2 tables

Journal ref: IEEE Robotics and Automation Letters, Volume: 9, Issue: 1, January 2024. Page(s): 651 - 658

arXiv:2302.08136 [pdf, ps, other]

An Attention-based Approach to Hierarchical Multi-label Music Instrument Classification

Authors: Zhi Zhong, Masato Hirano, Kazuki Shimada, Kazuya Tateishi, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Although music is typically multi-label, many works have studied hierarchical music tagging with simplified settings such as single-label data. Moreover, there lacks a framework to describe various joint training methods under the multi-label setting. In order to discuss the above topics, we introduce hierarchical multi-label music instrument classification task. The task provides a realistic sett… ▽ More Although music is typically multi-label, many works have studied hierarchical music tagging with simplified settings such as single-label data. Moreover, there lacks a framework to describe various joint training methods under the multi-label setting. In order to discuss the above topics, we introduce hierarchical multi-label music instrument classification task. The task provides a realistic setting where multi-instrument real music data is assumed. Various hierarchical methods that jointly train a DNN are summarized and explored in the context of the fusion of deep learning and conventional techniques. For the effective joint training in the multi-label setting, we propose two methods to model the connection between fine- and coarse-level tags, where one uses rule-based grouped max-pooling, the other one uses the attention mechanism obtained in a data-driven manner. Our evaluation reveals that the proposed methods have advantages over the method without joint training. In addition, the decision procedure within the proposed methods can be interpreted by visualizing attention maps or referring to fixed rules. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Comments: To appear at ICASSP 2023

arXiv:2301.08422 [pdf, other]

doi 10.1109/LRA.2023.3290415

A vision-based autonomous UAV inspection framework for unknown tunnel construction sites with dynamic obstacles

Authors: Zhefan Xu, Baihan Chen, Xiaoyang Zhan, Yumeng Xiu, Christopher Suzuki, Kenji Shimada

Abstract: Tunnel construction using the drill-and-blast method requires the 3D measurement of the excavation front to evaluate underbreak locations. Considering the inspection and measurement task's safety, cost, and efficiency, deploying lightweight autonomous robots, such as unmanned aerial vehicles (UAV), becomes more necessary and popular. Most of the previous works use a prior map for inspection viewpo… ▽ More Tunnel construction using the drill-and-blast method requires the 3D measurement of the excavation front to evaluate underbreak locations. Considering the inspection and measurement task's safety, cost, and efficiency, deploying lightweight autonomous robots, such as unmanned aerial vehicles (UAV), becomes more necessary and popular. Most of the previous works use a prior map for inspection viewpoint determination and do not consider dynamic obstacles. To maximally increase the level of autonomy, this paper proposes a vision-based UAV inspection framework for dynamic tunnel environments without using a prior map. Our approach utilizes a hierarchical planning scheme, decomposing the inspection problem into different levels. The high-level decision maker first determines the task for the robot and generates the target point. Then, the mid-level path planner finds the waypoint path and optimizes the collision-free static trajectory. Finally, the static trajectory will be fed into the low-level local planner to avoid dynamic obstacles and navigate to the target point. Besides, our framework contains a novel dynamic map module that can simultaneously track dynamic obstacles and represent static obstacles based on an RGB-D camera. After inspection, the Structure-from-Motion (SfM) pipeline is applied to generate the 3D shape of the target. To our best knowledge, this is the first time autonomous inspection has been realized in unknown and dynamic tunnel environments. Our flight experiments in a real tunnel prove that our method can autonomously inspect the tunnel excavation front surface. Our software is available on GitHub as an open-source ROS package. △ Less

Submitted 12 January, 2024; v1 submitted 19 January, 2023; originally announced January 2023.

Comments: 8 pages, 8 figures

Journal ref: IEEE Robotics and Automation Letters, Volume: 8, Issue: 8, June 2023. Page(s): 4983 - 4990

arXiv:2212.00290 [pdf, other]

doi 10.1016/j.compind.2023.103885

Component Segmentation of Engineering Drawings Using Graph Convolutional Networks

Authors: Wentai Zhang, Joe Joseph, Yue Yin, Liuyue Xie, Tomotake Furuhata, Soji Yamakawa, Kenji Shimada, Levent Burak Kara

Abstract: We present a data-driven framework to automate the vectorization and machine interpretation of 2D engineering part drawings. In industrial settings, most manufacturing engineers still rely on manual reads to identify the topological and manufacturing requirements from drawings submitted by designers. The interpretation process is laborious and time-consuming, which severely inhibits the efficiency… ▽ More We present a data-driven framework to automate the vectorization and machine interpretation of 2D engineering part drawings. In industrial settings, most manufacturing engineers still rely on manual reads to identify the topological and manufacturing requirements from drawings submitted by designers. The interpretation process is laborious and time-consuming, which severely inhibits the efficiency of part quotation and manufacturing tasks. While recent advances in image-based computer vision methods have demonstrated great potential in interpreting natural images through semantic segmentation approaches, the application of such methods in parsing engineering technical drawings into semantically accurate components remains a significant challenge. The severe pixel sparsity in engineering drawings also restricts the effective featurization of image-based data-driven methods. To overcome these challenges, we propose a deep learning based framework that predicts the semantic type of each vectorized component. Taking a raster image as input, we vectorize all components through thinning, stroke tracing, and cubic bezier fitting. Then a graph of such components is generated based on the connectivity between the components. Finally, a graph convolutional neural network is trained on this graph data to identify the semantic type of each component. We test our framework in the context of semantic segmentation of text, dimension and, contour components in engineering drawings. Results show that our method yields the best performance compared to recent image, and graph-based segmentation methods. △ Less

Submitted 14 March, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

Comments: Preprint accepted to Computers in Industry

arXiv:2209.08258 [pdf, other]

doi 10.1109/ICRA48891.2023.10161194

A real-time dynamic obstacle tracking and mapping system for UAV navigation and collision avoidance with an RGB-D camera

Authors: Zhefan Xu, Xiaoyang Zhan, Baihan Chen, Yumeng Xiu, Chenhao Yang, Kenji Shimada

Abstract: The real-time dynamic environment perception has become vital for autonomous robots in crowded spaces. Although the popular voxel-based mapping methods can efficiently represent 3D obstacles with arbitrarily complex shapes, they can hardly distinguish between static and dynamic obstacles, leading to the limited performance of obstacle avoidance. While plenty of sophisticated learning-based dynamic… ▽ More The real-time dynamic environment perception has become vital for autonomous robots in crowded spaces. Although the popular voxel-based mapping methods can efficiently represent 3D obstacles with arbitrarily complex shapes, they can hardly distinguish between static and dynamic obstacles, leading to the limited performance of obstacle avoidance. While plenty of sophisticated learning-based dynamic obstacle detection algorithms exist in autonomous driving, the quadcopter's limited computation resources cannot achieve real-time performance using those approaches. To address these issues, we propose a real-time dynamic obstacle tracking and mapping system for quadcopter obstacle avoidance using an RGB-D camera. The proposed system first utilizes a depth image with an occupancy voxel map to generate potential dynamic obstacle regions as proposals. With the obstacle region proposals, the Kalman filter and our continuity filter are applied to track each dynamic obstacle. Finally, the environment-aware trajectory prediction method is proposed based on the Markov chain using the states of tracked dynamic obstacles. We implemented the proposed system with our custom quadcopter and navigation planner. The simulation and physical experiments show that our methods can successfully track and represent obstacles in dynamic environments in real-time and safely avoid obstacles. Our software is available on GitHub as an open-source ROS package. △ Less

Submitted 12 January, 2024; v1 submitted 17 September, 2022; originally announced September 2022.

Journal ref: 2023 IEEE International Conference on Robotics and Automation (ICRA)

arXiv:2209.07003 [pdf, other]

doi 10.1109/ICRA48891.2023.10160638

Vision-aided UAV navigation and dynamic obstacle avoidance using gradient-based B-spline trajectory optimization

Authors: Zhefan Xu, Yumeng Xiu, Xiaoyang Zhan, Baihan Chen, Kenji Shimada

Abstract: Navigating dynamic environments requires the robot to generate collision-free trajectories and actively avoid moving obstacles. Most previous works designed path planning algorithms based on one single map representation, such as the geometric, occupancy, or ESDF map. Although they have shown success in static environments, due to the limitation of map representation, those methods cannot reliably… ▽ More Navigating dynamic environments requires the robot to generate collision-free trajectories and actively avoid moving obstacles. Most previous works designed path planning algorithms based on one single map representation, such as the geometric, occupancy, or ESDF map. Although they have shown success in static environments, due to the limitation of map representation, those methods cannot reliably handle static and dynamic obstacles simultaneously. To address the problem, this paper proposes a gradient-based B-spline trajectory optimization algorithm utilizing the robot's onboard vision. The depth vision enables the robot to track and represent dynamic objects geometrically based on the voxel map. The proposed optimization first adopts the circle-based guide-point algorithm to approximate the costs and gradients for avoiding static obstacles. Then, with the vision-detected moving objects, our receding-horizon distance field is simultaneously used to prevent dynamic collisions. Finally, the iterative re-guide strategy is applied to generate the collision-free trajectory. The simulation and physical experiments prove that our method can run in real-time to navigate dynamic environments safely. Our software is available on GitHub as an open-source package. △ Less

Submitted 12 January, 2024; v1 submitted 14 September, 2022; originally announced September 2022.

Journal ref: 2023 IEEE International Conference on Robotics and Automation (ICRA)

arXiv:2207.04196 [pdf, other]

doi 10.1109/LRA.2022.3195189

Robotic Depowdering for Additive Manufacturing Via Pose Tracking

Authors: Zhenwei Liu, Junyi Geng, Xikai Dai, Tomasz Swierzewski, Kenji Shimada

Abstract: With the rapid development of powder-based additive manufacturing, depowdering, a process of removing unfused powder that covers 3D-printed parts, has become a major bottleneck to further improve its productiveness. Traditional manual depowdering is extremely time-consuming and costly, and some prior automated systems either require pre-depowdering or lack adaptability to different 3D-printed part… ▽ More With the rapid development of powder-based additive manufacturing, depowdering, a process of removing unfused powder that covers 3D-printed parts, has become a major bottleneck to further improve its productiveness. Traditional manual depowdering is extremely time-consuming and costly, and some prior automated systems either require pre-depowdering or lack adaptability to different 3D-printed parts. To solve these problems, we introduce a robotic system that automatically removes unfused powder from the surface of 3D-printed parts. The key component is a visual perception system, which consists of a pose-tracking module that tracks the 6D pose of powder-occluded parts in real-time, and a progress estimation module that estimates the depowdering completion percentage. The tracking module can be run efficiently on a laptop CPU at up to 60 FPS. Experiments show that our depowdering system can remove unfused powder from the surface of various 3D-printed parts without causing any damage. To the best of our knowledge, this is one of the first vision-based robotic depowdering systems that adapt to parts with various shapes without the need for pre-depowdering. △ Less

Submitted 4 September, 2022; v1 submitted 9 July, 2022; originally announced July 2022.

Comments: Video link: https://www.youtube.com/watch?v=AUIkyULAhqM

Journal ref: 2022 IEEE Robotics and Automation Letters

arXiv:2206.01948 [pdf, other]

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

Authors: Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen

Abstract: This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone arr… ▽ More This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events in the dataset belonging to 13 target sound classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. The dataset serves as the development and evaluation dataset for the Task 3 of the DCASE2022 Challenge on Sound Event Localization and Detection and introduces significant new challenges for the task compared to the previous iterations, which were based on synthetic spatialized sound scene recordings. Dataset specifications are detailed including recording and annotation process, target classes and their presence, and details on the development and evaluation splits. Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format. Results of the baseline indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6387880. △ Less

Submitted 2 September, 2022; v1 submitted 4 June, 2022; originally announced June 2022.

arXiv:2202.13590 [pdf, other]

doi 10.3390/electronics11071014

LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation

Authors: Keita Nonaka, Kazutaka Yamanouchi, Tomohiro I, Tsuyoshi Okita, Kazutaka Shimada, Hiroshi Sakamoto

Abstract: In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approa… ▽ More In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches. However, compression-based approach has a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a probabilistic string algorithm, called locally-consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the probabilistic mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning from especially small training data. △ Less

Submitted 19 March, 2022; v1 submitted 28 February, 2022; originally announced February 2022.

Comments: 12 pages

Journal ref: Electronics 11(7), Article number 1014, 2022

arXiv:2112.08073 [pdf, other]

doi 10.1145/3486622.3493947

Analysis of Leading Communities Contributing to arXiv Information Distribution on Twitter

Authors: Kyosuke Shimada, Kazuhiro Kazama, Mitsuo Yoshida, Ikki Ohmukai, Sho Sato

Abstract: To analyze the impact that arXiv is having on the world, in this paper we propose an arXiv information distribution model on Twitter, which has a three-layer structure: arXiv papers, information spreaders, and information collectors. First, we use the HITS algorithm to analyze the arXiv information diffusion network with users as nodes, which is created from three types of behavior on Twitter rega… ▽ More To analyze the impact that arXiv is having on the world, in this paper we propose an arXiv information distribution model on Twitter, which has a three-layer structure: arXiv papers, information spreaders, and information collectors. First, we use the HITS algorithm to analyze the arXiv information diffusion network with users as nodes, which is created from three types of behavior on Twitter regarding arXiv papers: tweeting, retweeting, and liking. Next, we extract communities from the network of information spreaders with positive authority and hub degrees using the Louvain method, and analyze the relationship and roles of information spreaders in communities using research field, linguistic, and temporal characteristics. From our analysis using the tweet and arXiv datasets, we found that information about arXiv papers circulates on Twitter from information spreaders to information collectors, and that multiple communities of information spreaders are formed according to their research fields. It was also found that different communities were formed in the same research field, depending on the research or cultural background of the information spreaders. We were able to identify two types of key persons: information spreaders who lead the relevant field in the international community and information spreaders who bridge the regional and international communities using English and their native language. In addition, we found that it takes some time to gain trust as an information spreader. △ Less

Submitted 15 December, 2021; originally announced December 2021.

Comments: The 20th IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT '21)

arXiv:2110.07124 [pdf, other]

Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

Authors: Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Naoya Takahashi, Emiru Tsunoo, Yuki Mitsufuji

Abstract: Sound event localization and detection (SELD) involves identifying the direction-of-arrival (DOA) and the event class. The SELD methods with a class-wise output format make the model predict activities of all sound event classes and corresponding locations. The class-wise methods can output activity-coupled Cartesian DOA (ACCDOA) vectors, which enable us to solve a SELD task with a single target u… ▽ More Sound event localization and detection (SELD) involves identifying the direction-of-arrival (DOA) and the event class. The SELD methods with a class-wise output format make the model predict activities of all sound event classes and corresponding locations. The class-wise methods can output activity-coupled Cartesian DOA (ACCDOA) vectors, which enable us to solve a SELD task with a single target using a single network. However, there is still a challenge in detecting the same event class from multiple locations. To overcome this problem while maintaining the advantages of the class-wise format, we extended ACCDOA to a multi one and proposed auxiliary duplicating permutation invariant training (ADPIT). The multi- ACCDOA format (a class- and track-wise output format) enables the model to solve the cases with overlaps from the same class. The class-wise ADPIT scheme enables each track of the multi-ACCDOA format to learn with the same target as the single-ACCDOA format. In evaluations with the DCASE 2021 Task 3 dataset, the model trained with the multi-ACCDOA format and with the class-wise ADPIT detects overlapping events from the same class while maintaining its performance in the other cases. Also, the proposed method performed comparably to state-of-the-art SELD methods with fewer parameters. △ Less

Submitted 27 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: 5 pages, 3 figures, accepted for publication in IEEE ICASSP 2022

arXiv:2110.06501 [pdf, other]

Spatial Data Augmentation with Simulated Room Impulse Responses for Sound Event Localization and Detection

Authors: Yuichiro Koyama, Kazuhide Shigemi, Masafumi Takahashi, Kazuki Shimada, Naoya Takahashi, Emiru Tsunoo, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Recording and annotating real sound events for a sound event localization and detection (SELD) task is time consuming, and data augmentation techniques are often favored when the amount of data is limited. However, how to augment the spatial information in a dataset, including unlabeled directional interference events, remains an open research question. Furthermore, directional interference events… ▽ More Recording and annotating real sound events for a sound event localization and detection (SELD) task is time consuming, and data augmentation techniques are often favored when the amount of data is limited. However, how to augment the spatial information in a dataset, including unlabeled directional interference events, remains an open research question. Furthermore, directional interference events make it difficult to accurately extract spatial characteristics from target sound events. To address this problem, we propose an impulse response simulation framework (IRS) that augments spatial characteristics using simulated room impulse responses (RIR). RIRs corresponding to a microphone array assumed to be placed in various rooms are accurately simulated, and the source signals of the target sound events are extracted from a mixture. The simulated RIRs are then convolved with the extracted source signals to obtain an augmented multi-channel training dataset. Evaluation results obtained using the TAU-NIGENS Spatial Sound Events 2021 dataset show that the IRS contributes to improving the overall SELD performance. Additionally, we conducted an ablation study to discuss the contribution and need for each component within the IRS. △ Less

Submitted 28 April, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: 5 pages, 2 figures, accepted for publication in IEEE ICASSP 2022

arXiv:2110.06126 [pdf, other]

doi 10.1109/ICASSP43922.2022.9747312

Spatial mixup: Directional loudness modification as data augmentation for sound event localization and detection

Authors: Ricardo Falcon-Perez, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Data augmentation methods have shown great importance in diverse supervised learning problems where labeled data is scarce or costly to obtain. For sound event localization and detection (SELD) tasks several augmentation methods have been proposed, with most borrowing ideas from other domains such as images, speech, or monophonic audio. However, only a few exploit the spatial properties of a full… ▽ More Data augmentation methods have shown great importance in diverse supervised learning problems where labeled data is scarce or costly to obtain. For sound event localization and detection (SELD) tasks several augmentation methods have been proposed, with most borrowing ideas from other domains such as images, speech, or monophonic audio. However, only a few exploit the spatial properties of a full 3D audio scene. We propose Spatial Mixup, as an application of parametric spatial audio effects for data augmentation, which modifies the directional properties of a multi-channel spatial audio signal encoded in the ambisonics domain. Similarly to beamforming, these modifications enhance or suppress signals arriving from certain directions, although the effect is less pronounced. Therefore enabling deep learning models to achieve invariance to small spatial perturbations. The method is evaluated with experiments in the DCASE 2021 Task 3 dataset, where spatial mixup increases performance over a non-augmented baseline, and compares to other well known augmentation methods. Furthermore, combining spatial mixup with other methods greatly improves performance. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: 5 pages, 2 figures, 4 tables. Submitted to the 2022 International Conference on Acoustics, Speech, & Signal Processing (ICASSP)

arXiv:2109.07024 [pdf, other]

DPMPC-Planner: A real-time UAV trajectory planning framework for complex static environments with dynamic obstacles

Authors: Zhefan Xu, Di Deng, Yiping Dong, Kenji Shimada

Abstract: Safe UAV navigation is challenging due to the complex environment structures, dynamic obstacles, and uncertainties from measurement noises and unpredictable moving obstacle behaviors. Although plenty of recent works achieve safe navigation in complex static environments with sophisticated mapping algorithms, such as occupancy map and ESDF map, these methods cannot reliably handle dynamic environme… ▽ More Safe UAV navigation is challenging due to the complex environment structures, dynamic obstacles, and uncertainties from measurement noises and unpredictable moving obstacle behaviors. Although plenty of recent works achieve safe navigation in complex static environments with sophisticated mapping algorithms, such as occupancy map and ESDF map, these methods cannot reliably handle dynamic environments due to the mapping limitation from moving obstacles. To address the limitation, this paper proposes a trajectory planning framework to achieve safe navigation considering complex static environments with dynamic obstacles. To reliably handle dynamic obstacles, we divide the environment representation into static mapping and dynamic object representation, which can be obtained from computer vision methods. Our framework first generates a static trajectory based on the proposed iterative corridor shrinking algorithm. Then, reactive chance-constrained model predictive control with temporal goal tracking is applied to avoid dynamic obstacles with uncertainties. The simulation results in various environments demonstrate the ability of our algorithm to navigate safely in complex static environments with dynamic obstacles. △ Less

Submitted 12 March, 2022; v1 submitted 14 September, 2021; originally announced September 2021.

Comments: 7pages, 8 figures

Journal ref: 2022 IEEE International Conference on Robotics and Automation (ICRA)

arXiv:2106.10806 [pdf, other]

Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection

Authors: Kazuki Shimada, Naoya Takahashi, Yuichiro Koyama, Shusuke Takahashi, Emiru Tsunoo, Masafumi Takahashi, Yuki Mitsufuji

Abstract: This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference. Our previous system based on activity-coupled Cartesian direction of arrival (ACCDOA) representation enables us to solve a SELD task with a single target. This ACCDOA-based system with efficient network architecture called RD3Net and data augme… ▽ More This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference. Our previous system based on activity-coupled Cartesian direction of arrival (ACCDOA) representation enables us to solve a SELD task with a single target. This ACCDOA-based system with efficient network architecture called RD3Net and data augmentation techniques outperformed state-of-the-art SELD systems in terms of localization and location-dependent detection. Using the ACCDOA-based system as a base, we perform model ensembles by averaging outputs of several systems trained with different conditions such as input features, training folds, and model architectures. We also use the event independent network v2 (EINV2)-based system to increase the diversity of the model ensembles. To generalize the models, we further propose impulse response simulation (IRS), which generates simulated multi-channel signals by convolving simulated room impulse responses (RIRs) with source signals extracted from the original dataset. Our systems significantly improved over the baseline system on the development dataset. △ Less

Submitted 20 June, 2021; originally announced June 2021.

Comments: 5 pages, 3 figures, submitted to DCASE2021 task3

arXiv:2105.05163 [pdf, other]

An Efficient Bayes Coding Algorithm for the Non-Stationary Source in Which Context Tree Model Varies from Interval to Interval

Authors: Koshi Shimada, Shota Saito, Toshiyasu Matsushima

Abstract: The context tree source is a source model in which the occurrence probability of symbols is determined from a finite past sequence, and is a broader class of sources that includes i.i.d. and Markov sources. The proposed source model in this paper represents that a subsequence in each interval is generated from a different context tree model. The Bayes code for such sources requires weighting of th… ▽ More The context tree source is a source model in which the occurrence probability of symbols is determined from a finite past sequence, and is a broader class of sources that includes i.i.d. and Markov sources. The proposed source model in this paper represents that a subsequence in each interval is generated from a different context tree model. The Bayes code for such sources requires weighting of the posterior probability distributions for the change patterns of the context tree source and for all possible context tree models. Therefore, the challenge is how to reduce this exponential order computational complexity. In this paper, we assume a special class of prior probability distribution of change patterns and context tree models, and propose an efficient Bayes coding algorithm whose computational complexity is the polynomial order. △ Less

Submitted 13 May, 2021; v1 submitted 11 May, 2021; originally announced May 2021.

arXiv:2101.04757 [pdf, other]

doi 10.1093/jcde/qwad046

Airfoil GAN: Encoding and Synthesizing Airfoils for Aerodynamic Shape Optimization

Authors: Yuyang Wang, Kenji Shimada, Amir Barati Farimani

Abstract: The current design of aerodynamic shapes, like airfoils, involves computationally intensive simulations to explore the possible design space. Usually, such design relies on the prior definition of design parameters and places restrictions on synthesizing novel shapes. In this work, we propose a data-driven shape encoding and generating method, which automatically learns representations from existi… ▽ More The current design of aerodynamic shapes, like airfoils, involves computationally intensive simulations to explore the possible design space. Usually, such design relies on the prior definition of design parameters and places restrictions on synthesizing novel shapes. In this work, we propose a data-driven shape encoding and generating method, which automatically learns representations from existing airfoils and uses the learned representations to generate new airfoils. The representations are then used in the optimization of synthesized airfoil shapes based on their aerodynamic performance. Our model is built upon VAEGAN, a neural network that combines Variational Autoencoder with Generative Adversarial Network and is trained by the gradient-based technique. Our model can (1) encode the existing airfoil into a latent vector and reconstruct the airfoil from that, (2) generate novel airfoils by randomly sampling the latent vectors and mapping the vectors to the airfoil coordinate domain, and (3) synthesize airfoils with desired aerodynamic properties by optimizing learned features via a genetic algorithm. Our experiments show that the learned features encode shape information thoroughly and comprehensively without predefined design parameters. By interpolating/extrapolating feature vectors or sampling from Gaussian noises, the model can automatically synthesize novel airfoil shapes, some of which possess competitive or even better aerodynamic properties comparing to airfoils used for model training purposes. By optimizing shapes on the learned latent domain via a genetic algorithm, synthesized airfoils can evolve to target aerodynamic properties. This demonstrates an efficient learning-based airfoil design framework, which encodes and optimizes the airfoil on the latent domain and synthesizes promising airfoil candidates for required aerodynamic performance. △ Less

Submitted 6 July, 2023; v1 submitted 12 January, 2021; originally announced January 2021.

Comments: Published in Journal of Computational Design and Engineering. 13 pages, 13 figures, 1 table

arXiv:2011.05323 [pdf, other]

Robotic Exploration of Unknown 2D Environment Using a Frontier-based Automatic-Differentiable Information Gain Measure

Authors: Di Deng, Runlin Duan, Jiahong Liu, Kuangjie Sheng, Kenji Shimada

Abstract: At the heart of path-planning methods for autonomous robotic exploration is a heuristic which encourages exploring unknown regions of the environment. Such heuristics are typically computed using frontier-based or information-theoretic methods. Frontier-based methods define the information gain of an exploration path as the number of boundary cells, or frontiers, which are visible from the path. H… ▽ More At the heart of path-planning methods for autonomous robotic exploration is a heuristic which encourages exploring unknown regions of the environment. Such heuristics are typically computed using frontier-based or information-theoretic methods. Frontier-based methods define the information gain of an exploration path as the number of boundary cells, or frontiers, which are visible from the path. However, the discrete and non-differentiable nature of this measure of information gain makes it difficult to optimize using gradient-based methods. In contrast, information-theoretic methods define information gain as the mutual information between the sensor's measurements and the explored map. However, computation of the gradient of mutual information involves finite differencing and is thus computationally expensive. This work proposes an exploration planning framework that combines sampling-based path planning and gradient-based path optimization. The main contribution of this framework is a novel reformulation of information gain as a differentiable function. This allows us to simultaneously optimize information gain with other differentiable quality measures, such as smoothness. The proposed planning framework's effectiveness is verified both in simulation and in hardware experiments using a Turtlebot3 Burger robot. △ Less

Submitted 10 November, 2020; originally announced November 2020.

arXiv:2011.05288 [pdf, other]

Frontier-based Automatic-differentiable Information Gain Measure for Robotic Exploration of Unknown 3D Environments

Authors: Di Deng, Zhefan Xu, Wenbo Zhao, Kenji Shimada

Abstract: The path planning problem for autonomous exploration of an unknown region by a robotic agent typically employs frontier-based or information-theoretic heuristics. Frontier-based heuristics typically evaluate the information gain of a viewpoint by the number of visible frontier voxels, which is a discrete measure that can only be optimized by sampling. On the other hand, information-theoretic heuri… ▽ More The path planning problem for autonomous exploration of an unknown region by a robotic agent typically employs frontier-based or information-theoretic heuristics. Frontier-based heuristics typically evaluate the information gain of a viewpoint by the number of visible frontier voxels, which is a discrete measure that can only be optimized by sampling. On the other hand, information-theoretic heuristics compute information gain as the mutual information between the map and the sensor's measurement. Although the gradient of such measures can be computed, the computation involves costly numerical differentiation. In this work, we add a novel fuzzy logic filter in the counting of visible frontier voxels surrounding a viewpoint, which allows the gradient of the information gain with respect to the viewpoint to be efficiently computed using automatic differentiation. This enables us to simultaneously optimize information gain with other differentiable quality measures such as path length. Using multiple simulation environments, we demonstrate that the proposed gradient-based optimization method consistently improves the information gain and other quality measures of exploration paths. △ Less

Submitted 10 November, 2020; originally announced November 2020.

arXiv:2011.05275 [pdf, other]

Coordinated Aerial-Ground Robot Exploration via Monte-Carlo View Quality Rendering

Authors: Di Deng, Zhefan Xu, Wenbo Zhao, Kenji Shimada

Abstract: We present a framework for a ground-aerial robotic team to explore large, unstructured, and unknown environments. In such exploration problems, the effectiveness of existing exploration-boosting heuristics often scales poorly with the environments' size and complexity. This work proposes a novel framework combining incremental frontier distribution, goal selection with Monte-Carlo view quality ren… ▽ More We present a framework for a ground-aerial robotic team to explore large, unstructured, and unknown environments. In such exploration problems, the effectiveness of existing exploration-boosting heuristics often scales poorly with the environments' size and complexity. This work proposes a novel framework combining incremental frontier distribution, goal selection with Monte-Carlo view quality rendering, and an automatic-differentiable information gain measure to improve exploration efficiency. Simulated with multiple complex environments, we demonstrate that the proposed method effectively utilizes collaborative aerial and ground robots, consistently guides agents to informative viewpoints, improves exploration paths' information gain, and reduces planning time. △ Less

Submitted 10 November, 2020; originally announced November 2020.

arXiv:2010.15306 [pdf, other]

ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection

Authors: Kazuki Shimada, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Neural-network (NN)-based methods show high performance in sound event localization and detection (SELD). Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target. The two-branch representation with a single network has to decide how to balance the two objectives during optimization. Using two networks dedicated to each task in… ▽ More Neural-network (NN)-based methods show high performance in sound event localization and detection (SELD). Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target. The two-branch representation with a single network has to decide how to balance the two objectives during optimization. Using two networks dedicated to each task increases system complexity and network size. To address these problems, we propose an activity-coupled Cartesian DOA (ACCDOA) representation, which assigns a sound event activity to the length of a corresponding Cartesian DOA vector. The ACCDOA representation enables us to solve a SELD task with a single target and has two advantages: avoiding the necessity of balancing the objectives and model size increase. In experimental evaluations with the DCASE 2020 Task 3 dataset, the ACCDOA representation outperformed the two-branch representation in SELD metrics with a smaller network size. The ACCDOA-based SELD system also performed better than state-of-the-art SELD systems in terms of localization and location-dependent detection. △ Less

Submitted 14 February, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

Comments: 5 pages, 5 figures, accepted for publication in IEEE ICASSP 2021

arXiv:2010.07429 [pdf, other]

doi 10.1109/LRA.2021.3062008

Autonomous UAV Exploration of Dynamic Environments via Incremental Sampling and Probabilistic Roadmap

Authors: Zhefan Xu, Di Deng, Kenji Shimada

Abstract: Autonomous exploration requires robots to generate informative trajectories iteratively. Although sampling-based methods are highly efficient in unmanned aerial vehicle exploration, many of these methods do not effectively utilize the sampled information from the previous planning iterations, leading to redundant computation and longer exploration time. Also, few have explicitly shown their explor… ▽ More Autonomous exploration requires robots to generate informative trajectories iteratively. Although sampling-based methods are highly efficient in unmanned aerial vehicle exploration, many of these methods do not effectively utilize the sampled information from the previous planning iterations, leading to redundant computation and longer exploration time. Also, few have explicitly shown their exploration ability in dynamic environments even though they can run real-time. To overcome these limitations, we propose a novel dynamic exploration planner (DEP) for exploring unknown environments using incremental sampling and Probabilistic Roadmap (PRM). In our sampling strategy, nodes are added incrementally and distributed evenly in the explored region, yielding the best viewpoints. To further shortening exploration time and ensuring safety, our planner optimizes paths locally and refine them based on the Euclidean Signed Distance Function (ESDF) map. Meanwhile, as the multi-query planner, PRM allows the proposed planner to quickly search alternative paths to avoid dynamic obstacles for safe exploration. Simulation experiments show that our method safely explores dynamic environments and outperforms the benchmark planners in terms of exploration time, path length, and computational time. △ Less

Submitted 20 March, 2021; v1 submitted 14 October, 2020; originally announced October 2020.

Comments: 8 Pages, 9 Figures, and 5 Tables. Video Link: https://youtu.be/ileyP4DRBjU. Github Link: https://github.com/Zhefan-Xu/DEP

Journal ref: IEEE Robotics and Automation Letters, Volume: 6, Issue: 2, April 2021. Page(s): 2729 - 2736

arXiv:2009.08924 [pdf, other]

Multi-Resolution Graph Neural Network for Large-Scale Pointcloud Segmentation

Authors: Liuyue Xie, Tomotake Furuhata, Kenji Shimada

Abstract: In this paper, we propose a multi-resolution deep-learning architecture to semantically segment dense large-scale pointclouds. Dense pointcloud data require a computationally expensive feature encoding process before semantic segmentation. Previous work has used different approaches to drastically downsample from the original pointcloud so common computing hardware can be utilized. While these app… ▽ More In this paper, we propose a multi-resolution deep-learning architecture to semantically segment dense large-scale pointclouds. Dense pointcloud data require a computationally expensive feature encoding process before semantic segmentation. Previous work has used different approaches to drastically downsample from the original pointcloud so common computing hardware can be utilized. While these approaches can relieve the computation burden to some extent, they are still limited in their processing capability for multiple scans. We present MuGNet, a memory-efficient, end-to-end graph neural network framework to perform semantic segmentation on large-scale pointclouds. We reduce the computation demand by utilizing a graph neural network on the preformed pointcloud graphs and retain the precision of the segmentation with a bidirectional network that fuses feature embedding at different resolutions. Our framework has been validated on benchmark datasets including Stanford Large-Scale 3D Indoor Spaces Dataset(S3DIS) and Virtual KITTI Dataset. We demonstrate that our framework can process up to 45 room scans at once on a single 11 GB GPU while still surpassing other graph-based solutions for segmentation on S3DIS with an 88.5\% (+3\%) overall accuracy and 69.8\% (+7.7\%) mIOU accuracy. △ Less

Submitted 18 September, 2020; originally announced September 2020.

Journal ref: Conference on Robot Learning, 2020, 184

arXiv:2007.13065 [pdf, other]

Multi-UAV Coverage Path Planning for the Inspection of Large and Complex Structures

Authors: Wei Jing, Di Deng, Yan Wu, Kenji Shimada

Abstract: We present a multi-UAV Coverage Path Planning (CPP) framework for the inspection of large-scale, complex 3D structures. In the proposed sampling-based coverage path planning method, we formulate the multi-UAV inspection applications as a multi-agent coverage path planning problem. By combining two NP-hard problems: Set Covering Problem (SCP) and Vehicle Routing Problem (VRP), a Set-Covering Vehicl… ▽ More We present a multi-UAV Coverage Path Planning (CPP) framework for the inspection of large-scale, complex 3D structures. In the proposed sampling-based coverage path planning method, we formulate the multi-UAV inspection applications as a multi-agent coverage path planning problem. By combining two NP-hard problems: Set Covering Problem (SCP) and Vehicle Routing Problem (VRP), a Set-Covering Vehicle Routing Problem (SC-VRP) is formulated and subsequently solved by a modified Biased Random Key Genetic Algorithm (BRKGA) with novel, efficient encoding strategies and local improvement heuristics. We test our proposed method for several complex 3D structures with the 3D model extracted from OpenStreetMap. The proposed method outperforms previous methods, by reducing the length of the planned inspection path by up to 48% △ Less

Submitted 26 July, 2020; originally announced July 2020.

Comments: Accepted by IROS2020

arXiv:2006.12014 [pdf, other]

Sound Event Localization and Detection Using Activity-Coupled Cartesian DOA Vector and RD3net

Authors: Kazuki Shimada, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Our systems submitted to the DCASE2020 task~3: Sound Event Localization and Detection (SELD) are described in this report. We consider two systems: a single-stage system that solve sound event localization~(SEL) and sound event detection~(SED) simultaneously, and a two-stage system that first handles the SED and SEL tasks individually and later combines those results. As the single-stage system, w… ▽ More Our systems submitted to the DCASE2020 task~3: Sound Event Localization and Detection (SELD) are described in this report. We consider two systems: a single-stage system that solve sound event localization~(SEL) and sound event detection~(SED) simultaneously, and a two-stage system that first handles the SED and SEL tasks individually and later combines those results. As the single-stage system, we propose a unified training framework that uses an activity-coupled Cartesian DOA vector~(ACCDOA) representation as a single target for both the SED and SEL tasks. To efficiently estimate sound event locations and activities, we further propose RD3Net, which incorporates recurrent and convolution layers with dense skip connections and dilation. To generalize the models, we apply three data augmentation techniques: equalized mixture data augmentation~(EMDA), rotation of first-order Ambisonic~(FOA) singals, and multichannel extension of SpecAugment. Our systems demonstrate a significant improvement over the baseline system. △ Less

Submitted 7 October, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

Comments: Submitted to DCASE2020 task3

arXiv:1911.09864 [pdf, other]

Constrained Heterogeneous Vehicle Path Planning for Large-area Coverage

Authors: Di Deng, Wei Jing, Yuhe Fu, Ziyin Huang, Jiahong Liu, Kenji Shimada

Abstract: There is a strong demand for covering a large area autonomously by multiple UAVs (Unmanned Aerial Vehicles) supported by a ground vehicle. Limited by UAVs' battery life and communication distance, complete coverage of large areas typically involves multiple take-offs and landings to recharge batteries, and the transportation of UAVs between operation areas by a ground vehicle. In this paper, we in… ▽ More There is a strong demand for covering a large area autonomously by multiple UAVs (Unmanned Aerial Vehicles) supported by a ground vehicle. Limited by UAVs' battery life and communication distance, complete coverage of large areas typically involves multiple take-offs and landings to recharge batteries, and the transportation of UAVs between operation areas by a ground vehicle. In this paper, we introduce a novel large-area-coverage planning framework which collectively optimizes the paths for aerial and ground vehicles. Our method first partitions a large area into sub-areas, each of which a given fleet of UAVs can cover without recharging batteries. UAV operation routes, or trails, are then generated for each sub-area. Next, the assignment of trials to different UAVs and the order in which UAVs visit their assigned trails are simultaneously optimized to minimize the total UAV flight distance. Finally, a ground vehicle transportation path which visits all sub-areas is found by solving an asymmetric traveling salesman problem (ATSP). Although finding the globally optimal trail assignment and transition paths can be formulated as a Mixed Integer Quadratic Program (MIQP), the MIQP is intractable even for small problems. We show that the solution time can be reduced to close-to-real-time levels by first finding a feasible solution using a Random Key Genetic Algorithm (RKGA), which is then locally optimized by solving a much smaller MIQP. △ Less

Submitted 22 November, 2019; originally announced November 2019.

arXiv:1910.13724 [pdf, other]

Metric Learning with Background Noise Class for Few-shot Detection of Rare Sound Events

Authors: Kazuki Shimada, Yuichiro Koyama, Akira Inoue

Abstract: Few-shot learning systems for sound event recognition have gained interests since they require only a few examples to adapt to new target classes without fine-tuning. However, such systems have only been applied to chunks of sounds for classification or verification. In this paper, we aim to achieve few-shot detection of rare sound events, from query sequence that contain not only the target event… ▽ More Few-shot learning systems for sound event recognition have gained interests since they require only a few examples to adapt to new target classes without fine-tuning. However, such systems have only been applied to chunks of sounds for classification or verification. In this paper, we aim to achieve few-shot detection of rare sound events, from query sequence that contain not only the target events but also the other events and background noise. Therefore, it is required to prevent false positive reactions to both the other events and background noise. We propose metric learning with background noise class for the few-shot detection. The contribution is to present the explicit inclusion of background noise as an independent class, a suitable loss function that emphasizes this additional class, and a corresponding sampling strategy that assists training. It provides a feature space where the event classes and the background noise class are sufficiently separated. Evaluations on few-shot detection tasks, using DCASE 2017 task2 and ESC-50, show that our proposed method outperforms metric learning without considering the background noise class. The few-shot detection performance is also comparable to that of the DCASE 2017 task2 baseline system, which requires huge amount of annotated audio data. △ Less

Submitted 18 February, 2020; v1 submitted 30 October, 2019; originally announced October 2019.

Comments: 5 pages, 5 figures, accepted for publication in IEEE ICASSP 2020

arXiv:1908.02901 [pdf, other]

Coverage Path Planning using Path Primitive Sampling and Primitive Coverage Graph for Visual Inspection

Authors: Wei Jing, Di Deng, Zhe Xiao, Yong Liu, Kenji Shimada

Abstract: Planning the path to gather the surface information of the target objects is crucial to improve the efficiency of and reduce the overall cost, for visual inspection applications with Unmanned Aerial Vehicles (UAVs). Coverage Path Planning (CPP) problem is often formulated for these inspection applications because of the coverage requirement. Traditionally, researchers usually plan and optimize the… ▽ More Planning the path to gather the surface information of the target objects is crucial to improve the efficiency of and reduce the overall cost, for visual inspection applications with Unmanned Aerial Vehicles (UAVs). Coverage Path Planning (CPP) problem is often formulated for these inspection applications because of the coverage requirement. Traditionally, researchers usually plan and optimize the viewpoints to capture the surface information first, and then optimize the path to visit the selected viewpoints. In this paper, we propose a novel planning method to directly sample and plan the inspection path for a camera-equipped UAV to acquire visual and geometric information of the target structures as a video stream setting in complex 3D environment. The proposed planning method first generates via-points and path primitives around the target object by using sampling methods based on voxel dilation and subtraction. A novel Primitive Coverage Graph (PCG) is then proposed to encode the topological information, flying distances, and visibility information, with the sampled via-points and path primitives. Finally graph search is performed to find the resultant path in the PCG to complete the inspection task with the coverage requirements. The effectiveness of the proposed method is demonstrated through simulation and field tests in this paper. △ Less

Submitted 7 August, 2019; originally announced August 2019.

Comments: Accepted by IROS 2019, 8 pages

arXiv:1906.08809 [pdf, other]

A Deep Reinforcement Learning Approach for Global Routing

Authors: Haiguang Liao, Wentai Zhang, Xuliang Dong, Barnabas Poczos, Kenji Shimada, Levent Burak Kara

Abstract: Global routing has been a historically challenging problem in electronic circuit design, where the challenge is to connect a large and arbitrary number of circuit components with wires without violating the design rules for the printed circuit boards or integrated circuits. Similar routing problems also exist in the design of complex hydraulic systems, pipe systems and logistic networks. Existing… ▽ More Global routing has been a historically challenging problem in electronic circuit design, where the challenge is to connect a large and arbitrary number of circuit components with wires without violating the design rules for the printed circuit boards or integrated circuits. Similar routing problems also exist in the design of complex hydraulic systems, pipe systems and logistic networks. Existing solutions typically consist of greedy algorithms and hard-coded heuristics. As such, existing approaches suffer from a lack of model flexibility and non-optimum solutions. As an alternative approach, this work presents a deep reinforcement learning method for solving the global routing problem in a simulated environment. At the heart of the proposed method is deep reinforcement learning that enables an agent to produce an optimal policy for routing based on the variety of problems it is presented with leveraging the conjoint optimization mechanism of deep reinforcement learning. Conjoint optimization mechanism is explained and demonstrated in details; the best network structure and the parameters of the learned model are explored. Based on the fine-tuned model, routing solutions and rewards are presented and analyzed. The results indicate that the approach can outperform the benchmark method of a sequential A* method, suggesting a promising potential for deep reinforcement learning for global routing and other routing or path planning problems in general. Another major contribution of this work is the development of a global routing problem sets generator with the ability to generate parameterized global routing problem sets with different size and constraints, enabling evaluation of different routing algorithms and the generation of training datasets for future data-driven routing approaches. △ Less

Submitted 20 June, 2019; originally announced June 2019.

Comments: Preprint submitted to ASME JMD

arXiv:1904.07964 [pdf, other]

3D Shape Synthesis for Conceptual Design and Optimization Using Variational Autoencoders

Authors: Wentai Zhang, Zhangsihao Yang, Haoliang Jiang, Suyash Nigam, Soji Yamakawa, Tomotake Furuhata, Kenji Shimada, Levent Burak Kara

Abstract: We propose a data-driven 3D shape design method that can learn a generative model from a corpus of existing designs, and use this model to produce a wide range of new designs. The approach learns an encoding of the samples in the training corpus using an unsupervised variational autoencoder-decoder architecture, without the need for an explicit parametric representation of the original designs. To… ▽ More We propose a data-driven 3D shape design method that can learn a generative model from a corpus of existing designs, and use this model to produce a wide range of new designs. The approach learns an encoding of the samples in the training corpus using an unsupervised variational autoencoder-decoder architecture, without the need for an explicit parametric representation of the original designs. To facilitate the generation of smooth final surfaces, we develop a 3D shape representation based on a distance transformation of the original 3D data, rather than using the commonly utilized binary voxel representation. Once established, the generator maps the latent space representations to the high-dimensional distance transformation fields, which are then automatically surfaced to produce 3D representations amenable to physics simulations or other objective function evaluation modules. We demonstrate our approach for the computational design of gliders that are optimized to attain prescribed performance scores. Our results show that when combined with genetic optimization, the proposed approach can generate a rich set of candidate concept designs that achieve prescribed functional goals, even when the original dataset has only a few or no solutions that achieve these goals. △ Less

Submitted 16 April, 2019; originally announced April 2019.

Comments: Preprint accepted by ASME IDETC/CIE 2019

arXiv:1903.09341 [pdf, other]

doi 10.1109/TASLP.2019.2907015

Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition

Authors: Kazuki Shimada, Yoshiaki Bando, Masato Mimura, Katsutoshi Itoyama, Kazuyoshi Yoshii, Tatsuya Kawahara

Abstract: This paper describes multichannel speech enhancement for improving automatic speech recognition (ASR) in noisy environments. Recently, the minimum variance distortionless response (MVDR) beamforming has widely been used because it works well if the steering vector of speech and the spatial covariance matrix (SCM) of noise are given. To estimating such spatial information, conventional studies take… ▽ More This paper describes multichannel speech enhancement for improving automatic speech recognition (ASR) in noisy environments. Recently, the minimum variance distortionless response (MVDR) beamforming has widely been used because it works well if the steering vector of speech and the spatial covariance matrix (SCM) of noise are given. To estimating such spatial information, conventional studies take a supervised approach that classifies each time-frequency (TF) bin into noise or speech by training a deep neural network (DNN). The performance of ASR, however, is degraded in an unknown noisy environment. To solve this problem, we take an unsupervised approach that decomposes each TF bin into the sum of speech and noise by using multichannel nonnegative matrix factorization (MNMF). This enables us to accurately estimate the SCMs of speech and noise not from observed noisy mixtures but from separated speech and noise components. In this paper we propose online MVDR beamforming by effectively initializing and incrementally updating the parameters of MNMF. Another main contribution is to comprehensively investigate the performances of ASR obtained by various types of spatial filters, i.e., time-invariant and variant versions of MVDR beamformers and those of rank-1 and full-rank multichannel Wiener filters, in combination with MNMF. The experimental results showed that the proposed method outperformed the state-of-the-art DNN-based beamforming method in unknown environments that did not match training data. △ Less

Submitted 31 March, 2019; v1 submitted 21 March, 2019; originally announced March 2019.

arXiv:1807.02740 [pdf, other]

doi 10.1016/j.cad.2019.02.006.

Data-driven Upsampling of Point Clouds

Authors: Wentai Zhang, Haoliang Jiang, Zhangsihao Yang, Soji Yamakawa, Kenji Shimada, Levent Burak Kara

Abstract: High quality upsampling of sparse 3D point clouds is critically useful for a wide range of geometric operations such as reconstruction, rendering, meshing, and analysis. In this paper, we propose a data-driven algorithm that enables an upsampling of 3D point clouds without the need for hard-coded rules. Our approach uses a deep network with Chamfer distance as the loss function, capable of learnin… ▽ More High quality upsampling of sparse 3D point clouds is critically useful for a wide range of geometric operations such as reconstruction, rendering, meshing, and analysis. In this paper, we propose a data-driven algorithm that enables an upsampling of 3D point clouds without the need for hard-coded rules. Our approach uses a deep network with Chamfer distance as the loss function, capable of learning the latent features in point clouds belonging to different object categories. We evaluate our algorithm across different amplification factors, with upsampling learned and performed on objects belonging to the same category as well as different categories. We also explore the desirable characteristics of input point clouds as a function of the distribution of the point samples. Finally, we demonstrate the performance of our algorithm in single-category training versus multi-category training scenarios. The final proposed model is compared against a baseline, optimization-based upsampling method. Results indicate that our algorithm is capable of generating more uniform and accurate upsamplings. △ Less

Submitted 27 December, 2018; v1 submitted 7 July, 2018; originally announced July 2018.

Comments: Preprint submitted to CAD

Journal ref: Computer-Aided Design, Volume 112, Pages 1-13, 2019

arXiv:1803.02723 [pdf, other]

Heterogeneous Vehicles Routing for Water Canal Damage Assessment

Authors: Di Deng, Tao Pang, Prasanth Palli, Fang Shu, Kenji Shimada

Abstract: In Japan, inspection of irrigation water canals has been mostly conducted manually. However, the huge demand for more regular inspections as infrastructure ages, coupled with the limited time window available for inspection, has rendered manual inspection increasingly insufficient. With shortened inspection time and reduced labor cost, automated inspection using a combination of unmanned aerial ve… ▽ More In Japan, inspection of irrigation water canals has been mostly conducted manually. However, the huge demand for more regular inspections as infrastructure ages, coupled with the limited time window available for inspection, has rendered manual inspection increasingly insufficient. With shortened inspection time and reduced labor cost, automated inspection using a combination of unmanned aerial vehicles (UAVs) and ground vehicles (cars) has emerged as an attractive alternative to manual inspection. In this paper, we propose a path planning framework that generates optimal plans for UAVs and cars to inspect water canals in a large agricultural area (tens of square kilometers). In addition to optimality, the paths need to satisfy several constraints, in order to guarantee UAV navigation safety and to abide by local traffic regulations. In the proposed framework, the canal and road networks are first modeled as two graphs, which are then partitioned into smaller subgraphs that can be covered by a given fleet of UAVs within one battery charge. The problem of finding optimal paths for both UAVs and cars on the graphs, subject to the constraints, is formulated as a mixed-integer quadratic program (MIQP). The proposed framework can also quickly generate new plans when a current plan is interrupted. The effectiveness of the proposed framework is validated by simulation results showing the successful generation of plans covering all given canal segments, and the ability to quickly revise the plan when conditions change. △ Less

Submitted 7 March, 2018; originally announced March 2018.

Showing 1–42 of 42 results for author: Shimada, K