Search | arXiv e-print repository

Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Authors: Young-Jun Lee, Dokyong Lee, Junyoung Youn, Kyeongjin Oh, Byungsoo Ko, Jonghwan Hyeon, Ho-Jin Choi

Abstract: Humans share a wide variety of images related to their personal experiences within conversations via instant messaging tools. However, existing works focus on (1) image-sharing behavior in singular sessions, leading to limited long-term social interaction, and (2) a lack of personalized image-sharing behavior. In this work, we introduce Stark, a large-scale long-term multi-modal conversation datas… ▽ More Humans share a wide variety of images related to their personal experiences within conversations via instant messaging tools. However, existing works focus on (1) image-sharing behavior in singular sessions, leading to limited long-term social interaction, and (2) a lack of personalized image-sharing behavior. In this work, we introduce Stark, a large-scale long-term multi-modal conversation dataset that covers a wide range of social personas in a multi-modality format, time intervals, and images. To construct Stark automatically, we propose a novel multi-modal contextualization framework, Mcu, that generates long-term multi-modal dialogue distilled from ChatGPT and our proposed Plan-and-Execute image aligner. Using our Stark, we train a multi-modal conversation model, Ultron 7B, which demonstrates impressive visual imagination ability. Furthermore, we demonstrate the effectiveness of our dataset in human evaluation. We make our source code and dataset publicly available. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: Project website: https://stark-dataset.github.io

arXiv:2406.09998 [pdf, other]

Understanding Pedestrian Movement Using Urban Sensing Technologies: The Promise of Audio-based Sensors

Authors: Chaeyeon Han, Pavan Seshadri, Yiwei Ding, Noah Posner, Bon Woo Koo, Animesh Agrawal, Alexander Lerch, Subhrajit Guhathakurta

Abstract: While various sensors have been deployed to monitor vehicular flows, sensing pedestrian movement is still nascent. Yet walking is a significant mode of travel in many cities, especially those in Europe, Africa, and Asia. Understanding pedestrian volumes and flows is essential for designing safer and more attractive pedestrian infrastructure and for controlling periodic overcrowding. This study dis… ▽ More While various sensors have been deployed to monitor vehicular flows, sensing pedestrian movement is still nascent. Yet walking is a significant mode of travel in many cities, especially those in Europe, Africa, and Asia. Understanding pedestrian volumes and flows is essential for designing safer and more attractive pedestrian infrastructure and for controlling periodic overcrowding. This study discusses a new approach to scale up urban sensing of people with the help of novel audio-based technology. It assesses the benefits and limitations of microphone-based sensors as compared to other forms of pedestrian sensing. A large-scale dataset called ASPED is presented, which includes high-quality audio recordings along with video recordings used for labeling the pedestrian count data. The baseline analyses highlight the promise of using audio sensors for pedestrian tracking, although algorithmic and technological improvements to make the sensors practically usable continue. This study also demonstrates how the data can be leveraged to predict pedestrian trajectories. Finally, it discusses the use cases and scenarios where audio-based pedestrian sensing can support better urban and transportation planning. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: submitted to Urban Informatics

arXiv:2406.09442 [pdf]

An insertable glucose sensor using a compact and cost-effective phosphorescence lifetime imager and machine learning

Authors: Artem Goncharov, Zoltan Gorocs, Ridhi Pradhan, Brian Ko, Ajmal Ajmal, Andres Rodriguez, David Baum, Marcell Veszpremi, Xilin Yang, Maxime Pindrys, Tianle Zheng, Oliver Wang, Jessica C. Ramella-Roman, Michael J. McShane, Aydogan Ozcan

Abstract: Optical continuous glucose monitoring (CGM) systems are emerging for personalized glucose management owing to their lower cost and prolonged durability compared to conventional electrochemical CGMs. Here, we report a computational CGM system, which integrates a biocompatible phosphorescence-based insertable biosensor and a custom-designed phosphorescence lifetime imager (PLI). This compact and cos… ▽ More Optical continuous glucose monitoring (CGM) systems are emerging for personalized glucose management owing to their lower cost and prolonged durability compared to conventional electrochemical CGMs. Here, we report a computational CGM system, which integrates a biocompatible phosphorescence-based insertable biosensor and a custom-designed phosphorescence lifetime imager (PLI). This compact and cost-effective PLI is designed to capture phosphorescence lifetime images of an insertable sensor through the skin, where the lifetime of the emitted phosphorescence signal is modulated by the local concentration of glucose. Because this phosphorescence signal has a very long lifetime compared to tissue autofluorescence or excitation leakage processes, it completely bypasses these noise sources by measuring the sensor emission over several tens of microseconds after the excitation light is turned off. The lifetime images acquired through the skin are processed by neural network-based models for misalignment-tolerant inference of glucose levels, accurately revealing normal, low (hypoglycemia) and high (hyperglycemia) concentration ranges. Using a 1-mm thick skin phantom mimicking the optical properties of human skin, we performed in vitro testing of the PLI using glucose-spiked samples, yielding 88.8% inference accuracy, also showing resilience to random and unknown misalignments within a lateral distance of ~4.7 mm with respect to the position of the insertable sensor underneath the skin phantom. Furthermore, the PLI accurately identified larger lateral misalignments beyond 5 mm, prompting user intervention for re-alignment. The misalignment-resilient glucose concentration inference capability of this compact and cost-effective phosphorescence lifetime imager makes it an appealing wearable diagnostics tool for real-time tracking of glucose and other biomarkers. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 24 Pages, 4 Figures

arXiv:2405.16685 [pdf]

EdgeSphere: A Three-Tier Architecture for Cognitive Edge Computing

Authors: Christian Makaya, Keith Grueneberg, Bongjun Ko, David Wood, Nirmit Desai, Xiping Wang

Abstract: Computing at the edge is increasingly important as Internet of Things (IoT) devices at the edge generate massive amounts of data and pose challenges in transporting all that data to the Cloud where they can be analyzed. On the other hand, harnessing the edge data is essential for offering cognitive applications, if the challenges, such as device capabilities, connectivity, and heterogeneity can be… ▽ More Computing at the edge is increasingly important as Internet of Things (IoT) devices at the edge generate massive amounts of data and pose challenges in transporting all that data to the Cloud where they can be analyzed. On the other hand, harnessing the edge data is essential for offering cognitive applications, if the challenges, such as device capabilities, connectivity, and heterogeneity can be overcome. This paper proposes a novel three-tier architecture, called EdgeSphere, which harnesses resources of the edge devices, to analyze the data in situ at the edge. In contrast to the state-of-the-art cloud and mobile applications, EdgeSphere applications span across cloud, edge gateways, and edge devices. At its core, EdgeSphere builds on Apache Mesos to optimize resources usage and scheduling. EdgeSphere has been applied to practical scenarios and this paper describes the engineering challenges faced as well as innovative solutions. △ Less

Submitted 26 May, 2024; originally announced May 2024.

arXiv:2405.12648 [pdf, other]

Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency

Authors: Hyeongjin Kim, Sangwon Kim, Dasom Ahn, Jong Taek Lee, Byoung Chul Ko

Abstract: Scene graph generation (SGG) is an important task in image understanding because it represents the relationships between objects in an image as a graph structure, making it possible to understand the semantic relationships between objects intuitively. Previous SGG studies used a message-passing neural networks (MPNN) to update features, which can effectively reflect information about surrounding o… ▽ More Scene graph generation (SGG) is an important task in image understanding because it represents the relationships between objects in an image as a graph structure, making it possible to understand the semantic relationships between objects intuitively. Previous SGG studies used a message-passing neural networks (MPNN) to update features, which can effectively reflect information about surrounding objects. However, these studies have failed to reflect the co-occurrence of objects during SGG generation. In addition, they only addressed the long-tail problem of the training dataset from the perspectives of sampling and learning methods. To address these two problems, we propose CooK, which reflects the Co-occurrence Knowledge between objects, and the learnable term frequency-inverse document frequency (TF-l-IDF) to solve the long-tail problem. We applied the proposed model to the SGG benchmark dataset, and the results showed a performance improvement of up to 3.8% compared with existing state-of-the-art models in SGGen subtask. The proposed method exhibits generalization ability from the results obtained, showing uniform performance improvement for all MPNN models. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: Accepted by ICML2024

arXiv:2404.05256 [pdf, other]

StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding

Authors: Junseo Park, Beomseok Ko, Hyeryung Jang

Abstract: Recent advancements in text-to-image models, such as Stable Diffusion, have showcased their ability to create visual images from natural language prompts. However, existing methods like DreamBooth struggle with capturing arbitrary art styles due to the abstract and multifaceted nature of stylistic attributes. We introduce Single-StyleForge, a novel approach for personalized text-to-image synthesis… ▽ More Recent advancements in text-to-image models, such as Stable Diffusion, have showcased their ability to create visual images from natural language prompts. However, existing methods like DreamBooth struggle with capturing arbitrary art styles due to the abstract and multifaceted nature of stylistic attributes. We introduce Single-StyleForge, a novel approach for personalized text-to-image synthesis across diverse artistic styles. Using approximately 15 to 20 images of the target style, Single-StyleForge establishes a foundational binding of a unique token identifier with a broad range of attributes of the target style. Additionally, auxiliary images are incorporated for dual binding that guides the consistent representation of crucial elements such as people within the target style. Furthermore, we present Multi-StyleForge, which enhances image quality and text alignment by binding multiple tokens to partial style attributes. Experimental evaluations across six distinct artistic styles demonstrate significant improvements in image quality and perceptual fidelity, as measured by FID, KID, and CLIP scores. △ Less

Submitted 17 July, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

Comments: 20 pages, 12 figuers

arXiv:2404.01954 [pdf, other]

HyperCLOVA X Technical Report

Authors: Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim, Munhyong Kim, Sungju Kim, Donghyun Kwak, Hanock Kwak, Se Jung Kwon, Bado Lee, Dongsoo Lee, Gichang Lee, Jooho Lee, Baeseong Park, Seongjin Shin, Joonsang Yu, Seolki Baek, Sumin Byeon, Eungsup Cho, Dooseok Choe, Jeesung Han , et al. (371 additional authors not shown)

Abstract: We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t… ▽ More We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs. △ Less

Submitted 13 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: 44 pages; updated authors list and fixed author names

arXiv:2312.05814 [pdf, other]

Neural Speech Embeddings for Speech Synthesis Based on Deep Generative Networks

Authors: Seo-Hyun Lee, Young-Eun Lee, Soowon Kim, Byung-Kwan Ko, Jun-Young Kim, Seong-Whan Lee

Abstract: Brain-to-speech technology represents a fusion of interdisciplinary applications encompassing fields of artificial intelligence, brain-computer interfaces, and speech synthesis. Neural representation learning based intention decoding and speech synthesis directly connects the neural activity to the means of human linguistic communication, which may greatly enhance the naturalness of communication.… ▽ More Brain-to-speech technology represents a fusion of interdisciplinary applications encompassing fields of artificial intelligence, brain-computer interfaces, and speech synthesis. Neural representation learning based intention decoding and speech synthesis directly connects the neural activity to the means of human linguistic communication, which may greatly enhance the naturalness of communication. With the current discoveries on representation learning and the development of the speech synthesis technologies, direct translation of brain signals into speech has shown great promise. Especially, the processed input features and neural speech embeddings which are given to the neural network play a significant role in the overall performance when using deep generative models for speech generation from brain signals. In this paper, we introduce the current brain-to-speech technology with the possibility of speech synthesis from brain signals, which may ultimately facilitate innovation in non-verbal communication. Also, we perform comprehensive analysis on the neural features and neural speech embeddings underlying the neurophysiological activation while performing speech, which may play a significant role in the speech synthesis works. △ Less

Submitted 26 February, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

Comments: 4 pages

arXiv:2311.01192 [pdf, other]

Semantic Scene Graph Generation Based on an Edge Dual Scene Graph and Message Passing Neural Network

Authors: Hyeongjin Kim, Sangwon Kim, Jong Taek Lee, Byoung Chul Ko

Abstract: Along with generative AI, interest in scene graph generation (SGG), which comprehensively captures the relationships and interactions between objects in an image and creates a structured graph-based representation, has significantly increased in recent years. However, relying on object-centric and dichotomous relationships, existing SGG methods have a limited ability to accurately predict detailed… ▽ More Along with generative AI, interest in scene graph generation (SGG), which comprehensively captures the relationships and interactions between objects in an image and creates a structured graph-based representation, has significantly increased in recent years. However, relying on object-centric and dichotomous relationships, existing SGG methods have a limited ability to accurately predict detailed relationships. To solve these problems, a new approach to the modeling multiobject relationships, called edge dual scene graph generation (EdgeSGG), is proposed herein. EdgeSGG is based on a edge dual scene graph and Dual Message Passing Neural Network (DualMPNN), which can capture rich contextual interactions between unconstrained objects. To facilitate the learning of edge dual scene graphs with a symmetric graph structure, the proposed DualMPNN learns both object- and relation-centric features for more accurately predicting relation-aware contexts and allows fine-grained relational updates between objects. A comparative experiment with state-of-the-art (SoTA) methods was conducted using two public datasets for SGG operations and six metrics for three subtasks. Compared with SoTA approaches, the proposed model exhibited substantial performance improvements across all SGG subtasks. Furthermore, experiment on long-tail distributions revealed that incorporating the relationships between objects effectively mitigates existing long-tail problems. △ Less

Submitted 2 November, 2023; originally announced November 2023.

arXiv:2309.10825 [pdf, other]

Latent Disentanglement in Mesh Variational Autoencoders Improves the Diagnosis of Craniofacial Syndromes and Aids Surgical Planning

Authors: Simone Foti, Alexander J. Rickart, Bongjin Koo, Eimear O' Sullivan, Lara S. van de Lande, Athanasios Papaioannou, Roman Khonsari, Danail Stoyanov, N. u. Owase Jeelani, Silvia Schievano, David J. Dunaway, Matthew J. Clarkson

Abstract: The use of deep learning to undertake shape analysis of the complexities of the human head holds great promise. However, there have traditionally been a number of barriers to accurate modelling, especially when operating on both a global and local level. In this work, we will discuss the application of the Swap Disentangled Variational Autoencoder (SD-VAE) with relevance to Crouzon, Apert and Muen… ▽ More The use of deep learning to undertake shape analysis of the complexities of the human head holds great promise. However, there have traditionally been a number of barriers to accurate modelling, especially when operating on both a global and local level. In this work, we will discuss the application of the Swap Disentangled Variational Autoencoder (SD-VAE) with relevance to Crouzon, Apert and Muenke syndromes. Although syndrome classification is performed on the entire mesh, it is also possible, for the first time, to analyse the influence of each region of the head on the syndromic phenotype. By manipulating specific parameters of the generative model, and producing procedure-specific new shapes, it is also possible to simulate the outcome of a range of craniofacial surgical procedures. This opens new avenues to advance diagnosis, aids surgical planning and allows for the objective evaluation of surgical outcomes. △ Less

Submitted 5 September, 2023; originally announced September 2023.

arXiv:2309.06531 [pdf, other]

ASPED: An Audio Dataset for Detecting Pedestrians

Authors: Pavan Seshadri, Chaeyeon Han, Bon-Woo Koo, Noah Posner, Subhrajit Guhathakurta, Alexander Lerch

Abstract: We introduce the new audio analysis task of pedestrian detection and present a new large-scale dataset for this task. While the preliminary results prove the viability of using audio approaches for pedestrian detection, they also show that this challenging task cannot be easily solved with standard approaches. We introduce the new audio analysis task of pedestrian detection and present a new large-scale dataset for this task. While the preliminary results prove the viability of using audio approaches for pedestrian detection, they also show that this challenging task cannot be easily solved with standard approaches. △ Less

Submitted 16 January, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

Comments: 4+1 pages, ICASSP 2024

arXiv:2306.07274 [pdf, other]

CryoChains: Heterogeneous Reconstruction of Molecular Assembly of Semi-flexible Chains from Cryo-EM Images

Authors: Bongjin Koo, Julien Martel, Ariana Peck, Axel Levy, Frédéric Poitevin, Nina Miolane

Abstract: Cryogenic electron microscopy (cryo-EM) has transformed structural biology by allowing to reconstruct 3D biomolecular structures up to near-atomic resolution. However, the 3D reconstruction process remains challenging, as the 3D structures may exhibit substantial shape variations, while the 2D image acquisition suffers from a low signal-to-noise ratio, requiring to acquire very large datasets that… ▽ More Cryogenic electron microscopy (cryo-EM) has transformed structural biology by allowing to reconstruct 3D biomolecular structures up to near-atomic resolution. However, the 3D reconstruction process remains challenging, as the 3D structures may exhibit substantial shape variations, while the 2D image acquisition suffers from a low signal-to-noise ratio, requiring to acquire very large datasets that are time-consuming to process. Current reconstruction methods are precise but computationally expensive, or faster but lack a physically-plausible model of large molecular shape variations. To fill this gap, we propose CryoChains that encodes large deformations of biomolecules via rigid body transformation of their chains, while representing their finer shape variations with the normal mode analysis framework of biophysics. Our synthetic data experiments on the human GABA\textsubscript{B} and heat shock protein show that CryoChains gives a biophysically-grounded quantification of the heterogeneous conformations of biomolecules, while reconstructing their 3D molecular structures at an improved resolution compared to the current fastest, interpretable deep learning method. △ Less

Submitted 15 July, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

arXiv:2303.08906 [pdf, other]

VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression

Authors: Won Jo, Geuntaek Lim, Gwangjin Lee, Hyunwoo Kim, Byungsoo Ko, Yukyung Choi

Abstract: In content-based video retrieval (CBVR), dealing with large-scale collections, efficiency is as important as accuracy; thus, several video-level feature-based studies have actively been conducted. Nevertheless, owing to the severe difficulty of embedding a lengthy and untrimmed video into a single feature, these studies have been insufficient for accurate retrieval compared to frame-level feature-… ▽ More In content-based video retrieval (CBVR), dealing with large-scale collections, efficiency is as important as accuracy; thus, several video-level feature-based studies have actively been conducted. Nevertheless, owing to the severe difficulty of embedding a lengthy and untrimmed video into a single feature, these studies have been insufficient for accurate retrieval compared to frame-level feature-based studies. In this paper, we show that appropriate suppression of irrelevant frames can provide insight into the current obstacles of the video-level approaches. Furthermore, we propose a Video-to-Video Suppression network (VVS) as a solution. VVS is an end-to-end framework that consists of an easy distractor elimination stage to identify which frames to remove and a suppression weight generation stage to determine the extent to suppress the remaining frames. This structure is intended to effectively describe an untrimmed video with varying content and meaningless information. Its efficacy is proved via extensive experiments, and we show that our approach is not only state-of-the-art in video-level approaches but also has a fast inference time despite possessing retrieval capabilities close to those of frame-level approaches. Code is available at https://github.com/sejong-rcv/VVS △ Less

Submitted 19 December, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: AAAI-24

arXiv:2302.12798 [pdf, other]

doi 10.1111/cgf.14793

3D Generative Model Latent Disentanglement via Local Eigenprojection

Authors: Simone Foti, Bongjin Koo, Danail Stoyanov, Matthew J. Clarkson

Abstract: Designing realistic digital humans is extremely complex. Most data-driven generative models used to simplify the creation of their underlying geometric shape do not offer control over the generation of local shape attributes. In this paper, we overcome this limitation by introducing a novel loss function grounded in spectral geometry and applicable to different neural-network-based generative mode… ▽ More Designing realistic digital humans is extremely complex. Most data-driven generative models used to simplify the creation of their underlying geometric shape do not offer control over the generation of local shape attributes. In this paper, we overcome this limitation by introducing a novel loss function grounded in spectral geometry and applicable to different neural-network-based generative models of 3D head and body meshes. Encouraging the latent variables of mesh variational autoencoders (VAEs) or generative adversarial networks (GANs) to follow the local eigenprojections of identity attributes, we improve latent disentanglement and properly decouple the attribute creation. Experimental results show that our local eigenprojection disentangled (LED) models not only offer improved disentanglement with respect to the state-of-the-art, but also maintain good generation capabilities with training times comparable to the vanilla implementations of the models. △ Less

Submitted 4 April, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

Comments: Computer Graphics Forum 2023

arXiv:2212.10926 [pdf, other]

The Internet of Bio-Nano Things in Blood Vessels: System Design and Prototypes

Authors: Changmin Lee, Bon-Hong Koo, Chan-Byoung Chae, Robert Schober

Abstract: In this paper, we investigate the Internet of Bio-Nano Things (IoBNT) which relates to networks formed by molecular communications. By providing a means of communication through the ubiquitously connected blood vessels (arteries, veins, and capillaries), molecular communication-based IoBNT enables a host of new eHealth applications. For example, an organ monitoring sensor can transfer internal bod… ▽ More In this paper, we investigate the Internet of Bio-Nano Things (IoBNT) which relates to networks formed by molecular communications. By providing a means of communication through the ubiquitously connected blood vessels (arteries, veins, and capillaries), molecular communication-based IoBNT enables a host of new eHealth applications. For example, an organ monitoring sensor can transfer internal body signals through the IoBNT for health monitoring applications. We empirically show that blood vessel channels introduce a new set of challenges for the design of molecular communication systems in comparison to free-space channels. We then propose cylindrical duct channel models and discuss the corresponding system designs conforming to the channel characteristics. Furthermore, based on prototype implementations, we confirm that molecular communication techniques can be utilized for composing the IoBNT. We believe that the promising results presented in this work, together with the rich research challenges that lie ahead, are strong indicators that IoBNT with molecular communications can drive novel applications for emerging eHealth systems. △ Less

Submitted 21 December, 2022; originally announced December 2022.

arXiv:2212.05638 [pdf, other]

Cross-Modal Learning with 3D Deformable Attention for Action Recognition

Authors: Sangwon Kim, Dasom Ahn, Byoung Chul Ko

Abstract: An important challenge in vision-based action recognition is the embedding of spatiotemporal features with two or more heterogeneous modalities into a single feature. In this study, we propose a new 3D deformable transformer for action recognition with adaptive spatiotemporal receptive fields and a cross-modal learning scheme. The 3D deformable transformer consists of three attention modules: 3D d… ▽ More An important challenge in vision-based action recognition is the embedding of spatiotemporal features with two or more heterogeneous modalities into a single feature. In this study, we propose a new 3D deformable transformer for action recognition with adaptive spatiotemporal receptive fields and a cross-modal learning scheme. The 3D deformable transformer consists of three attention modules: 3D deformability, local joint stride, and temporal stride attention. The two cross-modal tokens are input into the 3D deformable attention module to create a cross-attention token with a reflected spatiotemporal correlation. Local joint stride attention is applied to spatially combine attention and pose tokens. Temporal stride attention temporally reduces the number of input tokens in the attention module and supports temporal expression learning without the simultaneous use of all tokens. The deformable transformer iterates L-times and combines the last cross-modal token for classification. The proposed 3D deformable transformer was tested on the NTU60, NTU120, FineGYM, and PennAction datasets, and showed results better than or similar to pre-trained state-of-the-art methods even without a pre-training process. In addition, by visualizing important joints and correlations during action recognition through spatial joint and temporal stride attention, the possibility of achieving an explainable potential for action recognition is presented. △ Less

Submitted 17 August, 2023; v1 submitted 11 December, 2022; originally announced December 2022.

Comments: Accepted by ICCV2023

arXiv:2212.04119 [pdf, other]

DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset

Authors: Young-Jun Lee, Byungsoo Ko, Han-Gyu Kim, Jonghwan Hyeon, Ho-Jin Choi

Abstract: As sharing images in an instant message is a crucial factor, there has been active research on learning an image-text multi-modal dialogue models. However, training a well-generalized multi-modal dialogue model remains challenging due to the low quality and limited diversity of images per dialogue in existing multi-modal dialogue datasets. In this paper, we propose an automated pipeline to constru… ▽ More As sharing images in an instant message is a crucial factor, there has been active research on learning an image-text multi-modal dialogue models. However, training a well-generalized multi-modal dialogue model remains challenging due to the low quality and limited diversity of images per dialogue in existing multi-modal dialogue datasets. In this paper, we propose an automated pipeline to construct a multi-modal dialogue dataset, ensuring both dialogue quality and image diversity without requiring minimum human effort. In our pipeline, to guarantee the coherence between images and dialogue, we prompt GPT-4 to infer potential image-sharing moments - specifically, the utterance, speaker, rationale, and image description. Furthermore, we leverage CLIP similarity to maintain consistency between aligned multiple images to the utterance. Through this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal dialogue dataset that surpasses existing datasets in terms of quality and diversity in human evaluation. Our comprehensive experiments highlight that when multi-modal dialogue models are trained using our dataset, their generalization performance on unseen dialogue datasets is significantly enhanced. We make our source code and dataset publicly available. △ Less

Submitted 29 March, 2024; v1 submitted 8 December, 2022; originally announced December 2022.

Comments: NAACL 2024

arXiv:2212.04114 [pdf, other]

Group Generalized Mean Pooling for Vision Transformer

Authors: Byungsoo Ko, Han-Gyu Kim, Byeongho Heo, Sangdoo Yun, Sanghyuk Chun, Geonmo Gu, Wonjae Kim

Abstract: Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, s… ▽ More Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, such as max and GeM pooling, can be considered. Despite their effectiveness, the existing pooling strategies do not consider the architecture of ViT and the channel-wise difference in the activation maps, aggregating the crucial and trivial channels with the same importance. In this paper, we present Group Generalized Mean (GGeM) pooling as a simple yet powerful pooling strategy for ViT. GGeM divides the channels into groups and computes GeM pooling with a shared pooling parameter per group. As ViT groups the channels via a multi-head attention mechanism, grouping the channels by GGeM leads to lower head-wise dependence while amplifying important channels on the activation maps. Exploiting GGeM shows 0.1%p to 0.7%p performance boosts compared to the baselines and achieves state-of-the-art performance for ViT-Base and ViT-Large models in ImageNet-1K classification task. Moreover, GGeM outperforms the existing pooling strategies on image retrieval and multi-modal representation learning tasks, demonstrating the superiority of GGeM for a variety of tasks. GGeM is a simple algorithm in that only a few lines of code are necessary for implementation. △ Less

Submitted 8 December, 2022; originally announced December 2022.

arXiv:2212.02047 [pdf, other]

Towards Neural Decoding of Imagined Speech based on Spoken Speech

Authors: Seo-Hyun Lee, Young-Eun Lee, Soowon Kim, Byung-Kwan Ko, Seong-Whan Lee

Abstract: Decoding imagined speech from human brain signals is a challenging and important issue that may enable human communication via brain signals. While imagined speech can be the paradigm for silent communication via brain signals, it is always hard to collect enough stable data to train the decoding model. Meanwhile, spoken speech data is relatively easy and to obtain, implying the significance of ut… ▽ More Decoding imagined speech from human brain signals is a challenging and important issue that may enable human communication via brain signals. While imagined speech can be the paradigm for silent communication via brain signals, it is always hard to collect enough stable data to train the decoding model. Meanwhile, spoken speech data is relatively easy and to obtain, implying the significance of utilizing spoken speech brain signals to decode imagined speech. In this paper, we performed a preliminary analysis to find out whether if it would be possible to utilize spoken speech electroencephalography data to decode imagined speech, by simply applying the pre-trained model trained with spoken speech brain signals to decode imagined speech. While the classification performance of imagined speech data solely used to train and validation was 30.5 %, the transferred performance of spoken speech based classifier to imagined speech data displayed average accuracy of 26.8 % which did not have statistically significant difference compared to the imagined speech based classifier (p = 0.0983, chi-square = 4.64). For more comprehensive analysis, we compared the result with the visual imagery dataset, which would naturally be less related to spoken speech compared to the imagined speech. As a result, visual imagery have shown solely trained performance of 31.8 % and transferred performance of 26.3 % which had shown statistically significant difference between each other (p = 0.022, chi-square = 7.64). Our results imply the potential of applying spoken speech to decode imagined speech, as well as their underlying common features. △ Less

Submitted 14 February, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

Comments: 4 pages, 2 figures

arXiv:2210.07503 [pdf, other]

STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition

Authors: Dasom Ahn, Sangwon Kim, Hyunsu Hong, Byoung Chul Ko

Abstract: In action recognition, although the combination of spatio-temporal videos and skeleton features can improve the recognition performance, a separate model and balancing feature representation for cross-modal data are required. To solve these problems, we propose Spatio-TemporAl cRoss (STAR)-transformer, which can effectively represent two cross-modal features as a recognizable vector. First, from t… ▽ More In action recognition, although the combination of spatio-temporal videos and skeleton features can improve the recognition performance, a separate model and balancing feature representation for cross-modal data are required. To solve these problems, we propose Spatio-TemporAl cRoss (STAR)-transformer, which can effectively represent two cross-modal features as a recognizable vector. First, from the input video and skeleton sequence, video frames are output as global grid tokens and skeletons are output as joint map tokens, respectively. These tokens are then aggregated into multi-class tokens and input into STAR-transformer. The STAR-transformer encoder layer consists of a full self-attention (FAttn) module and a proposed zigzag spatio-temporal attention (ZAttn) module. Similarly, the continuous decoder consists of a FAttn module and a proposed binary spatio-temporal attention (BAttn) module. STAR-transformer learns an efficient multi-feature representation of the spatio-temporal features by properly arranging pairings of the FAttn, ZAttn, and BAttn modules. Experimental results on the Penn-Action, NTU RGB+D 60, and 120 datasets show that the proposed method achieves a promising improvement in performance in comparison to previous state-of-the-art methods. △ Less

Submitted 14 October, 2022; originally announced October 2022.

Comments: Accepted by WACV 2023

MSC Class: 68T07

arXiv:2210.02254 [pdf, other]

Granularity-aware Adaptation for Image Retrieval over Multiple Tasks

Authors: Jon Almazán, Byungsoo Ko, Geonmo Gu, Diane Larlus, Yannis Kalantidis

Abstract: Strong image search models can be learned for a specific domain, ie. set of labels, provided that some labeled images of that domain are available. A practical visual search model, however, should be versatile enough to solve multiple retrieval tasks simultaneously, even if those cover very different specialized domains. Additionally, it should be able to benefit from even unlabeled images from th… ▽ More Strong image search models can be learned for a specific domain, ie. set of labels, provided that some labeled images of that domain are available. A practical visual search model, however, should be versatile enough to solve multiple retrieval tasks simultaneously, even if those cover very different specialized domains. Additionally, it should be able to benefit from even unlabeled images from these various retrieval tasks. This is the more practical scenario that we consider in this paper. We address it with the proposed Grappa, an approach that starts from a strong pretrained model, and adapts it to tackle multiple retrieval tasks concurrently, using only unlabeled images from the different task domains. We extend the pretrained model with multiple independently trained sets of adaptors that use pseudo-label sets of different sizes, effectively mimicking different pseudo-granularities. We reconcile all adaptor sets into a single unified model suited for all retrieval tasks by learning fusion layers that we guide by propagating pseudo-granularity attentions across neighbors in the feature space. Results on a benchmark composed of six heterogeneous retrieval tasks show that the unsupervised Grappa model improves the zero-shot performance of a state-of-the-art self-supervised learning model, and in some places reaches or improves over a task label-aware oracle that selects the most fitting pseudo-granularity per task. △ Less

Submitted 5 October, 2022; originally announced October 2022.

Comments: ECCV 2022

arXiv:2209.15121 [pdf, other]

Heterogeneous reconstruction of deformable atomic models in Cryo-EM

Authors: Youssef Nashed, Ariana Peck, Julien Martel, Axel Levy, Bongjin Koo, Gordon Wetzstein, Nina Miolane, Daniel Ratner, Frédéric Poitevin

Abstract: Cryogenic electron microscopy (cryo-EM) provides a unique opportunity to study the structural heterogeneity of biomolecules. Being able to explain this heterogeneity with atomic models would help our understanding of their functional mechanisms but the size and ruggedness of the structural space (the space of atomic 3D cartesian coordinates) presents an immense challenge. Here, we describe a heter… ▽ More Cryogenic electron microscopy (cryo-EM) provides a unique opportunity to study the structural heterogeneity of biomolecules. Being able to explain this heterogeneity with atomic models would help our understanding of their functional mechanisms but the size and ruggedness of the structural space (the space of atomic 3D cartesian coordinates) presents an immense challenge. Here, we describe a heterogeneous reconstruction method based on an atomistic representation whose deformation is reduced to a handful of collective motions through normal mode analysis. Our implementation uses an autoencoder. The encoder jointly estimates the amplitude of motion along the normal modes and the 2D shift between the center of the image and the center of the molecule . The physics-based decoder aggregates a representation of the heterogeneity readily interpretable at the atomic level. We illustrate our method on 3 synthetic datasets corresponding to different distributions along a simulated trajectory of adenylate kinase transitioning from its open to its closed structures. We show for each distribution that our approach is able to recapitulate the intermediate atomic models with atomic-level accuracy. △ Less

Submitted 29 September, 2022; originally announced September 2022.

Comments: 8 pages, 1 figure

arXiv:2208.04545 [pdf, ps, other]

Massive MIMO Channel Prediction Using Machine Learning: Power of Domain Transformation

Authors: Beomsoo Ko, Hwanjin Kim, Junil Choi

Abstract: To compensate the loss from outdated channel state information in wideband massive multiple-input multipleoutput (MIMO) systems, channel prediction can be performed by leveraging the temporal correlation of wireless channels. Machine learning (ML)-based channel predictors for massive MIMO systems were designed recently; however, the time overhead to collect a large amount of training data directly… ▽ More To compensate the loss from outdated channel state information in wideband massive multiple-input multipleoutput (MIMO) systems, channel prediction can be performed by leveraging the temporal correlation of wireless channels. Machine learning (ML)-based channel predictors for massive MIMO systems were designed recently; however, the time overhead to collect a large amount of training data directly affects the latency of the system. In this paper, we propose a novel ML-based channel prediction technique, which can reduce the time overhead to collect the training data by transforming the domain of channels from subcarrier to antenna in wideband massive MIMO systems. Numerical results show that the proposed technique can not only reduce the time overhead but also give additional performance gain compared to the ML-based channel prediction techniques without the domain transformation. △ Less

Submitted 9 August, 2022; originally announced August 2022.

arXiv:2208.04541 [pdf, ps, other]

Coverage Increase at THz Frequencies: A Cooperative Rate-Splitting Approach

Authors: Hyesang Cho, Beomsoo Ko, Bruno Clerckx, Junil Choi

Abstract: Numerous studies claim that terahertz (THz) communication will be an essential piece of sixth-generation wireless communication systems. Its promising potential also comes with major challenges, in particular the reduced coverage due to harsh propagation loss, hardware constraints, and blockage vulnerability. To increase the coverage of THz communication, we revisit cooperative communication. We p… ▽ More Numerous studies claim that terahertz (THz) communication will be an essential piece of sixth-generation wireless communication systems. Its promising potential also comes with major challenges, in particular the reduced coverage due to harsh propagation loss, hardware constraints, and blockage vulnerability. To increase the coverage of THz communication, we revisit cooperative communication. We propose a new type of cooperative rate-splitting (CRS) called extraction-based CRS (eCRS). Furthermore, we explore two extreme cases of eCRS, namely, identical eCRS and distinct eCRS. To enable the proposed eCRS framework, we design a novel THz cooperative channel model by considering unique characteristics of THz communication. Through mathematical derivations and convex optimization techniques considering the THz cooperative channel model, we derive local optimal solutions for the two cases of eCRS and a global optimal closed form solution for a specific scenario. Finally, we propose a novel channel estimation technique that not only specifies the channel value, but also the time delay of the channel from each cooperating user equipment to fully utilize the THz cooperative channel. In simulation results, we verify the validity of the two cases of our proposed framework and channel estimation technique. △ Less

Submitted 9 August, 2022; originally announced August 2022.

Comments: 13 pages, 8 figures, submitted to IEEE Transactions on Wireless Communications (TWC)

arXiv:2206.12059 [pdf]

Data Augmentation and Squeeze-and-Excitation Network on Multiple Dimension for Sound Event Localization and Detection in Real Scenes

Authors: Byeong-Yun Ko, Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Seung-Deok Choi, Yong-Hwa Park

Abstract: Performance of sound event localization and detection (SELD) in real scenes is limited by small size of SELD dataset, due to difficulty in obtaining sufficient amount of realistic multi-channel audio data recordings with accurate label. We used two main strategies to solve problems arising from the small real SELD dataset. First, we applied various data augmentation methods on all data dimensions:… ▽ More Performance of sound event localization and detection (SELD) in real scenes is limited by small size of SELD dataset, due to difficulty in obtaining sufficient amount of realistic multi-channel audio data recordings with accurate label. We used two main strategies to solve problems arising from the small real SELD dataset. First, we applied various data augmentation methods on all data dimensions: channel, frequency and time. We also propose original data augmentation method named Moderate Mixup in order to simulate situations where noise floor or interfering events exist. Second, we applied Squeeze-and-Excitation block on channel and frequency dimensions to efficiently extract feature characteristics. Result of our trained models on the STARSS22 test dataset achieved the best ER, F1, LE, and LR of 0.53, 49.8%, 16.0deg., and 56.2% respectively. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: Technical Report submitted for DCASE2022 Challenge Task3

arXiv:2206.00518 [pdf, other]

Efficient Scheduling of Data Augmentation for Deep Reinforcement Learning

Authors: Byungchan Ko, Jungseul Ok

Abstract: In deep reinforcement learning (RL), data augmentation is widely considered as a tool to induce a set of useful priors about semantic consistency and improve sample efficiency and generalization performance. However, even when the prior is useful for generalization, distilling it to RL agent often interferes with RL training and degenerates sample efficiency. Meanwhile, the agent is forgetful of t… ▽ More In deep reinforcement learning (RL), data augmentation is widely considered as a tool to induce a set of useful priors about semantic consistency and improve sample efficiency and generalization performance. However, even when the prior is useful for generalization, distilling it to RL agent often interferes with RL training and degenerates sample efficiency. Meanwhile, the agent is forgetful of the prior due to the non-stationary nature of RL. These observations suggest two extreme schedules of distillation: (i) over the entire training; or (ii) only at the end. Hence, we devise a stand-alone network distillation method to inject the consistency prior at any time (even after RL), and a simple yet efficient framework to automatically schedule the distillation. Specifically, the proposed framework first focuses on mastering train environments regardless of generalization by adaptively deciding which {\it or no} augmentation to be used for the training. After this, we add the distillation to extract the remaining benefits for generalization from all the augmentations, which requires no additional new samples. In our experiments, we demonstrate the utility of the proposed framework, in particular, that considers postponing the augmentation to the end of RL training. △ Less

Submitted 1 March, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2102.08581

Journal ref: Neurips2022

arXiv:2203.14463 [pdf, other]

Large-scale Bilingual Language-Image Contrastive Learning

Authors: Byungsoo Ko, Geonmo Gu

Abstract: This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than En… ▽ More This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than English. In this work, we collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP. We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation. Extensive experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages. Moreover, we discuss multimodal-related research questions: 1) strong augmentation-based methods can distract the model from learning proper multimodal relations; 2) training multimodal model without cross-lingual relation can learn the relation via visual semantics; 3) our bilingual KELIP can capture cultural differences of visual semantics for the same meaning of words; 4) a large-scale multimodal model can be used for multimodal feature analogy. We hope that this work will provide helpful experience and findings for future research. We provide an open-source pre-trained KELIP. △ Less

Submitted 14 April, 2022; v1 submitted 27 March, 2022; originally announced March 2022.

Comments: Accepted by ICLRW2022

arXiv:2203.03166 [pdf]

HRTF measurement for accurate sound localization cues

Authors: Gyeong-Tae Lee, Sang-Min Choi, Byeong-Yun Ko, Yong-Hwa Park

Abstract: A new database of head-related transfer functions (HRTFs) for accurate sound source localization is presented through precise measurement and post-processing in terms of improved frequency bandwidth and causality of head-related impulse responses (HRIRs) for accurate spectral cue (SC) and interaural time difference (ITD), respectively. The improvement effects of the proposed methods on binaural so… ▽ More A new database of head-related transfer functions (HRTFs) for accurate sound source localization is presented through precise measurement and post-processing in terms of improved frequency bandwidth and causality of head-related impulse responses (HRIRs) for accurate spectral cue (SC) and interaural time difference (ITD), respectively. The improvement effects of the proposed methods on binaural sound localization cues were investigated. To achieve sufficient frequency bandwidth with a single source, a one-way sealed speaker module was designed to obtain wide band frequency response based on electro-acoustics, whereas most existing HRTF databases rely on a two-way vented loudspeaker that has multiple sources. The origin transfer function at the head center was obtained by the proposed measurement scheme using a 0 degree on-axis microphone to ensure accurate spectral cue pattern of HRTFs, whereas in the previous measurements with a 90 degree off-axis microphone, the magnitude response of the origin transfer function fluctuated and decreased with increasing frequency, causing erroneous SCs of HRTFs. To prevent discontinuity of ITD due to non-causality of ipsilateral HRTFs, obtained HRIRs were circularly shifted by time delay considering the head radius of the measurement subject. Finally, various sound localization cues such as ITD, interaural level difference (ILD), SC, and horizontal plane directivity (HPD) were derived from the presented HRTFs, and improvements on binaural sound localization cues were examined. As a result, accurate SC patterns of HRTFs were confirmed through the proposed measurement scheme using the 0 degree on-axis microphone, and continuous ITD patterns were obtained due to the non-causality compensation. Source codes and presented HRTF database are available to relevant research groups at GitHub (https://github.com/han-saram/HRTF-HATS-KAIST). △ Less

Submitted 5 April, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

Comments: 39 pages, 27 figures, and 1 table

arXiv:2112.08816 [pdf, other]

Deep Hash Distillation for Image Retrieval

Authors: Young Kyun Jang, Geonmo Gu, Byungsoo Ko, Isaac Kang, Nam Ik Cho

Abstract: In hash-based image retrieval systems, degraded or transformed inputs usually generate different codes from the original, deteriorating the retrieval accuracy. To mitigate this issue, data augmentation can be applied during training. However, even if augmented samples of an image are similar in real feature space, the quantization can scatter them far away in Hamming space. This results in represe… ▽ More In hash-based image retrieval systems, degraded or transformed inputs usually generate different codes from the original, deteriorating the retrieval accuracy. To mitigate this issue, data augmentation can be applied during training. However, even if augmented samples of an image are similar in real feature space, the quantization can scatter them far away in Hamming space. This results in representation discrepancies that can impede training and degrade performance. In this work, we propose a novel self-distilled hashing scheme to minimize the discrepancy while exploiting the potential of augmented data. By transferring the hash knowledge of the weakly-transformed samples to the strong ones, we make the hash code insensitive to various transformations. We also introduce hash proxy-based similarity learning and binary cross entropy-based quantization loss to provide fine quality hash codes. Ultimately, we construct a deep hashing framework that not only improves the existing deep hashing approaches, but also achieves the state-of-the-art retrieval results. Extensive experiments are conducted and confirm the effectiveness of our work. △ Less

Submitted 13 July, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: ECCV2022

arXiv:2111.12448 [pdf, other]

3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces

Authors: Simone Foti, Bongjin Koo, Danail Stoyanov, Matthew J. Clarkson

Abstract: Learning a disentangled, interpretable, and structured latent representation in 3D generative models of faces and bodies is still an open problem. The problem is particularly acute when control over identity features is required. In this paper, we propose an intuitive yet effective self-supervised approach to train a 3D shape variational autoencoder (VAE) which encourages a disentangled latent rep… ▽ More Learning a disentangled, interpretable, and structured latent representation in 3D generative models of faces and bodies is still an open problem. The problem is particularly acute when control over identity features is required. In this paper, we propose an intuitive yet effective self-supervised approach to train a 3D shape variational autoencoder (VAE) which encourages a disentangled latent representation of identity features. Curating the mini-batch generation by swapping arbitrary features across different shapes allows to define a loss function leveraging known differences and similarities in the latent representations. Experimental results conducted on 3D meshes show that state-of-the-art methods for latent disentanglement are not able to disentangle identity features of faces and bodies. Our proposed method properly decouples the generation of such features while maintaining good representation and reconstruction capabilities. △ Less

Submitted 23 March, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

Comments: Accepted for publication at CVPR2022

arXiv:2111.09963 [pdf, other]

doi 10.1145/3487553.3524215

Beyond NDCG: behavioral testing of recommender systems with RecList

Authors: Patrick John Chia, Jacopo Tagliabue, Federico Bianchi, Chloe He, Brian Ko

Abstract: As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced: ad hoc error analysis and deployment-specific tests must be employed to ensure the desired quality in actual deployments. In this paper, we propose RecList, a behavioral-based testing methodology. Rec… ▽ More As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced: ad hoc error analysis and deployment-specific tests must be employed to ensure the desired quality in actual deployments. In this paper, we propose RecList, a behavioral-based testing methodology. RecList organizes recommender systems by use case and introduces a general plug-and-play procedure to scale up behavioral testing. We demonstrate its capabilities by analyzing known algorithms and black-box commercial systems, and we release RecList as an open source, extensible package for the community. △ Less

Submitted 27 March, 2022; v1 submitted 18 November, 2021; originally announced November 2021.

Comments: Paper accepted to the WebConf 2022

arXiv:2107.03649 [pdf]

Heavily Augmented Sound Event Detection utilizing Weak Predictions

Authors: Hyeonuk Nam, Byeong-Yun Ko, Gyeong-Tae Lee, Seong-Hu Kim, Won-Ho Jung, Sang-Min Choi, Yong-Hwa Park

Abstract: The performances of Sound Event Detection (SED) systems are greatly limited by the difficulty in generating large strongly labeled dataset. In this work, we used two main approaches to overcome the lack of strongly labeled data. First, we applied heavy data augmentation on input features. Data augmentation methods used include not only conventional methods used in speech/audio domains but also our… ▽ More The performances of Sound Event Detection (SED) systems are greatly limited by the difficulty in generating large strongly labeled dataset. In this work, we used two main approaches to overcome the lack of strongly labeled data. First, we applied heavy data augmentation on input features. Data augmentation methods used include not only conventional methods used in speech/audio domains but also our proposed method named FilterAugment. Second, we propose two methods to utilize weak predictions to enhance weakly supervised SED performance. As a result, we obtained the best PSDS1 of 0.4336 and best PSDS2 of 0.8161 on the DESED real validation dataset. This work is submitted to DCASE 2021 Task4 and is ranked on the 3rd place. Code availa-ble: https://github.com/frednam93/FilterAugSED. △ Less

Submitted 14 September, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

Comments: Won 3rd place on IEEE DCASE 2021 Task 4

arXiv:2107.01793 [pdf, other]

MIMO Operations in Molecular Communications: Theory, Prototypes, and Open Challenges

Authors: Bon-Hong Koo, Changmin Lee, Ali E. Pusane, Tuna Tugcu, Chan-Byoung Chae

Abstract: The Internet of Bio-nano Things is a significant development for next generation communication technologies. Because conventional wireless communication technologies face challenges in realizing new applications (e.g., in-body area networks for health monitoring) and necessitate the substitution of information carriers, researchers have shifted their interest to molecular communications (MC). Alth… ▽ More The Internet of Bio-nano Things is a significant development for next generation communication technologies. Because conventional wireless communication technologies face challenges in realizing new applications (e.g., in-body area networks for health monitoring) and necessitate the substitution of information carriers, researchers have shifted their interest to molecular communications (MC). Although remarkable progress has been made in this field over the last decade, advances have been far from acceptable for the achievement of its application objectives. A crucial problem of MC is the low data rate and high error rate inherent in particle dynamics specifications, in contrast to wave-based conventional communications. Therefore, it is important to investigate the resources by which MC can obtain additional information paths and provide strategies to exploit these resources. This study aims to examine techniques involving resource aggregation and exploitation to provide prospective directions for future progress in MC. In particular, we focus on state-of-the-art studies on multiple-input multiple-output (MIMO) systems. We discuss the possible advantages of applying MIMO to various MC system models. Furthermore, we survey various studies that aimed to achieve MIMO gains for the respective models, from theoretical background to prototypes. Finally, we conclude this study by summarizing the challenges that need to be addressed. △ Less

Submitted 5 July, 2021; originally announced July 2021.

Comments: 7 pages, 4 figures, accepted in Communications Magazine

arXiv:2107.00414 [pdf, other]

MultiCite: Modeling realistic citations requires moving beyond the single-sentence single-label setting

Authors: Anne Lauscher, Brandon Ko, Bailey Kuehl, Sophie Johnson, David Jurgens, Arman Cohan, Kyle Lo

Abstract: Citation context analysis (CCA) is an important task in natural language processing that studies how and why scholars discuss each others' work. Despite decades of study, traditional frameworks for CCA have largely relied on overly-simplistic assumptions of how authors cite, which ignore several important phenomena. For instance, scholarly papers often contain rich discussions of cited work that s… ▽ More Citation context analysis (CCA) is an important task in natural language processing that studies how and why scholars discuss each others' work. Despite decades of study, traditional frameworks for CCA have largely relied on overly-simplistic assumptions of how authors cite, which ignore several important phenomena. For instance, scholarly papers often contain rich discussions of cited work that span multiple sentences and express multiple intents concurrently. Yet, CCA is typically approached as a single-sentence, single-label classification task, and thus existing datasets fail to capture this interesting discourse. In our work, we address this research gap by proposing a novel framework for CCA as a document-level context extraction and labeling task. We release MultiCite, a new dataset of 12,653 citation contexts from over 1,200 computational linguistics papers. Not only is it the largest collection of expert-annotated citation contexts to-date, MultiCite contains multi-sentence, multi-label citation contexts within full paper texts. Finally, we demonstrate how our dataset, while still usable for training classic CCA models, also supports the development of new types of models for CCA beyond fixed-width text classification. We release our code and dataset at https://github.com/allenai/multicite. △ Less

Submitted 31 July, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

arXiv:2106.00186 [pdf, other]

Towards Light-weight and Real-time Line Segment Detection

Authors: Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, Minchul Shin

Abstract: Previous deep learning-based line segment detection (LSD) suffers from the immense model size and high computational cost for line prediction. This constrains them from real-time inference on computationally restricted environments. In this paper, we propose a real-time and light-weight line segment detector for resource-constrained environments named Mobile LSD (M-LSD). We design an extremely eff… ▽ More Previous deep learning-based line segment detection (LSD) suffers from the immense model size and high computational cost for line prediction. This constrains them from real-time inference on computationally restricted environments. In this paper, we propose a real-time and light-weight line segment detector for resource-constrained environments named Mobile LSD (M-LSD). We design an extremely efficient LSD architecture by minimizing the backbone network and removing the typical multi-module process for line prediction found in previous methods. To maintain competitive performance with a light-weight network, we present novel training schemes: Segments of Line segment (SoL) augmentation, matching and geometric loss. SoL augmentation splits a line segment into multiple subparts, which are used to provide auxiliary line data during the training process. Moreover, the matching and geometric loss allow a model to capture additional geometric cues. Compared with TP-LSD-Lite, previously the best real-time LSD method, our model (M-LSD-tiny) achieves competitive performance with 2.5% of model size and an increase of 130.5% in inference speed on GPU. Furthermore, our model runs at 56.8 FPS and 48.6 FPS on the latest Android and iPhone mobile devices, respectively. To the best of our knowledge, this is the first real-time deep LSD available on mobile devices. Our code is available. △ Less

Submitted 26 April, 2022; v1 submitted 31 May, 2021; originally announced June 2021.

Comments: Accepted by AAAI2022

arXiv:2104.03015 [pdf, other]

RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network

Authors: Minchul Shin, Yoonjae Cho, Byungsoo Ko, Geonmo Gu

Abstract: In this paper, we study the compositional learning of images and texts for image retrieval. The query is given in the form of an image and text that describes the desired modifications to the image; the goal is to retrieve the target image that satisfies the given modifications and resembles the query by composing information in both the text and image modalities. To remedy this, we propose a nove… ▽ More In this paper, we study the compositional learning of images and texts for image retrieval. The query is given in the form of an image and text that describes the desired modifications to the image; the goal is to retrieve the target image that satisfies the given modifications and resembles the query by composing information in both the text and image modalities. To remedy this, we propose a novel architecture designed for the image-text composition task and show that the proposed structure can effectively encode the differences between the source and target images conditioned on the text. Furthermore, we introduce a new joint training technique based on the graph convolutional network that is generally applicable for any existing composition methods in a plug-and-play manner. We found that the proposed technique consistently improves performance and achieves state-of-the-art scores on various benchmarks. To avoid misleading experimental results caused by trivial training hyper-parameters, we reproduce all individual baselines and train models with a unified training environment. We expect this approach to suppress undesirable effects from irrelevant components and emphasize the image-text composition module's ability. Also, we achieve the state-of-the-art score without restricting the training environment, which implies the superiority of our method considering the gains from hyper-parameter tuning. The code, including all the baseline methods, are released https://github.com/nashory/rtic-gcn-pytorch. △ Less

Submitted 25 October, 2021; v1 submitted 7 April, 2021; originally announced April 2021.

arXiv:2103.16940 [pdf, other]

Learning with Memory-based Virtual Classes for Deep Metric Learning

Authors: Byungsoo Ko, Geonmo Gu, Han-Gyu Kim

Abstract: The core of deep metric learning (DML) involves learning visual similarities in high-dimensional embedding space. One of the main challenges is to generalize from seen classes of training data to unseen classes of test data. Recent works have focused on exploiting past embeddings to increase the number of instances for the seen classes. Such methods achieve performance improvement via augmentation… ▽ More The core of deep metric learning (DML) involves learning visual similarities in high-dimensional embedding space. One of the main challenges is to generalize from seen classes of training data to unseen classes of test data. Recent works have focused on exploiting past embeddings to increase the number of instances for the seen classes. Such methods achieve performance improvement via augmentation, while the strong focus on seen classes still remains. This can be undesirable for DML, where training and test data exhibit entirely different classes. In this work, we present a novel training strategy for DML called MemVir. Unlike previous works, MemVir memorizes both embedding features and class weights to utilize them as additional virtual classes. The exploitation of virtual classes not only utilizes augmented information for training but also alleviates a strong focus on seen classes for better generalization. Moreover, we embed the idea of curriculum learning by slowly adding virtual classes for a gradual increase in learning difficulty, which improves the learning stability as well as the final performance. MemVir can be easily applied to many existing loss functions without any modification. Extensive experimental results on famous benchmarks demonstrate the superiority of MemVir over state-of-the-art competitors. Code of MemVir is publicly available. △ Less

Submitted 8 October, 2021; v1 submitted 31 March, 2021; originally announced March 2021.

Comments: Accepted by ICCV2021

arXiv:2103.15454 [pdf, other]

Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning

Authors: Geonmo Gu, Byungsoo Ko, Han-Gyu Kim

Abstract: One of the main purposes of deep metric learning is to construct an embedding space that has well-generalized embeddings on both seen (training) classes and unseen (test) classes. Most existing works have tried to achieve this using different types of metric objectives and hard sample mining strategies with given training data. However, learning with only the training data can be overfitted to the… ▽ More One of the main purposes of deep metric learning is to construct an embedding space that has well-generalized embeddings on both seen (training) classes and unseen (test) classes. Most existing works have tried to achieve this using different types of metric objectives and hard sample mining strategies with given training data. However, learning with only the training data can be overfitted to the seen classes, leading to the lack of generalization capability on unseen classes. To address this problem, we propose a simple regularizer called Proxy Synthesis that exploits synthetic classes for stronger generalization in deep metric learning. The proposed method generates synthetic embeddings and proxies that work as synthetic classes, and they mimic unseen classes when computing proxy-based losses. Proxy Synthesis derives an embedding space considering class relations and smooth decision boundaries for robustness on unseen classes. Our method is applicable to any proxy-based losses, including softmax and its variants. Extensive experiments on four famous benchmarks in image retrieval tasks demonstrate that Proxy Synthesis significantly boosts the performance of proxy-based losses and achieves state-of-the-art performance. △ Less

Submitted 29 March, 2021; originally announced March 2021.

Comments: Accepted by AAAI2021

arXiv:2102.08581 [pdf, other]

Efficient Scheduling of Data Augmentation for Deep Reinforcement Learning

Authors: Byungchan Ko, Jungseul Ok

Abstract: In deep reinforcement learning (RL), data augmentation is widely considered as a tool to induce a set of useful priors about semantic consistency and improve sample efficiency and generalization performance. However, even when the prior is useful for generalization, distilling it to RL agent often interferes with RL training and degenerates sample efficiency. Meanwhile, the agent is forgetful of t… ▽ More In deep reinforcement learning (RL), data augmentation is widely considered as a tool to induce a set of useful priors about semantic consistency and improve sample efficiency and generalization performance. However, even when the prior is useful for generalization, distilling it to RL agent often interferes with RL training and degenerates sample efficiency. Meanwhile, the agent is forgetful of the prior due to the non-stationary nature of RL. These observations suggest two extreme schedules of distillation: (i) over the entire training; or (ii) only at the end. Hence, we devise a stand-alone network distillation method to inject the consistency prior at any time (even after RL), and a simple yet efficient framework to automatically schedule the distillation. Specifically, the proposed framework first focuses on mastering train environments regardless of generalization by adaptively deciding which {\it or no} augmentation to be used for the training. After this, we add the distillation to extract the remaining benefits for generalization from all the augmentations, which requires no additional new samples. In our experiments, we demonstrate the utility of the proposed framework, in particular, that considers postponing the augmentation to the end of RL training. △ Less

Submitted 18 October, 2022; v1 submitted 17 February, 2021; originally announced February 2021.

Journal ref: Neurips 2022

arXiv:2011.12713 [pdf]

A Secure Deep Probabilistic Dynamic Thermal Line Rating Prediction

Authors: N. Safari, S. M. Mazhari, C. Y. Chung, S. B. Ko

Abstract: Accurate short-term prediction of overhead line (OHL) transmission ampacity can directly affect the efficiency of power system operation and planning. Any overestimation of the dynamic thermal line rating (DTLR) can lead to lifetime degradation and failure of OHLs, safety hazards, etc. This paper presents a secure yet sharp probabilistic prediction model for the hour-ahead forecasting of the DTLR.… ▽ More Accurate short-term prediction of overhead line (OHL) transmission ampacity can directly affect the efficiency of power system operation and planning. Any overestimation of the dynamic thermal line rating (DTLR) can lead to lifetime degradation and failure of OHLs, safety hazards, etc. This paper presents a secure yet sharp probabilistic prediction model for the hour-ahead forecasting of the DTLR. The security of the proposed DTLR limits the frequency of DTLR prediction exceeding the actual DTLR. The model is based on an augmented deep learning architecture that makes use of a wide range of predictors, including historical climatology data and latent variables obtained during DTLR calculation. Furthermore, by introducing a customized cost function, the deep neural network is trained to consider the DTLR security based on the required probability of exceedance while minimizing deviations of the predicted DTLRs from the actual values. The proposed probabilistic DTLR is developed and verified using recorded experimental data. The simulation results validate the superiority of the proposed DTLR compared to state-of-the-art prediction models using well-known evaluation metrics. △ Less

Submitted 21 November, 2020; originally announced November 2020.

Comments: The work is accepted for publication in Journal of Modern Power Systems and Clean Energy

arXiv:2011.04512 [pdf, other]

Auxiliary Sequence Labeling Tasks for Disfluency Detection

Authors: Dongyub Lee, Byeongil Ko, Myeong Cheol Shin, Taesun Whang, Daniel Lee, Eun Hwa Kim, EungGyun Kim, Jaechoon Jo

Abstract: Detecting disfluencies in spontaneous speech is an important preprocessing step in natural language processing and speech recognition applications. Existing works for disfluency detection have focused on designing a single objective only for disfluency detection, while auxiliary objectives utilizing linguistic information of a word such as named entity or part-of-speech information can be effectiv… ▽ More Detecting disfluencies in spontaneous speech is an important preprocessing step in natural language processing and speech recognition applications. Existing works for disfluency detection have focused on designing a single objective only for disfluency detection, while auxiliary objectives utilizing linguistic information of a word such as named entity or part-of-speech information can be effective. In this paper, we focus on detecting disfluencies on spoken transcripts and propose a method utilizing named entity recognition (NER) and part-of-speech (POS) as auxiliary sequence labeling (SL) tasks for disfluency detection. First, we investigate cases that utilizing linguistic information of a word can prevent mispredicting important words and can be helpful for the correct detection of disfluencies. Second, we show that training a disfluency detection model with auxiliary SL tasks can improve its F-score in disfluency detection. Then, we analyze which auxiliary SL tasks are influential depending on baseline models. Experimental results on the widely used English Switchboard dataset show that our method outperforms the previous state-of-the-art in disfluency detection. △ Less

Submitted 5 April, 2021; v1 submitted 23 October, 2020; originally announced November 2020.

Comments: Submitted to INTERSPEECH 2021

arXiv:2009.03871 [pdf, other]

doi 10.1007/978-3-030-60365-6_19

Intraoperative Liver Surface Completion with Graph Convolutional VAE

Authors: Simone Foti, Bongjin Koo, Thomas Dowrick, Joao Ramalhinho, Moustafa Allam, Brian Davidson, Danail Stoyanov, Matthew J. Clarkson

Abstract: In this work we propose a method based on geometric deep learning to predict the complete surface of the liver, given a partial point cloud of the organ obtained during the surgical laparoscopic procedure. We introduce a new data augmentation technique that randomly perturbs shapes in their frequency domain to compensate the limited size of our dataset. The core of our method is a variational auto… ▽ More In this work we propose a method based on geometric deep learning to predict the complete surface of the liver, given a partial point cloud of the organ obtained during the surgical laparoscopic procedure. We introduce a new data augmentation technique that randomly perturbs shapes in their frequency domain to compensate the limited size of our dataset. The core of our method is a variational autoencoder (VAE) that is trained to learn a latent space for complete shapes of the liver. At inference time, the generative part of the model is embedded in an optimisation procedure where the latent representation is iteratively updated to generate a model that matches the intraoperative partial point cloud. The effect of this optimisation is a progressive non-rigid deformation of the initially generated shape. Our method is qualitatively evaluated on real data and quantitatively evaluated on synthetic data. We compared with a state-of-the-art rigid registration algorithm, that our method outperformed in visible areas. △ Less

Submitted 12 July, 2021; v1 submitted 8 September, 2020; originally announced September 2020.

arXiv:2005.12739 [pdf, other]

An Effective Pipeline for a Real-world Clothes Retrieval System

Authors: Yang-Ho Ji, HeeJae Jun, Insik Kim, Jongtack Kim, Youngjoon Kim, Byungsoo Ko, Hyong-Keun Kook, Jingeun Lee, Sangwon Lee, Sanghyuk Park

Abstract: In this paper, we propose an effective pipeline for clothes retrieval system which has sturdiness on large-scale real-world fashion data. Our proposed method consists of three components: detection, retrieval, and post-processing. We firstly conduct a detection task for precise retrieval on target clothes, then retrieve the corresponding items with the metric learning-based model. To improve the r… ▽ More In this paper, we propose an effective pipeline for clothes retrieval system which has sturdiness on large-scale real-world fashion data. Our proposed method consists of three components: detection, retrieval, and post-processing. We firstly conduct a detection task for precise retrieval on target clothes, then retrieve the corresponding items with the metric learning-based model. To improve the retrieval robustness against noise and misleading bounding boxes, we apply post-processing methods such as weighted boxes fusion and feature concatenation. With the proposed methodology, we achieved 2nd place in the DeepFashion2 Clothes Retrieval 2020 challenge. △ Less

Submitted 26 May, 2020; originally announced May 2020.

Comments: 2nd place solution on DeepFashion2 clothes retrieval challenge in CVPR2020 workshop (CVFAD)

arXiv:2005.03510 [pdf, other]

Reference and Document Aware Semantic Evaluation Methods for Korean Language Summarization

Authors: Dongyub Lee, Myeongcheol Shin, Taesun Whang, Seungwoo Cho, Byeongil Ko, Daniel Lee, Eunggyun Kim, Jaechoon Jo

Abstract: Text summarization refers to the process that generates a shorter form of text from the source document preserving salient information. Many existing works for text summarization are generally evaluated by using recall-oriented understudy for gisting evaluation (ROUGE) scores. However, as ROUGE scores are computed based on n-gram overlap, they do not reflect semantic meaning correspondences betwee… ▽ More Text summarization refers to the process that generates a shorter form of text from the source document preserving salient information. Many existing works for text summarization are generally evaluated by using recall-oriented understudy for gisting evaluation (ROUGE) scores. However, as ROUGE scores are computed based on n-gram overlap, they do not reflect semantic meaning correspondences between generated and reference summaries. Because Korean is an agglutinative language that combines various morphemes into a word that express several meanings, ROUGE is not suitable for Korean summarization. In this paper, we propose evaluation metrics that reflect semantic meanings of a reference summary and the original document, Reference and Document Aware Semantic Score (RDASS). We then propose a method for improving the correlation of the metrics with human judgment. Evaluation results show that the correlation with human judgment is significantly higher for our evaluation metrics than for ROUGE scores. △ Less

Submitted 1 November, 2020; v1 submitted 29 April, 2020; originally announced May 2020.

Comments: COLING 2020

arXiv:2003.02546 [pdf, other]

Embedding Expansion: Augmentation in Embedding Space for Deep Metric Learning

Authors: Byungsoo Ko, Geonmo Gu

Abstract: Learning the distance metric between pairs of samples has been studied for image retrieval and clustering. With the remarkable success of pair-based metric learning losses, recent works have proposed the use of generated synthetic points on metric learning losses for augmentation and generalization. However, these methods require additional generative networks along with the main network, which ca… ▽ More Learning the distance metric between pairs of samples has been studied for image retrieval and clustering. With the remarkable success of pair-based metric learning losses, recent works have proposed the use of generated synthetic points on metric learning losses for augmentation and generalization. However, these methods require additional generative networks along with the main network, which can lead to a larger model size, slower training speed, and harder optimization. Meanwhile, post-processing techniques, such as query expansion and database augmentation, have proposed the combination of feature points to obtain additional semantic information. In this paper, inspired by query expansion and database augmentation, we propose an augmentation method in an embedding space for pair-based metric learning losses, called embedding expansion. The proposed method generates synthetic points containing augmented information by a combination of feature points and performs hard negative pair mining to learn with the most informative feature representations. Because of its simplicity and flexibility, it can be used for existing metric learning losses without affecting model size, training speed, or optimization difficulty. Finally, the combination of embedding expansion and representative metric learning losses outperforms the state-of-the-art losses and previous sample generation methods in both image retrieval and clustering tasks. The implementation is publicly available. △ Less

Submitted 23 April, 2020; v1 submitted 5 March, 2020; originally announced March 2020.

Comments: Accepted by CVPR 2020

arXiv:2002.06328 [pdf]

Many-to-Many Voice Conversion using Conditional Cycle-Consistent Adversarial Networks

Authors: Shindong Lee, BongGu Ko, Keonnyeong Lee, In-Chul Yoo, Dongsuk Yook

Abstract: Voice conversion (VC) refers to transforming the speaker characteristics of an utterance without altering its linguistic contents. Many works on voice conversion require to have parallel training data that is highly expensive to acquire. Recently, the cycle-consistent adversarial network (CycleGAN), which does not require parallel training data, has been applied to voice conversion, showing the st… ▽ More Voice conversion (VC) refers to transforming the speaker characteristics of an utterance without altering its linguistic contents. Many works on voice conversion require to have parallel training data that is highly expensive to acquire. Recently, the cycle-consistent adversarial network (CycleGAN), which does not require parallel training data, has been applied to voice conversion, showing the state-of-the-art performance. The CycleGAN based voice conversion, however, can be used only for a pair of speakers, i.e., one-to-one voice conversion between two speakers. In this paper, we extend the CycleGAN by conditioning the network on speakers. As a result, the proposed method can perform many-to-many voice conversion among multiple speakers using a single generative adversarial network (GAN). Compared to building multiple CycleGANs for each pair of speakers, the proposed method reduces the computational and spatial cost significantly without compromising the sound quality of the converted voice. Experimental results using the VCC2018 corpus confirm the efficiency of the proposed method. △ Less

Submitted 15 February, 2020; originally announced February 2020.

arXiv:2001.11658 [pdf, other]

Symmetrical Synthesis for Deep Metric Learning

Authors: Geonmo Gu, Byungsoo Ko

Abstract: Deep metric learning aims to learn embeddings that contain semantic similarity information among data points. To learn better embeddings, methods to generate synthetic hard samples have been proposed. Existing methods of synthetic hard sample generation are adopting autoencoders or generative adversarial networks, but this leads to more hyper-parameters, harder optimization, and slower training sp… ▽ More Deep metric learning aims to learn embeddings that contain semantic similarity information among data points. To learn better embeddings, methods to generate synthetic hard samples have been proposed. Existing methods of synthetic hard sample generation are adopting autoencoders or generative adversarial networks, but this leads to more hyper-parameters, harder optimization, and slower training speed. In this paper, we address these problems by proposing a novel method of synthetic hard sample generation called symmetrical synthesis. Given two original feature points from the same class, the proposed method firstly generates synthetic points with each other as an axis of symmetry. Secondly, it performs hard negative pair mining within the original and synthetic points to select a more informative negative pair for computing the metric learning loss. Our proposed method is hyper-parameter free and plug-and-play for existing metric learning losses without network modification. We demonstrate the superiority of our proposed method over existing methods for a variety of loss functions on clustering and image retrieval tasks. Our implementations is publicly available. △ Less

Submitted 23 April, 2020; v1 submitted 30 January, 2020; originally announced January 2020.

Comments: Accepted by AAAI 2020

arXiv:2001.08300 [pdf, other]

Overcoming Noisy and Irrelevant Data in Federated Learning

Authors: Tiffany Tuor, Shiqiang Wang, Bong Jun Ko, Changchang Liu, Kin K. Leung

Abstract: Many image and vision applications require a large amount of data for model training. Collecting all such data at a central location can be challenging due to data privacy and communication bandwidth restrictions. Federated learning is an effective way of training a machine learning model in a distributed manner from local data collected by client devices, which does not require exchanging the raw… ▽ More Many image and vision applications require a large amount of data for model training. Collecting all such data at a central location can be challenging due to data privacy and communication bandwidth restrictions. Federated learning is an effective way of training a machine learning model in a distributed manner from local data collected by client devices, which does not require exchanging the raw data among clients. A challenge is that among the large variety of data collected at each client, it is likely that only a subset is relevant for a learning task while the rest of data has a negative impact on model training. Therefore, before starting the learning process, it is important to select the subset of data that is relevant to the given federated learning task. In this paper, we propose a method for distributedly selecting relevant data, where we use a benchmark model trained on a small benchmark dataset that is task-specific, to evaluate the relevance of individual data samples at each client and select the data with sufficiently high relevance. Then, each client only uses the selected subset of its data in the federated learning process. The effectiveness of our proposed approach is evaluated on multiple real-world image datasets in a simulated system with a large number of clients, showing up to $25\%$ improvement in model accuracy compared to training with all data. △ Less

Submitted 22 June, 2020; v1 submitted 22 January, 2020; originally announced January 2020.

Comments: Accepted version in the 25th International Conference on Pattern Recognition (ICPR)

arXiv:2001.04721

Interpretation and Simplification of Deep Forest

Authors: Sangwon Kim, Mira Jeong, Byoung Chul Ko

Abstract: This paper proposes a new method for interpreting and simplifying a black box model of a deep random forest (RF) using a proposed rule elimination. In deep RF, a large number of decision trees are connected to multiple layers, thereby making an analysis difficult. It has a high performance similar to that of a deep neural network (DNN), but achieves a better generalizability. Therefore, in this st… ▽ More This paper proposes a new method for interpreting and simplifying a black box model of a deep random forest (RF) using a proposed rule elimination. In deep RF, a large number of decision trees are connected to multiple layers, thereby making an analysis difficult. It has a high performance similar to that of a deep neural network (DNN), but achieves a better generalizability. Therefore, in this study, we consider quantifying the feature contributions and frequency of the fully trained deep RF in the form of a decision rule set. The feature contributions provide a basis for determining how features affect the decision process in a rule set. Model simplification is achieved by eliminating unnecessary rules by measuring the feature contributions. Consequently, the simplified model has fewer parameters and rules than before. Experiment results have shown that a feature contribution analysis allows a black box model to be decomposed for quantitatively interpreting a rule set. The proposed method was successfully applied to various deep RF models and benchmark datasets while maintaining a robust performance despite the elimination of a large number of rules. △ Less

Submitted 11 December, 2020; v1 submitted 14 January, 2020; originally announced January 2020.

Comments: Major issues on the experiments

arXiv:1909.12326 [pdf, other]

Model Pruning Enables Efficient Federated Learning on Edge Devices

Authors: Yuang Jiang, Shiqiang Wang, Victor Valls, Bong Jun Ko, Wei-Han Lee, Kin K. Leung, Leandros Tassiulas

Abstract: Federated learning (FL) allows model training from local data collected by edge/mobile devices while preserving data privacy, which has wide applicability to image and vision applications. A challenge is that client devices in FL usually have much more limited computation and communication resources compared to servers in a datacenter. To overcome this challenge, we propose PruneFL -- a novel FL a… ▽ More Federated learning (FL) allows model training from local data collected by edge/mobile devices while preserving data privacy, which has wide applicability to image and vision applications. A challenge is that client devices in FL usually have much more limited computation and communication resources compared to servers in a datacenter. To overcome this challenge, we propose PruneFL -- a novel FL approach with adaptive and distributed parameter pruning, which adapts the model size during FL to reduce both communication and computation overhead and minimize the overall training time, while maintaining a similar accuracy as the original model. PruneFL includes initial pruning at a selected client and further pruning as part of the FL process. The model size is adapted during this process, which includes maximizing the approximate empirical risk reduction divided by the time of one FL round. Our experiments with various datasets on edge devices (e.g., Raspberry Pi) show that: (i) we significantly reduce the training time compared to conventional FL and various other pruning-based methods; (ii) the pruned model with automatically determined size converges to an accuracy that is very similar to the original model, and it is also a lottery ticket of the original model. △ Less

Submitted 6 April, 2022; v1 submitted 26 September, 2019; originally announced September 2019.

Comments: Accepted for publication in IEEE Transactions on Neural Networks and Learning Systems (TNNLS)

Showing 1–50 of 67 results for author: Ko, B