Search | arXiv e-print repository

Computer-Aided Fall Recognition Using a Three-Stream Spatial-Temporal GCN Model with Adaptive Feature Aggregation

Authors: Jungpil Shin, Abu Saleh Musa Miah, Rei Egawa1, Koki Hirooka, Md. Al Mehedi Hasan, Yoichi Tomioka, Yong Seok Hwang

Abstract: The prevention of falls is paramount in modern healthcare, particularly for the elderly, as falls can lead to severe injuries or even fatalities. Additionally, the growing incidence of falls among the elderly, coupled with the urgent need to prevent suicide attempts resulting from medication overdose, underscores the critical importance of accurate and efficient fall detection methods. In this sce… ▽ More The prevention of falls is paramount in modern healthcare, particularly for the elderly, as falls can lead to severe injuries or even fatalities. Additionally, the growing incidence of falls among the elderly, coupled with the urgent need to prevent suicide attempts resulting from medication overdose, underscores the critical importance of accurate and efficient fall detection methods. In this scenario, a computer-aided fall detection system is inevitable to save elderly people's lives worldwide. Many researchers have been working to develop fall detection systems. However, the existing fall detection systems often struggle with issues such as unsatisfactory performance accuracy, limited robustness, high computational complexity, and sensitivity to environmental factors due to a lack of effective features. In response to these challenges, this paper proposes a novel three-stream spatial-temporal feature-based fall detection system. Our system incorporates joint skeleton-based spatial and temporal Graph Convolutional Network (GCN) features, joint motion-based spatial and temporal GCN features, and residual connections-based features. Each stream employs adaptive graph-based feature aggregation and consecutive separable convolutional neural networks (Sep-TCN), significantly reducing computational complexity and model parameters compared to prior systems. Experimental results across multiple datasets demonstrate the superior effectiveness and efficiency of our proposed system, with accuracies of 99.51\%, 99.15\%, 99.79\% and 99.85 \% achieved on the ImViA, UR-Fall, Fall-UP and FU-Kinect datasets, respectively. The remarkable performance of our system highlights its superiority, efficiency, and generalizability in real-world fall detection scenarios, offering significant advancements in healthcare and societal well-being. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.11217 [pdf, other]

Beyond skyrmion spin texture from quantum Kelvin-Helmholtz instability

Authors: SeungJung Huh, Wooyoung Yun, Gabin Yun, Samgyu Hwang, Kiryang Kwon, Junhyeok Hur, Seungho Lee, Hiromitsu Takeuchi, Se Kwon Kim, Jae-yoon Choi

Abstract: Topology profoundly influences diverse fields of science, providing a powerful framework for classifying phases of matter and predicting nontrivial excitations, such as solitons, vortices, and skyrmions. These topological defects are typically characterized by integer numbers, called topological charges, representing the winding number in their order parameter field. The classification and predict… ▽ More Topology profoundly influences diverse fields of science, providing a powerful framework for classifying phases of matter and predicting nontrivial excitations, such as solitons, vortices, and skyrmions. These topological defects are typically characterized by integer numbers, called topological charges, representing the winding number in their order parameter field. The classification and prediction of topological defects, however, become challenging when singularities are included within the integration domain for calculating the topological charge. While such exotic nonlinear excitations have been proposed in the superfluid $^3$He-A phase and spinor Bose-Einstein condensate of atomic gases, experimental observation of these structures and studies of their stability have long been elusive. Here we report the observation of a singular skyrmion that goes beyond the framework of topology in a ferromagnetic superfluid. The exotic skyrmions are sustained by undergoing anomalous symmetry breaking associated with the eccentric spin singularity and carry half of the elementary charge, distinctive from conventional skyrmions or merons. By successfully realizing the universal regime of the quantum Kelvin-Helmholtz instability, we identified the eccentric fractional skyrmions, produced by emission from a magnetic domain wall and a spontaneous splitting of an integer skyrmion with spin singularities. The singular skyrmions are stable and can be observed after 2~s of hold time. Our results confirm the universality between classical and quantum Kelvin-Helmholtz instabilities and broaden our understanding on complex nonlinear dynamics of nontrivial texture beyond skyrmion in topological quantum systems. △ Less

Submitted 20 August, 2024; originally announced August 2024.

Comments: 13 pages, 5 main figures and 7 supplemental figures

arXiv:2408.07557 [pdf]

Band-selective simulation of photoelectron intensity and converging Berry phase in trilayer graphene

Authors: Hayoon Im, Sue Hyeon Hwang, Minhee Kang, Kyoo Kim, Haeyong Kang, Choongyu Hwang

Abstract: Berry phase is one of the key elements to understand quantum-mechanical phenomena such as the Aharonov-Bohm effect and the unconventional Hall effect in graphene. The Berry phase in monolayer and bilayer graphene has been manifested by the anisotropic distribution of photoelectron intensity along a closed loop in the momentum space as well as its rotation by a characteristic angle upon rotating li… ▽ More Berry phase is one of the key elements to understand quantum-mechanical phenomena such as the Aharonov-Bohm effect and the unconventional Hall effect in graphene. The Berry phase in monolayer and bilayer graphene has been manifested by the anisotropic distribution of photoelectron intensity along a closed loop in the momentum space as well as its rotation by a characteristic angle upon rotating light polarization. Here we report the band-selective simulation of photoelectron intensity of trilayer graphene to understand its Berry phase within the tight-binding formalism. ABC- and ABA-stacked trilayer graphene show characteristic rotational angles of photoelectron intensity distribution, as predicted from their well-known Berry phases. Surprisingly, however, in ABA-stacked trilayer graphene, the rotational angle changes upon approaching toward the band touching point between the conduction and valence bands, which suggest that Berry phase changes as a function of binding energy. The binding energy-dependent Berry phase is attributed to the enhanced hybridization of the two electron bands of ABA-stacked trilayer graphene that converge at the band touching point, resulting in the converging Berry phase. These findings will provide an efficient way of tuning Berry phase and hence exotic phenomena stemming from the Berry phase. △ Less

Submitted 14 August, 2024; originally announced August 2024.

Journal ref: Appl. Sci. Converg. Technol. 33, 91 (2024)

arXiv:2408.06010 [pdf, other]

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

Authors: Jisoo Kim, Jungbin Cho, Joonho Park, Soonmin Hwang, Da Eun Kim, Geon Kim, Youngjae Yu

Abstract: Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hinde… ▽ More Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hindering their applicability. To address these challenges, we introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs. To achieve this, we first train DEE (Dynamic Emotion Embedding), which employs probabilistic contrastive learning to forge a joint emotion embedding space for both speech and facial motion. This probabilistic framework captures the uncertainty in interpreting emotions from speech and facial motion, enabling the derivation of emotion vectors from its multifaceted space. Moreover, to generate dynamic facial motion, we design TH-VQVAE (Temporally Hierarchical VQ-VAE) as an expressive and robust motion prior overcoming limitations of VAEs and VQ-VAEs. Utilizing these strong priors, we develop DEEPTalk, A talking head generator that non-autoregressively predicts codebook indices to create dynamic facial motion, incorporating a novel emotion consistency loss. Extensive experiments on various datasets demonstrate the effectiveness of our approach in creating diverse, emotionally expressive talking faces that maintain accurate lip-sync. Source code will be made publicly available soon. △ Less

Submitted 12 August, 2024; originally announced August 2024.

Comments: First two authors contributed equally

arXiv:2408.05930 [pdf, ps, other]

Evolution of the Fermi surface of 1T-VSe$_2$ across a structural phase transition

Authors: Turgut Yilmaz, Xiao Tong, Jerzy T. Sadowski, Sooyeon Hwang, Kenneth Evans-Lutterodt, Kim Kisslinger, Elio Vescovo

Abstract: The electronic origin of the structural transition in 1T-VSe$_2$ is re-evaluated through an extensive angle-resolved photoemission spectroscopy experiment. The components of the band structure, missing in previous reports, are revealed. Earlier observations, shown to be temperature independent and therefore not correlated with the phase transition, are explained in terms of the increased complexit… ▽ More The electronic origin of the structural transition in 1T-VSe$_2$ is re-evaluated through an extensive angle-resolved photoemission spectroscopy experiment. The components of the band structure, missing in previous reports, are revealed. Earlier observations, shown to be temperature independent and therefore not correlated with the phase transition, are explained in terms of the increased complexity of the band structure close to the Fermi level. Only the overall size of the Fermi surface is found to be positively correlated with the phase transition at 110 K. These observations, quite distant from the charge density wave scenario commonly considered for 1T-VSe$_2$, bring fresh perspectives toward the correct description of structural transitions in dichalcogenides materials. △ Less

Submitted 12 August, 2024; originally announced August 2024.

Comments: 7 pages, 4 figures

arXiv:2408.05917 [pdf]

Inverse design of Non-parameterized Ventilated Acoustic Resonator via Variational Autoencoder with Acoustic Response-encoded Latent Space

Authors: Min Woo Cho, Seok Hyeon Hwang, Jun-Young Jang, Jin Yeong Song, Sun-kwang Hwang, Kyoung Je Cha, Dong Yong Park, Kyungjun Song, Sang Min Park

Abstract: Ventilated acoustic resonator(VAR), a type of acoustic metamaterial, emerge as an alternative for sound attenuation in environments that require ventilation, owing to its excellent low-frequency attenuation performance and flexible shape adaptability. However, due to the non-linear acoustic responses of VARs, the VAR designs are generally obtained within a limited parametrized design space, and th… ▽ More Ventilated acoustic resonator(VAR), a type of acoustic metamaterial, emerge as an alternative for sound attenuation in environments that require ventilation, owing to its excellent low-frequency attenuation performance and flexible shape adaptability. However, due to the non-linear acoustic responses of VARs, the VAR designs are generally obtained within a limited parametrized design space, and the design relies on the iteration of the numerical simulation which consumes a considerable amount of computational time and resources. This paper proposes an acoustic response-encoded variational autoencoder (AR-VAE), a novel variational autoencoder-based generative design model for the efficient and accurate inverse design of VAR even with non-parametrized designs. The AR-VAE matches the high-dimensional acoustic response with the VAR cross-section image in the dimension-reduced latent space, which enables the AR-VAE to generate various non-parametrized VAR cross-section images with the target acoustic response. AR-VAE generates non-parameterized VARs from target acoustic responses, which show a 25-fold reduction in mean squared error compared to conventional deep learning-based parameter searching methods while exhibiting lower average mean squared error and peak frequency variance. By combining the inverse-designed VARs by AR-VAE, multi-cavity VAR was devised for broadband and multitarget peak frequency attenuation. The proposed design method presents a new approach for structural inverse-design with a high-dimensional non-linear physical response. △ Less

Submitted 12 August, 2024; originally announced August 2024.

arXiv:2408.04261 [pdf, other]

Unveiling Hidden Visual Information: A Reconstruction Attack Against Adversarial Visual Information Hiding

Authors: Jonggyu Jang, Hyeonsu Lyu, Seongjin Hwang, Hyun Jong Yang

Abstract: This paper investigates the security vulnerabilities of adversarial-example-based image encryption by executing data reconstruction (DR) attacks on encrypted images. A representative image encryption method is the adversarial visual information hiding (AVIH), which uses type-I adversarial example training to protect gallery datasets used in image recognition tasks. In the AVIH method, the type-I a… ▽ More This paper investigates the security vulnerabilities of adversarial-example-based image encryption by executing data reconstruction (DR) attacks on encrypted images. A representative image encryption method is the adversarial visual information hiding (AVIH), which uses type-I adversarial example training to protect gallery datasets used in image recognition tasks. In the AVIH method, the type-I adversarial example approach creates images that appear completely different but are still recognized by machines as the original ones. Additionally, the AVIH method can restore encrypted images to their original forms using a predefined private key generative model. For the best security, assigning a unique key to each image is recommended; however, storage limitations may necessitate some images sharing the same key model. This raises a crucial security question for AVIH: How many images can safely share the same key model without being compromised by a DR attack? To address this question, we introduce a dual-strategy DR attack against the AVIH encryption method by incorporating (1) generative-adversarial loss and (2) augmented identity loss, which prevent DR from overfitting -- an issue akin to that in machine learning. Our numerical results validate this approach through image recognition and re-identification benchmarks, demonstrating that our strategy can significantly enhance the quality of reconstructed images, thereby requiring fewer key-sharing encrypted images. Our source code to reproduce our results will be available soon. △ Less

Submitted 8 August, 2024; originally announced August 2024.

Comments: 12 pages

arXiv:2408.00994 [pdf, other]

ArchCode: Incorporating Software Requirements in Code Generation with Large Language Models

Authors: Hojae Han, Jaejin Kim, Jaeseok Yoo, Youngwon Lee, Seung-won Hwang

Abstract: This paper aims to extend the code generation capability of large language models (LLMs) to automatically manage comprehensive software requirements from given textual descriptions. Such requirements include both functional (i.e. achieving expected behavior for inputs) and non-functional (e.g., time/space performance, robustness, maintainability) requirements. However, textual descriptions can eit… ▽ More This paper aims to extend the code generation capability of large language models (LLMs) to automatically manage comprehensive software requirements from given textual descriptions. Such requirements include both functional (i.e. achieving expected behavior for inputs) and non-functional (e.g., time/space performance, robustness, maintainability) requirements. However, textual descriptions can either express requirements verbosely or may even omit some of them. We introduce ARCHCODE, a novel framework that leverages in-context learning to organize requirements observed in descriptions and to extrapolate unexpressed requirements from them. ARCHCODE generates requirements from given descriptions, conditioning them to produce code snippets and test cases. Each test case is tailored to one of the requirements, allowing for the ranking of code snippets based on the compliance of their execution results with the requirements. Public benchmarks show that ARCHCODE enhances to satisfy functional requirements, significantly improving Pass@k scores. Furthermore, we introduce HumanEval-NFR, the first evaluation of LLMs' non-functional requirements in code generation, demonstrating ARCHCODE's superiority over baseline methods. The implementation of ARCHCODE and the HumanEval-NFR benchmark are both publicly accessible. △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: Accepted by ACL 2024 main conference

arXiv:2407.20806 [pdf, other]

ARCLE: The Abstraction and Reasoning Corpus Learning Environment for Reinforcement Learning

Authors: Hosung Lee, Sejin Kim, Seungpil Lee, Sanha Hwang, Jihwan Lee, Byung-Jun Lee, Sundong Kim

Abstract: This paper introduces ARCLE, an environment designed to facilitate reinforcement learning research on the Abstraction and Reasoning Corpus (ARC). Addressing this inductive reasoning benchmark with reinforcement learning presents these challenges: a vast action space, a hard-to-reach goal, and a variety of tasks. We demonstrate that an agent with proximal policy optimization can learn individual ta… ▽ More This paper introduces ARCLE, an environment designed to facilitate reinforcement learning research on the Abstraction and Reasoning Corpus (ARC). Addressing this inductive reasoning benchmark with reinforcement learning presents these challenges: a vast action space, a hard-to-reach goal, and a variety of tasks. We demonstrate that an agent with proximal policy optimization can learn individual tasks through ARCLE. The adoption of non-factorial policies and auxiliary losses led to performance enhancements, effectively mitigating issues associated with action spaces and goal attainment. Based on these insights, we propose several research directions and motivations for using ARCLE, including MAML, GFlowNets, and World Models. △ Less

Submitted 30 July, 2024; originally announced July 2024.

Comments: Accepted by CoLLAs 2024, Project page: https://github.com/confeitoHS/arcle

arXiv:2407.18602 [pdf, other]

Testing Lyman Alpha Emitters and Lyman-Break Galaxies as Tracers of Large-Scale Structures at High Redshifts

Authors: Sang Hyeok Im, Ho Seong Hwang, Jaehong Park, Jaehyun Lee, Hyunmi Song, Stephen Appleby, Yohan Dubois, C. Gareth Few, Brad K. Gibson, Juhan Kim, Yonghwi Kim, Changbom Park, Christophe Pichon, Jihye Shin, Owain N. Snaith, Maria Celeste Artale, Eric Gawiser, Lucia Guaita, Woong-Seob Jeong, Kyoung-Soo Lee, Nelson Padilla, Vandana Ramakrishnan, Paulina Troncoso, Yujin Yang

Abstract: We test whether Lyman alpha emitters (LAEs) and Lyman-break galaxies (LBGs) can be good tracers of high-z large-scale structures, using the Horizon Run 5 cosmological hydrodynamical simulation. We identify LAEs using the Lyα emission line luminosity and its equivalent width, and LBGs using the broad-band magnitudes at z~2.4, 3.1, and 4.5. We first compare the spatial distributions of LAEs, LBGs, a… ▽ More We test whether Lyman alpha emitters (LAEs) and Lyman-break galaxies (LBGs) can be good tracers of high-z large-scale structures, using the Horizon Run 5 cosmological hydrodynamical simulation. We identify LAEs using the Lyα emission line luminosity and its equivalent width, and LBGs using the broad-band magnitudes at z~2.4, 3.1, and 4.5. We first compare the spatial distributions of LAEs, LBGs, all galaxies, and dark matter around the filamentary structures defined by dark matter. The comparison shows that both LAEs and LBGs are more concentrated toward the dark matter filaments than dark matter. We also find an empirical fitting formula for the vertical density profile of filaments as a binomial power-law relation of the distance to the filaments. We then compare the spatial distributions of the samples around the filaments defined by themselves. LAEs and LBGs are again more concentrated toward their filaments than dark matter. We also find the overall consistency between filamentary structures defined by LAEs, LBGs, and dark matter, with the median spatial offsets that are smaller than the mean separation of the sample. These results support the idea that the LAEs and LBGs could be good tracers of large-scale structures of dark matter at high redshifts. △ Less

Submitted 26 July, 2024; originally announced July 2024.

Comments: 20 pages, 15 figures, 2 tables, accepted for publication in ApJ

arXiv:2407.17843 [pdf, other]

DragText: Rethinking Text Embedding in Point-based Image Editing

Authors: Gayoon Choi, Taejin Jeong, Sujung Hong, Jaehoon Joo, Seong Jae Hwang

Abstract: Point-based image editing enables accurate and flexible control through content dragging. However, the role of text embedding in the editing process has not been thoroughly investigated. A significant aspect that remains unexplored is the interaction between text and image embeddings. In this study, we show that during the progressive editing of an input image in a diffusion model, the text embedd… ▽ More Point-based image editing enables accurate and flexible control through content dragging. However, the role of text embedding in the editing process has not been thoroughly investigated. A significant aspect that remains unexplored is the interaction between text and image embeddings. In this study, we show that during the progressive editing of an input image in a diffusion model, the text embedding remains constant. As the image embedding increasingly diverges from its initial state, the discrepancy between the image and text embeddings presents a significant challenge. Moreover, we found that the text prompt significantly influences the dragging process, particularly in maintaining content integrity and achieving the desired manipulation. To utilize these insights, we propose DragText, which optimizes text embedding in conjunction with the dragging process to pair with the modified image embedding. Simultaneously, we regularize the text optimization process to preserve the integrity of the original text prompt. Our approach can be seamlessly integrated with existing diffusion-based drag methods with only a few lines of code. △ Less

Submitted 25 July, 2024; originally announced July 2024.

Comments: 22 pages, 18 figures

arXiv:2407.13864 [pdf, other]

Chandra Survey in the AKARI North Ecliptic Pole Deep Field Optical/Infrared Identifications of X-ray Sources

Authors: T. Miyaji, B. A. Bravo-Navarro, J. Díaz Tello, M. Krumpe, M. Herrera-Endoqui, H. Ikeda, T. Takagi, N. Oi, A. Shogaki, S. Matsuura, H. Kim, M. A. Malkan, H. S. Hwang, T. Kim, T. Ishigaki, H. Hanami, S. J. Kim, Y. Ohyama, T. Goto, H. Matsuhara

Abstract: We present a catalog of optical and infrared identifications (ID) of X-ray sources in the AKARI North Ecliptic Pole (NEP) Deep field detected with Chandra covering $\sim 0.34\,{\rm deg^{2}}$ with 0.5-2 keV flux limits ranging $\sim 2 \mathrm{-} 20\times 10^{-16}\,{\rm erg\,s^{-1}\,cm^{-2}}$. The optical/near-infrared counterparts of the X-ray sources are taken from our Hyper Suprime Cam (HSC)/Suba… ▽ More We present a catalog of optical and infrared identifications (ID) of X-ray sources in the AKARI North Ecliptic Pole (NEP) Deep field detected with Chandra covering $\sim 0.34\,{\rm deg^{2}}$ with 0.5-2 keV flux limits ranging $\sim 2 \mathrm{-} 20\times 10^{-16}\,{\rm erg\,s^{-1}\,cm^{-2}}$. The optical/near-infrared counterparts of the X-ray sources are taken from our Hyper Suprime Cam (HSC)/Subaru and Wide-Field InfraRed Camera (WIRCam)/Canada-France-Hawaii Telescope (CFHT) data because these have much more accurate source positions due to their spatial resolution than that of {Chandra} and longer wavelength infrared data. We concentrate our identifications in the HSC $g$ band and WIRCam $K_{\rm s}$ band-based catalogs. To select the best counterpart, we utilize a novel extension of the likelihood-ratio (LR) analysis, where we use the X-ray flux as well as $g - K_{\rm s}$ colors to calculate the likelihood ratio. Spectroscopic and photometric redshifts of the counterparts are summarized. Also, simple X-ray spectroscopy is made on the sources with sufficient source counts. We present the resulting catalog in an electronic form. The main ID catalog contains 403 X-ray sources and includes X-ray fluxes, luminosities, $g$ and $K_{\rm s}$ band magnitudes, redshifts, and their sources, optical spectroscopic properties, as well as intrinsic absorption column densities and power-law indices from simple X-ray spectroscopy. The identified X-ray sources include 27 Milky-Way objects, 57 type I AGNs, 131 other AGNs, and 15 galaxies. The catalog serves as a basis for further investigations of the properties of the X-ray and near-infrared sources in this field. (Abridged) △ Less

Submitted 22 July, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

Comments: 16 pages, 9 figures, Three electronic (fits) tables are included in src. Accepted to Astronomy and Astrophysics

arXiv:2407.12325 [pdf, other]

Optimizing Query Generation for Enhanced Document Retrieval in RAG

Authors: Hamin Koo, Minseon Kim, Sung Ju Hwang

Abstract: Large Language Models (LLMs) excel in various language tasks but they often generate incorrect information, a phenomenon known as "hallucinations". Retrieval-Augmented Generation (RAG) aims to mitigate this by using document retrieval for accurate responses. However, RAG still faces hallucinations due to vague queries. This study aims to improve RAG by optimizing query generation with a query-docu… ▽ More Large Language Models (LLMs) excel in various language tasks but they often generate incorrect information, a phenomenon known as "hallucinations". Retrieval-Augmented Generation (RAG) aims to mitigate this by using document retrieval for accurate responses. However, RAG still faces hallucinations due to vague queries. This study aims to improve RAG by optimizing query generation with a query-document alignment score, refining queries using LLMs for better precision and efficiency of document retrieval. Experiments have shown that our approach improves document retrieval, resulting in an average accuracy gain of 1.6%. △ Less

Submitted 17 July, 2024; originally announced July 2024.

arXiv:2407.11348 [pdf, other]

Flatfish Disease Detection Based on Part Segmentation Approach and Disease Image Generation

Authors: Seo-Bin Hwang, Han-Young Kim, Chae-Yeon Heo, Hie-Yong Jung, Sung-Ju Jung, Yeong-Jun Cho

Abstract: The flatfish is a major farmed species consumed globally in large quantities. However, due to the densely populated farming environment, flatfish are susceptible to injuries and diseases, making early disease detection crucial. Traditionally, diseases were detected through visual inspection, but observing large numbers of fish is challenging. Automated approaches based on deep learning technologie… ▽ More The flatfish is a major farmed species consumed globally in large quantities. However, due to the densely populated farming environment, flatfish are susceptible to injuries and diseases, making early disease detection crucial. Traditionally, diseases were detected through visual inspection, but observing large numbers of fish is challenging. Automated approaches based on deep learning technologies have been widely used, to address this problem, but accurate detection remains difficult due to the diversity of the fish and the lack of the fish disease dataset. In this study, augments fish disease images using generative adversarial networks and image harmonization methods. Next, disease detectors are trained separately for three body parts (head, fins, and body) to address individual diseases properly. In addition, a flatfish disease image dataset called \texttt{FlatIMG} is created and verified on the dataset using the proposed methods. A flash salmon disease dataset is also tested to validate the generalizability of the proposed methods. The results achieved 12\% higher performance than the baseline framework. This study is the first attempt to create a large-scale flatfish disease image dataset and propose an effective disease detection framework. Automatic disease monitoring could be achieved in farming environments based on the proposed methods and dataset. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 16 page, 13 figures, 4 tables

arXiv:2407.10164 [pdf, other]

LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection

Authors: Sanmin Kim, Youngseok Kim, Sihwan Hwang, Hyeonjun Jeong, Dongsuk Kum

Abstract: Recent advancements in camera-based 3D object detection have introduced cross-modal knowledge distillation to bridge the performance gap with LiDAR 3D detectors, leveraging the precise geometric information in LiDAR point clouds. However, existing cross-modal knowledge distillation methods tend to overlook the inherent imperfections of LiDAR, such as the ambiguity of measurements on distant or occ… ▽ More Recent advancements in camera-based 3D object detection have introduced cross-modal knowledge distillation to bridge the performance gap with LiDAR 3D detectors, leveraging the precise geometric information in LiDAR point clouds. However, existing cross-modal knowledge distillation methods tend to overlook the inherent imperfections of LiDAR, such as the ambiguity of measurements on distant or occluded objects, which should not be transferred to the image detector. To mitigate these imperfections in LiDAR teacher, we propose a novel method that leverages aleatoric uncertainty-free features from ground truth labels. In contrast to conventional label guidance approaches, we approximate the inverse function of the teacher's head to effectively embed label inputs into feature space. This approach provides additional accurate guidance alongside LiDAR teacher, thereby boosting the performance of the image detector. Additionally, we introduce feature partitioning, which effectively transfers knowledge from the teacher modality while preserving the distinctive features of the student, thereby maximizing the potential of both modalities. Experimental results demonstrate that our approach improves mAP and NDS by 5.1 points and 4.9 points compared to the baseline model, proving the effectiveness of our approach. The code is available at https://github.com/sanmin0312/LabelDistill △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: ECCV 2024

arXiv:2407.09941 [pdf, other]

Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers

Authors: Sukjun Hwang, Aakash Lahoti, Tri Dao, Albert Gu

Abstract: A wide array of sequence models are built on a framework modeled after Transformers, comprising alternating sequence mixer and channel mixer layers. This paper studies a unifying matrix mixer view of sequence mixers that can be conceptualized as a linear map on the input sequence. This framework encompasses a broad range of well-known sequence models, including the self-attention of Transformers a… ▽ More A wide array of sequence models are built on a framework modeled after Transformers, comprising alternating sequence mixer and channel mixer layers. This paper studies a unifying matrix mixer view of sequence mixers that can be conceptualized as a linear map on the input sequence. This framework encompasses a broad range of well-known sequence models, including the self-attention of Transformers as well as recent strong alternatives such as structured state space models (SSMs), and allows understanding downstream characteristics such as efficiency and expressivity through properties of their structured matrix class. We identify a key axis of matrix parameterizations termed sequence alignment, which increases the flexibility and performance of matrix mixers, providing insights into the strong performance of Transformers and recent SSMs such as Mamba. Furthermore, the matrix mixer framework offers a systematic approach to developing sequence mixers with desired properties, allowing us to develop several new sub-quadratic sequence models. In particular, we propose a natural bidirectional extension of the Mamba model (Hydra), parameterized as a quasiseparable matrix mixer, which demonstrates superior performance over other sequence models including Transformers on non-causal tasks. As a drop-in replacement for attention layers, Hydra outperforms BERT by 0.8 points on the GLUE benchmark and ViT by 2% Top-1 accuracy on ImageNet. △ Less

Submitted 13 July, 2024; originally announced July 2024.

arXiv:2407.07517 [pdf, other]

Parameter Efficient Fine Tuning for Multi-scanner PET to PET Reconstruction

Authors: Yumin Kim, Gayoon Choi, Seong Jae Hwang

Abstract: Reducing scan time in Positron Emission Tomography (PET) imaging while maintaining high-quality images is crucial for minimizing patient discomfort and radiation exposure. Due to the limited size of datasets and distribution discrepancy across scanners in medical imaging, fine-tuning in a parameter-efficient and effective manner is on the rise. Motivated by the potential of Parameter-Efficient Fin… ▽ More Reducing scan time in Positron Emission Tomography (PET) imaging while maintaining high-quality images is crucial for minimizing patient discomfort and radiation exposure. Due to the limited size of datasets and distribution discrepancy across scanners in medical imaging, fine-tuning in a parameter-efficient and effective manner is on the rise. Motivated by the potential of Parameter-Efficient Fine-Tuning (PEFT), we aim to address these issues by effectively leveraging PEFT to improve limited data and GPU resource issues in multi-scanner setups. In this paper, we introduce PETITE, Parameter-Efficient Fine-Tuning for MultI-scanner PET to PET REconstruction that uses fewer than 1% of the parameters. To the best of our knowledge, this study is the first to systematically explore the efficacy of diverse PEFT techniques in medical imaging reconstruction tasks via prevalent encoder-decoder-type deep models. This investigation, in particular, brings intriguing insights into PETITE as we show further improvements by treating encoder and decoder separately and mixing different PEFT methods, namely, Mix-PEFT. Using multi-scanner PET datasets comprised of five different scanners, we extensively test the cross-scanner PET scan time reduction performances (i.e., a model pre-trained on one scanner is fine-tuned on a different scanner) of 21 feasible Mix-PEFT combinations to derive optimal PETITE. We show that training with less than 1% parameters using PETITE performs on par with full fine-tuning (i.e., 100% parameter) △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.06716 [pdf, other]

Analyzing the Effectiveness of Listwise Reranking with Positional Invariance on Temporal Generalizability

Authors: Soyoung Yoon, Jongyoon Kim, Seung-won Hwang

Abstract: Benchmarking the performance of information retrieval (IR) methods are mostly conducted within a fixed set of documents (static corpora). However, in real-world web search engine environments, the document set is continuously updated and expanded. Addressing these discrepancies and measuring the temporal persistence of IR systems is crucial. By investigating the LongEval benchmark, specifically de… ▽ More Benchmarking the performance of information retrieval (IR) methods are mostly conducted within a fixed set of documents (static corpora). However, in real-world web search engine environments, the document set is continuously updated and expanded. Addressing these discrepancies and measuring the temporal persistence of IR systems is crucial. By investigating the LongEval benchmark, specifically designed for such dynamic environments, our findings demonstrate the effectiveness of a listwise reranking approach, which proficiently handles inaccuracies induced by temporal distribution shifts. Among listwise rerankers, our findings show that ListT5, which effectively mitigates the positional bias problem by adopting the Fusion-in-Decoder architecture, is especially effective, and more so, as temporal drift increases, on the test-long subset. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: Accepted at CLEF 2024 LongEval track

arXiv:2407.05059 [pdf, other]

Slice-Consistent 3D Volumetric Brain CT-to-MRI Translation with 2D Brownian Bridge Diffusion Model

Authors: Kyobin Choo, Youngjun Jun, Mijin Yun, Seong Jae Hwang

Abstract: In neuroimaging, generally, brain CT is more cost-effective and accessible imaging option compared to MRI. Nevertheless, CT exhibits inferior soft-tissue contrast and higher noise levels, yielding less precise structural clarity. In response, leveraging more readily available CT to construct its counterpart MRI, namely, medical image-to-image translation (I2I), serves as a promising solution. Part… ▽ More In neuroimaging, generally, brain CT is more cost-effective and accessible imaging option compared to MRI. Nevertheless, CT exhibits inferior soft-tissue contrast and higher noise levels, yielding less precise structural clarity. In response, leveraging more readily available CT to construct its counterpart MRI, namely, medical image-to-image translation (I2I), serves as a promising solution. Particularly, while diffusion models (DMs) have recently risen as a powerhouse, they also come with a few practical caveats for medical I2I. First, DMs' inherent stochasticity from random noise sampling cannot guarantee consistent MRI generation that faithfully reflects its CT. Second, for 3D volumetric images which are prevalent in medical imaging, naively using 2D DMs leads to slice inconsistency, e.g., abnormal structural and brightness changes. While 3D DMs do exist, significant training costs and data dependency bring hesitation. As a solution, we propose novel style key conditioning (SKC) and inter-slice trajectory alignment (ISTA) sampling for the 2D Brownian bridge diffusion model. Specifically, SKC ensures a consistent imaging style (e.g., contrast) across slices, and ISTA interconnects the independent sampling of each slice, deterministically achieving style and shape consistent 3D CT-to-MRI translation. To the best of our knowledge, this study is the first to achieve high-quality 3D medical I2I based only on a 2D DM with no extra architectural models. Our experimental results show superior 3D medical I2I than existing 2D and 3D baselines, using in-house CT-MRI dataset and BraTS2023 FLAIR-T1 MRI dataset. △ Less

Submitted 6 July, 2024; originally announced July 2024.

Comments: 13 pages, 7 figures, Early accepted at Medical Image Computing and Computer Assisted Intervention (MICCAI) 2024

ACM Class: I.4.5; I.4.9; J.3

arXiv:2407.03280 [pdf, other]

Cooperative Multi-Agent Deep Reinforcement Learning Methods for UAV-aided Mobile Edge Computing Networks

Authors: Mintae Kim, Hoon Lee, Sangwon Hwang, Merouane Debbah, Inkyu Lee

Abstract: This paper presents a cooperative multi-agent deep reinforcement learning (MADRL) approach for unmmaned aerial vehicle (UAV)-aided mobile edge computing (MEC) networks. An UAV with computing capability can provide task offlaoding services to ground internet-of-things devices (IDs). With partial observation of the entire network state, the UAV and the IDs individually determine their MEC strategies… ▽ More This paper presents a cooperative multi-agent deep reinforcement learning (MADRL) approach for unmmaned aerial vehicle (UAV)-aided mobile edge computing (MEC) networks. An UAV with computing capability can provide task offlaoding services to ground internet-of-things devices (IDs). With partial observation of the entire network state, the UAV and the IDs individually determine their MEC strategies, i.e., UAV trajectory, resource allocation, and task offloading policy. This requires joint optimization of decision-making process and coordination strategies among the UAV and the IDs. To address this difficulty, the proposed cooperative MADRL approach computes two types of action variables, namely message action and solution action, each of which is generated by dedicated actor neural networks (NNs). As a result, each agent can automatically encapsulate its coordination messages to enhance the MEC performance in the decentralized manner. The proposed actor structure is designed based on graph attention networks such that operations are possible regardless of the number of IDs. A scalable training algorithm is also proposed to train a group of NNs for arbitrary network configurations. Numerical results demonstrate the superiority of the proposed cooperative MADRL approach over conventional methods. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 13 pages, 6 figures

arXiv:2407.02945 [pdf, other]

VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors

Authors: Sungwon Hwang, Min-Jung Kim, Taewoong Kang, Jayeon Kang, Jaegul Choo

Abstract: Neural rendering-based urban scene reconstruction methods commonly rely on images collected from driving vehicles with cameras facing and moving forward. Although these methods can successfully synthesize from views similar to training camera trajectory, directing the novel view outside the training camera distribution does not guarantee on-par performance. In this paper, we tackle the Extrapolate… ▽ More Neural rendering-based urban scene reconstruction methods commonly rely on images collected from driving vehicles with cameras facing and moving forward. Although these methods can successfully synthesize from views similar to training camera trajectory, directing the novel view outside the training camera distribution does not guarantee on-par performance. In this paper, we tackle the Extrapolated View Synthesis (EVS) problem by evaluating the reconstructions on views such as looking left, right or downwards with respect to training camera distributions. To improve rendering quality for EVS, we initialize our model by constructing dense LiDAR map, and propose to leverage prior scene knowledge such as surface normal estimator and large-scale diffusion model. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS. To the best of our knowledge, we are the first to address the EVS problem in urban scene reconstruction. Link to our project page: https://vegs3d.github.io/. △ Less

Submitted 13 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

Comments: The first two authors contributed equally. Project Page: https://vegs3d.github.io/

arXiv:2407.00972 [pdf, other]

FALCON: Frequency Adjoint Link with CONtinuous Density Mask for Fast Single Image Dehazing

Authors: Donghyun Kim, Seil Kang, Seong Jae Hwang

Abstract: Image dehazing, addressing atmospheric interference like fog and haze, remains a pervasive challenge crucial for robust vision applications such as surveillance and remote sensing under adverse visibility. While various methodologies have evolved from early works predicting transmission matrix and atmospheric light features to deep learning and dehazing networks, they innately prioritize dehazing… ▽ More Image dehazing, addressing atmospheric interference like fog and haze, remains a pervasive challenge crucial for robust vision applications such as surveillance and remote sensing under adverse visibility. While various methodologies have evolved from early works predicting transmission matrix and atmospheric light features to deep learning and dehazing networks, they innately prioritize dehazing quality metrics, neglecting the need for real-time applicability in time-sensitive domains like autonomous driving. This work introduces FALCON (Frequency Adjoint Link with CONtinuous density mask), a single-image dehazing system achieving state-of-the-art performance on both quality and speed. Particularly, we develop a novel bottleneck module, namely, Frequency Adjoint Link, operating in the frequency space to globally expand the receptive field with minimal growth in network size. Further, we leverage the underlying haze distribution based on the atmospheric scattering model via a Continuous Density Mask (CDM) which serves as a continuous-valued mask input prior and a differentiable auxiliary loss. Comprehensive experiments involving multiple state-of-the-art methods and ablation analysis demonstrate FALCON's exceptional performance in both dehazing quality and speed (i.e., >$180 frames-per-second), quantified by metrics such as FPS, PSNR, and SSIM. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00256 [pdf, other]

One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts

Authors: Ruochen Wang, Sohyun An, Minhao Cheng, Tianyi Zhou, Sung Ju Hwang, Cho-Jui Hsieh

Abstract: Large Language Models (LLMs) exhibit strong generalization capabilities to novel tasks when prompted with language instructions and in-context demos. Since this ability sensitively depends on the quality of prompts, various methods have been explored to automate the instruction design. While these methods demonstrated promising results, they also restricted the searched prompt to one instruction.… ▽ More Large Language Models (LLMs) exhibit strong generalization capabilities to novel tasks when prompted with language instructions and in-context demos. Since this ability sensitively depends on the quality of prompts, various methods have been explored to automate the instruction design. While these methods demonstrated promising results, they also restricted the searched prompt to one instruction. Such simplification significantly limits their capacity, as a single demo-free instruction might not be able to cover the entire complex problem space of the targeted task. To alleviate this issue, we adopt the Mixture-of-Expert paradigm and divide the problem space into a set of sub-regions; Each sub-region is governed by a specialized expert, equipped with both an instruction and a set of demos. A two-phase process is developed to construct the specialized expert for each region: (1) demo assignment: Inspired by the theoretical connection between in-context learning and kernel regression, we group demos into experts based on their semantic similarity; (2) instruction assignment: A region-based joint search of an instruction per expert complements the demos assigned to it, yielding a synergistic effect. The resulting method, codenamed Mixture-of-Prompts (MoP), achieves an average win rate of 81% against prior arts across several major benchmarks. △ Less

Submitted 28 June, 2024; originally announced July 2024.

Comments: ICML 2024. code available at https://github.com/ruocwang/mixture-of-prompts

MSC Class: 68T01

Journal ref: Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 2024

arXiv:2406.17808 [pdf, other]

Training-Free Exponential Extension of Sliding Window Context with Cascading KV Cache

Authors: Jeffrey Willette, Heejun Lee, Youngwan Lee, Myeongjae Jeon, Sung Ju Hwang

Abstract: The context window within a transformer provides a form of active memory for the current task, which can be useful for few-shot learning and conditional generation, both which depend heavily on previous context tokens. However, as the context length grows, the computational cost increases quadratically. Recent works have shown that saving a few initial tokens along with a fixed-sized sliding windo… ▽ More The context window within a transformer provides a form of active memory for the current task, which can be useful for few-shot learning and conditional generation, both which depend heavily on previous context tokens. However, as the context length grows, the computational cost increases quadratically. Recent works have shown that saving a few initial tokens along with a fixed-sized sliding window leads to stable streaming generation with linear complexity in transformer-based Large Language Models (LLMs). However, they make suboptimal use of the fixed window by naively evicting all tokens unconditionally from the key-value (KV) cache once they reach the end of the window, resulting in tokens being forgotten and no longer able to affect subsequent predictions. To overcome this limitation, we propose a novel mechanism for storing longer sliding window contexts with the same total cache size by keeping separate cascading sub-cache buffers whereby each subsequent buffer conditionally accepts a fraction of the relatively more important tokens evicted from the previous buffer. Our method results in a dynamic KV cache that can store tokens from the more distant past than a fixed, static sliding window approach. Our experiments show improvements of 5.6% on long context generation (LongBench), 1.2% in streaming perplexity (PG19), and 0.6% in language understanding (MMLU STEM) using LLMs given the same fixed cache size. Additionally, we provide an efficient implementation that improves the KV cache latency from 1.33ms per caching operation to 0.54ms, a 59% speedup over previous work. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.16013 [pdf, other]

Database-Augmented Query Representation for Information Retrieval

Authors: Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, Jong C. Park

Abstract: Information retrieval models that aim to search for the documents relevant to the given query have shown many successes, which have been applied to diverse tasks. However, the query provided by the user is oftentimes very short, which challenges the retrievers to correctly fetch relevant documents. To tackle this, existing studies have proposed expanding the query with a couple of additional (user… ▽ More Information retrieval models that aim to search for the documents relevant to the given query have shown many successes, which have been applied to diverse tasks. However, the query provided by the user is oftentimes very short, which challenges the retrievers to correctly fetch relevant documents. To tackle this, existing studies have proposed expanding the query with a couple of additional (user-related) features related to the query. Yet, they may be suboptimal to effectively augment the query, though there is plenty of information available to augment it in a relational database. Motivated by this, we present a novel retrieval framework called Database-Augmented Query representation (DAQu), which augments the original query with various (query-related) metadata across multiple tables. In addition, as the number of features in the metadata can be very large and there is no order among them, we encode them with our graph-based set encoding strategy, which considers hierarchies of features in the database without order. We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database, demonstrating that ours significantly enhances overall retrieval performance, compared to existing query augmentation methods. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.12543 [pdf, other]

doi 10.1103/PhysRevResearch.6.013215

Phase-controlled heat modulation with Aharonov-Bohm interferometers

Authors: Sun-Yong Hwang, Björn Sothmann, Rosa López

Abstract: A heat modulator is proposed based on a voltage-biased Aharonov-Bohm interferometer. Once an electrical bias is applied, Peltier effects give rise to a flow of heat that can be modulated by a magnetic flux. We determine the corresponding temperature changes using a simple thermal model. Our calculations demonstrate that the modulated temperature difference can be as large as 80 mK at base temperat… ▽ More A heat modulator is proposed based on a voltage-biased Aharonov-Bohm interferometer. Once an electrical bias is applied, Peltier effects give rise to a flow of heat that can be modulated by a magnetic flux. We determine the corresponding temperature changes using a simple thermal model. Our calculations demonstrate that the modulated temperature difference can be as large as 80 mK at base temperature about 600 mK with relative temperature variations reaching 10\%. Our model also predicts, quite generally, the emergence of spin-polarized heat flows without any ferromagnetic contacts, if Rashba spin-orbit interaction is combined with the applied magnetic flux, which potentially paves the way towards caloritronic information processing. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 8 pages, 4 figures

Journal ref: Phys. Rev. Research 6, 013215 (2024)

arXiv:2406.11672 [pdf, other]

Effective Rank Analysis and Regularization for Enhanced 3D Gaussian Splatting

Authors: Junha Hyung, Susung Hong, Sungwon Hwang, Jaeseong Lee, Jaegul Choo, Jin-Hwa Kim

Abstract: 3D reconstruction from multi-view images is one of the fundamental challenges in computer vision and graphics. Recently, 3D Gaussian Splatting (3DGS) has emerged as a promising technique capable of real-time rendering with high-quality 3D reconstruction. This method utilizes 3D Gaussian representation and tile-based splatting techniques, bypassing the expensive neural field querying. Despite its p… ▽ More 3D reconstruction from multi-view images is one of the fundamental challenges in computer vision and graphics. Recently, 3D Gaussian Splatting (3DGS) has emerged as a promising technique capable of real-time rendering with high-quality 3D reconstruction. This method utilizes 3D Gaussian representation and tile-based splatting techniques, bypassing the expensive neural field querying. Despite its potential, 3DGS encounters challenges, including needle-like artifacts, suboptimal geometries, and inaccurate normals, due to the Gaussians converging into anisotropic Gaussians with one dominant variance. We propose using effective rank analysis to examine the shape statistics of 3D Gaussian primitives, and identify the Gaussians indeed converge into needle-like shapes with the effective rank 1. To address this, we introduce effective rank as a regularization, which constrains the structure of the Gaussians. Our new regularization method enhances normal and geometry reconstruction while reducing needle-like artifacts. The approach can be integrated as an add-on module to other 3DGS variants, improving their quality without compromising visual fidelity. △ Less

Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: project page: https://junhahyung.github.io/erankgs.github.io

arXiv:2406.11236 [pdf, other]

doi 10.1145/3643834.3661568

Expanding the Design Space of Computer Vision-based Interactive Systems for Group Dance Practice

Authors: Soohwan Lee, Seoyeong Hwang, Ian Oakley, Kyungho Lee

Abstract: Group dance, a sub-genre characterized by intricate motions made by a cohort of performers in tight synchronization, has a longstanding and culturally significant history and, in modern forms such as cheerleading, a broad base of current adherents. However, despite its popularity, learning group dance routines remains challenging. Based on the prior success of interactive systems to support indivi… ▽ More Group dance, a sub-genre characterized by intricate motions made by a cohort of performers in tight synchronization, has a longstanding and culturally significant history and, in modern forms such as cheerleading, a broad base of current adherents. However, despite its popularity, learning group dance routines remains challenging. Based on the prior success of interactive systems to support individual dance learning, this paper argues that group dance settings are fertile ground for augmentation by interactive aids. To better understand these design opportunities, this paper presents a sequence of user-centered studies of and with amateur cheerleading troupes, spanning from the formative (interviews, observations) through the generative (an ideation workshop) to concept validation (technology probes and speed dating). The outcomes are a nuanced understanding of the lived practice of group dance learning, a set of interactive concepts to support those practices, and design directions derived from validating the proposed concepts. Through this empirical work, we expand the design space of interactive dance practice systems from the established context of single-user practice (primarily focused on gesture recognition) to a multi-user, group-based scenario focused on feedback and communication. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 20 pages, 10 figures, 1 table, to be published in the proceedings of the ACM Designing Interactive Systems Conference, 2024, (DIS '24)

Journal ref: ACM Designing Interactive Systems Conference, 2024, (DIS '24)

arXiv:2406.11125 [pdf, other]

Conversational Agents as Catalysts for Critical Thinking: Challenging Design Fixation in Group Design

Authors: Soohwan Lee, Seoyeong Hwang, Kyungho Lee

Abstract: This paper investigates the potential of LLM-based conversational agents (CAs) to enhance critical reflection and mitigate design fixation in group design work. By challenging AI-generated recommendations and prevailing group opinions, these agents address issues such as groupthink and promote a more dynamic and inclusive design process. Key design considerations include optimizing intervention ti… ▽ More This paper investigates the potential of LLM-based conversational agents (CAs) to enhance critical reflection and mitigate design fixation in group design work. By challenging AI-generated recommendations and prevailing group opinions, these agents address issues such as groupthink and promote a more dynamic and inclusive design process. Key design considerations include optimizing intervention timing, ensuring clarity in counterarguments, and balancing critical thinking with designers' satisfaction. CAs can also adapt to various roles, supporting individual and collective reflection. Our work aligns with the "Death of the Design Researcher?" workshop's goals, emphasizing the transformative potential of generative AI in reshaping design practices and promoting ethical considerations. By exploring innovative uses of generative AI in group design contexts, we aim to stimulate discussion and open new pathways for future research and development, ultimately contributing to practical tools and resources for design researchers. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 7 pages, 2 figures, DIS2024 Workshop on 'Death of Design Researcher'

arXiv:2406.10996 [pdf, other]

THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation

Authors: Seo Hyun Kim, Kai Tzu-iunn Ong, Taeyoon Kwon, Namyoung Kim, Keummin Ka, SeongHyeon Bae, Yohan Jo, Seung-won Hwang, Dongha Lee, Jinyoung Yeo

Abstract: Large language models (LLMs) are capable of processing lengthy dialogue histories during prolonged interaction with users without additional memory modules; however, their responses tend to overlook or incorrectly recall information from the past. In this paper, we revisit memory-augmented response generation in the era of LLMs. While prior work focuses on getting rid of outdated memories, we argu… ▽ More Large language models (LLMs) are capable of processing lengthy dialogue histories during prolonged interaction with users without additional memory modules; however, their responses tend to overlook or incorrectly recall information from the past. In this paper, we revisit memory-augmented response generation in the era of LLMs. While prior work focuses on getting rid of outdated memories, we argue that such memories can provide contextual cues that help dialogue systems understand the development of past events and, therefore, benefit response generation. We present Theanine, a framework that augments LLMs' response generation with memory timelines -- series of memories that demonstrate the development and causality of relevant past events. Along with Theanine, we introduce TeaFarm, a counterfactual-driven question-answering pipeline addressing the limitation of G-Eval in long-term conversations. Supplementary videos of our methods and the TeaBag dataset for TeaFarm evaluation are in https://theanine-693b0.web.app/. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: Under Review

arXiv:2406.10995 [pdf, other]

Concept-skill Transferability-based Data Selection for Large Vision-Language Models

Authors: Jaewoo Lee, Boyang Li, Sung Ju Hwang

Abstract: Instruction tuning, or supervised finetuning on extensive task-specific data, is necessary for Large Vision-Language Models (LVLMs) to generalize well across a broad range of vision-language (VL) tasks. However, training on large VL datasets can become prohibitively expensive. In this work, we introduce COINCIDE, an effective and scalable data selection technique that uses a small model as a refer… ▽ More Instruction tuning, or supervised finetuning on extensive task-specific data, is necessary for Large Vision-Language Models (LVLMs) to generalize well across a broad range of vision-language (VL) tasks. However, training on large VL datasets can become prohibitively expensive. In this work, we introduce COINCIDE, an effective and scalable data selection technique that uses a small model as a reference model to select visual instruction tuning data for efficient finetuning of a target LVLM, focusing on diversity and transferability. Specifically, we cluster the training data using internal activations from a small model, which identifies VL concept-skill compositions needed by a target LVLM. We then sample data from these diverse clusters by considering their density and transferability, or the ability to transfer well to other concept-skill compositions. This approach ensures the diversity of these compositions, which is vital for LVLM generalization. Extensive experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines on two distinct datasets: LLaVA-1.5 and Vision-Flan. Using only 20% of the LLaVA-1.5 dataset, COINCIDE achieves performance comparable to the LVLM finetuned on the whole dataset, with 70% reduction of the wall-clock running time. On the Vision-Flan dataset, our method achieves superior results with only 16.7% of the training data. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: Preprint

arXiv:2406.09827 [pdf, other]

HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning

Authors: Heejun Lee, Geon Park, Youngwan Lee, Jina Kim, Wonyoung Jeong, Myeongjae Jeon, Sung Ju Hwang

Abstract: In modern large language models (LLMs), increasing sequence lengths is a crucial challenge for enhancing their comprehension and coherence in handling complex tasks such as multi-modal question answering. However, handling long context sequences with LLMs is prohibitively costly due to the conventional attention mechanism's quadratic time and space complexity, and the context window size is limite… ▽ More In modern large language models (LLMs), increasing sequence lengths is a crucial challenge for enhancing their comprehension and coherence in handling complex tasks such as multi-modal question answering. However, handling long context sequences with LLMs is prohibitively costly due to the conventional attention mechanism's quadratic time and space complexity, and the context window size is limited by the GPU memory. Although recent works have proposed linear and sparse attention mechanisms to address this issue, their real-world applicability is often limited by the need to re-train pre-trained models. In response, we propose a novel approach, Hierarchically Pruned Attention (HiP), which simultaneously reduces the training and inference time complexity from $O(T^2)$ to $O(T \log T)$ and the space complexity from $O(T^2)$ to $O(T)$. To this end, we devise a dynamic sparse attention mechanism that generates an attention mask through a novel tree-search-like algorithm for a given query on the fly. HiP is training-free as it only utilizes the pre-trained attention scores to spot the positions of the top-$k$ most significant elements for each query. Moreover, it ensures that no token is overlooked, unlike the sliding window-based sub-quadratic attention methods, such as StreamingLLM. Extensive experiments on diverse real-world benchmarks demonstrate that HiP significantly reduces prompt (i.e., prefill) and decoding latency and memory usage while maintaining high generation performance with little or no degradation. As HiP allows pretrained LLMs to scale to millions of tokens on commodity GPUs with no additional engineering due to its easy plug-and-play deployment, we believe that our work will have a large practical impact, opening up the possibility to many long-context LLM applications previously infeasible. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 26 pages, 15 figures

arXiv:2406.07736 [pdf, other]

MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models

Authors: Dojun Park, Jiwoo Lee, Seohyun Park, Hyeyun Jeong, Youngeun Koo, Soonha Hwang, Seonwoo Park, Sungeun Lee

Abstract: As the capabilities of LLMs expand, it becomes increasingly important to evaluate them beyond basic knowledge assessment, focusing on higher-level language understanding. This study introduces MultiPragEval, a robust test suite designed for the multilingual pragmatic evaluation of LLMs across English, German, Korean, and Chinese. Comprising 1200 question units categorized according to Grice's Coop… ▽ More As the capabilities of LLMs expand, it becomes increasingly important to evaluate them beyond basic knowledge assessment, focusing on higher-level language understanding. This study introduces MultiPragEval, a robust test suite designed for the multilingual pragmatic evaluation of LLMs across English, German, Korean, and Chinese. Comprising 1200 question units categorized according to Grice's Cooperative Principle and its four conversational maxims, MultiPragEval enables an in-depth assessment of LLMs' contextual awareness and their ability to infer implied meanings. Our findings demonstrate that Claude3-Opus significantly outperforms other models in all tested languages, establishing a state-of-the-art in the field. Among open-source models, Solar-10.7B and Qwen1.5-14B emerge as strong competitors. This study not only leads the way in the multilingual evaluation of LLMs in pragmatic inference but also provides valuable insights into the nuanced capabilities necessary for advanced language comprehension in AI systems. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 8 pages, under review

arXiv:2406.06748 [pdf, other]

Starling Formation-Flying Optical Experiment: Initial Operations and Flight Results

Authors: Justin Kruger, Soon S. Hwang, Simone D'Amico

Abstract: This paper presents initial flight results for distributed optical angles-only navigation of a swarm of small spacecraft, conducted during the Starling Formation-Flying Optical Experiment (StarFOX). StarFOX is a core payload of the NASA Starling mission, which consists of four CubeSats launched in 2023. Prior angles-only flight demonstrations have only featured one observer and target and have rel… ▽ More This paper presents initial flight results for distributed optical angles-only navigation of a swarm of small spacecraft, conducted during the Starling Formation-Flying Optical Experiment (StarFOX). StarFOX is a core payload of the NASA Starling mission, which consists of four CubeSats launched in 2023. Prior angles-only flight demonstrations have only featured one observer and target and have relied upon a-priori target orbit knowledge for initialization, translational maneuvers to resolve target range, and external absolute orbit updates to maintain convergence. StarFOX overcomes these limitations by applying the angles-only Absolute and Relative Trajectory Measurement System (ARTMS), which integrates three novel algorithms. Image Processing detects and tracks multiple targets in images from each satellite's on-board camera. Batch Orbit Determination computes initial swarm orbit estimates from bearing angle batches. Sequential Orbit Determination leverages an unscented Kalman filter to refine swarm state estimates over time. Multi-observer measurements shared over an intersatellite link are seamlessly fused to enable absolute and relative orbit determination. StarFOX flight data presents the first demonstrations of autonomous angles-only navigation for a satellite swarm, including multi-target and multi-observer relative navigation; autonomous initialization of navigation for unknown targets; and simultaneous absolute and relative orbit determination. Relative positioning uncertainties of 1.3% of target range (1$σ$) are achieved for a single observer under challenging measurement conditions, reduced to 0.6% (1$σ$) with multiple observers. Results demonstrate promising performance with regards to ongoing StarFOX campaigns and the application of angles-only navigation to future distributed missions. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted to the 38th Small Satellite Conference

arXiv:2406.04630 [pdf, other]

Low-Resource Cross-Lingual Summarization through Few-Shot Learning with Large Language Models

Authors: Gyutae Park, Seojin Hwang, Hwanhee Lee

Abstract: Cross-lingual summarization (XLS) aims to generate a summary in a target language different from the source language document. While large language models (LLMs) have shown promising zero-shot XLS performance, their few-shot capabilities on this task remain unexplored, especially for low-resource languages with limited parallel data. In this paper, we investigate the few-shot XLS performance of va… ▽ More Cross-lingual summarization (XLS) aims to generate a summary in a target language different from the source language document. While large language models (LLMs) have shown promising zero-shot XLS performance, their few-shot capabilities on this task remain unexplored, especially for low-resource languages with limited parallel data. In this paper, we investigate the few-shot XLS performance of various models, including Mistral-7B-Instruct-v0.2, GPT-3.5, and GPT-4. Our experiments demonstrate that few-shot learning significantly improves the XLS performance of LLMs, particularly GPT-3.5 and GPT-4, in low-resource settings. However, the open-source model Mistral-7B-Instruct-v0.2 struggles to adapt effectively to the XLS task with limited examples. Our findings highlight the potential of few-shot learning for improving XLS performance and the need for further research in designing LLM architectures and pre-training objectives tailored for this task. We provide a future work direction to explore more effective few-shot learning strategies and to investigate the transfer learning capabilities of LLMs for cross-lingual summarization. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 7 pages,3 figures

arXiv:2406.02223 [pdf, other]

doi 10.1109/ICASSP49357.2023.10097143

SMCL: Saliency Masked Contrastive Learning for Long-tailed Recognition

Authors: Sanglee Park, Seung-won Hwang, Jungmin So

Abstract: Real-world data often follow a long-tailed distribution with a high imbalance in the number of samples between classes. The problem with training from imbalanced data is that some background features, common to all classes, can be unobserved in classes with scarce samples. As a result, this background correlates to biased predictions into ``major" classes. In this paper, we propose saliency masked… ▽ More Real-world data often follow a long-tailed distribution with a high imbalance in the number of samples between classes. The problem with training from imbalanced data is that some background features, common to all classes, can be unobserved in classes with scarce samples. As a result, this background correlates to biased predictions into ``major" classes. In this paper, we propose saliency masked contrastive learning, a new method that uses saliency masking and contrastive learning to mitigate the problem and improve the generalizability of a model. Our key idea is to mask the important part of an image using saliency detection and use contrastive learning to move the masked image towards minor classes in the feature space, so that background features present in the masked image are no longer correlated with the original class. Experiment results show that our method achieves state-of-the-art level performance on benchmark long-tailed datasets. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: accepted at ICASSP 2023

arXiv:2405.20729 [pdf, other]

Extreme Point Supervised Instance Segmentation

Authors: Hyeonjun Lee, Sehyun Hwang, Suha Kwak

Abstract: This paper introduces a novel approach to learning instance segmentation using extreme points, i.e., the topmost, leftmost, bottommost, and rightmost points, of each object. These points are readily available in the modern bounding box annotation process while offering strong clues for precise segmentation, and thus allows to improve performance at the same annotation cost with box-supervised meth… ▽ More This paper introduces a novel approach to learning instance segmentation using extreme points, i.e., the topmost, leftmost, bottommost, and rightmost points, of each object. These points are readily available in the modern bounding box annotation process while offering strong clues for precise segmentation, and thus allows to improve performance at the same annotation cost with box-supervised methods. Our work considers extreme points as a part of the true instance mask and propagates them to identify potential foreground and background points, which are all together used for training a pseudo label generator. Then pseudo labels given by the generator are in turn used for supervised learning of our final model. On three public benchmarks, our method significantly outperforms existing box-supervised methods, further narrowing the gap with its fully supervised counterpart. In particular, our model generates high-quality masks when a target object is separated into multiple parts, where previous box-supervised methods often fail. △ Less

Submitted 3 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

Comments: Accepted to CVPR 2024

arXiv:2405.20616 [pdf, other]

doi 10.3847/1538-4357/ad53c1

SCUBA-2 Ultra Deep Imaging EAO Survey (STUDIES). V. Confusion-limited Submillimeter Galaxy Number Counts at 450 $μ$m and Data Release for the COSMOS Field

Authors: Zhen-Kai Gao, Chen-Fatt Lim, Wei-Hao Wang, Chian-Chou Chen, Ian Smail, Scott C. Chapman, Xian Zhong Zheng, Hyunjin Shim, Tadayuki Kodama, Yiping Ao, Siou-Yu Chang, David L. Clements, James S. Dunlop, Luis C. Ho, Yun-Hsin Hsu, Chorng-Yuan Hwang, Ho Seong Hwang, M. P. Koprowski, Douglas Scott, Stephen Serjeant, Yoshiki Toba, Sheona A. Urquhart

Abstract: We present confusion-limited SCUBA-2 450-$μ$m observations in the COSMOS-CANDELS region as part of the JCMT Large Program, SCUBA-2 Ultra Deep Imaging EAO Survey (STUDIES). Our maps at 450 and 850 $μ$m cover an area of 450 arcmin$^2$. We achieved instrumental noise levels of $σ_{\mathrm{450}}=$ 0.59 mJy beam$^{-1}$ and $σ_{\mathrm{850}}=$ 0.09 mJy beam$^{-1}$ in the deepest area of each map. The co… ▽ More We present confusion-limited SCUBA-2 450-$μ$m observations in the COSMOS-CANDELS region as part of the JCMT Large Program, SCUBA-2 Ultra Deep Imaging EAO Survey (STUDIES). Our maps at 450 and 850 $μ$m cover an area of 450 arcmin$^2$. We achieved instrumental noise levels of $σ_{\mathrm{450}}=$ 0.59 mJy beam$^{-1}$ and $σ_{\mathrm{850}}=$ 0.09 mJy beam$^{-1}$ in the deepest area of each map. The corresponding confusion noise levels are estimated to be 0.65 and 0.36 mJy beam$^{-1}$. Above the 4 (3.5) $σ$ threshold, we detected 360 (479) sources at 450 $μ$m and 237 (314) sources at 850 $μ$m. We derive the deepest blank-field number counts at 450 $μ$m, covering the flux-density range of 2 to 43 mJy. These are in agreement with other SCUBA-2 blank-field and lensing-cluster observations, but are lower than various model counts. We compare the counts with those in other fields and find that the field-to-field variance observed at 450 $μ$m at the $R=6^\prime$ scale is consistent with Poisson noise, so there is no evidence of strong 2-D clustering at this scale. Additionally, we derive the integrated surface brightness at 450 $μ$m down to 2.1 mJy to be $57.3^{+1.0}_{-6.2}$~Jy deg$^{-2}$, contributing to (41$\pm$4)\% of the 450-$μ$m extragalactic background light (EBL) measured by COBE and Planck. Our results suggest that the 450-$μ$m EBL may be fully resolved at $0.08^{+0.09}_{-0.08}$~mJy, which extremely deep lensing-cluster observations and next-generation submillimeter instruments with large aperture sizes may be able to achieve. △ Less

Submitted 31 May, 2024; originally announced May 2024.

Comments: 29 pages, 14 figures, accepted for publication in ApJ

arXiv:2405.18540 [pdf, other]

Learning diverse attacks on large language models for robust red-teaming and safety tuning

Authors: Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, Moksh Jain

Abstract: Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that e… ▽ More Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.18042 [pdf, other]

Visualizing the loss landscape of Self-supervised Vision Transformer

Authors: Youngwan Lee, Jeffrey Ryan Willette, Jonghee Kim, Sung Ju Hwang

Abstract: The Masked autoencoder (MAE) has drawn attention as a representative self-supervised approach for masked image modeling with vision transformers. However, even though MAE shows better generalization capability than fully supervised training from scratch, the reason why has not been explored. In another line of work, the Reconstruction Consistent Masked Auto Encoder (RC-MAE), has been proposed whic… ▽ More The Masked autoencoder (MAE) has drawn attention as a representative self-supervised approach for masked image modeling with vision transformers. However, even though MAE shows better generalization capability than fully supervised training from scratch, the reason why has not been explored. In another line of work, the Reconstruction Consistent Masked Auto Encoder (RC-MAE), has been proposed which adopts a self-distillation scheme in the form of an exponential moving average (EMA) teacher into MAE, and it has been shown that the EMA-teacher performs a conditional gradient correction during optimization. To further investigate the reason for better generalization of the self-supervised ViT when trained by MAE (MAE-ViT) and the effect of the gradient correction of RC-MAE from the perspective of optimization, we visualize the loss landscapes of the self-supervised vision transformer by both MAE and RC-MAE and compare them with the supervised ViT (Sup-ViT). Unlike previous loss landscape visualizations of neural networks based on classification task loss, we visualize the loss landscape of ViT by computing pre-training task loss. Through the lens of loss landscapes, we find two interesting observations: (1) MAE-ViT has a smoother and wider overall loss curvature than Sup-ViT. (2) The EMA-teacher allows MAE to widen the region of convexity in both pretraining and linear probing, leading to quicker convergence. To the best of our knowledge, this work is the first to investigate the self-supervised ViT through the lens of the loss landscape. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: NeurIPS 2023 Workshop: Self-Supervised Learning - Theory and Practice

arXiv:2405.17938 [pdf, other]

RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks

Authors: Seong-Hyeon Hwang, Minsu Kim, Steven Euijong Whang

Abstract: We study the problem of robust data augmentation for regression tasks in the presence of noisy data. Data augmentation is essential for generalizing deep learning models, but most of the techniques like the popular Mixup are primarily designed for classification tasks on image data. Recently, there are also Mixup techniques that are specialized to regression tasks like C-Mixup. In comparison to Mi… ▽ More We study the problem of robust data augmentation for regression tasks in the presence of noisy data. Data augmentation is essential for generalizing deep learning models, but most of the techniques like the popular Mixup are primarily designed for classification tasks on image data. Recently, there are also Mixup techniques that are specialized to regression tasks like C-Mixup. In comparison to Mixup, which takes linear interpolations of pairs of samples, C-Mixup is more selective in which samples to mix based on their label distances for better regression performance. However, C-Mixup does not distinguish noisy versus clean samples, which can be problematic when mixing and lead to suboptimal model performance. At the same time, robust training has been heavily studied where the goal is to train accurate models against noisy data through multiple rounds of model training. We thus propose our data augmentation strategy RC-Mixup, which tightly integrates C-Mixup with multi-round robust training methods for a synergistic effect. In particular, C-Mixup improves robust training in identifying clean data, while robust training provides cleaner data to C-Mixup for it to perform better. A key advantage of RC-Mixup is that it is data-centric where the robust model training algorithm itself does not need to be modified, but can simply benefit from data mixing. We show in our experiments that RC-Mixup significantly outperforms C-Mixup and robust training baselines on noisy data benchmarks and can be integrated with various robust training methods. △ Less

Submitted 15 August, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: Accepted to KDD 2024

arXiv:2405.17918 [pdf, other]

Cost-Sensitive Multi-Fidelity Bayesian Optimization with Transfer of Learning Curve Extrapolation

Authors: Dong Bok Lee, Aoxuan Silvia Zhang, Byungjoo Kim, Junhyeon Park, Juho Lee, Sung Ju Hwang, Hae Beom Lee

Abstract: In this paper, we address the problem of cost-sensitive multi-fidelity Bayesian Optimization (BO) for efficient hyperparameter optimization (HPO). Specifically, we assume a scenario where users want to early-stop the BO when the performance improvement is not satisfactory with respect to the required computational cost. Motivated by this scenario, we introduce utility, which is a function predefin… ▽ More In this paper, we address the problem of cost-sensitive multi-fidelity Bayesian Optimization (BO) for efficient hyperparameter optimization (HPO). Specifically, we assume a scenario where users want to early-stop the BO when the performance improvement is not satisfactory with respect to the required computational cost. Motivated by this scenario, we introduce utility, which is a function predefined by each user and describes the trade-off between cost and performance of BO. This utility function, combined with our novel acquisition function and stopping criterion, allows us to dynamically choose for each BO step the best configuration that we expect to maximally improve the utility in future, and also automatically stop the BO around the maximum utility. Further, we improve the sample efficiency of existing learning curve (LC) extrapolation methods with transfer learning, while successfully capturing the correlations between different configurations to develop a sensible surrogate function for multi-fidelity BO. We validate our algorithm on various LC datasets and found it outperform all the previous multi-fidelity BO and transfer-BO baselines we consider, achieving significantly better trade-off between cost and performance of BO. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.17373 [pdf]

Probing the Relationship between Defects and Enhanced Mobility in MoS2 Monolayers Grown by Mo Foil

Authors: Sudipta Majumder, Vaibhav Walve, Rahul Chand, Gokul M. A., Sooyeon Hwang, G. V. Pavan Kumar, Aparna Deshpande, Atikur Rahman

Abstract: Atomic vacancies, such as chalcogen vacancies in 2D TMDs, are important in changing the host material's electronic structure and transport properties. We present a straightforward one-step method for growing monolayer MoS2 utilizing oxidized Molybdenum (Mo) foil using CVD and delve into the transport properties of as-grown samples. Devices fabricated from these MoS2 sheets exhibit excellent electr… ▽ More Atomic vacancies, such as chalcogen vacancies in 2D TMDs, are important in changing the host material's electronic structure and transport properties. We present a straightforward one-step method for growing monolayer MoS2 utilizing oxidized Molybdenum (Mo) foil using CVD and delve into the transport properties of as-grown samples. Devices fabricated from these MoS2 sheets exhibit excellent electrical responses, with the standout device achieving mobility exceeding 100 cm2V-1s-1. Structural analysis and optical signatures unveiled the presence of chalcogen defects within these samples. To decipher the influence of inherent defects on the electronic transport properties, we measured low-temperature transport on two distinct sets of devices exhibiting relatively high or low mobilities. Combining the thermally activated transport model with quantum capacitance calculations, we have shown the existence of shallow states near the conduction band, likely attributed to sulfur vacancies within MoS2. These vacancies are responsible for the hopping conduction of electrons in the device channel. Furthermore, our claims were substantiated through low-temperature scanning tunnelling microscopy measurements, which revealed an abundance of isolated and lateral double sulfur vacancies in Mo foil-grown samples. We found that these vacancies increase the density of states near the conduction band, inducing intrinsic n-type doping in the MoS2 channel. Consequently, this elevated conductivity enhances the field-effect mobility of MoS2 transistors. Our study offers insights into chalcogen vacancies in CVD-grown monolayer MoS2 and highlights their beneficial impact on electronic transport properties. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.16567 [pdf, other]

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

Authors: Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang

Abstract: Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jai… ▽ More Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jailbreaking. However, most of the previous works only focused on the text-based jailbreaking in LLMs, and the jailbreaking of the text-to-image (T2I) generation system has been relatively overlooked. In this paper, we first evaluate the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. From this empirical study, we find that Copilot and Gemini block only 12% and 17% of the attacks with naive prompts, respectively, while ChatGPT blocks 84% of them. Then, we further propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our automated jailbreaking framework leverages an LLM optimizer to generate prompts to maximize degree of violation from the generated images without any weight updates or gradient computation. Surprisingly, our simple yet effective approach successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time. Finally, we explore various defense strategies, such as post-generation filtering and machine unlearning techniques, but found that they were inadequate, which suggests the necessity of stronger defense mechanisms. △ Less

Submitted 28 May, 2024; v1 submitted 26 May, 2024; originally announced May 2024.

Comments: Under review

arXiv:2405.11807 [pdf, other]

Dual-sided Peltier Elements for Rapid Thermal Feedback in Wearables

Authors: Seongjun Kang, Gwangbin Kim, Seokhyun Hwang, Jeongju Park, Ahmed Elsharkawy, SeungJun Kim

Abstract: This paper introduces a motor-driven Peltier device designed to deliver immediate thermal sensations within extended reality (XR) environments. The system incorporates eight motor-driven Peltier elements, facilitating swift transitions between warm and cool sensations by rotating preheated or cooled elements to opposite sides. A multi-layer structure, comprising aluminum and silicone layers, ensur… ▽ More This paper introduces a motor-driven Peltier device designed to deliver immediate thermal sensations within extended reality (XR) environments. The system incorporates eight motor-driven Peltier elements, facilitating swift transitions between warm and cool sensations by rotating preheated or cooled elements to opposite sides. A multi-layer structure, comprising aluminum and silicone layers, ensures user comfort and safety while maintaining optimal temperatures for thermal stimuli. Time-temperature characteristic analysis demonstrates the system's ability to provide warm and cool sensations efficiently, with a dual-sided lifetime of up to 206 seconds at a 2V input. Our system design is adaptable to various body parts and can be synchronized with corresponding visual stimuli to enhance the immersive sensation of virtual object interaction and information delivery. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: 3 pages, 4 figures, ICRA Wearable Workshop 2024 - 1st Workshop on Advancing Wearable Devices and Applications through Novel Design, Sensing, Actuation, and AI

arXiv:2405.11162 [pdf, other]

LG AI Research & KAIST at EHRSQL 2024: Self-Training Large Language Models with Pseudo-Labeled Unanswerable Questions for a Reliable Text-to-SQL System on EHRs

Authors: Yongrae Jo, Seongyun Lee, Minju Seo, Sung Ju Hwang, Moontae Lee

Abstract: Text-to-SQL models are pivotal for making Electronic Health Records (EHRs) accessible to healthcare professionals without SQL knowledge. With the advancements in large language models, these systems have become more adept at translating complex questions into SQL queries. Nonetheless, the critical need for reliability in healthcare necessitates these models to accurately identify unanswerable ques… ▽ More Text-to-SQL models are pivotal for making Electronic Health Records (EHRs) accessible to healthcare professionals without SQL knowledge. With the advancements in large language models, these systems have become more adept at translating complex questions into SQL queries. Nonetheless, the critical need for reliability in healthcare necessitates these models to accurately identify unanswerable questions or uncertain predictions, preventing misinformation. To address this problem, we present a self-training strategy using pseudo-labeled unanswerable questions to enhance the reliability of text-to-SQL models for EHRs. This approach includes a two-stage training process followed by a filtering method based on the token entropy and query execution. Our methodology's effectiveness is validated by our top performance in the EHRSQL 2024 shared task, showcasing the potential to improve healthcare decision-making through more reliable text-to-SQL systems. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: NAACL 2024 Clinical NLP Workshop

arXiv:2405.00115 [pdf]

Direct Evidence of a Major Merger in the Perseus Cluster

Authors: Kim HyeongHan, M. James Jee, Wonki Lee, John ZuHone, Irina Zhuravleva, Wooseok Kang, Ho Seong Hwang

Abstract: Although the Perseus cluster has often been regarded as an archetypical relaxed galaxy cluster, several lines of evidence including ancient, large-scale cold fronts, asymmetric plasma morphology, filamentary galaxy distribution, etc., provide a conflicting view of its dynamical state, suggesting that the cluster might have experienced a major merger. However, the absence of a clear merging compani… ▽ More Although the Perseus cluster has often been regarded as an archetypical relaxed galaxy cluster, several lines of evidence including ancient, large-scale cold fronts, asymmetric plasma morphology, filamentary galaxy distribution, etc., provide a conflicting view of its dynamical state, suggesting that the cluster might have experienced a major merger. However, the absence of a clear merging companion identified to date hampers our understanding of the evolutionary track of the Perseus cluster consistent with these observational features. In this paper, through careful weak lensing analysis, we successfully identified the missing subcluster halo ($M_{200}=1.70^{+0.73}_{-0.59}\times10^{14}~M_{\odot}$) at the >5$σ$ level centered on NGC1264, which is located ~430 kpc west of the Perseus main cluster core. Moreover, a significant ($>3σ$) mass bridge, which is also traced by the cluster member galaxies, is detected between the Perseus main and sub clusters, which serves as direct evidence of gravitational interaction. With idealized numerical simulations, we demonstrate that a ~3:1 off-axis major merger can create the cold front observed ~700 kpc east of the main cluster core and also generate the observed mass bridge through multiple core crossings. △ Less

Submitted 8 May, 2024; v1 submitted 30 April, 2024; originally announced May 2024.

Comments: The current version is a submitted manuscript

arXiv:2404.12250 [pdf, other]

Effects of Reduced Interlayer Interactions on the K-point Excitons of MoS$_2$ Nanoscrolls

Authors: Sagnik Chatterjee, Tamaghna Chowdhury, Pablo Díaz Núñez, Nicholas Kay, Manisha Rajput, Sooyeon Hwang, Ivan Timokhin, Artem Mishchenko, Atikur Rahman

Abstract: Transition metal dichalcogenide (TMD) nanoscrolls (NS) exhibit significant photoluminescence (PL) signals despite their multilayer structure, which cannot be explained by the strained multilayer description of NS. Here, we investigate the interlayer interactions in NS to address this discrepancy. The reduction of interlayer interactions in NS is attributed to two factors: (1) the symmetry-broken m… ▽ More Transition metal dichalcogenide (TMD) nanoscrolls (NS) exhibit significant photoluminescence (PL) signals despite their multilayer structure, which cannot be explained by the strained multilayer description of NS. Here, we investigate the interlayer interactions in NS to address this discrepancy. The reduction of interlayer interactions in NS is attributed to two factors: (1) the symmetry-broken mixed stacking order between neighbouring layers due to misalignment, and (2) the high inhomogeneity in the strain landscape resulting from the unique Archimedean spiral-like geometry with positive eccentricity. These were confirmed through transmission electron microscopy, field emission scanning electron microscopy and atomic force microscopy. To probe the effect of reduction of interlayer interactions in multilayered MoS$_2$ nanoscrolls, low-temperature PL spectroscopy was employed investigating the behaviour of K-point excitons. The effects of reduced interlayer interactions on exciton-phonon coupling (EXPC), exciton energy, and exciton oscillator strength are discussed, providing insights into the unique properties of TMD nanoscrolls. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: S.C. and T.C. have contributed equally to this work

arXiv:2404.11310 [pdf, other]

Autonomous aerial perching and unperching using omnidirectional tiltrotor and switching controller

Authors: Dongjae Lee, Sunwoo Hwang, Jeonghyun Byun, Seung Jae Lee, H. Jin Kim

Abstract: Aerial unperching of multirotors has received little attention as opposed to perching that has been investigated to elongate operation time. This study presents a new aerial robot capable of both perching and unperching autonomously on/from a ferromagnetic surface during flight, and a switching controller to avoid rotor saturation and mitigate overshoot during transition between free-flight and pe… ▽ More Aerial unperching of multirotors has received little attention as opposed to perching that has been investigated to elongate operation time. This study presents a new aerial robot capable of both perching and unperching autonomously on/from a ferromagnetic surface during flight, and a switching controller to avoid rotor saturation and mitigate overshoot during transition between free-flight and perching. To enable stable perching and unperching maneuvers on/from a vertical surface, a lightweight ($\approx$ $1$ \si{kg}), fully actuated tiltrotor that can hover at $90^\circ$ pitch angle is first developed. We design a perching/unperching module composed of a single servomotor and a magnet, which is then mounted on the tiltrotor. A switching controller including exclusive control modes for transitions between free-flight and perching is proposed. Lastly, we propose a simple yet effective strategy to ensure robust perching in the presence of measurement and control errors and avoid collisions with the perching site immediately after unperching. We validate the proposed framework in experiments where the tiltrotor successfully performs perching and unperching on/from a vertical surface during flight. We further show effectiveness of the proposed transition mode in the switching controller by ablation studies where large overshoot and even collision with a perching site occur. To the best of the authors' knowledge, this work presents the first autonomous aerial unperching framework using a fully actuated tiltrotor. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: 7 pages, 10 figures, 2024 IEEE International Conference on Robotics and Automation (ICRA) accepted

arXiv:2404.07738 [pdf, other]

ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models

Authors: Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, Sung Ju Hwang

Abstract: Scientific Research, vital for improving human life, is hindered by its inherent complexity, slow pace, and the need for specialized experts. To enhance its productivity, we propose a ResearchAgent, a large language model-powered research idea writing agent, which automatically generates problems, methods, and experiment designs while iteratively refining them based on scientific literature. Speci… ▽ More Scientific Research, vital for improving human life, is hindered by its inherent complexity, slow pace, and the need for specialized experts. To enhance its productivity, we propose a ResearchAgent, a large language model-powered research idea writing agent, which automatically generates problems, methods, and experiment designs while iteratively refining them based on scientific literature. Specifically, starting with a core paper as the primary focus to generate ideas, our ResearchAgent is augmented not only with relevant publications through connecting information over an academic graph but also entities retrieved from an entity-centric knowledge store based on their underlying concepts, mined and shared across numerous papers. In addition, mirroring the human approach to iteratively improving ideas with peer discussions, we leverage multiple ReviewingAgents that provide reviews and feedback iteratively. Further, they are instantiated with human preference-aligned large language models whose criteria for evaluation are derived from actual human judgments. We experimentally validate our ResearchAgent on scientific publications across multiple disciplines, showcasing its effectiveness in generating novel, clear, and valid research ideas based on human and model-based evaluation results. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Showing 1–50 of 847 results for author: Hwang, S