Search | arXiv e-print repository

Investigating the effect of Mental Models in User Interaction with an Adaptive Dialog Agent

Authors: Lindsey Vanderlyn, Dirk Väth, Ngoc Thang Vu

Abstract: Mental models play an important role in whether user interaction with intelligent systems, such as dialog systems is successful or not. Adaptive dialog systems present the opportunity to align a dialog agent's behavior with heterogeneous user expectations. However, there has been little research into what mental models users form when interacting with a task-oriented dialog system, how these model… ▽ More Mental models play an important role in whether user interaction with intelligent systems, such as dialog systems is successful or not. Adaptive dialog systems present the opportunity to align a dialog agent's behavior with heterogeneous user expectations. However, there has been little research into what mental models users form when interacting with a task-oriented dialog system, how these models affect users' interactions, or what role system adaptation can play in this process, making it challenging to avoid damage to human-AI partnership. In this work, we collect a new publicly available dataset for exploring user mental models about information seeking dialog systems. We demonstrate that users have a variety of conflicting mental models about such systems, the validity of which directly impacts the success of their interactions and perceived usability of system. Furthermore, we show that adapting a dialog agent's behavior to better align with users' mental models, even when done implicitly, can improve perceived usability, dialog efficiency, and success. To this end, we argue that implicit adaptation can be a valid strategy for task-oriented dialog systems, so long as developers first have a solid understanding of users' mental models. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: submitted to COLING 2025

arXiv:2408.14153 [pdf, other]

Explaining Vision-Language Similarities in Dual Encoders with Feature-Pair Attributions

Authors: Lucas Möller, Pascal Tilli, Ngoc Thang Vu, Sebastian Padó

Abstract: Dual encoder architectures like CLIP models map two types of inputs into a shared embedding space and learn similarities between them. However, it is not understood how such models compare two inputs. Here, we address this research gap with two contributions. First, we derive a method to attribute predictions of any differentiable dual encoder onto feature-pair interactions between its inputs. Sec… ▽ More Dual encoder architectures like CLIP models map two types of inputs into a shared embedding space and learn similarities between them. However, it is not understood how such models compare two inputs. Here, we address this research gap with two contributions. First, we derive a method to attribute predictions of any differentiable dual encoder onto feature-pair interactions between its inputs. Second, we apply our method to CLIP-type models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes and also account for mismatches. However, this visual-linguistic grounding ability heavily varies between object classes, depends on the training data distribution, and largely improves after in-domain training. Using our method we can identify knowledge gaps about specific object classes in individual models and can monitor their improvement upon fine-tuning. △ Less

Submitted 26 August, 2024; originally announced August 2024.

arXiv:2408.05300 [pdf, other]

High-Precision Ringdown Surrogate Model for Non-Precessing Binary Black Holes

Authors: Lorena Magaña Zertuche, Leo C. Stein, Keefe Mitman, Scott E. Field, Vijay Varma, Michael Boyle, Nils Deppe, Lawrence E. Kidder, Jordan Moxon, Harald P. Pfeiffer, Mark A. Scheel, Kyle C. Nelli, William Throwe, Nils L. Vu

Abstract: Highly precise and robust waveform models are required as improvements in detector sensitivity enable us to test general relativity with more precision than ever before. In this work, we introduce a spin-aligned surrogate ringdown model. This ringdown surrogate, NRSur3dq8_RD, is built with numerical waveforms produced using Cauchy-characteristic evolution. In addition, these waveforms are in the s… ▽ More Highly precise and robust waveform models are required as improvements in detector sensitivity enable us to test general relativity with more precision than ever before. In this work, we introduce a spin-aligned surrogate ringdown model. This ringdown surrogate, NRSur3dq8_RD, is built with numerical waveforms produced using Cauchy-characteristic evolution. In addition, these waveforms are in the superrest frame of the remnant black hole allowing us to do a correct analysis of the ringdown spectrum. The novel prediction of our surrogate model is complex-valued quasinormal mode (QNM) amplitudes, with median relative errors of $10^{-2}-10^{-3}$ over the parameter space. Like previous remnant surrogates, we also predict the remnant black hole's mass and spin. The QNM mode amplitude errors translate into median errors on ringdown waveforms of $10^{-4}$. The high accuracy and QNM mode content provided by our surrogate will enable high-precision ringdown analyses such as tests of general relativity. Our ringdown model is publicly available through the python package surfinBH. △ Less

Submitted 9 August, 2024; originally announced August 2024.

Comments: 11+2 pages, 13 figures, 1 table. This new model is publicly available through surfinBH https://pypi.org/project/surfinBH/

arXiv:2408.00122 [pdf, other]

A Course Shared Task on Evaluating LLM Output for Clinical Questions

Authors: Yufang Hou, Thy Thy Tran, Doan Nam Long Vu, Yiwen Cao, Kai Li, Lukas Rohde, Iryna Gurevych

Abstract: This paper presents a shared task that we organized at the Foundations of Language Technology (FoLT) course in 2023/2024 at the Technical University of Darmstadt, which focuses on evaluating the output of Large Language Models (LLMs) in generating harmful answers to health-related clinical questions. We describe the task design considerations and report the feedback we received from the students.… ▽ More This paper presents a shared task that we organized at the Foundations of Language Technology (FoLT) course in 2023/2024 at the Technical University of Darmstadt, which focuses on evaluating the output of Large Language Models (LLMs) in generating harmful answers to health-related clinical questions. We describe the task design considerations and report the feedback we received from the students. We expect the task and the findings reported in this paper to be relevant for instructors teaching natural language processing (NLP) and designing course assignments. △ Less

Submitted 31 July, 2024; originally announced August 2024.

Comments: accepted at the sixth Workshop on Teaching NLP (co-located with ACL 2024)

arXiv:2407.21061 [pdf, other]

Improving noisy student training for low-resource languages in End-to-End ASR using CycleGAN and inter-domain losses

Authors: Chia-Yu Li, Ngoc Thang Vu

Abstract: Training a semi-supervised end-to-end speech recognition system using noisy student training has significantly improved performance. However, this approach requires a substantial amount of paired speech-text and unlabeled speech, which is costly for low-resource languages. Therefore, this paper considers a more extreme case of semi-supervised end-to-end automatic speech recognition where there are… ▽ More Training a semi-supervised end-to-end speech recognition system using noisy student training has significantly improved performance. However, this approach requires a substantial amount of paired speech-text and unlabeled speech, which is costly for low-resource languages. Therefore, this paper considers a more extreme case of semi-supervised end-to-end automatic speech recognition where there are limited paired speech-text, unlabeled speech (less than five hours), and abundant external text. Firstly, we observe improved performance by training the model using our previous work on semi-supervised learning "CycleGAN and inter-domain losses" solely with external text. Secondly, we enhance "CycleGAN and inter-domain losses" by incorporating automatic hyperparameter tuning, calling it "enhanced CycleGAN inter-domain losses." Thirdly, we integrate it into the noisy student training approach pipeline for low-resource scenarios. Our experimental results, conducted on six non-English languages from Voxforge and Common Voice, show a 20% word error rate reduction compared to the baseline teacher model and a 10% word error rate reduction compared to the baseline best student model, highlighting the significant improvements achieved through our proposed method. △ Less

Submitted 26 July, 2024; originally announced July 2024.

Comments: 10 pages (2 for references), 4 figures, published in SIGUL2024@LREC-COLING 2024

arXiv:2407.19877 [pdf, other]

Language-driven Grasp Detection with Mask-guided Attention

Authors: Tuan Van Vo, Minh Nhat Vu, Baoru Huang, An Vuong, Ngan Le, Thieu Vo, Anh Nguyen

Abstract: Grasp detection is an essential task in robotics with various industrial applications. However, traditional methods often struggle with occlusions and do not utilize language for grasping. Incorporating natural language into grasp detection remains a challenging task and largely unexplored. To address this gap, we propose a new method for language-driven grasp detection with mask-guided attention… ▽ More Grasp detection is an essential task in robotics with various industrial applications. However, traditional methods often struggle with occlusions and do not utilize language for grasping. Incorporating natural language into grasp detection remains a challenging task and largely unexplored. To address this gap, we propose a new method for language-driven grasp detection with mask-guided attention by utilizing the transformer attention mechanism with semantic segmentation features. Our approach integrates visual data, segmentation mask features, and natural language instructions, significantly improving grasp detection accuracy. Our work introduces a new framework for language-driven grasp detection, paving the way for language-driven robotic applications. Intensive experiments show that our method outperforms other recent baselines by a clear margin, with a 10.0% success score improvement. We further validate our method in real-world robotic experiments, confirming the effectiveness of our approach. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: Accepted at IROS 2024

arXiv:2407.18789 [pdf, other]

Granularity is crucial when applying differential privacy to text: An investigation for neural machine translation

Authors: Doan Nam Long Vu, Timour Igamberdiev, Ivan Habernal

Abstract: Applying differential privacy (DP) by means of the DP-SGD algorithm to protect individual data points during training is becoming increasingly popular in NLP. However, the choice of granularity at which DP is applied is often neglected. For example, neural machine translation (NMT) typically operates on the sentence-level granularity. From the perspective of DP, this setup assumes that each senten… ▽ More Applying differential privacy (DP) by means of the DP-SGD algorithm to protect individual data points during training is becoming increasingly popular in NLP. However, the choice of granularity at which DP is applied is often neglected. For example, neural machine translation (NMT) typically operates on the sentence-level granularity. From the perspective of DP, this setup assumes that each sentence belongs to a single person and any two sentences in the training dataset are independent. This assumption is however violated in many real-world NMT datasets, e.g. those including dialogues. For proper application of DP we thus must shift from sentences to entire documents. In this paper, we investigate NMT at both the sentence and document levels, analyzing the privacy/utility trade-off for both scenarios, and evaluating the risks of not using the appropriate privacy granularity in terms of leaking personally identifiable information (PII). Our findings indicate that the document-level NMT system is more resistant to membership inference attacks, emphasizing the significance of using the appropriate granularity when working with DP. △ Less

Submitted 26 July, 2024; originally announced July 2024.

arXiv:2407.17967 [pdf, other]

Lightweight Language-driven Grasp Detection using Conditional Consistency Model

Authors: Nghia Nguyen, Minh Nhat Vu, Baoru Huang, An Vuong, Ngan Le, Thieu Vo, Anh Nguyen

Abstract: Language-driven grasp detection is a fundamental yet challenging task in robotics with various industrial applications. In this work, we present a new approach for language-driven grasp detection that leverages the concept of lightweight diffusion models to achieve fast inference time. By integrating diffusion processes with grasping prompts in natural language, our method can effectively encode v… ▽ More Language-driven grasp detection is a fundamental yet challenging task in robotics with various industrial applications. In this work, we present a new approach for language-driven grasp detection that leverages the concept of lightweight diffusion models to achieve fast inference time. By integrating diffusion processes with grasping prompts in natural language, our method can effectively encode visual and textual information, enabling more accurate and versatile grasp positioning that aligns well with the text query. To overcome the long inference time problem in diffusion models, we leverage the image and text features as the condition in the consistency model to reduce the number of denoising timesteps during inference. The intensive experimental results show that our method outperforms other recent grasp detection methods and lightweight diffusion models by a clear margin. We further validate our method in real-world robotic experiments to demonstrate its fast inference time capability. △ Less

Submitted 25 July, 2024; originally announced July 2024.

Comments: Accepted at IROS 2024

arXiv:2407.13842 [pdf, other]

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Authors: Toan Nguyen, Minh Nhat Vu, Baoru Huang, An Vuong, Quan Vuong, Ngan Le, Thieu Vo, Anh Nguyen

Abstract: 6-DoF grasp detection has been a fundamental and challenging problem in robotic vision. While previous works have focused on ensuring grasp stability, they often do not consider human intention conveyed through natural language, hindering effective collaboration between robots and users in complex 3D environments. In this paper, we present a new approach for language-driven 6-DoF grasp detection i… ▽ More 6-DoF grasp detection has been a fundamental and challenging problem in robotic vision. While previous works have focused on ensuring grasp stability, they often do not consider human intention conveyed through natural language, hindering effective collaboration between robots and users in complex 3D environments. In this paper, we present a new approach for language-driven 6-DoF grasp detection in cluttered point clouds. We first introduce Grasp-Anything-6D, a large-scale dataset for the language-driven 6-DoF grasp detection task with 1M point cloud scenes and more than 200M language-associated 3D grasp poses. We further introduce a novel diffusion model that incorporates a new negative prompt guidance learning strategy. The proposed negative prompt strategy directs the detection process toward the desired object while steering away from unwanted ones given the language input. Our method enables an end-to-end framework where humans can command the robot to grasp desired objects in a cluttered scene using natural language. Intensive experimental results show the effectiveness of our method in both benchmarking experiments and real-world scenarios, surpassing other baselines. In addition, we demonstrate the practicality of our approach in real-world robotic applications. Our project is available at https://airvlab.github.io/grasp-anything. △ Less

Submitted 25 July, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

Comments: Accepted at ECCV 2024

arXiv:2407.02937 [pdf, other]

Probing the Feasibility of Multilingual Speaker Anonymization

Authors: Sarina Meyer, Florian Lux, Ngoc Thang Vu

Abstract: In speaker anonymization, speech recordings are modified in a way that the identity of the speaker remains hidden. While this technology could help to protect the privacy of individuals around the globe, current research restricts this by focusing almost exclusively on English data. In this study, we extend a state-of-the-art anonymization system to nine languages by transforming language-dependen… ▽ More In speaker anonymization, speech recordings are modified in a way that the identity of the speaker remains hidden. While this technology could help to protect the privacy of individuals around the globe, current research restricts this by focusing almost exclusively on English data. In this study, we extend a state-of-the-art anonymization system to nine languages by transforming language-dependent components to their multilingual counterparts. Experiments testing the robustness of the anonymized speech against privacy attacks and speech deterioration show an overall success of this system for all languages. The results suggest that speaker embeddings trained on English data can be applied across languages, and that the anonymization performance for a language is mainly affected by the quality of the speech synthesis component used for it. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: accepted at Interspeech 2024

arXiv:2407.01381 [pdf, other]

Polaritonic Chemistry using the Density Matrix Renormalization Group Method

Authors: Mikuláš Matoušek, Nam Vu, Niranjan Govind, Jonathan J. Foley IV, Libor Veis

Abstract: The emerging field of polaritonic chemistry explores the behavior of molecules under strong coupling with cavity modes. Despite recent developments in ab initio polaritonic methods for simulating polaritonic chemistry under electronic strong coupling, their capabilities are limited, especially in cases where the molecule also features strong electronic correlation. To bridge this gap, we have deve… ▽ More The emerging field of polaritonic chemistry explores the behavior of molecules under strong coupling with cavity modes. Despite recent developments in ab initio polaritonic methods for simulating polaritonic chemistry under electronic strong coupling, their capabilities are limited, especially in cases where the molecule also features strong electronic correlation. To bridge this gap, we have developed a novel method for cavity QED calculations utilizing the Density Matrix Renormalization Group (DMRG) algorithm in conjunction with the Pauli-Fierz Hamiltonian. Our approach is applied to investigate the effect of the cavity on the S0 -S1 transition of n-oligoacenes, with n ranging from 2 to 5, encompassing 22 fully correlated π orbitals in the largest pentacene molecule. Our findings indicate that the influence of the cavity intensifies with larger acenes. Additionally, we demonstrate that, unlike the full determinantal representation, DMRG efficiently optimizes and eliminates excess photonic degrees of freedom, resulting in an asymptotically constant computational cost as the photonic basis increases. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00145 [pdf, other]

Co-evolving networks for opinion and social dynamics in agent-based models

Authors: Nataša Djurdjevac Conrad, Nhu Quang Vu, Sören Nagel

Abstract: The rise of digital social media has strengthened the coevolution of public opinions and social interactions, that shape social structures and collective outcomes in increasingly complex ways. Existing literature often explores this interplay as a one-directional influence, focusing on how opinions determine social ties within adaptive networks. However, this perspective overlooks the intrinsic dy… ▽ More The rise of digital social media has strengthened the coevolution of public opinions and social interactions, that shape social structures and collective outcomes in increasingly complex ways. Existing literature often explores this interplay as a one-directional influence, focusing on how opinions determine social ties within adaptive networks. However, this perspective overlooks the intrinsic dynamics driving social interactions, which can significantly influence how opinions form and evolve. In this work, we address this gap, by introducing the co-evolving opinion and social dynamics using stochastic agent-based models. Agents' mobility in a social space is governed by both their social and opinion similarity with others. Similarly, the dynamics of opinion formation is driven by the opinions of agents in their social vicinity. We analyze the underlying social and opinion interaction networks and explore the mechanisms influencing the appearance of emerging phenomena, like echo chambers and opinion consensus. To illustrate the model's potential for real-world analysis, we apply it to General Social Survey data on political identity and public opinion regarding governmental issues. Our findings highlight the model's strength in capturing the coevolution of social connections and individual opinions over time. △ Less

Submitted 28 June, 2024; originally announced July 2024.

MSC Class: 91Dxx; 05C82; 37Hxx

arXiv:2406.19038 [pdf, other]

Binary neutron star mergers using a discontinuous Galerkin-finite difference hybrid method

Authors: Nils Deppe, Francois Foucart, Marceline S. Bonilla, Michael Boyle, Nicholas J. Corso, Matthew D. Duez, Matthew Giesler, François Hébert, Lawrence E. Kidder, Yoonsoo Kim, Prayush Kumar, Isaac Legred, Geoffrey Lovelace, Elias R. Most, Jordan Moxon, Kyle C. Nelli, Harald P. Pfeiffer, Mark A. Scheel, Saul A. Teukolsky, William Throwe, Nils L. Vu

Abstract: We present a discontinuous Galerkin-finite difference hybrid scheme that allows high-order shock capturing with the discontinuous Galerkin method for general relativistic magnetohydrodynamics in dynamical spacetimes. We present several optimizations and stability improvements to our algorithm that allow the hybrid method to successfully simulate single, rotating, and binary neutron stars. The hybr… ▽ More We present a discontinuous Galerkin-finite difference hybrid scheme that allows high-order shock capturing with the discontinuous Galerkin method for general relativistic magnetohydrodynamics in dynamical spacetimes. We present several optimizations and stability improvements to our algorithm that allow the hybrid method to successfully simulate single, rotating, and binary neutron stars. The hybrid method achieves the efficiency of discontinuous Galerkin methods throughout almost the entire spacetime during the inspiral phase, while being able to robustly capture shocks and resolve the stellar surfaces. We also use Cauchy-Characteristic evolution to compute the first gravitational waveforms at future null infinity from binary neutron star mergers. The simulations presented here are the first successful binary neutron star inspiral and merger simulations using discontinuous Galerkin methods. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 31 pages, 8 figures, comments welcome!

arXiv:2406.09489 [pdf, other]

Language-driven Grasp Detection

Authors: An Dinh Vuong, Minh Nhat Vu, Baoru Huang, Nghia Nguyen, Hieu Le, Thieu Vo, Anh Nguyen

Abstract: Grasp detection is a persistent and intricate challenge with various industrial applications. Recently, many methods and datasets have been proposed to tackle the grasp detection problem. However, most of them do not consider using natural language as a condition to detect the grasp poses. In this paper, we introduce Grasp-Anything++, a new language-driven grasp detection dataset featuring 1M samp… ▽ More Grasp detection is a persistent and intricate challenge with various industrial applications. Recently, many methods and datasets have been proposed to tackle the grasp detection problem. However, most of them do not consider using natural language as a condition to detect the grasp poses. In this paper, we introduce Grasp-Anything++, a new language-driven grasp detection dataset featuring 1M samples, over 3M objects, and upwards of 10M grasping instructions. We utilize foundation models to create a large-scale scene corpus with corresponding images and grasp prompts. We approach the language-driven grasp detection task as a conditional generation problem. Drawing on the success of diffusion models in generative tasks and given that language plays a vital role in this task, we propose a new language-driven grasp detection method based on diffusion models. Our key contribution is the contrastive training objective, which explicitly contributes to the denoising process to detect the grasp pose given the language instructions. We illustrate that our approach is theoretically supportive. The intensive experiments show that our method outperforms state-of-the-art approaches and allows real-world robotic grasping. Finally, we demonstrate our large-scale dataset enables zero-short grasp detection and is a challenging benchmark for future work. Project website: https://airvlab.github.io/grasp-anything/ △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 19 pages. Accepted to CVPR24

arXiv:2406.09039 [pdf, other]

Language-Driven Closed-Loop Grasping with Model-Predictive Trajectory Replanning

Authors: Huy Hoang Nguyen, Minh Nhat Vu, Florian Beck, Gerald Ebmer, Anh Nguyen, Andreas Kugi

Abstract: Combining a vision module inside a closed-loop control system for a \emph{seamless movement} of a robot in a manipulation task is challenging due to the inconsistent update rates between utilized modules. This task is even more difficult in a dynamic environment, e.g., objects are moving. This paper presents a \emph{modular} zero-shot framework for language-driven manipulation of (dynamic) objects… ▽ More Combining a vision module inside a closed-loop control system for a \emph{seamless movement} of a robot in a manipulation task is challenging due to the inconsistent update rates between utilized modules. This task is even more difficult in a dynamic environment, e.g., objects are moving. This paper presents a \emph{modular} zero-shot framework for language-driven manipulation of (dynamic) objects through a closed-loop control system with real-time trajectory replanning and an online 6D object pose localization. We segment an object within $\SI{0.5}{\second}$ by leveraging a vision language model via language commands. Then, guided by natural language commands, a closed-loop system, including a unified pose estimation and tracking and online trajectory planning, is utilized to continuously track this object and compute the optimal trajectory in real-time. Our proposed zero-shot framework provides a smooth trajectory that avoids jerky movements and ensures the robot can grasp a non-stationary object. Experiment results exhibit the real-time capability of the proposed zero-shot modular framework for the trajectory optimization module to accurately and efficiently grasp moving objects, i.e., up to \SI{30}{\hertz} update rates for the online 6D pose localization module and \SI{10}{\hertz} update rates for the receding-horizon trajectory optimization. These advantages highlight the modular framework's potential applications in robotics and human-robot interaction; see the video in https://www.acin.tuwien.ac.at/en/6e64/. △ Less

Submitted 19 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: 9 pages, 6 figures

arXiv:2406.08410 [pdf, other]

Quasistationary hair for binary black hole initial data in scalar Gauss-Bonnet gravity

Authors: Peter James Nee, Guillermo Lara, Harald P. Pfeiffer, Nils L. Vu

Abstract: Recent efforts to numerically simulate compact objects in alternative theories of gravity have largely focused on the time-evolution equations. Another critical aspect is the construction of constraint-satisfying initial data with precise control over the properties of the systems under consideration. Here, we augment the extended conformal thin sandwich framework to construct quasistationary init… ▽ More Recent efforts to numerically simulate compact objects in alternative theories of gravity have largely focused on the time-evolution equations. Another critical aspect is the construction of constraint-satisfying initial data with precise control over the properties of the systems under consideration. Here, we augment the extended conformal thin sandwich framework to construct quasistationary initial data for black hole systems in scalar Gauss-Bonnet theory and numerically implement it in the open-source SpECTRE code. Despite the resulting elliptic system being singular at black hole horizons, we demonstrate how to construct numerical solutions that extend smoothly across the horizon. We obtain quasistationary scalar hair configurations in the test-field limit for black holes with linear/angular momentum as well as for black hole binaries. For isolated black holes, we explicitly show that the scalar profile obtained is stationary by evolving the system in time and compare against previous formulations of scalar Gauss-Bonnet initial data. In the case of the binary, we find that the scalar hair near the black holes can be markedly altered by the presence of the other black hole. The initial data constructed here enables targeted simulations in scalar Gauss-Bonnet simulations with reduced initial transients. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 13 pages, 11 figures

arXiv:2406.07124 [pdf, other]

CHARME: A chain-based reinforcement learning approach for the minor embedding problem

Authors: Hoang M. Ngo, Nguyen H K. Do, Minh N. Vu, Tamer Kahveci, My T. Thai

Abstract: Quantum Annealing (QA) holds great potential for solving combinatorial optimization problems efficiently. However, the effectiveness of QA algorithms heavily relies on the embedding of problem instances, represented as logical graphs, into the quantum unit processing (QPU) whose topology is in form of a limited connectivity graph, known as the minor embedding Problem. Existing methods for the mino… ▽ More Quantum Annealing (QA) holds great potential for solving combinatorial optimization problems efficiently. However, the effectiveness of QA algorithms heavily relies on the embedding of problem instances, represented as logical graphs, into the quantum unit processing (QPU) whose topology is in form of a limited connectivity graph, known as the minor embedding Problem. Existing methods for the minor embedding problem suffer from scalability issues when confronted with larger problem sizes. In this paper, we propose a novel approach utilizing Reinforcement Learning (RL) techniques to address the minor embedding problem, named CHARME. CHARME includes three key components: a Graph Neural Network (GNN) architecture for policy modeling, a state transition algorithm ensuring solution validity, and an order exploration strategy for effective training. Through comprehensive experiments on synthetic and real-world instances, we demonstrate that the efficiency of our proposed order exploration strategy as well as our proposed RL framework, CHARME. In details, CHARME yields superior solutions compared to fast embedding methods such as Minorminer and ATOM. Moreover, our method surpasses the OCT-based approach, known for its slower runtime but high-quality solutions, in several cases. In addition, our proposed exploration enhances the efficiency of the training of the CHARME framework by providing better solutions compared to the greedy strategy. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.06406 [pdf, other]

Controlling Emotion in Text-to-Speech with Natural Language Prompts

Authors: Thomas Bott, Florian Lux, Ngoc Thang Vu

Abstract: In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points wi… ▽ More In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture. Our approach is trained on merged emotional speech and text datasets and varies prompts in each training iteration to increase the generalization capabilities of the model. Objective and subjective evaluation results demonstrate the ability of the conditioned synthesis system to accurately transfer the emotions present in a prompt to speech. At the same time, precise tractability of speaker identities as well as overall high speech quality and intelligibility are maintained. △ Less

Submitted 11 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: accepted at Interspeech 2024

arXiv:2406.06403 [pdf, other]

Meta Learning Text-to-Speech Synthesis in over 7000 Languages

Authors: Florian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuël A. P. Habets, Ngoc Thang Vu

Abstract: In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech syn… ▽ More In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: accepted at Interspeech 2024

arXiv:2405.09335 [pdf, other]

Prompting-based Synthetic Data Generation for Few-Shot Question Answering

Authors: Maximilian Schmidt, Andrea Bartezzaghi, Ngoc Thang Vu

Abstract: Although language models (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the… ▽ More Although language models (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using large language models can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that language models contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: LREC-COLING 2024

arXiv:2405.08868 [pdf, other]

A Review of Gravitational Memory and BMS Frame Fixing in Numerical Relativity

Authors: Keefe Mitman, Michael Boyle, Leo C. Stein, Nils Deppe, Lawrence E. Kidder, Jordan Moxon, Harald P. Pfeiffer, Mark A. Scheel, Saul A. Teukolsky, William Throwe, Nils L. Vu

Abstract: Gravitational memory effects and the BMS freedoms exhibited at future null infinity have recently been resolved and utilized in numerical relativity simulations. With this, gravitational wave models and our understanding of the fundamental nature of general relativity have been vastly improved. In this paper, we review the history and intuition behind memory effects and BMS symmetries, how they ma… ▽ More Gravitational memory effects and the BMS freedoms exhibited at future null infinity have recently been resolved and utilized in numerical relativity simulations. With this, gravitational wave models and our understanding of the fundamental nature of general relativity have been vastly improved. In this paper, we review the history and intuition behind memory effects and BMS symmetries, how they manifest in gravitational waves, and how controlling the infinite number of BMS freedoms of numerical relativity simulations can crucially improve the waveform models that are used by gravitational wave detectors. We reiterate the fact that, with memory effects and BMS symmetries, not only can these next-generation numerical waveforms be used to observe never-before-seen physics, but they can also be used to test GR and learn new astrophysical information about our universe. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: 20 pages, 8 figures. Submitted to CGQ's focus issue: Gravitational-Wave Memory Effects: From Theory to Observation

arXiv:2405.06197 [pdf, other]

Improved frequency spectra of gravitational waves with memory in a binary-black-hole simulation

Authors: Yitian Chen, Michael Boyle, Nils Deppe, Lawrence E. Kidder, Keefe Mitman, Jordan Moxon, Kyle C. Nelli, Harald P. Pfeiffer, Mark A. Scheel, William Throwe, Nils L. Vu, Saul A. Teukolsky

Abstract: Numerical relativists can now produce gravitational waveforms with memory effects routinely and accurately. The gravitational-wave memory effect contains very low-frequency components, including a persistent offset. The presence of these components violates basic assumptions about time-shift behavior underpinning standard data-analysis techniques in gravitational-wave astronomy. This poses a chall… ▽ More Numerical relativists can now produce gravitational waveforms with memory effects routinely and accurately. The gravitational-wave memory effect contains very low-frequency components, including a persistent offset. The presence of these components violates basic assumptions about time-shift behavior underpinning standard data-analysis techniques in gravitational-wave astronomy. This poses a challenge to the analysis of waveform spectra: How to preserve the low-frequency characteristics when transforming a time-domain waveform to the frequency domain. To tackle this challenge, we revisit the preprocessing procedures applied to the waveforms that contain memory effects. We find inconsistency between the zero-frequency limit of displacement memory and the low- frequency spectrum of the same memory preprocessed using the common scheme in literature. To resolve the inconsistency, we propose a new robust preprocessing scheme that produces the spectra of memory waveforms more faithfully. Using this new scheme, we inspect several characteristics of the spectrum of a memory waveform. In particular, we find a discernible beating pattern formed by the dominant oscillatory mode and the displacement memory. This pattern is absent in the spectrum of a waveform without memory. The difference between the memory and no-memory waveforms is too small to be observed by current-generation detectors in a single binary-black-hole event. Detecting the memory in a single event is likely to occur in the era of next-generation detectors. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 24 pages, 11 figures, 5 tables

arXiv:2405.06120 [pdf, other]

A discontinuous Galerkin scheme for elliptic equations on extremely stretched grids

Authors: Nils L. Vu

Abstract: Discontinuous Galerkin (DG) methods for solving elliptic equations are gaining popularity in the computational physics community for their high-order spectral convergence and their potential for parallelization on computing clusters. However, problems in numerical relativity with extremely stretched grids, such as initial data problems for binary black holes that impose boundary conditions at larg… ▽ More Discontinuous Galerkin (DG) methods for solving elliptic equations are gaining popularity in the computational physics community for their high-order spectral convergence and their potential for parallelization on computing clusters. However, problems in numerical relativity with extremely stretched grids, such as initial data problems for binary black holes that impose boundary conditions at large distances from the black holes, have proven challenging for DG methods. To alleviate this problem we have developed a primal DG scheme that is generically applicable to a large class of elliptic equations, including problems on curved and extremely stretched grids. The DG scheme accommodates two widely used initial data formulations in numerical relativity, namely the puncture formulation and the extended conformal thin-sandwich (XCTS) formulation. We find that our DG scheme is able to stretch the grid by a factor of $\sim 10^9$ and hence allows to impose boundary conditions at large distances. The scheme converges exponentially with resolution both for the smooth XCTS problem and for the non-smooth puncture problem. With this method we are able to generate high-quality initial data for binary black hole problems using a parallelizable DG scheme. The code is publicly available in the open-source SpECTRE numerical relativity code. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 12 pages, 10 figures. Results are reproducible with the ancillary input files

arXiv:2404.10922 [pdf, other]

Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training

Authors: Pavel Denisov, Ngoc Thang Vu

Abstract: Recent advancements in language modeling have led to the emergence of Large Language Models (LLMs) capable of various natural language processing tasks. Despite their success in text-based tasks, applying LLMs to the speech domain remains limited and challenging. This paper presents BLOOMZMMS, a novel model that integrates a multilingual LLM with a multilingual speech encoder, aiming to harness th… ▽ More Recent advancements in language modeling have led to the emergence of Large Language Models (LLMs) capable of various natural language processing tasks. Despite their success in text-based tasks, applying LLMs to the speech domain remains limited and challenging. This paper presents BLOOMZMMS, a novel model that integrates a multilingual LLM with a multilingual speech encoder, aiming to harness the capabilities of LLMs for speech recognition and beyond. Utilizing a multi-instructional training approach, we demonstrate the transferability of linguistic knowledge from the text to the speech modality. Our experiments, conducted on 1900 hours of transcribed data from 139 languages, establish that a multilingual speech representation can be effectively learned and aligned with a multilingual LLM. While this learned representation initially shows limitations in task generalization, we address this issue by generating synthetic targets in a multi-instructional style. Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks, including speech translation and multilingual spoken language understanding, thereby opening new avenues for applying LLMs in the speech domain. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: NAACL Findings 2024

arXiv:2404.10222 [pdf, other]

Simulating electronic structure on bosonic quantum computers

Authors: Rishab Dutta, Nam P. Vu, Ningyi Lyu, Chen Wang, Victor S. Batista

Abstract: Computations with quantum harmonic oscillators or qumodes is a promising and rapidly evolving approach towards quantum computing. In contrast to qubits, which are two-level quantum systems, bosonic qumodes can in principle have infinite discrete levels, and can also be represented with continuous variable bases. One of the most promising applications of quantum computing is simulating many-fermion… ▽ More Computations with quantum harmonic oscillators or qumodes is a promising and rapidly evolving approach towards quantum computing. In contrast to qubits, which are two-level quantum systems, bosonic qumodes can in principle have infinite discrete levels, and can also be represented with continuous variable bases. One of the most promising applications of quantum computing is simulating many-fermion problems such as molecular electronic structure. Although there has been a lot of recent progress on simulating many-fermion systems on qubit-based quantum hardware, they can not be easily extended to bosonic quantum devices due to the fundamental difference in physics represented by qubits and qumodes. In this work, we show how an electronic structure Hamiltonian can be transformed into a system of qumodes with a fermion to boson mapping scheme and apply it to simulate the electronic structure of dihydrogen molecule as a system of two qumodes. Our work opens the door for simulating many-fermion systems by harnessing the power of bosonic quantum devices. △ Less

Submitted 27 April, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Comments: 47 pages including references, 7 figures, revised

arXiv:2404.10214 [pdf, other]

doi 10.1021/acs.jctc.4c00544

Simulating Chemistry on Bosonic Quantum Devices

Authors: Rishab Dutta, Delmar G. A. Cabral, Ningyi Lyu, Nam P. Vu, Yuchen Wang, Brandon Allen, Xiaohan Dan, Rodrigo G. Cortiñas, Pouya Khazaei, Max Schäfer, Alejandro C. C. d. Albornoz, Scott E. Smart, Scott Nie, Michel H. Devoret, David A. Mazziotti, Prineha Narang, Chen Wang, James D. Whitfield, Angela K. Wilson, Heidi P. Hendrickson, Daniel A. Lidar, Francisco Pérez-Bernal, Lea F. Santos, Sabre Kais, Eitan Geva , et al. (1 additional authors not shown)

Abstract: Bosonic quantum devices offer a novel approach to realize quantum computations, where the quantum two-level system (qubit) is replaced with the quantum (an)harmonic oscillator (qumode) as the fundamental building block of the quantum simulator. The simulation of chemical structure and dynamics can then be achieved by representing or mapping the system Hamiltonians in terms of bosonic operators. In… ▽ More Bosonic quantum devices offer a novel approach to realize quantum computations, where the quantum two-level system (qubit) is replaced with the quantum (an)harmonic oscillator (qumode) as the fundamental building block of the quantum simulator. The simulation of chemical structure and dynamics can then be achieved by representing or mapping the system Hamiltonians in terms of bosonic operators. In this perspective, we review recent progress and future potential of using bosonic quantum devices for addressing a wide range of challenging chemical problems, including the calculation of molecular vibronic spectra, the simulation of gas-phase and solution-phase adiabatic and nonadiabatic chemical dynamics, the efficient solution of molecular graph theory problems, and the calculations of electronic structure. △ Less

Submitted 5 July, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Comments: 40 pages including references, 13 figures, revised

arXiv:2404.07122 [pdf, other]

Driver Attention Tracking and Analysis

Authors: Dat Viet Thanh Nguyen, Anh Tran, Hoai Nam Vu, Cuong Pham, Minh Hoai

Abstract: We propose a novel method to estimate a driver's points-of-gaze using a pair of ordinary cameras mounted on the windshield and dashboard of a car. This is a challenging problem due to the dynamics of traffic environments with 3D scenes of unknown depths. This problem is further complicated by the volatile distance between the driver and the camera system. To tackle these challenges, we develop a n… ▽ More We propose a novel method to estimate a driver's points-of-gaze using a pair of ordinary cameras mounted on the windshield and dashboard of a car. This is a challenging problem due to the dynamics of traffic environments with 3D scenes of unknown depths. This problem is further complicated by the volatile distance between the driver and the camera system. To tackle these challenges, we develop a novel convolutional network that simultaneously analyzes the image of the scene and the image of the driver's face. This network has a camera calibration module that can compute an embedding vector that represents the spatial configuration between the driver and the camera system. This calibration module improves the overall network's performance, which can be jointly trained end to end. We also address the lack of annotated data for training and evaluation by introducing a large-scale driving dataset with point-of-gaze annotations. This is an in situ dataset of real driving sessions in an urban city, containing synchronized images of the driving scene as well as the face and gaze of the driver. Experiments on this dataset show that the proposed method outperforms various baseline methods, having the mean prediction error of 29.69 pixels, which is relatively small compared to the $1280{\times}720$ resolution of the scene camera. △ Less

Submitted 11 April, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

arXiv:2404.04018 [pdf, ps, other]

doi 10.1145/3638529.3654140

Superior Genetic Algorithms for the Target Set Selection Problem Based on Power-Law Parameter Choices and Simple Greedy Heuristics

Authors: Benjamin Doerr, Martin S. Krejca, Nguyen Vu

Abstract: The target set selection problem (TSS) asks for a set of vertices such that an influence spreading process started in these vertices reaches the whole graph. The current state of the art for this NP-hard problem are three recently proposed randomized search heuristics, namely a biased random-key genetic algorithm (BRKGA) obtained from extensive parameter tuning, a max-min ant system (MMAS), and a… ▽ More The target set selection problem (TSS) asks for a set of vertices such that an influence spreading process started in these vertices reaches the whole graph. The current state of the art for this NP-hard problem are three recently proposed randomized search heuristics, namely a biased random-key genetic algorithm (BRKGA) obtained from extensive parameter tuning, a max-min ant system (MMAS), and a MMAS using Q-learning with a graph convolutional network. We show that the BRKGA with two simple modifications and without the costly parameter tuning obtains significantly better results. Our first modification is to simply choose all parameters of the BRKGA in each iteration randomly from a power-law distribution. The resulting parameterless BRKGA is already competitive with the tuned BRKGA, as our experiments on the previously used benchmarks show. We then add a natural greedy heuristic, namely to repeatedly discard small-degree vertices that are not necessary for reaching the whole graph. The resulting algorithm consistently outperforms all of the state-of-the-art algorithms. Besides providing a superior algorithm for the TSS problem, this work shows that randomized parameter choices and elementary greedy heuristics can give better results than complex algorithms and costly parameter tuning. △ Less

Submitted 5 April, 2024; originally announced April 2024.

Journal ref: GECCO '24: Proceedings of the Genetic and Evolutionary Computation Conference, pages 169-177, ACM, 2024

arXiv:2403.17647 [pdf, other]

Intrinsic Subgraph Generation for Interpretable Graph based Visual Question Answering

Authors: Pascal Tilli, Ngoc Thang Vu

Abstract: The large success of deep learning based methods in Visual Question Answering (VQA) has concurrently increased the demand for explainable methods. Most methods in Explainable Artificial Intelligence (XAI) focus on generating post-hoc explanations rather than taking an intrinsic approach, the latter characterizing an interpretable model. In this work, we introduce an interpretable approach for grap… ▽ More The large success of deep learning based methods in Visual Question Answering (VQA) has concurrently increased the demand for explainable methods. Most methods in Explainable Artificial Intelligence (XAI) focus on generating post-hoc explanations rather than taking an intrinsic approach, the latter characterizing an interpretable model. In this work, we introduce an interpretable approach for graph-based VQA and demonstrate competitive performance on the GQA dataset. This approach bridges the gap between interpretability and performance. Our model is designed to intrinsically produce a subgraph during the question-answering process as its explanation, providing insight into the decision making. To evaluate the quality of these generated subgraphs, we compare them against established post-hoc explainability methods for graph neural networks, and perform a human evaluation. Moreover, we present quantitative metrics that correlate with the evaluations of human assessors, acting as automatic metrics for the generated explanatory subgraphs. Our implementation is available at https://github.com/DigitalPhonetics/Intrinsic-Subgraph-Generation-for-VQA. △ Less

Submitted 27 March, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

Comments: Accepted at LREC-COLING 2024

arXiv:2403.17582 [pdf, other]

Towards a Zero-Data, Controllable, Adaptive Dialog System

Authors: Dirk Väth, Lindsey Vanderlyn, Ngoc Thang Vu

Abstract: Conversational Tree Search (Väth et al., 2023) is a recent approach to controllable dialog systems, where domain experts shape the behavior of a Reinforcement Learning agent through a dialog tree. The agent learns to efficiently navigate this tree, while adapting to information needs, e.g., domain familiarity, of different users. However, the need for additional training data hinders deployment in… ▽ More Conversational Tree Search (Väth et al., 2023) is a recent approach to controllable dialog systems, where domain experts shape the behavior of a Reinforcement Learning agent through a dialog tree. The agent learns to efficiently navigate this tree, while adapting to information needs, e.g., domain familiarity, of different users. However, the need for additional training data hinders deployment in new domains. To address this, we explore approaches to generate this data directly from dialog trees. We improve the original approach, and show that agents trained on synthetic data can achieve comparable dialog success to models trained on human data, both when using a commercial Large Language Model for generation, or when using a smaller open-source model, running on a single GPU. We further demonstrate the scalability of our approach by collecting and testing on two new datasets: ONBOARD, a new domain helping foreign residents moving to a new city, and the medical domain DIAGNOSE, a subset of Wikipedia articles related to scalp and head symptoms. Finally, we perform human testing, where no statistically significant differences were found in either objective or subjective measures between models trained on human and generated data. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.08705 [pdf, other]

Scalarization of isolated black holes in scalar Gauss-Bonnet theory in the fixing-the-equations approach

Authors: Guillermo Lara, Harald P. Pfeiffer, Nikolas A. Wittek, Nils L. Vu, Kyle C. Nelli, Alexander Carpenter, Geoffrey Lovelace, Mark A. Scheel, William Throwe

Abstract: One of the most promising avenues to perform numerical evolutions in theories beyond General Relativity is the fixing-the-equations approach, a proposal in which new ``driver'' equations are added to the evolution equations in a way that allows for stable numerical evolutions. In this direction, we extend the numerical relativity code SpECTRE to evolve a ``fixed'' version of scalar Gauss-Bonnet th… ▽ More One of the most promising avenues to perform numerical evolutions in theories beyond General Relativity is the fixing-the-equations approach, a proposal in which new ``driver'' equations are added to the evolution equations in a way that allows for stable numerical evolutions. In this direction, we extend the numerical relativity code SpECTRE to evolve a ``fixed'' version of scalar Gauss-Bonnet theory in the decoupling limit, a phenomenologically interesting theory that allows for hairy black hole solutions in vacuum. We focus on isolated black hole systems both with and without linear and angular momentum, and propose a new driver equation to improve the recovery of such stationary solutions. We demonstrate the effectiveness of the latter by numerically evolving black holes that undergo spontaneous scalarization using different driver equations. Finally, we evaluate the accuracy of the obtained solutions by comparing with the original unaltered theory. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: 16 pages, 12 figures

arXiv:2403.05338 [pdf, other]

Explaining Pre-Trained Language Models with Attribution Scores: An Analysis in Low-Resource Settings

Authors: Wei Zhou, Heike Adel, Hendrik Schuff, Ngoc Thang Vu

Abstract: Attribution scores indicate the importance of different input parts and can, thus, explain model behaviour. Currently, prompt-based models are gaining popularity, i.a., due to their easier adaptability in low-resource settings. However, the quality of attribution scores extracted from prompt-based models has not been investigated yet. In this work, we address this topic by analyzing attribution sc… ▽ More Attribution scores indicate the importance of different input parts and can, thus, explain model behaviour. Currently, prompt-based models are gaining popularity, i.a., due to their easier adaptability in low-resource settings. However, the quality of attribution scores extracted from prompt-based models has not been investigated yet. In this work, we address this topic by analyzing attribution scores extracted from prompt-based models w.r.t. plausibility and faithfulness and comparing them with attribution scores extracted from fine-tuned models and large language models. In contrast to previous work, we introduce training size as another dimension into the analysis. We find that using the prompting paradigm (with either encoder-based or decoder-based models) yields more plausible explanations than fine-tuning the models in low-resource settings and Shapley Value Sampling consistently outperforms attention and Integrated Gradients in terms of leading to more plausible and faithful explanations. △ Less

Submitted 8 March, 2024; originally announced March 2024.

arXiv:2403.04784 [pdf, other]

Analysis of Privacy Leakage in Federated Large Language Models

Authors: Minh N. Vu, Truc Nguyen, Tre' R. Jeter, My T. Thai

Abstract: With the rapid adoption of Federated Learning (FL) as the training and tuning protocol for applications utilizing Large Language Models (LLMs), recent research highlights the need for significant modifications to FL to accommodate the large-scale of LLMs. While substantial adjustments to the protocol have been introduced as a response, comprehensive privacy analysis for the adapted FL protocol is… ▽ More With the rapid adoption of Federated Learning (FL) as the training and tuning protocol for applications utilizing Large Language Models (LLMs), recent research highlights the need for significant modifications to FL to accommodate the large-scale of LLMs. While substantial adjustments to the protocol have been introduced as a response, comprehensive privacy analysis for the adapted FL protocol is currently lacking. To address this gap, our work delves into an extensive examination of the privacy analysis of FL when used for training LLMs, both from theoretical and practical perspectives. In particular, we design two active membership inference attacks with guaranteed theoretical success rates to assess the privacy leakages of various adapted FL configurations. Our theoretical findings are translated into practical attacks, revealing substantial privacy vulnerabilities in popular LLMs, including BERT, RoBERTa, DistilBERT, and OpenAI's GPTs, across multiple real-world language datasets. Additionally, we conduct thorough experiments to evaluate the privacy leakage of these models when data is protected by state-of-the-art differential privacy (DP) mechanisms. △ Less

Submitted 2 March, 2024; originally announced March 2024.

arXiv:2402.04769 [pdf, other]

Hierarchical Motion Planning and Offline Robust Model Predictive Control for Autonomous Vehicles

Authors: Hung Duy Nguyen, Minh Nhat Vu, Nguyen Ngoc Nam, Kyoungseok Han

Abstract: Driving vehicles in complex scenarios under harsh conditions is the biggest challenge for autonomous vehicles (AVs). To address this issue, we propose hierarchical motion planning and robust control strategy using the front-active steering system in complex scenarios with various slippery road adhesion coefficients while considering vehicle uncertain parameters. Behaviors of human vehicles (HVs) a… ▽ More Driving vehicles in complex scenarios under harsh conditions is the biggest challenge for autonomous vehicles (AVs). To address this issue, we propose hierarchical motion planning and robust control strategy using the front-active steering system in complex scenarios with various slippery road adhesion coefficients while considering vehicle uncertain parameters. Behaviors of human vehicles (HVs) are considered and modeled in the form of a car-following model via the Intelligent Driver Model (IDM). Then, in the upper layer, the motion planner first generates an optimal trajectory by using the artificial potential field (APF) algorithm to formulate any surrounding objects, e.g., road marks, boundaries, and static/dynamic obstacles. To track the generated optimal trajectory, in the lower layer, an offline-constrained output feedback robust model predictive control (RMPC) is employed for the linear parameter varying (LPV) system by applying linear matrix inequality (LMI) optimization method that ensures the robustness against the model parameter uncertainties. Furthermore, by augmenting the system model, our proposed approach, called offline RMPC, achieves outstanding efficiency compared to three existing RMPC approaches, e.g., offset-offline RMPC, online RMPC, and offline RMPC without an augmented model (offline RMPC w/o AM), in both improving computing time and reducing input vibrations. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: 6 pages, 9 illustrations, Accepted for publication in American Control Conference (ACC) 2024

arXiv:2402.04730 [pdf, other]

Model Predictive Trajectory Optimization With Dynamically Changing Waypoints for Serial Manipulators

Authors: Florian Beck, Minh Nhat Vu, Christian Hartl-Nesic, Andreas Kugi

Abstract: Systematically including dynamically changing waypoints as desired discrete actions, for instance, resulting from superordinate task planning, has been challenging for online model predictive trajectory optimization with short planning horizons. This paper presents a novel waypoint model predictive control (wMPC) concept for online replanning tasks. The main idea is to split the planning horizon a… ▽ More Systematically including dynamically changing waypoints as desired discrete actions, for instance, resulting from superordinate task planning, has been challenging for online model predictive trajectory optimization with short planning horizons. This paper presents a novel waypoint model predictive control (wMPC) concept for online replanning tasks. The main idea is to split the planning horizon at the waypoint when it becomes reachable within the current planning horizon and reduce the horizon length towards the waypoints and goal points. This approach keeps the computational load low and provides flexibility in adapting to changing conditions in real time. The presented approach achieves competitive path lengths and trajectory durations compared to (global) offline RRT-type planners in a multi-waypoint scenario. Moreover, the ability of wMPC to dynamically replan tasks online is experimentally demonstrated on a KUKA LBR iiwa 14 R820 robot in a dynamic pick-and-place scenario. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: 8 pages, 6 figures

arXiv:2402.02819 [pdf, other]

doi 10.1103/PhysRevD.109.124030

Striking the right tone: toward a self-consistent framework for measuring black hole ringdowns

Authors: Teagan A. Clarke, Maximiliano Isi, Paul D. Lasky, Eric Thrane, Michael Boyle, Nils Deppe, Lawrence E. Kidder, Keefe Mitman, Jordan Moxon, Kyle C. Nelli, William Throwe, Nils L. Vu

Abstract: The ringdown portion of a binary black hole merger consists of a sum of modes, each containing an infinite number of tones that are exponentially damped sinusoids. In principle, these can be measured as gravitational-waves with observatories like LIGO/Virgo/KAGRA, however in practice it is unclear how many tones can be meaningfully resolved. We investigate the consistency and resolvability of the… ▽ More The ringdown portion of a binary black hole merger consists of a sum of modes, each containing an infinite number of tones that are exponentially damped sinusoids. In principle, these can be measured as gravitational-waves with observatories like LIGO/Virgo/KAGRA, however in practice it is unclear how many tones can be meaningfully resolved. We investigate the consistency and resolvability of the overtones of the quadrupolar $\ell = m = 2$ mode by starting at late times when the gravitational waveform is expected to be well-approximated by the $\ell m n = 220$ tone alone. We present a Bayesian inference framework to measure the tones in numerical relativity data. We measure tones at different start times, checking for consistency: we classify a tone as stably recovered if and only if the 95\% credible intervals for amplitude and phase at time $t$ overlap with the credible intervals at all subsequent times. We test a set of tones including the first four overtones of the fundamental mode and the 320 tone and find that the 220 and 221 tones can be measured consistently with the inclusion of additional overtones. The 222 tone measurements can be stabilised when we include the 223 tone, but only in a narrow time window, after which it is too weak to measure. The 223 tone recovery appears to be unstable, and does not become stable with the introduction of the 224 tone. We find that $N=3$ tones can be stably recovered simultaneously. However, when analysing $N \geq 4$ tones, the amplitude of one tone is consistent with zero. Thus, within our framework, one can identify only $N=3$ tones with non-zero amplitude that are simultaneously stable. △ Less

Submitted 11 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: 14 pages, 8 figures, 2 tables. Published in PRD

arXiv:2401.17676 [pdf, other]

Observer-based Controller Design for Oscillation Damping of a Novel Suspended Underactuated Aerial Platform

Authors: Hemjyoti Das, Minh Nhat Vu, Tobias Egle, Christian Ott

Abstract: In this work, we present a novel actuation strategy for a suspended aerial platform. By utilizing an underactuation approach, we demonstrate the successful oscillation damping of the proposed platform, modeled as a spherical double pendulum. A state estimator is designed in order to obtain the deflection angles of the platform, which uses only onboard IMU measurements. The state estimator is an ex… ▽ More In this work, we present a novel actuation strategy for a suspended aerial platform. By utilizing an underactuation approach, we demonstrate the successful oscillation damping of the proposed platform, modeled as a spherical double pendulum. A state estimator is designed in order to obtain the deflection angles of the platform, which uses only onboard IMU measurements. The state estimator is an extended Kalman filter (EKF) with intermittent measurements obtained at different frequencies. An optimal state feedback controller and a PD+ controller are designed in order to dampen the oscillations of the platform in the joint space and task space respectively. The proposed underactuated platform is found to be more energy-efficient than an omnidirectional platform and requires fewer actuators. The effectiveness of our proposed system is validated using both simulations and experimental studies. △ Less

Submitted 31 January, 2024; originally announced January 2024.

Comments: 7 pages, 11 figures, Accepted for publication to ICRA 2024

arXiv:2401.09059 [pdf, other]

Autonomous Catheterization with Open-source Simulator and Expert Trajectory

Authors: Tudor Jianu, Baoru Huang, Tuan Vo, Minh Nhat Vu, Jingxuan Kang, Hoan Nguyen, Olatunji Omisore, Pierre Berthet-Rayne, Sebastiano Fichera, Anh Nguyen

Abstract: Endovascular robots have been actively developed in both academia and industry. However, progress toward autonomous catheterization is often hampered by the widespread use of closed-source simulators and physical phantoms. Additionally, the acquisition of large-scale datasets for training machine learning algorithms with endovascular robots is usually infeasible due to expensive medical procedures… ▽ More Endovascular robots have been actively developed in both academia and industry. However, progress toward autonomous catheterization is often hampered by the widespread use of closed-source simulators and physical phantoms. Additionally, the acquisition of large-scale datasets for training machine learning algorithms with endovascular robots is usually infeasible due to expensive medical procedures. In this chapter, we introduce CathSim, the first open-source simulator for endovascular intervention to address these limitations. CathSim emphasizes real-time performance to enable rapid development and testing of learning algorithms. We validate CathSim against the real robot and show that our simulator can successfully mimic the behavior of the real robot. Based on CathSim, we develop a multimodal expert navigation network and demonstrate its effectiveness in downstream endovascular navigation tasks. The intensive experimental results suggest that CathSim has the potential to significantly accelerate research in the autonomous catheterization field. Our project is publicly available at https://github.com/airvlab/cathsim. △ Less

Submitted 19 January, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

Comments: Code: https://github.com/airvlab/cathsim

arXiv:2401.00805 [pdf, other]

Nonlinear Effects In Black Hole Ringdown From Scattering Experiments I: spin and initial data dependence of quadratic mode coupling

Authors: Hengrui Zhu, Justin L. Ripley, Frans Pretorius, Sizheng Ma, Keefe Mitman, Robert Owen, Michael Boyle, Yitian Chen, Nils Deppe, Lawrence E. Kidder, Jordan Moxon, Kyle C. Nelli, Harald P. Pfeiffer, Mark A. Scheel, William Throwe, Nils L. Vu

Abstract: We investigate quadratic quasinormal mode coupling in black hole spacetime through numerical simulations of single perturbed black holes using both numerical relativity and second-order black hole perturbation theory. Focusing on the dominant $\ell=|m|=2$ quadrupolar modes, we find good agreement (within $\sim10\%$) between these approaches, with discrepancies attributed to truncation error and un… ▽ More We investigate quadratic quasinormal mode coupling in black hole spacetime through numerical simulations of single perturbed black holes using both numerical relativity and second-order black hole perturbation theory. Focusing on the dominant $\ell=|m|=2$ quadrupolar modes, we find good agreement (within $\sim10\%$) between these approaches, with discrepancies attributed to truncation error and uncertainties from mode fitting. Our results align with earlier studies extracting the coupling coefficients from select binary black hole merger simulations, showing consistency for the same remnant spins. Notably, the coupling coefficient is insensitive to a diverse range of initial data, including configurations that led to a significant (up to $5\%$) increase in the remnant black hole mass. These findings present opportunities for testing the nonlinear dynamics of general relativity with ground-based gravitational wave observatories. Lastly, we provide evidence of a bifurcation in coupling coefficients between counter-rotating and co-rotating quasinormal modes as black hole spin increases. △ Less

Submitted 1 January, 2024; originally announced January 2024.

arXiv:2312.08588 [pdf, other]

Black Hole Spectroscopy for Precessing Binary Black Hole Coalescences

Authors: Hengrui Zhu, Harrison Siegel, Keefe Mitman, Maximiliano Isi, Will M. Farr, Michael Boyle, Nils Deppe, Lawrence E. Kidder, Sizheng Ma, Jordan Moxon, Kyle C. Nelli, Harald P. Pfeiffer, Mark A. Scheel, Saul A. Teukolsky, William Throwe, Vijay Varma, Nils L. Vu

Abstract: The spectroscopic study of black hole quasinormal modes in gravitational-wave ringdown observations is hindered by our ignorance of which modes should dominate astrophysical signals for different binary configurations, limiting tests of general relativity and astrophysics. In this work, we present a description of the quasinormal modes that are excited in the ringdowns of comparable mass, quasi-ci… ▽ More The spectroscopic study of black hole quasinormal modes in gravitational-wave ringdown observations is hindered by our ignorance of which modes should dominate astrophysical signals for different binary configurations, limiting tests of general relativity and astrophysics. In this work, we present a description of the quasinormal modes that are excited in the ringdowns of comparable mass, quasi-circular precessing binary black hole coalescences -- a key region of parameter space that has yet to be fully explored within the framework of black hole spectroscopy. We suggest that the remnant perturbation for precessing and non-precessing systems is approximately the same up to a rotation, which implies that the relative amplitudes of the quasinormal modes in both systems are also related by a rotation. We present evidence for this by analyzing an extensive catalog of numerical relativity simulations. Additional structure in the amplitudes is connected to the system's kick velocity and other asymmetries in the orbital dynamics. We find that the ringdowns of precessing systems need not be dominated by the ${(\ell,m)=(2,\pm 2)}$ quasinormal modes, and that instead the $(2,\pm 1)$~or~$(2,0)$ quasinormal modes can dominate. Our results are consistent with a ringdown analysis of the LIGO-Virgo gravitational wave signal GW190521, and may also help in understanding phenomenological inspiral-merger-ringdown waveform model systematics. △ Less

Submitted 18 July, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: Data Release and Analysis Scripts: https://github.com/HengruiPrinceton/precession_ringdown

arXiv:2311.14465 [pdf, other]

DP-NMT: Scalable Differentially-Private Machine Translation

Authors: Timour Igamberdiev, Doan Nam Long Vu, Felix Künnecke, Zhuo Yu, Jannik Holmer, Ivan Habernal

Abstract: Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implemen… ▽ More Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community. △ Less

Submitted 24 April, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

Comments: Accepted at EACL 2024

arXiv:2310.17502 [pdf, other]

doi 10.21437/Interspeech.2023-858

Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions

Authors: Florian Lux, Pascal Tilli, Sarina Meyer, Ngoc Thang Vu

Abstract: Customizing voice and speaking style in a speech synthesis system with intuitive and fine-grained controls is challenging, given that little data with appropriate labels is available. Furthermore, editing an existing human's voice also comes with ethical concerns. In this paper, we propose a method to generate artificial speaker embeddings that cannot be linked to a real human while offering intui… ▽ More Customizing voice and speaking style in a speech synthesis system with intuitive and fine-grained controls is challenging, given that little data with appropriate labels is available. Furthermore, editing an existing human's voice also comes with ethical concerns. In this paper, we propose a method to generate artificial speaker embeddings that cannot be linked to a real human while offering intuitive and fine-grained control over the voice and speaking style of the embeddings, without requiring any labels for speaker or style. The artificial and controllable embeddings can be fed to a speech synthesis system, conditioned on embeddings of real humans during training, without sacrificing privacy during inference. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: Published at ISCA Interspeech 2023 https://www.isca-speech.org/archive/interspeech_2023/lux23_interspeech.html

arXiv:2310.17499 [pdf, other]

The IMS Toucan System for the Blizzard Challenge 2023

Authors: Florian Lux, Julia Koch, Sarina Meyer, Thomas Bott, Nadja Schauffler, Pavel Denisov, Antje Schweitzer, Ngoc Thang Vu

Abstract: For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021. Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language. It then transforms the phonemes to spectrograms as intermediate representations using a fast and efficient non-autoregressive synt… ▽ More For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021. Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language. It then transforms the phonemes to spectrograms as intermediate representations using a fast and efficient non-autoregressive synthesis architecture based on Conformer and Glow. A GAN based neural vocoder that combines recent state-of-the-art approaches converts the spectrogram to the final wave. We carefully designed the data processing, training, and inference procedures for the challenge data. Our system identifier is G. Open source code and demo are available. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: Published at the Blizzard Challenge Workshop 2023, colocated with the Speech Synthesis Workshop 2023, a sattelite event of the Interspeech 2023

arXiv:2310.16618 [pdf, other]

Real-time 6-DoF Pose Estimation by an Event-based Camera using Active LED Markers

Authors: Gerald Ebmer, Adam Loch, Minh Nhat Vu, Germain Haessig, Roberto Mecca, Markus Vincze, Christian Hartl-Nesic, Andreas Kugi

Abstract: Real-time applications for autonomous operations depend largely on fast and robust vision-based localization systems. Since image processing tasks require processing large amounts of data, the computational resources often limit the performance of other processes. To overcome this limitation, traditional marker-based localization systems are widely used since they are easy to integrate and achieve… ▽ More Real-time applications for autonomous operations depend largely on fast and robust vision-based localization systems. Since image processing tasks require processing large amounts of data, the computational resources often limit the performance of other processes. To overcome this limitation, traditional marker-based localization systems are widely used since they are easy to integrate and achieve reliable accuracy. However, classical marker-based localization systems significantly depend on standard cameras with low frame rates, which often lack accuracy due to motion blur. In contrast, event-based cameras provide high temporal resolution and a high dynamic range, which can be utilized for fast localization tasks, even under challenging visual conditions. This paper proposes a simple but effective event-based pose estimation system using active LED markers (ALM) for fast and accurate pose estimation. The proposed algorithm is able to operate in real time with a latency below \SI{0.5}{\milli\second} while maintaining output rates of \SI{3}{\kilo \hertz}. Experimental results in static and dynamic scenarios are presented to demonstrate the performance of the proposed approach in terms of computational speed and absolute accuracy, using the OptiTrack system as the basis for measurement. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: 14 pages, 12 figures, this paper has been accepted to WACV 2024

arXiv:2310.15948 [pdf, other]

Language-driven Scene Synthesis using Multi-conditional Diffusion Model

Authors: An Vuong, Minh Nhat Vu, Toan Tien Nguyen, Baoru Huang, Dzung Nguyen, Thieu Vo, Anh Nguyen

Abstract: Scene synthesis is a challenging problem with several industrial applications. Recently, substantial efforts have been directed to synthesize the scene using human motions, room layouts, or spatial graphs as the input. However, few studies have addressed this problem from multiple modalities, especially combining text prompts. In this paper, we propose a language-driven scene synthesis task, which… ▽ More Scene synthesis is a challenging problem with several industrial applications. Recently, substantial efforts have been directed to synthesize the scene using human motions, room layouts, or spatial graphs as the input. However, few studies have addressed this problem from multiple modalities, especially combining text prompts. In this paper, we propose a language-driven scene synthesis task, which is a new task that integrates text prompts, human motion, and existing objects for scene synthesis. Unlike other single-condition synthesis tasks, our problem involves multiple conditions and requires a strategy for processing and encoding them into a unified space. To address the challenge, we present a multi-conditional diffusion model, which differs from the implicit unification approach of other diffusion literature by explicitly predicting the guiding points for the original data distribution. We demonstrate that our approach is theoretically supportive. The intensive experiment results illustrate that our method outperforms state-of-the-art benchmarks and enables natural scene editing applications. The source code and dataset can be accessed at https://lang-scene-synth.github.io/. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: Accepted to NeurIPS 2023

arXiv:2310.15262 [pdf, other]

Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

Authors: Injy Hamed, Nizar Habash, Ngoc Thang Vu

Abstract: Code-switching (CSW) text generation has been receiving increasing attention as a solution to address data scarcity. In light of this growing interest, we need more comprehensive studies comparing different augmentation approaches. In this work, we compare three popular approaches: lexical replacements, linguistic theories, and back-translation (BT), in the context of Egyptian Arabic-English CSW.… ▽ More Code-switching (CSW) text generation has been receiving increasing attention as a solution to address data scarcity. In light of this growing interest, we need more comprehensive studies comparing different augmentation approaches. In this work, we compare three popular approaches: lexical replacements, linguistic theories, and back-translation (BT), in the context of Egyptian Arabic-English CSW. We assess the effectiveness of the approaches on machine translation and the quality of augmentations through human evaluation. We show that BT and CSW predictive-based lexical replacement, being trained on CSW parallel data, perform best on both tasks. Linguistic theories and random lexical replacement prove to be effective in the lack of CSW parallel data, where both approaches achieve similar results. △ Less

Submitted 23 October, 2023; originally announced October 2023.

Comments: Findings of EMNLP 2023

arXiv:2310.06103 [pdf, other]

Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding

Authors: Pavel Denisov, Ngoc Thang Vu

Abstract: A number of methods have been proposed for End-to-End Spoken Language Understanding (E2E-SLU) using pretrained models, however their evaluation often lacks multilingual setup and tasks that require prediction of lexical fillers, such as slot filling. In this work, we propose a unified method that integrates multilingual pretrained speech and text models and performs E2E-SLU on six datasets in four… ▽ More A number of methods have been proposed for End-to-End Spoken Language Understanding (E2E-SLU) using pretrained models, however their evaluation often lacks multilingual setup and tasks that require prediction of lexical fillers, such as slot filling. In this work, we propose a unified method that integrates multilingual pretrained speech and text models and performs E2E-SLU on six datasets in four languages in a generative manner, including the prediction of lexical fillers. We investigate how the proposed method can be improved by pretraining on widely available speech recognition data using several training objectives. Pretraining on 7000 hours of multilingual data allows us to outperform the state-of-the-art ultimately on two SLU datasets and partly on two more SLU datasets. Finally, we examine the cross-lingual capabilities of the proposed model and improve on the best known result on the PortMEDIA-Language dataset by almost half, achieving a Concept/Value Error Rate of 23.65%. △ Less

Submitted 9 October, 2023; originally announced October 2023.

Comments: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2023

arXiv:2309.10932 [pdf, other]

Open-Vocabulary Affordance Detection using Knowledge Distillation and Text-Point Correlation

Authors: Tuan Van Vo, Minh Nhat Vu, Baoru Huang, Toan Nguyen, Ngan Le, Thieu Vo, Anh Nguyen

Abstract: Affordance detection presents intricate challenges and has a wide range of robotic applications. Previous works have faced limitations such as the complexities of 3D object shapes, the wide range of potential affordances on real-world objects, and the lack of open-vocabulary support for affordance understanding. In this paper, we introduce a new open-vocabulary affordance detection method in 3D po… ▽ More Affordance detection presents intricate challenges and has a wide range of robotic applications. Previous works have faced limitations such as the complexities of 3D object shapes, the wide range of potential affordances on real-world objects, and the lack of open-vocabulary support for affordance understanding. In this paper, we introduce a new open-vocabulary affordance detection method in 3D point clouds, leveraging knowledge distillation and text-point correlation. Our approach employs pre-trained 3D models through knowledge distillation to enhance feature extraction and semantic understanding in 3D point clouds. We further introduce a new text-point correlation method to learn the semantic links between point cloud features and open-vocabulary labels. The intensive experiments show that our approach outperforms previous works and adapts to new affordance labels and unseen objects. Notably, our method achieves the improvement of 7.96% mIOU score compared to the baselines. Furthermore, it offers real-time inference which is well-suitable for robotic manipulation applications. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: 8 pages

arXiv:2309.10911 [pdf, other]

Language-Conditioned Affordance-Pose Detection in 3D Point Clouds

Authors: Toan Nguyen, Minh Nhat Vu, Baoru Huang, Tuan Van Vo, Vy Truong, Ngan Le, Thieu Vo, Bac Le, Anh Nguyen

Abstract: Affordance detection and pose estimation are of great importance in many robotic applications. Their combination helps the robot gain an enhanced manipulation capability, in which the generated pose can facilitate the corresponding affordance task. Previous methods for affodance-pose joint learning are limited to a predefined set of affordances, thus limiting the adaptability of robots in real-wor… ▽ More Affordance detection and pose estimation are of great importance in many robotic applications. Their combination helps the robot gain an enhanced manipulation capability, in which the generated pose can facilitate the corresponding affordance task. Previous methods for affodance-pose joint learning are limited to a predefined set of affordances, thus limiting the adaptability of robots in real-world environments. In this paper, we propose a new method for language-conditioned affordance-pose joint learning in 3D point clouds. Given a 3D point cloud object, our method detects the affordance region and generates appropriate 6-DoF poses for any unconstrained affordance label. Our method consists of an open-vocabulary affordance detection branch and a language-guided diffusion model that generates 6-DoF poses based on the affordance text. We also introduce a new high-quality dataset for the task of language-driven affordance-pose joint learning. Intensive experimental results demonstrate that our proposed method works effectively on a wide range of open-vocabulary affordances and outperforms other baselines by a large margin. In addition, we illustrate the usefulness of our method in real-world robotic applications. Our code and dataset are publicly available at https://3DAPNet.github.io △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: Project page: https://3DAPNet.github.io

arXiv:2309.09818 [pdf, other]

Grasp-Anything: Large-scale Grasp Dataset from Foundation Models

Authors: An Dinh Vuong, Minh Nhat Vu, Hieu Le, Baoru Huang, Binh Huynh, Thieu Vo, Andreas Kugi, Anh Nguyen

Abstract: Foundation models such as ChatGPT have made significant strides in robotic tasks due to their universal representation of real-world domains. In this paper, we leverage foundation models to tackle grasp detection, a persistent challenge in robotics with broad industrial applications. Despite numerous grasp datasets, their object diversity remains limited compared to real-world figures. Fortunately… ▽ More Foundation models such as ChatGPT have made significant strides in robotic tasks due to their universal representation of real-world domains. In this paper, we leverage foundation models to tackle grasp detection, a persistent challenge in robotics with broad industrial applications. Despite numerous grasp datasets, their object diversity remains limited compared to real-world figures. Fortunately, foundation models possess an extensive repository of real-world knowledge, including objects we encounter in our daily lives. As a consequence, a promising solution to the limited representation in previous grasp datasets is to harness the universal knowledge embedded in these foundation models. We present Grasp-Anything, a new large-scale grasp dataset synthesized from foundation models to implement this solution. Grasp-Anything excels in diversity and magnitude, boasting 1M samples with text descriptions and more than 3M objects, surpassing prior datasets. Empirically, we show that Grasp-Anything successfully facilitates zero-shot grasp detection on vision-based tasks and real-world robotic experiments. Our dataset and code are available at https://grasp-anything-2023.github.io. △ Less

Submitted 18 September, 2023; originally announced September 2023.

Comments: Project page: https://grasp-anything-2023.github.io

Showing 1–50 of 187 results for author: Vũ, N