Search | arXiv e-print repository

Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

Authors: Lam Pham, Phat Lam, Truong Nguyen, Huyen Nguyen, Alexander Schindler

Abstract: In this paper, we propose a deep learning based system for the task of deepfake audio detection. In particular, the draw input audio is first transformed into various spectrograms using three transformation methods of Short-time Fourier Transform (STFT), Constant-Q Transform (CQT), Wavelet Transform (WT) combined with different auditory-based filters of Mel, Gammatone, linear filters (LF), and dis… ▽ More In this paper, we propose a deep learning based system for the task of deepfake audio detection. In particular, the draw input audio is first transformed into various spectrograms using three transformation methods of Short-time Fourier Transform (STFT), Constant-Q Transform (CQT), Wavelet Transform (WT) combined with different auditory-based filters of Mel, Gammatone, linear filters (LF), and discrete cosine transform (DCT). Given the spectrograms, we evaluate a wide range of classification models based on three deep learning approaches. The first approach is to train directly the spectrograms using our proposed baseline models of CNN-based model (CNN-baseline), RNN-based model (RNN-baseline), C-RNN model (C-RNN baseline). Meanwhile, the second approach is transfer learning from computer vision models such as ResNet-18, MobileNet-V3, EfficientNet-B0, DenseNet-121, SuffleNet-V2, Swint, Convnext-Tiny, GoogLeNet, MNASsnet, RegNet. In the third approach, we leverage the state-of-the-art audio pre-trained models of Whisper, Seamless, Speechbrain, and Pyannote to extract audio embeddings from the input spectrograms. Then, the audio embeddings are explored by a Multilayer perceptron (MLP) model to detect the fake or real audio samples. Finally, high-performance deep learning models from these approaches are fused to achieve the best performance. We evaluated our proposed models on ASVspoof 2019 benchmark dataset. Our best ensemble model achieved an Equal Error Rate (EER) of 0.03, which is highly competitive to top-performing systems in the ASVspoofing 2019 challenge. Experimental results also highlight the potential of selective spectrograms and deep learning approaches to enhance the task of audio deepfake detection. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01110 [pdf]

SecGenAI: Enhancing Security of Cloud-based Generative AI Applications within Australian Critical Technologies of National Interest

Authors: Christoforus Yoga Haryanto, Minh Hieu Vu, Trung Duc Nguyen, Emily Lomempow, Yulia Nurliana, Sona Taheri

Abstract: The rapid advancement of Generative AI (GenAI) technologies offers transformative opportunities within Australia's critical technologies of national interest while introducing unique security challenges. This paper presents SecGenAI, a comprehensive security framework for cloud-based GenAI applications, with a focus on Retrieval-Augmented Generation (RAG) systems. SecGenAI addresses functional, in… ▽ More The rapid advancement of Generative AI (GenAI) technologies offers transformative opportunities within Australia's critical technologies of national interest while introducing unique security challenges. This paper presents SecGenAI, a comprehensive security framework for cloud-based GenAI applications, with a focus on Retrieval-Augmented Generation (RAG) systems. SecGenAI addresses functional, infrastructure, and governance requirements, integrating end-to-end security analysis to generate specifications emphasizing data privacy, secure deployment, and shared responsibility models. Aligned with Australian Privacy Principles, AI Ethics Principles, and guidelines from the Australian Cyber Security Centre and Digital Transformation Agency, SecGenAI mitigates threats such as data leakage, adversarial attacks, and model inversion. The framework's novel approach combines advanced machine learning techniques with robust security measures, ensuring compliance with Australian regulations while enhancing the reliability and trustworthiness of GenAI systems. This research contributes to the field of intelligent systems by providing actionable strategies for secure GenAI implementation in industry, fostering innovation in AI applications, and safeguarding national interests. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 10 pages, 4 figures, 9 tables, submitted to the 2024 11th International Conference on Soft Computing & Machine Intelligence (ISCMI 2024)

arXiv:2407.00895 [pdf, other]

Large-Amplitude, Easy-Plane Spin-Orbit Torque Oscillators Driven by Out-of-Plane Spin Current: A Micromagnetic Study

Authors: Daniel Kubler, David A. Smith, Tommy Nguyen, Fernando Ramos-Diaz, Satoru Emori, Vivek P. Amin

Abstract: Spin torque oscillators are spintronic devices that generate a periodic output signal from a non-periodic input, making them promising candidates for applications like microwave communications and neuromorphic computing. However, traditional spin torque oscillators suffer from a limited precessional cone angle and thermal stability, as well as a need for an applied bias magnetic field. We use micr… ▽ More Spin torque oscillators are spintronic devices that generate a periodic output signal from a non-periodic input, making them promising candidates for applications like microwave communications and neuromorphic computing. However, traditional spin torque oscillators suffer from a limited precessional cone angle and thermal stability, as well as a need for an applied bias magnetic field. We use micromagnetic simulations to demonstrate a novel spin torque oscillator that relies on spin-orbit effects in ferromagnets to overcome these limitations. The key mechanism behind this oscillator is the generation of an out-of-plane spin current, in which both the spin flow and the spin orientation are out-of-plane. The torque from this spin current enables easy-plane coherent magnetic precession with a large cone angle and high thermal stability over a micron-scale lateral area. Moreover, the precession occurs about an internal field in the free layer, thereby eliminating the need for an external bias field. We demonstrate the feasibility of an easy-plane spin-orbit torque oscillator at room temperature over a wide parameter space, including the ratio of the out-of-plane spin current to the conventional spin-Hall spin current, presenting exciting possibilities for this novel spintronic device. △ Less

Submitted 30 June, 2024; originally announced July 2024.

arXiv:2407.00710 [pdf, other]

Weighted Missing Linear Discriminant Analysis: An Explainable Approach for Classification with Missing Data

Authors: Tuan L. Vo, Uyen Dang, Thu Nguyen

Abstract: As Artificial Intelligence (AI) models are gradually being adopted in real-life applications, the explainability of the model used is critical, especially in high-stakes areas such as medicine, finance, etc. Among the commonly used models, Linear Discriminant Analysis (LDA) is a widely used classification tool that is also explainable thanks to its ability to model class distributions and maximize… ▽ More As Artificial Intelligence (AI) models are gradually being adopted in real-life applications, the explainability of the model used is critical, especially in high-stakes areas such as medicine, finance, etc. Among the commonly used models, Linear Discriminant Analysis (LDA) is a widely used classification tool that is also explainable thanks to its ability to model class distributions and maximize class separation through linear feature combinations. Nevertheless, real-world data is frequently incomplete, presenting significant challenges for classification tasks and model explanations. In this paper, we propose a novel approach to LDA under missing data, termed \textbf{\textit{Weighted missing Linear Discriminant Analysis (WLDA)}}, to directly classify observations in data that contains missing values without imputation effectively by estimating the parameters directly on missing data and use a weight matrix for missing values to penalize missing entries during classification. Furthermore, we also analyze the theoretical properties and examine the explainability of the proposed technique in a comprehensive manner. Experimental results demonstrate that WLDA outperforms conventional methods by a significant margin, particularly in scenarios where missing values are present in both training and test sets. △ Less

Submitted 30 June, 2024; originally announced July 2024.

arXiv:2407.00609 [pdf, other]

ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding

Authors: Quang P. M. Pham, Khoi T. N. Nguyen, Lan C. Ngo, Truong Do, Truong Son Hy

Abstract: Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, m… ▽ More Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, multi-view 3D data. This work, to the best of our knowledge, is the first to implement an Equivariant Graph Neural Network in semantic scene graph generation from 3D point clouds for scene understanding. Our proposed method, ESGNN, outperforms existing state-of-the-art approaches, demonstrating a significant improvement in scene estimation with faster convergence. ESGNN demands low computational resources and is easy to implement from available frameworks, paving the way for real-time applications such as robotics and computer vision. △ Less

Submitted 30 June, 2024; originally announced July 2024.

arXiv:2407.00535 [pdf, other]

AI-powered multimodal modeling of personalized hemodynamics in aortic stenosis

Authors: Caglar Ozturk, Daniel H. Pak, Luca Rosalia, Debkalpa Goswami, Mary E. Robakowski, Raymond McKay, Christopher T. Nguyen, James S. Duncan, Ellen T. Roche

Abstract: Aortic stenosis (AS) is the most common valvular heart disease in developed countries. High-fidelity preclinical models can improve AS management by enabling therapeutic innovation, early diagnosis, and tailored treatment planning. However, their use is currently limited by complex workflows necessitating lengthy expert-driven manual operations. Here, we propose an AI-powered computational framewo… ▽ More Aortic stenosis (AS) is the most common valvular heart disease in developed countries. High-fidelity preclinical models can improve AS management by enabling therapeutic innovation, early diagnosis, and tailored treatment planning. However, their use is currently limited by complex workflows necessitating lengthy expert-driven manual operations. Here, we propose an AI-powered computational framework for accelerated and democratized patient-specific modeling of AS hemodynamics from computed tomography. First, we demonstrate that our automated meshing algorithms can generate task-ready geometries for both computational and benchtop simulations with higher accuracy and 100 times faster than existing approaches. Then, we show that our approach can be integrated with fluid-structure interaction and soft robotics models to accurately recapitulate a broad spectrum of clinical hemodynamic measurements of diverse AS patients. The efficiency and reliability of these algorithms make them an ideal complementary tool for personalized high-fidelity modeling of AS biomechanics, hemodynamics, and treatment planning. △ Less

Submitted 29 June, 2024; originally announced July 2024.

Comments: CO and DHP contributed equally to this work. JSD and ETR are corresponding authors

arXiv:2407.00411 [pdf, other]

Explainability of Machine Learning Models under Missing Data

Authors: Tuan L. Vo, Thu Nguyen, Hugo L. Hammer, Michael A. Riegler, Pal Halvorsen

Abstract: Missing data is a prevalent issue that can significantly impair model performance and interpretability. This paper briefly summarizes the development of the field of missing data with respect to Explainable Artificial Intelligence and experimentally investigates the effects of various imputation methods on the calculation of Shapley values, a popular technique for interpreting complex machine lear… ▽ More Missing data is a prevalent issue that can significantly impair model performance and interpretability. This paper briefly summarizes the development of the field of missing data with respect to Explainable Artificial Intelligence and experimentally investigates the effects of various imputation methods on the calculation of Shapley values, a popular technique for interpreting complex machine learning models. We compare different imputation strategies and assess their impact on feature importance and interaction as determined by Shapley values. Moreover, we also theoretically analyze the effects of missing values on Shapley values. Importantly, our findings reveal that the choice of imputation method can introduce biases that could lead to changes in the Shapley values, thereby affecting the interpretability of the model. Moreover, and that a lower test prediction mean square error (MSE) may not imply a lower MSE in Shapley values and vice versa. Also, while Xgboost is a method that could handle missing data directly, using Xgboost directly on missing data can seriously affect interpretability compared to imputing the data before training Xgboost. This study provides a comprehensive evaluation of imputation methods in the context of model interpretation, offering practical guidance for selecting appropriate techniques based on dataset characteristics and analysis objectives. The results underscore the importance of considering imputation effects to ensure robust and reliable insights from machine learning models. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.20077 [pdf, other]

HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model

Authors: Hieu T. Nguyen, Yiwen Chen, Vikram Voleti, Varun Jampani, Huaizu Jiang

Abstract: We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise m… ▽ More We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise manner along sampled locations based on the floorplan, where previously generated images are used as condition to the diffusion model to produce images at nearby locations. The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed. Through extensive evaluation on the 3D-Front dataset, we demonstrate that HouseCraft can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices. We will release our code and model weights. Project page: https://neu-vi.github.io/houseCrafter/ △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.19753 [pdf, other]

Backdoor Attack in Prompt-Based Continual Learning

Authors: Trang Nguyen, Anh Tran, Nhat Ho

Abstract: Prompt-based approaches offer a cutting-edge solution to data privacy issues in continual learning, particularly in scenarios involving multiple data suppliers where long-term storage of private user data is prohibited. Despite delivering state-of-the-art performance, its impressive remembering capability can become a double-edged sword, raising security concerns as it might inadvertently retain p… ▽ More Prompt-based approaches offer a cutting-edge solution to data privacy issues in continual learning, particularly in scenarios involving multiple data suppliers where long-term storage of private user data is prohibited. Despite delivering state-of-the-art performance, its impressive remembering capability can become a double-edged sword, raising security concerns as it might inadvertently retain poisoned knowledge injected during learning from private user data. Following this insight, in this paper, we expose continual learning to a potential threat: backdoor attack, which drives the model to follow a desired adversarial target whenever a specific trigger is present while still performing normally on clean samples. We highlight three critical challenges in executing backdoor attacks on incremental learners and propose corresponding solutions: (1) \emph{Transferability}: We employ a surrogate dataset and manipulate prompt selection to transfer backdoor knowledge to data from other suppliers; (2) \emph{Resiliency}: We simulate static and dynamic states of the victim to ensure the backdoor trigger remains robust during intense incremental learning processes; and (3) \emph{Authenticity}: We apply binary cross-entropy loss as an anti-cheating factor to prevent the backdoor trigger from devolving into adversarial noise. Extensive experiments across various benchmark datasets and continual learners validate our continual backdoor framework, achieving up to $100\%$ attack success rate, with further ablation studies confirming our contributions' effectiveness. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.19445 [pdf, other]

X-Ray Constraints on Dark Photon Tridents

Authors: Tim Linden, Thong T. Q. Nguyen, Tim M. P. Tait

Abstract: Dark photons that are sufficiently light and/or weakly-interacting represent a compelling vision of dark matter. Dark photon decay into three photons, which we call the dark photon trident, can be the dominant channel when the dark photon mass falls below the electron pair threshold and can produce a significant flux of x-rays. We use 16 years of data from INTEGRAL/SPI to constrain sub-MeV dark ph… ▽ More Dark photons that are sufficiently light and/or weakly-interacting represent a compelling vision of dark matter. Dark photon decay into three photons, which we call the dark photon trident, can be the dominant channel when the dark photon mass falls below the electron pair threshold and can produce a significant flux of x-rays. We use 16 years of data from INTEGRAL/SPI to constrain sub-MeV dark photon decay, producing new worlds-best constraints on the kinetic mixing parameter for dark photon masses between 61 keV and 1022 keV, and comment on the potential for future x-ray observatories to discover the trident decay process. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 4+3 pages, 4 figures. Comments are welcome!

arXiv:2406.18851 [pdf, other]

LICO: Large Language Models for In-Context Molecular Optimization

Authors: Tung Nguyen, Aditya Grover

Abstract: Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate model… ▽ More Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate modeling. However, directly prompting a pretrained language model to produce predictions is not feasible in many scientific domains due to the scarcity of domain-specific data in the pretraining corpora and the challenges of articulating complex problems in natural language. In this work, we introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization, with a particular application to the molecular domain. To achieve this, we equip the language model with a separate embedding layer and prediction layer, and train the model to perform in-context predictions on a diverse set of functions defined over the domain. Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting. LICO achieves state-of-the-art performance on PMO, a challenging molecular optimization benchmark comprising over 20 objective functions. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.17381 [pdf, other]

Forget but Recall: Incremental Latent Rectification in Continual Learning

Authors: Nghia D. Nguyen, Hieu Trung Nguyen, Ang Li, Hoang Pham, Viet Anh Nguyen, Khoa D. Doan

Abstract: Intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which hinders remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches either retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper in… ▽ More Intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which hinders remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches either retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper investigates an unexplored CL direction for incremental learning called Incremental Latent Rectification or ILR. In a nutshell, ILR learns to propagate with correction (or rectify) the representation from the current trained DNN backward to the representation space of the old task, where performing predictive decisions is easier. This rectification process only employs a chain of small representation mapping networks, called rectifier units. Empirical experiments on several continual learning benchmarks, including CIFAR10, CIFAR100, and Tiny ImageNet, demonstrate the effectiveness and potential of this novel CL direction compared to existing representative CL methods. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.17376 [pdf, other]

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

Authors: Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng

Abstract: Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in sp… ▽ More Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2406.16777 [pdf, other]

Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024

Authors: Sai Koneru, Thai-Binh Nguyen, Ngoc-Quan Pham, Danni Liu, Zhaolin Li, Alexander Waibel, Jan Niehues

Abstract: Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we inte… ▽ More Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we integrate Mistral-7B\footnote{mistralai/Mistral-7B-Instruct-v0.1} into our system to enhance it in two ways. Firstly, we refine the ASR outputs by utilizing the N-best lists generated by our system and fine-tuning the LLM to predict the transcript accurately. Secondly, we refine the MT outputs at the document level by fine-tuning the LLM, leveraging both ASR and MT predictions to improve translation quality. We find that integrating the LLM into the ASR and MT systems results in an absolute improvement of $0.3\%$ in Word Error Rate and $0.65\%$ in COMET for tst2019 test set. In challenging test sets with overlapping speakers and background noise, we find that integrating LLM is not beneficial due to poor ASR performance. Here, we use ASR with chunked long-form decoding to improve context usage that may be unavailable when transcribing with Voice Activity Detection segmentation alone. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.16685 [pdf, other]

A locking-free isogeometric thin shell formulation based on higher order accurate local strain projection via approximate dual splines

Authors: Thi-Hoa Nguyen, René R. Hiemstra, Dominik Schillinger

Abstract: We present a novel isogeometric discretization approach for the Kirchhoff-Love shell formulation based on the Hellinger-Reissner variational principle. For mitigating membrane locking, we discretize the independent strains with spline basis functions that are one degree lower than those used for the displacements. To enable computationally efficient condensation of the independent strains, we firs… ▽ More We present a novel isogeometric discretization approach for the Kirchhoff-Love shell formulation based on the Hellinger-Reissner variational principle. For mitigating membrane locking, we discretize the independent strains with spline basis functions that are one degree lower than those used for the displacements. To enable computationally efficient condensation of the independent strains, we first discretize the variations of the independent strains with approximate dual splines to obtain a projection matrix that is close to a diagonal matrix. We then diagonalize this strain projection matrix via row-sum lumping. The combination of approximate dual test functions with row-sum lumping enables the direct condensation of the independent strain fields at the quadrature point level, while maintaining higher-order accuracy at optimal rates of convergence. We illustrate the numerical properties and the performance of our approach through numerical benchmarks, including a curved Euler-Bernoulli beam and the examples of the shell obstacle course. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.16656 [pdf, ps, other]

Permutation Codes Correcting Multiple Deletions

Authors: Shuche Wang, The Nguyen, Yeow Meng Chee, Van Khu Vu

Abstract: Permutation codes in the Ulam metric, which can correct multiple deletions, have been investigated extensively recently owing to their applications. In this work, we are interested in the maximum size of the permutation codes in the Ulam metric and aim to design permutation codes that can correct multiple deletions with efficient decoding algorithms. We first present an improvement on the Gilbert-… ▽ More Permutation codes in the Ulam metric, which can correct multiple deletions, have been investigated extensively recently owing to their applications. In this work, we are interested in the maximum size of the permutation codes in the Ulam metric and aim to design permutation codes that can correct multiple deletions with efficient decoding algorithms. We first present an improvement on the Gilbert--Varshamov bound of the maximum size of these permutation codes which is the best-known lower bound. Next, we focus on designing permutation codes in the Ulam metric with a decoding algorithm. These constructed codes are the best-known permutation codes that can correct multiple deletions. In particular, the constructed permutation codes can correct $t$ deletions with at most $(3t-1) \log n+o(\log n)$ bits of redundancy where $n$ is the length of the code. Finally, we provide an efficient decoding algorithm for our constructed permutation codes. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 9 pages

arXiv:2406.15749 [pdf, ps, other]

Decay of CP-even Higgs $H\rightarrow h γγ$ in Two Higgs Doublet Model: (I) one-loop analytic results, ward identity checks

Authors: Khiem Hong Phan, Dzung Tri Tran, Thanh Huy Nguyen

Abstract: We present the first analytical expressions for one-loop induced contributions for the decay channels of CP-even Higgs $H\rightarrow h γγ$ with $h$ being standard model-like Higgs boson within the framework of Two Higgs Doublet Model in this paper. One-loop form factors for the decay processes are written in terms of the scalar Passarino-Veltman functions following the general notations of the pac… ▽ More We present the first analytical expressions for one-loop induced contributions for the decay channels of CP-even Higgs $H\rightarrow h γγ$ with $h$ being standard model-like Higgs boson within the framework of Two Higgs Doublet Model in this paper. One-loop form factors for the decay processes are written in terms of the scalar Passarino-Veltman functions following the general notations of the package~{\tt LoopTools} as well as the library {\tt Collier}. Subsequently, physical results for the decay processes can be generated numerically by using one of the above-mentioned packages. The analytical expressions shown in this paper, are verified by several numerical checks, for examples, the ultraviolet (UV) and the infrared (IR) finiteness for one-loop amplitude. Furthermore, the amplitude must be followed the so-called ward identity due to on-shell photons in final states. The identity can also be tested numerically in this work. We find that the numerical results for the checks are good stability. In phenomenological studies, the differential decay rates as functions of the invariant of two photons in final state of $H\rightarrow h γγ$ are first studied in parameter space for all types of Two Higgs Doublet Models. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 39 pages, 8 Figures, 9 Tables

Report number: DTU_2024-03

arXiv:2406.15633 [pdf, other]

Good things come in three: Generating SO Post Titles with Pre-Trained Models, Self Improvement and Post Ranking

Authors: Duc Anh Le, Anh M. T. Bui, Phuong T. Nguyen, Davide Di Ruscio

Abstract: Stack Overflow is a prominent Q and A forum, supporting developers in seeking suitable resources on programming-related matters. Having high-quality question titles is an effective means to attract developers' attention. Unfortunately, this is often underestimated, leaving room for improvement. Research has been conducted, predominantly leveraging pre-trained models to generate titles from code sn… ▽ More Stack Overflow is a prominent Q and A forum, supporting developers in seeking suitable resources on programming-related matters. Having high-quality question titles is an effective means to attract developers' attention. Unfortunately, this is often underestimated, leaving room for improvement. Research has been conducted, predominantly leveraging pre-trained models to generate titles from code snippets and problem descriptions. Yet, getting high-quality titles is still a challenging task, attributed to both the quality of the input data (e.g., containing noise and ambiguity) and inherent constraints in sequence generation models. In this paper, we present FILLER as a solution to generating Stack Overflow post titles using a fine-tuned language model with self-improvement and post ranking. Our study focuses on enhancing pre-trained language models for generating titles for Stack Overflow posts, employing a training and subsequent fine-tuning paradigm for these models. To this end, we integrate the model's predictions into the training process, enabling it to learn from its errors, thereby lessening the effects of exposure bias. Moreover, we apply a post-ranking method to produce a variety of sample candidates, subsequently selecting the most suitable one. To evaluate FILLER, we perform experiments using benchmark datasets, and the empirical findings indicate that our model provides high-quality recommendations. Moreover, it significantly outperforms all the baselines, including Code2Que, SOTitle, CCBERT, M3NSCT5, and GPT3.5-turbo. A user study also shows that FILLER provides more relevant titles, with respect to SOTitle and GPT3.5-turbo. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: The paper has been per-reviewed and accepted for publication to the International Symposium on Empirical Software Engineering and Measurement (ESEM 2024)

arXiv:2406.15119 [pdf, other]

Speech Emotion Recognition under Resource Constraints with Data Distillation

Authors: Yi Chang, Zhao Ren, Zhonghao Zhao, Thanh Tam Nguyen, Kun Qian, Tanja Schultz, Björn W. Schuller

Abstract: Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment… ▽ More Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment of SER models. To address these challenges, we propose a data distillation framework to facilitate efficient development of SER models in IoT applications using a synthesised, smaller, and distilled dataset. Our experiments demonstrate that the distilled dataset can be effectively utilised to train SER models with fixed initialisation, achieving performances comparable to those developed using the original full emotional speech dataset. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14835 [pdf, other]

ToVo: Toxicity Taxonomy via Voting

Authors: Tinh Son Luong, Thanh-Thien Le, Thang Viet Doan, Linh Ngo Van, Thien Huu Nguyen, Diep Thi-Ngoc Nguyen

Abstract: Existing toxic detection models face significant limitations, such as lack of transparency, customization, and reproducibility. These challenges stem from the closed-source nature of their training data and the paucity of explanations for their evaluation mechanism. To address these issues, we propose a dataset creation mechanism that integrates voting and chain-of-thought processes, producing a h… ▽ More Existing toxic detection models face significant limitations, such as lack of transparency, customization, and reproducibility. These challenges stem from the closed-source nature of their training data and the paucity of explanations for their evaluation mechanism. To address these issues, we propose a dataset creation mechanism that integrates voting and chain-of-thought processes, producing a high-quality open-source dataset for toxic content detection. Our methodology ensures diverse classification metrics for each sample and includes both classification scores and explanatory reasoning for the classifications. We utilize the dataset created through our proposed mechanism to train our model, which is then compared against existing widely-used detectors. Our approach not only enhances transparency and customizability but also facilitates better fine-tuning for specific use cases. This work contributes a robust framework for developing toxic content detection models, emphasizing openness and adaptability, thus paving the way for more effective and user-specific content moderation solutions. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.14784 [pdf, other]

Active Learning for Fair and Stable Online Allocations

Authors: Riddhiman Bhattacharya, Thanh Nguyen, Will Wei Sun, Mohit Tawarmalani

Abstract: We explore an active learning approach for dynamic fair resource allocation problems. Unlike previous work that assumes full feedback from all agents on their allocations, we consider feedback from a select subset of agents at each epoch of the online resource allocation process. Despite this restriction, our proposed algorithms provide regret bounds that are sub-linear in number of time-periods f… ▽ More We explore an active learning approach for dynamic fair resource allocation problems. Unlike previous work that assumes full feedback from all agents on their allocations, we consider feedback from a select subset of agents at each epoch of the online resource allocation process. Despite this restriction, our proposed algorithms provide regret bounds that are sub-linear in number of time-periods for various measures that include fairness metrics commonly used in resource allocation problems and stability considerations in matching mechanisms. The key insight of our algorithms lies in adaptively identifying the most informative feedback using dueling upper and lower confidence bounds. With this strategy, we show that efficient decision-making does not require extensive feedback and produces efficient outcomes for a variety of problem classes. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.14572 [pdf, other]

Bioptic -- A Target-Agnostic Potency-Based Small Molecules Search Engine

Authors: Vlad Vinogradov, Ivan Izmailov, Simon Steshin, Kong T. Nguyen

Abstract: Recent successes in virtual screening have been made possible by large models and extensive chemical libraries. However, combining these elements is challenging: the larger the model, the more expensive it is to run, making ultra-large libraries unfeasible. To address this, we developed a target-agnostic, efficacy-based molecule search model, which allows us to find structurally dissimilar molecul… ▽ More Recent successes in virtual screening have been made possible by large models and extensive chemical libraries. However, combining these elements is challenging: the larger the model, the more expensive it is to run, making ultra-large libraries unfeasible. To address this, we developed a target-agnostic, efficacy-based molecule search model, which allows us to find structurally dissimilar molecules with similar biological activities. We used the best practices to design fast retrieval system, based on processor-optimized SIMD instructions, enabling us to screen the ultra-large 40B Enamine REAL library with 100\% recall rate. We extensively benchmarked our model and several state-of-the-art models for both speed performance and retrieval quality of novel molecules. △ Less

Submitted 30 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.13781 [pdf, other]

A Primal-Dual Framework for Transformers and Neural Networks

Authors: Tan M. Nguyen, Tam Nguyen, Nhat Ho, Andrea L. Bertozzi, Richard G. Baraniuk, Stanley J. Osher

Abstract: Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresp… ▽ More Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Accepted to ICLR 2023, 26 pages, 4 figures, 14 tables

arXiv:2406.13770 [pdf, other]

Elliptical Attention

Authors: Stefan K. Nielsen, Laziz U. Abdullaev, Rachel Teo, Tan M. Nguyen

Abstract: Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision. This dot-product self-attention computes attention weights among the input tokens using Euclidean distance, which makes the model prone to representation collapse and vulnerable to contaminated samples. In this paper, we propos… ▽ More Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision. This dot-product self-attention computes attention weights among the input tokens using Euclidean distance, which makes the model prone to representation collapse and vulnerable to contaminated samples. In this paper, we propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance. In particular, we define a hyper-ellipsoidal neighborhood around each query to increase the attention weights of the tokens lying in the contextually important directions. We term this novel class of attention Elliptical Attention. Our Elliptical Attention provides two benefits: 1) reducing representation collapse and 2) enhancing the model's robustness as the Elliptical Attention pays more attention to contextually relevant information rather than focusing on some small subset of informative features. We empirically demonstrate the advantages of Elliptical Attention over the baseline dot-product attention and state-of-the-art attention methods on various practical tasks, including object classification, image segmentation, and language modeling across different data modalities. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 38 pages, 7 figures, 12 tables

arXiv:2406.13762 [pdf, other]

Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

Authors: Rachel S. Y. Teo, Tan M. Nguyen

Abstract: The remarkable success of transformers in sequence modeling tasks, spanning various applications in natural language processing and computer vision, is attributed to the critical role of self-attention. Similar to the development of most deep learning models, the construction of these attention mechanisms rely on heuristics and experience. In our work, we derive self-attention from kernel principa… ▽ More The remarkable success of transformers in sequence modeling tasks, spanning various applications in natural language processing and computer vision, is attributed to the critical role of self-attention. Similar to the development of most deep learning models, the construction of these attention mechanisms rely on heuristics and experience. In our work, we derive self-attention from kernel principal component analysis (kernel PCA) and show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space. We then formulate the exact formula for the value matrix in self-attention, theoretically and empirically demonstrating that this value matrix captures the eigenvectors of the Gram matrix of the key vectors in self-attention. Leveraging our kernel PCA framework, we propose Attention with Robust Principal Components (RPC-Attention), a novel class of robust attention that is resilient to data contamination. We empirically demonstrate the advantages of RPC-Attention over softmax attention on the ImageNet-1K object classification, WikiText-103 language modeling, and ADE20K image segmentation task. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 33 pages, 5 figures, 12 tables

arXiv:2406.13725 [pdf, other]

Tree-Sliced Wasserstein Distance on a System of Lines

Authors: Viet-Hoang Tran, Trang Pham, Tho Tran, Tam Le, Tan M. Nguyen

Abstract: Sliced Wasserstein (SW) distance in Optimal Transport (OT) is widely used in various applications thanks to its statistical effectiveness and computational efficiency. On the other hand, Tree Wassenstein (TW) and Tree-sliced Wassenstein (TSW) are instances of OT for probability measures where its ground cost is a tree metric. TSW also has a low computational complexity, i.e. linear to the number o… ▽ More Sliced Wasserstein (SW) distance in Optimal Transport (OT) is widely used in various applications thanks to its statistical effectiveness and computational efficiency. On the other hand, Tree Wassenstein (TW) and Tree-sliced Wassenstein (TSW) are instances of OT for probability measures where its ground cost is a tree metric. TSW also has a low computational complexity, i.e. linear to the number of edges in the tree. Especially, TSW is identical to SW when the tree is a chain. While SW is prone to loss of topological information of input measures due to relying on one-dimensional projection, TSW is more flexible and has a higher degree of freedom by choosing a tree rather than a line to alleviate the curse of dimensionality in SW. However, for practical applications, popular tree metric sampling methods are heavily built upon given supports, which limits their capacity to adapt to new supports. In this paper, we propose the Tree-Sliced Wasserstein distance on a System of Lines (TSW-SL), which brings a connection between SW and TSW. Compared to SW and TSW, our TSW-SL benefits from the higher degree of freedom of TSW while being suitable to dynamic settings as SW. In TSW-SL, we use a variant of the Radon Transform to project measures onto a system of lines, resulting in measures on a space with a tree metric, then leverage TW to efficiently compute distances between them. We empirically verify the advantages of TSW-SL over the traditional SW by conducting a variety of experiments on gradient flows, image style transfer, and generative models. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 33 pages, 6 figures, 2 tables, 4 algorithms

arXiv:2406.13587 [pdf, other]

The Precursor Small Aperture Telescope (PreSAT) CMB polarimeter

Authors: Matthew A. Petroff, Zeeshan Ahmed, James J. Bock, Marion Dierickx, Sofia Fatigoni, David C. Goldfinger, Paul K. Grimes, Shawn W. Henderson, Kirit S. Karkare, John M. Kovac, Hien T. Nguyen, Scott N. Paine, Anna R. Polish, Clement Pryke, Thibault Romand, Benjamin L. Schmitt, Abigail G. Vieregg

Abstract: The search for the polarized imprint of primordial gravitational waves in the cosmic microwave background (CMB) as direct evidence of cosmic inflation requires exquisite sensitivity and control over systematics. The next-generation CMB-S4 project intends to improve upon current-generation experiments by deploying a significantly greater number of highly-sensitive detectors, combined with refined i… ▽ More The search for the polarized imprint of primordial gravitational waves in the cosmic microwave background (CMB) as direct evidence of cosmic inflation requires exquisite sensitivity and control over systematics. The next-generation CMB-S4 project intends to improve upon current-generation experiments by deploying a significantly greater number of highly-sensitive detectors, combined with refined instrument components based on designs from field-proven instruments. The Precursor Small Aperture Telescope (PreSAT) is envisioned as an early step to this next generation, which will test prototype CMB-S4 components and technologies within an existing BICEP Array receiver, with the aim of enabling full-stack laboratory testing and early risk retirement, along with direct correlation of laboratory component-level performance measurements with deployed system performance. The instrument will utilize new 95/155GHz dichroic dual-linear-polarization prototype detectors developed for CMB-S4, cooled to 100mK via the installation of an adiabatic demagnetization refrigerator, along with a prototype readout chain and prototype optics manufactured with wide-band anti-reflection coatings. The experience gained by integrating, deploying, and calibrating PreSAT will also help inform planning for CMB-S4 small aperture telescope commissioning, calibration, and operations well in advance of the fabrication of CMB-S4 production hardware. △ Less

Submitted 10 July, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

Comments: 12 pages, 4 figures, submitted to Proc. SPIE

arXiv:2406.13262 [pdf]

doi 10.34190/eccws.23.1.2258

Machine Learning Applications of Quantum Computing: A Review

Authors: Thien Nguyen, Tuomo Sipola, Jari Hautamäki

Abstract: At the intersection of quantum computing and machine learning, this review paper explores the transformative impact these technologies are having on the capabilities of data processing and analysis, far surpassing the bounds of traditional computational methods. Drawing upon an in-depth analysis of 32 seminal papers, this review delves into the interplay between quantum computing and machine learn… ▽ More At the intersection of quantum computing and machine learning, this review paper explores the transformative impact these technologies are having on the capabilities of data processing and analysis, far surpassing the bounds of traditional computational methods. Drawing upon an in-depth analysis of 32 seminal papers, this review delves into the interplay between quantum computing and machine learning, focusing on transcending the limitations of classical computing in advanced data processing and applications. This review emphasizes the potential of quantum-enhanced methods in enhancing cybersecurity, a critical sector that stands to benefit significantly from these advancements. The literature review, primarily leveraging Science Direct as an academic database, delves into the transformative effects of quantum technologies on machine learning, drawing insights from a diverse collection of studies and scholarly articles. While the focus is primarily on the growing significance of quantum computing in cybersecurity, the review also acknowledges the promising implications for other sectors as the field matures. Our systematic approach categorizes sources based on quantum machine learning algorithms, applications, challenges, and potential future developments, uncovering that quantum computing is increasingly being implemented in practical machine learning scenarios. The review highlights advancements in quantum-enhanced machine learning algorithms and their potential applications in sectors such as cybersecurity, emphasizing the need for industry-specific solutions while considering ethical and security concerns. By presenting an overview of the current state and projecting future directions, the paper sets a foundation for ongoing research and strategic advancement in quantum machine learning. △ Less

Submitted 24 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

Comments: Proceedings of the 23rd European Conference on Cyber Warfare and Security (ECCWS 2024)

arXiv:2406.13096 [pdf]

Electric field enhances the electronic and diffusion properties of penta-graphene nanoribbons for application in lithium-ion batteries: a first-principles study

Authors: Thi Nhan Tran, Nguyen Vo Anh Duy, Nguyen Hoang Hieu, Truc Anh Nguyen, Nguyen To Van, Viet Bac Thi Phung, Peter Schall, Minh Triet Dang

Abstract: Enhancing the electronic and diffusion properties of lithium-ion batteries is crucial for improving the performance of the fast-growing energy storage devices. Recently, fast-charging capability of commercial-like lithium-ion anodes with the least modification of the current manufactoring technology is of great interest. Here we use first principles methods with density functional theory and the c… ▽ More Enhancing the electronic and diffusion properties of lithium-ion batteries is crucial for improving the performance of the fast-growing energy storage devices. Recently, fast-charging capability of commercial-like lithium-ion anodes with the least modification of the current manufactoring technology is of great interest. Here we use first principles methods with density functional theory and the climbing image-nudged elastic band method to evaluate the impact of an external electric field on the stability, electronic and diffusion properties of penta-graphene nanoribbons upon lithium adsorption. We show that by adsorbing a lithium atom, these semiconductor nanoribbons become metal with a formation energy of - 0.22 (eV). The lithium-ion mobility of this material is comparable to that of a common carbon graphite layer. Under a relatively small vertical electric field, the structural stability of these lithium-ion systems is even more stable, and their diffusion coefficient is enhanced significantly of ~719 times higher than that of the material in the absence of an applied electric field and ~521 times higher than in the case of commercial graphitic carbon layers. Our results highlight the role of an external electric field as a novel switch to improve the efficiency of lithium-ion batteries with penta-graphene nanoribbon electrodes and open a new horizon for the use of more environmentally friendly pentagonal materials as anode materials in lithium-ion battery industry. △ Less

Submitted 25 July, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

Comments: 21 pages, 5 figures

arXiv:2406.12507 [pdf, other]

Improving the Evaluation and Actionability of Explanation Methods for Multivariate Time Series Classification

Authors: Davide Italo Serramazza, Thach Le Nguyen, Georgiana Ifrim

Abstract: Explanation for Multivariate Time Series Classification (MTSC) is an important topic that is under explored. There are very few quantitative evaluation methodologies and even fewer examples of actionable explanation, where the explanation methods are shown to objectively improve specific computational tasks on time series data. In this paper we focus on analyzing InterpretTime, a recent evaluation… ▽ More Explanation for Multivariate Time Series Classification (MTSC) is an important topic that is under explored. There are very few quantitative evaluation methodologies and even fewer examples of actionable explanation, where the explanation methods are shown to objectively improve specific computational tasks on time series data. In this paper we focus on analyzing InterpretTime, a recent evaluation methodology for attribution methods applied to MTSC. We showcase some significant weaknesses of the original methodology and propose ideas to improve both its accuracy and efficiency. Unlike related work, we go beyond evaluation and also showcase the actionability of the produced explainer ranking, by using the best attribution methods for the task of channel selection in MTSC. We find that perturbation-based methods such as SHAP and Feature Ablation work well across a set of datasets, classifiers and tasks and outperform gradient-based methods. We apply the best ranked explainers to channel selection for MTSC and show significant data size reduction and improved classifier accuracy. △ Less

Submitted 12 August, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.11982 [pdf, other]

Nonlinear photocurrent in quantum materials for broadband photodetection

Authors: Yulin Shen, Louis Primeau, Jiangxu Li, Tuan-Dung Nguyen, David Mandrus, Yuxuan Cosmi Lin, Yang Zhang

Abstract: Unlocking the vast potential of optical sensing technology has long been hindered by the challenges of achieving fast, sensitive, and broadband photodetection at ambient temperatures. In this review, we summarize recent progress in the study of nonlinear photocurrent in topological quantum materials, and its application in broadband photodetection without the use of p-n junction based semiconducto… ▽ More Unlocking the vast potential of optical sensing technology has long been hindered by the challenges of achieving fast, sensitive, and broadband photodetection at ambient temperatures. In this review, we summarize recent progress in the study of nonlinear photocurrent in topological quantum materials, and its application in broadband photodetection without the use of p-n junction based semiconductor diodes. The intrinsic quadratic transverse current-input voltage relation is used to rectify the alternating electric field from incident radio, terahertz or infrared waves into a direct current, without a bias voltage and at zero magnetic field. We review novel photocurrents in several material systems, including topological Weyl semimetals, chiral crystals, ferroelectric materials, and low dimensional topological insulators. These quantum materials hold tremendous promise for broadband high-frequency rectification and photodetection, featuring substantial responsivity and detectivity. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Review article, 24 pages + 19 figures

arXiv:2406.11794 [pdf, other]

DataComp-LM: In search of the next generation of training sets for language models

Authors: Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner , et al. (34 additional authors not shown)

Abstract: We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with dat… ▽ More We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation. △ Less

Submitted 20 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: Project page: https://www.datacomp.ai/dclm/

arXiv:2406.11233 [pdf, other]

Probing the Decision Boundaries of In-context Learning in Large Language Models

Authors: Siyan Zhao, Tung Nguyen, Aditya Grover

Abstract: In-context learning is a key paradigm in large language models (LLMs) that enables them to generalize to new tasks and domains by simply prompting these models with a few exemplars without explicit parameter updates. Many attempts have been made to understand in-context learning in LLMs as a function of model scale, pretraining data, and other factors. In this work, we propose a new mechanism to p… ▽ More In-context learning is a key paradigm in large language models (LLMs) that enables them to generalize to new tasks and domains by simply prompting these models with a few exemplars without explicit parameter updates. Many attempts have been made to understand in-context learning in LLMs as a function of model scale, pretraining data, and other factors. In this work, we propose a new mechanism to probe and understand in-context learning from the lens of decision boundaries for in-context binary classification. Decision boundaries are straightforward to visualize and provide important information about the qualitative behavior of the inductive biases of standard classifiers. To our surprise, we find that the decision boundaries learned by current LLMs in simple binary classification tasks are often irregular and non-smooth, regardless of linear separability in the underlying task. This paper investigates the factors influencing these decision boundaries and explores methods to enhance their generalizability. We assess various approaches, including training-free and fine-tuning methods for LLMs, the impact of model architecture, and the effectiveness of active prompting techniques for smoothing decision boundaries in a data-efficient manner. Our findings provide a deeper understanding of in-context learning dynamics and offer practical improvements for enhancing robustness and generalizability of in-context learning. △ Less

Submitted 24 July, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: 18 pages, code at https://github.com/siyan-zhao/ICL_decision_boundary

arXiv:2406.11040 [pdf, other]

Decays $Z\to e_ae_b$ in a 3-3-1 model with neutral leptons

Authors: T. T. Hong, L. T. Hue, L. T. T. Phuong, N. H. T. Nha, T. Phong Nguyen

Abstract: We investigate the 3-3-1 model with neutral leptons (called the 331$NL$ for short) and by that, we will point out that this model can simultaneously explain the lepton flavor violating (LFV) decays of the $Z$ boson $Z \to e_a e_b$, Standard model-like Higgs boson decay $h\to e_be_a$, and the charged leptons $e_b\to e_a γ$ consistent with the recent experimental data. In addition, the numerical res… ▽ More We investigate the 3-3-1 model with neutral leptons (called the 331$NL$ for short) and by that, we will point out that this model can simultaneously explain the lepton flavor violating (LFV) decays of the $Z$ boson $Z \to e_a e_b$, Standard model-like Higgs boson decay $h\to e_be_a$, and the charged leptons $e_b\to e_a γ$ consistent with the recent experimental data. In addition, the numerical results show strict relations among these decay rates of $Z$ and $h$ which are predicted by this model. As a result, the decay channels can be determined theoretically if one of them is detected by experiments. △ Less

Submitted 19 August, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

Comments: 24 pages, 5 figures

arXiv:2406.09732 [pdf, ps, other]

Finding pure Nash equilibria in large random games

Authors: Andrea Collevecchio, Tuan-Minh Nguyen, Ziwen Zhong

Abstract: Best Response Dynamics (BRD) is a class of strategy updating rules to find Pure Nash Equilibria (PNE) in a game. At each step, a player is randomly picked, and the player switches to a "best response" strategy based on the strategies chosen by others, so that the new strategy profile maximises their payoff. If no such strategy exists, a different player will be chosen randomly. When no player want… ▽ More Best Response Dynamics (BRD) is a class of strategy updating rules to find Pure Nash Equilibria (PNE) in a game. At each step, a player is randomly picked, and the player switches to a "best response" strategy based on the strategies chosen by others, so that the new strategy profile maximises their payoff. If no such strategy exists, a different player will be chosen randomly. When no player wants to change their strategy anymore, the process reaches a PNE and will not deviate from it. On the other hand, either PNE may not exist, or BRD could be "trapped" within a subgame that has no PNE. We consider a random game with $N$ players, each with two actions available, and i.i.d. payoffs, in which the payoff distribution may have an atom, i.e. ties are allowed. We study a class of random walks in a random medium on the $N$-dimensional hypercube induced by the random game. The medium contains two types of obstacles corresponding to PNE and traps. The class of processes we analyze includes BRD, simple random walks on the hypercube, and many other nearest neighbour processes. We prove that, with high probability, these processes reach a PNE before hitting any trap. △ Less

Submitted 16 August, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

Comments: 20 pages, 4 figures, 1 table

MSC Class: 91A10; 91A06; 60K35; 60K37

arXiv:2406.09400 [pdf, other]

Yo'LLaVA: Your Personalized Language and Vision Assistant

Authors: Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, Yong Jae Lee

Abstract: Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user's pet dog). Human reasoning, in contrast, typically operates within the context of specific subjects in o… ▽ More Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user's pet dog). Human reasoning, in contrast, typically operates within the context of specific subjects in our surroundings. For example, one might ask, "What should I buy for my dog's birthday?"; as opposed to a generic inquiry about "What should I buy for a dog's birthday?". Similarly, when looking at a friend's image, the interest lies in seeing their activities (e.g., "my friend is holding a cat"), rather than merely observing generic human actions (e.g., "a man is holding a cat"). In this paper, we introduce the novel task of personalizing LMMs, so that they can have conversations about a specific subject. We propose Yo'LLaVA, which learns to embed a personalized subject into a set of latent tokens given a handful of example images of the subject. Our qualitative and quantitative analyses reveal that Yo'LLaVA can learn the concept more efficiently using fewer tokens and more effectively encode the visual attributes compared to strong prompting baselines (e.g., LLaVA). △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Project page: https://thaoshibe.github.io/YoLLaVA

arXiv:2406.08733 [pdf, other]

doi 10.1145/3441000.3441031

A Tangible Multi-Display Toolkit to Support the Collaborative Design Exploration of AV-Pedestrian Interfaces

Authors: Marius Hoggenmuller, Martin Tomitsch, Callum Parker, Trung Thanh Nguyen, Dawei Zhou, Stewart Worrall, Eduardo Nebot

Abstract: The advent of cyber-physical systems, such as robots and autonomous vehicles (AVs), brings new opportunities and challenges for the domain of interaction design. Though there is consensus about the value of human-centred development, there is a lack of documented tailored methods and tools for involving multiple stakeholders in design exploration processes. In this paper we present a novel approac… ▽ More The advent of cyber-physical systems, such as robots and autonomous vehicles (AVs), brings new opportunities and challenges for the domain of interaction design. Though there is consensus about the value of human-centred development, there is a lack of documented tailored methods and tools for involving multiple stakeholders in design exploration processes. In this paper we present a novel approach using a tangible multi-display toolkit. Orchestrating computer-generated imagery across multiple displays, the toolkit enables multiple viewing angles and perspectives to be captured simultaneously (e.g. top-view, first-person pedestrian view). Participants are able to directly interact with the simulated environment through tangible objects. At the same time, the objects physically simulate the interface's behaviour (e.g. through an integrated LED display). We evaluated the toolkit in design sessions with experts to collect feedback and input on the design of an AV-pedestrian interface. The paper reports on how the combination of tangible objects and multiple displays supports collaborative design explorations. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.07460 [pdf, ps, other]

Existence and asymptotic autonomous robustness of random attractors for three-dimensional stochastic globally modified Navier-Stokes equations on unbounded domains

Authors: Bui Kim My, Ho Thi Hang, Kush Kinra, Manil T. Mohan, Pham Tri Nguyen

Abstract: In this article, we discuss the existence and asymptotically autonomous robustness (AAR) (almost surely) of random attractors for 3D stochastic globally modified Navier-Stokes equations (SGMNSE) on Poincaré domains (which may be bounded or unbounded). Our aim is to investigate the existence and AAR of random attractors for 3D SGMNSE when the time-dependent forcing converges to a time-independent f… ▽ More In this article, we discuss the existence and asymptotically autonomous robustness (AAR) (almost surely) of random attractors for 3D stochastic globally modified Navier-Stokes equations (SGMNSE) on Poincaré domains (which may be bounded or unbounded). Our aim is to investigate the existence and AAR of random attractors for 3D SGMNSE when the time-dependent forcing converges to a time-independent function under the perturbation of linear multiplicative noise as well as additive noise. The main approach is to provide a way to justify that, on some uniformly tempered universe, the usual pullback asymptotic compactness of the solution operators is uniform across an infinite time-interval $(-\infty,τ]$. The backward uniform ``tail-smallness'' and ``flattening-property'' of the solutions over $(-\infty,τ]$ have been demonstrated to achieve this goal. To the best of our knowledge, this is the first attempt to establish the existence as well as AAR of random attractors for 3D SGMNSE on unbounded domains. △ Less

Submitted 9 July, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2208.06808

MSC Class: 37L55; 76D05; 35B41; 37B55; 35B40

arXiv:2406.06863 [pdf, other]

Ollabench: Evaluating LLMs' Reasoning for Human-centric Interdependent Cybersecurity

Authors: Tam n. Nguyen

Abstract: Large Language Models (LLMs) have the potential to enhance Agent-Based Modeling by better representing complex interdependent cybersecurity systems, improving cybersecurity threat modeling and risk management. However, evaluating LLMs in this context is crucial for legal compliance and effective application development. Existing LLM evaluation frameworks often overlook the human factor and cogniti… ▽ More Large Language Models (LLMs) have the potential to enhance Agent-Based Modeling by better representing complex interdependent cybersecurity systems, improving cybersecurity threat modeling and risk management. However, evaluating LLMs in this context is crucial for legal compliance and effective application development. Existing LLM evaluation frameworks often overlook the human factor and cognitive computing capabilities essential for interdependent cybersecurity. To address this gap, I propose OllaBench, a novel evaluation framework that assesses LLMs' accuracy, wastefulness, and consistency in answering scenario-based information security compliance and non-compliance questions. OllaBench is built on a foundation of 24 cognitive behavioral theories and empirical evidence from 38 peer-reviewed papers. OllaBench was used to evaluate 21 LLMs, including both open-weight and commercial models from OpenAI, Anthropic, Google, Microsoft, Meta and so on. The results reveal that while commercial LLMs have the highest overall accuracy scores, there is significant room for improvement. Smaller low-resolution open-weight LLMs are not far behind in performance, and there are significant differences in token efficiency and consistency among the evaluated models. OllaBench provides a user-friendly interface and supports a wide range of LLM platforms, making it a valuable tool for researchers and solution developers in the field of human-centric interdependent cybersecurity and beyond. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 12 pages, 7 figures, 2 tables The final conference/journal version may have significantly more content updates

ACM Class: I.2.0; J.4

arXiv:2406.06551 [pdf]

doi 10.1109/IRMMW-THz57677.2023.10298883

A Simple View on Large-Signal Resonant-Tunneling-Diode Dynamics

Authors: Petr Ourednik, Dinh Tuan Nguyen, Michael Feiginov

Abstract: We present a model for an accurate description of the large-signal resonant-tunneling-diode (RTD) dynamics, which allows for a simple and intuitive analysis in terms of dynamical trajectories in a phase space. We show that the RTD admittance can be accurately described by a simple RLRC equivalent circuit, which has a universal configuration, but with different circuit parameters in the large- and… ▽ More We present a model for an accurate description of the large-signal resonant-tunneling-diode (RTD) dynamics, which allows for a simple and intuitive analysis in terms of dynamical trajectories in a phase space. We show that the RTD admittance can be accurately described by a simple RLRC equivalent circuit, which has a universal configuration, but with different circuit parameters in the large- and small-signal cases. △ Less

Submitted 27 May, 2024; originally announced June 2024.

Comments: 2 pages, 2 figures, accepted version

Journal ref: 2023 48th International Conference on Infrared, Millimeter, and Terahertz Waves (IRMMW-THz), Montreal, QC, Canada, 2023, pp. 1-2

arXiv:2406.06409 [pdf, ps, other]

On the structure of the value function of optimal exit time problems

Authors: Piermarco Cannarsa, Marco Mazzola, Khai T. Nguyen

Abstract: In this paper, we study an optimal exit time problem with general running and terminal costs and a target $\mathcal{S}\subset\mathbb{R}^d$ having an inner ball property for a nonlinear control system that satisfies mild controllability assumptions. In particular, Petrov's condition at the boundary of $\mathcal{S}$ is not required and the value function $V$ may fail to be locally Lipschitz. In such… ▽ More In this paper, we study an optimal exit time problem with general running and terminal costs and a target $\mathcal{S}\subset\mathbb{R}^d$ having an inner ball property for a nonlinear control system that satisfies mild controllability assumptions. In particular, Petrov's condition at the boundary of $\mathcal{S}$ is not required and the value function $V$ may fail to be locally Lipschitz. In such a weakened set-up, we first establish a representation formula for proximal (horizontal) supergradients of $V$ by using transported proximal normal vectors. This allows us to obtain an external sphere condition for the hypograph of $V$ which yields several regularity properties. In particular, $V$ is almost everywhere twice differentiable and the Hausdorff dimension of its singularities is not greater than $d-1/2$. Furthermore, besides optimality conditions for trajectories of the optimal control problem, we extend the analysis to propagation of singularities and differentiability properties of the value function. An upper bound for the Hausdorff measure of the singular set is also studied, which implies that $V$ is a function of special bounded variation. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 50 pages

MSC Class: 49N60; 49N05; 49J52; 49E30

arXiv:2406.06239 [pdf, other]

I-MPN: Inductive Message Passing Network for Efficient Human-in-the-Loop Annotation of Mobile Eye Tracking Data

Authors: Hoang H. Le, Duy M. H. Nguyen, Omair Shahzad Bhatti, Laszlo Kopacsi, Thinh P. Ngo, Binh T. Nguyen, Michael Barz, Daniel Sonntag

Abstract: Comprehending how humans process visual information in dynamic settings is crucial for psychology and designing user-centered interactions. While mobile eye-tracking systems combining egocentric video and gaze signals can offer valuable insights, manual analysis of these recordings is time-intensive. In this work, we present a novel human-centered learning algorithm designed for automated object r… ▽ More Comprehending how humans process visual information in dynamic settings is crucial for psychology and designing user-centered interactions. While mobile eye-tracking systems combining egocentric video and gaze signals can offer valuable insights, manual analysis of these recordings is time-intensive. In this work, we present a novel human-centered learning algorithm designed for automated object recognition within mobile eye-tracking settings. Our approach seamlessly integrates an object detector with a spatial relation-aware inductive message-passing network (I-MPN), harnessing node profile information and capturing object correlations. Such mechanisms enable us to learn embedding functions capable of generalizing to new object angle views, facilitating rapid adaptation and efficient reasoning in dynamic contexts as users navigate their environment. Through experiments conducted on three distinct video sequences, our interactive-based method showcases significant performance improvements over fixed training/testing algorithms, even when trained on considerably smaller annotated samples collected through user feedback. Furthermore, we demonstrate exceptional efficiency in data annotation processes and surpass prior interactive methods that use complete object detectors, combine detectors with convolutional networks, or employ interactive video segmentation. △ Less

Submitted 7 July, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: Updated version

arXiv:2406.05615 [pdf, other]

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Authors: Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan

Abstract: Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with te… ▽ More Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with temporal dynamics. In this survey, we review the key tasks of these systems and highlight the associated challenges. Based on the challenges, we summarize their methods from model architecture, model training, and data perspectives. We also conduct performance comparison among the methods, and discuss promising directions for future research. △ Less

Submitted 1 July, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

Comments: Accepted at ACL 2024 (Findings)

arXiv:2406.05349 [pdf, other]

Blurry-Consistency Segmentation Framework with Selective Stacking on Differential Interference Contrast 3D Breast Cancer Spheroid

Authors: Thanh-Huy Nguyen, Thi Kim Ngan Ngo, Mai Anh Vu, Ting-Yuan Tu

Abstract: The ability of three-dimensional (3D) spheroid modeling to study the invasive behavior of breast cancer cells has drawn increased attention. The deep learning-based image processing framework is very effective at speeding up the cell morphological analysis process. Out-of-focus photos taken while capturing 3D cells under several z-slices, however, could negatively impact the deep learning model. I… ▽ More The ability of three-dimensional (3D) spheroid modeling to study the invasive behavior of breast cancer cells has drawn increased attention. The deep learning-based image processing framework is very effective at speeding up the cell morphological analysis process. Out-of-focus photos taken while capturing 3D cells under several z-slices, however, could negatively impact the deep learning model. In this work, we created a new algorithm to handle blurry images while preserving the stacked image quality. Furthermore, we proposed a unique training architecture that leverages consistency training to help reduce the bias of the model when dense-slice stacking is applied. Additionally, the model's stability is increased under the sparse-slice stacking effect by utilizing the self-training approach. The new blurring stacking technique and training flow are combined with the suggested architecture and self-training mechanism to provide an innovative yet easy-to-use framework. Our methods produced noteworthy experimental outcomes in terms of both quantitative and qualitative aspects. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.04994 [pdf, other]

Unguided structure learning of DAGs for count data

Authors: Thi Kim Hue Nguyen, Monica Chiogna, Davide Risso

Abstract: Mainly motivated by the problem of modelling directional dependence relationships for multivariate count data in high-dimensional settings, we present a new algorithm, called learnDAG, for learning the structure of directed acyclic graphs (DAGs). In particular, the proposed algorithm tackled the problem of learning DAGs from observational data in two main steps: (i) estimation of candidate parent… ▽ More Mainly motivated by the problem of modelling directional dependence relationships for multivariate count data in high-dimensional settings, we present a new algorithm, called learnDAG, for learning the structure of directed acyclic graphs (DAGs). In particular, the proposed algorithm tackled the problem of learning DAGs from observational data in two main steps: (i) estimation of candidate parent sets; and (ii) feature selection. We experimentally compare learnDAG to several popular competitors in recovering the true structure of the graphs in situations where relatively moderate sample sizes are available. Furthermore, to make our algorithm is stronger, a validation of the algorithm is presented through the analysis of real datasets. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.03699 [pdf, other]

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

Authors: Anand Subramanian, Viktor Schlegel, Abhinav Ramesh Kashyap, Thanh-Tung Nguyen, Vijay Prakash Dwivedi, Stefan Winkler

Abstract: There is vivid research on adapting Large Language Models (LLMs) to perform a variety of tasks in high-stakes domains such as healthcare. Despite their popularity, there is a lack of understanding of the extent and contributing factors that allow LLMs to recall relevant knowledge and combine it with presented information in the clinical and biomedical domain: a fundamental pre-requisite for succes… ▽ More There is vivid research on adapting Large Language Models (LLMs) to perform a variety of tasks in high-stakes domains such as healthcare. Despite their popularity, there is a lack of understanding of the extent and contributing factors that allow LLMs to recall relevant knowledge and combine it with presented information in the clinical and biomedical domain: a fundamental pre-requisite for success on down-stream tasks. Addressing this gap, we use Multiple Choice and Abstractive Question Answering to conduct a large-scale empirical study on 22 datasets in three generalist and three specialist biomedical sub-domains. Our multifaceted analysis of the performance of 15 LLMs, further broken down by sub-domain, source of knowledge and model architecture, uncovers success factors such as instruction tuning that lead to improved recall and comprehension. We further show that while recently proposed domain-adapted models may lack adequate knowledge, directly fine-tuning on our collected medical knowledge datasets shows encouraging results, even generalising to unseen specialist sub-domains. We complement the quantitative results with a skill-oriented manual error analysis, which reveals a significant gap between the models' capabilities to simply recall necessary knowledge and to integrate it with the presented context. To foster research and collaboration in this field we share M-QALM, our resources, standardised methodology, and evaluation results, with the research community to facilitate further advancements in clinical knowledge representation learning within language models. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Accepted at ACL 2024 (Findings)

arXiv:2406.02555 [pdf, ps, other]

PhoWhisper: Automatic Speech Recognition for Vietnamese

Authors: Thanh-Thien Le, Linh The Nguyen, Dat Quoc Nguyen

Abstract: We introduce PhoWhisper in five versions for Vietnamese automatic speech recognition. PhoWhisper's robustness is achieved through fine-tuning the Whisper model on an 844-hour dataset that encompasses diverse Vietnamese accents. Our experimental study demonstrates state-of-the-art performances of PhoWhisper on benchmark Vietnamese ASR datasets. We have open-sourced PhoWhisper at: https://github.com… ▽ More We introduce PhoWhisper in five versions for Vietnamese automatic speech recognition. PhoWhisper's robustness is achieved through fine-tuning the Whisper model on an 844-hour dataset that encompasses diverse Vietnamese accents. Our experimental study demonstrates state-of-the-art performances of PhoWhisper on benchmark Vietnamese ASR datasets. We have open-sourced PhoWhisper at: https://github.com/VinAIResearch/PhoWhisper △ Less

Submitted 27 March, 2024; originally announced June 2024.

Comments: Accepted to ICLR 2024 Tiny Papers Track

arXiv:2406.02440 [pdf, ps, other]

Simplicial complexes and matroids with vanishing $T^2$

Authors: Alexandru Constantinescu, Patricia Klein, Thai Thanh Nguyen, Anurag Singh, Lorenzo Venturello

Abstract: We investigate quotients by radical monomial ideals for which $T^2$, the second cotangent cohomology module, vanishes. The dimension of the graded components of $T^2$, and thus their vanishing, depends only on the combinatorics of the corresponding simplicial complex. We give both a complete characterization and a full list of one dimensional complexes with $T^2=0$. We characterize the graded comp… ▽ More We investigate quotients by radical monomial ideals for which $T^2$, the second cotangent cohomology module, vanishes. The dimension of the graded components of $T^2$, and thus their vanishing, depends only on the combinatorics of the corresponding simplicial complex. We give both a complete characterization and a full list of one dimensional complexes with $T^2=0$. We characterize the graded components of $T^2$ when the simplicial complex is a uniform matroid. Finally, we show that $T^2$ vanishes for all matroids of corank at most two and conjecture that all connected matroids with vanishing $T^2$ are of corank at most two. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: 13 pages

arXiv:2406.02317 [pdf, other]

Generative Conditional Distributions by Neural (Entropic) Optimal Transport

Authors: Bao Nguyen, Binh Nguyen, Hieu Trung Nguyen, Viet Anh Nguyen

Abstract: Learning conditional distributions is challenging because the desired outcome is not a single distribution but multiple distributions that correspond to multiple instances of the covariates. We introduce a novel neural entropic optimal transport method designed to effectively learn generative models of conditional distributions, particularly in scenarios characterized by limited sample sizes. Our… ▽ More Learning conditional distributions is challenging because the desired outcome is not a single distribution but multiple distributions that correspond to multiple instances of the covariates. We introduce a novel neural entropic optimal transport method designed to effectively learn generative models of conditional distributions, particularly in scenarios characterized by limited sample sizes. Our method relies on the minimax training of two neural networks: a generative network parametrizing the inverse cumulative distribution functions of the conditional distributions and another network parametrizing the conditional Kantorovich potential. To prevent overfitting, we regularize the objective function by penalizing the Lipschitz constant of the network output. Our experiments on real-world datasets show the effectiveness of our algorithm compared to state-of-the-art conditional distribution learning techniques. Our implementation can be found at https://github.com/nguyenngocbaocmt02/GENTLE. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: 15 pages, 8 figures

arXiv:2406.01029 [pdf, other]

CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos

Authors: Trong-Thuan Nguyen, Pha Nguyen, Xin Li, Jackson Cothren, Alper Yilmaz, Khoa Luu

Abstract: Video scene graph generation (VidSGG) has emerged as a transformative approach to capturing and interpreting the intricate relationships among objects and their temporal dynamics in video sequences. In this paper, we introduce the new AeroEye dataset that focuses on multi-object relationship modeling in aerial videos. Our AeroEye dataset features various drone scenes and includes a visually compre… ▽ More Video scene graph generation (VidSGG) has emerged as a transformative approach to capturing and interpreting the intricate relationships among objects and their temporal dynamics in video sequences. In this paper, we introduce the new AeroEye dataset that focuses on multi-object relationship modeling in aerial videos. Our AeroEye dataset features various drone scenes and includes a visually comprehensive and precise collection of predicates that capture the intricate relationships and spatial arrangements among objects. To this end, we propose the novel Cyclic Graph Transformer (CYCLO) approach that allows the model to capture both direct and long-range temporal dependencies by continuously updating the history of interactions in a circular manner. The proposed approach also allows one to handle sequences with inherent cyclical patterns and process object relationships in the correct sequential order. Therefore, it can effectively capture periodic and overlapping relationships while minimizing information loss. The extensive experiments on the AeroEye dataset demonstrate the effectiveness of the proposed CYCLO model, demonstrating its potential to perform scene understanding on drone videos. Finally, the CYCLO method consistently achieves State-of-the-Art (SOTA) results on two in-the-wild scene graph generation benchmarks, i.e., PVSG and ASPIRe. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Showing 101–150 of 3,810 results for author: Nguyen, T