Search | arXiv e-print repository

Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs

Authors: Sarthak Yadav, Sergios Theodoridis, Zheng-Hua Tan

Abstract: While the transformer has emerged as the eminent neural architecture, several independent lines of research have emerged to address its limitations. Recurrent neural approaches have also observed a lot of renewed interest, including the extended long short-term memory (xLSTM) architecture, which reinvigorates the original LSTM architecture. However, while xLSTMs have shown competitive performance… ▽ More While the transformer has emerged as the eminent neural architecture, several independent lines of research have emerged to address its limitations. Recurrent neural approaches have also observed a lot of renewed interest, including the extended long short-term memory (xLSTM) architecture, which reinvigorates the original LSTM architecture. However, while xLSTMs have shown competitive performance compared to the transformer, their viability for learning self-supervised general-purpose audio representations has not yet been evaluated. This work proposes Audio xLSTM (AxLSTM), an approach to learn audio representations from masked spectrogram patches in a self-supervised setting. Pretrained on the AudioSet dataset, the proposed AxLSTM models outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by up to 20% in relative performance across a set of ten diverse downstream tasks while having up to 45% fewer parameters. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: Under review at ICASSP 2025. arXiv admin note: text overlap with arXiv:2406.02178

arXiv:2408.13126 [pdf, other]

CathAction: A Benchmark for Endovascular Intervention Understanding

Authors: Baoru Huang, Tuan Vo, Chayun Kongtongvattana, Giulio Dagnino, Dennis Kundrat, Wenqiang Chi, Mohamed Abdelaziz, Trevor Kwok, Tudor Jianu, Tuong Do, Hieu Le, Minh Nguyen, Hoan Nguyen, Erman Tjiputra, Quang Tran, Jianyang Xie, Yanda Meng, Binod Bhattarai, Zhaorui Tan, Hongbin Liu, Hong Seng Gan, Wei Wang, Xi Yang, Qiufeng Wang, Jionglong Su , et al. (13 additional authors not shown)

Abstract: Real-time visual feedback from catheterization analysis is crucial for enhancing surgical safety and efficiency during endovascular interventions. However, existing datasets are often limited to specific tasks, small scale, and lack the comprehensive annotations necessary for broader endovascular intervention understanding. To tackle these limitations, we introduce CathAction, a large-scale datase… ▽ More Real-time visual feedback from catheterization analysis is crucial for enhancing surgical safety and efficiency during endovascular interventions. However, existing datasets are often limited to specific tasks, small scale, and lack the comprehensive annotations necessary for broader endovascular intervention understanding. To tackle these limitations, we introduce CathAction, a large-scale dataset for catheterization understanding. Our CathAction dataset encompasses approximately 500,000 annotated frames for catheterization action understanding and collision detection, and 25,000 ground truth masks for catheter and guidewire segmentation. For each task, we benchmark recent related works in the field. We further discuss the challenges of endovascular intentions compared to traditional computer vision tasks and point out open research questions. We hope that CathAction will facilitate the development of endovascular intervention understanding methods that can be applied to real-world applications. The dataset is available at https://airvlab.github.io/cathaction/. △ Less

Submitted 30 August, 2024; v1 submitted 23 August, 2024; originally announced August 2024.

Comments: 10 pages. Webpage: https://airvlab.github.io/cathaction/

arXiv:2408.12025 [pdf, other]

Exploring Large Language Models for Feature Selection: A Data-centric Perspective

Authors: Dawei Li, Zhen Tan, Huan Liu

Abstract: The rapid advancement of Large Language Models (LLMs) has significantly influenced various domains, leveraging their exceptional few-shot and zero-shot learning capabilities. In this work, we aim to explore and understand the LLMs-based feature selection methods from a data-centric perspective. We begin by categorizing existing feature selection methods with LLMs into two groups: data-driven featu… ▽ More The rapid advancement of Large Language Models (LLMs) has significantly influenced various domains, leveraging their exceptional few-shot and zero-shot learning capabilities. In this work, we aim to explore and understand the LLMs-based feature selection methods from a data-centric perspective. We begin by categorizing existing feature selection methods with LLMs into two groups: data-driven feature selection which requires samples values to do statistical inference and text-based feature selection which utilizes prior knowledge of LLMs to do semantical associations using descriptive context. We conduct extensive experiments in both classification and regression tasks with LLMs in various sizes (e.g., GPT-4, ChatGPT and LLaMA-2). Our findings emphasize the effectiveness and robustness of text-based feature selection methods and showcase their potentials using a real-world medical application. We also discuss the challenges and future opportunities in employing LLMs for feature selection, offering insights for further research and development in this emerging field. △ Less

Submitted 21 August, 2024; originally announced August 2024.

Comments: Preprint, under review

arXiv:2408.11900 [pdf, other]

Quantum highway: Observation of minimal and maximal speed limits for few and many-body states

Authors: Zitian Zhu, Lei Gao, Zehang Bao, Liang Xiang, Zixuan Song, Shibo Xu, Ke Wang, Jiachen Chen, Feitong Jin, Xuhao Zhu, Yu Gao, Yaozu Wu, Chuanyu Zhang, Ning Wang, Yiren Zou, Ziqi Tan, Aosai Zhang, Zhengyi Cui, Fanhao Shen, Jiarun Zhong, Tingting Li, Jinfeng Deng, Xu Zhang, Hang Dong, Pengfei Zhang , et al. (8 additional authors not shown)

Abstract: Tracking the time evolution of a quantum state allows one to verify the thermalization rate or the propagation speed of correlations in generic quantum systems. Inspired by the energy-time uncertainty principle, bounds have been demonstrated on the maximal speed at which a quantum state can change, resulting in immediate and practical tasks. Based on a programmable superconducting quantum processo… ▽ More Tracking the time evolution of a quantum state allows one to verify the thermalization rate or the propagation speed of correlations in generic quantum systems. Inspired by the energy-time uncertainty principle, bounds have been demonstrated on the maximal speed at which a quantum state can change, resulting in immediate and practical tasks. Based on a programmable superconducting quantum processor, we test the dynamics of various emulated quantum mechanical systems encompassing single- and many-body states. We show that one can test the known quantum speed limits and that modifying a single Hamiltonian parameter allows the observation of the crossover of the different bounds on the dynamics. We also unveil the observation of minimal quantum speed limits in addition to more common maximal ones, i.e., the lowest rate of change of a unitarily evolved quantum state. Our results establish a comprehensive experimental characterization of quantum speed limits and pave the way for their subsequent study in engineered non-unitary conditions. △ Less

Submitted 21 August, 2024; originally announced August 2024.

Comments: 9 pages,4 figures + supplementary information

arXiv:2408.10387 [pdf, other]

Neural Infalling Cloud Equations (NICE): Increasing the Efficacy of Subgrid Models and Scientific Equation Discovery using Neural ODEs and Symbolic Regression

Authors: Zun Yi Brent Tan

Abstract: It is now well established that galactic systems are inherently multiphase, and that understanding the roles and interactions of the various phases is key towards a more complete picture of galaxy formation and evolution. For example, these interactions play a pivotal role in the cycling of baryons which fuels star formation. It remains a challenge that the transport and dynamics of cold clouds in… ▽ More It is now well established that galactic systems are inherently multiphase, and that understanding the roles and interactions of the various phases is key towards a more complete picture of galaxy formation and evolution. For example, these interactions play a pivotal role in the cycling of baryons which fuels star formation. It remains a challenge that the transport and dynamics of cold clouds in their surrounding hot environment are governed by complex small scale processes (such as the interplay of turbulence and radiative cooling) that determine how the phases exchange mass, momentum and energy. Large scale models thus require subgrid prescriptions in the form of models validated on small scale simulations, which can take the form of a system of coupled differential equations. In this work, we explore using neural ordinary differential equations which embed a neural network as a term in the subgrid model to capture an uncertain physical process. We then apply Symbolic Regression on the learned model to potentially discover new insights into the physics of cloud-environment interactions. We test this on both generated mock data and actual simulation data. We also extend the neural ODE to include a secondary neural term. We show that neural ODEs in tandem with Symbolic Regression can be used to enhance the accuracy and efficiency of subgrid models, and/or discover the underlying equations to improve generality and scientific understanding. We highlight the potential of this scientific machine learning approach as a natural extension to the traditional modelling paradigm, both for the development of semi-analytic models and for physically interpretable equation discovery in complex non-linear systems. △ Less

Submitted 19 August, 2024; originally announced August 2024.

Comments: 9 Pages, 8 Figures, submitted to MNRAS

arXiv:2408.09555 [pdf]

doi 10.1016/j.jmii.2024.01.002

Predicting potential SARS-CoV-2 spillover and spillback in animals

Authors: Zi Hian Tan, Kian Yan Yong, Jian-Jun Shu

Abstract: The COVID-19 pandemic is spreading rapidly around the world, causing countries to impose lockdowns and efforts to develop vaccines on a global scale. However, human-to-animal and animal-to-human transmission cannot be ignored, as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) can spread rapidly in farmed and wild animals. This could create a worrying cycle of SARS-CoV-2 spillover fro… ▽ More The COVID-19 pandemic is spreading rapidly around the world, causing countries to impose lockdowns and efforts to develop vaccines on a global scale. However, human-to-animal and animal-to-human transmission cannot be ignored, as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) can spread rapidly in farmed and wild animals. This could create a worrying cycle of SARS-CoV-2 spillover from humans to animals and spillback of new strains back into humans, rendering vaccines ineffective. This study provides a key indicator of animals that may be potential susceptible hosts for SARS-CoV-2 and coronavirus infections by analysing the phylogenetic distance between host angiotensin-converting enzyme 2 and the coronavirus spike protein. Crucially, our analysis identifies animals that are at elevated risk from a spillover and spillback incident. One group of animals has been identified as potentially susceptible to SARS-CoV-2 by harbouring a parasitic coronavirus spike protein similar to the SARS-CoV-2 spike protein. These animals may serve as amplification hosts in spillover events from zoonotic reservoirs. Tracing interspecies transmission in multi-host environments based solely on in vitro and in vivo examinations of animal susceptibility or serology is a time-consuming task. This approach allows rapid identification of high-risk animals to prioritize research and assessment of the risk of zoonotic disease transmission in the environment. It is a tool to rapidly identify zoonotic species that may cause outbreaks or participate in expansion cycles of coexistence with their hosts. This prevents the spread of coronavirus infections between species, preventing spillover and spillback incidents from occurring. △ Less

Submitted 18 August, 2024; originally announced August 2024.

Journal ref: Journal of Microbiology Immunology and Infection, Vol. 57, No. 2, pp. 225-237, 2024

arXiv:2408.09465 [pdf, other]

MedMAP: Promoting Incomplete Multi-modal Brain Tumor Segmentation with Alignment

Authors: Tianyi Liu, Zhaorui Tan, Muyin Chen, Xi Yang, Haochuan Jiang, Kaizhu Huang

Abstract: Brain tumor segmentation is often based on multiple magnetic resonance imaging (MRI). However, in clinical practice, certain modalities of MRI may be missing, which presents a more difficult scenario. To cope with this challenge, Knowledge Distillation, Domain Adaption, and Shared Latent Space have emerged as commonly promising strategies. However, recent efforts typically overlook the modality ga… ▽ More Brain tumor segmentation is often based on multiple magnetic resonance imaging (MRI). However, in clinical practice, certain modalities of MRI may be missing, which presents a more difficult scenario. To cope with this challenge, Knowledge Distillation, Domain Adaption, and Shared Latent Space have emerged as commonly promising strategies. However, recent efforts typically overlook the modality gaps and thus fail to learn important invariant feature representations across different modalities. Such drawback consequently leads to limited performance for missing modality models. To ameliorate these problems, pre-trained models are used in natural visual segmentation tasks to minimize the gaps. However, promising pre-trained models are often unavailable in medical image segmentation tasks. Along this line, in this paper, we propose a novel paradigm that aligns latent features of involved modalities to a well-defined distribution anchor as the substitution of the pre-trained model}. As a major contribution, we prove that our novel training paradigm ensures a tight evidence lower bound, thus theoretically certifying its effectiveness. Extensive experiments on different backbones validate that the proposed paradigm can enable invariant feature representations and produce models with narrowed modality gaps. Models with our alignment paradigm show their superior performance on both BraTS2018 and BraTS2020 datasets. △ Less

Submitted 18 August, 2024; originally announced August 2024.

arXiv:2408.09070 [pdf, other]

CodeTaxo: Enhancing Taxonomy Expansion with Limited Examples via Code Language Prompts

Authors: Qingkai Zeng, Yuyang Bai, Zhaoxuan Tan, Zhenyu Wu, Shangbin Feng, Meng Jiang

Abstract: Taxonomies play a crucial role in various applications by providing a structural representation of knowledge. The task of taxonomy expansion involves integrating emerging concepts into existing taxonomies by identifying appropriate parent concepts for these new query concepts. Previous approaches typically relied on self-supervised methods that generate annotation data from existing taxonomies. Ho… ▽ More Taxonomies play a crucial role in various applications by providing a structural representation of knowledge. The task of taxonomy expansion involves integrating emerging concepts into existing taxonomies by identifying appropriate parent concepts for these new query concepts. Previous approaches typically relied on self-supervised methods that generate annotation data from existing taxonomies. However, these methods are less effective when the existing taxonomy is small (fewer than 100 entities). In this work, we introduce \textsc{CodeTaxo}, a novel approach that leverages large language models through code language prompts to capture the taxonomic structure. Extensive experiments on five real-world benchmarks from different domains demonstrate that \textsc{CodeTaxo} consistently achieves superior performance across all evaluation metrics, significantly outperforming previous state-of-the-art methods. The code and data are available at \url{https://github.com/QingkaiZeng/CodeTaxo-Pub}. △ Less

Submitted 16 August, 2024; originally announced August 2024.

arXiv:2408.08084 [pdf, other]

An Efficient Replay for Class-Incremental Learning with Pre-trained Models

Authors: Weimin Yin, Bin Chen adn Chunzhao Xie, Zhenhao Tan

Abstract: In general class-incremental learning, researchers typically use sample sets as a tool to avoid catastrophic forgetting during continuous learning. At the same time, researchers have also noted the differences between class-incremental learning and Oracle training and have attempted to make corrections. In recent years, researchers have begun to develop class-incremental learning algorithms utiliz… ▽ More In general class-incremental learning, researchers typically use sample sets as a tool to avoid catastrophic forgetting during continuous learning. At the same time, researchers have also noted the differences between class-incremental learning and Oracle training and have attempted to make corrections. In recent years, researchers have begun to develop class-incremental learning algorithms utilizing pre-trained models, achieving significant results. This paper observes that in class-incremental learning, the steady state among the weight guided by each class center is disrupted, which is significantly correlated with catastrophic forgetting. Based on this, we propose a new method to overcoming forgetting . In some cases, by retaining only a single sample unit of each class in memory for replay and applying simple gradient constraints, very good results can be achieved. Experimental results indicate that under the condition of pre-trained models, our method can achieve competitive performance with very low computational cost and by simply using the cross-entropy loss. △ Less

Submitted 15 August, 2024; originally announced August 2024.

arXiv:2408.06320 [pdf, other]

Dynamics of ballistic photocurrents driven by Coulomb scattering

Authors: Liang Z. Tan, Xavier Andrade, Sangeeta Rajpurohit, Alfredo A. Correa, Tadashi Ogitsu

Abstract: First principles real-time time dependent density functional theory (rt-TDDFT) calculations reveal the existence of ballistic photocurrents generated by Coulomb scattering, a form of photocurrent that has not previously been considered as a mechanism for the bulk photovoltaic effect. With monolayer GeS as an example, it is predicted that ballistic currents can exceed shift currents under experimen… ▽ More First principles real-time time dependent density functional theory (rt-TDDFT) calculations reveal the existence of ballistic photocurrents generated by Coulomb scattering, a form of photocurrent that has not previously been considered as a mechanism for the bulk photovoltaic effect. With monolayer GeS as an example, it is predicted that ballistic currents can exceed shift currents under experimentally accessible conditions. Moreover, these simulations reveal recombination pathways that work to modify shift current magnitudes beyond perturbation theory predictions. △ Less

Submitted 12 August, 2024; originally announced August 2024.

arXiv:2408.05479 [pdf, other]

ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack

Authors: Ziyi Gao, Kai Chen, Zhipeng Wei, Tingshu Mou, Jingjing Chen, Zhiyu Tan, Hao Li, Yu-Gang Jiang

Abstract: Recent diffusion-based unrestricted attacks generate imperceptible adversarial examples with high transferability compared to previous unrestricted attacks and restricted attacks. However, existing works on diffusion-based unrestricted attacks are mostly focused on images yet are seldom explored in videos. In this paper, we propose the Recursive Token Merging for Video Diffusion-based Unrestricted… ▽ More Recent diffusion-based unrestricted attacks generate imperceptible adversarial examples with high transferability compared to previous unrestricted attacks and restricted attacks. However, existing works on diffusion-based unrestricted attacks are mostly focused on images yet are seldom explored in videos. In this paper, we propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack (ReToMe-VA), which is the first framework to generate imperceptible adversarial video clips with higher transferability. Specifically, to achieve spatial imperceptibility, ReToMe-VA adopts a Timestep-wise Adversarial Latent Optimization (TALO) strategy that optimizes perturbations in diffusion models' latent space at each denoising step. TALO offers iterative and accurate updates to generate more powerful adversarial frames. TALO can further reduce memory consumption in gradient computation. Moreover, to achieve temporal imperceptibility, ReToMe-VA introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging tokens across video frames in the self-attention module, resulting in temporally consistent adversarial videos. ReToMe concurrently facilitates inter-frame interactions into the attack process, inducing more diverse and robust gradients, thus leading to better adversarial transferability. Extensive experiments demonstrate the efficacy of ReToMe-VA, particularly in surpassing state-of-the-art attacks in adversarial transferability by more than 14.16% on average. △ Less

Submitted 10 August, 2024; originally announced August 2024.

arXiv:2408.03407 [pdf, other]

Deep Clustering via Distribution Learning

Authors: Guanfang Dong, Zijie Tan, Chenqiu Zhao, Anup Basu

Abstract: Distribution learning finds probability density functions from a set of data samples, whereas clustering aims to group similar data points to form clusters. Although there are deep clustering methods that employ distribution learning methods, past work still lacks theoretical analysis regarding the relationship between clustering and distribution learning. Thus, in this work, we provide a theoreti… ▽ More Distribution learning finds probability density functions from a set of data samples, whereas clustering aims to group similar data points to form clusters. Although there are deep clustering methods that employ distribution learning methods, past work still lacks theoretical analysis regarding the relationship between clustering and distribution learning. Thus, in this work, we provide a theoretical analysis to guide the optimization of clustering via distribution learning. To achieve better results, we embed deep clustering guided by a theoretical analysis. Furthermore, the distribution learning method cannot always be directly applied to data. To overcome this issue, we introduce a clustering-oriented distribution learning method called Monte-Carlo Marginalization for Clustering. We integrate Monte-Carlo Marginalization for Clustering into Deep Clustering, resulting in Deep Clustering via Distribution Learning (DCDL). Eventually, the proposed DCDL achieves promising results compared to state-of-the-art methods on popular datasets. Considering a clustering task, the new distribution learning method outperforms previous methods as well. △ Less

Submitted 6 August, 2024; originally announced August 2024.

arXiv:2408.02629 [pdf, other]

VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Authors: Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Hao Li

Abstract: The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manu… ▽ More The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models. △ Less

Submitted 5 August, 2024; originally announced August 2024.

Comments: project page: https://sais-fuxi.github.io/projects/vidgen-1m

arXiv:2408.02306 [pdf, other]

Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization

Authors: Changtao Miao, Qi Chu, Tao Gong, Zhentao Tan, Zhenchao Jin, Wanyi Zhuang, Man Luo, Honggang Hu, Nenghai Yu

Abstract: With the advancement of face manipulation technology, forgery images in multi-face scenarios are gradually becoming a more complex and realistic challenge. Despite this, detection and localization methods for such multi-face manipulations remain underdeveloped. Traditional manipulation localization methods either indirectly derive detection results from localization masks, resulting in limited det… ▽ More With the advancement of face manipulation technology, forgery images in multi-face scenarios are gradually becoming a more complex and realistic challenge. Despite this, detection and localization methods for such multi-face manipulations remain underdeveloped. Traditional manipulation localization methods either indirectly derive detection results from localization masks, resulting in limited detection performance, or employ a naive two-branch structure to simultaneously obtain detection and localization results, which cannot effectively benefit the localization capability due to limited interaction between two tasks. This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization. The MoNFAP primarily introduces two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM). The FUP integrates detection and localization tasks using a token learning strategy and multiple forgery-aware transformers, which facilitates the use of classification information to enhance localization capability. Besides, motivated by the crucial role of noise information in forgery detection, the MNM leverages multiple noise extractors based on the concept of the mixture of experts to enhance the general RGB features, further boosting the performance of our framework. Finally, we establish a comprehensive benchmark for multi-face detection and localization and the proposed \textit{MoNFAP} achieves significant performance. The codes will be made available. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2408.00001 [pdf, other]

Replication in Visual Diffusion Models: A Survey and Outlook

Authors: Wenhao Wang, Yifan Sun, Zongxin Yang, Zhengdong Hu, Zhentao Tan, Yi Yang

Abstract: Visual diffusion models have revolutionized the field of creative AI, producing high-quality and diverse content. However, they inevitably memorize training images or videos, subsequently replicating their concepts, content, or styles during inference. This phenomenon raises significant concerns about privacy, security, and copyright within generated outputs. In this survey, we provide the first c… ▽ More Visual diffusion models have revolutionized the field of creative AI, producing high-quality and diverse content. However, they inevitably memorize training images or videos, subsequently replicating their concepts, content, or styles during inference. This phenomenon raises significant concerns about privacy, security, and copyright within generated outputs. In this survey, we provide the first comprehensive review of replication in visual diffusion models, marking a novel contribution to the field by systematically categorizing the existing studies into unveiling, understanding, and mitigating this phenomenon. Specifically, unveiling mainly refers to the methods used to detect replication instances. Understanding involves analyzing the underlying mechanisms and factors that contribute to this phenomenon. Mitigation focuses on developing strategies to reduce or eliminate replication. Beyond these aspects, we also review papers focusing on its real-world influence. For instance, in the context of healthcare, replication is critically worrying due to privacy concerns related to patient data. Finally, the paper concludes with a discussion of the ongoing challenges, such as the difficulty in detecting and benchmarking replication, and outlines future directions including the development of more robust mitigation techniques. By synthesizing insights from diverse studies, this paper aims to equip researchers and practitioners with a deeper understanding at the intersection between AI technology and social good. We release this project at https://github.com/WangWenhao0716/Awesome-Diffusion-Replication. △ Less

Submitted 7 July, 2024; originally announced August 2024.

Comments: The first survey focuses on replication in visual diffusion models. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2407.21264 [pdf, other]

Model Attribution in LLM-Generated Disinformation: A Domain Generalization Approach with Supervised Contrastive Learning

Authors: Alimohammad Beigi, Zhen Tan, Nivedh Mudiam, Canyu Chen, Kai Shu, Huan Liu

Abstract: Model attribution for LLM-generated disinformation poses a significant challenge in understanding its origins and mitigating its spread. This task is especially challenging because modern large language models (LLMs) produce disinformation with human-like quality. Additionally, the diversity in prompting methods used to generate disinformation complicates accurate source attribution. These methods… ▽ More Model attribution for LLM-generated disinformation poses a significant challenge in understanding its origins and mitigating its spread. This task is especially challenging because modern large language models (LLMs) produce disinformation with human-like quality. Additionally, the diversity in prompting methods used to generate disinformation complicates accurate source attribution. These methods introduce domain-specific features that can mask the fundamental characteristics of the models. In this paper, we introduce the concept of model attribution as a domain generalization problem, where each prompting method represents a unique domain. We argue that an effective attribution model must be invariant to these domain-specific features. It should also be proficient in identifying the originating models across all scenarios, reflecting real-world detection challenges. To address this, we introduce a novel approach based on Supervised Contrastive Learning. This method is designed to enhance the model's robustness to variations in prompts and focuses on distinguishing between different source LLMs. We evaluate our model through rigorous experiments involving three common prompting methods: ``open-ended'', ``rewriting'', and ``paraphrasing'', and three advanced LLMs: ``llama 2'', ``chatgpt'', and ``vicuna''. Our results demonstrate the effectiveness of our approach in model attribution tasks, achieving state-of-the-art performance across diverse and unseen datasets. △ Less

Submitted 14 August, 2024; v1 submitted 30 July, 2024; originally announced July 2024.

Comments: 10 pages, 2 figures, accepted at DSAA 2024

arXiv:2407.20920 [pdf, other]

SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition

Authors: Hao Tan, Zichang Tan, Jun Li, Jun Wan, Zhen Lei, Stan Z. Li

Abstract: Multi-label image recognition is a fundamental task in computer vision. Recently, Vision-Language Models (VLMs) have made notable advancements in this area. However, previous methods fail to effectively leverage the rich knowledge in language models and often incorporate label semantics into visual features unidirectionally. To overcome these problems, we propose a Split-and-Synthesize Prompting w… ▽ More Multi-label image recognition is a fundamental task in computer vision. Recently, Vision-Language Models (VLMs) have made notable advancements in this area. However, previous methods fail to effectively leverage the rich knowledge in language models and often incorporate label semantics into visual features unidirectionally. To overcome these problems, we propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of VLMs. Specifically, we develop an in-context learning approach to associate the inherent knowledge from LLMs. Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually and then aggregate them carefully through the quaternion network. Moreover, we present Gated Dual-Modal Alignments (GDMA) to bidirectionally interact visual and linguistic modalities while eliminating redundant cross-modal information, enabling more efficient region-level alignments. Rather than making the final prediction by a sharp manner in previous works, we propose a soft aggregator to jointly consider results from all image regions. With the help of flexible prompting and gated alignments, SSPA is generalizable to specific domains. Extensive experiments on nine datasets from three domains (i.e., natural, pedestrian attributes and remote sensing) demonstrate the state-of-the-art performance of SSPA. Further analyses verify the effectiveness of SSP and the interpretability of GDMA. The code will be made public. △ Less

Submitted 30 July, 2024; originally announced July 2024.

Comments: 13 pages, 8 figures

arXiv:2407.20516 [pdf, other]

Machine Unlearning in Generative AI: A Survey

Authors: Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, Meng Jiang

Abstract: Generative AI technologies have been deployed in many places, such as (multimodal) large language models and vision generative models. Their remarkable performance should be attributed to massive training data and emergent reasoning abilities. However, the models would memorize and generate sensitive, biased, or dangerous information originated from the training data especially those from web craw… ▽ More Generative AI technologies have been deployed in many places, such as (multimodal) large language models and vision generative models. Their remarkable performance should be attributed to massive training data and emergent reasoning abilities. However, the models would memorize and generate sensitive, biased, or dangerous information originated from the training data especially those from web crawl. New machine unlearning (MU) techniques are being developed to reduce or eliminate undesirable knowledge and its effects from the models, because those that were designed for traditional classification tasks could not be applied for Generative AI. We offer a comprehensive survey on many things about MU in Generative AI, such as a new problem formulation, evaluation methods, and a structured discussion on the advantages and limitations of different kinds of MU techniques. It also presents several critical challenges and promising directions in MU research. A curated list of readings can be found: https://github.com/franciscoliu/GenAI-MU-Reading. △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.17451 [pdf, other]

BlueTempNet: A Temporal Multi-network Dataset of Social Interactions in Bluesky Social

Authors: Ujun Jeong, Bohan Jiang, Zhen Tan, H. Russell Bernard, Huan Liu

Abstract: Decentralized social media platforms like Bluesky Social (Bluesky) have made it possible to publicly disclose some user behaviors with millisecond-level precision. Embracing Bluesky's principles of open-source and open-data, we present the first collection of the temporal dynamics of user-driven social interactions. BlueTempNet integrates multiple types of networks into a single multi-network, inc… ▽ More Decentralized social media platforms like Bluesky Social (Bluesky) have made it possible to publicly disclose some user behaviors with millisecond-level precision. Embracing Bluesky's principles of open-source and open-data, we present the first collection of the temporal dynamics of user-driven social interactions. BlueTempNet integrates multiple types of networks into a single multi-network, including user-to-user interactions (following and blocking users) and user-to-community interactions (creating and joining communities). Communities are user-formed groups in custom Feeds, where users subscribe to posts aligned with their interests. Following Bluesky's public data policy, we collect existing Bluesky Feeds, including the users who liked and generated these Feeds, and provide tools to gather users' social interactions within a date range. This data-collection strategy captures past user behaviors and supports the future data collection of user behavior. △ Less

Submitted 24 July, 2024; originally announced July 2024.

Comments: to appear in IEEE Data Description

arXiv:2407.17436 [pdf, other]

AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies

Authors: Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, Bo Li

Abstract: Foundation models (FMs) provide societal benefits but also amplify risks. Governments, companies, and researchers have proposed regulatory frameworks, acceptable use policies, and safety benchmarks in response. However, existing public benchmarks often define safety categories based on previous literature, intuitions, or common sense, leading to disjointed sets of categories for risks specified in… ▽ More Foundation models (FMs) provide societal benefits but also amplify risks. Governments, companies, and researchers have proposed regulatory frameworks, acceptable use policies, and safety benchmarks in response. However, existing public benchmarks often define safety categories based on previous literature, intuitions, or common sense, leading to disjointed sets of categories for risks specified in recent regulations and policies, which makes it challenging to evaluate and compare FMs across these benchmarks. To bridge this gap, we introduce AIR-Bench 2024, the first AI safety benchmark aligned with emerging government regulations and company policies, following the regulation-based safety categories grounded in our AI risks study, AIR 2024. AIR 2024 decomposes 8 government regulations and 16 company policies into a four-tiered safety taxonomy with 314 granular risk categories in the lowest tier. AIR-Bench 2024 contains 5,694 diverse prompts spanning these categories, with manual curation and human auditing to ensure quality. We evaluate leading language models on AIR-Bench 2024, uncovering insights into their alignment with specified safety concerns. By bridging the gap between public benchmarks and practical AI risks, AIR-Bench 2024 provides a foundation for assessing model safety across jurisdictions, fostering the development of safer and more responsible AI systems. △ Less

Submitted 5 August, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.14827 [pdf, other]

How does the velocity anisotropy of halo stars, dark matter and satellite galaxies depend on host halo properties?

Authors: Jiaxin He, Wenting Wang, Zhaozhou Li, Jiaxin Han, Vicente Rodriguez-Gomez, Donghai Zhao, Xianguang Meng, Yipeng Jing, Shi Shao, Rui Shi, Zhenlin Tan

Abstract: We investigate the mass ($M_{200}$) and concentration ($c_{200}$) dependencies of the velocity anisotropy ($β$) profiles for different components in the dark matter halo, including halo stars, dark matter and subhalos, using systems from the IllustrisTNG simulations. Beyond a critical radius, $β$ becomes more radial with the increase of $M_{200}$, reflecting more prominent radial accretion around… ▽ More We investigate the mass ($M_{200}$) and concentration ($c_{200}$) dependencies of the velocity anisotropy ($β$) profiles for different components in the dark matter halo, including halo stars, dark matter and subhalos, using systems from the IllustrisTNG simulations. Beyond a critical radius, $β$ becomes more radial with the increase of $M_{200}$, reflecting more prominent radial accretion around massive halos. The critical radius is $r\sim r_s$, $0.3~r_s$ and $r_s$ for halo stars, dark matter and subhalos, with $r_s$ the scale radius of host halos. This dependence on $M_{200}$ is the strongest for subhalos, and the weakest for halo stars. In central regions, $β$ of halo stars and dark matter particles gets more isotropic with the increase of $M_{200}$ in TNG300 due to baryons. By contrast, $β$ of dark matter from the dark matter only TNG300-Dark run shows much weaker dependence on $M_{200}$ within $r_s$. Dark matter in TNG300 is slightly more isotropic than in TNG300-Dark at $0.2~r_s<r<10~r_s$ and $\log_{10}M_{200}/M_\odot<13.8$. Halo stars and dark matter also become more radial with the increase in $c_{200}$, at fixed $M_{200}$. Halo stars are more radial than the $β$ profile of dark matter by approximately a constant beyond $r_s$. Dark matter particles are more radial than subhalos. The differences can be understood as subhalos on more radial orbits are easier to get stripped, contributing more stars and dark matter to the diffuse components. We provide a fitting formula to the difference between the $β$ of halo stars and of dark matter at $r>r_s$ as $β_\mathrm{star}-β_\mathrm{DM}=(-0.028 \pm 0.008)\log_{10}M_{200}/M_\odot + (0.690\pm0.010)$. △ Less

Submitted 20 July, 2024; originally announced July 2024.

Comments: submitted to ApJ

arXiv:2407.11030 [pdf, other]

DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs

Authors: Zhen Tan, Daize Dong, Xinyu Zhao, Jie Peng, Yu Cheng, Tianlong Chen

Abstract: In this paper, we introduce Dynamic Layer Operations (DLO), a novel approach for vertically scaling transformer-based Large Language Models (LLMs) by dynamically expanding, activating, or skipping layers using a sophisticated routing policy based on layerwise feature similarity. Unlike traditional Mixture-of-Experts (MoE) methods that focus on extending the model width, our approach targets model… ▽ More In this paper, we introduce Dynamic Layer Operations (DLO), a novel approach for vertically scaling transformer-based Large Language Models (LLMs) by dynamically expanding, activating, or skipping layers using a sophisticated routing policy based on layerwise feature similarity. Unlike traditional Mixture-of-Experts (MoE) methods that focus on extending the model width, our approach targets model depth, addressing the redundancy observed across layer representations for various input samples. Our framework is integrated with the Supervised Fine-Tuning (SFT) stage, eliminating the need for resource-intensive Continual Pre-Training (CPT). Experimental results demonstrate that DLO not only outperforms the original unscaled models but also achieves comparable results to densely expanded models with significantly improved efficiency. Our work offers a promising direction for building efficient yet powerful LLMs. We will release our implementation and model weights upon acceptance. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.10852 [pdf, other]

Cut-Preserving Vertex Sparsifiers for Planar and Quasi-bipartite Graphs

Authors: Yu Chen, Zihan Tan

Abstract: We study vertex sparsification for preserving cuts. Given a graph $G$ with a subset $|T|=k$ of its vertices called terminals, a \emph{quality-$q$ cut sparsifier} is a graph $G'$ that contains $T$, such that, for any partition $(T_1,T_2)$ of $T$ into non-empty subsets, the value of the min-cut in $G'$ separating $T_1$ from $T_2$ is within factor $q$ from the value of the min-cut in $G$ separating… ▽ More We study vertex sparsification for preserving cuts. Given a graph $G$ with a subset $|T|=k$ of its vertices called terminals, a \emph{quality-$q$ cut sparsifier} is a graph $G'$ that contains $T$, such that, for any partition $(T_1,T_2)$ of $T$ into non-empty subsets, the value of the min-cut in $G'$ separating $T_1$ from $T_2$ is within factor $q$ from the value of the min-cut in $G$ separating $T_1$ from $T_2$. The construction of cut sparsifiers with good (small) quality and size has been a central problem in graph compression for years. Planar graphs and quasi-bipartite graphs are two important special families studied in this research direction. The main results in this paper are new cut sparsifier constructions for them in the high-quality regime (where $q=1$ or $1+\varepsilon$ for small $\varepsilon>0$). We first show that every planar graph admits a planar quality-$(1+\varepsilon)$ cut sparsifier of size $\tilde O(k/\text{poly}(\varepsilon))$, which is in sharp contrast with the lower bound of $2^{Ω(k)}$ for the quality-$1$ case. We then show that every quasi-bipartite graph admits a quality-$1$ cut sparsifier of size $2^{\tilde O(k^2)}$. This is the second to improve over the doubly-exponential bound for general graphs (previously only planar graphs have been shown to have single-exponential size quality-$1$ cut sparsifiers). Lastly, we show that contraction, a common approach for constructing cut sparsifiers adopted in most previous works, does not always give optimal bounds for cut sparsifiers. We demonstrate this by showing that the optimal size bound for quality-$(1+\varepsilon)$ contraction-based cut sparsifiers for quasi-bipartite graphs lies in the range $[k^{\tildeΩ(1/\varepsilon)},k^{O(1/\varepsilon^2)}]$, while in previous work an upper bound of $\tilde O(k/\varepsilon^2)$ was achieved via a non-contraction approach. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.10468 [pdf, other]

LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis

Authors: Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang

Abstract: Latent diffusion models have shown promising results in audio generation, making notable advancements over traditional methods. However, their performance, while impressive with short audio clips, faces challenges when extended to longer audio sequences. These challenges are due to model's self-attention mechanism and training predominantly on 10-second clips, which complicates the extension to lo… ▽ More Latent diffusion models have shown promising results in audio generation, making notable advancements over traditional methods. However, their performance, while impressive with short audio clips, faces challenges when extended to longer audio sequences. These challenges are due to model's self-attention mechanism and training predominantly on 10-second clips, which complicates the extension to longer audio without adaptation. In response to these issues, we introduce a novel approach, LiteFocus that enhances the inference of existing audio latent diffusion models in long audio synthesis. Observed the attention pattern in self-attention, we employ a dual sparse form for attention calculation, designated as same-frequency focus and cross-frequency compensation, which curtails the attention computation under same-frequency constraints, while enhancing audio quality through cross-frequency refillment. LiteFocus demonstrates substantial reduction on inference time with diffusion-based TTA model by 1.99x in synthesizing 80-second audio clips while also obtaining improved audio quality. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: Interspeech 2024; Code: https://github.com/Yuanshi9815/LiteFocus

arXiv:2407.09333 [pdf, other]

HETOCompiler: An MLIR-based crypTOgraphic Compilation Framework for HEterogeneous Devices

Authors: Zhiyuan Tan, Liutong Han, Mingjie Xing, Yanjun Wu

Abstract: Hash algorithms are fundamental tools in cryptography, offering irreversible and sensitive transformations of input data for various security purposes. As computing architectures evolve towards heterogeneous systems, efficiently harnessing diverse computing resources for hash encryption algorithms becomes crucial. This paper presents HETOCompiler, a novel cryptography compilation framework designe… ▽ More Hash algorithms are fundamental tools in cryptography, offering irreversible and sensitive transformations of input data for various security purposes. As computing architectures evolve towards heterogeneous systems, efficiently harnessing diverse computing resources for hash encryption algorithms becomes crucial. This paper presents HETOCompiler, a novel cryptography compilation framework designed for heterogeneous systems. Leveraging Multi-Level Intermediate Representation (MLIR), HETOCompiler abstracts syntax and semantics for cryptographic primitives and heterogeneous computing models, facilitating efficient compilation of high-level hash encryption algorithms into executable programs compatible with diverse devices. Experimental results demonstrate significant performance improvements over existing OpenSSL library, with average enhancements of 49.3x, 1.5x, and 23.4x for SHA-1, MD5, and SM3 algorithms respectively. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.08974 [pdf, other]

Topology-enhanced machine learning model (Top-ML) for anticancer peptide prediction

Authors: Joshua Zhi En Tan, JunJie Wee, Xue Gong, Kelin Xia

Abstract: Recently, therapeutic peptides have demonstrated great promise for cancer treatment. To explore powerful anticancer peptides, artificial intelligence (AI)-based approaches have been developed to systematically screen potential candidates. However, the lack of efficient featurization of peptides has become a bottleneck for these machine-learning models. In this paper, we propose a topology-enhanced… ▽ More Recently, therapeutic peptides have demonstrated great promise for cancer treatment. To explore powerful anticancer peptides, artificial intelligence (AI)-based approaches have been developed to systematically screen potential candidates. However, the lack of efficient featurization of peptides has become a bottleneck for these machine-learning models. In this paper, we propose a topology-enhanced machine learning model (Top-ML) for anticancer peptide prediction. Our Top-ML employs peptide topological features derived from its sequence "connection" information characterized by vector and spectral descriptors. Our Top-ML model has been validated on two widely used AntiCP 2.0 benchmark datasets and has achieved state-of-the-art performance. Our results highlight the potential of leveraging novel topology-based featurization to accelerate the identification of anticancer peptides. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.07958 [pdf, other]

Bayesian Detector Combination for Object Detection with Crowdsourced Annotations

Authors: Zhi Qin Tan, Olga Isupova, Gustavo Carneiro, Xiatian Zhu, Yunpeng Li

Abstract: Acquiring fine-grained object detection annotations in unconstrained images is time-consuming, expensive, and prone to noise, especially in crowdsourcing scenarios. Most prior object detection methods assume accurate annotations; A few recent works have studied object detection with noisy crowdsourced annotations, with evaluation on distinct synthetic crowdsourced datasets of varying setups under… ▽ More Acquiring fine-grained object detection annotations in unconstrained images is time-consuming, expensive, and prone to noise, especially in crowdsourcing scenarios. Most prior object detection methods assume accurate annotations; A few recent works have studied object detection with noisy crowdsourced annotations, with evaluation on distinct synthetic crowdsourced datasets of varying setups under artificial assumptions. To address these algorithmic limitations and evaluation inconsistency, we first propose a novel Bayesian Detector Combination (BDC) framework to more effectively train object detectors with noisy crowdsourced annotations, with the unique ability of automatically inferring the annotators' label qualities. Unlike previous approaches, BDC is model-agnostic, requires no prior knowledge of the annotators' skill level, and seamlessly integrates with existing object detection models. Due to the scarcity of real-world crowdsourced datasets, we introduce large synthetic datasets by simulating varying crowdsourcing scenarios. This allows consistent evaluation of different models at scale. Extensive experiments on both real and synthetic crowdsourced datasets show that BDC outperforms existing state-of-the-art methods, demonstrating its superiority in leveraging crowdsourced data for object detection. Our code and data are available at https://github.com/zhiqin1998/bdc. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: Accepted at ECCV 2024

arXiv:2407.07418 [pdf, other]

Dynamics of asymmetrically deformed skyrmion driven by internal forces and strain force in a flower-shaped magnetic nanostructure

Authors: Zhen-Yu Tan, Ji-Pei Chen, Yu-Ke Shi, Yuan Chen, Ming-Hui Qin, Xing-Sen Gao, Jun-Ming Liu

Abstract: Magnetic skyrmions emerge as promising quasi-particles for encoding information in nextgeneration spintronic devices. Their innate flexibility in shape is essential for the applications although they were often ideally treated as rigid particles. In this work, we investigated the voltagecontrolled uniform strain mediated dynamics of deformed skyrmions in heterostructures with a flower-shaped magne… ▽ More Magnetic skyrmions emerge as promising quasi-particles for encoding information in nextgeneration spintronic devices. Their innate flexibility in shape is essential for the applications although they were often ideally treated as rigid particles. In this work, we investigated the voltagecontrolled uniform strain mediated dynamics of deformed skyrmions in heterostructures with a flower-shaped magnetic nanostructure, using micromagnetic simulations. The simulated results revealed the possible states of isolated skyrmion nucleated in the nanostructure, which can be mutually switched by applying suitable in-plane strain pulses. In addition, it was found that the skyrmion motions are driven by the emerging internal forces and strain force, which originate from the asymmetric deformation of skyrmion structures. Furthermore, an analytical model of deformed skyrmions was proposed to interpret the dependences of internal forces and strain force on the asymmetric deformation of skyrmion, with some formulae derived for these forces in a semi-analytical approach. Further calculations based on these formulae verified the forces appearing in the skyrmion motion, with the resulting forces showing consistence with the simulated data. This suggested that our semi-analytical model successfully captures the main physics responsible for the motion of deformed skyrmion in the nanostructure. Our work extends the understanding of the mechanics emerging in deformed skyrmion, and provides an effective approach for deterministic manipulation of deformed skyrmion motion via strain forces and internal forces, which may be instructive to design of skyrmion-based spintronic devices. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.07053 [pdf, other]

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Authors: Wenqi Zhang, Zhenglin Cheng, Yuanyu He, Mengna Wang, Yongliang Shen, Zeqi Tan, Guiyang Hou, Mingqian He, Yanna Ma, Weiming Lu, Yueting Zhuang

Abstract: Although most current large multimodal models (LMMs) can already understand photos of natural scenes and portraits, their understanding of abstract images, e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite rudimentary. They often struggle with simple daily tasks, such as reading time from a clock, understanding a flowchart, or planning a route using a road map. In lig… ▽ More Although most current large multimodal models (LMMs) can already understand photos of natural scenes and portraits, their understanding of abstract images, e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite rudimentary. They often struggle with simple daily tasks, such as reading time from a clock, understanding a flowchart, or planning a route using a road map. In light of this, we design a multi-modal self-instruct, utilizing large language models and their code capabilities to synthesize massive abstract images and visual reasoning instructions across daily scenarios. Our strategy effortlessly creates a multimodal benchmark with 11,193 instructions for eight visual scenarios: charts, tables, simulated maps, dashboards, flowcharts, relation graphs, floor plans, and visual puzzles. \textbf{This benchmark, constructed with simple lines and geometric elements, exposes the shortcomings of most advanced LMMs} like Claude-3.5-Sonnet and GPT-4o in abstract image understanding, spatial relations reasoning, and visual element induction. Besides, to verify the quality of our synthetic data, we fine-tune an LMM using 62,476 synthetic chart, table and road map instructions. The results demonstrate improved chart understanding and map navigation performance, and also demonstrate potential benefits for other visual reasoning tasks. Our code is available at: \url{https://github.com/zwq2018/Multi-modal-Self-instruct}. △ Less

Submitted 8 August, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

Comments: code: https://github.com/zwq2018/Multi-modal-Self-instruct dataset: https://huggingface.co/datasets/zwq2018/Multi-modal-Self-instruct Leaderboard: https://multi-modal-self-instruct.github.io/

arXiv:2407.02408 [pdf, other]

CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models

Authors: Song Wang, Peng Wang, Tong Zhou, Yushun Dong, Zhen Tan, Jundong Li

Abstract: As Large Language Models (LLMs) are increasingly deployed to handle various natural language processing (NLP) tasks, concerns regarding the potential negative societal impacts of LLM-generated content have also arisen. To evaluate the biases exhibited by LLMs, researchers have recently proposed a variety of datasets. However, existing bias evaluation efforts often focus on only a particular type o… ▽ More As Large Language Models (LLMs) are increasingly deployed to handle various natural language processing (NLP) tasks, concerns regarding the potential negative societal impacts of LLM-generated content have also arisen. To evaluate the biases exhibited by LLMs, researchers have recently proposed a variety of datasets. However, existing bias evaluation efforts often focus on only a particular type of bias and employ inconsistent evaluation metrics, leading to difficulties in comparison across different datasets and LLMs. To address these limitations, we collect a variety of datasets designed for the bias evaluation of LLMs, and further propose CEB, a Compositional Evaluation Benchmark that covers different types of bias across different social groups and tasks. The curation of CEB is based on our newly proposed compositional taxonomy, which characterizes each dataset from three dimensions: bias types, social groups, and tasks. By combining the three dimensions, we develop a comprehensive evaluation strategy for the bias in LLMs. Our experiments demonstrate that the levels of bias vary across these dimensions, thereby providing guidance for the development of specific bias mitigation methods. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: 37 pages, 32 figures

arXiv:2407.00390 [pdf, other]

Advancing Process Verification for Large Language Models via Tree-Based Preference Learning

Authors: Mingqian He, Yongliang Shen, Wenqi Zhang, Zeqi Tan, Weiming Lu

Abstract: Large Language Models (LLMs) have demonstrated remarkable potential in handling complex reasoning tasks by generating step-by-step rationales.Some methods have proven effective in boosting accuracy by introducing extra verifiers to assess these paths. However, existing verifiers, typically trained on binary-labeled reasoning paths, fail to fully utilize the relative merits of intermediate steps, t… ▽ More Large Language Models (LLMs) have demonstrated remarkable potential in handling complex reasoning tasks by generating step-by-step rationales.Some methods have proven effective in boosting accuracy by introducing extra verifiers to assess these paths. However, existing verifiers, typically trained on binary-labeled reasoning paths, fail to fully utilize the relative merits of intermediate steps, thereby limiting the effectiveness of the feedback provided. To overcome this limitation, we propose Tree-based Preference Learning Verifier (Tree-PLV), a novel approach that constructs reasoning trees via a best-first search algorithm and collects step-level paired data for preference training. Compared to traditional binary classification, step-level preferences more finely capture the nuances between reasoning steps, allowing for a more precise evaluation of the complete reasoning path. We empirically evaluate Tree-PLV across a range of arithmetic and commonsense reasoning tasks, where it significantly outperforms existing benchmarks. For instance, Tree-PLV achieved substantial performance gains over the Mistral-7B self-consistency baseline on GSM8K (67.55% to 82.79%), MATH (17.00% to 26.80%), CSQA (68.14% to 72.97%), and StrategyQA (82.86% to 83.25%).Additionally, our study explores the appropriate granularity for applying preference learning, revealing that step-level guidance provides feedback that better aligns with the evaluation of the reasoning process. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.19417 [pdf, other]

"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models

Authors: Zhen Tan, Chengshuai Zhao, Raha Moraffah, Yifan Li, Song Wang, Jundong Li, Tianlong Chen, Huan Liu

Abstract: Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs) by integrating external knowledge bases, improving their performance in applications like fact-checking and information searching. In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases by injecting deceptive content into the retrieval database, intentionall… ▽ More Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs) by integrating external knowledge bases, improving their performance in applications like fact-checking and information searching. In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases by injecting deceptive content into the retrieval database, intentionally changing the model's behavior. This threat is critical as it mirrors real-world usage scenarios where RAG systems interact with publicly accessible knowledge bases, such as web scrapings and user-contributed data pools. To be more realistic, we target a realistic setting where the adversary has no knowledge of users' queries, knowledge base data, and the LLM parameters. We demonstrate that it is possible to exploit the model successfully through crafted content uploads with access to the retriever. Our findings emphasize an urgent need for security measures in the design and deployment of RAG systems to prevent potential manipulation and ensure the integrity of machine-generated content. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Preprint

arXiv:2406.18981 [pdf, other]

Exact Fisher zeros and thermofield dynamics across a quantum critical point

Authors: Yang Liu, Songtai Lv, Yuchen Meng, Zefan Tan, Erhai Zhao, Haiyuan Zou

Abstract: By setting the inverse temperature $β$ loose to occupy the complex plane, Michael E. Fisher showed that the zeros of the complex partition function $Z$, if approaching the real $β$ axis, reveal a thermodynamic phase transition. More recently, Fisher zeros have been used to mark the dynamical phase transition in quench dynamics. The success of Fisher zeros however seems limited, and it is unclear h… ▽ More By setting the inverse temperature $β$ loose to occupy the complex plane, Michael E. Fisher showed that the zeros of the complex partition function $Z$, if approaching the real $β$ axis, reveal a thermodynamic phase transition. More recently, Fisher zeros have been used to mark the dynamical phase transition in quench dynamics. The success of Fisher zeros however seems limited, and it is unclear how they can be employed to shed light on quantum phase transitions or the non-unitary dynamics of open quantum systems. Here we answer this question by a comprehensive analysis of the (analytically continued) one-dimensional transverse field Ising model. We exhaust all the Fisher zeros to show that in the thermodynamic limit they congregate into a remarkably simple pattern in the form of continuous open or closed lines. These Fisher lines evolve smoothly as the coupling constant is tuned, and a qualitative change identifies the quantum critical point. By exploiting the connection between $Z$ and the thermofield double states, we obtain analytical expressions for the short- and long-time dynamics of the survival amplitude and the scaling of recurrence time at the quantum critical point. We further point out $Z$ can be realized and probed in monitored quantum circuits. The analytical results are corroborated by numerical tensor renormalization group which elevates the approach outlined here to a powerful tool for interacting quantum systems. △ Less

Submitted 8 July, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

Comments: 11 pages; 3+4 figures

arXiv:2406.17992 [pdf, other]

Catching Chameleons: Detecting Evolving Disinformation Generated using Large Language Models

Authors: Bohan Jiang, Chengshuai Zhao, Zhen Tan, Huan Liu

Abstract: Despite recent advancements in detecting disinformation generated by large language models (LLMs), current efforts overlook the ever-evolving nature of this disinformation. In this work, we investigate a challenging yet practical research problem of detecting evolving LLM-generated disinformation. Disinformation evolves constantly through the rapid development of LLMs and their variants. As a cons… ▽ More Despite recent advancements in detecting disinformation generated by large language models (LLMs), current efforts overlook the ever-evolving nature of this disinformation. In this work, we investigate a challenging yet practical research problem of detecting evolving LLM-generated disinformation. Disinformation evolves constantly through the rapid development of LLMs and their variants. As a consequence, the detection model faces significant challenges. First, it is inefficient to train separate models for each disinformation generator. Second, the performance decreases in scenarios when evolving LLM-generated disinformation is encountered in sequential order. To address this problem, we propose DELD (Detecting Evolving LLM-generated Disinformation), a parameter-efficient approach that jointly leverages the general fact-checking capabilities of pre-trained language models (PLM) and the independent disinformation generation characteristics of various LLMs. In particular, the learned characteristics are concatenated sequentially to facilitate knowledge accumulation and transformation. DELD addresses the issue of label scarcity by integrating the semantic embeddings of disinformation with trainable soft prompts to elicit model-specific knowledge. Our experiments show that \textit{DELD} significantly outperforms state-of-the-art methods. Moreover, our method provides critical insights into the unique patterns of disinformation generation across different LLMs, offering valuable perspectives in this line of research. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 10 pages, 5 figures

arXiv:2406.17841 [pdf, other]

Probing many-body Bell correlation depth with superconducting qubits

Authors: Ke Wang, Weikang Li, Shibo Xu, Mengyao Hu, Jiachen Chen, Yaozu Wu, Chuanyu Zhang, Feitong Jin, Xuhao Zhu, Yu Gao, Ziqi Tan, Aosai Zhang, Ning Wang, Yiren Zou, Tingting Li, Fanhao Shen, Jiarun Zhong, Zehang Bao, Zitian Zhu, Zixuan Song, Jinfeng Deng, Hang Dong, Xu Zhang, Pengfei Zhang, Wenjie Jiang , et al. (10 additional authors not shown)

Abstract: Quantum nonlocality describes a stronger form of quantum correlation than that of entanglement. It refutes Einstein's belief of local realism and is among the most distinctive and enigmatic features of quantum mechanics. It is a crucial resource for achieving quantum advantages in a variety of practical applications, ranging from cryptography and certified random number generation via self-testing… ▽ More Quantum nonlocality describes a stronger form of quantum correlation than that of entanglement. It refutes Einstein's belief of local realism and is among the most distinctive and enigmatic features of quantum mechanics. It is a crucial resource for achieving quantum advantages in a variety of practical applications, ranging from cryptography and certified random number generation via self-testing to machine learning. Nevertheless, the detection of nonlocality, especially in quantum many-body systems, is notoriously challenging. Here, we report an experimental certification of genuine multipartite Bell correlations, which signal nonlocality in quantum many-body systems, up to 24 qubits with a fully programmable superconducting quantum processor. In particular, we employ energy as a Bell correlation witness and variationally decrease the energy of a many-body system across a hierarchy of thresholds, below which an increasing Bell correlation depth can be certified from experimental data. As an illustrating example, we variationally prepare the low-energy state of a two-dimensional honeycomb model with 73 qubits and certify its Bell correlations by measuring an energy that surpasses the corresponding classical bound with up to 48 standard deviations. In addition, we variationally prepare a sequence of low-energy states and certify their genuine multipartite Bell correlations up to 24 qubits via energies measured efficiently by parity oscillation and multiple quantum coherence techniques. Our results establish a viable approach for preparing and certifying multipartite Bell correlations, which provide not only a finer benchmark beyond entanglement for quantum devices, but also a valuable guide towards exploiting multipartite Bell correlation in a wide spectrum of practical applications. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 11 pages,6 figures + 14 pages, 6 figures

arXiv:2406.16562 [pdf, other]

EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models

Authors: Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Mengping Yang, Cheng Zhang, Hao Li

Abstract: The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our ap… ▽ More The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive datasets. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. Each protocol comprises a set of detailed, fine-grained instructions linked to specific scoring options, enabling precise manual scoring of the generated images. We Supervised Fine-Tune (SFT) the MLLM to align closely with human evaluative judgments, resulting in a robust evaluation model. Our comprehensive tests across 24 text-to-image generation models demonstrate that EvalAlign not only provides superior metric stability but also aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment. △ Less

Submitted 26 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

Comments: Github Repository: https://github.com/SAIS-FUXI/EvalAlign

arXiv:2406.16260 [pdf, other]

Video-Infinity: Distributed Long Video Generation

Authors: Zhenxiong Tan, Xingyi Yang, Songhua Liu, Xinchao Wang

Abstract: Diffusion models have recently achieved remarkable results for video generation. Despite the encouraging performances, the generated videos are typically constrained to a small number of frames, resulting in clips lasting merely a few seconds. The primary challenges in producing longer videos include the substantial memory requirements and the extended processing time required on a single GPU. A s… ▽ More Diffusion models have recently achieved remarkable results for video generation. Despite the encouraging performances, the generated videos are typically constrained to a small number of frames, resulting in clips lasting merely a few seconds. The primary challenges in producing longer videos include the substantial memory requirements and the extended processing time required on a single GPU. A straightforward solution would be to split the workload across multiple GPUs, which, however, leads to two issues: (1) ensuring all GPUs communicate effectively to share timing and context information, and (2) modifying existing video diffusion models, which are usually trained on short sequences, to create longer videos without additional training. To tackle these, in this paper we introduce Video-Infinity, a distributed inference pipeline that enables parallel processing across multiple GPUs for long-form video generation. Specifically, we propose two coherent mechanisms: Clip parallelism and Dual-scope attention. Clip parallelism optimizes the gathering and sharing of context information across GPUs which minimizes communication overhead, while Dual-scope attention modulates the temporal self-attention to balance local and global contexts efficiently across the devices. Together, the two mechanisms join forces to distribute the workload and enable the fast generation of long videos. Under an 8 x Nvidia 6000 Ada GPU (48G) setup, our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.16003 [pdf]

Unidirectional Chiral Emission via Twisted Bi-layer Metasurfaces

Authors: Dmitrii Gromyko, Shu An, Sergey Gorelik, Jiahui Xu, Li Jun Lim, Henry Yit Loong Lee, Febiana Tjiptoharsono, Zhi-Kuang Tan, Cheng-Wei Qiu, Zhaogang Dong, Lin Wu

Abstract: Controlling and channelling light emissions from unpolarized quantum dots into specific directions with chiral polarization remains a key challenge in modern photonics. Stacked metasurface designs offer a potential compact solution for chirality and directionality engineering. However, experimental observations of directional chiral radiation from resonant metasurfaces with quantum emitters remain… ▽ More Controlling and channelling light emissions from unpolarized quantum dots into specific directions with chiral polarization remains a key challenge in modern photonics. Stacked metasurface designs offer a potential compact solution for chirality and directionality engineering. However, experimental observations of directional chiral radiation from resonant metasurfaces with quantum emitters remain obscure. In this paper, we present experimental observations of unidirectional chiral emission from a twisted bi-layer metasurface via multi-dimensional control, including twist angle, interlayer distance, and lateral displacement between the top and bottom layers, as enabled by doublet alignment lithography (DAL). First, maintaining alignment, the metasurface demonstrates a resonant intrinsic optical chirality with near-unity circular dichroism of 0.94 and reflectance difference of 74%, where a high circular dichroism greater than 0.9 persists across a wide range of angles from -11 to 11 degrees. Second, engineered lateral displacement induces a unidirectional chiral resonance, resulting in unidirectional chiral emission from the quantum dots deposited onto the metasurface. Our bi-layer metasurfaces offer a universal compact platform for efficient radiation manipulation over a wide angular range, promising potential applications in miniaturized lasers, grating couplers, and chiral nanoantennas. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 16 pages, 4 figures

arXiv:2406.15992 [pdf, other]

Can LLM Graph Reasoning Generalize beyond Pattern Memorization?

Authors: Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xiaochuang Han, Tianxing He, Yulia Tsvetkov

Abstract: Large language models (LLMs) demonstrate great potential for problems with implicit graphical structures, while recent works seek to enhance the graph reasoning capabilities of LLMs through specialized instruction tuning. The resulting 'graph LLMs' are evaluated with in-distribution settings only, thus it remains underexplored whether LLMs are learning generalizable graph reasoning skills or merel… ▽ More Large language models (LLMs) demonstrate great potential for problems with implicit graphical structures, while recent works seek to enhance the graph reasoning capabilities of LLMs through specialized instruction tuning. The resulting 'graph LLMs' are evaluated with in-distribution settings only, thus it remains underexplored whether LLMs are learning generalizable graph reasoning skills or merely memorizing patterns in the synthetic training data. To this end, we propose the NLGift benchmark, an evaluation suite of LLM graph reasoning generalization: whether LLMs could go beyond semantic, numeric, structural, reasoning patterns in the synthetic training data and improve utility on real-world graph-based tasks. Extensive experiments with two LLMs across four graph reasoning tasks demonstrate that while generalization on simple patterns (semantic, numeric) is somewhat satisfactory, LLMs struggle to generalize across reasoning and real-world patterns, casting doubt on the benefit of synthetic graph tuning for real-world tasks with underlying network structures. We explore three strategies to improve LLM graph reasoning generalization, and we find that while post-training alignment is most promising for real-world tasks, empowering LLM graph reasoning to go beyond pattern memorization remains an open research question. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 16 pages, 6 figures, Code and data will be publicly available at https://github.com/MatthewYZhang/NLGift

ACM Class: I.2.7

arXiv:2406.13906 [pdf, other]

Semi-supervised Regression Analysis with Model Misspecification and High-dimensional Data

Authors: Ye Tian, Peng Wu, Zhiqiang Tan

Abstract: The accessibility of vast volumes of unlabeled data has sparked growing interest in semi-supervised learning (SSL) and covariate shift transfer learning (CSTL). In this paper, we present an inference framework for estimating regression coefficients in conditional mean models within both SSL and CSTL settings, while allowing for the misspecification of conditional mean models. We develop an augment… ▽ More The accessibility of vast volumes of unlabeled data has sparked growing interest in semi-supervised learning (SSL) and covariate shift transfer learning (CSTL). In this paper, we present an inference framework for estimating regression coefficients in conditional mean models within both SSL and CSTL settings, while allowing for the misspecification of conditional mean models. We develop an augmented inverse probability weighted (AIPW) method, employing regularized calibrated estimators for both propensity score (PS) and outcome regression (OR) nuisance models, with PS and OR models being sequentially dependent. We show that when the PS model is correctly specified, the proposed estimator achieves consistency, asymptotic normality, and valid confidence intervals, even with possible OR model misspecification and high-dimensional data. Moreover, by suppressing detailed technical choices, we demonstrate that previous methods can be unified within our AIPW framework. Our theoretical findings are verified through extensive simulation studies and a real-world data application. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13779 [pdf, other]

doi 10.1145/3637528.3672065

FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering

Authors: Tianchi Cai, Zhiwen Tan, Xierui Song, Tao Sun, Jiyan Jiang, Yunqi Xu, Yinger Zhang, Jinjie Gu

Abstract: Retrieval Augmented Generation (RAG) has become prevalent in question-answering (QA) tasks due to its ability of utilizing search engine to enhance the quality of long-form question-answering (LFQA). Despite the emergence of various open source methods and web-enhanced commercial systems such as Bing Chat, two critical problems remain unsolved, i.e., the lack of factuality and clear logic in the g… ▽ More Retrieval Augmented Generation (RAG) has become prevalent in question-answering (QA) tasks due to its ability of utilizing search engine to enhance the quality of long-form question-answering (LFQA). Despite the emergence of various open source methods and web-enhanced commercial systems such as Bing Chat, two critical problems remain unsolved, i.e., the lack of factuality and clear logic in the generated long-form answers. In this paper, we remedy these issues via a systematic study on answer generation in web-enhanced LFQA. Specifically, we first propose a novel outline-enhanced generator to achieve clear logic in the generation of multifaceted answers and construct two datasets accordingly. Then we propose a factuality optimization method based on a carefully designed doubly fine-grained RLHF framework, which contains automatic evaluation and reward modeling in different levels of granularity. Our generic framework comprises conventional fine-grained RLHF methods as special cases. Extensive experiments verify the superiority of our proposed \textit{Factuality-optimized RAG (FoRAG)} method on both English and Chinese benchmarks. In particular, when applying our method to Llama2-7B-chat, the derived model FoRAG-L-7B outperforms WebGPT-175B in terms of three commonly used metrics (i.e., coherence, helpfulness, and factuality), while the number of parameters is much smaller (only 1/24 of that of WebGPT-175B). Our datasets and models are made publicly available for better reproducibility: https://huggingface.co/forag. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Report number: 30th

Journal ref: KDD 2024

arXiv:2406.12867 [pdf, ps, other]

Classification of quasi-affine Generalized Dynkin Diagrams with Rank $3$ and Rank $2$

Authors: Zhengtang Tan, Shouchuan Zhang

Abstract: All quasi-affine connected Generalized Dynkin Diagram with rank $= 3$ and $2$ are found. All quasi-affine Nichols (Lie braided) algebras with rank $ 3$ and $2$ are also found. All quasi-affine connected Generalized Dynkin Diagram with rank $= 3$ and $2$ are found. All quasi-affine Nichols (Lie braided) algebras with rank $ 3$ and $2$ are also found. △ Less

Submitted 9 April, 2024; originally announced June 2024.

Comments: 338 pages

MSC Class: 16W30; 16G10

arXiv:2406.12313 [pdf]

A framework for developing a knowledge management platform

Authors: Marie Lisandra Zepeda Mendoza, Sonali Agarwal, James A. Blackshaw, Vanesa Bol, Audrey Fazzi, Filippo Fiorini, Amy Louise Foreman, Nancy George, Brett R. Johnson, Brian Martin, Dave McComb, Euphemia Mutasa-Gottgens, Helen Parkinson, Martin Romacker, Rolf Russell, Valérien Ségard, Shawn Zheng Kai Tan, Wei Kheng Teh, F. P. Winstanley, Benedict Wong, Adrian M. Smith

Abstract: Knowledge management (KM) involves collecting, organizing, storing, and disseminating information to improve decision-making, innovation, and performance. Implementing KM at scale has become essential for organizations to effectively leverage vast accessible data. This paper is a compilation of concepts that emerged from KM workshops hosted by EMBL-EBI, attended by SMEs and industry. We provide gu… ▽ More Knowledge management (KM) involves collecting, organizing, storing, and disseminating information to improve decision-making, innovation, and performance. Implementing KM at scale has become essential for organizations to effectively leverage vast accessible data. This paper is a compilation of concepts that emerged from KM workshops hosted by EMBL-EBI, attended by SMEs and industry. We provide guidance on envisioning, executing, evaluating, and evolving knowledge management platforms. We emphasize essential considerations such as setting knowledge domain boundaries and measuring success, as well as the importance of making knowledge accessible for downstream applications and non-computational users and highlights necessary personal and organizational skills for success. We stress the importance of collaboration and the need for convergence on shared principles and commitment to provide or seek resources to advance KM. The community is invited to join the journey of KM and contribute to the advancement of the field by applying and improving on the guidelines described. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 18 pages, 1 figure

arXiv:2406.10471 [pdf, other]

Personalized Pieces: Efficient Personalized Large Language Models through Collaborative Efforts

Authors: Zhaoxuan Tan, Zheyuan Liu, Meng Jiang

Abstract: Personalized large language models (LLMs) aim to tailor interactions, content, and recommendations to individual user preferences. While parameter-efficient fine-tuning (PEFT) methods excel in performance and generalization, they are costly and limit communal benefits when used individually. To this end, we introduce Personalized Pieces (Per-Pcs), a framework that allows users to safely share and… ▽ More Personalized large language models (LLMs) aim to tailor interactions, content, and recommendations to individual user preferences. While parameter-efficient fine-tuning (PEFT) methods excel in performance and generalization, they are costly and limit communal benefits when used individually. To this end, we introduce Personalized Pieces (Per-Pcs), a framework that allows users to safely share and assemble personalized PEFT efficiently with collaborative efforts. Per-Pcs involves selecting sharers, breaking their PEFT into pieces, and training gates for each piece. These pieces are added to a pool, from which target users can select and assemble personalized PEFT using their history data. This approach preserves privacy and enables fine-grained user modeling without excessive storage and computation demands. Experimental results show Per-Pcs outperforms non-personalized and PEFT retrieval baselines, offering performance comparable to OPPU with significantly lower resource use across six tasks. Further analysis highlights Per-Pcs's robustness concerning sharer count and selection strategy, pieces sharing ratio, and scalability in computation time and storage space. Per-Pcs's modularity promotes safe sharing, making LLM personalization more efficient, effective, and widely accessible through collaborative efforts. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.09899 [pdf, other]

Learning Solution-Aware Transformers for Efficiently Solving Quadratic Assignment Problem

Authors: Zhentao Tan, Yadong Mu

Abstract: Recently various optimization problems, such as Mixed Integer Linear Programming Problems (MILPs), have undergone comprehensive investigation, leveraging the capabilities of machine learning. This work focuses on learning-based solutions for efficiently solving the Quadratic Assignment Problem (QAPs), which stands as a formidable challenge in combinatorial optimization. While many instances of sim… ▽ More Recently various optimization problems, such as Mixed Integer Linear Programming Problems (MILPs), have undergone comprehensive investigation, leveraging the capabilities of machine learning. This work focuses on learning-based solutions for efficiently solving the Quadratic Assignment Problem (QAPs), which stands as a formidable challenge in combinatorial optimization. While many instances of simpler problems admit fully polynomial-time approximate solution (FPTAS), QAP is shown to be strongly NP-hard. Even finding a FPTAS for QAP is difficult, in the sense that the existence of a FPTAS implies $P = NP$. Current research on QAPs suffer from limited scale and computational inefficiency. To attack the aforementioned issues, we here propose the first solution of its kind for QAP in the learn-to-improve category. This work encodes facility and location nodes separately, instead of forming computationally intensive association graphs prevalent in current approaches. This design choice enables scalability to larger problem sizes. Furthermore, a \textbf{S}olution \textbf{AW}are \textbf{T}ransformer (SAWT) architecture integrates the incumbent solution matrix with the attention score to effectively capture higher-order information of the QAPs. Our model's effectiveness is validated through extensive experiments on self-generated QAP instances of varying sizes and the QAPLIB benchmark. △ Less

Submitted 19 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted by ICML 2024

arXiv:2406.08785 [pdf, other]

BEVSpread: Spread Voxel Pooling for Bird's-Eye-View Representation in Vision-based Roadside 3D Object Detection

Authors: Wenjie Wang, Yehao Lu, Guangcong Zheng, Shuigen Zhan, Xiaoqing Ye, Zichang Tan, Jingdong Wang, Gaoang Wang, Xi Li

Abstract: Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain, since it encompasses inherent advantages in reducing blind spots and expanding perception range. While previous work mainly focuses on accurately estimating depth or height for 2D-to-3D mapping, ignoring the position approximation error in the voxel pooling process. Inspired by this insight, we p… ▽ More Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain, since it encompasses inherent advantages in reducing blind spots and expanding perception range. While previous work mainly focuses on accurately estimating depth or height for 2D-to-3D mapping, ignoring the position approximation error in the voxel pooling process. Inspired by this insight, we propose a novel voxel pooling strategy to reduce such error, dubbed BEVSpread. Specifically, instead of bringing the image features contained in a frustum point to a single BEV grid, BEVSpread considers each frustum point as a source and spreads the image features to the surrounding BEV grids with adaptive weights. To achieve superior propagation performance, a specific weight function is designed to dynamically control the decay speed of the weights according to distance and depth. Aided by customized CUDA parallel acceleration, BEVSpread achieves comparable inference time as the original voxel pooling. Extensive experiments on two large-scale roadside benchmarks demonstrate that, as a plug-in, BEVSpread can significantly improve the performance of existing frustum-based BEV methods by a large margin of (1.12, 5.26, 3.01) AP in vehicle, pedestrian and cyclist. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.06911 [pdf, other]

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Authors: Zigeng Chen, Xinyin Ma, Gongfan Fang, Zhenxiong Tan, Xinchao Wang

Abstract: Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enable… ▽ More Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality. Specifically, for the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score, on four NVIDIA A5000 GPUs. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances. The code is available at https://github.com/czg1225/AsyncDiff. △ Less

Submitted 27 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: Work in progress. Project Page: https://czg1225.github.io/asyncdiff_page/

arXiv:2406.06904 [pdf, other]

Person Transfer in the Field: Examining Real World Sequential Human-Robot Interaction Between Two Robots

Authors: Xiang Zhi Tan, Elizabeth J. Carter, Aaron Steinfeld

Abstract: With more robots being deployed in the world, users will likely interact with multiple robots sequentially when receiving services. In this paper, we describe an exploratory field study in which unsuspecting participants experienced a ``person transfer'' -- a scenario in which they first interacted with one stationary robot before another mobile robot joined to complete the interaction. In our 7-h… ▽ More With more robots being deployed in the world, users will likely interact with multiple robots sequentially when receiving services. In this paper, we describe an exploratory field study in which unsuspecting participants experienced a ``person transfer'' -- a scenario in which they first interacted with one stationary robot before another mobile robot joined to complete the interaction. In our 7-hour study spanning 4 days, we recorded 18 instances of person transfers with 40+ individuals. We also interviewed 11 participants after the interaction to further understand their experience. We used the recorded video and interview data to extract interesting insights about in-the-field sequential human-robot interaction, such as mobile robot handovers, trust in person transfer, and the importance of the robots' positions. Our findings expose pitfalls and present important factors to consider when designing sequential human-robot interaction. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted to RO-MAN 2024

arXiv:2406.06295 [pdf, other]

Zero-Shot Audio Captioning Using Soft and Hard Prompts

Authors: Yiming Zhang, Xuenan Xu, Ruoyi Du, Haohe Liu, Yuan Dong, Zheng-Hua Tan, Wenwu Wang, Zhanyu Ma

Abstract: In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these model… ▽ More In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic space.In the inference stage, the model generates the descriptive text for the given audio from the audio feature by leveraging the audio-text alignment from CLAP.We devise two strategies to mitigate the discrepancy between text and audio embeddings: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt. These approaches are designed to enhance the generalization performance of our proposed model, facilitating the model to generate captions more robustly and accurately. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing

arXiv:2406.06160 [pdf, other]

The Effect of Training Dataset Size on Discriminative and Diffusion-Based Speech Enhancement Systems

Authors: Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

Abstract: The performance of deep neural network-based speech enhancement systems typically increases with the training dataset size. However, studies that investigated the effect of training dataset size on speech enhancement performance did not consider recent approaches, such as diffusion-based generative models. Diffusion models are typically trained with massive datasets for image generation tasks, but… ▽ More The performance of deep neural network-based speech enhancement systems typically increases with the training dataset size. However, studies that investigated the effect of training dataset size on speech enhancement performance did not consider recent approaches, such as diffusion-based generative models. Diffusion models are typically trained with massive datasets for image generation tasks, but whether this is also required for speech enhancement is unknown. Moreover, studies that investigated the effect of training dataset size did not control for the data diversity. It is thus unclear whether the performance improvement was due to the increased dataset size or diversity. Therefore, we systematically investigate the effect of training dataset size on the performance of popular state-of-the-art discriminative and diffusion-based speech enhancement systems. We control for the data diversity by using a fixed set of speech utterances, noise segments and binaural room impulse responses to generate datasets of different sizes. We find that the diffusion-based systems do not benefit from increasing the training dataset size as much as the discriminative systems. They perform the best relative to the discriminative systems with datasets of 10 h or less, but they are outperformed by the discriminative systems with datasets of 100 h or more. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Showing 1–50 of 732 results for author: Tan, Z