Search | arXiv e-print repository

AgentMove: Predicting Human Mobility Anywhere Using Large Language Model based Agentic Framework

Authors: Jie Feng, Yuwei Du, Jie Zhao, Yong Li

Abstract: Human mobility prediction plays a crucial role in various real-world applications. Although deep learning based models have shown promising results over the past decade, their reliance on extensive private mobility data for training and their inability to perform zero-shot predictions, have hindered further advancements. Recently, attempts have been made to apply large language models (LLMs) to mo… ▽ More Human mobility prediction plays a crucial role in various real-world applications. Although deep learning based models have shown promising results over the past decade, their reliance on extensive private mobility data for training and their inability to perform zero-shot predictions, have hindered further advancements. Recently, attempts have been made to apply large language models (LLMs) to mobility prediction task. However, their performance has been constrained by the absence of a systematic design of workflow. They directly generate the final output using LLMs, which limits the potential of LLMs to uncover complex mobility patterns and underestimates their extensive reserve of global geospatial knowledge. In this paper, we introduce AgentMove, a systematic agentic prediction framework to achieve generalized mobility prediction for any cities worldwide. In AgentMove, we first decompose the mobility prediction task into three sub-tasks and then design corresponding modules to complete these subtasks, including spatial-temporal memory for individual mobility pattern mining, world knowledge generator for modeling the effects of urban structure and collective knowledge extractor for capturing the shared patterns among population. Finally, we combine the results of three modules and conduct a reasoning step to generate the final predictions. Extensive experiments on mobility data from two sources in 12 cities demonstrate that AgentMove outperforms the best baseline more than 8% in various metrics and it shows robust predictions with various LLMs as base and also less geographical bias across cities. Codes and data can be found in https://github.com/tsinghua-fib-lab/AgentMove. △ Less

Submitted 25 August, 2024; originally announced August 2024.

Comments: 13 pages

arXiv:2408.13980 [pdf, other]

FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation

Authors: Daixun Li, Weiying Xie, Mingxiang Cao, Yunke Wang, Jiaqing Zhang, Yunsong Li, Leyuan Fang, Chang Xu

Abstract: Multimodal image fusion and segmentation enhance scene understanding in autonomous driving by integrating data from various sensors. However, current models struggle to efficiently segment densely packed elements in such scenes, due to the absence of comprehensive fusion features that can guide mid-process fine-tuning and focus attention on relevant areas. The Segment Anything Model (SAM) has emer… ▽ More Multimodal image fusion and segmentation enhance scene understanding in autonomous driving by integrating data from various sensors. However, current models struggle to efficiently segment densely packed elements in such scenes, due to the absence of comprehensive fusion features that can guide mid-process fine-tuning and focus attention on relevant areas. The Segment Anything Model (SAM) has emerged as a transformative segmentation method. It provides more effective prompts through its flexible prompt encoder, compared to transformers lacking fine-tuned control. Nevertheless, SAM has not been extensively studied in the domain of multimodal fusion for natural images. In this paper, we introduce SAM into multimodal image segmentation for the first time, proposing a novel framework that combines Latent Space Token Generation (LSTG) and Fusion Mask Prompting (FMP) modules to enhance SAM's multimodal fusion and segmentation capabilities. Specifically, we first obtain latent space features of the two modalities through vector quantization and embed them into a cross-attention-based inter-domain fusion module to establish long-range dependencies between modalities. Then, we use these comprehensive fusion features as prompts to guide precise pixel-level segmentation. Extensive experiments on several public datasets demonstrate that the proposed method significantly outperforms SAM and SAM2 in multimodal autonomous driving scenarios, achieving at least 3.9$\%$ higher segmentation mIoU than the state-of-the-art approaches. △ Less

Submitted 25 August, 2024; originally announced August 2024.

arXiv:2408.13977 [pdf, other]

Say Your Reason: Extract Contextual Rules In Situ for Context-aware Service Recommendation

Authors: Yuxuan Li, Jiahui Li, Lihang Pan, Chun Yu, Yuanchun Shi

Abstract: This paper introduces SayRea, an interactive system that facilitates the extraction of contextual rules for personalized context-aware service recommendations in mobile scenarios. The system monitors a user's execution of registered services on their smartphones (via accessibility service) and proactively requests a single-sentence reason from the user. By utilizing a Large Language Model (LLM), S… ▽ More This paper introduces SayRea, an interactive system that facilitates the extraction of contextual rules for personalized context-aware service recommendations in mobile scenarios. The system monitors a user's execution of registered services on their smartphones (via accessibility service) and proactively requests a single-sentence reason from the user. By utilizing a Large Language Model (LLM), SayRea parses the reason and predicts contextual relationships between the observed service and potential contexts (such as setting the alarm clock deep in the evening). In this way, SayRea can significantly reduce the cognitive load on users in anticipating future needs and selecting contextual attributes. A 10-day field study involving 20 participants showed that SayRea accumulated an average of 62.4 rules per user and successfully recommended 45% of service usage. The participants provided positive feedback on the system's usability, interpretability, and controllability. The findings highlight SayRea's effectiveness in personalized service recommendations and its potential to enhance user experience in mobile scenarios. △ Less

Submitted 25 August, 2024; originally announced August 2024.

arXiv:2408.13759 [pdf, other]

MASQ: Multi-Agent Reinforcement Learning for Single Quadruped Robot Locomotion

Authors: Qi Liu, Jingxiang Guo, Sixu Lin, Shuaikang Ma, Jinxuan Zhu, Yanjie Li

Abstract: This paper proposes a novel method to improve locomotion learning for a single quadruped robot using multi-agent deep reinforcement learning (MARL). Many existing methods use single-agent reinforcement learning for an individual robot or MARL for the cooperative task in multi-robot systems. Unlike existing methods, this paper proposes using MARL for the locomotion learning of a single quadruped ro… ▽ More This paper proposes a novel method to improve locomotion learning for a single quadruped robot using multi-agent deep reinforcement learning (MARL). Many existing methods use single-agent reinforcement learning for an individual robot or MARL for the cooperative task in multi-robot systems. Unlike existing methods, this paper proposes using MARL for the locomotion learning of a single quadruped robot. We develop a learning structure called Multi-Agent Reinforcement Learning for Single Quadruped Robot Locomotion (MASQ), considering each leg as an agent to explore the action space of the quadruped robot, sharing a global critic, and learning collaboratively. Experimental results indicate that MASQ not only speeds up learning convergence but also enhances robustness in real-world settings, suggesting that applying MASQ to single robots such as quadrupeds could surpass traditional single-robot reinforcement learning approaches. Our study provides insightful guidance on integrating MARL with single-robot locomotion learning. △ Less

Submitted 25 August, 2024; originally announced August 2024.

arXiv:2408.13750 [pdf, other]

Multi-Agent Target Assignment and Path Finding for Intelligent Warehouse: A Cooperative Multi-Agent Deep Reinforcement Learning Perspective

Authors: Qi Liu, Jianqi Gao, Dongjie Zhu, Xizheng Pang, Pengbin Chen, Jingxiang Guo, Yanjie Li

Abstract: Multi-agent target assignment and path planning (TAPF) are two key problems in intelligent warehouse. However, most literature only addresses one of these two problems separately. In this study, we propose a method to simultaneously solve target assignment and path planning from a perspective of cooperative multi-agent deep reinforcement learning (RL). To the best of our knowledge, this is the fir… ▽ More Multi-agent target assignment and path planning (TAPF) are two key problems in intelligent warehouse. However, most literature only addresses one of these two problems separately. In this study, we propose a method to simultaneously solve target assignment and path planning from a perspective of cooperative multi-agent deep reinforcement learning (RL). To the best of our knowledge, this is the first work to model the TAPF problem for intelligent warehouse to cooperative multi-agent deep RL, and the first to simultaneously address TAPF based on multi-agent deep RL. Furthermore, previous literature rarely considers the physical dynamics of agents. In this study, the physical dynamics of the agents is considered. Experimental results show that our method performs well in various task settings, which means that the target assignment is solved reasonably well and the planned path is almost shortest. Moreover, our method is more time-efficient than baselines. △ Less

Submitted 25 August, 2024; originally announced August 2024.

arXiv:2408.13738 [pdf, other]

Poor-Supervised Evaluation for SuperLLM via Mutual Consistency

Authors: Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Boyuan Pan, Heda Wang, Yao Hu, Kan Li

Abstract: The guidance from capability evaluations has greatly propelled the progress of both human society and Artificial Intelligence. However, as LLMs evolve, it becomes challenging to construct evaluation benchmarks for them with accurate labels on hard tasks that approach the boundaries of human capabilities. To credibly conduct evaluation without accurate labels (denoted as poor-supervised evaluation)… ▽ More The guidance from capability evaluations has greatly propelled the progress of both human society and Artificial Intelligence. However, as LLMs evolve, it becomes challenging to construct evaluation benchmarks for them with accurate labels on hard tasks that approach the boundaries of human capabilities. To credibly conduct evaluation without accurate labels (denoted as poor-supervised evaluation), we propose the PoEM framework. We first prove that the capability of a model can be equivalently assessed by the consistency between it and certain reference model, when their prediction distributions are independent and the sample size is infinite. To alleviate the insufficiencies of the conditions in reality, we further introduce an algorithm that treats humans (when available) and the models under evaluation as reference models, alternately conducting model weights calibration and filtering during E-step and M-step. Comprehensive experiments across 3 types of tasks with 16 mainstream LLMs have shown that PoEM under poor supervision can achieve an average of 0.98 Pearson correlation coefficient with supervised evaluation results, demonstrating good effectiveness, efficiency and generalizability. More generally, PoEM has advanced the evaluation paradigm evolution from human-centric to human&model-centric by treating both of them as reference models, mitigating the limitations of human evaluation in the era of LLMs. △ Less

Submitted 25 August, 2024; originally announced August 2024.

Comments: ACL findings

arXiv:2408.13728 [pdf, other]

3D-RCNet: Learning from Transformer to Build a 3D Relational ConvNet for Hyperspectral Image Classification

Authors: Haizhao Jing, Liuwei Wan, Xizhe Xue, Haokui Zhang, Ying Li

Abstract: Recently, the Vision Transformer (ViT) model has replaced the classical Convolutional Neural Network (ConvNet) in various computer vision tasks due to its superior performance. Even in hyperspectral image (HSI) classification field, ViT-based methods also show promising potential. Nevertheless, ViT encounters notable difficulties in processing HSI data. Its self-attention mechanism, which exhibits… ▽ More Recently, the Vision Transformer (ViT) model has replaced the classical Convolutional Neural Network (ConvNet) in various computer vision tasks due to its superior performance. Even in hyperspectral image (HSI) classification field, ViT-based methods also show promising potential. Nevertheless, ViT encounters notable difficulties in processing HSI data. Its self-attention mechanism, which exhibits quadratic complexity, escalates computational costs. Additionally, ViT's substantial demand for training samples does not align with the practical constraints posed by the expensive labeling of HSI data. To overcome these challenges, we propose a 3D relational ConvNet named 3D-RCNet, which inherits both strengths of ConvNet and ViT, resulting in high performance in HSI classification. We embed the self-attention mechanism of Transformer into the convolutional operation of ConvNet to design 3D relational convolutional operation and use it to build the final 3D-RCNet. The proposed 3D-RCNet maintains the high computational efficiency of ConvNet while enjoying the flexibility of ViT. Additionally, the proposed 3D relational convolutional operation is a plug-and-play operation, which can be inserted into previous ConvNet-based HSI classification methods seamlessly. Empirical evaluations on three representative benchmark HSI datasets show that the proposed model outperforms previous ConvNet-based and ViT-based HSI approaches. △ Less

Submitted 25 August, 2024; originally announced August 2024.

arXiv:2408.13705 [pdf, other]

Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval

Authors: Lifeng Zhou, Yuke Li, Rui Deng, Yuting Yang, Haoqi Zhu

Abstract: The success of speech-image retrieval relies on establishing an effective alignment between speech and image. Existing methods often model cross-modal interaction through simple cosine similarity of the global feature of each modality, which fall short in capturing fine-grained details within modalities. To address this issue, we introduce an effective framework and a novel learning task named cro… ▽ More The success of speech-image retrieval relies on establishing an effective alignment between speech and image. Existing methods often model cross-modal interaction through simple cosine similarity of the global feature of each modality, which fall short in capturing fine-grained details within modalities. To address this issue, we introduce an effective framework and a novel learning task named cross-modal denoising (CMD) to enhance cross-modal interaction to achieve finer-level cross-modal alignment. Specifically, CMD is a denoising task designed to reconstruct semantic features from noisy features within one modality by interacting features from another modality. Notably, CMD operates exclusively during model training and can be removed during inference without adding extra inference time. The experimental results demonstrate that our framework outperforms the state-of-the-art method by 2.0% in mean R@1 on the Flickr8k dataset and by 1.7% in mean R@1 on the SpokenCOCO dataset for the speech-image retrieval tasks, respectively. These experimental results validate the efficiency and effectiveness of our framework. △ Less

Submitted 14 August, 2024; originally announced August 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2408.13119

arXiv:2408.13687 [pdf, other]

Quantum error correction below the surface code threshold

Authors: Rajeev Acharya, Laleh Aghababaie-Beni, Igor Aleiner, Trond I. Andersen, Markus Ansmann, Frank Arute, Kunal Arya, Abraham Asfaw, Nikita Astrakhantsev, Juan Atalaya, Ryan Babbush, Dave Bacon, Brian Ballard, Joseph C. Bardin, Johannes Bausch, Andreas Bengtsson, Alexander Bilmes, Sam Blackwell, Sergio Boixo, Gina Bortoli, Alexandre Bourassa, Jenna Bovaird, Leon Brill, Michael Broughton, David A. Browne , et al. (224 additional authors not shown)

Abstract: Quantum error correction provides a path to reach practical quantum computing by combining multiple physical qubits into a logical qubit, where the logical error rate is suppressed exponentially as more qubits are added. However, this exponential suppression only occurs if the physical error rate is below a critical threshold. In this work, we present two surface code memories operating below this… ▽ More Quantum error correction provides a path to reach practical quantum computing by combining multiple physical qubits into a logical qubit, where the logical error rate is suppressed exponentially as more qubits are added. However, this exponential suppression only occurs if the physical error rate is below a critical threshold. In this work, we present two surface code memories operating below this threshold: a distance-7 code and a distance-5 code integrated with a real-time decoder. The logical error rate of our larger quantum memory is suppressed by a factor of $Λ$ = 2.14 $\pm$ 0.02 when increasing the code distance by two, culminating in a 101-qubit distance-7 code with 0.143% $\pm$ 0.003% error per cycle of error correction. This logical memory is also beyond break-even, exceeding its best physical qubit's lifetime by a factor of 2.4 $\pm$ 0.3. We maintain below-threshold performance when decoding in real time, achieving an average decoder latency of 63 $μ$s at distance-5 up to a million cycles, with a cycle time of 1.1 $μ$s. To probe the limits of our error-correction performance, we run repetition codes up to distance-29 and find that logical performance is limited by rare correlated error events occurring approximately once every hour, or 3 $\times$ 10$^9$ cycles. Our results present device performance that, if scaled, could realize the operational requirements of large scale fault-tolerant quantum algorithms. △ Less

Submitted 24 August, 2024; originally announced August 2024.

Comments: 10 pages, 4 figures, Supplementary Information

arXiv:2408.13578 [pdf, other]

Adaptive Graded Denoising of Seismic Data Based on Noise Estimation and Local Similarity

Authors: Xueting Yang, Yong Li, Zhangquan Liao, Yingtian Liu, Junheng Peng

Abstract: Seismic data denoising is an important part of seismic data processing, which directly relate to the follow-up processing of seismic data. In terms of this issue, many authors proposed many methods based on rank reduction, sparse transformation, domain transformation, and deep learning. However, when the seismic data is noisy, complex and uneven, these methods often lead to over-denoising or under… ▽ More Seismic data denoising is an important part of seismic data processing, which directly relate to the follow-up processing of seismic data. In terms of this issue, many authors proposed many methods based on rank reduction, sparse transformation, domain transformation, and deep learning. However, when the seismic data is noisy, complex and uneven, these methods often lead to over-denoising or under-denoising. To solve this problems, we proposed a novel method called noise level estimation and similarity segmentation for graded denoising. Specifically, we first assessed the average noise level of the entire seismic data and denoised it using block matching and three-dimensional filtering (BM3D) methods. Then, the denoised data is contrasted with the residual using local similarity, pinpointing regions where noise levels deviate significantly from the average. The remaining data is retained intact. These areas are then re-evaluated and denoised. Finally, we integrated the data retained after the first denoising with the re-denoising data to get a complete and cleaner data. This method is verified on theoretical model and actual seismic data. The experimental results show that this method has a good effect on seismic data with uneven noise. △ Less

Submitted 24 August, 2024; originally announced August 2024.

Comments: This article has been submitted to geophysics

MSC Class: 86-10 ACM Class: I.4.4

arXiv:2408.13499 [pdf, other]

R2G: Reasoning to Ground in 3D Scenes

Authors: Yixuan Li, Zan Wang, Wei Liang

Abstract: We propose Reasoning to Ground (R2G), a neural symbolic model that grounds the target objects within 3D scenes in a reasoning manner. In contrast to prior works, R2G explicitly models the 3D scene with a semantic concept-based scene graph; recurrently simulates the attention transferring across object entities; thus makes the process of grounding the target objects with the highest probability int… ▽ More We propose Reasoning to Ground (R2G), a neural symbolic model that grounds the target objects within 3D scenes in a reasoning manner. In contrast to prior works, R2G explicitly models the 3D scene with a semantic concept-based scene graph; recurrently simulates the attention transferring across object entities; thus makes the process of grounding the target objects with the highest probability interpretable. Specifically, we respectively embed multiple object properties within the graph nodes and spatial relations among entities within the edges, utilizing a predefined semantic vocabulary. To guide attention transferring, we employ learning or prompting-based methods to analyze the referential utterance and convert it into reasoning instructions within the same semantic space. In each reasoning round, R2G either (1) merges current attention distribution with the similarity between the instruction and embedded entity properties or (2) shifts the attention across the scene graph based on the similarity between the instruction and embedded spatial relations. The experiments on Sr3D/Nr3D benchmarks show that R2G achieves a comparable result with the prior works while maintaining improved interpretability, breaking a new path for 3D language grounding. △ Less

Submitted 24 August, 2024; originally announced August 2024.

arXiv:2408.13471 [pdf, other]

Disentangled Generative Graph Representation Learning

Authors: Xinyue Hu, Zhibin Duan, Xinyang Liu, Yuxin Li, Bo Chen, Mingyuan Zhou

Abstract: Recently, generative graph models have shown promising results in learning graph representations through self-supervised methods. However, most existing generative graph representation learning (GRL) approaches rely on random masking across the entire graph, which overlooks the entanglement of learned representations. This oversight results in non-robustness and a lack of explainability. Furthermo… ▽ More Recently, generative graph models have shown promising results in learning graph representations through self-supervised methods. However, most existing generative graph representation learning (GRL) approaches rely on random masking across the entire graph, which overlooks the entanglement of learned representations. This oversight results in non-robustness and a lack of explainability. Furthermore, disentangling the learned representations remains a significant challenge and has not been sufficiently explored in GRL research. Based on these insights, this paper introduces DiGGR (Disentangled Generative Graph Representation Learning), a self-supervised learning framework. DiGGR aims to learn latent disentangled factors and utilizes them to guide graph mask modeling, thereby enhancing the disentanglement of learned representations and enabling end-to-end joint learning. Extensive experiments on 11 public datasets for two different graph learning tasks demonstrate that DiGGR consistently outperforms many previous self-supervised methods, verifying the effectiveness of the proposed approach. △ Less

Submitted 24 August, 2024; originally announced August 2024.

arXiv:2408.13457 [pdf, other]

Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning

Authors: Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Boyuan Pan, Heda Wang, Yao Hu, Kan Li

Abstract: Self-consistency (SC), a widely used decoding strategy for chain-of-thought reasoning, shows significant gains across various multi-step reasoning tasks but comes with a high cost due to multiple sampling with the preset size. Its variants, Adaptive self-consistency (ASC) and Early-stopping self-consistency (ESC), dynamically adjust the number of samples based on the posterior distribution of a se… ▽ More Self-consistency (SC), a widely used decoding strategy for chain-of-thought reasoning, shows significant gains across various multi-step reasoning tasks but comes with a high cost due to multiple sampling with the preset size. Its variants, Adaptive self-consistency (ASC) and Early-stopping self-consistency (ESC), dynamically adjust the number of samples based on the posterior distribution of a set of pre-samples, reducing the cost of SC with minimal impact on performance. Both methods, however, do not exploit the prior information about question difficulty. It often results in unnecessary repeated sampling for easy questions that could be accurately answered with just one attempt, wasting resources. To tackle this problem, we propose Difficulty-Adaptive Self-Consistency (DSC), which leverages the difficulty information from both prior and posterior perspectives to adaptively allocate inference resources, further reducing the cost of SC. To demonstrate the effectiveness of DSC, we conduct extensive experiments on three popular categories of reasoning tasks: arithmetic, commonsense and symbolic reasoning on six benchmarks. The empirical results show that DSC consistently surpasses the strong baseline ASC and ESC in terms of costs by a significant margin, while attaining comparable performances. △ Less

Submitted 24 August, 2024; originally announced August 2024.

Comments: Preprint

arXiv:2408.13385 [pdf, other]

MICM: Rethinking Unsupervised Pretraining for Enhanced Few-shot Learning

Authors: Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Zhimeng Huang, Yuhua Li, Ruixuan Li

Abstract: Humans exhibit a remarkable ability to learn quickly from a limited number of labeled samples, a capability that starkly contrasts with that of current machine learning systems. Unsupervised Few-Shot Learning (U-FSL) seeks to bridge this divide by reducing reliance on annotated datasets during initial training phases. In this work, we first quantitatively assess the impacts of Masked Image Modelin… ▽ More Humans exhibit a remarkable ability to learn quickly from a limited number of labeled samples, a capability that starkly contrasts with that of current machine learning systems. Unsupervised Few-Shot Learning (U-FSL) seeks to bridge this divide by reducing reliance on annotated datasets during initial training phases. In this work, we first quantitatively assess the impacts of Masked Image Modeling (MIM) and Contrastive Learning (CL) on few-shot learning tasks. Our findings highlight the respective limitations of MIM and CL in terms of discriminative and generalization abilities, which contribute to their underperformance in U-FSL contexts. To address these trade-offs between generalization and discriminability in unsupervised pretraining, we introduce a novel paradigm named Masked Image Contrastive Modeling (MICM). MICM creatively combines the targeted object learning strength of CL with the generalized visual feature learning capability of MIM, significantly enhancing its efficacy in downstream few-shot learning inference. Extensive experimental analyses confirm the advantages of MICM, demonstrating significant improvements in both generalization and discrimination capabilities for few-shot learning. Our comprehensive quantitative evaluations further substantiate the superiority of MICM, showing that our two-stage U-FSL framework based on MICM markedly outperforms existing leading baselines. △ Less

Submitted 23 August, 2024; originally announced August 2024.

Comments: ACMMM 2024 (Oral)

arXiv:2408.13373 [pdf, other]

Learning Unknowns from Unknowns: Diversified Negative Prototypes Generator for Few-Shot Open-Set Recognition

Authors: Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Yuhua Li, Ruixuan Li

Abstract: Few-shot open-set recognition (FSOR) is a challenging task that requires a model to recognize known classes and identify unknown classes with limited labeled data. Existing approaches, particularly Negative-Prototype-Based methods, generate negative prototypes based solely on known class data. However, as the unknown space is infinite while the known space is limited, these methods suffer from lim… ▽ More Few-shot open-set recognition (FSOR) is a challenging task that requires a model to recognize known classes and identify unknown classes with limited labeled data. Existing approaches, particularly Negative-Prototype-Based methods, generate negative prototypes based solely on known class data. However, as the unknown space is infinite while the known space is limited, these methods suffer from limited representation capability. To address this limitation, we propose a novel approach, termed \textbf{D}iversified \textbf{N}egative \textbf{P}rototypes \textbf{G}enerator (DNPG), which adopts the principle of "learning unknowns from unknowns." Our method leverages the unknown space information learned from base classes to generate more representative negative prototypes for novel classes. During the pre-training phase, we learn the unknown space representation of the base classes. This representation, along with inter-class relationships, is then utilized in the meta-learning process to construct negative prototypes for novel classes. To prevent prototype collapse and ensure adaptability to varying data compositions, we introduce the Swap Alignment (SA) module. Our DNPG model, by learning from the unknown space, generates negative prototypes that cover a broader unknown space, thereby achieving state-of-the-art performance on three standard FSOR datasets. △ Less

Submitted 23 August, 2024; originally announced August 2024.

Comments: ACMMM 2024

arXiv:2408.13289 [pdf]

doi 10.1109/TSTE.2024.3449909

Optimal Dispatch Strategy for a Multi-microgrid Cooperative Alliance Using a Two-Stage Pricing Mechanism

Authors: Yonghui Nie, Zhi Li, Jie Zhang, Lei Gao, Yang Li, Hengyu Zhou

Abstract: To coordinate resources among multi-level stakeholders and enhance the integration of electric vehicles (EVs) into multi-microgrids, this study proposes an optimal dispatch strategy within a multi-microgrid cooperative alliance using a nuanced two-stage pricing mechanism. Initially, the strategy assesses electric energy interactions between microgrids and distribution networks to establish a found… ▽ More To coordinate resources among multi-level stakeholders and enhance the integration of electric vehicles (EVs) into multi-microgrids, this study proposes an optimal dispatch strategy within a multi-microgrid cooperative alliance using a nuanced two-stage pricing mechanism. Initially, the strategy assesses electric energy interactions between microgrids and distribution networks to establish a foundation for collaborative scheduling. The two-stage pricing mechanism initiates with a leader-follower game, wherein the microgrid operator acts as the leader and users as followers. Subsequently, it adjusts EV tariffs based on the game's equilibrium, taking into account factors such as battery degradation and travel needs to optimize EVs' electricity consumption. Furthermore, a bi-level optimization model refines power interactions and pricing strategies across the network, significantly enhancing demand response capabilities and economic outcomes. Simulation results demonstrate that this strategy not only increases renewable energy consumption but also reduces energy costs, thereby improving the overall efficiency and sustainability of the system. △ Less

Submitted 23 August, 2024; originally announced August 2024.

Comments: Accepted by IEEE Transactions on Sustainable Energy, Paper no. TSTE-00122-2024

arXiv:2408.13252 [pdf, other]

LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Authors: Shuai Yang, Jing Tan, Mengchen Zhang, Tong Wu, Yixuan Li, Gordon Wetzstein, Ziwei Liu, Dahua Lin

Abstract: 3D immersive scene generation is a challenging yet critical task in computer vision and graphics. A desired virtual 3D scene should 1) exhibit omnidirectional view consistency, and 2) allow for free exploration in complex scene hierarchies. Existing methods either rely on successive scene expansion via inpainting or employ panorama representation to represent large FOV scene environments. However,… ▽ More 3D immersive scene generation is a challenging yet critical task in computer vision and graphics. A desired virtual 3D scene should 1) exhibit omnidirectional view consistency, and 2) allow for free exploration in complex scene hierarchies. Existing methods either rely on successive scene expansion via inpainting or employ panorama representation to represent large FOV scene environments. However, the generated scene suffers from semantic drift during expansion and is unable to handle occlusion among scene hierarchies. To tackle these challenges, we introduce LayerPano3D, a novel framework for full-view, explorable panoramic 3D scene generation from a single text prompt. Our key insight is to decompose a reference 2D panorama into multiple layers at different depth levels, where each layer reveals the unseen space from the reference views via diffusion prior. LayerPano3D comprises multiple dedicated designs: 1) we introduce a novel text-guided anchor view synthesis pipeline for high-quality, consistent panorama generation. 2) We pioneer the Layered 3D Panorama as underlying representation to manage complex scene hierarchies and lift it into 3D Gaussians to splat detailed 360-degree omnidirectional scenes with unconstrained viewing paths. Extensive experiments demonstrate that our framework generates state-of-the-art 3D panoramic scene in both full view consistency and immersive exploratory experience. We believe that LayerPano3D holds promise for advancing 3D panoramic scene creation with numerous applications. △ Less

Submitted 23 August, 2024; originally announced August 2024.

Comments: Project page: https://ys-imtech.github.io/projects/LayerPano3D/

arXiv:2408.13134 [pdf, ps, other]

Optimal order time discretizations for stochastic semilinear wave equations with multiplicative noise

Authors: Xiaobing Feng, Yukun Li, Liet Vo

Abstract: This paper is concerned with developing and analyzing two novel implicit temporal discretization methods for the stochastic semilinear wave equations with multiplicative noise. The proposed methods are natural extensions of well-known time-discrete schemes for deterministic wave equations, hence, they are easy to implement. It is proved that both methods are energy-stable. Moreover, the first meth… ▽ More This paper is concerned with developing and analyzing two novel implicit temporal discretization methods for the stochastic semilinear wave equations with multiplicative noise. The proposed methods are natural extensions of well-known time-discrete schemes for deterministic wave equations, hence, they are easy to implement. It is proved that both methods are energy-stable. Moreover, the first method is shown to converge with the linear order in the energy norm, while the second method converges with the $\mathcal{O}(τ^{\frac32})$ order in the $L^2$-norm, which is optimal with respect to the time regularity of the solution to the underlying stochastic PDE. The convergence analyses of both methods, which are different and quite involved, require some novel numerical techniques to overcome difficulties caused by the nonlinear noise term and the interplay between nonlinear drift and diffusion. Numerical experiments are provided to validate the sharpness of the theoretical error estimate results. △ Less

Submitted 23 August, 2024; originally announced August 2024.

Comments: 28 pages, 0 figure, 3 tables

MSC Class: 65N12; 65N15; 65N30

arXiv:2408.13119 [pdf, ps, other]

Coarse-to-fine Alignment Makes Better Speech-image Retrieval

Authors: Lifeng Zhou, Yuke Li

Abstract: In this paper, we propose a novel framework for speech-image retrieval. We utilize speech-image contrastive (SIC) learning tasks to align speech and image representations at a coarse level and speech-image matching (SIM) learning tasks to further refine the fine-grained cross-modal alignment. SIC and SIM learning tasks are jointly trained in a unified manner. To optimize the learning process, we u… ▽ More In this paper, we propose a novel framework for speech-image retrieval. We utilize speech-image contrastive (SIC) learning tasks to align speech and image representations at a coarse level and speech-image matching (SIM) learning tasks to further refine the fine-grained cross-modal alignment. SIC and SIM learning tasks are jointly trained in a unified manner. To optimize the learning process, we utilize an embedding queue that facilitates efficient sampling of high-quality and diverse negative representations during SIC learning. Additionally, it enhances the learning of SIM tasks by effectively mining hard negatives based on contrastive similarities calculated in SIC tasks. To further optimize learning under noisy supervision, we incorporate momentum distillation into the training process. Experimental results show that our framework outperforms the state-of-the-art method by more than 4% in R@1 on two benchmark datasets for the speech-image retrieval tasks. Moreover, as observed in zero-shot experiments, our framework demonstrates excellent generalization capabilities. △ Less

Submitted 14 August, 2024; originally announced August 2024.

arXiv:2408.12914 [pdf, ps, other]

A Recursion-Based SNR Determination Method for Short Packet Transmission: Analysis and Applications

Authors: Chengzhe Yin, Rui Zhang, Yongzhao Li, Yuhan Ruan, Tao Li, Jiaheng Lu

Abstract: The short packet transmission (SPT) has gained much attention in recent years. In SPT, the most significant characteristic is that the finite blocklength code (FBC) is adopted. With FBC, the signal-to-noise ratio (SNR) cannot be expressed as an explicit function with respect to the other transmission parameters. This raises the following two problems for the resource allocation in SPTs: (i) The ex… ▽ More The short packet transmission (SPT) has gained much attention in recent years. In SPT, the most significant characteristic is that the finite blocklength code (FBC) is adopted. With FBC, the signal-to-noise ratio (SNR) cannot be expressed as an explicit function with respect to the other transmission parameters. This raises the following two problems for the resource allocation in SPTs: (i) The exact value of the SNR is hard to determine, and (ii) The property of SNR w.r.t. the other parameters is hard to analyze, which hinders the efficient optimization of them. To simultaneously tackle these problems, we have developed a recursion method in our prior work. To emphasize the significance of this method, we further analyze the convergence rate of the recursion method and investigate the property of the recursion function in this paper. Specifically, we first analyze the convergence rate of the recursion method, which indicates it can determine the SNR with low complexity. Then, we analyze the property of the recursion function, which facilitates the optimization of the other parameters during the recursion. Finally, we also enumerate some applications for the recursion method. Simulation results indicate that the recursion method converges faster than the other SNR determination methods. Besides, the results also show that the recursion-based methods can almost achieve the optimal solution of the application cases. △ Less

Submitted 23 August, 2024; originally announced August 2024.

arXiv:2408.12897 [pdf, other]

When Diffusion MRI Meets Diffusion Model: A Novel Deep Generative Model for Diffusion MRI Generation

Authors: Xi Zhu, Wei Zhang, Yijie Li, Lauren J. O'Donnell, Fan Zhang

Abstract: Diffusion MRI (dMRI) is an advanced imaging technique characterizing tissue microstructure and white matter structural connectivity of the human brain. The demand for high-quality dMRI data is growing, driven by the need for better resolution and improved tissue contrast. However, acquiring high-quality dMRI data is expensive and time-consuming. In this context, deep generative modeling emerges as… ▽ More Diffusion MRI (dMRI) is an advanced imaging technique characterizing tissue microstructure and white matter structural connectivity of the human brain. The demand for high-quality dMRI data is growing, driven by the need for better resolution and improved tissue contrast. However, acquiring high-quality dMRI data is expensive and time-consuming. In this context, deep generative modeling emerges as a promising solution to enhance image quality while minimizing acquisition costs and scanning time. In this study, we propose a novel generative approach to perform dMRI generation using deep diffusion models. It can generate high dimension (4D) and high resolution data preserving the gradients information and brain structure. We demonstrated our method through an image mapping task aimed at enhancing the quality of dMRI images from 3T to 7T. Our approach demonstrates highly enhanced performance in generating dMRI images when compared to the current state-of-the-art (SOTA) methods. This achievement underscores a substantial progression in enhancing dMRI quality, highlighting the potential of our novel generative approach to revolutionize dMRI imaging standards. △ Less

Submitted 23 August, 2024; originally announced August 2024.

Comments: 11 pages, 3 figures

arXiv:2408.12821 [pdf, other]

Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery

Authors: Zhenyuan Yang, Xuhui Lin, Qinyi He, Ziye Huang, Zhengliang Liu, Hanqi Jiang, Peng Shu, Zihao Wu, Yiwei Li, Stephen Law, Gengchen Mai, Tianming Liu, Tao Yang

Abstract: The emergence of Large Language Models (LLMs) and multimodal foundation models (FMs) has generated heightened interest in their applications that integrate vision and language. This paper investigates the capabilities of ChatGPT-4V and Gemini Pro for Street View Imagery, Built Environment, and Interior by evaluating their performance across various tasks. The assessments include street furniture i… ▽ More The emergence of Large Language Models (LLMs) and multimodal foundation models (FMs) has generated heightened interest in their applications that integrate vision and language. This paper investigates the capabilities of ChatGPT-4V and Gemini Pro for Street View Imagery, Built Environment, and Interior by evaluating their performance across various tasks. The assessments include street furniture identification, pedestrian and car counts, and road width measurement in Street View Imagery; building function classification, building age analysis, building height analysis, and building structure classification in the Built Environment; and interior room classification, interior design style analysis, interior furniture counts, and interior length measurement in Interior. The results reveal proficiency in length measurement, style analysis, question answering, and basic image understanding, but highlight limitations in detailed recognition and counting tasks. While zero-shot learning shows potential, performance varies depending on the problem domains and image complexities. This study provides new insights into the strengths and weaknesses of multimodal foundation models for practical challenges in Street View Imagery, Built Environment, and Interior. Overall, the findings demonstrate foundational multimodal intelligence, emphasizing the potential of FMs to drive forward interdisciplinary applications at the intersection of computer vision and language. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.12803 [pdf, other]

Multi-Treatment Multi-Task Uplift Modeling for Enhancing User Growth

Authors: Yuxiang Wei, Zhaoxin Qiu, Yingjie Li, Yuke Sun, Xiaoling Li

Abstract: As a key component in boosting online user growth, uplift modeling aims to measure individual user responses (e.g., whether to play the game) to various treatments, such as gaming bonuses, thereby enhancing business outcomes. However, previous research typically considers a single-task, single-treatment setting, where only one treatment exists and the overall treatment effect is measured by a sing… ▽ More As a key component in boosting online user growth, uplift modeling aims to measure individual user responses (e.g., whether to play the game) to various treatments, such as gaming bonuses, thereby enhancing business outcomes. However, previous research typically considers a single-task, single-treatment setting, where only one treatment exists and the overall treatment effect is measured by a single type of user response. In this paper, we propose a Multi-Treatment Multi-Task (MTMT) uplift network to estimate treatment effects in a multi-task scenario. We identify the multi-treatment problem as a causal inference problem with a tiered response, comprising a base effect (from offering a treatment) and an incremental effect (from offering a specific type of treatment), where the base effect can be numerically much larger than the incremental effect. Specifically, MTMT separately encodes user features and treatments. The user feature encoder uses a multi-gate mixture of experts (MMOE) network to encode relevant user features, explicitly learning inter-task relations. The resultant embeddings are used to measure natural responses per task. Furthermore, we introduce a treatment-user feature interaction module to model correlations between each treatment and user feature. Consequently, we separately measure the base and incremental treatment effect for each task based on the produced treatment-aware representations. Experimental results based on an offline public dataset and an online proprietary dataset demonstrate the effectiveness of MTMT in single/multi-treatment and single/multi-task settings. Additionally, MTMT has been deployed in our gaming platform to improve user experience. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.12798 [pdf, other]

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Authors: Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, Jun Sun

Abstract: Generative Large Language Models (LLMs) have made significant strides across various tasks, but they remain vulnerable to backdoor attacks, where specific triggers in the prompt cause the LLM to generate adversary-desired responses. While most backdoor research has focused on vision or text classification tasks, backdoor attacks in text generation have been largely overlooked. In this work, we int… ▽ More Generative Large Language Models (LLMs) have made significant strides across various tasks, but they remain vulnerable to backdoor attacks, where specific triggers in the prompt cause the LLM to generate adversary-desired responses. While most backdoor research has focused on vision or text classification tasks, backdoor attacks in text generation have been largely overlooked. In this work, we introduce \textit{BackdoorLLM}, the first comprehensive benchmark for studying backdoor attacks on LLMs. \textit{BackdoorLLM} features: 1) a repository of backdoor benchmarks with a standardized training pipeline, 2) diverse attack strategies, including data poisoning, weight poisoning, hidden state attacks, and chain-of-thought attacks, 3) extensive evaluations with over 200 experiments on 8 attacks across 7 scenarios and 6 model architectures, and 4) key insights into the effectiveness and limitations of backdoors in LLMs. We hope \textit{BackdoorLLM} will raise awareness of backdoor threats and contribute to advancing AI safety. The code is available at \url{https://github.com/bboylyg/BackdoorLLM}. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.12767 [pdf, other]

When In-memory Computing Meets Spiking Neural Networks -- A Perspective on Device-Circuit-System-and-Algorithm Co-design

Authors: Abhishek Moitra, Abhiroop Bhattacharjee, Yuhang Li, Youngeun Kim, Priyadarshini Panda

Abstract: This review explores the intersection of bio-plausible artificial intelligence in the form of Spiking Neural Networks (SNNs) with the analog In-Memory Computing (IMC) domain, highlighting their collective potential for low-power edge computing environments. Through detailed investigation at the device, circuit, and system levels, we highlight the pivotal synergies between SNNs and IMC architecture… ▽ More This review explores the intersection of bio-plausible artificial intelligence in the form of Spiking Neural Networks (SNNs) with the analog In-Memory Computing (IMC) domain, highlighting their collective potential for low-power edge computing environments. Through detailed investigation at the device, circuit, and system levels, we highlight the pivotal synergies between SNNs and IMC architectures. Additionally, we emphasize the critical need for comprehensive system-level analyses, considering the inter-dependencies between algorithms, devices, circuit & system parameters, crucial for optimal performance. An in-depth analysis leads to identification of key system-level bottlenecks arising from device limitations which can be addressed using SNN-specific algorithm-hardware co-design techniques. This review underscores the imperative for holistic device to system design space co-exploration, highlighting the critical aspects of hardware and algorithm research endeavors for low-power neuromorphic solutions. △ Less

Submitted 22 August, 2024; originally announced August 2024.

Comments: 19 Pages, 13 Figures

arXiv:2408.12748 [pdf, other]

SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection

Authors: Mengya Hu, Rui Xu, Deren Lei, Yaxi Li, Mingyu Wang, Emily Ching, Eslam Kamal, Alex Deng

Abstract: Large language models (LLMs) are highly capable but face latency challenges in real-time applications, such as conducting online hallucination detection. To overcome this issue, we propose a novel framework that leverages a small language model (SLM) classifier for initial detection, followed by a LLM as constrained reasoner to generate detailed explanations for detected hallucinated content. This… ▽ More Large language models (LLMs) are highly capable but face latency challenges in real-time applications, such as conducting online hallucination detection. To overcome this issue, we propose a novel framework that leverages a small language model (SLM) classifier for initial detection, followed by a LLM as constrained reasoner to generate detailed explanations for detected hallucinated content. This study optimizes the real-time interpretable hallucination detection by introducing effective prompting techniques that align LLM-generated explanations with SLM decisions. Empirical experiment results demonstrate its effectiveness, thereby enhancing the overall user experience. △ Less

Submitted 22 August, 2024; originally announced August 2024.

Comments: preprint under review

arXiv:2408.12725 [pdf, other]

DUNE Phase II: Scientific Opportunities, Detector Concepts, Technological Solutions

Authors: DUNE Collaboration, A. Abed Abud, B. Abi, R. Acciarri, M. A. Acero, M. R. Adames, G. Adamov, M. Adamowski, D. Adams, M. Adinolfi, C. Adriano, A. Aduszkiewicz, J. Aguilar, F. Akbar, K. Allison, S. Alonso Monsalve, M. Alrashed, A. Alton, R. Alvarez, T. Alves, H. Amar, P. Amedo, J. Anderson, C. Andreopoulos, M. Andreotti , et al. (1347 additional authors not shown)

Abstract: The international collaboration designing and constructing the Deep Underground Neutrino Experiment (DUNE) at the Long-Baseline Neutrino Facility (LBNF) has developed a two-phase strategy toward the implementation of this leading-edge, large-scale science project. The 2023 report of the US Particle Physics Project Prioritization Panel (P5) reaffirmed this vision and strongly endorsed DUNE Phase I… ▽ More The international collaboration designing and constructing the Deep Underground Neutrino Experiment (DUNE) at the Long-Baseline Neutrino Facility (LBNF) has developed a two-phase strategy toward the implementation of this leading-edge, large-scale science project. The 2023 report of the US Particle Physics Project Prioritization Panel (P5) reaffirmed this vision and strongly endorsed DUNE Phase I and Phase II, as did the European Strategy for Particle Physics. While the construction of the DUNE Phase I is well underway, this White Paper focuses on DUNE Phase II planning. DUNE Phase-II consists of a third and fourth far detector (FD) module, an upgraded near detector complex, and an enhanced 2.1 MW beam. The fourth FD module is conceived as a "Module of Opportunity", aimed at expanding the physics opportunities, in addition to supporting the core DUNE science program, with more advanced technologies. This document highlights the increased science opportunities offered by the DUNE Phase II near and far detectors, including long-baseline neutrino oscillation physics, neutrino astrophysics, and physics beyond the standard model. It describes the DUNE Phase II near and far detector technologies and detector design concepts that are currently under consideration. A summary of key R&D goals and prototyping phases needed to realize the Phase II detector technical designs is also provided. DUNE's Phase II detectors, along with the increased beam power, will complete the full scope of DUNE, enabling a multi-decadal program of groundbreaking science with neutrinos. △ Less

Submitted 22 August, 2024; originally announced August 2024.

Report number: FERMILAB-TM-2833-LBNF

arXiv:2408.12451 [pdf, other]

Dissipation and Interaction-Controlled Non-Hermitian Skin Effects

Authors: Yang Li, Zhao-Fan Cai, Tao Liu, Franco Nori

Abstract: Non-Hermitian skin effects (NHSEs) have recently been investigated extensively at the single-particle level. When many-body interactions become dominant, novel non-Hermitian physical phenomena can emerge. In this work, we theoretically study NHSEs controlled by dissipation and interaction. We consider a 1D zigzag Bose-Hubbard lattice, subject to magnetic flux, staggered onsite single-particle loss… ▽ More Non-Hermitian skin effects (NHSEs) have recently been investigated extensively at the single-particle level. When many-body interactions become dominant, novel non-Hermitian physical phenomena can emerge. In this work, we theoretically study NHSEs controlled by dissipation and interaction. We consider a 1D zigzag Bose-Hubbard lattice, subject to magnetic flux, staggered onsite single-particle loss, and uniform onsite two-particle loss. When the two-particle loss is small, two-body bound eigenstates (i.e., doublons) are all localized at the same boundary due to the interplay of the magnetic flux and staggered single-particle loss. While, for strong two-particle loss, the localization direction of doublons is unexpectedly reversed. This is attributed to the effective strong nonreciprocal hopping of doublons contributing from the virtual second-order and third-order hopping processes of particle pairs in combination with the magnetic flux, the strong two-particle loss, and the many-body interaction. Moreover, a two-particle gain can induce the same skin-localization of doublons, which can be utilized to dynamically observe the NHSE and its reversal of doublons controlled by interactions. Our results open up a new avenue for exploring novel non-Hermitian phenomena in many-body systems. △ Less

Submitted 24 August, 2024; v1 submitted 22 August, 2024; originally announced August 2024.

Comments: 16 pages, 9 figures; Comments are welcome

arXiv:2408.12420 [pdf, other]

Dataset | Mindset = Explainable AI | Interpretable AI

Authors: Caesar Wu, Rajkumar Buyya, Yuan Fang Li, Pascal Bouvry

Abstract: We often use "explainable" Artificial Intelligence (XAI)" and "interpretable AI (IAI)" interchangeably when we apply various XAI tools for a given dataset to explain the reasons that underpin machine learning (ML) outputs. However, these notions can sometimes be confusing because interpretation often has a subjective connotation, while explanations lean towards objective facts. We argue that XAI i… ▽ More We often use "explainable" Artificial Intelligence (XAI)" and "interpretable AI (IAI)" interchangeably when we apply various XAI tools for a given dataset to explain the reasons that underpin machine learning (ML) outputs. However, these notions can sometimes be confusing because interpretation often has a subjective connotation, while explanations lean towards objective facts. We argue that XAI is a subset of IAI. The concept of IAI is beyond the sphere of a dataset. It includes the domain of a mindset. At the core of this ambiguity is the duality of reasons, in which we can reason either outwards or inwards. When directed outwards, we want the reasons to make sense through the laws of nature. When turned inwards, we want the reasons to be happy, guided by the laws of the heart. While XAI and IAI share reason as the common notion for the goal of transparency, clarity, fairness, reliability, and accountability in the context of ethical AI and trustworthy AI (TAI), their differences lie in that XAI emphasizes the post-hoc analysis of a dataset, and IAI requires a priori mindset of abstraction. This hypothesis can be proved by empirical experiments based on an open dataset and harnessed by High-Performance Computing (HPC). The demarcation of XAI and IAI is indispensable because it would be impossible to determine regulatory policies for many AI applications, especially in healthcare, human resources, banking, and finance. We aim to clarify these notions and lay the foundation of XAI, IAI, EAI, and TAI for many practitioners and policymakers in future AI applications and research. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.12414 [pdf, other]

BIPeC: A Combined Change-Point Analyzer to Identify Performance Regressions in Large-scale Database Systems

Authors: Zhan Lyu, Thomas Bach, Yong Li, Nguyen Minh Le, Lars Hoemke

Abstract: Performance testing in large-scale database systems like SAP HANA is a crucial yet labor-intensive task, involving extensive manual analysis of thousands of measurements, such as CPU time and elapsed time. Manual maintenance of these metrics is time-consuming and susceptible to human error, making early detection of performance regressions challenging. We address these issues by proposing an autom… ▽ More Performance testing in large-scale database systems like SAP HANA is a crucial yet labor-intensive task, involving extensive manual analysis of thousands of measurements, such as CPU time and elapsed time. Manual maintenance of these metrics is time-consuming and susceptible to human error, making early detection of performance regressions challenging. We address these issues by proposing an automated approach to detect performance regressions in such measurements. Our approach integrates Bayesian inference with the Pruned Exact Linear Time (PELT) algorithm, enhancing the detection of change points and performance regressions with high precision and efficiency compared to previous approaches. Our method minimizes false negatives and ensures SAP HANA's system's reliability and performance quality. The proposed solution can accelerate testing and contribute to more sustainable performance management practices in large-scale data management environments. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.12373 [pdf, other]

Cell-ontology guided transcriptome foundation model

Authors: Xinyu Yuan, Zhihao Zhan, Zuobai Zhang, Manqi Zhou, Jianan Zhao, Boyu Han, Yue Li, Jian Tang

Abstract: Transcriptome foundation models TFMs hold great promises of deciphering the transcriptomic language that dictate diverse cell functions by self-supervised learning on large-scale single-cell gene expression data, and ultimately unraveling the complex mechanisms of human diseases. However, current TFMs treat cells as independent samples and ignore the taxonomic relationships between cell types, whi… ▽ More Transcriptome foundation models TFMs hold great promises of deciphering the transcriptomic language that dictate diverse cell functions by self-supervised learning on large-scale single-cell gene expression data, and ultimately unraveling the complex mechanisms of human diseases. However, current TFMs treat cells as independent samples and ignore the taxonomic relationships between cell types, which are available in cell ontology graphs. We argue that effectively leveraging this ontology information during the TFM pre-training can improve learning biologically meaningful gene co-expression patterns while preserving TFM as a general purpose foundation model for downstream zero-shot and fine-tuning tasks. To this end, we present \textbf{s}ingle \textbf{c}ell, \textbf{Cell}-\textbf{o}ntology guided TFM scCello. We introduce cell-type coherence loss and ontology alignment loss, which are minimized along with the masked gene expression prediction loss during the pre-training. The novel loss component guide scCello to learn the cell-type-specific representation and the structural relation between cell types from the cell ontology graph, respectively. We pre-trained scCello on 22 million cells from CellxGene database leveraging their cell-type labels mapped to the cell ontology graph from Open Biological and Biomedical Ontology Foundry. Our TFM demonstrates competitive generalization and transferability performance over the existing TFMs on biologically important tasks including identifying novel cell types of unseen cells, prediction of cell-type-specific marker genes, and cancer drug responses. △ Less

Submitted 22 August, 2024; originally announced August 2024.

Comments: All anonymous reviewers' constructive suggestions are appreciated. The next version will be updated soon

arXiv:2408.12370 [pdf, other]

doi 10.1140/epjc/s10052-024-13164-z

Basis-independent quantum coherence and its distribution under relativistic motion

Authors: Ming-Ming Du, Hong-Wei Li, Zhen Tao, Shu-Ting Shen, Xiao-Jing Yan. Xi-Yun Li, Wei Zhong, Yu-Bo Sheng, Lan Zhou

Abstract: Recent studies have increasingly focused on the effect of relativistic motion on quantum coherence. Prior research predominantly examined the influence of relative motion on basis-dependent quantum coherence, underscoring its susceptibility to decoherence under accelerated conditions. Yet, the effect of relativistic motion on basis-independent quantum coherence, which is critical for understanding… ▽ More Recent studies have increasingly focused on the effect of relativistic motion on quantum coherence. Prior research predominantly examined the influence of relative motion on basis-dependent quantum coherence, underscoring its susceptibility to decoherence under accelerated conditions. Yet, the effect of relativistic motion on basis-independent quantum coherence, which is critical for understanding the intrinsic quantum features of a system, remains an interesting open question. This paper addresses this question by examining how total, collective, and localized coherence are affected by acceleration and coupling strength. Our analysis reveals that both total and collective coherence significantly decrease with increasing acceleration and coupling strength, ultimately vanishing at high levels of acceleration. This underscores the profound impact of Unruh thermal noise. Conversely, localized coherence exhibits relative stability, decreasing to zero only under the extreme condition of infinite acceleration. Moreover, we demonstrate that collective, localized, and basis-independent coherence collectively satisfy the triangle inequality. These findings are crucial for enhancing our understanding of quantum information dynamics in environments subjected to high acceleration and offer valuable insights on the behavior of quantum coherence under relativistic conditions. △ Less

Submitted 22 August, 2024; originally announced August 2024.

Comments: 7 pages, 3 figures

arXiv:2408.12236 [pdf, other]

MedDiT: A Knowledge-Controlled Diffusion Transformer Framework for Dynamic Medical Image Generation in Virtual Simulated Patient

Authors: Yanzeng Li, Cheng Zeng, Jinchao Zhang, Jie Zhou, Lei Zou

Abstract: Medical education relies heavily on Simulated Patients (SPs) to provide a safe environment for students to practice clinical skills, including medical image analysis. However, the high cost of recruiting qualified SPs and the lack of diverse medical imaging datasets have presented significant challenges. To address these issues, this paper introduces MedDiT, a novel knowledge-controlled conversati… ▽ More Medical education relies heavily on Simulated Patients (SPs) to provide a safe environment for students to practice clinical skills, including medical image analysis. However, the high cost of recruiting qualified SPs and the lack of diverse medical imaging datasets have presented significant challenges. To address these issues, this paper introduces MedDiT, a novel knowledge-controlled conversational framework that can dynamically generate plausible medical images aligned with simulated patient symptoms, enabling diverse diagnostic skill training. Specifically, MedDiT integrates various patient Knowledge Graphs (KGs), which describe the attributes and symptoms of patients, to dynamically prompt Large Language Models' (LLMs) behavior and control the patient characteristics, mitigating hallucination during medical conversation. Additionally, a well-tuned Diffusion Transformer (DiT) model is incorporated to generate medical images according to the specified patient attributes in the KG. In this paper, we present the capabilities of MedDiT through a practical demonstration, showcasing its ability to act in diverse simulated patient cases and generate the corresponding medical images. This can provide an abundant and interactive learning experience for students, advancing medical education by offering an immersive simulation platform for future healthcare professionals. The work sheds light on the feasibility of incorporating advanced technologies like LLM, KG, and DiT in education applications, highlighting their potential to address the challenges faced in simulated patient-based medical education. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.12201 [pdf, ps, other]

Prescribing positive curvature with conical singularities on $\mathbb S^2$

Authors: Jingyi Chen, Yuxiang Li, Yunqing Wu

Abstract: For conformal metrics with conical singularities and positive curvature on $\mathbb S^2$, we prove a convergence theorem and apply it to obtain a criterion for nonexistence in an open region of the prescribing data. The core of our study is a fine analysis of the bubble trees and an area identity in the convergence process. For conformal metrics with conical singularities and positive curvature on $\mathbb S^2$, we prove a convergence theorem and apply it to obtain a criterion for nonexistence in an open region of the prescribing data. The core of our study is a fine analysis of the bubble trees and an area identity in the convergence process. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.12195 [pdf, ps, other]

Prescribing negative curvature with cusps and conical singularities on compact surface

Authors: Jingyi Chen, Yuxiang Li, Yunqing Wu

Abstract: On a compact surface, we prove existence and uniqueness of the conformal metric whose curvature is prescribed by a negative function away from finitely many points where the metric has prescribed angles presenting cusps or conical singularities. On a compact surface, we prove existence and uniqueness of the conformal metric whose curvature is prescribed by a negative function away from finitely many points where the metric has prescribed angles presenting cusps or conical singularities. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.12161 [pdf, other]

Rebalancing Multi-Label Class-Incremental Learning

Authors: Kaile Du, Yifan Zhou, Fan Lyu, Yuyang Li, Junzhou Xie, Yixi Shen, Fuyuan Hu, Guangcan Liu

Abstract: Multi-label class-incremental learning (MLCIL) is essential for real-world multi-label applications, allowing models to learn new labels while retaining previously learned knowledge continuously. However, recent MLCIL approaches can only achieve suboptimal performance due to the oversight of the positive-negative imbalance problem, which manifests at both the label and loss levels because of the t… ▽ More Multi-label class-incremental learning (MLCIL) is essential for real-world multi-label applications, allowing models to learn new labels while retaining previously learned knowledge continuously. However, recent MLCIL approaches can only achieve suboptimal performance due to the oversight of the positive-negative imbalance problem, which manifests at both the label and loss levels because of the task-level partial label issue. The imbalance at the label level arises from the substantial absence of negative labels, while the imbalance at the loss level stems from the asymmetric contributions of the positive and negative loss parts to the optimization. To address the issue above, we propose a Rebalance framework for both the Loss and Label levels (RebLL), which integrates two key modules: asymmetric knowledge distillation (AKD) and online relabeling (OR). AKD is proposed to rebalance at the loss level by emphasizing the negative label learning in classification loss and down-weighting the contribution of overconfident predictions in distillation loss. OR is designed for label rebalance, which restores the original class distribution in memory by online relabeling the missing classes. Our comprehensive experiments on the PASCAL VOC and MS-COCO datasets demonstrate that this rebalancing strategy significantly improves performance, achieving new state-of-the-art results even with a vanilla CNN backbone. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.12076 [pdf, other]

ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM

Authors: Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, Yu Cheng

Abstract: Large language models (LLMs) have achieved impressive advancements across numerous disciplines, yet the critical issue of knowledge conflicts, a major source of hallucinations, has rarely been studied. Only a few research explored the conflicts between the inherent knowledge of LLMs and the retrieved contextual knowledge. However, a thorough assessment of knowledge conflict in LLMs is still missin… ▽ More Large language models (LLMs) have achieved impressive advancements across numerous disciplines, yet the critical issue of knowledge conflicts, a major source of hallucinations, has rarely been studied. Only a few research explored the conflicts between the inherent knowledge of LLMs and the retrieved contextual knowledge. However, a thorough assessment of knowledge conflict in LLMs is still missing. Motivated by this research gap, we present ConflictBank, the first comprehensive benchmark developed to systematically evaluate knowledge conflicts from three aspects: (i) conflicts encountered in retrieved knowledge, (ii) conflicts within the models' encoded knowledge, and (iii) the interplay between these conflict forms. Our investigation delves into four model families and twelve LLM instances, meticulously analyzing conflicts stemming from misinformation, temporal discrepancies, and semantic divergences. Based on our proposed novel construction framework, we create 7,453,853 claim-evidence pairs and 553,117 QA pairs. We present numerous findings on model scale, conflict causes, and conflict types. We hope our ConflictBank benchmark will help the community better understand model behavior in conflicts and develop more reliable LLMs. △ Less

Submitted 21 August, 2024; originally announced August 2024.

Comments: Under Review

arXiv:2408.11982 [pdf, other]

AIM 2024 Challenge on Compressed Video Quality Assessment: Methods and Results

Authors: Maksim Smirnov, Aleksandr Gushchin, Anastasia Antsiferova, Dmitry Vatolin, Radu Timofte, Ziheng Jia, Zicheng Zhang, Wei Sun, Jiaying Qian, Yuqin Cao, Yinan Sun, Yuxin Zhu, Xiongkuo Min, Guangtao Zhai, Kanjar De, Qing Luo, Ao-Xiang Zhang, Peng Zhang, Haibo Lei, Linyan Jiang, Yaqing Li, Wenhui Meng, Xiaoheng Tan, Haiqiang Wang, Xiaozhong Xu , et al. (11 additional authors not shown)

Abstract: Video quality assessment (VQA) is a crucial task in the development of video compression standards, as it directly impacts the viewer experience. This paper presents the results of the Compressed Video Quality Assessment challenge, held in conjunction with the Advances in Image Manipulation (AIM) workshop at ECCV 2024. The challenge aimed to evaluate the performance of VQA methods on a diverse dat… ▽ More Video quality assessment (VQA) is a crucial task in the development of video compression standards, as it directly impacts the viewer experience. This paper presents the results of the Compressed Video Quality Assessment challenge, held in conjunction with the Advances in Image Manipulation (AIM) workshop at ECCV 2024. The challenge aimed to evaluate the performance of VQA methods on a diverse dataset of 459 videos, encoded with 14 codecs of various compression standards (AVC/H.264, HEVC/H.265, AV1, and VVC/H.266) and containing a comprehensive collection of compression artifacts. To measure the methods performance, we employed traditional correlation coefficients between their predictions and subjective scores, which were collected via large-scale crowdsourced pairwise human comparisons. For training purposes, participants were provided with the Compressed Video Quality Assessment Dataset (CVQAD), a previously developed dataset of 1022 videos. Up to 30 participating teams registered for the challenge, while we report the results of 6 teams, which submitted valid final solutions and code for reproducing the results. Moreover, we calculated and present the performance of state-of-the-art VQA methods on the developed dataset, providing a comprehensive benchmark for future research. The dataset, results, and online leaderboard are publicly available at https://challenges.videoprocessing.ai/challenges/compressedvideo-quality-assessment.html. △ Less

Submitted 28 August, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

arXiv:2408.11850 [pdf, other]

Parallel Speculative Decoding with Adaptive Draft Length

Authors: Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu

Abstract: Speculative decoding (SD), where an extra draft model is employed to provide multiple \textit{draft} tokens first and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods suffer from the mutual waiting problem, i.e., the target model gets stuck when the draft model is \textit{guessing} tokens, and vice… ▽ More Speculative decoding (SD), where an extra draft model is employed to provide multiple \textit{draft} tokens first and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods suffer from the mutual waiting problem, i.e., the target model gets stuck when the draft model is \textit{guessing} tokens, and vice versa. This problem is directly incurred by the asynchronous execution of the draft model and the target model, and is exacerbated due to the fixed draft length in speculative decoding. To address these challenges, we propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely \textbf{P}arallel sp\textbf{E}culative decoding with \textbf{A}daptive d\textbf{R}aft \textbf{L}ength (PEARL). Specifically, PEARL proposes \textit{pre-verify} to verify the first draft token in advance during the drafting phase, and \textit{post-verify} to generate more draft tokens during the verification phase. PEARL parallels the drafting phase and the verification phase via applying the two strategies, and achieves adaptive draft length for different scenarios, which effectively alleviates the mutual waiting problem. Moreover, we theoretically demonstrate that the mean accepted tokens of PEARL is more than existing \textit{draft-then-verify} works. Experiments on various text generation benchmarks demonstrate the effectiveness of our \name, leading to a superior speedup performance up to \textbf{3.79$\times$} and \textbf{1.52$\times$}, compared to auto-regressive decoding and vanilla speculative decoding, respectively. △ Less

Submitted 13 August, 2024; originally announced August 2024.

arXiv:2408.11849 [pdf, other]

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

Authors: Yinghao Aaron Li, Xilin Jiang, Jordan Darefsky, Ge Zhu, Nima Mesgarani

Abstract: The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and contextually relevant dialogues. However, extending these advancements to enable end-to-end speech-to-speech conversation bots remains a formidable challenge, primarily due to the extensive dataset and computational resou… ▽ More The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and contextually relevant dialogues. However, extending these advancements to enable end-to-end speech-to-speech conversation bots remains a formidable challenge, primarily due to the extensive dataset and computational resources required. The conventional approach of cascading automatic speech recognition (ASR), LLM, and text-to-speech (TTS) models in a pipeline, while effective, suffers from unnatural prosody because it lacks direct interactions between the input audio and its transcribed text and the output audio. These systems are also limited by their inherent latency from the ASR process for real-time applications. This paper introduces Style-Talker, an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation. Style-Talker takes user input audio and uses transcribed chat history and speech styles to generate both the speaking style and text for the response. Subsequently, the TTS model synthesizes the speech, which is then played back to the user. While the response speech is being played, the input speech undergoes ASR processing to extract the transcription and speaking style, serving as the context for the ensuing dialogue turn. This novel pipeline accelerates the traditional cascade ASR-LLM-TTS systems while integrating rich paralinguistic information from input speech. Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence while being more than 50% faster. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: CoLM 2024

arXiv:2408.11843 [pdf, other]

Editable Fairness: Fine-Grained Bias Mitigation in Language Models

Authors: Ruizhe Chen, Yichen Li, Jianfei Yang, Joey Tianyi Zhou, Zuozhu Liu

Abstract: Generating fair and accurate predictions plays a pivotal role in deploying large language models (LLMs) in the real world. However, existing debiasing methods inevitably generate unfair or incorrect predictions as they are designed and evaluated to achieve parity across different social groups but leave aside individual commonsense facts, resulting in modified knowledge that elicits unreasonable o… ▽ More Generating fair and accurate predictions plays a pivotal role in deploying large language models (LLMs) in the real world. However, existing debiasing methods inevitably generate unfair or incorrect predictions as they are designed and evaluated to achieve parity across different social groups but leave aside individual commonsense facts, resulting in modified knowledge that elicits unreasonable or undesired predictions. In this paper, we first establish a new bias mitigation benchmark, BiaScope, which systematically assesses performance by leveraging newly constructed datasets and metrics on knowledge retention and generalization. Then, we propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases. FAST identifies the decisive layer responsible for storing social biases and then calibrates its outputs by integrating a small modular network, considering both bias mitigation and knowledge-preserving demands. Comprehensive experiments demonstrate that FAST surpasses state-of-the-art baselines with superior debiasing performance while not compromising the overall model capability for knowledge retention and downstream predictions. This highlights the potential of fine-grained debiasing strategies to achieve fairness in LLMs. Code will be publicly available. △ Less

Submitted 7 August, 2024; originally announced August 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2405.09341

arXiv:2408.11824

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

Authors: Yanda Li, Chi Zhang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, Yunchao Wei

Abstract: With the advancement of Multimodal Large Language Models (MLLM), LLM-driven visual agents are increasingly impacting software interfaces, particularly those with graphical user interfaces. This work introduces a novel LLM-based multimodal agent framework for mobile devices. This framework, capable of navigating mobile devices, emulates human-like interactions. Our agent constructs a flexible actio… ▽ More With the advancement of Multimodal Large Language Models (MLLM), LLM-driven visual agents are increasingly impacting software interfaces, particularly those with graphical user interfaces. This work introduces a novel LLM-based multimodal agent framework for mobile devices. This framework, capable of navigating mobile devices, emulates human-like interactions. Our agent constructs a flexible action space that enhances adaptability across various applications including parser, text and vision descriptions. The agent operates through two main phases: exploration and deployment. During the exploration phase, functionalities of user interface elements are documented either through agent-driven or manual explorations into a customized structured knowledge base. In the deployment phase, RAG technology enables efficient retrieval and update from this knowledge base, thereby empowering the agent to perform tasks effectively and accurately. This includes performing complex, multi-step operations across various applications, thereby demonstrating the framework's adaptability and precision in handling customized task workflows. Our experimental results across various benchmarks demonstrate the framework's superior performance, confirming its effectiveness in real-world scenarios. Our code will be open source soon. △ Less

Submitted 23 August, 2024; v1 submitted 5 August, 2024; originally announced August 2024.

Comments: Pre-print version, some content needs to be supplemented

arXiv:2408.11681 [pdf, other]

Variational autoencoder inverse mapper for extraction of Compton form factors: Benchmarks and conditional learning

Authors: Fayaz Hossen, Douglas Adams, Joshua Bautista, Yaohang Li, Gia-Wei Chern, Simonetta Liuti, Marie Boer, Marija Cuic, Gari R. Goldstein, Michael Engelhardt, Huey-Wen Li

Abstract: Deeply virtual exclusive scattering processes (DVES) serve as precise probes of nucleon quark and gluon distributions in coordinate space. These distributions are derived from generalized parton distributions (GPDs) via Fourier transform relative to proton momentum transfer. QCD factorization theorems enable DVES to be parameterized by Compton form factors (CFFs), which are convolutions of GPDs wi… ▽ More Deeply virtual exclusive scattering processes (DVES) serve as precise probes of nucleon quark and gluon distributions in coordinate space. These distributions are derived from generalized parton distributions (GPDs) via Fourier transform relative to proton momentum transfer. QCD factorization theorems enable DVES to be parameterized by Compton form factors (CFFs), which are convolutions of GPDs with perturbatively calculable kernels. Accurate extraction of CFFs from DVCS, benefiting from interference with the Bethe-Heitler (BH) process and a simpler final state structure, is essential for inferring GPDs. This paper focuses on extracting CFFs from DVCS data using a variational autoencoder inverse mapper (VAIM) and its constrained variant (C-VAIM). VAIM is shown to be consistent with Markov Chain Monte Carlo (MCMC) methods in extracting multiple CFF solutions for given kinematics, while C-VAIM effectively captures correlations among CFFs across different kinematic values, providing more constrained solutions. This study represents a crucial first step towards a comprehensive analysis pipeline towards the extraction of GPDs. △ Less

Submitted 21 August, 2024; originally announced August 2024.

Comments: 12 pages, 9 figures

arXiv:2408.11463 [pdf, other]

Low-Light Object Tracking: A Benchmark

Authors: Pengzhi Zhong, Xiaoyu Guo, Defeng Huang, Xiaojun Peng, Yian Li, Qijun Zhao, Shuiwang Li

Abstract: In recent years, the field of visual tracking has made significant progress with the application of large-scale training datasets. These datasets have supported the development of sophisticated algorithms, enhancing the accuracy and stability of visual object tracking. However, most research has primarily focused on favorable illumination circumstances, neglecting the challenges of tracking in low… ▽ More In recent years, the field of visual tracking has made significant progress with the application of large-scale training datasets. These datasets have supported the development of sophisticated algorithms, enhancing the accuracy and stability of visual object tracking. However, most research has primarily focused on favorable illumination circumstances, neglecting the challenges of tracking in low-ligh environments. In low-light scenes, lighting may change dramatically, targets may lack distinct texture features, and in some scenarios, targets may not be directly observable. These factors can lead to a severe decline in tracking performance. To address this issue, we introduce LLOT, a benchmark specifically designed for Low-Light Object Tracking. LLOT comprises 269 challenging sequences with a total of over 132K frames, each carefully annotated with bounding boxes. This specially designed dataset aims to promote innovation and advancement in object tracking techniques for low-light conditions, addressing challenges not adequately covered by existing benchmarks. To assess the performance of existing methods on LLOT, we conducted extensive tests on 39 state-of-the-art tracking algorithms. The results highlight a considerable gap in low-light tracking performance. In response, we propose H-DCPT, a novel tracker that incorporates historical and darkness clue prompts to set a stronger baseline. H-DCPT outperformed all 39 evaluated methods in our experiments, demonstrating significant improvements. We hope that our benchmark and H-DCPT will stimulate the development of novel and accurate methods for tracking objects in low-light conditions. The LLOT and code are available at https://github.com/OpenCodeGithub/H-DCPT. △ Less

Submitted 21 August, 2024; originally announced August 2024.

arXiv:2408.11449 [pdf, other]

Enabling Small Models for Zero-Shot Classification through Model Label Learning

Authors: Jia Zhang, Zhi Zhou, Lan-Zhe Guo, Yu-Feng Li

Abstract: Vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot ability in image classification tasks by aligning text and images but suffer inferior performance compared with task-specific expert models. On the contrary, expert models excel in their specialized domains but lack zero-shot ability for new tasks. How to obtain both the high performance of expert models and zero-shot a… ▽ More Vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot ability in image classification tasks by aligning text and images but suffer inferior performance compared with task-specific expert models. On the contrary, expert models excel in their specialized domains but lack zero-shot ability for new tasks. How to obtain both the high performance of expert models and zero-shot ability is an important research direction. In this paper, we attempt to demonstrate that by constructing a model hub and aligning models with their functionalities using model labels, new tasks can be solved in a zero-shot manner by effectively selecting and reusing models in the hub. We introduce a novel paradigm, Model Label Learning (MLL), which bridges the gap between models and their functionalities through a Semantic Directed Acyclic Graph (SDAG) and leverages an algorithm, Classification Head Combination Optimization (CHCO), to select capable models for new tasks. Compared with the foundation model paradigm, it is less costly and more scalable, i.e., the zero-shot ability grows with the sizes of the model hub. Experiments on seven real-world datasets validate the effectiveness and efficiency of MLL, demonstrating that expert models can be effectively reused for zero-shot tasks. Our code will be released publicly. △ Less

Submitted 21 August, 2024; originally announced August 2024.

arXiv:2408.11432 [pdf, other]

doi 10.1145/3664647.3680673

T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval

Authors: Yili Li, Jing Yu, Keke Gai, Bang Liu, Gang Xiong, Qi Wu

Abstract: Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, which are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in nat… ▽ More Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, which are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in natural language processing and computer vision, and have been successfully applied in document retrieval, but their application in multimodal retrieval remains unexplored. To enhance retrieval efficiency, in this paper, we introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers and retrieving candidate videos with constant time complexity. T2VIndexer aims to reduce retrieval time while maintaining high accuracy. To achieve this goal, we propose video identifier encoding and query-identifier augmentation approaches to represent videos as short sequences while preserving their semantic information. Our method consistently enhances the retrieval efficiency of current state-of-the-art models on four standard datasets. It enables baselines with only 30\%-50\% of the original retrieval time to achieve better retrieval performance on MSR-VTT (+1.0%), MSVD (+1.8%), ActivityNet (+1.5%), and DiDeMo (+0.2%). The code is available at https://github.com/Lilidamowang/T2VIndexer-generativeSearch. △ Less

Submitted 21 August, 2024; originally announced August 2024.

arXiv:2408.11426 [pdf, other]

AS-LIO: Spatial Overlap Guided Adaptive Sliding Window LiDAR-Inertial Odometry for Aggressive FOV Variation

Authors: Tianxiang Zhang, Xuanxuan Zhang, Zongbo Liao, Xin Xia, You Li

Abstract: LiDAR-Inertial Odometry (LIO) demonstrates outstanding accuracy and stability in general low-speed and smooth motion scenarios. However, in high-speed and intense motion scenarios, such as sharp turns, two primary challenges arise: firstly, due to the limitations of IMU frequency, the error in estimating significantly non-linear motion states escalates; secondly, drastic changes in the Field of Vi… ▽ More LiDAR-Inertial Odometry (LIO) demonstrates outstanding accuracy and stability in general low-speed and smooth motion scenarios. However, in high-speed and intense motion scenarios, such as sharp turns, two primary challenges arise: firstly, due to the limitations of IMU frequency, the error in estimating significantly non-linear motion states escalates; secondly, drastic changes in the Field of View (FOV) may diminish the spatial overlap between LiDAR frame and pointcloud map (or between frames), leading to insufficient data association and constraint degradation. To address these issues, we propose a novel Adaptive Sliding window LIO framework (AS-LIO) guided by the Spatial Overlap Degree (SOD). Initially, we assess the SOD between the LiDAR frames and the registered map, directly evaluating the adverse impact of current FOV variation on pointcloud alignment. Subsequently, we design an adaptive sliding window to manage the continuous LiDAR stream and control state updates, dynamically adjusting the update step according to the SOD. This strategy enables our odometry to adaptively adopt higher update frequency to precisely characterize trajectory during aggressive FOV variation, thus effectively reducing the non-linear error in positioning. Meanwhile, the historical constraints within the sliding window reinforce the frame-to-map data association, ensuring the robustness of state estimation. Experiments show that our AS-LIO framework can quickly perceive and respond to challenging FOV change, outperforming other state-of-the-art LIO frameworks in terms of accuracy and robustness. △ Less

Submitted 21 August, 2024; originally announced August 2024.

Comments: 8 pages, 6 figures

arXiv:2408.11329 [pdf, ps, other]

Full-Duplex ISAC-Enabled D2D Underlaid Cellular Networks: Joint Transceiver Beamforming and Power Allocation

Authors: Tao Jiang, Ming Jin, Qinghua Guo, Yinhong Liu, Yaming Li

Abstract: Integrating device-to-device (D2D) communication into cellular networks can significantly reduce the transmission burden on base stations (BSs). Besides, integrated sensing and communication (ISAC) is envisioned as a key feature in future wireless networks. In this work, we consider a full-duplex ISAC- based D2D underlaid system, and propose a joint beamforming and power allocation scheme to impro… ▽ More Integrating device-to-device (D2D) communication into cellular networks can significantly reduce the transmission burden on base stations (BSs). Besides, integrated sensing and communication (ISAC) is envisioned as a key feature in future wireless networks. In this work, we consider a full-duplex ISAC- based D2D underlaid system, and propose a joint beamforming and power allocation scheme to improve the performance of the coexisting ISAC and D2D networks. To enhance spectral efficiency, a sum rate maximization problem is formulated for the full-duplex ISAC-based D2D underlaid system, which is non-convex. To solve the non-convex optimization problem, we propose a successive convex approximation (SCA)-based iterative algorithm and prove its convergence. Numerical results are provided to validate the effectiveness of the proposed scheme with the iterative algorithm, demonstrating that the proposed scheme outperforms state-of-the-art ones in both communication and sensing performance. △ Less

Submitted 21 August, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

Comments: This work has been submitted to IEEE Transactions on Wireless Communications on 7 June,2024

arXiv:2408.11298 [pdf, other]

Towards a first principles light-front Hamiltonian for the nucleon

Authors: Siqi Xu, Yiping Liu, Chandan Mondal, Jiangshan Lan, Xingbo Zhao, Yang Li, James P. Vary

Abstract: We solve the nucleon's wave functions from the eigenstates of the light-front quantum chromodynamics Hamiltonian for the first time, using a fully relativistic and nonperturbative approach based on light-front quantization, without an explicit confining potential. These eigenstates are determined for the three-quark, three-quark-gluon, and three-quark-quark-antiquark Fock representations, making t… ▽ More We solve the nucleon's wave functions from the eigenstates of the light-front quantum chromodynamics Hamiltonian for the first time, using a fully relativistic and nonperturbative approach based on light-front quantization, without an explicit confining potential. These eigenstates are determined for the three-quark, three-quark-gluon, and three-quark-quark-antiquark Fock representations, making them suitable for low-resolution probes. From this, we calculate the nucleon's quark and gluon matter densities, helicity, and transversity distributions, which show qualitative consistency with experimental extractions. We also compute the contributions of quark and gluon helicity to the proton spin and the tensor charges. The obtained light-front wave functions represent a significant advancement towards a unified description of various hadron distribution functions in both longitudinal and transverse momentum space. △ Less

Submitted 20 August, 2024; originally announced August 2024.

arXiv:2408.10994 [pdf, other]

Microsatellite-based real-time quantum key distribution

Authors: Yang Li, Wen-Qi Cai, Ji-Gang Ren, Chao-Ze Wang, Meng Yang, Liang Zhang, Hui-Ying Wu, Liang Chang, Jin-Cai Wu, Biao Jin, Hua-Jian Xue, Xue-Jiao Li, Hui Liu, Guang-Wen Yu, Xue-Ying Tao, Ting Chen, Chong-Fei Liu, Wen-Bin Luo, Jie Zhou, Hai-Lin Yong, Yu-Huai Li, Feng-Zhi Li, Cong Jiang, Hao-Ze Chen, Chao Wu , et al. (16 additional authors not shown)

Abstract: A quantum network provides an infrastructure connecting quantum devices with revolutionary computing, sensing, and communication capabilities. As the best-known application of a quantum network, quantum key distribution (QKD) shares secure keys guaranteed by the laws of quantum mechanics. A quantum satellite constellation offers a solution to facilitate the quantum network on a global scale. The M… ▽ More A quantum network provides an infrastructure connecting quantum devices with revolutionary computing, sensing, and communication capabilities. As the best-known application of a quantum network, quantum key distribution (QKD) shares secure keys guaranteed by the laws of quantum mechanics. A quantum satellite constellation offers a solution to facilitate the quantum network on a global scale. The Micius satellite has verified the feasibility of satellite quantum communications, however, scaling up quantum satellite constellations is challenging, requiring small lightweight satellites, portable ground stations and real-time secure key exchange. Here we tackle these challenges and report the development of a quantum microsatellite capable of performing space-to-ground QKD using portable ground stations. The quantum microsatellite features a payload weighing approximately 23 kg, while the portable ground station weighs about 100 kg. These weights represent reductions by more than an order and two orders of magnitude, respectively, compared to the Micius satellite. Additionally, we multiplex bidirectional satellite-ground optical communication with quantum communication, enabling key distillation and secure communication in real-time. Using the microsatellite and the portable ground stations, we demonstrate satellite-based QKD with multiple ground stations and achieve the sharing of up to 0.59 million bits of secure keys during a single satellite pass. The compact quantum payload can be readily assembled on existing space stations or small satellites, paving the way for a satellite-constellation-based quantum and classical network for widespread real-life applications. △ Less

Submitted 20 August, 2024; originally announced August 2024.

Comments: 40 pages, 8 figures

Showing 51–100 of 17,264 results for author: Li, Y