Skip to main content

Showing 1–50 of 4,771 results for author: Zhang, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13748  [pdf, other

    cs.CV

    General Geometry-aware Weakly Supervised 3D Object Detection

    Authors: Guowen Zhang, Junsong Fan, Liyi Chen, Zhaoxiang Zhang, Zhen Lei, Lei Zhang

    Abstract: 3D object detection is an indispensable component for scene understanding. However, the annotation of large-scale 3D datasets requires significant human effort. To tackle this problem, many methods adopt weakly supervised 3D object detection that estimates 3D boxes by leveraging 2D boxes and scene/class-specific priors. However, these approaches generally depend on sophisticated manual priors, whi… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV24

  2. arXiv:2407.13719  [pdf, other

    cs.CV

    HazeCLIP: Towards Language Guided Real-World Image Dehazing

    Authors: Ruiyi Wang, Wenhao Li, Xiaohong Liu, Chunyi Li, Zicheng Zhang, Xiongkuo Min, Guangtao Zhai

    Abstract: Existing methods have achieved remarkable performance in single image dehazing, particularly on synthetic datasets. However, they often struggle with real-world hazy images due to domain shift, limiting their practical applicability. This paper introduces HazeCLIP, a language-guided adaptation framework designed to enhance the real-world performance of pre-trained dehazing networks. Inspired by th… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: 5 pages, 4 figures

  3. arXiv:2407.13362  [pdf, other

    cs.CV

    Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation

    Authors: Pengfei Wang, Yuxi Wang, Shuai Li, Zhaoxiang Zhang, Zhen Lei, Lei Zhang

    Abstract: The scarcity of large-scale 3D-text paired data poses a great challenge on open vocabulary 3D scene understanding, and hence it is popular to leverage internet-scale 2D data and transfer their open vocabulary capabilities to 3D models through knowledge distillation. However, the existing distillation-based 3D scene understanding approaches rely on the representation capacity of 2D models, disregar… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  4. arXiv:2407.13255  [pdf, other

    cs.IT eess.SP

    Interleaved Block-Sparse Transform

    Authors: Lei Liu, Ming Wang, Shufeng Li, Yuhao Chi, Ning Wei, ZhaoYang Zhang

    Abstract: Low-complexity Bayes-optimal memory approximate message passing (MAMP) is an efficient signal estimation algorithm in compressed sensing and multicarrier modulation. However, achieving replica Bayes optimality with MAMP necessitates a large-scale right-unitarily invariant transformation, which is prohibitive in practical systems due to its high computational complexity and hardware costs. To solve… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Submitted to the IEEE Journal

  5. arXiv:2407.13219  [pdf, other

    cs.CV

    Multi-sentence Video Grounding for Long Video Generation

    Authors: Wei Feng, Xin Wang, Hong Chen, Zeyang Zhang, Wenwu Zhu

    Abstract: Video generation has witnessed great success recently, but their application in generating long videos still remains challenging due to the difficulty in maintaining the temporal consistency of generated videos and the high memory cost during generation. To tackle the problems, in this paper, we propose a brave and new idea of Multi-sentence Video Grounding for Long Video Generation, connecting th… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  6. arXiv:2407.13147  [pdf, other

    cs.CV

    DFMSD: Dual Feature Masking Stage-wise Knowledge Distillation for Object Detection

    Authors: Zhourui Zhang, Jun Li, Zhijian Wu, Jifeng Shen, Jianhua Xu

    Abstract: In recent years, current mainstream feature masking distillation methods mainly function by reconstructing selectively masked regions of a student network from the feature maps of a teacher network. In these methods, attention mechanisms can help to identify spatially important regions and crucial object-aware channel clues, such that the reconstructed features are encoded with sufficient discrimi… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  7. arXiv:2407.13139  [pdf, other

    cs.CV

    Image Inpainting Models are Effective Tools for Instruction-guided Image Editing

    Authors: Xuan Ju, Junhao Zhuang, Zhaoyang Zhang, Yuxuan Bian, Qiang Xu, Ying Shan

    Abstract: This is the technique report for the winning solution of the CVPR2024 GenAI Media Generation Challenge Workshop's Instruction-guided Image Editing track. Instruction-guided image editing has been largely studied in recent years. The most advanced methods, such as SmartEdit and MGIE, usually combine large language models with diffusion models through joint training, where the former provides text u… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  8. arXiv:2407.13101  [pdf, other

    cs.CL cs.AI

    Retrieve, Summarize, Plan: Advancing Multi-hop Question Answering with an Iterative Approach

    Authors: Zhouyu Jiang, Mengshu Sun, Lei Liang, Zhiqiang Zhang

    Abstract: Multi-hop question answering is a challenging task with distinct industrial relevance, and Retrieval-Augmented Generation (RAG) methods based on large language models (LLMs) have become a popular approach to tackle this task. Owing to the potential inability to retrieve all necessary information in a single iteration, a series of iterative RAG methods has been recently developed, showing significa… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  9. arXiv:2407.12828  [pdf, other

    cs.CL cs.AI

    Why Does New Knowledge Create Messy Ripple Effects in LLMs?

    Authors: Jiaxin Qin, Zixuan Zhang, Chi Han, Manling Li, Pengfei Yu, Heng Ji

    Abstract: Extensive previous research has focused on post-training knowledge editing (KE) for language models (LMs) to ensure that knowledge remains accurate and up-to-date. One desired property and open question in KE is to let edited LMs correctly handle ripple effects, where LM is expected to answer its logically related knowledge accurately. In this paper, we answer the question of why most KE methods s… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  10. arXiv:2407.12798  [pdf, other

    cs.CV

    Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval

    Authors: Wenjun Li, Shudong Wang, Dong Zhao, Shenghui Xu, Zhaoming Pan, Zhimin Zhang

    Abstract: The key of the text-to-video retrieval (TVR) task lies in learning the unique similarity between each pair of text (consisting of words) and video (consisting of audio and image frames) representations. However, some problems exist in the representation alignment of video and text, such as a text, and further each word, are of different importance for video frames. Besides, audio usually carries a… ▽ More

    Submitted 20 June, 2024; originally announced July 2024.

  11. arXiv:2407.12782  [pdf, other

    cs.LG cs.CV

    Contrastive Adversarial Training for Unsupervised Domain Adaptation

    Authors: Jiahong Chen, Zhilin Zhang, Lucy Li, Behzad Shahrasbi, Arjun Mishra

    Abstract: Domain adversarial training has shown its effective capability for finding domain invariant feature representations and been successfully adopted for various domain adaptation tasks. However, recent advances of large models (e.g., vision transformers) and emerging of complex adaptation scenarios (e.g., DomainNet) make adversarial training being easily biased towards source domain and hardly adapte… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  12. arXiv:2407.12758  [pdf, other

    cs.CV

    Mutual Information Guided Optimal Transport for Unsupervised Visible-Infrared Person Re-identification

    Authors: Zhizhong Zhang, Jiangming Wang, Xin Tan, Yanyun Qu, Junping Wang, Yong Xie, Yuan Xie

    Abstract: Unsupervised visible infrared person re-identification (USVI-ReID) is a challenging retrieval task that aims to retrieve cross-modality pedestrian images without using any label information. In this task, the large cross-modality variance makes it difficult to generate reliable cross-modality labels, and the lack of annotations also provides additional difficulties for learning modality-invariant… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  13. arXiv:2407.12727  [pdf, other

    cs.CV

    NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model

    Authors: Zhongqun Zhang, Hengfei Wang, Ziwei Yu, Yihua Cheng, Angela Yao, Hyung Jin Chang

    Abstract: Modeling the physical contacts between the hand and object is standard for refining inaccurate hand poses and generating novel human grasp in 3D hand-object reconstruction. However, existing methods rely on geometric constraints that cannot be specified or controlled. This paper introduces a novel task of controllable 3D hand-object contact modeling with natural language descriptions. Challenges i… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024

  14. arXiv:2407.12622  [pdf, other

    cs.CV

    Rethinking the Architecture Design for Efficient Generic Event Boundary Detection

    Authors: Ziwei Zheng, Zechuan Zhang, Yulin Wang, Shiji Song, Gao Huang, Le Yang

    Abstract: Generic event boundary detection (GEBD), inspired by human visual cognitive behaviors of consistently segmenting videos into meaningful temporal chunks, finds utility in various applications such as video editing and. In this paper, we demonstrate that SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed and hindering efficient deployment in r… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: ACM MM 2024

  15. arXiv:2407.12582  [pdf, other

    cs.CV cs.AI cs.RO

    Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

    Authors: Hu Cao, Zehua Zhang, Yan Xia, Xinyi Li, Jiahao Xia, Guang Chen, Alois Knoll

    Abstract: In frame-based vision, object detection faces substantial performance degradation under challenging conditions due to the limited sensing capability of conventional cameras. Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems. However, effectively fusing two heterogeneous modalities remains an open issue. In this work, we propose a novel hier… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  16. arXiv:2407.12580  [pdf, other

    cs.CL cs.CV cs.IR

    E5-V: Universal Embeddings with Multimodal Large Language Models

    Authors: Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, Fuzhen Zhuang

    Abstract: Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal information using MLLMs remains largely unexplored. In this work, we introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings. Our findings highlight the significant potential of MLLMs in rep… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: Code and models are available at https://github.com/kongds/E5-V

  17. arXiv:2407.12516  [pdf, other

    cs.NE cs.AI cs.LG

    Online Pseudo-Zeroth-Order Training of Neuromorphic Spiking Neural Networks

    Authors: Mingqing Xiao, Qingyan Meng, Zongpeng Zhang, Di He, Zhouchen Lin

    Abstract: Brain-inspired neuromorphic computing with spiking neural networks (SNNs) is a promising energy-efficient computational approach. However, successfully training SNNs in a more biologically plausible and neuromorphic-hardware-friendly way is still challenging. Most recent methods leverage spatial and temporal backpropagation (BP), not adhering to neuromorphic properties. Despite the efforts of some… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  18. arXiv:2407.12448  [pdf, other

    cs.LG

    Energy-Guided Diffusion Sampling for Offline-to-Online Reinforcement Learning

    Authors: Xu-Hui Liu, Tian-Shuo Liu, Shengyi Jiang, Ruifeng Chen, Zhilong Zhang, Xinwei Chen, Yang Yu

    Abstract: Combining offline and online reinforcement learning (RL) techniques is indeed crucial for achieving efficient and safe learning where data acquisition is expensive. Existing methods replay offline data directly in the online phase, resulting in a significant challenge of data distribution shift and subsequently causing inefficiency in online fine-tuning. To address this issue, we introduce an inno… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  19. arXiv:2407.12276  [pdf, other

    cs.CV

    VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation

    Authors: Zhen Qu, Xian Tao, Mukesh Prasad, Fei Shen, Zhengtao Zhang, Xinyi Gong, Guiguang Ding

    Abstract: Recently, large-scale vision-language models such as CLIP have demonstrated immense potential in zero-shot anomaly segmentation (ZSAS) task, utilizing a unified model to directly detect anomalies on any unseen product with painstakingly crafted text prompts. However, existing methods often assume that the product category to be inspected is known, thus setting product-specific text prompts, which… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  20. arXiv:2407.12066  [pdf, other

    cs.CV

    Temporally Grounding Instructional Diagrams in Unconstrained Videos

    Authors: Jiahao Zhang, Frederic Z. Zhang, Cristian Rodriguez, Yizhak Ben-Shabat, Anoop Cherian, Stephen Gould

    Abstract: We study the challenging problem of simultaneously localizing a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However, most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries such as the general mutual exclusiveness and the temp… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  21. arXiv:2407.11895  [pdf, other

    cs.CV

    OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

    Authors: Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, Zhou Zhao

    Abstract: Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understanding and generation pipelines, high-quality omni joint representations would be a step toward co-processing more diverse multimodal information. In this work, we present OmniBind, large-scale multimodal joi… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Homepage is http://omnibind.github.io

  22. arXiv:2407.11730  [pdf, other

    cs.CV

    Monocular Occupancy Prediction for Scalable Indoor Scenes

    Authors: Hongxiao Yu, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang

    Abstract: Camera-based 3D occupancy prediction has recently garnered increasing attention in outdoor driving scenes. However, research in indoor scenes remains relatively unexplored. The core differences in indoor scenes lie in the complexity of scene scale and the variance in object size. In this paper, we propose a novel method, named ISO, for predicting indoor scene occupancy using monocular images. ISO… ▽ More

    Submitted 16 July, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  23. arXiv:2407.11459  [pdf, other

    eess.SP cs.LG

    RIMformer: An End-to-End Transformer for FMCW Radar Interference Mitigation

    Authors: Ziang Zhang, Guangzhi Chen, Youlong Weng, Shunchuan Yang, Zhiyu Jia, Jingxuan Chen

    Abstract: Frequency-modulated continuous-wave (FMCW) radar plays a pivotal role in the field of remote sensing. The increasing degree of FMCW radar deployment has increased the mutual interference, which weakens the detection capabilities of radars and threatens reliability and safety of systems. In this paper, a novel FMCW radar interference mitigation (RIM) method, termed as RIMformer, is proposed by usin… ▽ More

    Submitted 17 July, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

  24. arXiv:2407.11424  [pdf, other

    cs.CV

    Model Inversion Attacks Through Target-Specific Conditional Diffusion Models

    Authors: Ouxiang Li, Yanbin Hao, Zhicai Wang, Bin Zhu, Shuo Wang, Zaixi Zhang, Fuli Feng

    Abstract: Model inversion attacks (MIAs) aim to reconstruct private images from a target classifier's training set, thereby raising privacy concerns in AI applications. Previous GAN-based MIAs tend to suffer from inferior generative fidelity due to GAN's inherent flaws and biased optimization within latent space. To alleviate these issues, leveraging on diffusion models' remarkable synthesis capabilities, w… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Preprint. Under review

  25. arXiv:2407.11299  [pdf, other

    cs.RO cs.CV

    FR-SLAM: A SLAM Improvement Method Based on Floor Plan Registration

    Authors: Jiantao Feng, Xinde Li, HyunCheol Park, Juan Liu, Zhentong Zhang

    Abstract: Simultaneous Localization and Mapping (SLAM) technology enables the construction of environmental maps and localization, serving as a key technique for indoor autonomous navigation of mobile robots. Traditional SLAM methods typically require exhaustive traversal of all rooms during indoor navigation to obtain a complete map, resulting in lengthy path planning times and prolonged time to reach targ… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  26. arXiv:2407.11253  [pdf, other

    cs.LG cs.CE

    Separable Operator Networks

    Authors: Xinling Yu, Sean Hooten, Ziyue Liu, Yequan Zhao, Marco Fiorentino, Thomas Van Vaerenbergh, Zheng Zhang

    Abstract: Operator learning has become a powerful tool in machine learning for modeling complex physical systems. Although Deep Operator Networks (DeepONet) show promise, they require extensive data acquisition. Physics-informed DeepONets (PI-DeepONet) mitigate data scarcity but suffer from inefficient training processes. We introduce Separable Operator Networks (SepONet), a novel framework that significant… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  27. arXiv:2407.11239  [pdf, other

    cs.LG

    From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients

    Authors: Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu, Jiawei Zhao, Yuandong Tian, Zhangyang Wang

    Abstract: Modern Large Language Models (LLMs) are composed of matrices with billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Being significantly large, such matrices can often be expressed in low-rank format with potential to relax resource requirements. Unlike prior works which focus on developing novel matrix decomposition algo… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  28. arXiv:2407.11052  [pdf, other

    cs.LG cs.AI

    Revisiting, Benchmarking and Understanding Unsupervised Graph Domain Adaptation

    Authors: Meihan Liu, Zhen Zhang, Jiachen Tang, Jiajun Bu, Bingsheng He, Sheng Zhou

    Abstract: Unsupervised Graph Domain Adaptation (UGDA) involves the transfer of knowledge from a label-rich source graph to an unlabeled target graph under domain discrepancies. Despite the proliferation of methods designed for this emerging task, the lack of standard experimental settings and fair performance comparisons makes it challenging to understand which and when models perform well across different… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  29. arXiv:2407.10701  [pdf, other

    cs.CL

    DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

    Authors: Anni Zou, Wenhao Yu, Hongming Zhang, Kaixin Ma, Deng Cai, Zhuosheng Zhang, Hai Zhao, Dong Yu

    Abstract: Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going beyond simple reading comprehension tasks. Consequently, these systems have been carefully designed to tackle challenges such as file parsing, metadata extraction, m… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Work in progress

  30. arXiv:2407.10680  [pdf, ps, other

    cs.SI cs.NI

    Friedkin-Johnsen Model for Opinion Dynamics on Signed Graphs

    Authors: Xiaotian Zhou, Haoxin Sun, Wanyue Xu, Wei Li, Zhongzhi Zhang

    Abstract: A signed graph offers richer information than an unsigned graph, since it describes both collaborative and competitive relationships in social networks. In this paper, we study opinion dynamics on a signed graph, based on the Friedkin-Johnsen model. We first interpret the equilibrium opinion in terms of a defined random walk on an augmented signed graph, by representing the equilibrium opinion of… ▽ More

    Submitted 17 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

  31. arXiv:2407.10671  [pdf, other

    cs.CL cs.AI

    Qwen2 Technical Report

    Authors: An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin , et al. (37 additional authors not shown)

    Abstract: This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, a… ▽ More

    Submitted 17 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

    Comments: 25 pages, 1 figure

  32. arXiv:2407.10424  [pdf, other

    cs.PL cs.AI

    CodeV: Empowering LLMs for Verilog Generation through Multi-Level Summarization

    Authors: Yang Zhao, Di Huang, Chongxiao Li, Pengwei Jin, Ziyuan Nan, Tianyun Ma, Lei Qi, Yansong Pan, Zhenxing Zhang, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen

    Abstract: The increasing complexity and high costs associated with modern processor design have led to a surge in demand for processor design automation. Instruction-tuned large language models (LLMs) have demonstrated remarkable performance in automatically generating code for general-purpose programming languages like Python. However, these methods fail on hardware description languages (HDLs) like Verilo… ▽ More

    Submitted 18 July, 2024; v1 submitted 14 July, 2024; originally announced July 2024.

    Comments: 16 pages, 8 figures, conference

  33. arXiv:2407.10099  [pdf, other

    cs.CV

    STGFormer: Spatio-Temporal GraphFormer for 3D Human Pose Estimation in Video

    Authors: Yang Liu, Zhiyong Zhang

    Abstract: The current methods of video-based 3D human pose estimation have achieved significant progress; however, they continue to confront the significant challenge of depth ambiguity. To address this limitation, this paper presents the spatio-temporal GraphFormer framework for 3D human pose estimation in video, which integrates body structure graph-based representations with spatio-temporal information.… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

  34. arXiv:2407.10061  [pdf, other

    cs.CV

    InfiniMotion: Mamba Boosts Memory in Transformer for Arbitrary Long Motion Generation

    Authors: Zeyu Zhang, Akide Liu, Qi Chen, Feng Chen, Ian Reid, Richard Hartley, Bohan Zhuang, Hao Tang

    Abstract: Text-to-motion generation holds potential for film, gaming, and robotics, yet current methods often prioritize short motion generation, making it challenging to produce long motion sequences effectively: (1) Current methods struggle to handle long motion sequences as a single input due to prohibitively high computational cost; (2) Breaking down the generation of long motion sequences into shorter… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

  35. arXiv:2407.09899  [pdf, other

    cs.RO

    DexGrasp-Diffusion: Diffusion-based Unified Functional Grasp Synthesis Pipeline for Multi-Dexterous Robotic Hands

    Authors: Zhengshen Zhang, Lei Zhou, Chenchen Liu, Zhiyang Liu, Chengran Yuan, Sheng Guo, Ruiteng Zhao, Marcelo H. Ang Jr., Francis EH Tay

    Abstract: The versatility and adaptability of human grasping catalyze advancing dexterous robotic manipulation. While significant strides have been made in dexterous grasp generation, current research endeavors pivot towards optimizing object manipulation while ensuring functional integrity, emphasizing the synthesis of functional grasps following desired affordance instructions. This paper addresses the ch… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

  36. arXiv:2407.09781  [pdf, other

    cs.CV

    Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

    Authors: Ruihuang Li, Zhengqiang Zhang, Chenhang He, Zhiyuan Ma, Vishal M. Patel, Lei Zhang

    Abstract: Recent vision-language pre-training models have exhibited remarkable generalization ability in zero-shot recognition tasks. Previous open-vocabulary 3D scene understanding methods mostly focus on training 3D models using either image or text supervision while neglecting the collective strength of all modalities. In this work, we propose a Dense Multimodal Alignment (DMA) framework to densely co-em… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  37. arXiv:2407.09590  [pdf, other

    cs.CL cs.LG

    Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

    Authors: Zeliang Zhang, Xiaodong Liu, Hao Cheng, Chenliang Xu, Jianfeng Gao

    Abstract: By increasing model parameters but activating them sparsely when performing a task, the use of Mixture-of-Experts (MoE) architecture significantly improves the performance of Large Language Models (LLMs) without increasing the inference cost. However, the memory consumption due to the growing number of experts presents a challenge to the deployment of these models in many real world settings. Our… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

    Comments: 13pages, 6 figures

  38. arXiv:2407.09359  [pdf, other

    cs.CV

    A Unified Anomaly Synthesis Strategy with Gradient Ascent for Industrial Anomaly Detection and Localization

    Authors: Qiyu Chen, Huiyuan Luo, Chengkan Lv, Zhengtao Zhang

    Abstract: Anomaly synthesis strategies can effectively enhance unsupervised anomaly detection. However, existing strategies have limitations in the coverage and controllability of anomaly synthesis, particularly for weak defects that are very similar to normal regions. In this paper, we propose Global and Local Anomaly co-Synthesis Strategy (GLASS), a novel unified framework designed to synthesize a broader… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  39. arXiv:2407.09250  [pdf

    cs.NI cs.LG

    FedsLLM: Federated Split Learning for Large Language Models over Communication Networks

    Authors: Kai Zhao, Zhaohui Yang, Chongwen Huang, Xiaoming Chen, Zhaoyang Zhang

    Abstract: Addressing the challenges of deploying large language models in wireless communication networks, this paper combines low-rank adaptation technology (LoRA) with the splitfed learning framework to propose the federated split learning for large language models (FedsLLM) framework. The method introduced in this paper utilizes LoRA technology to reduce processing loads by dividing the network into clie… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  40. arXiv:2407.08561  [pdf, other

    cs.CV

    MapLocNet: Coarse-to-Fine Feature Registration for Visual Re-Localization in Navigation Maps

    Authors: Hang Wu, Zhenghao Zhang, Siyuan Lin, Xiangru Mu, Qiang Zhao, Ming Yang, Tong Qin

    Abstract: Robust localization is the cornerstone of autonomous driving, especially in challenging urban environments where GPS signals suffer from multipath errors. Traditional localization approaches rely on high-definition (HD) maps, which consist of precisely annotated landmarks. However, building HD map is expensive and challenging to scale up. Given these limitations, leveraging navigation maps has eme… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: IROS 2024 (Oral)

  41. arXiv:2407.08554  [pdf, other

    cs.AI cs.HC

    Establishing Rigorous and Cost-effective Clinical Trials for Artificial Intelligence Models

    Authors: Wanling Gao, Yunyou Huang, Dandan Cui, Zhuoming Yu, Wenjing Liu, Xiaoshuang Liang, Jiahui Zhao, Jiyue Xie, Hao Li, Li Ma, Ning Ye, Yumiao Kang, Dingfeng Luo, Peng Pan, Wei Huang, Zhongmou Liu, Jizhong Hu, Gangyuan Zhao, Chongrong Jiang, Fan Huang, Tianyi Wei, Suqin Tang, Bingjie Xia, Zhifei Zhang, Jianfeng Zhan

    Abstract: A profound gap persists between artificial intelligence (AI) and clinical practice in medicine, primarily due to the lack of rigorous and cost-effective evaluation methodologies. State-of-the-art and state-of-the-practice AI model evaluations are limited to laboratory studies on medical datasets or direct clinical trials with no or solely patient-centered controls. Moreover, the crucial role of cl… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: 23 pages

  42. arXiv:2407.08526  [pdf, other

    cs.CV

    BLOS-BEV: Navigation Map Enhanced Lane Segmentation Network, Beyond Line of Sight

    Authors: Hang Wu, Zhenghao Zhang, Siyuan Lin, Tong Qin, Jin Pan, Qiang Zhao, Chunjing Xu, Ming Yang

    Abstract: Bird's-eye-view (BEV) representation is crucial for the perception function in autonomous driving tasks. It is difficult to balance the accuracy, efficiency and range of BEV representation. The existing works are restricted to a limited perception range within 50 meters. Extending the BEV representation range can greatly benefit downstream tasks such as topology reasoning, scene understanding, and… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: IEEE IV 2024

  43. arXiv:2407.08458  [pdf, other

    cs.LG cs.NI eess.SP

    Joint Optimization of Age of Information and Energy Consumption in NR-V2X System based on Deep Reinforcement Learning

    Authors: Shulin Song, Zheng Zhang, Qiong Wu, Qiang Fan, Pingyi Fan

    Abstract: Autonomous driving may be the most important application scenario of next generation, the development of wireless access technologies enabling reliable and low-latency vehicle communication becomes crucial. To address this, 3GPP has developed Vehicle-to-Everything (V2X) specifications based on 5G New Radio (NR) technology, where Mode 2 Side-Link (SL) communication resembles Mode 4 in LTE-V2X, allo… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: This paper has been accepted by sensors. The source code has been released at: https://github.com/qiongwu86/Joint-Optimization-of-AoI-and-Energy-Consumption-in-NR-V2X-System-based-on-DRL

  44. arXiv:2407.08394  [pdf, other

    cs.CV

    Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

    Authors: Zhengbo Zhang, Li Xu, Duo Peng, Hossein Rahmani, Jun Liu

    Abstract: We introduce Diff-Tracker, a novel approach for the challenging unsupervised visual tracking task leveraging the pre-trained text-to-image diffusion model. Our main idea is to leverage the rich knowledge encapsulated within the pre-trained diffusion model, such as the understanding of image semantics and structural information, to address unsupervised visual tracking. To this end, we design an ini… ▽ More

    Submitted 16 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  45. arXiv:2407.08296  [pdf, other

    cs.LG

    Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

    Authors: Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, Zhangyang Wang

    Abstract: Training Large Language Models (LLMs) is memory-intensive due to the large number of parameters and associated optimization states. GaLore, a recent method, reduces memory usage by projecting weight gradients into a low-rank subspace without compromising performance. However, GaLore relies on time-consuming Singular Value Decomposition (SVD) operations to identify the subspace, and the frequent su… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  46. arXiv:2407.08196  [pdf, other

    cs.AI

    SoupLM: Model Integration in Large Language and Multi-Modal Models

    Authors: Yue Bai, Zichen Zhang, Jiasen Lu, Yun Fu

    Abstract: Training large language models (LLMs) and multimodal LLMs necessitates significant computing resources, and existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks. For instance, LLaMA, Vicuna, and LLaVA are three LLM variants trained with LLaMA base models using very different training recipes, tasks, and data modalities. The traini… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  47. arXiv:2407.08021  [pdf, other

    cs.MA

    Field Deployment of Multi-Agent Reinforcement Learning Based Variable Speed Limit Controllers

    Authors: Yuhang Zhang, Zhiyao Zhang, Marcos Quiñones-Grueiro, William Barbour, Clay Weston, Gautam Biswas, Daniel Work

    Abstract: This article presents the first field deployment of a multi-agent reinforcement-learning (MARL) based variable speed limit (VSL) control system on the I-24 freeway near Nashville, Tennessee. We describe how we train MARL agents in a traffic simulator and directly deploy the simulation-based policy on a 17-mile stretch of Interstate 24 with 67 VSL controllers. We use invalid action masking and seve… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

  48. arXiv:2407.07791  [pdf, other

    cs.CL

    Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities

    Authors: Tianjie Ju, Yiting Wang, Xinbei Ma, Pengzhou Cheng, Haodong Zhao, Yulong Wang, Lifeng Liu, Jian Xie, Zhuosheng Zhang, Gongshen Liu

    Abstract: The rapid adoption of large language models (LLMs) in multi-agent systems has highlighted their impressive capabilities in various applications, such as collaborative problem-solving and autonomous negotiation. However, the security implications of these LLM-based multi-agent systems have not been thoroughly investigated, particularly concerning the spread of manipulated knowledge. In this paper,… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: 18 Pages, working in progress

  49. arXiv:2407.07638  [pdf, other

    cs.CV cs.AI

    Tuning Vision-Language Models with Candidate Labels by Prompt Alignment

    Authors: Zhifang Zhang, Beibei Li

    Abstract: Vision-language models (VLMs) can learn high-quality representations from a large-scale training dataset of image-text pairs. Prompt learning is a popular approach to fine-tuning VLM to adapt them to downstream tasks. Despite the satisfying performance, a major limitation of prompt learning is the demand for labelled data. In real-world scenarios, we may only obtain candidate labels (where the tru… ▽ More

    Submitted 11 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

  50. arXiv:2407.07503  [pdf, other

    cs.CV cs.IR

    Metasurface-based Snapshot Shortwave-Infrared Hyperspectral Image Reconstruction with Inter and Intra Prior Learning Network

    Authors: Linqiang Li, Jinglei Hao, Yongqiang Zhao, Pan Liu, Haofang Yan, Ziqin Zhang, Seong G. Kong

    Abstract: Shortwave-infrared(SWIR) spectral information,ranging from 1 μm to 2.5μm, breaks the limitations of traditional color cameras in acquiring scene information and has been used in many fields. However, conventional SWIR hyperspectral imaging systems face challenges due to their bulky setups and low acquisition speed. In this work, we introduce a snapshot SWIR hyperspectral imaging system based on a… ▽ More

    Submitted 10 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: 10 pages,5 figures