Skip to main content

Showing 1–50 of 871 results for author: Lu, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13500  [pdf, other

    cs.CV

    FADE: A Task-Agnostic Upsampling Operator for Encoder-Decoder Architectures

    Authors: Hao Lu, Wenze Liu, Hongtao Fu, Zhiguo Cao

    Abstract: The goal of this work is to develop a task-agnostic feature upsampling operator for dense prediction where the operator is required to facilitate not only region-sensitive tasks like semantic segmentation but also detail-sensitive tasks such as image matting. Prior upsampling operators often can work well in either type of the tasks, but not both. We argue that task-agnostic upsampling should dyna… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted to International Journal of Computer Vision. Extended version of ECCV 2022 paper at arXiv:2207.10392

  2. arXiv:2407.13483  [pdf, other

    cs.CV

    SCAPE: A Simple and Strong Category-Agnostic Pose Estimator

    Authors: Yujia Liang, Zixuan Ye, Wenze Liu, Hao Lu

    Abstract: Category-Agnostic Pose Estimation (CAPE) aims to localize keypoints on an object of any category given few exemplars in an in-context manner. Prior arts involve sophisticated designs, e.g., sundry modules for similarity calculation and a two-stage framework, or takes in extra heatmap generation and supervision. We notice that CAPE is essentially a task about feature matching, which can be solved w… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024. Code is available at https://github.com/tiny-smart/SCAPE

  3. arXiv:2407.13306  [pdf, ps, other

    cs.IT eess.SP

    Group Movable Antenna With Flexible Sparsity: Joint Array Position and Sparsity Optimization

    Authors: Haiquan Lu, Yong Zeng, Shi Jin, Rui Zhang

    Abstract: Movable antenna (MA) is a promising technology to exploit the spatial variation of wireless channel for performance enhancement, by dynamically varying the antenna position within a certain region. However, for multi-antenna communication systems, moving each antenna independently not only requires prohibitive complexity to find the optimal antenna positions, but also incurs sophisticated movement… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: 5 pages, 5 figures

  4. arXiv:2407.12996  [pdf, other

    stat.ML cs.LG

    Sharpness-diversity tradeoff: improving flat ensembles with SharpBalance

    Authors: Haiquan Lu, Xiaotian Liu, Yefan Zhou, Qunli Li, Kurt Keutzer, Michael W. Mahoney, Yujun Yan, Huanrui Yang, Yaoqing Yang

    Abstract: Recent studies on deep ensembles have identified the sharpness of the local minima of individual learners and the diversity of the ensemble members as key factors in improving test-time performance. Building on this, our study investigates the interplay between sharpness and diversity within deep ensembles, illustrating their crucial role in robust generalization to both in-distribution (ID) and o… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  5. arXiv:2407.12593  [pdf, other

    cs.CV

    EvSign: Sign Language Recognition and Translation with Streaming Events

    Authors: Pengyu Zhang, Hao Yin, Zeren Wang, Wenyue Chen, Shengming Li, Dong Wang, Huchuan Lu, and Xu Jia

    Abstract: Sign language is one of the most effective communication tools for people with hearing difficulties. Most existing works focus on improving the performance of sign language tasks on RGB videos, which may suffer from degraded recording conditions, such as fast movement of hands with motion blur and textured signer's appearance. The bio-inspired event camera, which asynchronously captures brightness… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: To appear on ECCV 2024

  6. arXiv:2407.12014  [pdf, other

    cs.HC cs.CY

    Surprising Performances of Students with Autism in Classroom with NAO Robot

    Authors: Qin Yang, Huan Lu, Dandan Liang, Shengrong Gong, Huanghao Feng

    Abstract: Autism is a developmental disorder that manifests in early childhood and persists throughout life, profoundly affecting social behavior and hindering the acquisition of learning and social skills in those diagnosed. As technological advancements progress, an increasing array of technologies is being utilized to support the education of students with Autism Spectrum Disorder (ASD), aiming to improv… ▽ More

    Submitted 26 June, 2024; originally announced July 2024.

  7. arXiv:2407.11503  [pdf, other

    cs.CV

    Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation

    Authors: Shijie Chang, Youwei Pang, Xiaoqi Zhao, Lihe Zhang, Huchuan Lu

    Abstract: Existing few-shot segmentation (FSS) methods mainly focus on prototype feature generation and the query-support matching mechanism. As a crucial prompt for generating prototype features, the pair of image-mask types in the support set has become the default setting. However, various types such as image, text, box, and mask all can provide valuable information regarding the objects in context, clas… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Preprint under review

  8. arXiv:2407.11031  [pdf, other

    cs.LG eess.SP

    Purification Of Contaminated Convolutional Neural Networks Via Robust Recovery: An Approach with Theoretical Guarantee in One-Hidden-Layer Case

    Authors: Hanxiao Lu, Zeyu Huang, Ren Wang

    Abstract: Convolutional neural networks (CNNs), one of the key architectures of deep learning models, have achieved superior performance on many machine learning tasks such as image classification, video recognition, and power systems. Despite their success, CNNs can be easily contaminated by natural noises and artificially injected noises such as backdoor attacks. In this paper, we propose a robust recover… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  9. arXiv:2407.08106  [pdf, other

    cs.RO

    SGLC: Semantic Graph-Guided Coarse-Fine-Refine Full Loop Closing for LiDAR SLAM

    Authors: Neng Wang, Xieyuanli Chen, Chenghao Shi, Zhiqiang Zheng, Hongshan Yu, Huimin Lu

    Abstract: Loop closing is a crucial component in SLAM that helps eliminate accumulated errors through two main steps: loop detection and loop pose correction. The first step determines whether loop closing should be performed, while the second estimates the 6-DoF pose to correct odometry drift. Current methods mostly focus on developing robust descriptors for loop closure detection, often neglecting loop po… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: 8 pages, 4 figures

  10. arXiv:2407.07760  [pdf, other

    cs.CV cs.AI

    Learning Spatial-Semantic Features for Robust Video Object Segmentation

    Authors: Xin Li, Deshui Miao, Zhenyu He, Yaowei Wang, Huchuan Lu, Ming-Hsuan Yang

    Abstract: Tracking and segmenting multiple similar objects with complex or separate parts in long-term videos is inherently challenging due to the ambiguity of target parts and identity confusion caused by occlusion, background clutter, and long-term variations. In this paper, we propose a robust video object segmentation framework equipped with spatial-semantic features and discriminative object queries to… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: Winner solution of the VOTS2024 Challenge

  11. arXiv:2407.07523  [pdf, other

    cs.CV cs.MM

    SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning

    Authors: Haiwen Diao, Bo Wan, Xu Jia, Yunzhi Zhuge, Ying Zhang, Huchuan Lu, Long Chen

    Abstract: Parameter-efficient transfer learning (PETL) has emerged as a flourishing research field for adapting large pre-trained models to downstream tasks, greatly reducing trainable parameters while grappling with memory challenges during fine-tuning. To address it, memory-efficient series (METL) avoid backpropagating gradients through the large backbone. However, they compromise by exclusively relying o… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: 23 pages, 11 figures, Accepted by ECCV2024

  12. arXiv:2407.06770  [pdf, other

    cs.RO

    Pretraining-finetuning Framework for Efficient Co-design: A Case Study on Quadruped Robot Parkour

    Authors: Ci Chen, Jiyu Yu, Haojian Lu, Hongbo Gao, Rong Xiong, Yue Wang

    Abstract: In nature, animals with exceptional locomotion abilities, such as cougars, often possess asymmetric fore and hind legs, with their powerful hind legs acting as reservoirs of energy for leaps. This observation inspired us: could optimize the leg length of quadruped robots endow them with similar locomotive capabilities? In this paper, we propose an approach that co-optimizes the mechanical structur… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  13. arXiv:2407.05407  [pdf, other

    cs.SD cs.AI eess.AS

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Authors: Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

    Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role… ▽ More

    Submitted 9 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

    Comments: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051

  14. arXiv:2407.04803  [pdf, other

    cs.LG cs.AI

    The Impact of Quantization and Pruning on Deep Reinforcement Learning Models

    Authors: Heng Lu, Mehdi Alemi, Reza Rawassizadeh

    Abstract: Deep reinforcement learning (DRL) has achieved remarkable success across various domains, such as video games, robotics, and, recently, large language models. However, the computational costs and memory requirements of DRL models often limit their deployment in resource-constrained environments. The challenge underscores the urgent need to explore neural network compression methods to make RDL mod… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  15. arXiv:2407.04093  [pdf, other

    cs.CL

    Stephanie: Step-by-Step Dialogues for Mimicking Human Interactions in Social Conversations

    Authors: Hao Yang, Hongyuan Lu, Xinhua Zeng, Yang Liu, Xiang Zhang, Haoran Yang, Yumeng Zhang, Shan Huang, Yiran Wei, Wai Lam

    Abstract: In the rapidly evolving field of natural language processing, dialogue systems primarily employ a single-step dialogue paradigm. Although this paradigm is efficient, it lacks the depth and fluidity of human interactions and does not appear natural. We introduce a novel \textbf{Step}-by-Step Dialogue Paradigm (Stephanie), designed to mimic the ongoing dynamic nature of human conversations. By emplo… ▽ More

    Submitted 12 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

  16. arXiv:2407.04051  [pdf, other

    cs.SD cs.AI eess.AS

    FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

    Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

    Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Work in progress. Authors are listed in alphabetical order by family name

  17. arXiv:2407.03618  [pdf, other

    cs.IR cs.CL

    BM25S: Orders of magnitude faster lexical search via eager sparse scoring

    Authors: Xing Han Lù

    Abstract: We introduce BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices. It also achieves considerable speedups compared to highly optimized Java-based implementations, which are used by pop… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: Technical Report

  18. arXiv:2407.03169  [pdf, other

    cs.CL cs.SD eess.AS

    Investigating Decoder-only Large Language Models for Speech-to-text Translation

    Authors: Chao-Wei Huang, Hui Lu, Hongyu Gong, Hirofumi Inaguma, Ilia Kulikov, Ruslan Mavlyutov, Sravya Popuri

    Abstract: Large language models (LLMs), known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains, present a promising avenue for enhancing speech-related tasks. In this paper, we focus on integrating decoder-only LLMs to the task of speech-to-text translation (S2TT). We propose a decoder-only architecture that enables the LLM to directly consume the encoded sp… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted to Interspeech 2024

  19. arXiv:2407.02252  [pdf, other

    cs.CV

    GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models

    Authors: Jian Ma, Yonglin Deng, Chen Chen, Haonan Lu, Zhenyu Yang

    Abstract: Posters play a crucial role in marketing and advertising, contributing significantly to industrial design by enhancing visual communication and brand visibility. With recent advances in controllable text-to-image diffusion models, more concise research is now focusing on rendering text within synthetic images. Despite improvements in text rendering accuracy, the field of end-to-end poster generati… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  20. arXiv:2407.02031  [pdf, other

    cs.DC cs.AI cs.LG

    SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules

    Authors: Suyi Li, Lingyun Yang, Xiaoxiao Jiang, Hanfeng Lu, Zhipeng Di, Weiyi Lu, Jiawei Chen, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang

    Abstract: This paper documents our characterization study and practices for serving text-to-image requests with stable diffusion models in production. We first comprehensively analyze inference request traces for commercial text-to-image applications. It commences with our observation that add-on modules, i.e., ControlNets and LoRAs, that augment the base stable diffusion models, are ubiquitous in generatin… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  21. arXiv:2407.01094  [pdf, other

    cs.CV

    Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

    Authors: Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, Wangmeng Zuo, Qixiang Ye, Jingdong Wang

    Abstract: Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  22. arXiv:2406.19632  [pdf, other

    cs.CV

    PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation

    Authors: Deyi Ji, Wenwei Jin, Hongtao Lu, Feng Zhao

    Abstract: The ascension of Unmanned Aerial Vehicles (UAVs) in various fields necessitates effective UAV image segmentation, which faces challenges due to the dynamic perspectives of UAV-captured images. Traditional segmentation algorithms falter as they cannot accurately mimic the complexity of UAV perspectives, and the cost of obtaining multi-perspective labeled datasets is prohibitive. To address these is… ▽ More

    Submitted 11 July, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

    Comments: IJCAI 2024

  23. arXiv:2406.19006  [pdf, other

    cs.CV

    VideoMambaPro: A Leap Forward for Mamba in Video Understanding

    Authors: Hui Lu, Albert Ali Salah, Ronald Poppe

    Abstract: Video understanding requires the extraction of rich spatio-temporal representations, which transformer models achieve through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alternative for transformers. However, Mamba's successes do not trivially extend to computer vision tasks, including those in video analysis. In this paper… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  24. arXiv:2406.18938  [pdf, other

    cs.IR

    Towards Personalized Federated Multi-scenario Multi-task Recommendation

    Authors: Yue Ding, Yanbiao Ji, Xun Cai, Xin Xin, Xiaofeng Gao, Hongtao Lu

    Abstract: In modern recommender system applications, such as e-commerce, predicting multiple targets like click-through rate (CTR) and post-view click-through \& conversion rate (CTCVR) is common. Multi-task recommender systems are gaining traction in research and practical use. Existing multi-task recommender systems tackle diverse business scenarios, merging and modeling these scenarios unlocks shared kno… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  25. arXiv:2406.18379  [pdf, other

    cs.CR cs.AI cs.SE

    MALSIGHT: Exploring Malicious Source Code and Benign Pseudocode for Iterative Binary Malware Summarization

    Authors: Haolang Lu, Hongrui Peng, Guoshun Nan, Jiaoyang Cui, Cheng Wang, Weifei Jin

    Abstract: Binary malware summarization aims to automatically generate human-readable descriptions of malware behaviors from executable files, facilitating tasks like malware cracking and detection. Previous methods based on Large Language Models (LLMs) have shown great promise. However, they still face significant issues, including poor usability, inaccurate explanations, and incomplete summaries, primarily… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: 17 pages, 14 figures

  26. arXiv:2406.17100  [pdf, other

    cs.CV

    Fine-tuning Diffusion Models for Enhancing Face Quality in Text-to-image Generation

    Authors: Zhenyi Liao, Qingsong Xie, Chen Chen, Hannan Lu, Zhijie Deng

    Abstract: Diffusion models (DMs) have achieved significant success in generating imaginative images given textual descriptions. However, they are likely to fall short when it comes to real-life scenarios with intricate details.The low-quality, unrealistic human faces in text-to-image generation are one of the most prominent issues, hindering the wide application of DMs in practice. Targeting addressing such… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Under review

  27. arXiv:2406.16740  [pdf, other

    cs.LG math.NA

    Learning the boundary-to-domain mapping using Lifting Product Fourier Neural Operators for partial differential equations

    Authors: Aditya Kashi, Arka Daw, Muralikrishnan Gopalakrishnan Meena, Hao Lu

    Abstract: Neural operators such as the Fourier Neural Operator (FNO) have been shown to provide resolution-independent deep learning models that can learn mappings between function spaces. For example, an initial condition can be mapped to the solution of a partial differential equation (PDE) at a future time-step using a neural operator. Despite the popularity of neural operators, their use to predict solu… ▽ More

    Submitted 1 July, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: Accepted by ICML 2024 AI for Science Workshop

    MSC Class: 65N99; 68T07 ACM Class: I.2.1; J.2

  28. arXiv:2406.16279  [pdf, other

    cs.CV

    SegNet4D: Effective and Efficient 4D LiDAR Semantic Segmentation in Autonomous Driving Environments

    Authors: Neng Wang, Ruibin Guo, Chenghao Shi, Hui Zhang, Huimin Lu, Zhiqiang Zheng, Xieyuanli Chen

    Abstract: 4D LiDAR semantic segmentation, also referred to as multi-scan semantic segmentation, plays a crucial role in enhancing the environmental understanding capabilities of autonomous vehicles. It entails identifying the semantic category of each point in the LiDAR scan and distinguishing whether it is dynamic, a critical aspect in downstream tasks such as path planning and autonomous navigation. Exist… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: 10 pages, 5 figures

  29. arXiv:2406.14773  [pdf, other

    cs.CR

    Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data

    Authors: Shenglai Zeng, Jiankun Zhang, Pengfei He, Jie Ren, Tianqi Zheng, Hanqing Lu, Han Xu, Hui Liu, Yue Xing, Jiliang Tang

    Abstract: Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources. However, when the retrieval process involves private data, RAG systems may face severe privacy risks, potentially leading to the leakage of sensitive information. To address this issue, we propose using synthetic data as a privacy-preserving al… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  30. arXiv:2406.14186  [pdf, other

    eess.IV cs.CV

    CriDiff: Criss-cross Injection Diffusion Framework via Generative Pre-train for Prostate Segmentation

    Authors: Tingwei Liu, Miao Zhang, Leiye Liu, Jialong Zhong, Shuyao Wang, Yongri Piao, Huchuan Lu

    Abstract: Recently, the Diffusion Probabilistic Model (DPM)-based methods have achieved substantial success in the field of medical image segmentation. However, most of these methods fail to enable the diffusion model to learn edge features and non-edge features effectively and to inject them efficiently into the diffusion backbone. Additionally, the domain gap between the images features and the diffusion… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Accepted in MICCAI 2024

  31. arXiv:2406.14129  [pdf, other

    cs.CV cs.CL cs.MM

    Towards Event-oriented Long Video Understanding

    Authors: Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen

    Abstract: With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Work on progress

  32. arXiv:2406.11832  [pdf, other

    cs.CV cs.MM

    Unveiling Encoder-Free Vision-Language Models

    Authors: Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang

    Abstract: Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 16 pages, 7 figures

  33. arXiv:2406.10475  [pdf, other

    cs.CV

    Discrete Latent Perspective Learning for Segmentation and Detection

    Authors: Deyi Ji, Feng Zhao, Lanyun Zhu, Wenwei Jin, Hongtao Lu, Jieping Ye

    Abstract: In this paper, we address the challenge of Perspective-Invariant Learning in machine learning and computer vision, which involves enabling a network to understand images from varying perspectives to achieve consistent semantic interpretation. While standard approaches rely on the labor-intensive collection of multi-view images or limited data augmentation techniques, we propose a novel framework,… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: ICML 2024 Spotlight

  34. arXiv:2406.09367  [pdf, other

    cs.CV

    Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs

    Authors: Zijia Zhao, Haoyu Lu, Yuqi Huo, Yifan Du, Tongtian Yue, Longteng Guo, Bingning Wang, Weipeng Chen, Jing Liu

    Abstract: Video understanding is a crucial next step for multimodal large language models (MLLMs). To probe specific aspects of video understanding ability, existing video benchmarks typically require careful video selection based on the target capability, along with laborious annotation of query-response pairs to match the specific video content. This process is both challenging and resource-intensive. In… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  35. arXiv:2406.08115  [pdf, other

    cs.DC cs.AI

    Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey

    Authors: Feng Liang, Zhen Zhang, Haifeng Lu, Chengming Li, Victor C. M. Leung, Yanyi Guo, Xiping Hu

    Abstract: With rapidly increasing distributed deep learning workloads in large-scale data centers, efficient distributed deep learning framework strategies for resource allocation and workload scheduling have become the key to high-performance deep learning. The large-scale environment with large volumes of datasets, models, and computational and communication resources raises various unique challenges for… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  36. arXiv:2406.05768  [pdf, other

    cs.CV cs.AI

    MLCM: Multistep Consistency Distillation of Latent Diffusion Model

    Authors: Qingsong Xie, Zhenyi Liao, Chen chen, Zhijie Deng, Shixiang Tang, Haonan Lu

    Abstract: Distilling large latent diffusion models (LDMs) into ones that are fast to sample from is attracting growing research interest. However, the majority of existing methods face a dilemma where they either (i) depend on multiple individual distilled models for different sampling budgets, or (ii) sacrifice generation quality with limited (e.g., 2-4) and/or moderate (e.g., 5-8) sampling steps. To addre… ▽ More

    Submitted 11 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

  37. arXiv:2406.04548  [pdf, other

    cs.LG cs.IR cs.SI

    GNNAnatomy: Systematic Generation and Evaluation of Multi-Level Explanations for Graph Neural Networks

    Authors: Hsiao-Ying Lu, Yiran Li, Ujwal Pratap Krishna Kaluvakolanu Thyagarajan, Kwan-Liu Ma

    Abstract: Graph Neural Networks (GNNs) have proven highly effective in various machine learning (ML) tasks involving graphs, such as node/graph classification and link prediction. However, explaining the decisions made by GNNs poses challenges because of the aggregated relational information based on graph structure, leading to complex data transformations. Existing methods for explaining GNNs often face li… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  38. arXiv:2406.03474  [pdf, other

    cs.CV

    AD-H: Autonomous Driving with Hierarchical Agents

    Authors: Zaibin Zhang, Shiyu Tang, Yuanhang Zhang, Talas Fu, Yifan Wang, Yang Liu, Dong Wang, Jing Shao, Lijun Wang, Huchuan Lu

    Abstract: Due to the impressive capabilities of multimodal large language models (MLLMs), recent works have focused on employing MLLM-based agents for autonomous driving in large-scale and dynamic environments. However, prevalent approaches often directly translate high-level instructions into low-level vehicle control signals, which deviates from the inherent language generation paradigm of MLLMs and fails… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  39. arXiv:2406.03464  [pdf, other

    cs.LG

    Node-wise Filtering in Graph Neural Networks: A Mixture of Experts Approach

    Authors: Haoyu Han, Juanhui Li, Wei Huang, Xianfeng Tang, Hanqing Lu, Chen Luo, Hui Liu, Jiliang Tang

    Abstract: Graph Neural Networks (GNNs) have proven to be highly effective for node classification tasks across diverse graph structural patterns. Traditionally, GNNs employ a uniform global filter, typically a low-pass filter for homophilic graphs and a high-pass filter for heterophilic graphs. However, real-world graphs often exhibit a complex mix of homophilic and heterophilic patterns, rendering a single… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  40. arXiv:2406.03176  [pdf, other

    cs.CV

    MMCL: Boosting Deformable DETR-Based Detectors with Multi-Class Min-Margin Contrastive Learning for Superior Prohibited Item Detection

    Authors: Mingyuan Li, Tong Jia, Hui Lu, Bowen Ma, Hao Wang, Dongyue Chen

    Abstract: Prohibited Item detection in X-ray images is one of the most effective security inspection methods.However, differing from natural light images, the unique overlapping phenomena in X-ray images lead to the coupling of foreground and background features, thereby lowering the accuracy of general object detectors.Therefore, we propose a Multi-Class Min-Margin Contrastive Learning (MMCL) method that,… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: 14 pages, 6 figures

  41. arXiv:2406.02940  [pdf, other

    cs.SD eess.AS

    Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder

    Authors: Haohan Guo, Fenglong Xie, Dongchao Yang, Hui Lu, Xixin Wu, Helen Meng

    Abstract: VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewords to address this problem and build large-codebook speech tokenizers. It encodes speech features into multiple VQ subspaces and composes them into c… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  42. arXiv:2406.02283  [pdf, other

    cs.RO

    Broadcasting Support Relations Recursively from Local Dynamics for Object Retrieval in Clutters

    Authors: Yitong Li, Ruihai Wu, Haoran Lu, Chuanruo Ning, Yan Shen, Guanqi Zhan, Hao Dong

    Abstract: In our daily life, cluttered objects are everywhere, from scattered stationery and books cluttering the table to bowls and plates filling the kitchen sink. Retrieving a target object from clutters is an essential while challenging skill for robots, for the difficulty of safely manipulating an object without disturbing others, which requires the robot to plan a manipulation sequence and first move… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: RSS 2024

  43. arXiv:2406.01333  [pdf, other

    cs.CL cs.AI

    Probing Language Models for Pre-training Data Detection

    Authors: Zhenhua Liu, Tong Zhu, Chuanyuan Tan, Haonan Lu, Bing Liu, Wenliang Chen

    Abstract: Large Language Models (LLMs) have shown their impressive capabilities, while also raising concerns about the data contamination problems due to privacy issues and leakage of benchmark datasets in the pre-training phase. Therefore, it is vital to detect the contamination by checking whether an LLM has been pre-trained on the target texts. Recent studies focus on the generated texts and compute perp… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Accepted by ACL-2024 main conference

  44. arXiv:2405.20457  [pdf, other

    cs.SI cs.CY cs.HC

    Online network topology shapes personal narratives and hashtag generation

    Authors: J. Hunter Priniski, Bryce Linford, Sai Krishna, Fred Morstatter, Jeff Brantingham, Hongjing Lu

    Abstract: While narratives have shaped cognition and cultures for centuries, digital media and online social networks have introduced new narrative phenomena. With increased narrative agency, networked groups of individuals can directly contribute and steer narratives that center our collective discussions of politics, science, and morality. We report the results of an online network experiment on narrative… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: Will be published in the 2024 Proceedings of the Cognitive Science Society

  45. arXiv:2405.19730  [pdf

    cs.AI cs.CV cs.LG

    Research on Foundation Model for Spatial Data Intelligence: China's 2024 White Paper on Strategic Development of Spatial Data Intelligence

    Authors: Shaohua Wang, Xing Xie, Yong Li, Danhuai Guo, Zhi Cai, Yu Liu, Yang Yue, Xiao Pan, Feng Lu, Huayi Wu, Zhipeng Gui, Zhiming Ding, Bolong Zheng, Fuzheng Zhang, Tao Qin, Jingyuan Wang, Chuang Tao, Zhengchao Chen, Hao Lu, Jiayi Li, Hongyang Chen, Peng Yue, Wenhao Yu, Yao Yao, Leilei Sun , et al. (9 additional authors not shown)

    Abstract: This report focuses on spatial data intelligent large models, delving into the principles, methods, and cutting-edge applications of these models. It provides an in-depth discussion on the definition, development history, current status, and trends of spatial data intelligent large models, as well as the challenges they face. The report systematically elucidates the key technologies of spatial dat… ▽ More

    Submitted 29 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

    Comments: in Chinese language

  46. arXiv:2405.18003  [pdf, other

    cs.CV cs.AI

    MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling

    Authors: Bowen Zhang, Xiaofei Xie, Haotian Lu, Na Ma, Tianlin Li, Qing Guo

    Abstract: Diffusion-based video generation has achieved significant progress, yet generating multiple actions that occur sequentially remains a formidable task. Directly generating a video with sequential actions can be extremely challenging due to the scarcity of fine-grained action annotations and the difficulty in establishing temporal semantic correspondences and maintaining long-term consistency. To ta… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  47. arXiv:2405.18001  [pdf, other

    cs.NI

    Network-Aware Reliability Modeling and Optimization for Microservice Placement

    Authors: Fangyu Zhang, Yuang Chen, Hancheng Lu, Yongsheng Huang

    Abstract: Optimizing microservice placement to enhance the reliability of services is crucial for improving the service level of microservice architecture-based mobile networks and Internet of Things (IoT) networks. Despite extensive research on service reliability, the impact of network load and routing on service reliability remains understudied, leading to suboptimal models and unsatisfactory performance… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: 13 pages, 8 figures, submitted to IEEE transaction for potential publication

  48. arXiv:2405.16886  [pdf, other

    cs.CV

    Hawk: Learning to Understand Open-World Video Anomalies

    Authors: Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, Ying-Cong Chen

    Abstract: Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs. However, current VAD systems are often limited by their superficial semantic understanding of scenes and minimal user interaction. Additionally, the prevalent data scarcity in existing datasets restricts their applicability in open-world scenarios. In t… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  49. arXiv:2405.16471  [pdf, other

    cs.NI

    Performance Optimization in RSMA-assisted Uplink xURLLC IIoT Networks with Statistical QoS Provisioning

    Authors: Yuang Chen, Hancheng Lu, Chang Wu, Langtian Qin, Xiaobo Guo

    Abstract: Industry 5.0 and beyond networks have driven the emergence of numerous mission-critical applications, prompting contemplation of the neXt-generation ultra-reliable low-latency communication (xURLLC). To guarantee low-latency requirements, xURLLC heavily relies on short-blocklength packets with sporadic arrival traffic. As a disruptive multi-access technique, rate-splitting multiple access (RSMA) h… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

    Comments: 13 pages, 8 figures, submitted to IEEE Transactions for potential publication

  50. arXiv:2405.16363  [pdf, other

    cs.IR cs.AI

    LLMs for User Interest Exploration in Large-scale Recommendation Systems

    Authors: Jianling Wang, Haokai Lu, Yifan Liu, He Ma, Yueqi Wang, Yang Gu, Shuzhou Zhang, Ningren Han, Shuchao Bi, Lexi Baugher, Ed Chi, Minmin Chen

    Abstract: Traditional recommendation systems are subject to a strong feedback loop by learning from and reinforcing past user-item interactions, which in turn limits the discovery of novel user interests. To address this, we introduce a hybrid hierarchical framework combining Large Language Models (LLMs) and classic recommendation models for user interest exploration. The framework controls the interfacing… ▽ More

    Submitted 7 June, 2024; v1 submitted 25 May, 2024; originally announced May 2024.