Zum Hauptinhalt springen

Showing 1–50 of 798 results for author: Zha, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.16500  [pdf, other

    cs.CV

    CogVLM2: Visual Language Models for Image and Video Understanding

    Authors: Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang

    Abstract: Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

  2. arXiv:2408.15947  [pdf, other

    eess.IV cs.CV

    Auxiliary Input in Training: Incorporating Catheter Features into Deep Learning Models for ECG-Free Dynamic Coronary Roadmapping

    Authors: Yikang Liu, Lin Zhao, Eric Z. Chen, Xiao Chen, Terrence Chen, Shanhui Sun

    Abstract: Dynamic coronary roadmapping is a technology that overlays the vessel maps (the "roadmap") extracted from an offline image sequence of X-ray angiography onto a live stream of X-ray fluoroscopy in real-time. It aims to offer navigational guidance for interventional surgeries without the need for repeated contrast agent injections, thereby reducing the risks associated with radiation exposure and ki… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: MICCAI 2024

  3. arXiv:2408.13797  [pdf, ps, other

    cs.CG

    Approximation Algorithms for Minimum Sum of Moving-Distance and Opening-Costs Target Coverage Problem

    Authors: Lei Zhao, Zhao Zhang

    Abstract: In this paper, we study the Minimum Sum of Moving-Distance and Opening-Costs Target Coverage problem (MinMD$+$OCTC). Given a set of targets and a set of base stations on the plane, an opening cost function for every base station, the opened base stations can emit mobile sensors with a radius of $r$ from base station to cover the targets. The goal of MinMD$+$OCTC is to cover all the targets and min… ▽ More

    Submitted 25 August, 2024; originally announced August 2024.

  4. arXiv:2408.13459  [pdf, other

    cs.CV

    Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model

    Authors: Chen Rao, Guangyuan Li, Zehua Lan, Jiakai Sun, Junsheng Luan, Wei Xing, Lei Zhao, Huaizhong Lin, Jianfeng Dong, Dalong Zhang

    Abstract: Current video deblurring methods have limitations in recovering high-frequency information since the regression losses are conservative with high-frequency details. Since Diffusion Models (DMs) have strong capabilities in generating high-frequency details, we consider introducing DMs into the video deblurring task. However, we found that directly applying DMs to the video deblurring task has the f… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

    Comments: accepted by ECCV2024

    ACM Class: I.4.4

  5. arXiv:2408.12757  [pdf, other

    cs.DC

    NanoFlow: Towards Optimal Large Language Model Serving Throughput

    Authors: Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci

    Abstract: The increasing usage of Large Language Models (LLMs) has resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput (under reasonable latency constraints) has emerged as a key metric that determines serving systems' performance. To boost throughput, various methods of inter-device paralle… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

  6. arXiv:2408.11811  [pdf, other

    cs.CV cs.RO

    EmbodiedSAM: Online Segment Any 3D Thing in Real Time

    Authors: Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu

    Abstract: Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration, so an online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed. Since high-quality 3D data is limited, directly training such a model in 3D is almost infeasible. Meanwhile, vision foundation models (VFM) has revolutionized the field of 2D computer vision with… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

    Comments: Project page: https://xuxw98.github.io/ESAM/

  7. arXiv:2408.08813  [pdf, other

    cs.CV

    Retrieval-augmented Few-shot Medical Image Segmentation with Foundation Models

    Authors: Lin Zhao, Xiao Chen, Eric Z. Chen, Yikang Liu, Terrence Chen, Shanhui Sun

    Abstract: Medical image segmentation is crucial for clinical decision-making, but the scarcity of annotated data presents significant challenges. Few-shot segmentation (FSS) methods show promise but often require retraining on the target domain and struggle to generalize across different modalities. Similarly, adapting foundation models like the Segment Anything Model (SAM) for medical imaging has limitatio… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

  8. Representation Learning of Geometric Trees

    Authors: Zheng Zhang, Allen Zhang, Ruth Nelson, Giorgio Ascoli, Liang Zhao

    Abstract: Geometric trees are characterized by their tree-structured layout and spatially constrained nodes and edges, which significantly impacts their topological attributes. This inherent hierarchical structure plays a crucial role in domains such as neuron morphology and river geomorphology, but traditional graph representation methods often overlook these specific characteristics of tree structures. To… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

  9. arXiv:2408.08669  [pdf, other

    cs.SD eess.AS

    HSDreport: Heart Sound Diagnosis with Echocardiography Reports

    Authors: Zihan Zhao, Pingjie Wang, Liudan Zhao, Yuchen Yang, Ya Zhang, Kun Sun, Xin Sun, Xin Zhou, Yu Wang, Yanfeng Wang

    Abstract: Heart sound auscultation holds significant importance in the diagnosis of congenital heart disease. However, existing methods for Heart Sound Diagnosis (HSD) tasks are predominantly limited to a few fixed categories, framing the HSD task as a rigid classification problem that does not fully align with medical practice and offers only limited information to physicians. Besides, such methods do not… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

  10. arXiv:2408.06359  [pdf, other

    eess.SP cs.AI cs.LG

    An Adaptive CSI Feedback Model Based on BiLSTM for Massive MIMO-OFDM Systems

    Authors: Hongrui Shen, Long Zhao, Kan Zheng, Yuhua Cao, Pingzhi Fan

    Abstract: Deep learning (DL)-based channel state information (CSI) feedback has the potential to improve the recovery accuracy and reduce the feedback overhead in massive multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) systems. However, the length of input CSI and the number of feedback bits should be adjustable in different scenarios, which can not be efficiently achie… ▽ More

    Submitted 26 July, 2024; originally announced August 2024.

    Comments: 13 pages, 14 figures, 3 tables

  11. arXiv:2408.05136  [pdf, ps, other

    cs.LG

    Cycle-Configuration: A Novel Graph-theoretic Descriptor Set for Molecular Inference

    Authors: Bowen Song, Jianshen Zhu, Naveed Ahmed Azam, Kazuya Haraguchi, Liang Zhao, Tatsuya Akutsu

    Abstract: In this paper, we propose a novel family of descriptors of chemical graphs, named cycle-configuration (CC), that can be used in the standard "two-layered (2L) model" of mol-infer, a molecular inference framework based on mixed integer linear programming (MILP) and machine learning (ML). Proposed descriptors capture the notion of ortho/meta/para patterns that appear in aromatic rings, which has bee… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

  12. arXiv:2408.04532  [pdf, other

    cs.LG

    How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression

    Authors: Xingwu Chen, Lei Zhao, Difan Zou

    Abstract: Despite the remarkable success of transformer-based models in various real-world tasks, their underlying mechanisms remain poorly understood. Recent studies have suggested that transformers can implement gradient descent as an in-context learner for linear regression problems and have developed various theoretical analyses accordingly. However, these works mostly focus on the expressive power of t… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

  13. arXiv:2408.03244  [pdf

    cs.LO

    Modular assurance of an Autonomous Ferry using Contract-Based Design and Simulation-based Verification Principles

    Authors: Jon Arne Glomsrud, Stephanie Kemna, Chanjei Vasanthan, Luman Zhao, Dag McGeorge, Tom Arne Pedersen, Tobias Rye Torben, Børge Rokseth, Dong Trong Nguyen

    Abstract: With the introduction of autonomous technology into our society, e.g. autonomous shipping, it is important to assess and assure the safety of autonomous systems in a real-world context. Simulation-based testing is a common approach to attempt to verify performance of autonomous systems, but assurance also requires formal evidence. This paper introduces the Assurance of Digital Assets (ADA) framewo… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: 12 pages, 3 figures, final draft submitted to ICMASS/MTEC 2024 conference

  14. arXiv:2408.03230  [pdf, other

    cs.CV

    Contrastive Learning for Image Complexity Representation

    Authors: Shipeng Liu, Liang Zhao, Dengfeng Chen, Zhanping Song

    Abstract: Quantifying and evaluating image complexity can be instrumental in enhancing the performance of various computer vision tasks. Supervised learning can effectively learn image complexity features from well-annotated datasets. However, creating such datasets requires expensive manual annotation costs. The models may learn human subjective biases from it. In this work, we introduce the MoCo v2 framew… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

  15. arXiv:2408.01775  [pdf, other

    cs.HC

    3DStoryline: Immersive Visual Storytelling

    Authors: Haonan Yao, Lixiang Zhao, Boyuan Chen, Kaiwen Li, Hai-Ning Liang, Lingyun Yu

    Abstract: Storyline visualization has emerged as an innovative method for illustrating the development and changes in stories across various domains. Traditional approaches typically represent stories with one line per character, progressing from left to right. While effective for simpler narratives, this method faces significant challenges when dealing with complex stories involving multiple characters, as… ▽ More

    Submitted 3 August, 2024; originally announced August 2024.

    Comments: 9 pages

  16. arXiv:2408.01615  [pdf, other

    cs.RO

    Three-dimensional Morphological Reconstruction of Millimeter-Scale Soft Continuum Robots based on Dual-Stereo-Vision

    Authors: Tian-Ao Ren, Wenyan Liu, Tao Zhang, Lei Zhao, Hongliang Ren, Jiewen Lai

    Abstract: Continuum robots can be miniaturized to just a few millimeters in diameter. Among these, notched tubular continuum robots (NTCR) show great potential in many delicate applications. Existing works in robotic modeling focus on kinematics and dynamics but still face challenges in reproducing the robot's morphology -- a significant factor that can expand the research landscape of continuum robots, esp… ▽ More

    Submitted 15 August, 2024; v1 submitted 2 August, 2024; originally announced August 2024.

    Comments: 6 pages, 6 figures, submitted to Robio 2024

  17. arXiv:2408.00699  [pdf, other

    cs.LG

    Granular-Balls based Fuzzy Twin Support Vector Machine for Classification

    Authors: Lixi Zhao, Weiping Ding, Duoqian Miao, Guangming Lang

    Abstract: The twin support vector machine (TWSVM) classifier has attracted increasing attention because of its low computational complexity. However, its performance tends to degrade when samples are affected by noise. The granular-ball fuzzy support vector machine (GBFSVM) classifier partly alleviates the adverse effects of noise, but it relies solely on the distance between the granular-ball's center and… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

  18. arXiv:2407.19843  [pdf, other

    cs.HC

    The Second Joint Workshop on Cross Reality

    Authors: Nanjia Wang, Yue Li, Francesco Chiossi, Fabian Pointecker, Lixiang Zhao, Daniel Zielasko

    Abstract: The 2nd Joint Workshop on Cross Reality (JWCR'24), organized as part of ISMAR 2024, seeks to explore the burgeoning field of Cross Reality (CR), which encompasses the seamless integration and transition between various points on the reality-virtuality continuum (RVC) such as Virtual Reality (VR), Augmented Virtuality (AV), and Augmented Reality (AR). This hybrid workshop aims to build upon the fou… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: 5 pages

    Journal ref: 2024 IEEE International Symposium on Mixed and Augmented Reality

  19. arXiv:2407.16982  [pdf, other

    cs.CV cs.AI

    Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

    Authors: Lirui Zhao, Tianshuo Yang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Kaipeng Zhang, Rongrong Ji

    Abstract: This paper addresses an important problem of object addition for images with only text guidance. It is challenging because the new object must be integrated seamlessly into the image with consistent visual context, such as lighting, texture, and spatial location. While existing text-guided image inpainting methods can add objects, they either fail to preserve the background consistency or involve… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  20. arXiv:2407.16204  [pdf, other

    cs.CV

    CLII: Visual-Text Inpainting via Cross-Modal Predictive Interaction

    Authors: Liang Zhao, Qing Guo, Xiaoguang Li, Song Wang

    Abstract: Image inpainting aims to fill missing pixels in damaged images and has achieved significant progress with cut-edging learning techniques. Nevertheless, state-of-the-art inpainting methods are mainly designed for nature images and cannot correctly recover text within scene text images, and training existing models on the scene text images cannot fix the issues. In this work, we identify the visual-… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  21. arXiv:2407.16192  [pdf, other

    cs.IR cs.CL

    How to Leverage Personal Textual Knowledge for Personalized Conversational Information Retrieval

    Authors: Fengran Mo, Longxiang Zhao, Kaiyu Huang, Yue Dong, Degen Huang, Jian-Yun Nie

    Abstract: Personalized conversational information retrieval (CIR) combines conversational and personalizable elements to satisfy various users' complex information needs through multi-turn interaction based on their backgrounds. The key promise is that the personal textual knowledge base (PTKB) can improve the CIR effectiveness because the retrieval results can be more related to the user's background. Howe… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

    Comments: Accepted to CIKM 2024

  22. arXiv:2407.14833  [pdf, other

    cs.HC

    SpatialTouch: Exploring Spatial Data Visualizations in Cross-reality

    Authors: Lixiang Zhao, Tobias Isenberg, Fuqi Xie, Hai-Ning Liang, Lingyun Yu

    Abstract: We propose and study a novel cross-reality environment that seamlessly integrates a monoscopic 2D surface (an interactive screen with touch and pen input) with a stereoscopic 3D space (an augmented reality HMD) to jointly host spatial data visualizations. This innovative approach combines the best of two conventional methods of displaying and manipulating spatial 3D data, enabling users to fluidly… ▽ More

    Submitted 20 July, 2024; originally announced July 2024.

    Comments: 15 pages, 20 figures, IEEE VIS2024

  23. arXiv:2407.14733  [pdf, other

    cs.LG cs.AI cs.CL

    Hard Prompts Made Interpretable: Sparse Entropy Regularization for Prompt Tuning with RL

    Authors: Yunseon Choi, Sangmin Bae, Seonghyun Ban, Minchan Jeong, Chuheng Zhang, Lei Song, Li Zhao, Jiang Bian, Kee-Eung Kim

    Abstract: With the advent of foundation models, prompt tuning has positioned itself as an important technique for directing model behaviors and eliciting desired responses. Prompt tuning regards selecting appropriate keywords included into the input, thereby adapting to the downstream task without adjusting or fine-tuning the model parameters. There is a wide range of work in prompt tuning, from approaches… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.

  24. arXiv:2407.13642  [pdf, other

    cs.CV

    Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

    Authors: Xiaoyu Zhu, Hao Zhou, Pengfei Xing, Long Zhao, Hao Xu, Junwei Liang, Alexander Hauptmann, Ting Liu, Andrew Gallagher

    Abstract: In this paper, we investigate the use of diffusion models which are pre-trained on large-scale image-caption pairs for open-vocabulary 3D semantic understanding. We propose a novel method, namely Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  25. arXiv:2407.12339  [pdf, other

    cs.CV

    Exploring Deeper! Segment Anything Model with Depth Perception for Camouflaged Object Detection

    Authors: Zhenni Yu, Xiaoqin Zhang, Li Zhao, Yi Bin, Guobao Xiao

    Abstract: This paper introduces a new Segment Anything Model with Depth Perception (DSAM) for Camouflaged Object Detection (COD). DSAM exploits the zero-shot capability of SAM to realize precise segmentation in the RGB-D domain. It consists of the Prompt-Deeper Module and the Finer Module. The Prompt-Deeper Module utilizes knowledge distillation and the Bias Correction Module to achieve the interaction betw… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: ACM MM 2024

  26. arXiv:2407.12002  [pdf, other

    cs.MM cs.CV

    A Multimodal Transformer for Live Streaming Highlight Prediction

    Authors: Jiaxin Deng, Shiyao Wang, Dong Shen, Liqin Zhao, Fan Yang, Guorui Zhou, Gaofeng Meng

    Abstract: Recently, live streaming platforms have gained immense popularity. Traditional video highlight detection mainly focuses on visual features and utilizes both past and future content for prediction. However, live streaming requires models to infer without future frames and process complex multimodal interactions, including images, audio and text comments. To address these issues, we propose a multim… ▽ More

    Submitted 15 June, 2024; originally announced July 2024.

    Comments: Accepted at ICME 2024 as poster presentation. arXiv admin note: text overlap with arXiv:2306.14392

  27. arXiv:2407.11346  [pdf, other

    cs.CE

    DEDEM: Discontinuity Embedded Deep Energy Method for solving fracture mechanics problems

    Authors: Luyang Zhao, Qian Shao

    Abstract: Physics-Informed Neural Networks (PINNs) have aroused great attention for its ability to address forward and inverse problems of partial differential equations. However, approximating discontinuous functions by neural networks poses a considerable challenge, which results in high computational demands and low accuracy to solve fracture mechanics problems within standard PINNs framework. In this pa… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  28. arXiv:2407.11298  [pdf, other

    cs.RO

    ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

    Authors: Yaoyao Qian, Xupeng Zhu, Ondrej Biza, Shuo Jiang, Linfeng Zhao, Haojie Huang, Yu Qi, Robert Platt

    Abstract: Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even wh… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Project Website:(https://h-freax.github.io/thinkgrasp_page/)

  29. arXiv:2407.09874  [pdf, other

    cs.CV cs.AI

    SeFi-CD: A Semantic First Change Detection Paradigm That Can Detect Any Change You Want

    Authors: Ling Zhao, Zhenyang Huang, Dongsheng Kuang, Chengli Peng, Jun Gan, Haifeng Li

    Abstract: The existing change detection(CD) methods can be summarized as the visual-first change detection (ViFi-CD) paradigm, which first extracts change features from visual differences and then assigns them specific semantic information. However, CD is essentially dependent on change regions of interest (CRoIs), meaning that the CD results are directly determined by the semantics changes of interest, mak… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

  30. arXiv:2407.09509  [pdf, other

    q-bio.NC cs.HC

    Brain Dialogue Interface (BDI): A User-Friendly fMRI Model for Interactive Brain Decoding

    Authors: Heng Huang, Lin Zhao, Zihao Wu, Xiaowei Yu, Jing Zhang, Xintao Hu, Dajiang Zhu, Tianming Liu

    Abstract: Brain decoding techniques are essential for understanding the neurocognitive system. Although numerous methods have been introduced in this field, accurately aligning complex external stimuli with brain activities remains a formidable challenge. To alleviate alignment difficulties, many studies have simplified their models by employing single-task paradigms and establishing direct links between br… ▽ More

    Submitted 17 June, 2024; originally announced July 2024.

  31. arXiv:2407.08348  [pdf, other

    cs.AI cs.CL cs.LG

    Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On

    Authors: Liang Zeng, Liangjun Zhong, Liang Zhao, Tianwen Wei, Liu Yang, Jujie He, Cheng Cheng, Rui Hu, Yang Liu, Shuicheng Yan, Han Fang, Yahui Zhou

    Abstract: In this paper, we investigate the underlying factors that potentially enhance the mathematical reasoning capabilities of large language models (LLMs). We argue that the data scaling law for math reasoning capabilities in modern LLMs is far from being saturated, highlighting how the model's quality improves with increases in data quantity. To support this claim, we introduce the Skywork-Math model… ▽ More

    Submitted 17 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

  32. arXiv:2407.07356  [pdf, other

    cs.CV

    Video In-context Learning

    Authors: Wentao Zhang, Junliang Guo, Tianyu He, Li Zhao, Linli Xu, Jiang Bian

    Abstract: In-context learning for vision data has been underexplored compared with that in natural language. Previous works studied image in-context learning, urging models to generate a single image guided by demonstrations. In this paper, we propose and study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences, each semantically gu… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

  33. arXiv:2407.06503  [pdf, other

    cs.LG

    Preference-Guided Reinforcement Learning for Efficient Exploration

    Authors: Guojian Wang, Faguo Wu, Xiao Zhang, Tianyuan Chen, Xuyang Chen, Lin Zhao

    Abstract: In this paper, we investigate preference-based reinforcement learning (PbRL) that allows reinforcement learning (RL) agents to learn from human feedback. This is particularly valuable when defining a fine-grain reward function is not feasible. However, this approach is inefficient and impractical for promoting deep exploration in hard-exploration tasks with long horizons and sparse rewards. To tac… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: 13 pages, 17 figures

  34. arXiv:2407.06159  [pdf, other

    cs.CV

    A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion

    Authors: Xiaoli Zhang, Liying Wang, Libo Zhao, Xiongfei Li, Siwei Ma

    Abstract: Multi-modality image fusion aims at fusing specific-modality and shared-modality information from two source images. To tackle the problem of insufficient feature extraction and lack of semantic awareness for complex scenes, this paper focuses on how to model correlation-driven decomposing features and reason high-level graph representation by efficiently extracting complementary features and mult… ▽ More

    Submitted 3 August, 2024; v1 submitted 11 June, 2024; originally announced July 2024.

  35. arXiv:2407.05769  [pdf, other

    cs.CV

    Boosting 3D Object Detection with Semantic-Aware Multi-Branch Framework

    Authors: Hao Jing, Anhong Wang, Lijun Zhao, Yakun Yang, Donghan Bu, Jing Zhang, Yifan Zhang, Junhui Hou

    Abstract: In autonomous driving, LiDAR sensors are vital for acquiring 3D point clouds, providing reliable geometric information. However, traditional sampling methods of preprocessing often ignore semantic features, leading to detail loss and ground point interference in 3D object detection. To address this, we propose a multi-branch two-stage 3D object detection framework using a Semantic-aware Multi-bran… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  36. arXiv:2407.02505  [pdf, other

    cs.CE cs.LG physics.flu-dyn

    A MgNO Method for Multiphase Flow in Porous Media

    Authors: Xinliang Liu, Xia Yang, Chen-Song Zhang, Lian Zhang, Li Zhao

    Abstract: This research investigates the application of Multigrid Neural Operator (MgNO), a neural operator architecture inspired by multigrid methods, in the simulation for multiphase flow within porous media. The architecture is adjusted to manage a variety of crucial factors, such as permeability and porosity heterogeneity. The study extendes MgNO to time-dependent porous media flow problems and validate… ▽ More

    Submitted 16 June, 2024; originally announced July 2024.

  37. arXiv:2407.02190  [pdf, other

    cs.RO

    I2EKF-LO: A Dual-Iteration Extended Kalman Filter Based LiDAR Odometry

    Authors: Wenlu Yu, Jie Xu, Chengwei Zhao, Lijun Zhao, Thien-Minh Nguyen, Shenghai Yuan, Mingming Bai, Lihua Xie

    Abstract: LiDAR odometry is a pivotal technology in the fields of autonomous driving and autonomous mobile robotics. However, most of the current works focus on nonlinear optimization methods, and still existing many challenges in using the traditional Iterative Extended Kalman Filter (IEKF) framework to tackle the problem: IEKF only iterates over the observation equation, relying on a rough estimate of the… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted by IROS 2024

  38. arXiv:2407.01853  [pdf, other

    cs.CL cs.AI cs.LG

    Improving Multilingual Instruction Finetuning via Linguistically Natural and Diverse Datasets

    Authors: Sathish Reddy Indurthi, Wenxuan Zhou, Shamil Chollampatt, Ravi Agrawal, Kaiqiang Song, Lingxiao Zhao, Chenguang Zhu

    Abstract: Advancements in Large Language Models (LLMs) have significantly enhanced instruction-following capabilities. However, most Instruction Fine-Tuning (IFT) datasets are predominantly in English, limiting model performance in other languages. Traditional methods for creating multilingual IFT datasets such as translating existing English IFT datasets or converting existing NLP datasets into IFT dataset… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  39. arXiv:2407.01601  [pdf, other

    cs.LG cs.AI

    Unveiling and Controlling Anomalous Attention Distribution in Transformers

    Authors: Ruiqing Yan, Xingbo Du, Haoyu Deng, Linghan Zheng, Qiuzhuang Sun, Jifang Hu, Yuhang Shao, Penghao Jiang, Jinrong Jiang, Lian Zhao

    Abstract: With the advent of large models based on the Transformer architecture, researchers have observed an anomalous phenomenon in the Attention mechanism--there is a very high attention on the first element, which is prevalent across Transformer-based models. It is crucial to understand it for the development of techniques focusing on attention distribution, such as Key-Value (KV) Cache compression and… ▽ More

    Submitted 3 July, 2024; v1 submitted 26 June, 2024; originally announced July 2024.

  40. Self-consistent Deep Geometric Learning for Heterogeneous Multi-source Spatial Point Data Prediction

    Authors: Dazhou Yu, Xiaoyun Gong, Yun Li, Meikang Qiu, Liang Zhao

    Abstract: Multi-source spatial point data prediction is crucial in fields like environmental monitoring and natural resource management, where integrating data from various sensors is the key to achieving a holistic environmental understanding. Existing models in this area often fall short due to their domain-specific nature and lack a strategy for integrating information from various sources in the absence… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  41. PolygonGNN: Representation Learning for Polygonal Geometries with Heterogeneous Visibility Graph

    Authors: Dazhou Yu, Yuntong Hu, Yun Li, Liang Zhao

    Abstract: Polygon representation learning is essential for diverse applications, encompassing tasks such as shape coding, building pattern classification, and geographic question answering. While recent years have seen considerable advancements in this field, much of the focus has been on single polygons, overlooking the intricate inner- and inter-polygonal relationships inherent in multipolygons. To addres… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  42. arXiv:2407.00569  [pdf, other

    cs.CV cs.AI cs.CL

    Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models

    Authors: Weihong Zhong, Xiaocheng Feng, Liang Zhao, Qiming Li, Lei Huang, Yuxuan Gu, Weitao Ma, Yuan Xu, Bing Qin

    Abstract: Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs' subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, w… ▽ More

    Submitted 3 August, 2024; v1 submitted 29 June, 2024; originally announced July 2024.

    Comments: Accepted to ACL 2024 Main Conference. 21 pages, 20 figures

  43. arXiv:2407.00056  [pdf, other

    cs.IR cs.AI cs.SI

    MMBee: Live Streaming Gift-Sending Recommendations via Multi-Modal Fusion and Behaviour Expansion

    Authors: Jiaxin Deng, Shiyao Wang, Yuchen Wang, Jiansong Qi, Liqin Zhao, Guorui Zhou, Gaofeng Meng

    Abstract: Live streaming services are becoming increasingly popular due to real-time interactions and entertainment. Viewers can chat and send comments or virtual gifts to express their preferences for the streamers. Accurately modeling the gifting interaction not only enhances users' experience but also increases streamers' revenue. Previous studies on live streaming gifting prediction treat this task as a… ▽ More

    Submitted 15 June, 2024; originally announced July 2024.

    Comments: Accepted at KDD 2024

  44. arXiv:2406.19311  [pdf, other

    cs.CR cs.SD eess.AS

    Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems

    Authors: Zheng Fang, Tao Wang, Lingchen Zhao, Shenyi Zhang, Bowen Li, Yunjie Ge, Qi Li, Chao Shen, Qian Wang

    Abstract: In recent years, extensive research has been conducted on the vulnerability of ASR systems, revealing that black-box adversarial example attacks pose significant threats to real-world ASR systems. However, most existing black-box attacks rely on queries to the target ASRs, which is impractical when queries are not permitted. In this paper, we propose ZQ-Attack, a transfer-based adversarial attack… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: To appear in the Proceedings of The ACM Conference on Computer and Communications Security (CCS), 2024

  45. arXiv:2406.18583  [pdf, other

    cs.CV cs.LG

    Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

    Authors: Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao

    Abstract: Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lu… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Code at: https://github.com/Alpha-VLLM/Lumina-T2X

  46. arXiv:2406.15677  [pdf, other

    cs.RO

    Open-vocabulary Pick and Place via Patch-level Semantic Maps

    Authors: Mingxi Jia, Haojie Huang, Zhewen Zhang, Chenghao Wang, Linfeng Zhao, Dian Wang, Jason Xinyu Liu, Robin Walters, Robert Platt, Stefanie Tellex

    Abstract: Controlling robots through natural language instructions in open-vocabulary scenarios is pivotal for enhancing human-robot collaboration and complex robot behavior synthesis. However, achieving this capability poses significant challenges due to the need for a system that can generalize from limited data to a wide range of tasks and environments. Existing methods rely on large, costly datasets and… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  47. arXiv:2406.14885  [pdf, other

    cs.HC

    Ink and Algorithm: Exploring Temporal Dynamics in Human-AI Collaborative Writing

    Authors: Kaixun Yang, Yixin Cheng, Linxuan Zhao, Mladen Raković, Zachari Swiecki, Dragan Gašević, Guanliang Chen

    Abstract: The advent of Generative Artificial Intelligence (GAI) has revolutionized the field of writing, marking a shift towards human-AI collaborative writing in education. However, the dynamics of human-AI interaction in the collaborative writing process are not well understood, and thus it remains largely unknown how human learning can be effectively supported with such cutting-edge GAI technologies. In… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  48. arXiv:2406.14862  [pdf, other

    cs.LG cs.CL cs.CV

    LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models

    Authors: Mengdan Zhu, Raasikh Kanjiani, Jiahui Lu, Andrew Choi, Qirui Ye, Liang Zhao

    Abstract: Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces LatentExplainer, a framewo… ▽ More

    Submitted 28 June, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

  49. arXiv:2406.13215  [pdf, other

    cs.CV cs.AI

    Neural Residual Diffusion Models for Deep Scalable Vision Generation

    Authors: Zhiyuan Ma, Liangliang Zhao, Biqing Qi, Bowen Zhou

    Abstract: The most advanced diffusion models have recently adopted increasingly deep stacked networks (e.g., U-Net or Transformer) to promote the generative emergence capabilities of vision generation models similar to large language models (LLMs). However, progressively deeper stacked networks will intuitively cause numerical propagation errors and reduce noisy prediction capabilities on generative data, w… ▽ More

    Submitted 21 July, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

  50. arXiv:2406.13105  [pdf, other

    cs.CV

    A transformer boosted UNet for smoke segmentation in complex backgrounds in multispectral LandSat imagery

    Authors: Jixue Liu, Jiuyong Li, Stefan Peters, Liang Zhao

    Abstract: Many studies have been done to detect smokes from satellite imagery. However, these prior methods are not still effective in detecting various smokes in complex backgrounds. Smokes present challenges in detection due to variations in density, color, lighting, and backgrounds such as clouds, haze, and/or mist, as well as the contextual nature of thin smoke. This paper addresses these challenges by… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.