Skip to main content

Showing 1–50 of 1,005 results for author: Chen, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13092  [pdf, other

    eess.IV cs.CV

    CC-DCNet: Dynamic Convolutional Neural Network with Contrastive Constraints for Identifying Lung Cancer Subtypes on Multi-modality Images

    Authors: Yuan Jin, Gege Ma, Geng Chen, Tianling Lyu, Jan Egger, Junhui Lyu, Shaoting Zhang, Wentao Zhu

    Abstract: The accurate diagnosis of pathological subtypes of lung cancer is of paramount importance for follow-up treatments and prognosis managements. Assessment methods utilizing deep learning technologies have introduced novel approaches for clinical diagnosis. However, the majority of existing models rely solely on single-modality image input, leading to limited diagnostic accuracy. To this end, we prop… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  2. arXiv:2407.12709  [pdf, other

    cs.CV

    MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

    Authors: Leyang Shen, Gongwei Chen, Rui Shao, Weili Guan, Liqiang Nie

    Abstract: Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoM… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: Github: https://github.com/JiuTian-VL/MoME

  3. arXiv:2407.12582  [pdf, other

    cs.CV cs.AI cs.RO

    Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

    Authors: Hu Cao, Zehua Zhang, Yan Xia, Xinyi Li, Jiahao Xia, Guang Chen, Alois Knoll

    Abstract: In frame-based vision, object detection faces substantial performance degradation under challenging conditions due to the limited sensing capability of conventional cameras. Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems. However, effectively fusing two heterogeneous modalities remains an open issue. In this work, we propose a novel hier… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  4. arXiv:2407.12387  [pdf, other

    cs.CV

    HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation

    Authors: Tianpei Zou, Sanqing Qu, Zhijun Li, Alois Knoll, Lianghua He, Guang Chen, Changjun Jiang

    Abstract: 3D point cloud segmentation has received significant interest for its growing applications. However, the generalization ability of models suffers in dynamic scenarios due to the distribution shift between test and training data. To promote robustness and adaptability across diverse scenarios, test-time adaptation (TTA) has recently been introduced. Nevertheless, most existing TTA methods are devel… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Journal ref: ECCV 2024

  5. arXiv:2407.11459  [pdf, other

    eess.SP cs.LG

    RIMformer: An End-to-End Transformer for FMCW Radar Interference Mitigation

    Authors: Ziang Zhang, Guangzhi Chen, Youlong Weng, Shunchuan Yang, Zhiyu Jia, Jingxuan Chen

    Abstract: Frequency-modulated continuous-wave (FMCW) radar plays a pivotal role in the field of remote sensing. The increasing degree of FMCW radar deployment has increased the mutual interference, which weakens the detection capabilities of radars and threatens reliability and safety of systems. In this paper, a novel FMCW radar interference mitigation (RIM) method, termed as RIMformer, is proposed by usin… ▽ More

    Submitted 17 July, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

  6. arXiv:2407.11096  [pdf, other

    cs.LG cs.AI

    Static and multivariate-temporal attentive fusion transformer for readmission risk prediction

    Authors: Zhe Sun, Runzhi Li, Jing Wang, Gang Chen, Siyu Yan, Lihong Ma

    Abstract: Background: Accurate short-term readmission prediction of ICU patients is significant in improving the efficiency of resource assignment by assisting physicians in making discharge decisions. Clinically, both individual static static and multivariate temporal data collected from ICU monitors play critical roles in short-term readmission prediction. Informative static and multivariate temporal feat… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

  7. arXiv:2407.11018  [pdf, other

    cs.NI eess.SP

    Online Multi-Task Offloading for Semantic-Aware Edge Computing Systems

    Authors: Xuyang Chen, Qu Luo, Gaojie Chen, Daquan Feng, Yao Sun

    Abstract: Mobile edge computing (MEC) provides low-latency offloading solutions for computationally intensive tasks, effectively improving the computing efficiency and battery life of mobile devices. However, for data-intensive tasks or scenarios with limited uplink bandwidth, network congestion might occur due to massive simultaneous offloading nodes, increasing transmission latency and affecting task perf… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

  8. arXiv:2407.09792  [pdf, other

    cs.RO

    Language-Augmented Symbolic Planner for Open-World Task Planning

    Authors: Guanqi Chen, Lei Yang, Ruixing Jia, Zhe Hu, Yizhou Chen, Wei Zhang, Wenping Wang, Jia Pan

    Abstract: Enabling robotic agents to perform complex long-horizon tasks has been a long-standing goal in robotics and artificial intelligence (AI). Despite the potential shown by large language models (LLMs), their planning capabilities remain limited to short-horizon tasks and they are unable to replace the symbolic planning approach. Symbolic planners, on the other hand, may encounter execution errors due… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: Accepted by Robotics: Science and Systems (RSS) 2024

  9. arXiv:2407.05597  [pdf, other

    cs.CV cs.GR

    GeoNLF: Geometry guided Pose-Free Neural LiDAR Fields

    Authors: Weiyi Xue, Zehan Zheng, Fan Lu, Haiyun Wei, Guang Chen, Changjun Jiang

    Abstract: Although recent efforts have extended Neural Radiance Fields (NeRF) into LiDAR point cloud synthesis, the majority of existing works exhibit a strong dependence on precomputed poses. However, point cloud registration methods struggle to achieve precise global pose estimation, whereas previous pose-free NeRFs overlook geometric consistency in global reconstruction. In light of this, we explore the… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  10. arXiv:2407.05311  [pdf, other

    cs.CV

    MMAD: Multi-label Micro-Action Detection in Videos

    Authors: Kun Li, Dan Guo, Pengyu Liu, Guoliang Chen, Meng Wang

    Abstract: Human body actions are an important form of non-verbal communication in social interactions. This paper focuses on a specific subset of body actions known as micro-actions, which are subtle, low-intensity body movements that provide a deeper understanding of inner human feelings. In real-world scenarios, human micro-actions often co-occur, with multiple micro-actions overlapping in time, such as s… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: Work in Progress

  11. arXiv:2407.04608  [pdf, other

    math.OC cs.GT cs.MA

    A Multi-Player Potential Game Approach for Sensor Network Localization with Noisy Measurements

    Authors: Gehui Xu, Guanpu Chen, Baris Fidan, Yiguang Hong, Hongsheng Qi, Thomas Parisini, Karl H. Johansson

    Abstract: Sensor network localization (SNL) is a challenging problem due to its inherent non-convexity and the effects of noise in inter-node ranging measurements and anchor node position. We formulate a non-convex SNL problem as a multi-player non-convex potential game and investigate the existence and uniqueness of a Nash equilibrium (NE) in both the ideal setting without measurement noise and the practic… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2311.03326, arXiv:2401.02471

  12. arXiv:2407.04490  [pdf, other

    cs.CV

    Micro-gesture Online Recognition using Learnable Query Points

    Authors: Pengyu Liu, Fei Wang, Kun Li, Guoliang Chen, Yanyan Wei, Shengeng Tang, Zhiliang Wu, Dan Guo

    Abstract: In this paper, we briefly introduce the solution developed by our team, HFUT-VUT, for the Micro-gesture Online Recognition track in the MiGA challenge at IJCAI 2024. The Micro-gesture Online Recognition task involves identifying the category and locating the start and end times of micro-gestures in video clips. Compared to the typical Temporal Action Detection task, the Micro-gesture Online Recogn… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: Technical Report of HFUT-VUT for the MiGA challenge at IJCAI 2024

  13. arXiv:2407.03384  [pdf, other

    physics.flu-dyn cs.CE

    Topological Separation of Vortices

    Authors: Adeel Zafar, Zahra Poorshayegh, Di Yang, Guoning Chen

    Abstract: Vortices and their analysis play a critical role in the understanding of complex phenomena in turbulent flows. Traditional vortex extraction methods, notably region-based techniques, often overlook the entanglement phenomenon, resulting in the inclusion of multiple vortices within a single extracted region. Their separation is necessary for quantifying different types of vortices and their statist… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted for presentation at IEEE Visualization (VIS) 2024 short paper track and will appear in the conference proceedings

  14. arXiv:2407.02457  [pdf, other

    cs.MM

    Volume Tracking Based Reference Mesh Extraction for Time-Varying Mesh Compression

    Authors: Guodong Chen, Libor Vasa, Fulin Wang, Mallesham Dasari

    Abstract: Time-Varying meshes (TVMs), characterized by their varying connectivity and number of vertices, hold significant potential in immersive media and other various applications. However, their practical utilization is challenging due to their time-varying features and large file sizes. Creating a reference mesh that contains the most essential information is a promising approach to utilizing shared in… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 6 pages

  15. arXiv:2407.01902  [pdf, other

    cs.CR cs.AI cs.CL

    SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

    Authors: Yan Yang, Zeguan Xiao, Xin Lu, Hongru Wang, Hailiang Huang, Guanhua Chen, Yun Chen

    Abstract: The widespread applications of large language models (LLMs) have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SoP, a simple yet effective framework to design jailbreak prompts automatically. I… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  16. arXiv:2407.01033  [pdf, other

    cs.LG cs.NE

    Neural Networks Trained by Weight Permutation are Universal Approximators

    Authors: Yongqiang Cai, Gaohang Chen, Zhonghua Qiao

    Abstract: The universal approximation property is fundamental to the success of neural networks, and has traditionally been achieved by training networks without any constraints on their parameters. However, recent experimental research proposed a novel permutation-based training method, which exhibited a desired classification performance without modifying the exact weight values. In this paper, we provide… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    MSC Class: 41A30; 68T05; 68T07

  17. Locomotion as Manipulation with ReachBot

    Authors: Tony G. Chen, Stephanie Newdick, Julia Di, Carlo Bosio, Nitin Ongole, Mathieu Lapotre, Marco Pavone, Mark R. Cutkosky

    Abstract: Caves and lava tubes on the Moon and Mars are sites of geological and astrobiological interest but consist of terrain that is inaccessible with traditional robot locomotion. To support the exploration of these sites, we present ReachBot, a robot that uses extendable booms as appendages to manipulate itself with respect to irregular rock surfaces. The booms terminate in grippers equipped with micro… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Journal ref: Science Robotics 2024

  18. MuGSI: Distilling GNNs with Multi-Granularity Structural Information for Graph Classification

    Authors: Tianjun Yao, Jiaqi Sun, Defu Cao, Kun Zhang, Guangyi Chen

    Abstract: Recent works have introduced GNN-to-MLP knowledge distillation (KD) frameworks to combine both GNN's superior performance and MLP's fast inference speed. However, existing KD frameworks are primarily designed for node classification within single graphs, leaving their applicability to graph classification largely unexplored. Two main challenges arise when extending KD for node classification to gr… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: 12 pages, 4 figures. Accepted by TheWebConf2024

    ACM Class: I.2.6

  19. arXiv:2406.19364  [pdf, other

    cs.CV

    SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues

    Authors: Yuxin Xie, Tao Zhou, Yi Zhou, Geng Chen

    Abstract: Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key compon… ▽ More

    Submitted 28 June, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

    Comments: accepted by MICCAI 2024

  20. arXiv:2406.19280  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

    Authors: Junying Chen, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, Benyou Wang

    Abstract: The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-i… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  21. arXiv:2406.18070  [pdf, other

    cs.CV

    EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

    Authors: Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei Huang, Yali Wang, Tong Lu, Limin Wang, Yu Qiao

    Abstract: In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the uniqu… ▽ More

    Submitted 30 June, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

    Comments: Champion solutions in the EgoVis CVPR 2024 workshop

  22. arXiv:2406.17969  [pdf, other

    cs.CL cs.AI

    Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective

    Authors: Hanqi Yan, Yanzheng Xiang, Guangyi Chen, Yifei Wang, Lin Gui, Yulan He

    Abstract: To better interpret the intrinsic mechanism of large language models (LLMs), recent studies focus on monosemanticity on its basic units. A monosemantic neuron is dedicated to a single and specific concept, which forms a one-to-one correlation between neurons and concepts. Despite extensive research in monosemanticity probing, it remains unclear whether monosemanticity is beneficial or harmful to m… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  23. arXiv:2406.17697  [pdf, other

    cs.LG cs.AI cs.CV

    HGTDP-DTA: Hybrid Graph-Transformer with Dynamic Prompt for Drug-Target Binding Affinity Prediction

    Authors: Xi Xiao, Wentao Wang, Jiacheng Xie, Lijing Zhu, Gaofei Chen, Zhengji Li, Tianyang Wang, Min Xu

    Abstract: Drug target binding affinity (DTA) is a key criterion for drug screening. Existing experimental methods are time-consuming and rely on limited structural and domain information. While learning-based methods can model sequence and structural information, they struggle to integrate contextual data and often lack comprehensive modeling of drug-target interactions. In this study, we propose a novel DT… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  24. arXiv:2406.17530  [pdf, other

    cs.CV cs.RO

    Point Tree Transformer for Point Cloud Registration

    Authors: Meiling Wang, Guangyan Chen, Yi Yang, Li Yuan, Yufeng Yue

    Abstract: Point cloud registration is a fundamental task in the fields of computer vision and robotics. Recent developments in transformer-based methods have demonstrated enhanced performance in this domain. However, the standard attention mechanism utilized in these methods often integrates many low-relevance points, thereby struggling to prioritize its attention weights on sparse yet meaningful points. Th… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  25. arXiv:2406.17517  [pdf, other

    cs.LG cs.AI

    Preserving Node Distinctness in Graph Autoencoders via Similarity Distillation

    Authors: Ge Chen, Yulan Hu, Sheng Ouyang, Yong Liu, Cuicui Luo

    Abstract: Graph autoencoders (GAEs), as a kind of generative self-supervised learning approach, have shown great potential in recent years. GAEs typically rely on distance-based criteria, such as mean-square-error (MSE), to reconstruct the input graph. However, relying solely on a single reconstruction criterion may lead to a loss of distinctiveness in the reconstructed graph, causing nodes to collapse into… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  26. arXiv:2406.16486  [pdf, other

    cs.AI

    Towards Comprehensive Preference Data Collection for Reward Modeling

    Authors: Yulan Hu, Qingyang Li, Sheng Ouyang, Ge Chen, Kaihui Chen, Lijun Mei, Xucheng Ye, Fuzheng Zhang, Yong Liu

    Abstract: Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models (LLMs) with human preferences, thereby enhancing the quality of responses generated. A critical component of RLHF is the reward model, which is trained on preference data and outputs a scalar reward during the inference stage. However, the collection of preference data still lacks thorough investig… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  27. arXiv:2406.15126  [pdf, other

    cs.CL

    On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

    Authors: Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang

    Abstract: Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of real-world data with synthetic data generation. However, current investigations into this field lack a unified framework and mostly stay on the surface. Therefore,… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: A survey on LLMs-driven synthetic data generation, curation and evaluation

  28. arXiv:2406.14885  [pdf, other

    cs.HC

    Ink and Algorithm: Exploring Temporal Dynamics in Human-AI Collaborative Writing

    Authors: Kaixun Yang, Yixin Cheng, Linxuan Zhao, Mladen Raković, Zachari Swiecki, Dragan Gašević, Guanliang Chen

    Abstract: The advent of Generative Artificial Intelligence (GAI) has revolutionized the field of writing, marking a shift towards human-AI collaborative writing in education. However, the dynamics of human-AI interaction in the collaborative writing process are not well understood, and thus it remains largely unknown how human learning can be effectively supported with such cutting-edge GAI technologies. In… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  29. arXiv:2406.13857  [pdf, other

    cs.RO

    Martian Exploration of Lava Tubes (MELT) with ReachBot: Scientific Investigation and Concept of Operations

    Authors: Julia Di, Sara Cuevas-Quinones, Stephanie Newdick, Tony G. Chen, Marco Pavone, Mathieu G. A. Lapotre, Mark Cutkosky

    Abstract: As natural access points to the subsurface, lava tubes and other caves have become premier targets of planetary missions for astrobiological analyses. Few existing robotic paradigms, however, are able to explore such challenging environments. ReachBot is a robot that enables navigation in planetary caves by using extendable and retractable limbs to locomote. This paper outlines the potential scien… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: In International Conference on Space Robotics 2024

  30. arXiv:2406.12784  [pdf, other

    cs.CL

    UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

    Authors: Xunzhi Wang, Zhuowei Zhang, Qiongyu Li, Gaonan Chen, Mengting Hu, Zhiyu li, Bitong Luo, Hang Gao, Zhixin Han, Haotian Wang

    Abstract: The rapid development of large language models (LLMs) has shown promising practical results. However, their low interpretability often leads to errors in unforeseen circumstances, limiting their utility. Many works have focused on creating comprehensive evaluation systems, but previous benchmarks have primarily assessed problem-solving abilities while neglecting the response's uncertainty, which m… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Under review

  31. arXiv:2406.12629  [pdf, other

    cs.CL cs.AI cs.CV

    SeTAR: Out-of-Distribution Detection with Selective Low-Rank Approximation

    Authors: Yixia Li, Boya Xiong, Guanhua Chen, Yun Chen

    Abstract: Out-of-distribution (OOD) detection is crucial for the safe deployment of neural networks. Existing CLIP-based approaches perform OOD detection by devising novel scoring functions or sophisticated fine-tuning methods. In this work, we propose SeTAR, a novel, training-free OOD detection method that leverages selective low-rank approximation of weight matrices in vision-language and vision-only mode… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Code are available at \url{https://github.com/X1AOX1A/SeTAR}

  32. arXiv:2406.11546  [pdf, other

    eess.AS cs.CL cs.SD

    GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

    Authors: Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

    Abstract: The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired spee… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Under review

  33. arXiv:2406.11353  [pdf, other

    cs.LG cs.CL

    $\texttt{MoE-RBench}$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts

    Authors: Guanjie Chen, Xinyu Zhao, Tianlong Chen, Yu Cheng

    Abstract: Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, the reliability assessment of MoE lags behind its surging applications. Moreover, when transferred to new domains such as in fine-tuning MoE models sometimes underperform their dense counterparts. Motivated by the research gap and counter-intuitive phenomenon, we… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 9 pages, 8 figures, camera ready on ICML2024

  34. arXiv:2406.11317  [pdf, other

    cs.AI cs.CL cs.CV cs.HC

    GUICourse: From General Vision Language Models to Versatile GUI Agents

    Authors: Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun

    Abstract: Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (th… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  35. arXiv:2406.11288  [pdf, other

    cs.CL cs.CV

    MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models

    Authors: Shengkang Wang, Hongzhan Lin, Ziyang Luo, Zhen Ye, Guang Chen, Jing Ma

    Abstract: Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks, such as visual question answering and image captioning. These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly. However, the content discerned by LVLMs may deviate from actual facts due to inherent bias or incorre… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 22 pages, 8 figures

  36. arXiv:2406.10858  [pdf, other

    cs.CL cs.AI

    Step-level Value Preference Optimization for Mathematical Reasoning

    Authors: Guoxin Chen, Minpeng Liao, Chengxi Li, Kai Fan

    Abstract: Direct Preference Optimization (DPO) using an implicit reward model has proven to be an effective alternative to reinforcement learning from human feedback (RLHF) for fine-tuning preference aligned large language models (LLMs). However, the overall preference annotations of responses do not fully capture the fine-grained quality of model outputs in complex multi-step reasoning tasks, such as mathe… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: Ongoing Work

  37. arXiv:2406.10484  [pdf, other

    cs.CV

    Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

    Authors: Lu Xu, Sijie Zhu, Chunyuan Li, Chia-Wen Kuo, Fan Chen, Xinyao Wang, Guang Chen, Dawei Du, Ye Yuan, Longyin Wen

    Abstract: The emerging video LMMs (Large Multimodal Models) have achieved significant improvements on generic video understanding in the form of VQA (Visual Question Answering), where the raw videos are captured by cameras. However, a large portion of videos in real-world applications are edited videos, \textit{e.g.}, users usually cut and add effects/modifications to the raw video before publishing it on s… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  38. arXiv:2406.09481  [pdf, other

    cs.CV cs.LG

    ELF-UA: Efficient Label-Free User Adaptation in Gaze Estimation

    Authors: Yong Wu, Yang Wang, Sanqing Qu, Zhijun Li, Guang Chen

    Abstract: We consider the problem of user-adaptive 3D gaze estimation. The performance of person-independent gaze estimation is limited due to interpersonal anatomical differences. Our goal is to provide a personalized gaze estimation model specifically adapted to a target user. Previous work on user-adaptive gaze estimation requires some labeled images of the target person data to fine-tune the model at te… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: This paper has been accepted by IJCAI'24

  39. arXiv:2406.09133  [pdf

    cs.CL

    RH-SQL: Refined Schema and Hardness Prompt for Text-to-SQL

    Authors: Jiawen Yi, Guo Chen, Zixiang Shen

    Abstract: Text-to-SQL is a technology that converts natural language queries into the structured query language SQL. A novel research approach that has recently gained attention focuses on methods based on the complexity of SQL queries, achieving notable performance improvements. However, existing methods entail significant storage and training costs, which hampers their practical application. To address th… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 4 pages, 2 figures, 2024 6th International Conference on Electronic Engineering and Informatics (EEI 2024)

  40. arXiv:2406.09044  [pdf, other

    cs.CL

    MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning

    Authors: Hanqing Wang, Zeguan Xiao, Yixia Li, Shuo Wang, Guanhua Chen, Yun Chen

    Abstract: Efficient finetuning of large language models (LLMs) aims to adapt the LLMs with reduced computation and memory cost. Previous LoRA-based approaches initialize the low-rank matrices with gaussian distribution and zero values, while keeping the original weight matrices frozen. However, the trainable model parameters optimized in an unguided subspace might have interference with the well-learned sub… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  41. arXiv:2406.08756  [pdf, other

    cs.DC cs.LG

    Optimizing Large Model Training through Overlapped Activation Recomputation

    Authors: Ping Chen, Wenjie Zhang, Shuibing He, Yingjie Gu, Zhuwei Peng, Kexin Huang, Xuan Zhan, Weijian Chen, Yi Zheng, Zhefeng Wang, Yanlong Yin, Gang Chen

    Abstract: Large model training has been using recomputation to alleviate the memory pressure and pipelining to exploit the parallelism of data, tensor, and devices. The existing recomputation approaches may incur up to 40% overhead when training real-world models, e.g., the GPT model with 22B parameters. This is because they are executed on demand in the critical training path. In this paper, we design a ne… ▽ More

    Submitted 27 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: 13 pages

  42. arXiv:2406.08418  [pdf, other

    cs.CV cs.AI

    OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

    Authors: Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang , et al. (15 additional authors not shown)

    Abstract: Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale an… ▽ More

    Submitted 12 July, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  43. arXiv:2406.08119  [pdf

    eess.AS cs.SD

    Low-Complexity Acoustic Scene Classification Using Parallel Attention-Convolution Network

    Authors: Yanxiong Li, Jiaxin Tan, Guoqing Chen, Jialong Li, Yongjie Si, Qianhua He

    Abstract: This work is an improved system that we submitted to task 1 of DCASE2023 challenge. We propose a method of low-complexity acoustic scene classification by a parallel attention-convolution network which consists of four modules, including pre-processing, fusion, global and local contextual information extraction. The proposed network is computationally efficient to capture global and local contextu… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted for publication on Interspeech 2024. 5 pages, 4 figures, 3 tables

  44. arXiv:2406.07476  [pdf, other

    cs.CV cs.CL

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Authors: Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing

    Abstract: In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data… ▽ More

    Submitted 17 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: ZC, SL, HZ, YX, and XL contributed equally to this project

  45. arXiv:2406.06649  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution

    Authors: Kai Liu, Haotong Qin, Yong Guo, Xin Yuan, Linghe Kong, Guihai Chen, Yulun Zhang

    Abstract: Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment, which allows advanced SR models to enjoy compact low-bit parameters and efficient integer/bitwise constructions for storage compression and inference acceleration, respectively. However, it is notorious that low-bit quantization degrades the accuracy of SR models compared to their ful… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 9 pages, 6 figures. The code and models will be available at https://github.com/Kai-Liu001/2DQuant

  46. arXiv:2406.05427  [pdf, other

    cs.LG

    Decision Mamba: A Multi-Grained State Space Model with Self-Evolution Regularization for Offline RL

    Authors: Qi Lv, Xiang Deng, Gongwei Chen, Michael Yu Wang, Liqiang Nie

    Abstract: While the conditional sequence modeling with the transformer architecture has demonstrated its effectiveness in dealing with offline reinforcement learning (RL) tasks, it is struggle to handle out-of-distribution states and actions. Existing work attempts to address this issue by data augmentation with the learned policy or adding extra constraints with the value-based RL algorithm. However, these… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

  47. arXiv:2406.05250  [pdf, other

    cs.AI cs.AR cs.LG

    LLM-Enhanced Bayesian Optimization for Efficient Analog Layout Constraint Generation

    Authors: Guojin Chen, Keren Zhu, Seunggeun Kim, Hanqing Zhu, Yao Lai, Bei Yu, David Z. Pan

    Abstract: Analog layout synthesis faces significant challenges due to its dependence on manual processes, considerable time requirements, and performance instability. Current Bayesian Optimization (BO)-based techniques for analog layout synthesis, despite their potential for automation, suffer from slow convergence and extensive data needs, limiting their practical application. This paper presents the \text… ▽ More

    Submitted 19 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

  48. arXiv:2406.04998  [pdf, other

    cs.LG cs.AI cs.CV

    ADBA:Approximation Decision Boundary Approach for Black-Box Adversarial Attacks

    Authors: Feiyang Wang, Xingquan Zuo, Hai Huang, Gang Chen

    Abstract: Many machine learning models are susceptible to adversarial attacks, with decision-based black-box attacks representing the most critical threat in real-world applications. These attacks are extremely stealthy, generating adversarial examples using hard labels obtained from the target machine learning model. This is typically realized by optimizing perturbation directions, guided by decision bound… ▽ More

    Submitted 12 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: 10 pages, 5 figures, conference

  49. arXiv:2406.04575  [pdf, other

    cs.LG cs.AI stat.AP stat.ML

    Optimization of geological carbon storage operations with multimodal latent dynamic model and deep reinforcement learning

    Authors: Zhongzheng Wang, Yuntian Chen, Guodong Chen, Dongxiao Zhang

    Abstract: Maximizing storage performance in geological carbon storage (GCS) is crucial for commercial deployment, but traditional optimization demands resource-intensive simulations, posing computational challenges. This study introduces the multimodal latent dynamic (MLD) model, a deep learning framework for fast flow prediction and well control optimization in GCS. The MLD model includes a representation… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  50. arXiv:2406.03345  [pdf, other

    cs.LG cs.AI

    Feature Contamination: Neural Networks Learn Uncorrelated Features and Fail to Generalize

    Authors: Tianren Zhang, Chujie Zhao, Guanyu Chen, Yizhou Jiang, Feng Chen

    Abstract: Learning representations that generalize under distribution shifts is critical for building robust machine learning models. However, despite significant efforts in recent years, algorithmic advances in this direction have been limited. In this work, we seek to understand the fundamental difficulty of out-of-distribution generalization with deep neural networks. We first empirically show that perha… ▽ More

    Submitted 6 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: ICML 2024