Zum Hauptinhalt springen

Showing 1–50 of 270 results for author: Gan, Z

.
  1. arXiv:2408.15777  [pdf, other

    cs.CV

    A Survey on Facial Expression Recognition of Static and Dynamic Emotions

    Authors: Yan Wang, Shaoqi Yan, Yang Liu, Wei Song, Jing Liu, Yang Chang, Xinji Mai, Xiping Hu, Wenqiang Zhang, Zhongxue Gan

    Abstract: Facial expression recognition (FER) aims to analyze emotional states from static images and dynamic sequences, which is pivotal in enhancing anthropomorphic communication among humans, robots, and digital avatars by leveraging AI technologies. As the FER field evolves from controlled laboratory environments to more complex in-the-wild scenarios, advanced methods have been rapidly developed and new… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  2. arXiv:2408.04957   

    cs.CV cs.AI

    LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description

    Authors: Yizhang Jin, Jian Li, Jiangning Zhang, Jianlong Hu, Zhenye Gan, Xin Tan, Yong Liu, Yabiao Wang, Chengjie Wang, Lizhuang Ma

    Abstract: Visual Spatial Description (VSD) aims to generate texts that describe the spatial relationships between objects within images. Traditional visual spatial relationship classification (VSRC) methods typically output the spatial relationship between two objects in an image, often neglecting world knowledge and lacking general language capabilities. In this paper, we propose a Large Language-and-Visio… ▽ More

    Submitted 28 August, 2024; v1 submitted 9 August, 2024; originally announced August 2024.

    Comments: We have discovered a significant error in the paper that affects the main conclusions. To ensure the accuracy of our research, we have decided to withdraw this paper and will resubmit it after making the necessary corrections

  3. arXiv:2407.21762  [pdf, other

    cs.RO

    ReplanVLM: Replanning Robotic Tasks with Visual Language Models

    Authors: Aoran Mei, Guo-Niu Zhu, Huaxiang Zhang, Zhongxue Gan

    Abstract: Large language models (LLMs) have gained increasing popularity in robotic task planning due to their exceptional abilities in text analytics and generation, as well as their broad knowledge of the world. However, they fall short in decoding visual cues. LLMs have limited direct perception of the world, which leads to a deficient grasp of the current state of the world. By contrast, the emergence o… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

  4. arXiv:2407.17155  [pdf, other

    cs.CV

    FIIH: Fully Invertible Image Hiding for Secure and Robust

    Authors: Lang Huang, Lin Huo, Zheng Gan, Xinrong He

    Abstract: Image hiding is the study of techniques for covert storage and transmission, which embeds a secret image into a container image and generates stego image to make it similar in appearance to a normal image. However, existing image hiding methods have a serious problem that the hiding and revealing process cannot be fully invertible, which results in the revealing network not being able to recover t… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

  5. arXiv:2407.15841  [pdf, other

    cs.CV

    SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

    Authors: Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan

    Abstract: We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Speci… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: Technical report

  6. arXiv:2407.12344  [pdf, other

    cs.CL cs.CY

    The Better Angels of Machine Personality: How Personality Relates to LLM Safety

    Authors: Jie Zhang, Dongrui Liu, Chen Qian, Ziyue Gan, Yong Liu, Yu Qiao, Jing Shao

    Abstract: Personality psychologists have analyzed the relationship between personality and safety behaviors in human society. Although Large Language Models (LLMs) demonstrate personality traits, the relationship between personality traits and safety abilities in LLMs still remains a mystery. In this paper, we discover that LLMs' personality traits are closely related to their safety abilities, i.e., toxici… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  7. arXiv:2407.06698  [pdf, ps, other

    cs.CV cs.LG

    PSPU: Enhanced Positive and Unlabeled Learning by Leveraging Pseudo Supervision

    Authors: Chengjie Wang, Chengming Xu, Zhenye Gan, Jianlong Hu, Wenbing Zhu, Lizhuag Ma

    Abstract: Positive and Unlabeled (PU) learning, a binary classification model trained with only positive and unlabeled data, generally suffers from overfitted risk estimation due to inconsistent data distributions. To address this, we introduce a pseudo-supervised PU learning framework (PSPU), in which we train the PU model first, use it to gather confident samples for the pseudo supervision, and then apply… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: accepted by ICME2024

  8. arXiv:2407.06053  [pdf, other

    cond-mat.mtrl-sci cs.LG quant-ph

    Learning local equivariant representations for quantum operators

    Authors: Zhanghao Zhouyin, Zixi Gan, Shishir Kumar Pandey, Linfeng Zhang, Qiangqiang Gu

    Abstract: Predicting quantum operator matrices such as Hamiltonian, overlap, and density matrices in the density functional theory (DFT) framework is crucial for understanding material properties. Current methods often focus on individual operators and struggle with efficiency and scalability for large systems. Here we introduce a novel deep learning model, SLEM (strictly localized equivariant message-passi… ▽ More

    Submitted 16 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

    Comments: 11 pages, 5 figures and 5 tables

  9. arXiv:2407.02477  [pdf, other

    cs.CV cs.CL

    Understanding Alignment in Multimodal LLMs: A Comprehensive Study

    Authors: Elmira Amirloo, Jean-Philippe Fauconnier, Christoph Roesmann, Christian Kerl, Rinu Boney, Yusu Qian, Zirui Wang, Afshin Dehghan, Yinfei Yang, Zhe Gan, Peter Grasch

    Abstract: Preference alignment has become a crucial component in enhancing the performance of Large Language Models (LLMs), yet its impact in Multimodal Large Language Models (MLLMs) remains comparatively underexplored. Similar to language models, MLLMs for image understanding tasks encounter challenges like hallucination. In MLLMs, hallucination can occur not only by stating incorrect facts but also by pro… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  10. arXiv:2407.01509  [pdf, other

    cs.CV cs.CL

    MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

    Authors: Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan

    Abstract: We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results fro… ▽ More

    Submitted 25 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

  11. arXiv:2406.17225  [pdf, other

    eess.IV cs.CV

    Multimodal Cross-Task Interaction for Survival Analysis in Whole Slide Pathological Images

    Authors: Songhan Jiang, Zhengyu Gan, Linghan Cai, Yifeng Wang, Yongbing Zhang

    Abstract: Survival prediction, utilizing pathological images and genomic profiles, is increasingly important in cancer analysis and prognosis. Despite significant progress, precise survival analysis still faces two main challenges: (1) The massive pixels contained in whole slide images (WSIs) complicate the process of pathological images, making it difficult to generate an effective representation of the tu… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  12. arXiv:2406.13434  [pdf, other

    cs.RO

    Tactile Aware Dynamic Obstacle Avoidance in Crowded Environment with Deep Reinforcement Learning

    Authors: Yung Chuen Ng, Qi Wen, Lim, Chun Ye Tan, Zhen Hao Gan, Meng Yee, Chuah

    Abstract: Mobile robots operating in crowded environments require the ability to navigate among humans and surrounding obstacles efficiently while adhering to safety standards and socially compliant mannerisms. This scale of the robot navigation problem may be classified as both a local path planning and trajectory optimization problem. This work presents an array of force sensors that act as a tactile laye… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  13. arXiv:2406.07314  [pdf, other

    cs.LG

    Rethinking the impact of noisy labels in graph classification: A utility and privacy perspective

    Authors: De Li, Xianxian Li, Zeming Gan, Qiyu Li, Bin Qu, Jinyan Wang

    Abstract: Graph neural networks based on message-passing mechanisms have achieved advanced results in graph classification tasks. However, their generalization performance degrades when noisy labels are present in the training data. Most existing noisy labeling approaches focus on the visual domain or graph node classification tasks and analyze the impact of noisy labels only from a utility perspective. Unl… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  14. arXiv:2406.03262  [pdf, other

    cs.CV

    ADer: A Comprehensive Benchmark for Multi-class Visual Anomaly Detection

    Authors: Jiangning Zhang, Haoyang He, Zhenye Gan, Qingdong He, Yuxuan Cai, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lei Xie, Yong Liu

    Abstract: Visual anomaly detection aims to identify anomalous regions in images through unsupervised learning paradigms, with increasing application demand and value in fields such as industrial inspection and medical lesion detection. Despite significant progress in recent years, there is a lack of comprehensive benchmarks to adequately evaluate the performance of various mainstream methods across differen… ▽ More

    Submitted 6 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

  15. arXiv:2405.20795  [pdf, other

    cs.CV cs.AI

    InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding

    Authors: Huaxiang Zhang, Yaojia Mu, Guo-Niu Zhu, Zhongxue Gan

    Abstract: Accurate visual understanding is imperative for advancing autonomous systems and intelligent robots. Despite the powerful capabilities of vision-language models (VLMs) in processing complex visual scenes, precisely recognizing obscured or ambiguously presented visual elements remains challenging. To tackle such issues, this paper proposes InsightSee, a multi-agent framework to enhance VLMs' interp… ▽ More

    Submitted 31 May, 2024; originally announced May 2024.

  16. arXiv:2405.17579  [pdf, other

    cs.RO

    Harnessing Natural Oscillations for High-Speed, Efficient Asymmetrical Locomotion in Quadrupedal Robots

    Authors: Jing Cheng, Yasser G. Alqaham, Zhenyu Gan

    Abstract: This study explores the dynamics of asymmetrical bounding gaits in quadrupedal robots, focusing on the integration of torso pitching and hip motion to enhance speed and stability. Traditional control strategies often enforce a fixed posture, minimizing natural body movements to simplify the control problem. However, this approach may overlook the inherent dynamical advantages found in natural loco… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  17. arXiv:2405.13751  [pdf, other

    cs.RO cs.AI

    GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games

    Authors: Aoran Mei, Jianhua Wang, Guo-Niu Zhu, Zhongxue Gan

    Abstract: With their prominent scene understanding and reasoning capabilities, pre-trained visual-language models (VLMs) such as GPT-4V have attracted increasing attention in robotic task planning. Compared with traditional task planning strategies, VLMs are strong in multimodal information parsing and code generation and show remarkable efficiency. Although VLMs demonstrate great potential in robotic task… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  18. arXiv:2405.10739  [pdf, other

    cs.CV cs.AI

    Efficient Multimodal Large Language Models: A Survey

    Authors: Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, Yabiao Wang, Chengjie Wang, Lizhuang Ma

    Abstract: In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, e… ▽ More

    Submitted 9 August, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

  19. arXiv:2405.06333  [pdf, other

    math.NA

    Random Batch Ewald Method for Dielectrically Confined Coulomb Systems

    Authors: Zecheng Gan, Xuanzhao Gao, Jiuyang Liang, Zhenli Xu

    Abstract: Quasi two-dimensional Coulomb systems have drawn widespread interest. The reduced symmetry of these systems leads to complex collective behaviors, yet simultaneously poses significant challenges for particle-based simulations. In this paper, a novel method is presented for efficiently simulate a collection of charges confined in doubly-periodic slabs, with the extension to scenarios involving diel… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: 24 pages, 7 figures

    MSC Class: 82M37; 65C35; 65T50; 65Y05

  20. arXiv:2404.07973  [pdf, other

    cs.CV

    Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

    Authors: Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang

    Abstract: While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: Preprint. 14 pages, 4 figures

  21. arXiv:2404.06836  [pdf, other

    cs.CV

    O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

    Authors: Muer Tie, Julong Wei, Zhengjun Wang, Ke Wu, Shansuai Yuan, Kaizhao Zhang, Jie Jia, Jieru Zhao, Zhongxue Gan, Wenchao Ding

    Abstract: Online construction of open-ended language scenes is crucial for robotic applications, where open-vocabulary interactive scene understanding is required. Recently, neural implicit representation has provided a promising direction for online interactive mapping. However, implementing open-vocabulary scene understanding capability into online neural implicit mapping still faces three challenges: lac… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  22. arXiv:2404.06564  [pdf, other

    cs.CV

    MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection

    Authors: Haoyang He, Yuhu Bai, Jiangning Zhang, Qingdong He, Hongxu Chen, Zhenye Gan, Chengjie Wang, Xiangtai Li, Guanzhong Tian, Lei Xie

    Abstract: Recent advancements in anomaly detection have seen the efficacy of CNN- and transformer-based approaches. However, CNNs struggle with long-range dependencies, while transformers are burdened by quadratic computational complexity. Mamba-based models, with their superior long-range modeling and linear efficiency, have garnered substantial attention. This study pioneers the application of Mamba to mu… ▽ More

    Submitted 14 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

  23. arXiv:2404.05719  [pdf, other

    cs.CV cs.CL cs.HC

    Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

    Authors: Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan

    Abstract: Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  24. arXiv:2403.20159  [pdf, other

    cs.CV

    HGS-Mapping: Online Dense Mapping Using Hybrid Gaussian Representation in Urban Scenes

    Authors: Ke Wu, Kaizhao Zhang, Zhiwei Zhang, Shanshuai Yuan, Muer Tie, Julong Wei, Zijun Xu, Jieru Zhao, Zhongxue Gan, Wenchao Ding

    Abstract: Online dense mapping of urban scenes forms a fundamental cornerstone for scene understanding and navigation of autonomous vehicles. Recent advancements in mapping methods are mainly based on NeRF, whose rendering speed is too slow to meet online requirements. 3D Gaussian Splatting (3DGS), with its rendering speed hundreds of times faster than NeRF, holds greater potential in online dense mapping.… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

  25. arXiv:2403.17326  [pdf

    cond-mat.mtrl-sci

    Unveiling the origin of unconventional moire ferroelectricity

    Authors: Ruirui Niu, Zhuoxian Li, Xiangyan Han, Qianling Liu, Zhuangzhuang Qu, Zhiyu Wang, Chunrui Han, Kenji Watanabe, Takashi Taniguchi, Kaihui Liu, Jinhai Mao, Wu Shi, Bo Peng, Zheng Vitto Han, Zizhao Gan, Jianming Lu

    Abstract: Interfacial ferroelectricity emerges in heterostructures consisting of nonpolar van der Waals (vdW) layers, greatly expanding the scope of two dimensional ferroelectrics. In particular, the unconventional moire ferroelectricity observed in bilayer graphene/boron nitride (BN) heterostructures, exhibits promising functionalities with topological current, superconductivity and synaptic responses. How… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  26. arXiv:2403.12580  [pdf, other

    cs.CV

    Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection

    Authors: Chengjie Wang, Wenbing Zhu, Bin-Bin Gao, Zhenye Gan, Jianning Zhang, Zhihao Gu, Shuguang Qian, Mingang Chen, Lizhuang Ma

    Abstract: Industrial anomaly detection (IAD) has garnered significant attention and experienced rapid development. However, the recent development of IAD approach has encountered certain difficulties due to dataset limitations. On the one hand, most of the state-of-the-art methods have achieved saturation (over 99% in AUROC) on mainstream datasets such as MVTec, and the differences of methods cannot be well… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: It is accepted by CVPR2024

  27. arXiv:2403.12362  [pdf, other

    cs.CV cs.LG

    DMAD: Dual Memory Bank for Real-World Anomaly Detection

    Authors: Jianlong Hu, Xu Chen, Zhenye Gan, Jinlong Peng, Shengchuan Zhang, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Liujuan Cao, Rongrong Ji

    Abstract: Training a unified model is considered to be more suitable for practical industrial anomaly detection scenarios due to its generalization ability and storage efficiency. However, this multi-class setting, which exclusively uses normal data, overlooks the few but important accessible annotated anomalies in the real world. To address the challenge of real-world anomaly detection, we propose a new fr… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  28. arXiv:2403.10723  [pdf, other

    eess.SY

    Leveraging Symmetries in Gaits for Reinforcement Learning: A Case Study on Quadrupedal Gaits

    Authors: Jiayu Ding, Xulin Chen, Garret E. Katz, Zhenyu Gan

    Abstract: In this research, we address the complex task of developing versatile and agile quadrupedal gaits for robotic platforms, a domain predominantly governed by model-based trajectory optimization methods. We propose an innovative, reference-free reinforcement learning framework that exploits the intrinsic symmetries of dynamic systems to synthesize a broad array of naturalistic quadrupedal locomotion… ▽ More

    Submitted 14 June, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

  29. arXiv:2403.09611  [pdf, other

    cs.CV cs.CL cs.LG

    MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

    Authors: Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman , et al. (7 additional authors not shown)

    Abstract: In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for la… ▽ More

    Submitted 18 April, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

  30. arXiv:2403.01521  [pdf, other

    math.NA physics.comp-ph

    Fast Algorithm for Quasi-2D Coulomb Systems

    Authors: Zecheng Gan, Xuanzhao Gao, Jiuyang Liang, Zhenli Xu

    Abstract: Quasi-2D Coulomb systems are of fundamental importance and have attracted much attention in many areas nowadays. Their reduced symmetry gives rise to interesting collective behaviors, but also brings great challenges for particle-based simulations. Here, we propose a novel algorithm framework to address the $\mathcal O(N^2)$ simulation complexity associated with the long-range nature of Coulomb in… ▽ More

    Submitted 3 March, 2024; originally announced March 2024.

    Comments: 39 pages

    MSC Class: 82M37; 65D15; 65C35

  31. arXiv:2402.13220  [pdf, other

    cs.CV cs.CL

    How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

    Authors: Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan

    Abstract: The remarkable advancements in Multimodal Large Language Models (MLLMs) have not rendered them immune to challenges, particularly in the context of handling deceptive information in prompts, thus producing hallucinated responses under such conditions. To quantitatively assess this vulnerability, we present MAD-Bench, a carefully curated benchmark that contains 1000 test samples divided into 5 cate… ▽ More

    Submitted 23 July, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

  32. arXiv:2401.13372  [pdf, other

    physics.optics cond-mat.mes-hall

    Influence of resonant plasmonic nanoparticles on optically accessing the valley degree of freedom in 2D semiconductors

    Authors: Tobias Bucher, Zlata Fedorova, Mostafa Abasifard, Rajeshkumar Mupparapu, Matthias J. Wurdack, Emad Najafidehaghani, Ziyang Gan, Heiko Knopf, Antony George, Falk Eilenberger, Thomas Pertsch, Andrey Turchanin, Isabelle Staude

    Abstract: The valley degree of freedom is one of the most intriguing properties of atomically thin transition metal dichalcogenides. Together with the possibility to address this degree of freedom by valley-contrasting optical selection rules, it has the potential to enable a completely new class of future electronic and optoelectronic devices. Resonant optical nanostructures emerge as promising tools for i… ▽ More

    Submitted 20 June, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Tobias Bucher and Zlata Fedorova contributed equally to this work. 29 pages, 6 figures

  33. A Soft Continuum Robot with Self-Controllable Variable Curvature

    Authors: Xinran Wang, Qiujie Lu, Dongmyoung Lee, Zhongxue Gan, Nicolas Rojas

    Abstract: This paper introduces a new type of soft continuum robot, called SCoReS, which is capable of self-controlling continuously its curvature at the segment level; in contrast to previous designs which either require external forces or machine elements, or whose variable curvature capabilities are discrete -- depending on the number of locking mechanisms and segments. The ability to have a variable cur… ▽ More

    Submitted 19 January, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

    Comments: Accpeted for IEEE Robotics and Automation letters in January 2024, Imperial's open access research REF 2029 open access policy

    Journal ref: IEEE Robotics and Automation Letters 2024

  34. arXiv:2401.00652  [pdf, other

    cs.CV

    From Covert Hiding to Visual Editing: Robust Generative Video Steganography

    Authors: Xueying Mao, Xiaoxiao Hu, Wanli Peng, Zhenliang Gan, Qichao Ying, Zhenxing Qian, Sheng Li, Xinpeng Zhang

    Abstract: Traditional video steganography methods are based on modifying the covert space for embedding, whereas we propose an innovative approach that embeds secret message within semantic feature for steganography during the video editing process. Although existing traditional video steganography methods display a certain level of security and embedding capacity, they lack adequate robustness against comm… ▽ More

    Submitted 31 December, 2023; originally announced January 2024.

    Comments: Under Review

  35. arXiv:2312.15611  [pdf, other

    stat.ME stat.ML

    Inference of Dependency Knowledge Graph for Electronic Health Records

    Authors: Zhiwei Xu, Ziming Gan, Doudou Zhou, Shuting Shen, Junwei Lu, Tianxi Cai

    Abstract: The effective analysis of high-dimensional Electronic Health Record (EHR) data, with substantial potential for healthcare research, presents notable methodological challenges. Employing predictive modeling guided by a knowledge graph (KG), which enables efficient feature selection, can enhance both statistical efficiency and interpretability. While various methods have emerged for constructing KGs… ▽ More

    Submitted 24 December, 2023; originally announced December 2023.

  36. arXiv:2312.13503  [pdf, other

    cs.CV cs.AI

    InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models

    Authors: Bingbing Wen, Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Bill Howe, Lijuan Wang

    Abstract: In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to b… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

  37. A discovery of Two Slow Pulsars with FAST: "Ronin" from the Globular Cluster M15

    Authors: Dengke Zhou, Pei Wang, Di Li, Jianhua Fang, Chenchen Miao, Paulo C. C. Freire, Lei Zhang, Dandan Zhang, Huaxi Chen, Yi Feng, Yifan Xiao, Jintao Xie, Xu Zhang, Chenwu Jin, Han Wang, Yinan Ke, Xuerong Guo, Rushuang Zhao, Chenhui Niu, Weiwei Zhu, Mengyao Xue, Yabiao Wang, Jiafu Wu, Zhenye Gan, Zhongyi Sun , et al. (4 additional authors not shown)

    Abstract: Globular clusters harbor numerous millisecond pulsars, but long-period pulsars ($P \gtrsim 100$ ms) are rarely found. In this study, we employed a fast folding algorithm to analyze observational data from multiple globular clusters obtained by the Five-hundred-meter Aperture Spherical radio Telescope (FAST), aiming to detect the existence of long-period pulsars. We estimated the impact of the medi… ▽ More

    Submitted 18 April, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

    Comments: Accepted by SCIENCE CHINA Physics, Mechanics & Astronomy

    Journal ref: Sci. China-Phys. Mech. Astron. 67, 269512 (2024)

  38. arXiv:2311.17647  [pdf, other

    cs.CV cs.AI cs.CL

    Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

    Authors: Xiujun Li, Yujie Lu, Zhe Gan, Jianfeng Gao, William Yang Wang, Yejin Choi

    Abstract: Recent multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks. In this work, we introduce VISUAL MODALITY INSTRUCTION (VIM), and investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning. We adapt VIM to eight be… ▽ More

    Submitted 10 June, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

    Comments: Github: https://github.com/VIM-Bench/VIM_TOOL, Model and Data: https://huggingface.co/VIM-Bench

  39. arXiv:2311.16201  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

    Authors: Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander Toshev

    Abstract: Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  40. arXiv:2310.13699  [pdf, other

    cs.HC cs.ET

    Interaction in Metaverse: A Survey

    Authors: Hong Lin, Zirun Gan, Wensheng Gan, Zhenlian Qi, Yuehua Wang, Philip S. Yu

    Abstract: Human-computer interaction (HCI) emerged with the birth of the computer and has been upgraded through decades of development. Metaverse has attracted a lot of interest with its immersive experience, and HCI is the entrance to the Metaverse for people. It is predictable that HCI will determine the immersion of the Metaverse. However, the technologies of HCI in Metaverse are not mature enough. There… ▽ More

    Submitted 27 September, 2023; originally announced October 2023.

    Comments: Preprint. 3 figures, 3 tables

  41. arXiv:2310.13398  [pdf, other

    cs.CV

    OpenAnnotate3D: Open-Vocabulary Auto-Labeling System for Multi-modal 3D Data

    Authors: Yijie Zhou, Likun Cai, Xianhui Cheng, Zhongxue Gan, Xiangyang Xue, Wenchao Ding

    Abstract: In the era of big data and large models, automatic annotating functions for multi-modal data are of great significance for real-world AI-driven applications, such as autonomous driving and embodied AI. Unlike traditional closed-set annotation, open-vocabulary annotation is essential to achieve human-level cognition capability. However, there are few open-vocabulary auto-labeling systems for multi-… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

    Comments: The source code will be released at https://github.com/Fudan-ProjectTitan/OpenAnnotate3D

  42. arXiv:2310.07704  [pdf, other

    cs.CV cs.CL

    Ferret: Refer and Ground Anything Anywhere at Any Granularity

    Authors: Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang

    Abstract: We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to r… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: 30 pages, 10 figures. Code/Project Website: https://github.com/apple/ml-ferret

  43. arXiv:2310.07699  [pdf, other

    cs.CV cs.AI cs.LG

    VeCLIP: Improving CLIP Training via Visual-enriched Captions

    Authors: Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, Meng Cao

    Abstract: Large-scale web-crawled datasets are fundamental for the success of pre-training vision-language models, such as CLIP. However, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in achieving precise image-text alignment. Existing methods utilizing large language models (LLMs) for caption rewriting have shown promise on small, curated datasets like CC3M and CC12M.… ▽ More

    Submitted 13 March, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

    Comments: CV/ML

  44. arXiv:2310.01382  [pdf, other

    cs.CL cs.LG

    Compressing LLMs: The Truth is Rarely Pure and Never Simple

    Authors: Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang

    Abstract: Despite their remarkable achievements, modern Large Language Models (LLMs) face exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncom… ▽ More

    Submitted 16 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: Accepted to ICLR 2024

  45. arXiv:2309.17102  [pdf, other

    cs.CV

    Guiding Instruction-based Image Editing via Multimodal Large Language Models

    Authors: Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan

    Abstract: Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation… ▽ More

    Submitted 5 February, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ICLR'24 (Spotlight) ; Project at https://mllm-ie.github.io ; Code at https://github.com/tsujuifu/pytorch_mgie

  46. arXiv:2309.10020  [pdf, other

    cs.CV cs.CL

    Multimodal Foundation Models: From Specialists to General-Purpose Assistants

    Authors: Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao

    Abstract: This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: 119 pages, PDF file size 58MB; Tutorial website: https://vlp-tutorial.github.io/2023/

  47. arXiv:2308.08832  [pdf, other

    astro-ph.HE

    Atypical radio pulsations from magnetar SGR 1935+2154

    Authors: Pei Wang, Jian Li, Long Ji, Xian Hou, Erbil Gugercinoglu, Di Li, Diego F. Torres, Yutong Chen, Jiarui Niu, Weiwei Zhu, Bing Zhang, En-wei Liang, Li Zhang, Mingyu Ge, Zigao Dai, Lin Lin, Jinlin Han, Yi Feng, Chenhui Niu, Yongkun Zhang, Dengjiang Zhou, Heng Xu, Chunfeng Zhang, Jinchen Jiang, Chenchen Miao , et al. (10 additional authors not shown)

    Abstract: Magnetars are neutron stars with extremely strong magnetic fields, frequently powering high-energy activity in X-rays. Pulsed radio emission following some X-ray outbursts have been detected, albeit its physical origin is unclear. It has long been speculated that the origin of magnetars' radio signals is different from those from canonical pulsars, although convincing evidence is still lacking. Fi… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: 47 pages, 11 figures

  48. arXiv:2308.07551  [pdf, other

    cs.CV

    FLAME-based Multi-View 3D Face Reconstruction

    Authors: Wenzhuo Zheng, Junhao Zhao, Xiaohong Liu, Yongyang Pan, Zhenghao Gan, Haozhe Han, Ning Liu

    Abstract: At present, face 3D reconstruction has broad application prospects in various fields, but the research on it is still in the development stage. In this paper, we hope to achieve better face 3D reconstruction quality by combining multi-view training framework with face parametric model Flame, propose a multi-view training and testing model MFNet (Multi-view Flame Network). We build a self-supervise… ▽ More

    Submitted 25 September, 2023; v1 submitted 14 August, 2023; originally announced August 2023.

  49. UniG-Encoder: A Universal Feature Encoder for Graph and Hypergraph Node Classification

    Authors: Minhao Zou, Zhongxue Gan, Yutong Wang, Junheng Zhang, Dongyan Sui, Chun Guan, Siyang Leng

    Abstract: Graph and hypergraph representation learning has attracted increasing attention from various research fields. Despite the decent performance and fruitful applications of Graph Neural Networks (GNNs), Hypergraph Neural Networks (HGNNs), and their well-designed variants, on some commonly used benchmark graphs and hypergraphs, they are outperformed by even a simple Multi-Layer Perceptron. This observ… ▽ More

    Submitted 3 August, 2023; originally announced August 2023.

  50. arXiv:2308.01194  [pdf, other

    cs.CV

    Improving Generalization in Visual Reinforcement Learning via Conflict-aware Gradient Agreement Augmentation

    Authors: Siao Liu, Zhaoyu Chen, Yang Liu, Yuzheng Wang, Dingkang Yang, Zhile Zhao, Ziqing Zhou, Xie Yi, Wei Li, Wenqiang Zhang, Zhongxue Gan

    Abstract: Learning a policy with great generalization to unseen environments remains challenging but critical in visual reinforcement learning. Despite the success of augmentation combination in the supervised learning generalization, naively applying it to visual RL algorithms may damage the training efficiency, suffering from serve performance degradation. In this paper, we first conduct qualitative analy… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

    Comments: accepted by iccv2023