Skip to main content

Showing 101–150 of 743 results for author: Qiao, Y

.
  1. arXiv:2402.19465  [pdf, other

    cs.CL cs.AI

    Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

    Authors: Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, Jing Shao

    Abstract: Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robu… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

  2. arXiv:2402.19282  [pdf, other

    cs.CL

    WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

    Authors: Jiantao Qiu, Haijun Lv, Zhenjiang Jin, Rui Wang, Wenchang Ning, Jia Yu, ChaoBin Zhang, Zhenxiang Li, Pei Chu, Yuan Qu, Jin Shi, Lindong Lu, Runyu Peng, Zhiyuan Zeng, Huanze Tang, Zhikai Lei, Jiawei Hong, Keyu Chen, Zhaoye Fei, Ruiliang Xu, Wei Li, Zhongying Tu, Lin Dahua, Yu Qiao, Hang Yan , et al. (1 additional authors not shown)

    Abstract: This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for language models, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy… ▽ More

    Submitted 17 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

  3. arXiv:2402.18951  [pdf, other

    cs.CV

    Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition

    Authors: Boyu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang

    Abstract: Open-world video recognition is challenging since traditional networks are not generalized well on complex environment variations. Alternatively, foundation models with rich knowledge have recently shown their generalization power. However, how to apply such knowledge has not been fully explored for open-world video recognition. To this end, we propose a generic knowledge transfer pipeline, which… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: 35 pages, 6 figures, 8 tables

  4. arXiv:2402.17511  [pdf, other

    cs.RO cs.AI

    Rethinking Mutual Information for Language Conditioned Skill Discovery on Imitation Learning

    Authors: Zhaoxun Ju, Chao Yang, Hongbo Wang, Yu Qiao, Fuchun Sun

    Abstract: Language-conditioned robot behavior plays a vital role in executing complex tasks by associating human commands or instructions with perception and actions. The ability to compose long-horizon tasks based on unconstrained language instructions necessitates the acquisition of a diverse set of general-purpose skills. However, acquiring inherent primitive skills in a coupled and long-horizon environm… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: 16 pages

    ACM Class: I.2.6

  5. arXiv:2402.16880  [pdf, other

    cs.LG cs.AI cs.CL

    BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

    Authors: Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

    Abstract: Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer… ▽ More

    Submitted 19 April, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

  6. Infrared and visible Image Fusion with Language-driven Loss in CLIP Embedding Space

    Authors: Yuhao Wang, Lingjuan Miao, Zhiqiang Zhou, Lei Zhang, Yajun Qiao

    Abstract: Infrared-visible image fusion (IVIF) has attracted much attention owing to the highly-complementary properties of the two image modalities. Due to the lack of ground-truth fused images, the fusion output of current deep-learning based methods heavily depends on the loss functions defined mathematically. As it is hard to well mathematically define the fused image without ground truth, the performan… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

  7. arXiv:2402.16117  [pdf, other

    cs.RO cs.AI cs.CV

    RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

    Authors: Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, Ping Luo

    Abstract: Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

  8. arXiv:2402.15653  [pdf, other

    cs.CV

    Low-Frequency Black-Box Backdoor Attack via Evolutionary Algorithm

    Authors: Yanqi Qiao, Dazhuang Liu, Rui Wang, Kaitai Liang

    Abstract: While convolutional neural networks (CNNs) have achieved success in computer vision tasks, it is vulnerable to backdoor attacks. Such attacks could mislead the victim model to make attacker-chosen prediction with a specific trigger pattern. Until now, the trigger injection of existing attacks is mainly limited to spatial domain. Recent works take advantage of perceptual properties of planting spec… ▽ More

    Submitted 6 March, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

  9. arXiv:2402.14623  [pdf, other

    cs.RO cs.AI cs.CL cs.CV

    RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation

    Authors: Junting Chen, Yao Mu, Qiaojun Yu, Tianming Wei, Silang Wu, Zhecheng Yuan, Zhixuan Liang, Chao Yang, Kaipeng Zhang, Wenqi Shao, Yu Qiao, Huazhe Xu, Mingyu Ding, Ping Luo

    Abstract: Rapid progress in high-level task planning and code generation for open-world robot manipulation has been witnessed in Embodied AI. However, previous studies put much effort into general common sense reasoning and task planning capabilities of large-scale language or multi-modal models, relatively little effort on ensuring the deployability of generated code on real robots, and other fundamental c… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

    Comments: 10 pages of main paper, 4 pages of appendix; 10 figures in main paper, 3 figures in appendix

    ACM Class: I.2.7; I.2.8; I.2.9; I.2.10

  10. arXiv:2402.12343  [pdf, other

    cs.CL cs.AI cs.LG

    Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

    Authors: Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, Yu Qiao

    Abstract: Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans. However, this paper introduces a training-free attack method capable of reversing safety alignment, converting the outcomes of stronger alignment into greater potential for harm by accessing only LLM output token distributions. Specifically, our method achieves this reversal by contrasting the output to… ▽ More

    Submitted 6 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: ACL 2024

  11. arXiv:2402.12185  [pdf, other

    cs.CV

    ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

    Authors: Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, Yu Qiao

    Abstract: Recently, many versatile Multi-modal Large Language Models (MLLMs) have emerged continuously. However, their capacity to query information depicted in visual charts and engage in reasoning based on the queried contents remains under-explored. In this paper, to comprehensively and rigorously benchmark the ability of the off-the-shelf MLLMs in the chart domain, we construct ChartX, a multi-modal eva… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

    Comments: Code and dataset are available for downloading at: https://github.com/UniModal4Reasoning/ChartVLM 22 pages, 15 figures

  12. arXiv:2402.10991  [pdf

    cs.LG cs.AI

    Enhancing Convergence in Federated Learning: A Contribution-Aware Asynchronous Approach

    Authors: Changxin Xu, Yuxin Qiao, Zhanxin Zhou, Fanghao Ni, Jize Xiong

    Abstract: Federated Learning (FL) is a distributed machine learning paradigm that allows clients to train models on their data while preserving their privacy. FL algorithms, such as Federated Averaging (FedAvg) and its variants, have been shown to converge well in many scenarios. However, these methods require clients to upload their local updates to the server in a synchronous manner, which can be slow and… ▽ More

    Submitted 3 March, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: 5 pages, 1 figures

  13. arXiv:2402.10350  [pdf, other

    cs.LG cs.AI

    Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review

    Authors: Jing Su, Chufeng Jiang, Xin Jin, Yuxin Qiao, Tingsong Xiao, Hongda Ma, Rong Wei, Zhi Jing, Jiajun Xu, Junhong Lin

    Abstract: This systematic literature review comprehensively examines the application of Large Language Models (LLMs) in forecasting and anomaly detection, highlighting the current state of research, inherent challenges, and prospective future directions. LLMs have demonstrated significant potential in parsing and analyzing extensive datasets to identify patterns, predict future events, and detect anomalous… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

  14. arXiv:2402.09283  [pdf, other

    cs.CL cs.AI cs.CY cs.LG

    Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

    Authors: Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, Yu Qiao

    Abstract: Large Language Models (LLMs) are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, an… ▽ More

    Submitted 27 March, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

    Comments: Accepted to NAACL 2024

  15. arXiv:2402.09181  [pdf, other

    eess.IV cs.CV

    OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

    Authors: Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, Ping Luo

    Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in various multimodal tasks. However, their potential in the medical domain remains largely unexplored. A significant challenge arises from the scarcity of diverse medical images spanning various modalities and anatomical regions, which is essential in real-world medical applications. To solve this problem, in this pape… ▽ More

    Submitted 21 April, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

  16. Particle Filter SLAM for Vehicle Localization

    Authors: Tianrui Liu, Changxin Xu, Yuxin Qiao, Chufeng Jiang, Jiqiang Yu

    Abstract: Simultaneous Localization and Mapping (SLAM) presents a formidable challenge in robotics, involving the dynamic construction of a map while concurrently determining the precise location of the robotic agent within an unfamiliar environment. This intricate task is further compounded by the inherent "chicken-and-egg" dilemma, where accurate mapping relies on a dependable estimation of the robot's lo… ▽ More

    Submitted 19 February, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

    Comments: 6 pages, Journal of Industrial Engineering and Applied Science

    Journal ref: Journal of Industrial Engineering and Applied Science 2024

  17. News Recommendation with Attention Mechanism

    Authors: Tianrui Liu, Changxin Xu, Yuxin Qiao, Chufeng Jiang, Weisheng Chen

    Abstract: This paper explores the area of news recommendation, a key component of online information sharing. Initially, we provide a clear introduction to news recommendation, defining the core problem and summarizing current methods and notable recent algorithms. We then present our work on implementing the NRAM (News Recommendation with Attention Mechanism), an attention-based approach for news recommend… ▽ More

    Submitted 19 February, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

    Comments: 7 pages, Journal of Industrial Engineering and Applied Science

    Journal ref: Journal of Industrial Engineering and Applied Science 2024

  18. arXiv:2402.05935  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

    Authors: Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao, Peng Gao

    Abstract: We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we… ▽ More

    Submitted 26 June, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted by ICML 2024. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

  19. arXiv:2402.05655  [pdf, other

    cs.CV cs.RO

    Real-time Holistic Robot Pose Estimation with Unknown States

    Authors: Shikun Ban, Juling Fan, Xiaoxuan Ma, Wentao Zhu, Yu Qiao, Yizhou Wang

    Abstract: Estimating robot pose from RGB images is a crucial problem in computer vision and robotics. While previous methods have achieved promising performance, most of them presume full knowledge of robot internal states, e.g. ground-truth robot joint angles. However, this assumption is not always valid in practical situations. In real-world applications such as multi-robot collaboration or human-robot in… ▽ More

    Submitted 16 July, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted by ECCV 2024

  20. arXiv:2402.05044  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

    Authors: Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, Jing Shao

    Abstract: In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose \emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy s… ▽ More

    Submitted 7 June, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

    Comments: Accepted at ACL 2024 Findings

  21. arXiv:2402.02338  [pdf, other

    cs.NI cs.LG

    NetLLM: Adapting Large Language Models for Networking

    Authors: Duo Wu, Xianda Wang, Yaqi Qiao, Zhi Wang, Junchen Jiang, Shuguang Cui, Fangxin Wang

    Abstract: Many networking tasks now employ deep learning (DL) to solve complex prediction and system optimization problems. However, current design philosophy of DL-based algorithms entails intensive engineering overhead due to the manual design of deep neural networks (DNNs) for different networking tasks. Besides, DNNs tend to achieve poor generalization performance on unseen data distributions/environmen… ▽ More

    Submitted 5 May, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

    Comments: This paper has been accepted by ACM SIGCOMM 2024

  22. arXiv:2402.01246  [pdf, other

    cs.RO eess.SY

    LimSim++: A Closed-Loop Platform for Deploying Multimodal LLMs in Autonomous Driving

    Authors: Daocheng Fu, Wenjie Lei, Licheng Wen, Pinlong Cai, Song Mao, Min Dou, Botian Shi, Yu Qiao

    Abstract: The emergence of Multimodal Large Language Models ((M)LLMs) has ushered in new avenues in artificial intelligence, particularly for autonomous driving by offering enhanced understanding and reasoning capabilities. This paper introduces LimSim++, an extended version of LimSim designed for the application of (M)LLMs in autonomous driving. Acknowledging the limitations of existing simulation platform… ▽ More

    Submitted 12 April, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted by 35th IEEE Intelligent Vehicles Symposium (IV 2024)

  23. arXiv:2402.00357  [pdf, other

    cs.CV

    Safety of Multimodal Large Language Models on Images and Texts

    Authors: Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, Yu Qiao

    Abstract: Attracted by the impressive power of Multimodal Large Language Models (MLLMs), the public is increasingly utilizing them to improve the efficiency of daily work. Nonetheless, the vulnerabilities of MLLMs to unsafe instructions bring huge safety risks when these models are deployed in real-world scenarios. In this paper, we systematically survey current efforts on the evaluation, attack, and defens… ▽ More

    Submitted 20 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

    Comments: Accepted at IJCAI2024

  24. arXiv:2401.16420  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XCo… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: Code and models are available at https://github.com/InternLM/InternLM-XComposer

  25. arXiv:2401.16265  [pdf, other

    cs.CL cs.DC

    CO2: Efficient Distributed Training with Full Communication-Computation Overlap

    Authors: Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong

    Abstract: The fundamental success of large language models hinges upon the efficacious implementation of large-scale distributed training techniques. Nevertheless, building a vast, high-performance cluster featuring high-speed communication interconnectivity is prohibitively costly, and accessible only to prominent entities. In this work, we aim to lower this barrier and democratize large-scale training wit… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: ICLR 2024 Spotlight. Yiran Zhong is the corresponding author. Code is available at: https://github.com/OpenNLPLab/CO2

  26. arXiv:2401.15071  [pdf, other

    cs.CV

    From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

    Authors: Chaochao Lu, Chen Qian, Guodong Zheng, Hongxing Fan, Hongzhi Gao, Jie Zhang, Jing Shao, Jingyi Deng, Jinlan Fu, Kexin Huang, Kunchang Li, Lijun Li, Limin Wang, Lu Sheng, Meiqi Chen, Ming Zhang, Qibing Ren, Sirui Chen, Tao Gui, Wanli Ouyang, Yali Wang, Yan Teng, Yaru Wang, Yi Wang, Yinan He , et al. (11 additional authors not shown)

    Abstract: Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents. However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the most powerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paper strives to enhance unde… ▽ More

    Submitted 29 January, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

  27. arXiv:2401.14356  [pdf, other

    math-ph

    Exact surface energy of the Hubbard model with unparallel boundary magnetic fields

    Authors: Pei Sun, Yi Qiao, Junpeng Cao, Wen-Li Yang

    Abstract: In this paper, we study the exact physical quantities in the thermodynamic limit of the one-dimensional Hubbard model with unparallel boundary magnetic fields based on the off-diagonal Bethe ansatz solution. At the half-filling, we obtain the different patterns of Bethe roots of the reduced Bethe ansatz equations for the different boundary parameters. According to them, we obtain the densities of… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

    Comments: 13 pages, 3 figures

  28. arXiv:2401.13898  [pdf, other

    cs.LG

    Cross-Modal Prototype based Multimodal Federated Learning under Severely Missing Modality

    Authors: Huy Q. Le, Chu Myaet Thwal, Yu Qiao, Ye Lin Tun, Minh N. H. Nguyen, Choong Seon Hong

    Abstract: Multimodal federated learning (MFL) has emerged as a decentralized machine learning paradigm, allowing multiple clients with different modalities to collaborate on training a machine learning model across diverse data sources without sharing their private data. However, challenges, such as data heterogeneity and severely missing modalities, pose crucial hindrances to the robustness of MFL, signifi… ▽ More

    Submitted 24 January, 2024; originally announced January 2024.

    Comments: 12 pages, 8 figures, 5 tables

  29. arXiv:2401.13627  [pdf, other

    cs.CV

    Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

    Authors: Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, Chao Dong

    Abstract: We introduce SUPIR (Scaling-UP Image Restoration), a groundbreaking image restoration method that harnesses generative prior and the power of model scaling up. Leveraging multi-modal techniques and advanced generative prior, SUPIR marks a significant advance in intelligent and realistic image restoration. As a pivotal catalyst within SUPIR, model scaling dramatically enhances its capabilities and… ▽ More

    Submitted 3 April, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: This paper has been accepted by CVPR 2024

  30. arXiv:2401.13246  [pdf, other

    cs.CL

    SEER: Facilitating Structured Reasoning and Explanation via Reinforcement Learning

    Authors: Guoxin Chen, Kexin Tang, Chao Yang, Fuying Ye, Yu Qiao, Yiming Qian

    Abstract: Elucidating the reasoning process with structured explanations from question to answer is crucial, as it significantly enhances the interpretability, traceability, and trustworthiness of question-answering (QA) systems. However, structured explanations demand models to perform intricately structured reasoning, which poses great challenges. Most existing methods focus on single-step reasoning throu… ▽ More

    Submitted 4 June, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Camera ready version for ACL 2024 Main Conference

  31. arXiv:2401.11880  [pdf, other

    cs.CL cs.AI cs.CR cs.MA

    PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety

    Authors: Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, Jing Shao

    Abstract: Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound capabilities in collective intelligence. However, the potential misuse of this intelligence for malicious purposes presents significant risks. To date, comprehensive research on the safety issues associated with multi-agent systems remains limited. In this paper, we explore these concerns through the innovative… ▽ More

    Submitted 17 February, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

  32. arXiv:2401.10755  [pdf, other

    cs.SE

    Code Reviewer Recommendation Based on a Hypergraph with Multiplex Relationships

    Authors: Yu Qiao, Jian Wang, Can Cheng, Wei Tang, Peng Liang, Yuqi Zhao, Bing Li

    Abstract: Code review is an essential component of software development, playing a vital role in ensuring a comprehensive check of code changes. However, the continuous influx of pull requests and the limited pool of available reviewer candidates pose a significant challenge to the review process, making the task of assigning suitable reviewers to each review request increasingly difficult. To tackle this i… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

    Comments: The 31st IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER)

  33. arXiv:2401.10208  [pdf, other

    cs.CV cs.CL

    MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

    Authors: Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, Hongsheng Li, Yu Qiao, Jifeng Dai

    Abstract: Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. T… ▽ More

    Submitted 2 April, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: 20 pages, 9 figures, 17 tables

  34. arXiv:2401.09414  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    Vlogger: Make Your Dream A Vlog

    Authors: Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, Yali Wang

    Abstract: In this work, we present Vlogger, a generic AI system for generating a minute-level video blog (i.e., vlog) of user descriptions. Different from short videos with a few seconds, vlog often contains a complex storyline with diversified scenes, which is challenging for most existing video generation approaches. To break through this bottleneck, our Vlogger smartly leverages Large Language Model (LLM… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

    Comments: 16 pages, 8 figures, 11 tables

  35. arXiv:2401.08996  [pdf, other

    cs.LG cs.AI

    MicroNAS: Zero-Shot Neural Architecture Search for MCUs

    Authors: Ye Qiao, Haocheng Xu, Yifan Zhang, Sitao Huang

    Abstract: Neural Architecture Search (NAS) effectively discovers new Convolutional Neural Network (CNN) architectures, particularly for accuracy optimization. However, prior approaches often require resource-intensive training on super networks or extensive architecture evaluations, limiting practical applications. To address these challenges, we propose MicroNAS, a hardware-aware zero-shot NAS framework de… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

  36. arXiv:2401.06557  [pdf, other

    cs.LG cs.AI cs.SI stat.ME

    Treatment-Aware Hyperbolic Representation Learning for Causal Effect Estimation with Social Networks

    Authors: Ziqiang Cui, Xing Tang, Yang Qiao, Bowei He, Liang Chen, Xiuqiang He, Chen Ma

    Abstract: Estimating the individual treatment effect (ITE) from observational data is a crucial research topic that holds significant value across multiple domains. How to identify hidden confounders poses a key challenge in ITE estimation. Recent studies have incorporated the structural information of social networks to tackle this challenge, achieving notable advancements. However, these methods utilize g… ▽ More

    Submitted 12 January, 2024; originally announced January 2024.

    Comments: Accepted by SIAM SDM'24

  37. arXiv:2401.06197  [pdf, other

    cs.CV

    Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

    Authors: Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, Jifeng Dai

    Abstract: We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications. DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements: 1. removing softmax normalization in spatial aggregation to enhance its dynamic property and expressive power and 2. optimizing memory access to minimize redundant operat… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

    Comments: Tech report; Code: https://github.com/OpenGVLab/DCNv4

  38. arXiv:2401.04872  [pdf, other

    cs.CV cs.LG cs.RO

    Knowledge-aware Graph Transformer for Pedestrian Trajectory Prediction

    Authors: Yu Liu, Yuexin Zhang, Kunming Li, Yongliang Qiao, Stewart Worrall, You-Fu Li, He Kong

    Abstract: Predicting pedestrian motion trajectories is crucial for path planning and motion control of autonomous vehicles. Accurately forecasting crowd trajectories is challenging due to the uncertain nature of human motions in different environments. For training, recent deep learning-based prediction approaches mainly utilize information like trajectory history and interactions between pedestrians, among… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

    Comments: This paper was accepted to and presented at the 26th IEEE International Conference on Intelligent Transportation Systems (ITSC), September 2023

  39. arXiv:2401.03048  [pdf, other

    cs.CV

    Latte: Latent Diffusion Transformer for Video Generation

    Authors: Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, Yu Qiao

    Abstract: We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatia… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

    Comments: Project page: https://maxin-cn.github.io/latte_project

  40. arXiv:2401.02384  [pdf, other

    cs.CV

    ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning

    Authors: Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo

    Abstract: Charts play a vital role in data visualization, understanding data patterns, and informed decision-making. However, their unique combination of graphical elements (e.g., bars, lines) and textual components (e.g., labels, legends) poses challenges for general-purpose multimodal models. While vision-language models trained on chart data excel in comprehension, they struggle with generalization. To a… ▽ More

    Submitted 15 February, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

    Comments: Updated and corrected experimental results, removal of inappropriate experiments, and a more comprehensive experimental setup

  41. arXiv:2401.01741  [pdf

    cond-mat.mes-hall

    Spin-Transfer-Torque Induced Spatially Nonuniform Switching in Ferrimagnets

    Authors: Xue Zhang, Zhengde Xu, Jie Ren, Yixiao Qiao, Weijia Fan, Zhifeng Zhu

    Abstract: Ferrimagnet (FiM), (FeCo)1-xGdx, attracts research attention due to its ultrafast magnetic dynamics and finite net magnetization. Incorporating FiM into the magnetic tunnel junction will be beneficial to further improve the writing speed of magnetic random access memory (MRAM). It is commonly assumed that the FeCo and Gd atoms are switched together due to the strong exchange coupling, which remain… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

    Journal ref: Appl. Phys. Lett. 124, 012405 (2024)

  42. arXiv:2401.00006  [pdf, other

    cs.AI

    Building Open-Ended Embodied Agent via Language-Policy Bidirectional Adaptation

    Authors: Shaopeng Zhai, Jie Wang, Tianyi Zhang, Fuxian Huang, Qi Zhang, Ming Zhou, Jing Hou, Yu Qiao, Yu Liu

    Abstract: Building embodied agents on integrating Large Language Models (LLMs) and Reinforcement Learning (RL) have revolutionized human-AI interaction: researchers can now leverage language instructions to plan decision-making for open-ended tasks. However, existing research faces challenges in meeting the requirement of open-endedness. They typically either train LLM/RL models to adapt to a fixed counterp… ▽ More

    Submitted 6 February, 2024; v1 submitted 12 December, 2023; originally announced January 2024.

  43. arXiv:2312.14238  [pdf, other

    cs.CV

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Authors: Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai

    Abstract: The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model… ▽ More

    Submitted 15 January, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: 25 pages, 5 figures, 28 tables

  44. arXiv:2312.13716  [pdf, other

    cs.LG cs.AI

    Critic-Guided Decision Transformer for Offline Reinforcement Learning

    Authors: Yuanfu Wang, Chao Yang, Ying Wen, Yu Liu, Yu Qiao

    Abstract: Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner. However, prevailing RCSL methods largely focus on deterministic trajectory modeling, disregarding stochastic state transitions and the diversity of… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: Accepted at AAAI 2024

  45. arXiv:2312.12232  [pdf, other

    cs.CV

    Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

    Authors: Lingjun Zhang, Xinyuan Chen, Yaohui Wang, Yue Lu, Yu Qiao

    Abstract: Recently, diffusion-based image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language. Our model outputs a photo-realistic image given a text of any langua… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: Accepted to AAAI 2024. Code: https://github.com/ecnuljzhang/brush-your-text

  46. arXiv:2312.12144  [pdf, other

    cs.CV

    M-BEV: Masked BEV Perception for Robust Autonomous Driving

    Authors: Siran Chen, Yue Ma, Yu Qiao, Yali Wang

    Abstract: 3D perception is a critical problem in autonomous driving. Recently, the Bird-Eye-View (BEV) approach has attracted extensive attention, due to low-cost deployment and desirable vision detection capacity. However, the existing models ignore a realistic scenario during the driving procedure, i.e., one or more view cameras may be failed, which largely deteriorates the performance. To tackle this pro… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: Github repository: https://github.com/Sranc3/M-BEV

  47. arXiv:2312.10163  [pdf, other

    cs.CV cs.LG

    Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey

    Authors: Xu Liu, Tong Zhou, Yuanxin Wang, Yuping Wang, Qinjingwen Cao, Weizhi Du, Yonghuan Yang, Junjun He, Yu Qiao, Yiqing Shen

    Abstract: The advent of foundation models, which are pre-trained on vast datasets, has ushered in a new era of computer vision, characterized by their robustness and remarkable zero-shot generalization capabilities. Mirroring the transformative impact of foundation models like large language models (LLMs) in natural language processing, visual foundation models (VFMs) have become a catalyst for groundbreaki… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  48. arXiv:2312.10035  [pdf, other

    cs.CV

    Point Transformer V3: Simpler, Faster, Stronger

    Authors: Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, Hengshuang Zhao

    Abstract: This paper is not motivated to seek innovation within the attention mechanism. Instead, it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing, leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning, we recognize that model performance is more influenced by scale than b… ▽ More

    Submitted 25 March, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

    Comments: CVPR 2024, code available at Pointcept (https://github.com/Pointcept/PointTransformerV3)

  49. arXiv:2312.09451  [pdf, other

    cs.CL

    MANTIS at #SMM4H 2023: Leveraging Hybrid and Ensemble Models for Detection of Social Anxiety Disorder on Reddit

    Authors: Sourabh Zanwar, Daniel Wiechmann, Yu Qiao, Elma Kerz

    Abstract: This paper presents our system employed for the Social Media Mining for Health 2023 Shared Task 4: Binary classification of English Reddit posts self-reporting a social anxiety disorder diagnosis. We systematically investigate and contrast the efficacy of hybrid and ensemble models that harness specialized medical domain-adapted transformers in conjunction with BiLSTM neural networks. The evaluati… ▽ More

    Submitted 28 November, 2023; originally announced December 2023.

    Comments: accepted at at the #SMM4H 2023 workshop, co-located with the AMIA Annual Symposium 2023

  50. arXiv:2312.09245  [pdf, other

    cs.CV

    DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

    Authors: Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, Hao Tian, Lewei Lu, Xizhou Zhu, Xiaogang Wang, Yu Qiao, Jifeng Dai

    Abstract: Large language models (LLMs) have opened up new possibilities for intelligent agents, endowing them with human-like thinking and cognitive abilities. In this work, we delve into the potential of large language models (LLMs) in autonomous driving (AD). We introduce DriveMLM, an LLM-based AD framework that can perform close-loop autonomous driving in realistic simulators. To this end, (1) we bridge… ▽ More

    Submitted 25 December, 2023; v1 submitted 14 December, 2023; originally announced December 2023.

    Comments: Technical Report