Skip to main content

Showing 1–50 of 229 results for author: Yan, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13198  [pdf, other

    cs.SD eess.AS

    DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

    Authors: Baihan Li, Zeyu Xie, Xuenan Xu, Yiwei Guo, Ming Yan, Ji Zhang, Kai Yu, Mengyue Wu

    Abstract: Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assist… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  2. arXiv:2407.12232  [pdf, other

    cs.AR

    RTL Verification for Secure Speculation Using Contract Shadow Logic

    Authors: Qinhan Tan, Yuheng Yang, Thomas Bourgeat, Sharad Malik, Mengjia Yan

    Abstract: Modern out-of-order processors face speculative execution attacks. Despite various proposed software and hardware mitigations to prevent such attacks, new attacks keep arising from unknown vulnerabilities. Thus, a formal and rigorous evaluation of the ability of hardware designs to deal with speculative execution attacks is urgently desired. This paper proposes a formal verification technique call… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: This paper has been accepted to ASPLOS 2025

  3. arXiv:2407.11790  [pdf, other

    cs.LG cs.AI cs.AR cs.PF

    Characterizing and Understanding HGNN Training on GPUs

    Authors: Dengke Han, Mingyu Yan, Xiaochun Ye, Dongrui Fan, Ninghui Sun

    Abstract: Owing to their remarkable representation capabilities for heterogeneous graph data, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted in many critical real-world domains such as recommendation systems and medical analysis. Prior to their practical application, identifying the optimal HGNN model parameters tailored to specific tasks through extensive training is a time-consuming… ▽ More

    Submitted 17 July, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

    Comments: 23 pages, 14 figures, submitted to ACM TACO

  4. arXiv:2407.11034  [pdf

    cs.LG

    Bridging Data Gaps in Healthcare: A Scoping Review of Transfer Learning in Biomedical Data Analysis

    Authors: Siqi Li, Xin Li, Kunyu Yu, Di Miao, Mingcheng Zhu, Mengying Yan, Yuhe Ke, Danny D'Agostino, Yilin Ning, Qiming Wu, Ziwen Wang, Yuqing Shang, Molei Liu, Chuan Hong, Nan Liu

    Abstract: Clinical and biomedical research in low-resource settings often faces significant challenges due to the need for high-quality data with sufficient sample sizes to construct effective models. These constraints hinder robust model training and prompt researchers to seek methods for leveraging existing knowledge from related studies to support new research efforts. Transfer learning (TL), a machine l… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  5. arXiv:2407.08265  [pdf, other

    cs.CV

    Enhancing Thermal Infrared Tracking with Natural Language Modeling and Coordinate Sequence Generation

    Authors: Miao Yan, Ping Zhang, Haofei Zhang, Ruqian Hao, Juanxiu Liu, Xiaoyang Wang, Lin Liu

    Abstract: Thermal infrared tracking is an essential topic in computer vision tasks because of its advantage of all-weather imaging. However, most conventional methods utilize only hand-crafted features, while deep learning-based correlation filtering methods are limited by simple correlation operations. Transformer-based methods ignore temporal and coordinate information, which is critical for TIR tracking… ▽ More

    Submitted 18 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

  6. arXiv:2407.07462  [pdf, other

    cs.CV cs.AI cs.LG

    MAN TruckScenes: A multimodal dataset for autonomous trucking in diverse conditions

    Authors: Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin, Stefan Juergens, Lorenz Lechermann, Christian Nissler, Andrea Perl, Ulrich Voll, Min Yan, Markus Lienkamp

    Abstract: Autonomous trucking is a promising technology that can greatly impact modern logistics and the environment. Ensuring its safety on public roads is one of the main duties that requires an accurate perception of the environment. To achieve this, machine learning methods rely on large datasets, but to this day, no such datasets are available for autonomous trucks. In this work, we present MAN TruckSc… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

  7. arXiv:2406.10778  [pdf, other

    cs.CE stat.AP

    Heterogeneous Entity Representation for Medicinal Synergy Prediction

    Authors: Jiawei Wu, Jun Wen, Mingyuan Yan, Anqi Dong, Can Chen

    Abstract: Medicinal synergy prediction is a powerful tool in drug discovery and development that harnesses the principles of combination therapy to enhance therapeutic outcomes by improving efficacy, reducing toxicity, and preventing drug resistance. While a myriad of computational methods has emerged for predicting synergistic drug combinations, a large portion of them may overlook the intricate, yet criti… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: 8 pages, 3 figures

    MSC Class: 92C50; 05C65; 68T07

  8. arXiv:2406.10227  [pdf, other

    cs.CV cs.AI

    VideoGUI: A Benchmark for GUI Automation from Instructional Videos

    Authors: Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

    Abstract: Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-c… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: 24 pages, 16 tables, 17 figures

  9. arXiv:2406.09095  [pdf, other

    cs.CL

    Modeling Comparative Logical Relation with Contrastive Learning for Text Generation

    Authors: Yuhao Dan, Junfeng Tian, Jie Zhou, Ming Yan, Ji Zhang, Qin Chen, Liang He

    Abstract: Data-to-Text Generation (D2T), a classic natural language generation problem, aims at producing fluent descriptions for structured input data, such as a table. Existing D2T works mainly focus on describing the superficial associative relations among entities, while ignoring the deep comparative logical relations, such as A is better than B in a certain aspect with a corresponding opinion, which is… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  10. arXiv:2406.04648  [pdf, other

    cs.CV

    UCDNet: Multi-UAV Collaborative 3D Object Detection Network by Reliable Feature Mapping

    Authors: Pengju Tian, Peirui Cheng, Yuchao Wang, Zhechao Wang, Zhirui Wang, Menglong Yan, Xue Yang, Xian Sun

    Abstract: Multi-UAV collaborative 3D object detection can perceive and comprehend complex environments by integrating complementary information, with applications encompassing traffic monitoring, delivery services and agricultural management. However, the extremely broad observations in aerial remote sensing and significant perspective differences across multiple UAVs make it challenging to achieve precise… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  11. arXiv:2406.03210  [pdf, other

    cs.IR

    Text-like Encoding of Collaborative Information in Large Language Models for Recommendation

    Authors: Yang Zhang, Keqin Bao, Ming Yan, Wenjie Wang, Fuli Feng, Xiangnan He

    Abstract: When adapting Large Language Models for Recommendation (LLMRec), it is crucial to integrate collaborative information. Existing methods achieve this by learning collaborative embeddings in LLMs' latent space from scratch or by mapping from external models. However, they fail to represent the information in a text-like format, which may not align optimally with LLMs. To bridge this gap, we introduc… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted by ACL 2024

    ACM Class: H.3.3

  12. arXiv:2406.03172  [pdf, other

    cs.LG

    Initialization-enhanced Physics-Informed Neural Network with Domain Decomposition (IDPINN)

    Authors: Chenhao Si, Ming Yan

    Abstract: We propose a new physics-informed neural network framework, IDPINN, based on the enhancement of initialization and domain decomposition to improve prediction accuracy. We train a PINN using a small dataset to obtain an initial network structure, including the weighted matrix and bias, which initializes the PINN for each subdomain. Moreover, we leverage the smoothness condition on the interface to… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: 20 pages, 14 figures

  13. arXiv:2406.01014  [pdf, other

    cs.CL cs.CV

    Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

    Authors: Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang

    Abstract: Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the tw… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: 22 pages, 11 figures, 10 Tables

  14. arXiv:2406.00988  [pdf, other

    cs.AR

    ADE-HGNN: Accelerating HGNNs through Attention Disparity Exploitation

    Authors: Dengke Han, Meng Wu, Runzhen Xue, Mingyu Yan, Xiaochun Ye, Dongrui Fan

    Abstract: Heterogeneous Graph Neural Networks (HGNNs) have recently demonstrated great power in handling heterogeneous graph data, rendering them widely applied in many critical real-world domains. Most HGNN models leverage attention mechanisms to significantly improvemodel accuracy, albeit at the cost of increased computational complexity and memory bandwidth requirements. Fortunately, the attention dispar… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: 15 pages, 9 figures, accepted by Euro-PAR 2024

  15. arXiv:2406.00683  [pdf, other

    eess.IV cs.CV cs.MM

    Exploiting Frequency Correlation for Hyperspectral Image Reconstruction

    Authors: Muge Yan, Lizhi Wang, Lin Zhu, Hua Huang

    Abstract: Deep priors have emerged as potent methods in hyperspectral image (HSI) reconstruction. While most methods emphasize space-domain learning using image space priors like non-local similarity, frequency-domain learning using image frequency priors remains neglected, limiting the reconstruction capability of networks. In this paper, we first propose a Hyperspectral Frequency Correlation (HFC) prior r… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: 14 pages, 11 figures

  16. arXiv:2405.06247  [pdf, other

    cs.LG cs.AI cs.CR

    Disttack: Graph Adversarial Attacks Toward Distributed GNN Training

    Authors: Yuxiang Zhang, Xin Liu, Meng Wu, Wei Yan, Mingyu Yan, Xiaochun Ye, Dongrui Fan

    Abstract: Graph Neural Networks (GNNs) have emerged as potent models for graph learning. Distributing the training process across multiple computing nodes is the most promising solution to address the challenges of ever-growing real-world graphs. However, current adversarial attack methods on GNNs neglect the characteristics and applications of the distributed scenario, leading to suboptimal performance and… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: Accepted by 30th International European Conference on Parallel and Distributed Computing(Euro-Par 2024)

  17. arXiv:2404.18166  [pdf, other

    cs.IR

    Behavior-Contextualized Item Preference Modeling for Multi-Behavior Recommendation

    Authors: Mingshi Yan, Fan Liu, Jing Sun, Fuming Sun, Zhiyong Cheng, Yahong Han

    Abstract: In recommender systems, multi-behavior methods have demonstrated their effectiveness in mitigating issues like data sparsity, a common challenge in traditional single-behavior recommendation approaches. These methods typically infer user preferences from various auxiliary behaviors and apply them to the target behavior for recommendations. However, this direct transfer can introduce noise to the t… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

    Comments: This paper has been accepted by SIGIR 2024

  18. arXiv:2404.17238  [pdf, other

    cs.IR

    TruthSR: Trustworthy Sequential Recommender Systems via User-generated Multimodal Content

    Authors: Meng Yan, Haibin Huang, Ying Liu, Juan Zhao, Xiyue Gao, Cai Xu, Ziyu Guan, Wei Zhao

    Abstract: Sequential recommender systems explore users' preferences and behavioral patterns from their historically generated data. Recently, researchers aim to improve sequential recommendation by utilizing massive user-generated multi-modal content, such as reviews, images, etc. This content often contains inevitable noise. Some studies attempt to reduce noise interference by suppressing cross-modal incon… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

  19. arXiv:2404.16635  [pdf, other

    cs.CV

    TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

    Authors: Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, Fei Huang

    Abstract: Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficien… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: 13 pages, 11 figures

  20. arXiv:2404.16484  [pdf, other

    cs.CV eess.IV

    Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

    Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi Jin, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

    Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024, AI for Streaming (AIS) Workshop

  21. arXiv:2404.10343  [pdf, other

    cs.CV eess.IV

    The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

    Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

  22. GDR-HGNN: A Heterogeneous Graph Neural Networks Accelerator Frontend with Graph Decoupling and Recoupling

    Authors: Runzhen Xue, Mingyu Yan, Dengke Han, Yihan Teng, Zhimin Tang, Xiaochun Ye, Dongrui Fan

    Abstract: Heterogeneous Graph Neural Networks (HGNNs) have broadened the applicability of graph representation learning to heterogeneous graphs. However, the irregular memory access pattern of HGNNs leads to the buffer thrashing issue in HGNN accelerators. In this work, we identify an opportunity to address buffer thrashing in HGNN acceleration through an analysis of the topology of heterogeneous graphs. To… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Comments: 6 pages, 10 figures, accepted by DAC'61

  23. arXiv:2404.02084  [pdf, other

    cs.CV

    Adaptive Feature Fusion Neural Network for Glaucoma Segmentation on Unseen Fundus Images

    Authors: Jiyuan Zhong, Hu Ke, Ming Yan

    Abstract: Fundus image segmentation on unseen domains is challenging, especially for the over-parameterized deep models trained on the small medical datasets. To address this challenge, we propose a method named Adaptive Feature-fusion Neural Network (AFNN) for glaucoma segmentation on unseen domains, which mainly consists of three modules: domain adaptor, feature-fusion network, and self-supervised multi-t… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: 17 pages, 11 figures

  24. arXiv:2404.00461  [pdf, other

    cs.LG cs.AI cs.CL cs.CR

    Shortcuts Arising from Contrast: Effective and Covert Clean-Label Attacks in Prompt-Based Learning

    Authors: Xiaopeng Xie, Ming Yan, Xiwen Zhou, Chenlong Zhao, Suli Wang, Yong Zhang, Joey Tianyi Zhou

    Abstract: Prompt-based learning paradigm has demonstrated remarkable efficacy in enhancing the adaptability of pretrained language models (PLMs), particularly in few-shot scenarios. However, this learning paradigm has been shown to be vulnerable to backdoor attacks. The current clean-label attack, employing a specific prompt as a trigger, can achieve success without the need for external triggers and ensure… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

    Comments: 10 pages, 6 figures, conference

    MSC Class: 68T50 ACM Class: I.2.7

  25. arXiv:2403.19501  [pdf, other

    cs.CV

    RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method

    Authors: Ming Yan, Yan Zhang, Shuqiang Cai, Shuqi Fan, Xincheng Lin, Yudi Dai, Siqi Shen, Chenglu Wen, Lan Xu, Yuexin Ma, Cheng Wang

    Abstract: Comprehensive capturing of human motions requires both accurate captures of complex poses and precise localization of the human within scenes. Most of the HPE datasets and methods primarily rely on RGB, LiDAR, or IMU data. However, solely using these modalities or a combination of them may not be adequate for HPE, particularly for complex and fast movements. For holistic human motion understanding… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Comments: CVPR2024, Project website: http://www.lidarhumanmotion.net/reli11d/

  26. Collaborative Knowledge Infusion for Low-resource Stance Detection

    Authors: Ming Yan, Joey Tianyi Zhou, Ivor W. Tsang

    Abstract: Stance detection is the view towards a specific target by a given context (\textit{e.g.} tweets, commercial reviews). Target-related knowledge is often needed to assist stance detection models in understanding the target well and making detection correctly. However, prevailing works for knowledge-infused stance detection predominantly incorporate target knowledge from a singular source that lacks… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Comments: 13 pages, 3 figures, Big Data Mining and Analysis

  27. arXiv:2403.14589  [pdf, other

    cs.AI cs.CL cs.LG

    ReAct Meets ActRe: When Language Agents Enjoy Training Data Autonomy

    Authors: Zonghan Yang, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu

    Abstract: Language agents have demonstrated autonomous decision-making abilities by reasoning with foundation models. Recently, efforts have been made to train language agents for performance improvement, with multi-step reasoning and action trajectories as the training data. However, collecting such trajectories still requires considerable human effort, by either artificial annotation or implementations of… ▽ More

    Submitted 1 April, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

  28. arXiv:2403.13679  [pdf, other

    cs.CL

    RoleInteract: Evaluating the Social Interaction of Role-Playing Agents

    Authors: Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Xing Gao, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, Fei Huang, Jingren Zhou

    Abstract: Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their… ▽ More

    Submitted 21 March, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

  29. arXiv:2403.12895  [pdf, other

    cs.CV

    mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

    Authors: Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou

    Abstract: Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure informatio… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: 21 pages, 15 figures

  30. arXiv:2403.07943  [pdf, other

    cs.LG cs.CR

    Revisiting Edge Perturbation for Graph Neural Network in Graph Data Augmentation and Attack

    Authors: Xin Liu, Yuxiang Zhang, Meng Wu, Mingyu Yan, Kun He, Wei Yan, Shirui Pan, Xiaochun Ye, Dongrui Fan

    Abstract: Edge perturbation is a basic method to modify graph structures. It can be categorized into two veins based on their effects on the performance of graph neural networks (GNNs), i.e., graph data augmentation and attack. Surprisingly, both veins of edge perturbation methods employ the same operations, yet yield opposite effects on GNNs' accuracy. A distinct boundary between these methods in using edg… ▽ More

    Submitted 10 March, 2024; originally announced March 2024.

    Comments: 14P

  31. arXiv:2403.07883  [pdf, other

    cs.CV cs.AI

    Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

    Authors: Wei Ye, Chaoya Jiang, Haiyang Xu, Chenhao Ye, Chenliang Li, Ming Yan, Shikun Zhang, Songhang Huang, Fei Huang

    Abstract: Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training (VLP) models. Although previous VLP research has demonstrated the efficacy of ViTs, these efforts still struggle with computational inefficiencies caused by lengthy visual sequences. To address this challenge, we introduce an efficient VLP approach called TRIPS, which stands for Text-Relevan… ▽ More

    Submitted 11 January, 2024; originally announced March 2024.

  32. arXiv:2403.03360  [pdf, other

    cs.CR

    Bridge the Future: High-Performance Networks in Confidential VMs without Trusted I/O devices

    Authors: Mengyuan Li, Shashvat Srivastava, Mengjia Yan

    Abstract: Trusted I/O (TIO) is an appealing solution to improve I/O performance for confidential VMs (CVMs), with the potential to eliminate broad sources of I/O overhead. However, this paper emphasizes that not all types of I/O can derive substantial benefits from TIO, particularly network I/O. Given the obligatory use of encryption protocols for network traffic in CVM's threat model, TIO's approach of I/O… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

  33. arXiv:2403.03353  [pdf, ps, other

    stat.ML cs.LG math.FA

    Hypothesis Spaces for Deep Learning

    Authors: Rui Wang, Yuesheng Xu, Mingsong Yan

    Abstract: This paper introduces a hypothesis space for deep learning that employs deep neural networks (DNNs). By treating a DNN as a function of two variables, the physical variable and parameter variable, we consider the primitive set of the DNNs for the parameter variable located in a set of the weight matrices and biases determined by a prescribed depth and widths of the DNNs. We then complete the linea… ▽ More

    Submitted 11 March, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

  34. arXiv:2403.03089  [pdf, other

    q-bio.QM cs.AI cs.LG

    VQSynery: Robust Drug Synergy Prediction With Vector Quantization Mechanism

    Authors: Jiawei Wu, Mingyuan Yan, Dianbo Liu

    Abstract: The pursuit of optimizing cancer therapies is significantly advanced by the accurate prediction of drug synergy. Traditional methods, such as clinical trials, are reliable yet encumbered by extensive time and financial demands. The emergence of high-throughput screening and computational innovations has heralded a shift towards more efficient methodologies for exploring drug interactions. In this… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

  35. arXiv:2403.00249  [pdf, other

    cs.CV

    Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

    Authors: Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu

    Abstract: In vision-language pre-training (VLP), masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment. However, in most existing methods, the reconstruction targets for MIM lack high-level semantics, and text is not sufficiently involved in masked modeling. These two drawbacks limit the effect of MIM in facilitating cross-modal semantic alignment. In this work, we… ▽ More

    Submitted 29 February, 2024; originally announced March 2024.

    Comments: Accepted to LREC-COLING 2024

  36. arXiv:2402.17525  [pdf, other

    cs.CV

    Diffusion Model-Based Image Editing: A Survey

    Authors: Yi Huang, Jiancheng Huang, Yifan Liu, Mingfu Yan, Jiaxi Lv, Jianzhuang Liu, Wei Xiong, He Zhang, Shifeng Chen, Liangliang Cao

    Abstract: Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks, facilitating the synthesis of visual content in an unconditional or input-conditional manner. The core idea behind them is learning to reverse the process of gradually adding noise to images, allowing them to generate high-quality samples from a complex distribution. In this survey, we provid… ▽ More

    Submitted 16 March, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

  37. arXiv:2402.16769  [pdf, other

    cs.CV

    Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

    Authors: Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu

    Abstract: In video-text retrieval, most existing methods adopt the dual-encoder architecture for fast retrieval, which employs two individual encoders to extract global latent representations for videos and texts. However, they face challenges in capturing fine-grained semantic concepts. In this work, we propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics and… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: Accepted to LREC-COLING 2024

  38. arXiv:2402.15960  [pdf, other

    cs.AI

    Budget-Constrained Tool Learning with Planning

    Authors: Yuanhang Zheng, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu

    Abstract: Despite intensive efforts devoted to tool learning, the problem of budget-constrained tool learning, which focuses on resolving user queries within a specific budget constraint, has been widely overlooked. This paper proposes a novel method for budget-constrained tool learning. Our approach involves creating a preferable plan under the budget constraint before utilizing the tools. This plan outlin… ▽ More

    Submitted 10 June, 2024; v1 submitted 24 February, 2024; originally announced February 2024.

    Comments: Accepted for Findings of ACL 2024

  39. arXiv:2402.15721  [pdf, other

    cs.AI cs.CL

    Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

    Authors: Chaoya Jiang, Wei Ye, Mengfan Dong, Hongrui Jia, Haiyang Xu, Ming Yan, Ji Zhang, Shikun Zhang

    Abstract: Large Vision Language Models exhibit remarkable capabilities but struggle with hallucinations inconsistencies between images and their descriptions. Previous hallucination evaluation studies on LVLMs have identified hallucinations in terms of objects, attributes, and relations but overlooked complex hallucinations that create an entire narrative around a fictional entity. In this paper, we introdu… ▽ More

    Submitted 24 February, 2024; originally announced February 2024.

  40. arXiv:2402.14326  [pdf, other

    cs.MM

    Think before You Leap: Content-Aware Low-Cost Edge-Assisted Video Semantic Segmentation

    Authors: Mingxuan Yan, Yi Wang, Xuedou Xiao, Zhiqing Luo, Jianhua He, Wei Wang

    Abstract: Offloading computing to edge servers is a promising solution to support growing video understanding applications at resource-constrained IoT devices. Recent efforts have been made to enhance the scalability of such systems by reducing inference costs on edge servers. However, existing research is not directly applicable to pixel-level vision tasks such as video semantic segmentation (VSS), partly… ▽ More

    Submitted 27 March, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Comments: Accepted by ACM Multimedia 2023

  41. arXiv:2402.12835  [pdf, other

    cs.CL cs.AI

    PANDA: Preference Adaptation for Enhancing Domain-Specific Abilities of LLMs

    Authors: An Liu, Zonghan Yang, Zhenhe Zhang, Qingyuan Hu, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu

    Abstract: While Large language models (LLMs) have demonstrated considerable capabilities across various natural language tasks, they often fall short of the performance achieved by domain-specific state-of-the-art models. One potential approach to enhance domain-specific capabilities of LLMs involves fine-tuning them using corresponding datasets. However, this method can be both resource and time-intensive,… ▽ More

    Submitted 17 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: Accepted as Findings of ACL 2024

  42. arXiv:2402.12750  [pdf, other

    cs.CV cs.AI cs.CL

    Model Composition for Multimodal Large Language Models

    Authors: Chi Chen, Yiyang Du, Zheng Fang, Ziyue Wang, Fuwen Luo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Maosong Sun, Yang Liu

    Abstract: Recent developments in Multimodal Large Language Models (MLLMs) have shown rapid progress, moving towards the goal of creating versatile MLLMs that understand inputs from various modalities. However, existing methods typically rely on joint training with paired multimodal instruction data, which is resource-intensive and challenging to extend to new modalities. In this paper, we propose a new para… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

    Comments: Code will be available at https://github.com/THUNLP-MT/ModelCompose

  43. arXiv:2402.12195  [pdf, other

    cs.CL

    Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

    Authors: Ziyue Wang, Chi Chen, Yiqi Zhu, Fuwen Luo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Maosong Sun, Yang Liu

    Abstract: With the bloom of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks. However, they fall short to comprehend context involving multiple images. A primary reason for this shortcoming is that the visual features for each images are encoded in… ▽ More

    Submitted 7 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: 17 pages, 5 figures

  44. arXiv:2402.12146  [pdf, other

    cs.CL cs.AI cs.LG

    Enabling Weak LLMs to Judge Response Reliability via Meta Ranking

    Authors: Zijun Liu, Boqun Kou, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu

    Abstract: Despite the strong performance of large language models (LLMs) across a wide range of tasks, they still have reliability issues. Previous studies indicate that strong LLMs like GPT-4-turbo excel in evaluating the reliability of responses from LLMs, but face efficiency and local deployment issues. Thus, to enable weak LLMs to effectively assess the reliability of LLM responses, we propose a novel c… ▽ More

    Submitted 30 May, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: Preprint, under review. 28 pages

  45. arXiv:2402.01528  [pdf, other

    cs.LG cs.CL

    Decoding Speculative Decoding

    Authors: Minghao Yan, Saurabh Agarwal, Shivaram Venkataraman

    Abstract: Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens and then uses the target LLM to verify those draft tokens. The speedup provided by speculative decoding heavily depends on the choice of the draft model. In this… ▽ More

    Submitted 26 April, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

  46. arXiv:2401.16158  [pdf, other

    cs.CL cs.CV

    Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

    Authors: Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang

    Abstract: Mobile device agent based on Multimodal Large Language Models (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it the… ▽ More

    Submitted 18 April, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: Accepted by ICLR 2024 Workshop in Large Language Model (LLM) Agents

  47. arXiv:2401.07745  [pdf, other

    cs.CV

    MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

    Authors: Mi Yan, Jiazhao Zhang, Yan Zhu, He Wang

    Abstract: Open-vocabulary 3D instance segmentation is cutting-edge for its ability to segment 3D instances without predefined categories. However, progress in 3D lags behind its 2D counterpart due to limited annotated 3D data. To address this, recent works first generate 2D open-vocabulary masks through 2D models and then merge them into 3D instances based on metrics calculated between two neighboring frame… ▽ More

    Submitted 10 April, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

  48. arXiv:2401.07324  [pdf, other

    cs.AI cs.CL

    Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

    Authors: Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, Fei Huang

    Abstract: Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs, empowering them to interact with external tools (e.g., APIs, functions) and complete various tasks in a self-directed fashion. The challenge of tool use demands that LLMs not only understand user queries and generate answers accurately but also excel in task planning, tool invocation, and result summarizati… ▽ More

    Submitted 16 February, 2024; v1 submitted 14 January, 2024; originally announced January 2024.

    Comments: On progress, github repo: https://github.com/X-PLUG/Multi-LLM-Agent

  49. arXiv:2401.07013  [pdf, other

    cs.CL

    Knowledge Distillation for Closed-Source Language Models

    Authors: Hongzhan Chen, Xiaojun Quan, Hehong Chen, Ming Yan, Ji Zhang

    Abstract: Closed-source language models such as GPT-4 have achieved remarkable performance. Many recent studies focus on enhancing the capabilities of smaller models through knowledge distillation from closed-source language models. However, due to the incapability to directly access the weights, hidden states, and output distributions of these closed-source models, the distillation can only be performed by… ▽ More

    Submitted 13 January, 2024; originally announced January 2024.

  50. arXiv:2312.17653  [pdf, other

    cs.AI

    LARP: Language-Agent Role Play for Open-World Games

    Authors: Ming Yan, Ruihao Li, Hao Zhang, Hao Wang, Zhilan Yang, Ji Yan

    Abstract: Language agents have shown impressive problem-solving skills within defined settings and brief timelines. Yet, with the ever-evolving complexities of open-world simulations, there's a pressing need for agents that can flexibly adapt to complex environments and consistently maintain a long-term memory to ensure coherent actions. To bridge the gap between language agents and open-world games, we int… ▽ More

    Submitted 24 December, 2023; originally announced December 2023.

    Comments: 12 pages, 4 figures