Skip to main content

Showing 1–50 of 1,173 results for author: Wang, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.11466  [pdf, other

    cs.CY

    Navigating the Data Trading Crossroads: An Interdisciplinary Survey

    Authors: Yi Yu, Jingru Yu, Xuhong Wang, Juanjuan Li, Yilun Lin, Conghui He, Yanqing Yang, Yu Qiao, Li Li, Fei-Yue Wang

    Abstract: Data has been increasingly recognized as a critical factor in the future economy. However, constructing an efficient data trading market faces challenges such as privacy breaches, data monopolies, and misuse. Despite numerous studies proposing algorithms to protect privacy and methods for pricing data, a comprehensive understanding of these issues and systemic solutions remain elusive. This paper… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  2. arXiv:2407.11398  [pdf, other

    cs.CV

    Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

    Authors: Yanqin Jiang, Chaohui Yu, Chenjie Cao, Fan Wang, Weiming Hu, Jin Gao

    Abstract: Recent advances in 4D generation mainly focus on generating 4D content by distilling pre-trained text or single-view image-conditioned models. It is inconvenient for them to take advantage of various off-the-shelf 3D assets with multi-view attributes, and their results suffer from spatiotemporal inconsistency owing to the inherent ambiguity in the supervision signals. In this work, we present Anim… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Project Page: https://animate3d.github.io/

  3. arXiv:2407.11100  [pdf, other

    cs.CR cs.AI cs.CL

    Building Intelligence Identification System via Large Language Model Watermarking: A Survey and Beyond

    Authors: Xuhong Wang, Haoyu Jiang, Yi Yu, Jingru Yu, Yilun Lin, Ping Yi, Yingchun Wang, Qiao Yu, Li Li, Fei-Yue Wang

    Abstract: Large Language Models (LLMs) are increasingly integrated into diverse industries, posing substantial security risks due to unauthorized replication and misuse. To mitigate these concerns, robust identification mechanisms are widely acknowledged as an effective strategy. Identification systems for LLMs now rely heavily on watermarking technology to manage and protect intellectual property and ensur… ▽ More

    Submitted 16 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

    Comments: 59 pages, 7 figures

  4. arXiv:2407.08195  [pdf

    cs.AI cs.CL cs.MA

    A Text-to-Game Engine for UGC-Based Role-Playing Games

    Authors: Lei Zhang, Xuezheng Peng, Shuyi Yang, Feiyang Wang

    Abstract: The shift from professionally generated content (PGC) to user-generated content (UGC) has revolutionized various media formats, from text to video. With the rapid advancements in generative AI, a similar shift is set to transform the game industry, particularly in the realm of role-playing games (RPGs). This paper introduces a new framework for a text-to-game engine that utilizes foundation models… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: 13 pages,11 figures

  5. arXiv:2407.06546  [pdf, other

    cs.CV cs.RO

    Exploring the Causality of End-to-End Autonomous Driving

    Authors: Jiankun Li, Hao Li, Jiangjiang Liu, Zhikang Zou, Xiaoqing Ye, Fan Wang, Jizhou Huang, Hua Wu, Haifeng Wang

    Abstract: Deep learning-based models are widely deployed in autonomous driving areas, especially the increasingly noticed end-to-end solutions. However, the black-box property of these models raises concerns about their trustworthiness and safety for autonomous driving, and how to debug the causality has become a pressing concern. Despite some existing research on the explainability of autonomous driving, t… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  6. arXiv:2407.06204  [pdf, other

    cs.LG cs.CL

    A Survey on Mixture of Experts

    Authors: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang

    Abstract: Large language models (LLMs) have garnered unprecedented advancements across diverse fields, ranging from natural language processing to computer vision and beyond. The prowess of LLMs is underpinned by their substantial model size, extensive and diverse datasets, and the vast computational power harnessed during training, all of which contribute to the emergent abilities of LLMs (e.g., in-context… ▽ More

    Submitted 26 June, 2024; originally announced July 2024.

  7. arXiv:2407.05679  [pdf, other

    cs.CV cs.AI

    BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space

    Authors: Yumeng Zhang, Shi Gong, Kaixin Xiong, Xiaoqing Ye, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, Haifeng Wang

    Abstract: World models are receiving increasing attention in autonomous driving for their ability to predict potential future scenarios. In this paper, we present BEVWorld, a novel approach that tokenizes multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for environment modeling. The world model consists of two parts: the multi-modal tokenizer and the latent BEV sequence… ▽ More

    Submitted 18 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

    Comments: 10 pages

  8. arXiv:2407.05458  [pdf, other

    cs.AI

    A Survey of Models for Cognitive Diagnosis: New Developments and Future Directions

    Authors: Fei Wang, Weibo Gao, Qi Liu, Jiatong Li, Guanhao Zhao, Zheng Zhang, Zhenya Huang, Mengxiao Zhu, Shijin Wang, Wei Tong, Enhong Chen

    Abstract: Cognitive diagnosis has been developed for decades as an effective measurement tool to evaluate human cognitive status such as ability level and knowledge mastery. It has been applied to a wide range of fields including education, sport, psychological diagnosis, etc. By providing better awareness of cognitive status, it can serve as the basis for personalized services such as well-designed medical… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  9. arXiv:2407.05213  [pdf, other

    cs.CL cs.AI

    BadCLM: Backdoor Attack in Clinical Language Models for Electronic Health Records

    Authors: Weimin Lyu, Zexin Bi, Fusheng Wang, Chao Chen

    Abstract: The advent of clinical language models integrated into electronic health records (EHR) for clinical decision support has marked a significant advancement, leveraging the depth of clinical notes for improved decision-making. Despite their success, the potential vulnerabilities of these models remain largely unexplored. This paper delves into the realm of backdoor attacks on clinical language models… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

    Comments: AMIA 2024

  10. arXiv:2407.05017  [pdf, other

    cs.RO

    VIPS-Odom: Visual-Inertial Odometry Tightly-coupled with Parking Slots for Autonomous Parking

    Authors: Xuefeng Jiang, Fangyuan Wang, Rongzhang Zheng, Han Liu, Yixiong Huo, Jinzhang Peng, Lu Tian, Emad Barsoum

    Abstract: Precise localization is of great importance for autonomous parking task since it provides service for the downstream planning and control modules, which significantly affects the system performance. For parking scenarios, dynamic lighting, sparse textures, and the instability of global positioning system (GPS) signals pose challenges for most traditional localization methods. To address these diff… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

    Comments: A SLAM Method for Autonomous Parking

  11. arXiv:2407.04490  [pdf, other

    cs.CV

    Micro-gesture Online Recognition using Learnable Query Points

    Authors: Pengyu Liu, Fei Wang, Kun Li, Guoliang Chen, Yanyan Wei, Shengeng Tang, Zhiliang Wu, Dan Guo

    Abstract: In this paper, we briefly introduce the solution developed by our team, HFUT-VUT, for the Micro-gesture Online Recognition track in the MiGA challenge at IJCAI 2024. The Micro-gesture Online Recognition task involves identifying the category and locating the start and end times of micro-gestures in video clips. Compared to the typical Temporal Action Detection task, the Micro-gesture Online Recogn… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: Technical Report of HFUT-VUT for the MiGA challenge at IJCAI 2024

  12. arXiv:2407.04461  [pdf, other

    cs.CV

    VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing

    Authors: Shang Liu, Chaohui Yu, Chenjie Cao, Wen Qian, Fan Wang

    Abstract: Recent research on texture synthesis for 3D shapes benefits a lot from dramatically developed 2D text-to-image diffusion models, including inpainting-based and optimization-based approaches. However, these methods ignore the modal gap between the 2D diffusion model and 3D objects, which primarily render 3D objects into 2D images and texture each image separately. In this paper, we revisit the text… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  13. arXiv:2407.02698  [pdf, other

    cs.CR cs.NI

    Navigating Connected Car Cybersecurity: Location Anomaly Detection with RAN Data

    Authors: Feng Wang, Yaron Koral, Kenichi Futamura

    Abstract: The cybersecurity of connected cars, integral to the broader Internet of Things (IoT) landscape, has become of paramount concern. Cyber-attacks, including hijacking and spoofing, pose significant threats to these technological advancements, potentially leading to unauthorized control over vehicular networks or creating deceptive identities. Given the difficulty of deploying comprehensive defensive… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted by VTC2024-Spring

  14. arXiv:2407.02457  [pdf, other

    cs.MM

    Volume Tracking Based Reference Mesh Extraction for Time-Varying Mesh Compression

    Authors: Guodong Chen, Libor Vasa, Fulin Wang, Mallesham Dasari

    Abstract: Time-Varying meshes (TVMs), characterized by their varying connectivity and number of vertices, hold significant potential in immersive media and other various applications. However, their practical utilization is challenging due to their time-varying features and large file sizes. Creating a reference mesh that contains the most essential information is a promising approach to utilizing shared in… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 6 pages

  15. arXiv:2407.02081  [pdf, other

    cs.DC

    On the Performance and Memory Footprint of Distributed Training: An Empirical Study on Transformers

    Authors: Zhengxian Lu, Fangyu Wang, Zhiwei Xu, Fei Yang, Tao Li

    Abstract: Transformer models have emerged as potent solutions to a wide array of multidisciplinary challenges. The deployment of Transformer architectures is significantly hindered by their extensive computational and memory requirements, necessitating the reliance on advanced efficient distributed training methodologies. Prior research has delved into the performance bottlenecks associated with distributed… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  16. arXiv:2407.02053  [pdf, ps, other

    cs.IT cs.CR

    Secure Semantic Communication via Paired Adversarial Residual Networks

    Authors: Boxiang He, Fanggang Wang, Tony Q. S. Quek

    Abstract: This letter explores the positive side of the adversarial attack for the security-aware semantic communication system. Specifically, a pair of matching pluggable modules is installed: one after the semantic transmitter and the other before the semantic receiver. The module at transmitter uses a trainable adversarial residual network (ARN) to generate adversarial examples, while the module at recei… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  17. arXiv:2407.00902  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning

    Authors: Nan Xu, Fei Wang, Sheng Zhang, Hoifung Poon, Muhao Chen

    Abstract: Motivated by in-context learning (ICL) capabilities of Large Language models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled eval… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  18. arXiv:2407.00787  [pdf, other

    cs.IR cs.LG

    Enhancing Travel Decision-Making: A Contrastive Learning Approach for Personalized Review Rankings in Accommodations

    Authors: Reda Igebaria, Eran Fainman, Sarai Mizrachi, Moran Beladev, Fengjun Wang

    Abstract: User-generated reviews significantly influence consumer decisions, particularly in the travel domain when selecting accommodations. This paper contribution comprising two main elements. Firstly, we present a novel dataset of authentic guest reviews sourced from a prominent online travel platform, totaling over two million reviews from 50,000 distinct accommodations. Secondly, we propose an innovat… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  19. arXiv:2407.00187  [pdf, other

    cs.RO cs.CV cs.GR

    SMPLOlympics: Sports Environments for Physically Simulated Humanoids

    Authors: Zhengyi Luo, Jiashun Wang, Kangni Liu, Haotian Zhang, Chen Tessler, Jingbo Wang, Ye Yuan, Jinkun Cao, Zihui Lin, Fengyi Wang, Jessica Hodgins, Kris Kitani

    Abstract: We present SMPLOlympics, a collection of physically simulated environments that allow humanoids to compete in a variety of Olympic sports. Sports simulation offers a rich and standardized testing ground for evaluating and improving the capabilities of learning algorithms due to the diversity and physically demanding nature of athletic activities. As humans have been competing in these sports for m… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

    Comments: Project page: https://smplolympics.github.io/SMPLOlympics

  20. arXiv:2407.00042  [pdf

    q-bio.NC cs.SI eess.SY

    Module control of network analysis in psychopathology

    Authors: Chunyu Pan, Quan Zhang, Yue Zhu, Shengzhou Kong, Juan Liu, Changsheng Zhang, Fei Wang, Xizhe Zhang

    Abstract: The network approach to characterizing psychopathology departs from traditional latent categorical and dimensional approaches. Causal interplay among symptoms contributed to dynamic psychopathology system. Therefore, analyzing the symptom clusters is critical for understanding mental disorders. Furthermore, despite extensive research studying the topological features of symptom networks, the contr… ▽ More

    Submitted 30 May, 2024; originally announced July 2024.

  21. arXiv:2406.20066  [pdf, other

    cs.CV

    ASSR-NeRF: Arbitrary-Scale Super-Resolution on Voxel Grid for High-Quality Radiance Fields Reconstruction

    Authors: Ding-Jiun Huang, Zi-Ting Chou, Yu-Chiang Frank Wang, Cheng Sun

    Abstract: NeRF-based methods reconstruct 3D scenes by building a radiance field with implicit or explicit representations. While NeRF-based methods can perform novel view synthesis (NVS) at arbitrary scale, the performance in high-resolution novel view synthesis (HRNVS) with low-resolution (LR) optimization often results in oversmoothing. On the other hand, single-image super-resolution (SR) aims to enhance… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  22. arXiv:2406.19853  [pdf, other

    cs.CL cs.AI

    YuLan: An Open-source Large Language Model

    Authors: Yutao Zhu, Kun Zhou, Kelong Mao, Wentong Chen, Yiding Sun, Zhipeng Chen, Qian Cao, Yihan Wu, Yushuo Chen, Feng Wang, Lei Zhang, Junyi Li, Xiaolei Wang, Lei Wang, Beichen Zhang, Zican Dong, Xiaoxue Cheng, Yuhan Chen, Xinyu Tang, Yupeng Hou, Qiangqiang Ren, Xincheng Pang, Shufang Xie, Wayne Xin Zhao, Zhicheng Dou , et al. (13 additional authors not shown)

    Abstract: Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billi… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  23. arXiv:2406.19392  [pdf, other

    cs.CV

    ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos

    Authors: Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen-Chun Chen, Yu-Chiang Frank Wang

    Abstract: We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across vi… ▽ More

    Submitted 2 July, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

    Comments: Project page: https://rextime.github.io/

  24. arXiv:2406.19043  [pdf

    eess.IV cs.AI cs.CV cs.DB

    CMRxRecon2024: A Multi-Modality, Multi-View K-Space Dataset Boosting Universal Machine Learning for Accelerated Cardiac MRI

    Authors: Zi Wang, Fanwen Wang, Chen Qin, Jun Lyu, Ouyang Cheng, Shuo Wang, Yan Li, Mengyao Yu, Haoyu Zhang, Kunyuan Guo, Zhang Shi, Qirong Li, Ziqiang Xu, Yajing Zhang, Hao Li, Sha Hua, Binghua Chen, Longyu Sun, Mengting Sun, Qin Li, Ying-Hua Chu, Wenjia Bai, Jing Qin, Xiahai Zhuang, Claudia Prieto , et al. (7 additional authors not shown)

    Abstract: Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover h… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 19 pages, 3 figures, 2 tables

  25. arXiv:2406.18871  [pdf, other

    eess.AS cs.CL

    DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

    Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

    Abstract: Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby fa… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  26. arXiv:2406.18583  [pdf, other

    cs.CV cs.LG

    Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

    Authors: Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao

    Abstract: Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lu… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Code at: https://github.com/Alpha-VLLM/Lumina-T2X

  27. arXiv:2406.17812  [pdf, other

    cs.LG cs.AI cs.DC

    Scalable Artificial Intelligence for Science: Perspectives, Methods and Exemplars

    Authors: Wesley Brewer, Aditya Kashi, Sajal Dash, Aristeidis Tsaris, Junqi Yin, Mallikarjun Shankar, Feiyi Wang

    Abstract: In a post-ChatGPT world, this paper explores the potential of leveraging scalable artificial intelligence for scientific discovery. We propose that scaling up artificial intelligence on high-performance computing platforms is essential to address such complex problems. This perspective focuses on scientific use cases like cognitive simulations, large language models for scientific inquiry, medical… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: 17 pages, 5 figures

  28. arXiv:2406.16253  [pdf, other

    cs.CL

    LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing

    Authors: Jiangshu Du, Yibo Wang, Wenting Zhao, Zhongfen Deng, Shuaiqi Liu, Renze Lou, Henry Peng Zou, Pranav Narayanan Venkit, Nan Zhang, Mukund Srinath, Haoran Ranran Zhang, Vipul Gupta, Yinghui Li, Tao Li, Fei Wang, Qin Liu, Tianlin Liu, Pengzhi Gao, Congying Xia, Chen Xing, Jiayang Cheng, Zhaowei Wang, Ying Su, Raj Sanjay Shah, Ruohao Guo , et al. (15 additional authors not shown)

    Abstract: This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as th… ▽ More

    Submitted 25 June, 2024; v1 submitted 23 June, 2024; originally announced June 2024.

  29. arXiv:2406.15762  [pdf, other

    cs.LG stat.ML

    Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient Flow

    Authors: Zhichao Chen, Haoxuan Li, Fangyikang Wang, Odin Zhang, Hu Xu, Xiaoyu Jiang, Zhihuan Song, Eric H. Wang

    Abstract: Diffusion models (DMs) have gained attention in Missing Data Imputation (MDI), but there remain two long-neglected issues to be addressed: (1). Inaccurate Imputation, which arises from inherently sample-diversification-pursuing generative process of DMs. (2). Difficult Training, which stems from intricate design required for the mask matrix in model training stage. To address these concerns within… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  30. arXiv:2406.15459  [pdf, other

    cs.GT cs.CE cs.LG

    Large-Scale Contextual Market Equilibrium Computation through Deep Learning

    Authors: Yunxuan Ma, Yide Bian, Hao Xu, Weitao Yang, Jingshu Zhao, Zhijian Duan, Feng Wang, Xiaotie Deng

    Abstract: Market equilibrium is one of the most fundamental solution concepts in economics and social optimization analysis. Existing works on market equilibrium computation primarily focus on settings with a relatively small number of buyers. Motivated by this, our paper investigates the computation of market equilibrium in scenarios with a large-scale buyer population, where buyers and goods are represent… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 22 pages

  31. arXiv:2406.14825  [pdf, other

    cs.CL

    TemPrompt: Multi-Task Prompt Learning for Temporal Relation Extraction in RAG-based Crowdsourcing Systems

    Authors: Jing Yang, Yu Zhao, Linyao Yang, Xiao Wang, Long Chen, Fei-Yue Wang

    Abstract: Temporal relation extraction (TRE) aims to grasp the evolution of events or actions, and thus shape the workflow of associated tasks, so it holds promise in helping understand task requests initiated by requesters in crowdsourcing systems. However, existing methods still struggle with limited and unevenly distributed annotated data. Therefore, inspired by the abundant global knowledge stored withi… ▽ More

    Submitted 9 July, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: 12 pages, 9 figures

  32. arXiv:2406.14315  [pdf, other

    cs.DC

    AI-coupled HPC Workflow Applications, Middleware and Performance

    Authors: Wes Brewer, Ana Gainaru, Frédéric Suter, Feiyi Wang, Murali Emani, Shantenu Jha

    Abstract: AI integration is revolutionizing the landscape of HPC simulations, enhancing the importance, use, and performance of AI-driven HPC workflows. This paper surveys the diverse and rapidly evolving field of AI-driven HPC and provides a common conceptual basis for understanding AI-driven HPC workflows. Specifically, we use insights from different modes of coupling AI into HPC workflows to propose six… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  33. arXiv:2406.13445  [pdf, other

    cs.CV cs.AI

    Lost in UNet: Improving Infrared Small Target Detection by Underappreciated Local Features

    Authors: Wuzhou Quan, Wei Zhao, Weiming Wang, Haoran Xie, Fu Lee Wang, Mingqiang Wei

    Abstract: Many targets are often very small in infrared images due to the long-distance imaging meachnism. UNet and its variants, as popular detection backbone networks, downsample the local features early and cause the irreversible loss of these local features, leading to both the missed and false detection of small targets in infrared images. We propose HintU, a novel network to recover the local features… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  34. arXiv:2406.12834  [pdf, other

    cs.CV

    GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation

    Authors: Ci-Siang Lin, I-Jieh Liu, Min-Hung Chen, Chien-Yi Wang, Sifei Liu, Yu-Chiang Frank Wang

    Abstract: Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence throughout the entire video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we aim to efficiently adapt foundation segmentation models for addressing RVOS from weak supervision with the proposed… ▽ More

    Submitted 23 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

    Comments: CVPR Workshop (CVinW) 2024. Project page: https://jack24658735.github.io/groprompt/

  35. arXiv:2406.11933  [pdf, other

    cs.CV

    Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset

    Authors: Fengxiang Wang, Hongzhen Wang, Di Wang, Zonghao Guo, Zhenyu Zhong, Long Lan, Jing Zhang, Zhiyuan Liu, Maosong Sun

    Abstract: Masked Image Modeling (MIM) has emerged as a pivotal approach for developing foundational visual models in the field of remote sensing (RS). However, current RS datasets are limited in volume and diversity, which significantly constrains the capacity of MIM methods to learn generalizable representations. In this study, we introduce \textbf{RS-4M}, a large-scale dataset designed to enable highly ef… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  36. arXiv:2406.11839  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    mDPO: Conditional Preference Optimization for Multimodal Large Language Models

    Authors: Fei Wang, Wenxuan Zhou, James Y. Huang, Nan Xu, Sheng Zhang, Hoifung Poon, Muhao Chen

    Abstract: Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment. Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement. Through a comparative experiment, we identify the unconditional preference problem in multimodal preference optimization, where the model overlooks the ima… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  37. arXiv:2406.11243  [pdf, other

    cs.CL cs.AI

    FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation

    Authors: Bangzheng Li, Ben Zhou, Xingyu Fu, Fei Wang, Dan Roth, Muhao Chen

    Abstract: Language models have shown impressive in-context-learning capabilities, which allow them to benefit from input prompts and perform better on downstream end tasks. Existing works investigate the mechanisms behind this observation, and propose label-agnostic prompt metrics that can better estimate end-task performances. One popular approach is using perplexity as a way to measure models' familiarity… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  38. arXiv:2406.09411  [pdf, other

    cs.CV cs.AI cs.CL

    MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

    Authors: Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, Muhao Chen

    Abstract: We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a… ▽ More

    Submitted 1 July, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: typos corrected, references added, Project Page: https://muirbench.github.io/

  39. arXiv:2406.08894  [pdf, other

    cs.CV

    OpenMaterial: A Comprehensive Dataset of Complex Materials for 3D Reconstruction

    Authors: Zheng Dang, Jialu Huang, Fei Wang, Mathieu Salzmann

    Abstract: Recent advances in deep learning such as neural radiance fields and implicit neural representations have significantly propelled the field of 3D reconstruction. However, accurately reconstructing objects with complex optical properties, such as metals and glass, remains a formidable challenge due to their unique specular and light-transmission characteristics. To facilitate the development of solu… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  40. arXiv:2406.07645  [pdf, other

    cs.CV cs.MM

    SSNVC: Single Stream Neural Video Compression with Implicit Temporal Information

    Authors: Feng Wang, Haihang Ruan, Zhihuang Xie, Ronggang Wang, Xiangyu Yue

    Abstract: Recently, Neural Video Compression (NVC) techniques have achieved remarkable performance, even surpassing the best traditional lossy video codec. However, most existing NVC methods heavily rely on transmitting Motion Vector (MV) to generate accurate contextual features, which has the following drawbacks. (1) Compressing and transmitting MV requires specialized MV encoder and decoder, which makes m… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by DCC 2024 as Poster. This is the full paper

  41. arXiv:2406.07537  [pdf, other

    cs.CV

    Autoregressive Pretraining with Mamba in Vision

    Authors: Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie

    Abstract: The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structur… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  42. arXiv:2406.07061  [pdf, other

    eess.IV cs.CV

    Triage of 3D pathology data via 2.5D multiple-instance learning to guide pathologist assessments

    Authors: Gan Gao, Andrew H. Song, Fiona Wang, David Brenes, Rui Wang, Sarah S. L. Chow, Kevin W. Bishop, Lawrence D. True, Faisal Mahmood, Jonathan T. C. Liu

    Abstract: Accurate patient diagnoses based on human tissue biopsies are hindered by current clinical practice, where pathologists assess only a limited number of thin 2D tissue slices sectioned from 3D volumetric tissue. Recent advances in non-destructive 3D pathology, such as open-top light-sheet microscopy, enable comprehensive imaging of spatially heterogeneous tissue morphologies, offering the feasibili… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: CVPR CVMI 2024

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 6955-6965

  43. arXiv:2406.06577  [pdf, other

    cs.CL cs.AI

    RAG-based Crowdsourcing Task Decomposition via Masked Contrastive Learning with Prompts

    Authors: Jing Yang, Xiao Wang, Yu Zhao, Yuhang Liu, Fei-Yue Wang

    Abstract: Crowdsourcing is a critical technology in social manufacturing, which leverages an extensive and boundless reservoir of human resources to handle a wide array of complex tasks. The successful execution of these complex tasks relies on task decomposition (TD) and allocation, with the former being a prerequisite for the latter. Recently, pre-trained language models (PLMs)-based methods have garnered… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 13 pages, 9 figures

  44. arXiv:2406.05692  [pdf, other

    cs.SD cs.AI eess.AS

    SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion

    Authors: Bingsong Bai, Fengping Wang, Yingming Gao, Ya Li

    Abstract: Diffusion-based singing voice conversion (SVC) models have shown better synthesis quality compared to traditional methods. However, in cross-domain SVC scenarios, where there is a significant disparity in pitch between the source and target voice domains, the models tend to generate audios with hoarseness, posing challenges in achieving high-quality vocal outputs. Therefore, in this paper, we prop… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  45. arXiv:2406.05515  [pdf, other

    cs.SD cs.CL eess.AS

    Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation

    Authors: Paige Tuttösí, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim

    Abstract: Acoustic context effects, where surrounding changes in pitch, rate or timbre influence the perception of a sound, are well documented in speech perception, but how they interact with language background remains unclear. Using a reverse-correlation approach, we systematically varied the pitch and speech rate in phrases around different pairs of vowels for second language (L2) speakers of English (/… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH 2024

  46. arXiv:2406.05000  [pdf, other

    cs.CV

    AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

    Authors: Lianyu Pang, Jian Yin, Baoquan Zhao, Feize Wu, Fu Lee Wang, Qing Li, Xudong Mao

    Abstract: Recent advances in text-to-image models have enabled high-quality personalized image synthesis of user-provided concepts with flexible textual control. In this work, we analyze the limitations of two primary techniques in text-to-image personalization: Textual Inversion and DreamBooth. When integrating the learned concept into new prompts, Textual Inversion tends to overfit the concept, while Drea… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  47. arXiv:2406.04998  [pdf, other

    cs.LG cs.AI cs.CV

    ADBA:Approximation Decision Boundary Approach for Black-Box Adversarial Attacks

    Authors: Feiyang Wang, Xingquan Zuo, Hai Huang, Gang Chen

    Abstract: Many machine learning models are susceptible to adversarial attacks, with decision-based black-box attacks representing the most critical threat in real-world applications. These attacks are extremely stealthy, generating adversarial examples using hard labels obtained from the target machine learning model. This is typically realized by optimizing perturbation directions, guided by decision bound… ▽ More

    Submitted 12 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: 10 pages, 5 figures, conference

  48. arXiv:2406.04727  [pdf, other

    cs.LG cond-mat.soft cs.AI

    Predicting Polymer Properties Based on Multimodal Multitask Pretraining

    Authors: Fanmeng Wang, Wentao Guo, Minjie Cheng, Shen Yuan, Hongteng Xu, Zhifeng Gao

    Abstract: In the past few decades, polymers, high-molecular-weight compounds formed by bonding numerous identical or similar monomers covalently, have played an essential role in various scientific fields. In this context, accurate prediction of their properties is becoming increasingly crucial. Typically, the properties of a polymer, such as plasticity, conductivity, bio-compatibility, and so on, are highl… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  49. arXiv:2406.04340  [pdf, other

    cs.CV

    GLACE: Global Local Accelerated Coordinate Encoding

    Authors: Fangjinhua Wang, Xudong Jiang, Silvano Galliani, Christoph Vogel, Marc Pollefeys

    Abstract: Scene coordinate regression (SCR) methods are a family of visual localization methods that directly regress 2D-3D matches for camera pose estimation. They are effective in small-scale scenes but face significant challenges in large-scale scenes that are further amplified in the absence of ground truth 3D point clouds for supervision. Here, the model can only rely on reprojection constraints and ne… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Large-scale visual localization with a single optimizable MLP. CVPR 2024. Code: https://github.com/cvg/glace. Project page: https://xjiangan.github.io/glace

  50. arXiv:2406.03799  [pdf

    cs.CV cs.AI

    Enhanced Semantic Segmentation Pipeline for WeatherProof Dataset Challenge

    Authors: Nan Zhang, Xidan Zhang, Jianing Wei, Fangjun Wang, Zhiming Tan

    Abstract: This report describes the winning solution to the WeatherProof Dataset Challenge (CVPR 2024 UG2+ Track 3). Details regarding the challenge are available at https://cvpr2024ug2challenge.github.io/track3.html. We propose an enhanced semantic segmentation pipeline for this challenge. Firstly, we improve semantic segmentation models, using backbone pretrained with Depth Anything to improve UperNet mod… ▽ More

    Submitted 6 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.