Skip to main content

Showing 1–50 of 174 results for author: Bai, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13038  [pdf, other

    cs.CV cs.LG

    Universal Facial Encoding of Codec Avatars from VR Headsets

    Authors: Shaojie Bai, Te-Li Wang, Chenghui Li, Akshay Venkatesh, Tomas Simon, Chen Cao, Gabriel Schwartz, Ryan Wrench, Jason Saragih, Yaser Sheikh, Shih-En Wei

    Abstract: Faithful real-time facial animation is essential for avatar-mediated telepresence in Virtual Reality (VR). To emulate authentic communication, avatar animation needs to be efficient and accurate: able to capture both extreme and subtle expressions within a few milliseconds to sustain the rhythm of natural conversations. The oblique and incomplete views of the face, variability in the donning of he… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: SIGGRAPH 2024 (ACM Transactions on Graphics (TOG))

    Journal ref: ACM Trans. Graph. 43, 4, Article 93 (July 2024), 22 pages.

  2. arXiv:2407.10671  [pdf, other

    cs.CL cs.AI

    Qwen2 Technical Report

    Authors: An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin , et al. (37 additional authors not shown)

    Abstract: This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, a… ▽ More

    Submitted 17 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

    Comments: 25 pages, 1 figure

  3. arXiv:2406.17005  [pdf, other

    cs.CV

    PVUW 2024 Challenge on Complex Video Understanding: Methods and Results

    Authors: Henghui Ding, Chang Liu, Yunchao Wei, Nikhila Ravi, Shuting He, Song Bai, Philip Torr, Deshui Miao, Xin Li, Zhenyu He, Yaowei Wang, Ming-Hsuan Yang, Zhensong Xu, Jiangtao Yao, Chengjing Wu, Ting Liu, Luoqi Liu, Xinyu Liu, Jing Zhang, Kexin Zhang, Yuting Yang, Licheng Jiao, Shuyuan Yang, Mingqi Gao, Jingnan Luo , et al. (12 additional authors not shown)

    Abstract: Pixel-level Video Understanding in the Wild Challenge (PVUW) focus on complex video understanding. In this CVPR 2024 workshop, we add two new tracks, Complex Video Object Segmentation Track based on MOSE dataset and Motion Expression guided Video Segmentation track based on MeViS dataset. In the two new tracks, we provide additional videos and annotations that feature challenging elements, such as… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: MOSE Challenge: https://henghuiding.github.io/MOSE/ChallengeCVPR2024, MeViS Challenge: https://henghuiding.github.io/MeViS/ChallengeCVPR2024

  4. arXiv:2406.04322  [pdf, other

    cs.CV

    DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

    Authors: Qihao Liu, Yi Zhang, Song Bai, Adam Kortylewski, Alan Yuille

    Abstract: We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets (represented by Neural Radiance Fields) from text prompts. Unlike recent 3D generative models that rely on clean and well-aligned 3D data, limiting them to single or few-class generation, our model is directly trained on extensive noisy and unaligned `in-the-wild' 3D assets, mitigating the key challenge… ▽ More

    Submitted 6 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted to CVPR 2024. Code: https://github.com/qihao067/direct3d Project page: https://direct-3d.github.io/

  5. arXiv:2406.00532  [pdf, other

    cs.AI cs.LG

    Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques

    Authors: Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik

    Abstract: Breast cancer (BC) stands as one of the most common malignancies affecting women worldwide, necessitating advancements in diagnostic methodologies for better clinical outcomes. This article provides a comprehensive exploration of the application of Explainable Artificial Intelligence (XAI) techniques in the detection and diagnosis of breast cancer. As Artificial Intelligence (AI) technologies cont… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

  6. arXiv:2405.08779  [pdf, other

    cs.LG

    Jacobian Regularizer-based Neural Granger Causality

    Authors: Wanqi Zhou, Shuanghao Bai, Shujian Yu, Qibin Zhao, Badong Chen

    Abstract: With the advancement of neural networks, diverse methods for neural Granger causality have emerged, which demonstrate proficiency in handling complex data, and nonlinear relationships. However, the existing framework of neural Granger causality has several limitations. It requires the construction of separate predictive models for each target variable, and the relationship depends on the sparsity… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: 20 pages, 7 figures, ICML 2024

  7. arXiv:2405.08484  [pdf, other

    quant-ph cs.LG nlin.CD stat.ML

    Universal replication of chaotic characteristics by classical and quantum machine learning

    Authors: Sheng-Chen Bai, Shi-Ju Ran

    Abstract: Replicating chaotic characteristics of non-linear dynamics by machine learning (ML) has recently drawn wide attentions. In this work, we propose that a ML model, trained to predict the state one-step-ahead from several latest historic states, can accurately replicate the bifurcation diagram and the Lyapunov exponents of discrete dynamic systems. The characteristics for different values of the hype… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: 8 pages, 4 figures

  8. arXiv:2404.19287  [pdf, other

    cs.CV

    Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

    Authors: Wanqi Zhou, Shuanghao Bai, Qibin Zhao, Badong Chen

    Abstract: Pretrained vision-language models (VLMs) like CLIP have shown impressive generalization performance across various downstream tasks, yet they remain vulnerable to adversarial attacks. While prior research has primarily concentrated on improving the adversarial robustness of image encoders to guard against attacks on images, the exploration of text-based and multimodal attacks has largely been over… ▽ More

    Submitted 17 July, 2024; v1 submitted 30 April, 2024; originally announced April 2024.

    Comments: 16 pages, 14 figures

  9. arXiv:2404.19286  [pdf, other

    cs.CV

    Soft Prompt Generation for Domain Generalization

    Authors: Shuanghao Bai, Yuedi Zhang, Wanqi Zhou, Zhirong Luan, Badong Chen

    Abstract: Large pre-trained vision language models (VLMs) have shown impressive zero-shot ability on downstream tasks with manually designed prompt. To further adapt VLMs to downstream tasks, soft prompt is proposed to replace manually designed prompt, which undergoes fine-tuning based on specific domain data. Prior prompt learning methods primarily learn a fixed prompt or residuled prompt from training sam… ▽ More

    Submitted 12 July, 2024; v1 submitted 30 April, 2024; originally announced April 2024.

    Comments: 25 pages, 4 figures, accepted by ECCV 2024

  10. arXiv:2404.14724  [pdf

    cs.RO

    Tightly Joined Positioning and Control Model for Unmanned Aerial Vehicles Based on Factor Graph Optimization

    Authors: Peiwen Yang, Weisong Wen, Shiyu Bai, Li-Ta Hsu

    Abstract: The execution of flight missions by unmanned aerial vehicles (UAV) primarily relies on navigation. In particular, the navigation pipeline has traditionally been divided into positioning and control, operating in a sequential loop. However, the existing navigation pipeline, where the positioning and control are decoupled, struggles to adapt to ubiquitous uncertainties arising from measurement noise… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

  11. arXiv:2404.14471  [pdf, other

    cs.CV

    Narrative Action Evaluation with Prompt-Guided Multimodal Interaction

    Authors: Shiyi Zhang, Sule Bai, Guangyi Chen, Lei Chen, Jiwen Lu, Junle Wang, Yansong Tang

    Abstract: In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate d… ▽ More

    Submitted 26 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR 2024

  12. arXiv:2404.10499  [pdf, other

    cs.CV cs.AI

    Robust Noisy Label Learning via Two-Stream Sample Distillation

    Authors: Sihan Bai, Sanping Zhou, Zheng Qin, Le Wang, Nanning Zheng

    Abstract: Noisy label learning aims to learn robust networks under the supervision of noisy labels, which plays a critical role in deep learning. Existing work either conducts sample selection or label correction to deal with noisy labels during the model training process. In this paper, we design a simple yet effective sample selection framework, termed Two-Stream Sample Distillation (TSSD), for noisy labe… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  13. arXiv:2404.03067  [pdf, other

    cs.RO cs.CV

    Self-supervised 6-DoF Robot Grasping by Demonstration via Augmented Reality Teleoperation System

    Authors: Xiwen Dengxiong, Xueting Wang, Shi Bai, Yunbo Zhang

    Abstract: Most existing 6-DoF robot grasping solutions depend on strong supervision on grasp pose to ensure satisfactory performance, which could be laborious and impractical when the robot works in some restricted area. To this end, we propose a self-supervised 6-DoF grasp pose detection framework via an Augmented Reality (AR) teleoperation system that can efficiently learn human demonstrations and provide… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

  14. arXiv:2404.01853  [pdf, other

    cs.LG cs.CV

    Pairwise Similarity Distribution Clustering for Noisy Label Learning

    Authors: Sihan Bai

    Abstract: Noisy label learning aims to train deep neural networks using a large amount of samples with noisy labels, whose main challenge comes from how to deal with the inaccurate supervision caused by wrong labels. Existing works either take the label correction or sample selection paradigm to involve more samples with accurate labels into the training process. In this paper, we propose a simple yet effec… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  15. arXiv:2403.08506  [pdf, other

    cs.LG cs.AI cs.CV

    DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning

    Authors: Sikai Bai, Jie Zhang, Shuaicheng Li, Song Guo, Jingcai Guo, Jun Hou, Tao Han, Xiaocheng Lu

    Abstract: Federated learning (FL) has emerged as a powerful paradigm for learning from decentralized data, and federated domain generalization further considers the test dataset (target domain) is absent from the decentralized training data (source domains). However, most existing FL methods assume that domain labels are provided during training, and their evaluation imposes explicit constraints on the numb… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Journal ref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024

  16. arXiv:2403.08192  [pdf, other

    cs.CL q-bio.BM

    MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension

    Authors: Xingyu Lu, He Cao, Zijing Liu, Shengyuan Bai, Leqing Chen, Yuan Yao, Hai-Tao Zheng, Yu Li

    Abstract: Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information, posing challenges to accurate molecular comprehension. Traditional evaluation metrics for generated content fail to assess a model's accuracy in molecular understanding. To rectify the absence of factual evaluation, we present MoleculeQA, a novel quest… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: 19 pages, 8 figures

  17. arXiv:2403.06764  [pdf, other

    cs.CV cs.AI cs.CL

    An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

    Authors: Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang

    Abstract: In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we i… ▽ More

    Submitted 25 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: 21 papes, 8 figures, code is released at https://github.com/pkunlp-icler/FastV

  18. arXiv:2402.14577  [pdf, other

    cs.CV

    Debiasing Text-to-Image Diffusion Models

    Authors: Ruifei He, Chuhui Xue, Haoru Tan, Wenqing Zhang, Yingchen Yu, Song Bai, Xiaojuan Qi

    Abstract: Learning-based Text-to-Image (TTI) models like Stable Diffusion have revolutionized the way visual content is generated in various domains. However, recent research has shown that nonnegligible social bias exists in current state-of-the-art TTI systems, which raises important concerns. In this work, we target resolving the social bias in TTI diffusion models. We begin by formalizing the problem se… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

  19. arXiv:2401.15865  [pdf, other

    cs.CV

    LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection

    Authors: Sifan Zhou, Liang Li, Xinyu Zhang, Bo Zhang, Shipeng Bai, Miao Sun, Ziyu Zhao, Xiaobo Lu, Xiangxiang Chu

    Abstract: Due to highly constrained computing power and memory, deploying 3D lidar-based detectors on edge devices equipped in autonomous vehicles and robots poses a crucial challenge. Being a convenient and straightforward model compression approach, Post-Training Quantization (PTQ) has been widely adopted in 2D vision tasks. However, applying it directly to 3D lidar-based tasks inevitably leads to perform… ▽ More

    Submitted 28 January, 2024; originally announced January 2024.

    Comments: Accepted in ICLR 2024

  20. arXiv:2401.11002  [pdf, other

    cs.CV cs.AI

    Fast Registration of Photorealistic Avatars for VR Facial Animation

    Authors: Chaitanya Patel, Shaojie Bai, Te-Li Wang, Jason Saragih, Shih-En Wei

    Abstract: Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a photorealistic avatar of one's likeness while wearing a VR headset. Although high quality registration of person-specific avatars to headset-mounted camera (HMC) images is possible in an offline setting, the performance of generic realtime mode… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

    Comments: Project page: https://chaitanya100100.github.io/FastRegistration/

  21. arXiv:2401.02620  [pdf, other

    cs.AI cs.GR

    Progress and Prospects in 3D Generative AI: A Technical Overview including 3D human

    Authors: Song Bai, Jie Li

    Abstract: While AI-generated text and 2D images continue to expand its territory, 3D generation has gradually emerged as a trend that cannot be ignored. Since the year 2023 an abundant amount of research papers has emerged in the domain of 3D generation. This growth encompasses not just the creation of 3D objects, but also the rapid development of 3D character and motion generation. Several key factors cont… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

  22. arXiv:2401.01885  [pdf, other

    cs.CV

    From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

    Authors: Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, Alexander Richard

    Abstract: We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

  23. arXiv:2401.00616  [pdf, other

    cs.CV

    GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields

    Authors: Xiao Pan, Zongxin Yang, Shuai Bai, Yi Yang

    Abstract: In this paper, we focus on the One-shot Novel View Synthesis (O-NVS) task which targets synthesizing photo-realistic novel views given only one reference image per scene. Previous One-shot Generalizable Neural Radiance Fields (OG-NeRF) methods solve this task in an inference-time finetuning-free manner, yet suffer the blurry issue due to the encoder-only architecture that highly relies on the limi… ▽ More

    Submitted 29 March, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

    Comments: Submitted to Journal

  24. arXiv:2312.09589  [pdf, other

    cs.CV

    Improving Cross-domain Few-shot Classification with Multilayer Perceptron

    Authors: Shuanghao Bai, Wanqi Zhou, Zhirong Luan, Donglin Wang, Badong Chen

    Abstract: Cross-domain few-shot classification (CDFSC) is a challenging and tough task due to the significant distribution discrepancies across different domains. To address this challenge, many approaches aim to learn transferable representations. Multilayer perceptron (MLP) has shown its capability to learn transferable representations in various downstream tasks, such as unsupervised image classification… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: 5pages, 4 figures

  25. arXiv:2312.09553  [pdf, other

    cs.CV

    Prompt-based Distribution Alignment for Unsupervised Domain Adaptation

    Authors: Shuanghao Bai, Min Zhang, Wanqi Zhou, Siteng Huang, Zhirong Luan, Donglin Wang, Badong Chen

    Abstract: Recently, despite the unprecedented success of large pre-trained visual-language models (VLMs) on a wide range of downstream tasks, the real-world unsupervised domain adaptation (UDA) problem is still not well explored. Therefore, in this paper, we first experimentally demonstrate that the unsupervised-trained VLMs can significantly reduce the distribution discrepancy between source and target dom… ▽ More

    Submitted 26 January, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

    Comments: 13pages,6figures

  26. arXiv:2312.09158  [pdf, other

    cs.CV

    General Object Foundation Model for Images and Videos at Scale

    Authors: Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai

    Abstract: We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Project homepage: https://glee-vision.github.io

  27. arXiv:2312.04089  [pdf, other

    cs.CV

    Open-Vocabulary Segmentation with Semantic-Assisted Calibration

    Authors: Yong Liu, Sule Bai, Guanbin Li, Yitong Wang, Yansong Tang

    Abstract: This paper studies open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP. As the core of open-vocabulary understanding, alignment of visual content with the semantics of unbounded text has become the bottleneck of this field. To address this challenge, recent works propose to utilize CLIP as an additional cl… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

  28. arXiv:2312.02481  [pdf, other

    cs.CV cs.AI

    Learning to Holistically Detect Bridges from Large-Size VHR Remote Sensing Imagery

    Authors: Yansheng Li, Junwei Luo, Yongjun Zhang, Yihua Tan, Jin-Gang Yu, Song Bai

    Abstract: Bridge detection in remote sensing images (RSIs) plays a crucial role in various applications, but it poses unique challenges compared to the detection of other objects. In RSIs, bridges exhibit considerable variations in terms of their spatial scales and aspect ratios. Therefore, to ensure the visibility and integrity of bridges, it is essential to perform holistic bridge detection in large-size… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

    Comments: 16 pages, 11 figures, 6 tables; due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract appearing here is slightly shorter than that in the PDF file

  29. arXiv:2310.06218  [pdf, other

    cs.LG cs.AI

    SUBP: Soft Uniform Block Pruning for 1xN Sparse CNNs Multithreading Acceleration

    Authors: Jingyang Xiang, Siqi Li, Jun Chen, Shipeng Bai, Yukai Ma, Guang Dai, Yong Liu

    Abstract: The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1$\times$N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: 14 pages, 4 figures, Accepted by 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  30. arXiv:2309.16609  [pdf, other

    cs.CL

    Qwen Technical Report

    Authors: Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan , et al. (23 additional authors not shown)

    Abstract: Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Q… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

    Comments: 59 pages, 5 figures

  31. arXiv:2309.07698  [pdf, other

    cs.CV

    Dataset Condensation via Generative Model

    Authors: David Junhao Zhang, Heng Wang, Chuhui Xue, Rui Yan, Wenqing Zhang, Song Bai, Mike Zheng Shou

    Abstract: Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number of parameters to be optimized. When increasing image resolutions and classes, the number of learnable parameters grows accordingly, prohibiting condensation meth… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: old work,done in 2022

  32. Ethical Framework for Harnessing the Power of AI in Healthcare and Beyond

    Authors: Sidra Nasir, Rizwan Ahmed Khan, Samita Bai

    Abstract: In the past decade, the deployment of deep learning (Artificial Intelligence (AI)) methods has become pervasive across a spectrum of real-world applications, often in safety-critical contexts. This comprehensive research article rigorously investigates the ethical dimensions intricately linked to the rapid evolution of AI technologies, with a particular focus on the healthcare domain. Delving deep… ▽ More

    Submitted 31 August, 2023; originally announced September 2023.

    Journal ref: IEEE Access 2024

  33. arXiv:2308.16890  [pdf, other

    cs.CV cs.CL

    TouchStone: Evaluating Vision-Language Models by Language Models

    Authors: Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, Jingren Zhou

    Abstract: Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual s… ▽ More

    Submitted 4 September, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

    Comments: https://github.com/OFA-Sys/TouchStone

  34. arXiv:2308.12966  [pdf, other

    cs.CV cs.CL

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Authors: Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou

    Abstract: In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyon… ▽ More

    Submitted 12 October, 2023; v1 submitted 24 August, 2023; originally announced August 2023.

    Comments: Code, demo and models are available at https://github.com/QwenLM/Qwen-VL

  35. arXiv:2308.07209  [pdf, other

    cs.LG cs.CV eess.IV

    Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning

    Authors: Shipeng Bai, Jun Chen, Xintian Shen, Yixuan Qian, Yong Liu

    Abstract: Structured pruning and quantization are promising approaches for reducing the inference time and memory footprint of neural networks. However, most existing methods require the original training dataset to fine-tune the model. This not only brings heavy resource consumption but also is not possible for applications with sensitive or proprietary data due to privacy and security concerns. Therefore,… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: ICCV2023

  36. arXiv:2308.06739  [pdf, other

    cs.CV

    Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks

    Authors: David Junhao Zhang, Mutian Xu, Chuhui Xue, Wenqing Zhang, Xiaoguang Han, Song Bai, Mike Zheng Shou

    Abstract: Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been… ▽ More

    Submitted 13 August, 2023; originally announced August 2023.

  37. arXiv:2308.04269  [pdf, other

    cs.CV cs.AI

    Lossy and Lossless (L$^2$) Post-training Model Size Compression

    Authors: Yumeng Shi, Shihao Bai, Xiuying Wei, Ruihao Gong, Jianlei Yang

    Abstract: Deep neural networks have delivered remarkable performance and have been widely used in various visual tasks. However, their huge size causes significant inconvenience for transmission and storage. Many previous studies have explored model size compression. However, these studies often approach various lossy and lossless compression methods in isolation, leading to challenges in achieving high com… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

  38. arXiv:2308.00353  [pdf, other

    cs.CV

    Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

    Authors: Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, Xiaojuan Qi

    Abstract: Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Interne… ▽ More

    Submitted 1 August, 2023; originally announced August 2023.

    Comments: submit to TPAMI

  39. arXiv:2307.05358  [pdf, other

    cs.LG cs.AI

    Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators

    Authors: Sikai Bai, Shuaicheng Li, Weiming Zhuang, Jie Zhang, Song Guo, Kunlin Yang, Jun Hou, Shuai Zhang, Junyu Gao, Shuai Yi

    Abstract: Federated learning has become a popular method to learn from decentralized heterogeneous data. Federated semi-supervised learning (FSSL) emerges to train models from a small fraction of labeled data due to label scarcity on decentralized clients. Existing FSSL methods assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled… ▽ More

    Submitted 11 March, 2024; v1 submitted 11 July, 2023; originally announced July 2023.

    Journal ref: The 38th Annual AAAI Conference on Artificial Intelligence, 2024

  40. arXiv:2307.00498  [pdf, other

    cs.LG cs.CV

    Data-Free Quantization via Mixed-Precision Compensation without Fine-Tuning

    Authors: Jun Chen, Shipeng Bai, Tianxin Huang, Mengmeng Wang, Guanzhong Tian, Yong Liu

    Abstract: Neural network quantization is a very promising solution in the field of model compression, but its resulting accuracy highly depends on a training/fine-tuning process and requires the original data. This not only brings heavy computation and time costs but also is not conducive to privacy and sensitive information protection. Therefore, a few recent works are starting to focus on data-free quanti… ▽ More

    Submitted 2 July, 2023; originally announced July 2023.

    Comments: This paper has been accepted for publication in the Pattern Recognition

    Journal ref: Pattern Recognition 2023

  41. arXiv:2306.16718  [pdf, other

    cs.CV

    Metric-aligned Sample Selection and Critical Feature Sampling for Oriented Object Detection

    Authors: Peng Sun, Yongbin Zheng, Wenqi Wu, Wanying Xu, Shengjian Bai

    Abstract: Arbitrary-oriented object detection is a relatively emerging but challenging task. Although remarkable progress has been made, there still remain many unsolved issues due to the large diversity of patterns in orientation, scale, aspect ratio, and visual appearance of objects in aerial images. Most of the existing methods adopt a coarse-grained fixed label assignment strategy and suffer from the in… ▽ More

    Submitted 10 July, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

  42. arXiv:2306.14435  [pdf, other

    cs.CV cs.LG

    DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing

    Authors: Yujun Shi, Chuhui Xue, Jun Hao Liew, Jiachun Pan, Hanshu Yan, Wenqing Zhang, Vincent Y. F. Tan, Song Bai

    Abstract: Accurate and controllable image editing is a challenging task that has attracted significant attention recently. Notably, DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision. However, due to its reliance on generative adversarial networks (GANs), its generality is limited by the capacity of pretrained GAN models. In this… ▽ More

    Submitted 7 April, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

    Comments: Code is released at https://github.com/Yujun-Shi/DragDiffusion

  43. arXiv:2306.00974  [pdf, other

    cs.CV

    Discovering Failure Modes of Text-guided Diffusion Models via Adversarial Search

    Authors: Qihao Liu, Adam Kortylewski, Yutong Bai, Song Bai, Alan Yuille

    Abstract: Text-guided diffusion models (TDMs) are widely applied but can fail unexpectedly. Common failures include: (i) natural-looking text prompts generating images with the wrong content, or (ii) different random samples of the latent variables that generate vastly different, and even unrelated, outputs despite being conditioned on the same text prompt. In this work, we aim to study and understand the f… ▽ More

    Submitted 29 November, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: Project page: https://sage-diffusion.github.io/

  44. arXiv:2305.15643  [pdf, other

    cs.LG math.OC stat.ML

    Federated Composite Saddle Point Optimization

    Authors: Site Bai, Brian Bullins

    Abstract: Federated learning (FL) approaches for saddle point problems (SPP) have recently gained in popularity due to the critical role they play in machine learning (ML). Existing works mostly target smooth unconstrained objectives in Euclidean space, whereas ML problems often involve constraints or non-smooth regularization, which results in a need for composite optimization. Addressing these issues, we… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

  45. arXiv:2305.11676  [pdf, other

    cs.CV

    Learning Global-aware Kernel for Image Harmonization

    Authors: Xintian Shen, Jiangning Zhang, Jun Chen, Shipeng Bai, Yue Han, Yabiao Wang, Chengjie Wang, Yong Liu

    Abstract: Image harmonization aims to solve the visual inconsistency problem in composited images by adaptively adjusting the foreground pixels with the background as references. Existing methods employ local color transformation or region matching between foreground and background, which neglects powerful proximity prior and independently distinguishes fore-/back-ground as a whole part for harmonization. A… ▽ More

    Submitted 17 August, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

    Comments: 10 pages, 10 figures

  46. arXiv:2305.11172  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

    Authors: Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, Chang Zhou

    Abstract: In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This desi… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: 30 pages, 9 figures, 18 tables

  47. arXiv:2305.01239  [pdf, other

    cs.CV cs.AI

    DRPT: Disentangled and Recurrent Prompt Tuning for Compositional Zero-Shot Learning

    Authors: Xiaocheng Lu, Ziming Liu, Song Guo, Jingcai Guo, Fushuo Huo, Sikai Bai, Tao Han

    Abstract: Compositional Zero-shot Learning (CZSL) aims to recognize novel concepts composed of known knowledge without training samples. Standard CZSL either identifies visual primitives or enhances unseen composed entities, and as a result, entanglement between state and object primitives cannot be fully utilized. Admittedly, vision-language models (VLMs) could naturally cope with CZSL through tuning promp… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

  48. arXiv:2303.09735  [pdf, other

    cs.CV

    SRFormer: Permuted Self-Attention for Single Image Super-Resolution

    Authors: Yupeng Zhou, Zhen Li, Chun-Le Guo, Song Bai, Ming-Ming Cheng, Qibin Hou

    Abstract: Previous works have shown that increasing the window size for Transformer-based image super-resolution models (e.g., SwinIR) can significantly improve the model performance but the computation overhead is also considerable. In this paper, we present SRFormer, a simple but novel method that can enjoy the benefit of large window self-attention but introduces even less computational burden. The core… ▽ More

    Submitted 16 March, 2023; originally announced March 2023.

  49. arXiv:2303.08242  [pdf, other

    stat.ML cs.LG stat.AP

    Optimal Sampling Designs for Multi-dimensional Streaming Time Series with Application to Power Grid Sensor Data

    Authors: Rui Xie, Shuyang Bai, Ping Ma

    Abstract: The Internet of Things (IoT) system generates massive high-speed temporally correlated streaming data and is often connected with online inference tasks under computational or energy constraints. Online analysis of these streaming time series data often faces a trade-off between statistical efficiency and computational cost. One important approach to balance this trade-off is sampling, where only… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

    Comments: Accepted by The Annals of Applied Statistics

  50. arXiv:2303.08132  [pdf, other

    cs.CV

    InstMove: Instance Motion for Object-centric Video Segmentation

    Authors: Qihao Liu, Junfeng Wu, Yi Jiang, Xiang Bai, Alan Yuille, Song Bai

    Abstract: Despite significant efforts, cutting-edge video segmentation methods still remain sensitive to occlusion and rapid movement, due to their reliance on the appearance of objects in the form of object embeddings, which are vulnerable to these disturbances. A common solution is to use optical flow to provide motion information, but essentially it only considers pixel-level motion, which still relies o… ▽ More

    Submitted 30 March, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

    Comments: Accepted to CVPR 2023; Code: https://github.com/wjf5203/VNext