Search | arXiv e-print repository

Dataset Condensation via Generative Model

Authors: David Junhao Zhang, Heng Wang, Chuhui Xue, Rui Yan, Wenqing Zhang, Song Bai, Mike Zheng Shou

Abstract: Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number of parameters to be optimized. When increasing image resolutions and classes, the number of learnable parameters grows accordingly, prohibiting condensation meth… ▽ More Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number of parameters to be optimized. When increasing image resolutions and classes, the number of learnable parameters grows accordingly, prohibiting condensation methods from scaling up to large datasets with diverse classes. Moreover, the relations among condensed samples have been neglected and hence the feature distribution of condensed samples is often not diverse. To solve these problems, we propose to condense the dataset into another format, a generative model. Such a novel format allows for the condensation of large datasets because the size of the generative model remains relatively stable as the number of classes or image resolution increases. Furthermore, an intra-class and an inter-class loss are proposed to model the relation of condensed samples. Intra-class loss aims to create more diverse samples for each class by pushing each sample away from the others of the same class. Meanwhile, inter-class loss increases the discriminability of samples by widening the gap between the centers of different classes. Extensive comparisons with state-of-the-art methods and our ablation studies confirm the effectiveness of our method and its individual component. To our best knowledge, we are the first to successfully conduct condensation on ImageNet-1k. △ Less

Submitted 14 September, 2023; originally announced September 2023.

Comments: old work,done in 2022

arXiv:2309.00064 [pdf, other]

doi 10.1109/ACCESS.2024.3369912

Ethical Framework for Harnessing the Power of AI in Healthcare and Beyond

Authors: Sidra Nasir, Rizwan Ahmed Khan, Samita Bai

Abstract: In the past decade, the deployment of deep learning (Artificial Intelligence (AI)) methods has become pervasive across a spectrum of real-world applications, often in safety-critical contexts. This comprehensive research article rigorously investigates the ethical dimensions intricately linked to the rapid evolution of AI technologies, with a particular focus on the healthcare domain. Delving deep… ▽ More In the past decade, the deployment of deep learning (Artificial Intelligence (AI)) methods has become pervasive across a spectrum of real-world applications, often in safety-critical contexts. This comprehensive research article rigorously investigates the ethical dimensions intricately linked to the rapid evolution of AI technologies, with a particular focus on the healthcare domain. Delving deeply, it explores a multitude of facets including transparency, adept data management, human oversight, educational imperatives, and international collaboration within the realm of AI advancement. Central to this article is the proposition of a conscientious AI framework, meticulously crafted to accentuate values of transparency, equity, answerability, and a human-centric orientation. The second contribution of the article is the in-depth and thorough discussion of the limitations inherent to AI systems. It astutely identifies potential biases and the intricate challenges of navigating multifaceted contexts. Lastly, the article unequivocally accentuates the pressing need for globally standardized AI ethics principles and frameworks. Simultaneously, it aptly illustrates the adaptability of the ethical framework proposed herein, positioned skillfully to surmount emergent challenges. △ Less

Submitted 31 August, 2023; originally announced September 2023.

Journal ref: IEEE Access 2024

arXiv:2308.16890 [pdf, other]

TouchStone: Evaluating Vision-Language Models by Language Models

Authors: Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, Jingren Zhou

Abstract: Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual s… ▽ More Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual storytelling abilities. In this paper, we propose an evaluation method that uses strong LLMs as judges to comprehensively evaluate the various abilities of LVLMs. Firstly, we construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. This dataset not only covers fundamental recognition and comprehension but also extends to literary creation. Secondly, by integrating detailed image annotations we effectively transform the multimodal input content into a form understandable by LLMs. This enables us to employ advanced LLMs for directly evaluating the quality of the multimodal dialogue without requiring human intervention. Through validation, we demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone, aligning with human preferences. We hope our work can serve as a touchstone for LVLMs' evaluation and pave the way for building stronger LVLMs. The evaluation code is available at https://github.com/OFA-Sys/TouchStone. △ Less

Submitted 4 September, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

Comments: https://github.com/OFA-Sys/TouchStone

arXiv:2308.12966 [pdf, other]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Authors: Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou

Abstract: In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyon… ▽ More In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL. △ Less

Submitted 12 October, 2023; v1 submitted 24 August, 2023; originally announced August 2023.

Comments: Code, demo and models are available at https://github.com/QwenLM/Qwen-VL

arXiv:2308.09988 [pdf, ps, other]

doi 10.3934/era.2023292

On $p$-Laplacian Kirchhoff-Schrödinger-Poisson type systems with critical growth on the Heisenberg group

Authors: Shujie Bai, Yueqiang Song, Dušan D. Repovš

Abstract: In this article, we investigate the Kirchhoff-Schrödinger-Poisson type systems on the Heisenberg group of the following form: \begin{equation*} \left\{ \begin{array}{lll} {-(a+b\int_Ω|\nabla_{H} u|^{p}dξ)Δ_{H,p}u-μφ|u|^{p-2}u}=λ|u|^{q-2}u+|u|^{Q^{\ast}-2}u &\mbox{in}\ Ω, \\ -Δ_{H}φ=|u|^{p} &\mbox{in}\ Ω, \\ u=φ=0 &\mbox{on}\ \partialΩ, \end{array} \right. \end{equation*} where $a,b$ are positive r… ▽ More In this article, we investigate the Kirchhoff-Schrödinger-Poisson type systems on the Heisenberg group of the following form: \begin{equation*} \left\{ \begin{array}{lll} {-(a+b\int_Ω|\nabla_{H} u|^{p}dξ)Δ_{H,p}u-μφ|u|^{p-2}u}=λ|u|^{q-2}u+|u|^{Q^{\ast}-2}u &\mbox{in}\ Ω, \\ -Δ_{H}φ=|u|^{p} &\mbox{in}\ Ω, \\ u=φ=0 &\mbox{on}\ \partialΩ, \end{array} \right. \end{equation*} where $a,b$ are positive real numbers, $Ω\subset \mathbb{H}^N$ is a bounded region with smooth boundary, $1<p<Q$, $Q = 2N + 2$ is the homogeneous dimension of the Heisenberg group $\mathbb{H}^N$, $Q^{\ast}=\frac{pQ}{Q-p}$, $q\in(2p, Q^{\ast})$, and $Δ_{H,p}u=\mbox{div}(|\nabla_{H} u|^{p-2}\nabla_{H} u)$ is the $p$-horizontal Laplacian. Under some appropriate conditions for the parameters $μ$ and $λ$, we establish existence and multiplicity results for the system above. To some extent, we generalize the results of An and Liu (Israel J. Math., 2020) and Liu et al. (Adv. Nonlinear Anal., 2022). △ Less

Submitted 19 August, 2023; originally announced August 2023.

MSC Class: 35J20; 35R03; 46E35

Journal ref: Electron. Res. Arch. 31:9 (2023), 5749-5765

arXiv:2308.07209 [pdf, other]

Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning

Authors: Shipeng Bai, Jun Chen, Xintian Shen, Yixuan Qian, Yong Liu

Abstract: Structured pruning and quantization are promising approaches for reducing the inference time and memory footprint of neural networks. However, most existing methods require the original training dataset to fine-tune the model. This not only brings heavy resource consumption but also is not possible for applications with sensitive or proprietary data due to privacy and security concerns. Therefore,… ▽ More Structured pruning and quantization are promising approaches for reducing the inference time and memory footprint of neural networks. However, most existing methods require the original training dataset to fine-tune the model. This not only brings heavy resource consumption but also is not possible for applications with sensitive or proprietary data due to privacy and security concerns. Therefore, a few data-free methods are proposed to address this problem, but they perform data-free pruning and quantization separately, which does not explore the complementarity of pruning and quantization. In this paper, we propose a novel framework named Unified Data-Free Compression(UDFC), which performs pruning and quantization simultaneously without any data and fine-tuning process. Specifically, UDFC starts with the assumption that the partial information of a damaged(e.g., pruned or quantized) channel can be preserved by a linear combination of other channels, and then derives the reconstruction form from the assumption to restore the information loss due to compression. Finally, we formulate the reconstruction error between the original network and its compressed network, and theoretically deduce the closed-form solution. We evaluate the UDFC on the large-scale image classification task and obtain significant improvements over various network architectures and compression methods. For example, we achieve a 20.54% accuracy improvement on ImageNet dataset compared to SOTA method with 30% pruning ratio and 6-bit quantization on ResNet-34. △ Less

Submitted 14 August, 2023; originally announced August 2023.

Comments: ICCV2023

arXiv:2308.06739 [pdf, other]

Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks

Authors: David Junhao Zhang, Mutian Xu, Chuhui Xue, Wenqing Zhang, Xiaoguang Han, Song Bai, Mike Zheng Shou

Abstract: Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been… ▽ More Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images. To address this, we start by uncovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques ( i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios. △ Less

Submitted 13 August, 2023; originally announced August 2023.

arXiv:2308.04269 [pdf, other]

Lossy and Lossless (L$^2$) Post-training Model Size Compression

Authors: Yumeng Shi, Shihao Bai, Xiuying Wei, Ruihao Gong, Jianlei Yang

Abstract: Deep neural networks have delivered remarkable performance and have been widely used in various visual tasks. However, their huge size causes significant inconvenience for transmission and storage. Many previous studies have explored model size compression. However, these studies often approach various lossy and lossless compression methods in isolation, leading to challenges in achieving high com… ▽ More Deep neural networks have delivered remarkable performance and have been widely used in various visual tasks. However, their huge size causes significant inconvenience for transmission and storage. Many previous studies have explored model size compression. However, these studies often approach various lossy and lossless compression methods in isolation, leading to challenges in achieving high compression ratios efficiently. This work proposes a post-training model size compression method that combines lossy and lossless compression in a unified way. We first propose a unified parametric weight transformation, which ensures different lossy compression methods can be performed jointly in a post-training manner. Then, a dedicated differentiable counter is introduced to guide the optimization of lossy compression to arrive at a more suitable point for later lossless compression. Additionally, our method can easily control a desired global compression ratio and allocate adaptive ratios for different layers. Finally, our method can achieve a stable $10\times$ compression ratio without sacrificing accuracy and a $20\times$ compression ratio with minor accuracy loss in a short time. Our code is available at https://github.com/ModelTC/L2_Compression . △ Less

Submitted 8 August, 2023; originally announced August 2023.

arXiv:2308.00353 [pdf, other]

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Authors: Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, Xiaojuan Qi

Abstract: Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Interne… ▽ More Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Internet, which cover a wide range of vocabulary concepts. However, this success is hard to replicate in 3D scenarios due to the scarcity of 3D-text pairs. To address this challenge, we propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for multi-view images of 3D scenes. This allows us to establish explicit associations between 3D shapes and semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic representation learning from captions for object-level categorization, we design hierarchical point-caption association methods to learn semantic-aware embeddings that exploit the 3D geometry between 3D points and multi-view images. In addition, to tackle the localization challenge for novel classes in the open-world setting, we develop debiased instance localization, which involves training object grouping modules on unlabeled data using instance-level pseudo supervision. This significantly improves the generalization capabilities of instance grouping and thus the ability to accurately locate novel objects. We conduct extensive experiments on 3D semantic, instance, and panoptic segmentation tasks, covering indoor and outdoor scenes across three datasets. Our method outperforms baseline methods by a significant margin in semantic segmentation (e.g. 34.5%$\sim$65.3%), instance segmentation (e.g. 21.8%$\sim$54.0%) and panoptic segmentation (e.g. 14.7%$\sim$43.3%). Code will be available. △ Less

Submitted 1 August, 2023; originally announced August 2023.

Comments: submit to TPAMI

arXiv:2307.07701 [pdf, other]

doi 10.1002/qute.202300218

Quantum metrology in the noisy intermediate-scale quantum era

Authors: Lin Jiao, Wei Wu, Si-Yuan Bai, Jun-Hong An

Abstract: Quantum metrology pursues the physical realization of higher-precision measurements to physical quantities than the classically achievable limit by exploiting quantum features, such as entanglement and squeezing, as resources. It has potential applications in developing next-generation frequency standards, magnetometers, radar, and navigation. However, the ubiquitous decoherence in the quantum wor… ▽ More Quantum metrology pursues the physical realization of higher-precision measurements to physical quantities than the classically achievable limit by exploiting quantum features, such as entanglement and squeezing, as resources. It has potential applications in developing next-generation frequency standards, magnetometers, radar, and navigation. However, the ubiquitous decoherence in the quantum world degrades the quantum resources and forces the precision back to or even worse than the classical limit, which is called the no-go theorem of noisy quantum metrology and greatly hinders its applications. Therefore, how to realize the promised performance of quantum metrology in realistic noisy situations attracts much attention in recent years. We will review the principle, categories, and applications of quantum metrology. Special attention will be paid to different quantum resources that can bring quantum superiority in enhancing sensitivity. Then, we will introduce the no-go theorem of noisy quantum metrology and its active control under different kinds of noise-induced decoherence situations. △ Less

Submitted 28 November, 2023; v1 submitted 15 July, 2023; originally announced July 2023.

Comments: Minireview of quantum metrology based on Lectures given at the summer school "Fundamental and Frontiers of Quantum Metrology and Quantum Computation" held in Bohai University, China, from 23 July to 8 August

Journal ref: Adv Quantum Technol. 2023, 2300218

arXiv:2307.05358 [pdf, other]

Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators

Authors: Sikai Bai, Shuaicheng Li, Weiming Zhuang, Jie Zhang, Song Guo, Kunlin Yang, Jun Hou, Shuai Zhang, Junyu Gao, Shuai Yi

Abstract: Federated learning has become a popular method to learn from decentralized heterogeneous data. Federated semi-supervised learning (FSSL) emerges to train models from a small fraction of labeled data due to label scarcity on decentralized clients. Existing FSSL methods assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled… ▽ More Federated learning has become a popular method to learn from decentralized heterogeneous data. Federated semi-supervised learning (FSSL) emerges to train models from a small fraction of labeled data due to label scarcity on decentralized clients. Existing FSSL methods assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled and unlabeled data within a client. This work studies a more practical and challenging scenario of FSSL, where data distribution is different not only across clients but also within a client between labeled and unlabeled data. To address this challenge, we propose a novel FSSL framework with dual regulators, FedDure. FedDure lifts the previous assumption with a coarse-grained regulator (C-reg) and a fine-grained regulator (F-reg): C-reg regularizes the updating of the local model by tracking the learning effect on labeled data distribution; F-reg learns an adaptive weighting scheme tailored for unlabeled instances in each client. We further formulate the client model training as bi-level optimization that adaptively optimizes the model in the client with two regulators. Theoretically, we show the convergence guarantee of the dual regulators. Empirically, we demonstrate that FedDure is superior to the existing methods across a wide range of settings, notably by more than 11 on CIFAR-10 and CINIC-10 datasets. △ Less

Submitted 11 March, 2024; v1 submitted 11 July, 2023; originally announced July 2023.

Journal ref: The 38th Annual AAAI Conference on Artificial Intelligence, 2024

arXiv:2307.00498 [pdf, other]

Data-Free Quantization via Mixed-Precision Compensation without Fine-Tuning

Authors: Jun Chen, Shipeng Bai, Tianxin Huang, Mengmeng Wang, Guanzhong Tian, Yong Liu

Abstract: Neural network quantization is a very promising solution in the field of model compression, but its resulting accuracy highly depends on a training/fine-tuning process and requires the original data. This not only brings heavy computation and time costs but also is not conducive to privacy and sensitive information protection. Therefore, a few recent works are starting to focus on data-free quanti… ▽ More Neural network quantization is a very promising solution in the field of model compression, but its resulting accuracy highly depends on a training/fine-tuning process and requires the original data. This not only brings heavy computation and time costs but also is not conducive to privacy and sensitive information protection. Therefore, a few recent works are starting to focus on data-free quantization. However, data-free quantization does not perform well while dealing with ultra-low precision quantization. Although researchers utilize generative methods of synthetic data to address this problem partially, data synthesis needs to take a lot of computation and time. In this paper, we propose a data-free mixed-precision compensation (DF-MPC) method to recover the performance of an ultra-low precision quantized model without any data and fine-tuning process. By assuming the quantized error caused by a low-precision quantized layer can be restored via the reconstruction of a high-precision quantized layer, we mathematically formulate the reconstruction loss between the pre-trained full-precision model and its layer-wise mixed-precision quantized model. Based on our formulation, we theoretically deduce the closed-form solution by minimizing the reconstruction loss of the feature maps. Since DF-MPC does not require any original/synthetic data, it is a more efficient method to approximate the full-precision model. Experimentally, our DF-MPC is able to achieve higher accuracy for an ultra-low precision quantized model compared to the recent methods without any data and fine-tuning process. △ Less

Submitted 2 July, 2023; originally announced July 2023.

Comments: This paper has been accepted for publication in the Pattern Recognition

Journal ref: Pattern Recognition 2023

arXiv:2306.16718 [pdf, other]

Metric-aligned Sample Selection and Critical Feature Sampling for Oriented Object Detection

Authors: Peng Sun, Yongbin Zheng, Wenqi Wu, Wanying Xu, Shengjian Bai

Abstract: Arbitrary-oriented object detection is a relatively emerging but challenging task. Although remarkable progress has been made, there still remain many unsolved issues due to the large diversity of patterns in orientation, scale, aspect ratio, and visual appearance of objects in aerial images. Most of the existing methods adopt a coarse-grained fixed label assignment strategy and suffer from the in… ▽ More Arbitrary-oriented object detection is a relatively emerging but challenging task. Although remarkable progress has been made, there still remain many unsolved issues due to the large diversity of patterns in orientation, scale, aspect ratio, and visual appearance of objects in aerial images. Most of the existing methods adopt a coarse-grained fixed label assignment strategy and suffer from the inconsistency between the classification score and localization accuracy. First, to align the metric inconsistency between sample selection and regression loss calculation caused by fixed IoU strategy, we introduce affine transformation to evaluate the quality of samples and propose a distance-based label assignment strategy. The proposed metric-aligned selection (MAS) strategy can dynamically select samples according to the shape and rotation characteristic of objects. Second, to further address the inconsistency between classification and localization, we propose a critical feature sampling (CFS) module, which performs localization refinement on the sampling location for classification task to extract critical features accurately. Third, we present a scale-controlled smooth $L_1$ loss (SC-Loss) to adaptively select high quality samples by changing the form of regression loss function based on the statistics of proposals during training. Extensive experiments are conducted on four challenging rotated object detection datasets DOTA, FAIR1M-1.0, HRSC2016, and UCAS-AOD. The results show the state-of-the-art accuracy of the proposed detector. △ Less

Submitted 10 July, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

arXiv:2306.14435 [pdf, other]

DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing

Authors: Yujun Shi, Chuhui Xue, Jun Hao Liew, Jiachun Pan, Hanshu Yan, Wenqing Zhang, Vincent Y. F. Tan, Song Bai

Abstract: Accurate and controllable image editing is a challenging task that has attracted significant attention recently. Notably, DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision. However, due to its reliance on generative adversarial networks (GANs), its generality is limited by the capacity of pretrained GAN models. In this… ▽ More Accurate and controllable image editing is a challenging task that has attracted significant attention recently. Notably, DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision. However, due to its reliance on generative adversarial networks (GANs), its generality is limited by the capacity of pretrained GAN models. In this work, we extend this editing framework to diffusion models and propose a novel approach DragDiffusion. By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images. Our approach involves optimizing the diffusion latents to achieve precise spatial control. The supervision signal of this optimization process is from the diffusion model's UNet features, which are known to contain rich semantic and geometric information. Moreover, we introduce two additional techniques, namely LoRA fine-tuning and latent-MasaCtrl, to further preserve the identity of the original image. Lastly, we present a challenging benchmark dataset called DragBench -- the first benchmark to evaluate the performance of interactive point-based image editing methods. Experiments across a wide range of challenging cases (e.g., images with multiple objects, diverse object categories, various styles, etc.) demonstrate the versatility and generality of DragDiffusion. Code: https://github.com/Yujun-Shi/DragDiffusion. △ Less

Submitted 7 April, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

Comments: Code is released at https://github.com/Yujun-Shi/DragDiffusion

arXiv:2306.00974 [pdf, other]

Discovering Failure Modes of Text-guided Diffusion Models via Adversarial Search

Authors: Qihao Liu, Adam Kortylewski, Yutong Bai, Song Bai, Alan Yuille

Abstract: Text-guided diffusion models (TDMs) are widely applied but can fail unexpectedly. Common failures include: (i) natural-looking text prompts generating images with the wrong content, or (ii) different random samples of the latent variables that generate vastly different, and even unrelated, outputs despite being conditioned on the same text prompt. In this work, we aim to study and understand the f… ▽ More Text-guided diffusion models (TDMs) are widely applied but can fail unexpectedly. Common failures include: (i) natural-looking text prompts generating images with the wrong content, or (ii) different random samples of the latent variables that generate vastly different, and even unrelated, outputs despite being conditioned on the same text prompt. In this work, we aim to study and understand the failure modes of TDMs in more detail. To achieve this, we propose SAGE, the first adversarial search method on TDMs that systematically explores the discrete prompt space and the high-dimensional latent space, to automatically discover undesirable behaviors and failure cases in image generation. We use image classifiers as surrogate loss functions during searching, and employ human inspections to validate the identified failures. For the first time, our method enables efficient exploration of both the discrete and intricate human language space and the challenging latent space, overcoming the gradient vanishing problem. Then, we demonstrate the effectiveness of SAGE on five widely used generative models and reveal four typical failure modes: (1) We find a variety of natural text prompts that generate images failing to capture the semantics of input texts. We further discuss the underlying causes and potential solutions based on the results. (2) We find regions in the latent space that lead to distorted images independent of the text prompt, suggesting that parts of the latent space are not well-structured. (3) We also find latent samples that result in natural-looking images unrelated to the text prompt, implying a possible misalignment between the latent and prompt spaces. (4) By appending a single adversarial token embedding to any input prompts, we can generate a variety of specified target objects. Project page: https://sage-diffusion.github.io/ △ Less

Submitted 29 November, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

Comments: Project page: https://sage-diffusion.github.io/

arXiv:2305.15643 [pdf, other]

Federated Composite Saddle Point Optimization

Authors: Site Bai, Brian Bullins

Abstract: Federated learning (FL) approaches for saddle point problems (SPP) have recently gained in popularity due to the critical role they play in machine learning (ML). Existing works mostly target smooth unconstrained objectives in Euclidean space, whereas ML problems often involve constraints or non-smooth regularization, which results in a need for composite optimization. Addressing these issues, we… ▽ More Federated learning (FL) approaches for saddle point problems (SPP) have recently gained in popularity due to the critical role they play in machine learning (ML). Existing works mostly target smooth unconstrained objectives in Euclidean space, whereas ML problems often involve constraints or non-smooth regularization, which results in a need for composite optimization. Addressing these issues, we propose Federated Dual Extrapolation (FeDualEx), an extra-step primal-dual algorithm, which is the first of its kind that encompasses both saddle point optimization and composite objectives under the FL paradigm. Both the convergence analysis and the empirical evaluation demonstrate the effectiveness of FeDualEx in these challenging settings. In addition, even for the sequential version of FeDualEx, we provide rates for the stochastic composite saddle point setting which, to our knowledge, are not found in prior literature. △ Less

Submitted 24 May, 2023; originally announced May 2023.

arXiv:2305.11676 [pdf, other]

Learning Global-aware Kernel for Image Harmonization

Authors: Xintian Shen, Jiangning Zhang, Jun Chen, Shipeng Bai, Yue Han, Yabiao Wang, Chengjie Wang, Yong Liu

Abstract: Image harmonization aims to solve the visual inconsistency problem in composited images by adaptively adjusting the foreground pixels with the background as references. Existing methods employ local color transformation or region matching between foreground and background, which neglects powerful proximity prior and independently distinguishes fore-/back-ground as a whole part for harmonization. A… ▽ More Image harmonization aims to solve the visual inconsistency problem in composited images by adaptively adjusting the foreground pixels with the background as references. Existing methods employ local color transformation or region matching between foreground and background, which neglects powerful proximity prior and independently distinguishes fore-/back-ground as a whole part for harmonization. As a result, they still show a limited performance across varied foreground objects and scenes. To address this issue, we propose a novel Global-aware Kernel Network (GKNet) to harmonize local regions with comprehensive consideration of long-distance background references. Specifically, GKNet includes two parts, \ie, harmony kernel prediction and harmony kernel modulation branches. The former includes a Long-distance Reference Extractor (LRE) to obtain long-distance context and Kernel Prediction Blocks (KPB) to predict multi-level harmony kernels by fusing global information with local features. To achieve this goal, a novel Selective Correlation Fusion (SCF) module is proposed to better select relevant long-distance background references for local harmonization. The latter employs the predicted kernels to harmonize foreground regions with both local and global awareness. Abundant experiments demonstrate the superiority of our method for image harmonization over state-of-the-art methods, \eg, achieving 39.53dB PSNR that surpasses the best counterpart by +0.78dB $\uparrow$; decreasing fMSE/MSE by 11.5\%$\downarrow$/6.7\%$\downarrow$ compared with the SoTA method. Code will be available at \href{https://github.com/XintianShen/GKNet}{here}. △ Less

Submitted 17 August, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

Comments: 10 pages, 10 figures

arXiv:2305.11172 [pdf, other]

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Authors: Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, Chang Zhou

Abstract: In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This desi… ▽ More In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE. △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: 30 pages, 9 figures, 18 tables

arXiv:2305.10545 [pdf]

Recycling Silicon Scrap for Spherical Si-C composite as High-Performance Lithium-ion Battery Anodes

Authors: Bhagath Sreenarayanan, Marta Vicencio, Shuang Bai, Bingyu Lu, Ou Mao, Shiva Adireddy, Wurigumula Bao, Ying Shirley Meng

Abstract: The growth of the semiconductor and solar industry has been exponential in the last two decades due to the computing and energy demands of the world. Silicon (Si) is one of the main constituents for both sectors and, thus, is used in large quantities. As a result, a lot of Si waste is generated mainly by these two industries. For a sustainable world, the circular economy is the key; thus, the wast… ▽ More The growth of the semiconductor and solar industry has been exponential in the last two decades due to the computing and energy demands of the world. Silicon (Si) is one of the main constituents for both sectors and, thus, is used in large quantities. As a result, a lot of Si waste is generated mainly by these two industries. For a sustainable world, the circular economy is the key; thus, the waste produced must be upcycled/recycled/reused to complete the circular chain. Herein, we show that an upcycled/recycled Si can be used with carbon as a composite anode material, with high Si content (~40 wt.%) and loading of 3-4 mAh/cm^2 for practical use in lithium-ion batteries. The unique spherical jackfruit-like structure of the Si-C composite can minimize the total lithium inventory loss compared to the conventional Si-C composite and pure Si, resulting in superior electrochemical performance. The superior electrochemical performance of Si-C composites enables the cell energy density of ~325 Wh/kg (with NMC cathode) and ~260 Wh/kg (with LFP cathode), respectively. The results demonstrate that Si-based industrial waste can be upcycled for high-performance Li-ion battery anodes through a controllable, scalable, and energy-efficient route. △ Less

Submitted 17 May, 2023; originally announced May 2023.

arXiv:2305.01239 [pdf, other]

DRPT: Disentangled and Recurrent Prompt Tuning for Compositional Zero-Shot Learning

Authors: Xiaocheng Lu, Ziming Liu, Song Guo, Jingcai Guo, Fushuo Huo, Sikai Bai, Tao Han

Abstract: Compositional Zero-shot Learning (CZSL) aims to recognize novel concepts composed of known knowledge without training samples. Standard CZSL either identifies visual primitives or enhances unseen composed entities, and as a result, entanglement between state and object primitives cannot be fully utilized. Admittedly, vision-language models (VLMs) could naturally cope with CZSL through tuning promp… ▽ More Compositional Zero-shot Learning (CZSL) aims to recognize novel concepts composed of known knowledge without training samples. Standard CZSL either identifies visual primitives or enhances unseen composed entities, and as a result, entanglement between state and object primitives cannot be fully utilized. Admittedly, vision-language models (VLMs) could naturally cope with CZSL through tuning prompts, while uneven entanglement leads prompts to be dragged into local optimum. In this paper, we take a further step to introduce a novel Disentangled and Recurrent Prompt Tuning framework termed DRPT to better tap the potential of VLMs in CZSL. Specifically, the state and object primitives are deemed as learnable tokens of vocabulary embedded in prompts and tuned on seen compositions. Instead of jointly tuning state and object, we devise a disentangled and recurrent tuning strategy to suppress the traction force caused by entanglement and gradually optimize the token parameters, leading to a better prompt space. Notably, we develop a progressive fine-tuning procedure that allows for incremental updates to the prompts, optimizing the object first, then the state, and vice versa. Meanwhile, the optimization of state and object is independent, thus clearer features can be learned to further alleviate the issue of entangling misleading optimization. Moreover, we quantify and analyze the entanglement in CZSL and supplement entanglement rebalancing optimization schemes. DRPT surpasses representative state-of-the-art methods on extensive benchmark datasets, demonstrating superiority in both accuracy and efficiency. △ Less

Submitted 2 May, 2023; originally announced May 2023.

arXiv:2305.01156 [pdf, other]

doi 10.1103/PhysRevA.109.033518

Non-Markovian quantum interconnect formed by a surface plasmon polariton waveguide

Authors: Chun-Jie Yang, Xin-Yue Liu, Shi-Qiang Xia, Si-Yuan Bai, Jun-Hong An

Abstract: Allowing the generation of effective interactions between distant quantum emitters (QEs) via flying photons, quantum interconnect (QI) is essentially a light-matter interface and acts as a building block in quantum technologies. A surface plasmon polariton (SPP) supported by a metallic waveguide provides an ideal interface to explore strong light-matter couplings and to realize QI. However, the lo… ▽ More Allowing the generation of effective interactions between distant quantum emitters (QEs) via flying photons, quantum interconnect (QI) is essentially a light-matter interface and acts as a building block in quantum technologies. A surface plasmon polariton (SPP) supported by a metallic waveguide provides an ideal interface to explore strong light-matter couplings and to realize QI. However, the loss of SPP in metal makes the mediated entanglement of the QEs damp with the increase of the distance and time, which hinders its applications. We propose a scheme of non-Markovian QI formed by the SPP of a metallic nanowire. A mechanism to make the generated entanglement of the QEs persistent is discovered. We find that, as long as bound states are formed in the energy spectrum of total QE-SPP system, the damping of the SPP-mediated entanglement is overcome even in the presence of the metal absorption to the SPP. Our finding enriches our understanding of light-matter couplings in absorptive medium and paves the way for using the SPP in designing QI. △ Less

Submitted 20 March, 2024; v1 submitted 1 May, 2023; originally announced May 2023.

Journal ref: Phys. Rev. A 109, 033518 (2024)

arXiv:2304.06817 [pdf]

Elucidating the Role of Prelithiation in Si-based Anodes for Interface Stabilization

Authors: Shuang Bai, Wurigumula Bao, Kun Qian, Bing Han, Weikang Li, Baharak Sayahpour, Bhagath Screenarayanan, Darren H. S. Tan, So-yeon Ham, Ying Shirley Meng

Abstract: Prelithiation as a facile and effective method to compensate the lithium inventory loss in the initial cycle has progressed considerably both on anode and cathode sides. However, much less research has been devoted to the prelithiation effect on the interface stabilization for long-term cycling of Si-based anodes. An in-depth quantitative analysis of the interface that form during the prelithiatio… ▽ More Prelithiation as a facile and effective method to compensate the lithium inventory loss in the initial cycle has progressed considerably both on anode and cathode sides. However, much less research has been devoted to the prelithiation effect on the interface stabilization for long-term cycling of Si-based anodes. An in-depth quantitative analysis of the interface that form during the prelithiation of SiO$_x$ is presented here and the results are compared with prelithiaton of Si anodes. Local structure probe combined with detailed electrochemical analysis reveals that a characteristic mosaic interface is formed on both prelithiated SiO$_x$ and Si anodes. This mosaic interface containing multiple lithium silicates phases, is fundamentally different from the solid electrolyte interface (SEI) formed without prelithiation. The ideal conductivity and mechanical properties of lithium silicates enable improved cycling stability of both prelithiated anodes. With a higher ratio of lithium silicates due to the oxygen participation, prelithiated SiO$_{1.3}$ anode improves the initial coulombic efficiency to 94% in full cell and delivers good cycling retention after hundreds cycles under lean electrolyte conditions. The insights provided in this work could be used to further optimize high Si loading based anode in future high energy density batteries. △ Less

Submitted 13 April, 2023; originally announced April 2023.

arXiv:2304.03319 [pdf, ps, other]

Joint sum-max limit for a class of long-range dependent processes with heavy tails

Authors: Shuyang Bai, He Tang

Abstract: We consider a class of stationary processes exhibiting both long-range dependence and heavy tails. Separate limit theorems for sums and for extremes have been established recently in literature with novel objects appearing in the limits. In this article, we establish the joint sum-max limit theorems for this class of processes. In the finite-variance case, the limit consists of two independent com… ▽ More We consider a class of stationary processes exhibiting both long-range dependence and heavy tails. Separate limit theorems for sums and for extremes have been established recently in literature with novel objects appearing in the limits. In this article, we establish the joint sum-max limit theorems for this class of processes. In the finite-variance case, the limit consists of two independent components: a fractional Brownian motion arising from the sum, and a long-range dependent random sup measure arising from the maximum. In the infinite-variance case, we obtain in the limit two dependent components: a stable process and a random sup measure whose dependence structure is described through the local time and range of a stable subordinator. For establishing the limit theorem in the latter case, we also develop a joint convergence result for the local time and range of subordinators, which may be of independent interest. △ Less

Submitted 11 September, 2023; v1 submitted 6 April, 2023; originally announced April 2023.

Comments: 26 pages

MSC Class: 60F17; 60G10

arXiv:2303.08242 [pdf, other]

Optimal Sampling Designs for Multi-dimensional Streaming Time Series with Application to Power Grid Sensor Data

Authors: Rui Xie, Shuyang Bai, Ping Ma

Abstract: The Internet of Things (IoT) system generates massive high-speed temporally correlated streaming data and is often connected with online inference tasks under computational or energy constraints. Online analysis of these streaming time series data often faces a trade-off between statistical efficiency and computational cost. One important approach to balance this trade-off is sampling, where only… ▽ More The Internet of Things (IoT) system generates massive high-speed temporally correlated streaming data and is often connected with online inference tasks under computational or energy constraints. Online analysis of these streaming time series data often faces a trade-off between statistical efficiency and computational cost. One important approach to balance this trade-off is sampling, where only a small portion of the sample is selected for the model fitting and update. Motivated by the demands of dynamic relationship analysis of IoT system, we study the data-dependent sample selection and online inference problem for a multi-dimensional streaming time series, aiming to provide low-cost real-time analysis of high-speed power grid electricity consumption data. Inspired by D-optimality criterion in design of experiments, we propose a class of online data reduction methods that achieve an optimal sampling criterion and improve the computational efficiency of the online analysis. We show that the optimal solution amounts to a strategy that is a mixture of Bernoulli sampling and leverage score sampling. The leverage score sampling involves auxiliary estimations that have a computational advantage over recursive least squares updates. Theoretical properties of the auxiliary estimations involved are also discussed. When applied to European power grid consumption data, the proposed leverage score based sampling methods outperform the benchmark sampling method in online estimation and prediction. The general applicability of the sampling-assisted online estimation method is assessed via simulation studies. △ Less

Submitted 14 March, 2023; originally announced March 2023.

Comments: Accepted by The Annals of Applied Statistics

arXiv:2303.08132 [pdf, other]

InstMove: Instance Motion for Object-centric Video Segmentation

Authors: Qihao Liu, Junfeng Wu, Yi Jiang, Xiang Bai, Alan Yuille, Song Bai

Abstract: Despite significant efforts, cutting-edge video segmentation methods still remain sensitive to occlusion and rapid movement, due to their reliance on the appearance of objects in the form of object embeddings, which are vulnerable to these disturbances. A common solution is to use optical flow to provide motion information, but essentially it only considers pixel-level motion, which still relies o… ▽ More Despite significant efforts, cutting-edge video segmentation methods still remain sensitive to occlusion and rapid movement, due to their reliance on the appearance of objects in the form of object embeddings, which are vulnerable to these disturbances. A common solution is to use optical flow to provide motion information, but essentially it only considers pixel-level motion, which still relies on appearance similarity and hence is often inaccurate under occlusion and fast movement. In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video Segmentation. In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings, and features physical interpretations, making it more accurate and robust toward occlusion and fast-moving objects. To better fit in with the video segmentation tasks, InstMove uses instance masks to model the physical presence of an object and learns the dynamic model through a memory network to predict its position and shape in the next frame. With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks and boost their performance. Specifically, we improve the previous arts by 1.5 AP on OVIS dataset, which features heavy occlusions, and 4.9 AP on YouTubeVIS-Long dataset, which mainly contains fast-moving objects. These results suggest that instance-level motion is robust and accurate, and hence serving as a powerful solution in complex scenarios for object-centric video segmentation. △ Less

Submitted 30 March, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

Comments: Accepted to CVPR 2023; Code: https://github.com/wjf5203/VNext

arXiv:2303.06340 [pdf, other]

Intelligent diagnostic scheme for lung cancer screening with Raman spectra data by tensor network machine learning

Authors: Yu-Jia An, Sheng-Chen Bai, Lin Cheng, Xiao-Guang Li, Cheng-en Wang, Xiao-Dong Han, Gang Su, Shi-Ju Ran, Cong Wang

Abstract: Artificial intelligence (AI) has brought tremendous impacts on biomedical sciences from academic researches to clinical applications, such as in biomarkers' detection and diagnosis, optimization of treatment, and identification of new therapeutic targets in drug discovery. However, the contemporary AI technologies, particularly deep machine learning (ML), severely suffer from non-interpretability,… ▽ More Artificial intelligence (AI) has brought tremendous impacts on biomedical sciences from academic researches to clinical applications, such as in biomarkers' detection and diagnosis, optimization of treatment, and identification of new therapeutic targets in drug discovery. However, the contemporary AI technologies, particularly deep machine learning (ML), severely suffer from non-interpretability, which might uncontrollably lead to incorrect predictions. Interpretability is particularly crucial to ML for clinical diagnosis as the consumers must gain necessary sense of security and trust from firm grounds or convincing interpretations. In this work, we propose a tensor-network (TN)-ML method to reliably predict lung cancer patients and their stages via screening Raman spectra data of Volatile organic compounds (VOCs) in exhaled breath, which are generally suitable as biomarkers and are considered to be an ideal way for non-invasive lung cancer screening. The prediction of TN-ML is based on the mutual distances of the breath samples mapped to the quantum Hilbert space. Thanks to the quantum probabilistic interpretation, the certainty of the predictions can be quantitatively characterized. The accuracy of the samples with high certainty is almost 100$\%$. The incorrectly-classified samples exhibit obviously lower certainty, and thus can be decipherably identified as anomalies, which will be handled by human experts to guarantee high reliability. Our work sheds light on shifting the ``AI for biomedical sciences'' from the conventional non-interpretable ML schemes to the interpretable human-ML interactive approaches, for the purpose of high accuracy and reliability. △ Less

Submitted 11 March, 2023; originally announced March 2023.

Comments: 10 pages, 7 figures

arXiv:2303.04366 [pdf, other]

Semantically Consistent Multi-view Representation Learning

Authors: Yiyang Zhou, Qinghai Zheng, Shunshun Bai, Jihua Zhu

Abstract: In this work, we devote ourselves to the challenging task of Unsupervised Multi-view Representation Learning (UMRL), which requires learning a unified feature representation from multiple views in an unsupervised manner. Existing UMRL methods mainly concentrate on the learning process in the feature space while ignoring the valuable semantic information hidden in different views. To address this i… ▽ More In this work, we devote ourselves to the challenging task of Unsupervised Multi-view Representation Learning (UMRL), which requires learning a unified feature representation from multiple views in an unsupervised manner. Existing UMRL methods mainly concentrate on the learning process in the feature space while ignoring the valuable semantic information hidden in different views. To address this issue, we propose a novel Semantically Consistent Multi-view Representation Learning (SCMRL), which makes efforts to excavate underlying multi-view semantic consensus information and utilize the information to guide the unified feature representation learning. Specifically, SCMRL consists of a within-view reconstruction module and a unified feature representation learning module, which are elegantly integrated by the contrastive learning strategy to simultaneously align semantic labels of both view-specific feature representations and the learned unified feature representation. In this way, the consensus information in the semantic space can be effectively exploited to constrain the learning process of unified feature representation. Compared with several state-of-the-art algorithms, extensive experiments demonstrate its superiority. △ Less

Submitted 7 March, 2023; originally announced March 2023.

Comments: 19 pages, 4figures

arXiv:2303.00392 [pdf, other]

doi 10.1103/PhysRevLett.131.050801

Floquet Engineering to Overcome No-Go Theorem of Noisy Quantum Metrology

Authors: Si-Yuan Bai, Jun-Hong An

Abstract: Permitting a more precise measurement to physical quantities than the classical limit by using quantum resources, quantum metrology holds a promise in developing many revolutionary technologies. However, the noise-induced decoherence forces its superiority to disappear, which is called no-go theorem of noisy quantum metrology and constrains its application. We propose a scheme to overcome the no-g… ▽ More Permitting a more precise measurement to physical quantities than the classical limit by using quantum resources, quantum metrology holds a promise in developing many revolutionary technologies. However, the noise-induced decoherence forces its superiority to disappear, which is called no-go theorem of noisy quantum metrology and constrains its application. We propose a scheme to overcome the no-go theorem by Floquet engineering. It is found that, by applying a periodic driving on the atoms of the Ramsey spectroscopy, the ultimate sensitivity to measure their frequency characterized by quantum Fisher information returns to the ideal $t^2$ scaling with the encoding time whenever a Floquet bound state is formed by the system consisting of each driven atom and its local noise. Combining with the optimal control, this mechanism also allows us to retrieve the ideal Heisenberg-limit scaling with the atom number $N$. Our result gives an efficient way to avoid the no-go theorem of noisy quantum metrology and to realize high-precision measurements. △ Less

Submitted 1 August, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

Journal ref: Phys. Rev. Lett. 131, 050801 (2023)

arXiv:2302.01872 [pdf, other]

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

Authors: Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip H. S. Torr, Song Bai

Abstract: Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit… ▽ More Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J&F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their ~90% J&F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future. The proposed MOSE dataset has been released at https://henghuiding.github.io/MOSE. △ Less

Submitted 3 February, 2023; originally announced February 2023.

Comments: MOSE Dataset Report

Journal ref: ICCV 2023

arXiv:2212.06384 [pdf, other]

PV3D: A 3D Generative Model for Portrait Video Generation

Authors: Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Wenqing Zhang, Song Bai, Jiashi Feng, Mike Zheng Shou

Abstract: Recent advances in generative adversarial networks (GANs) have demonstrated the capabilities of generating stunning photo-realistic portrait images. While some prior works have applied such image GANs to unconditional 2D portrait video generation and static 3D portrait synthesis, there are few works successfully extending GANs for generating 3D-aware portrait videos. In this work, we propose PV3D,… ▽ More Recent advances in generative adversarial networks (GANs) have demonstrated the capabilities of generating stunning photo-realistic portrait images. While some prior works have applied such image GANs to unconditional 2D portrait video generation and static 3D portrait synthesis, there are few works successfully extending GANs for generating 3D-aware portrait videos. In this work, we propose PV3D, the first generative framework that can synthesize multi-view consistent portrait videos. Specifically, our method extends the recent static 3D-aware image GAN to the video domain by generalizing the 3D implicit neural representation to model the spatio-temporal space. To introduce motion dynamics to the generation process, we develop a motion generator by stacking multiple motion layers to generate motion features via modulated convolution. To alleviate motion ambiguities caused by camera/human motions, we propose a simple yet effective camera condition strategy for PV3D, enabling both temporal and multi-view consistent video generation. Moreover, PV3D introduces two discriminators for regularizing the spatial and temporal domains to ensure the plausibility of the generated portrait videos. These elaborated designs enable PV3D to generate 3D-aware motion-plausible portrait videos with high-quality appearance and geometry, significantly outperforming prior works. As a result, PV3D is able to support many downstream applications such as animating static portraits and view-consistent video motion editing. Code and models are released at https://showlab.github.io/pv3d. △ Less

Submitted 20 June, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

Comments: Accepted to ICLR2023, Project Page https://showlab.github.io/pv3d

arXiv:2212.04408 [pdf, other]

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

Authors: Jinze Bai, Rui Men, Hao Yang, Xuancheng Ren, Kai Dang, Yichang Zhang, Xiaohuan Zhou, Peng Wang, Sinan Tan, An Yang, Zeyu Cui, Yu Han, Shuai Bai, Wenbin Ge, Jianxin Ma, Junyang Lin, Jingren Zhou, Chang Zhou

Abstract: Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we rele… ▽ More Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys △ Less

Submitted 8 December, 2022; originally announced December 2022.

arXiv:2212.02837 [pdf, other]

Pretrained Diffusion Models for Unified Human Motion Synthesis

Authors: Jianxin Ma, Shuai Bai, Chang Zhou

Abstract: Generative modeling of human motion has broad applications in computer animation, virtual reality, and robotics. Conventional approaches develop separate models for different motion synthesis tasks, and typically use a model of a small size to avoid overfitting the scarce data available in each setting. It remains an open question whether developing a single unified model is feasible, which may 1)… ▽ More Generative modeling of human motion has broad applications in computer animation, virtual reality, and robotics. Conventional approaches develop separate models for different motion synthesis tasks, and typically use a model of a small size to avoid overfitting the scarce data available in each setting. It remains an open question whether developing a single unified model is feasible, which may 1) benefit the acquirement of novel skills by combining skills learned from multiple tasks, and 2) help in increasing the model capacity without overfitting by combining multiple data sources. Unification is challenging because 1) it involves diverse control signals as well as targets of varying granularity, and 2) motion datasets may use different skeletons and default poses. In this paper, we present MoFusion, a framework for unified motion synthesis. MoFusion employs a Transformer backbone to ease the inclusion of diverse control signals via cross attention, and pretrains the backbone as a diffusion model to support multi-granularity synthesis ranging from motion completion of a body part to whole-body motion generation. It uses a learnable adapter to accommodate the differences between the default skeletons used by the pretraining and the fine-tuning data. Empirical results show that pretraining is vital for scaling the model size without overfitting, and demonstrate MoFusion's potential in various tasks, e.g., text-to-motion, motion completion, and zero-shot mixing of multiple control signals. Project page: \url{https://ofa-sys.github.io/MoFusion/}. △ Less

Submitted 6 December, 2022; originally announced December 2022.

arXiv:2211.16312 [pdf, other]

PLA: Language-Driven Open-Vocabulary 3D Scene Understanding

Authors: Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, Xiaojuan Qi

Abstract: Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. The recent breakthrough of 2D open-vocabulary perception is largely driven by Internet-scale paired image-text data with rich vocabulary concepts. However, this success cannot be directly transferred to 3D scenarios due to the inaccessibility of large-scale 3D-text pairs. To this… ▽ More Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. The recent breakthrough of 2D open-vocabulary perception is largely driven by Internet-scale paired image-text data with rich vocabulary concepts. However, this success cannot be directly transferred to 3D scenarios due to the inaccessibility of large-scale 3D-text pairs. To this end, we propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D, which allows explicitly associating 3D and semantic-rich captions. Further, to foster coarse-to-fine visual-semantic representation learning from captions, we design hierarchical 3D-caption pairs, leveraging geometric constraints between 3D scenes and multi-view images. Finally, by employing contrastive learning, the model learns language-aware embeddings that connect 3D and text for open-vocabulary tasks. Our method not only remarkably outperforms baseline methods by 25.8% $\sim$ 44.7% hIoU and 14.5% $\sim$ 50.4% hAP$_{50}$ in open-vocabulary semantic and instance segmentation, but also shows robust transferability on challenging zero-shot domain transfer tasks. See the project website at https://dingry.github.io/projects/PLA. △ Less

Submitted 22 March, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: CVPR2023

arXiv:2211.15846 [pdf, other]

LUMix: Improving Mixup by Better Modelling Label Uncertainty

Authors: Shuyang Sun, Jie-Neng Chen, Ruifei He, Alan Yuille, Philip Torr, Song Bai

Abstract: Modern deep networks can be better generalized when trained with noisy samples and regularization techniques. Mixup and CutMix have been proven to be effective for data augmentation to help avoid overfitting. Previous Mixup-based methods linearly combine images and labels to generate additional training data. However, this is problematic if the object does not occupy the whole image as we demonstr… ▽ More Modern deep networks can be better generalized when trained with noisy samples and regularization techniques. Mixup and CutMix have been proven to be effective for data augmentation to help avoid overfitting. Previous Mixup-based methods linearly combine images and labels to generate additional training data. However, this is problematic if the object does not occupy the whole image as we demonstrate in Figure 1. Correctly assigning the label weights is hard even for human beings and there is no clear criterion to measure it. To tackle this problem, in this paper, we propose LUMix, which models such uncertainty by adding label perturbation during training. LUMix is simple as it can be implemented in just a few lines of code and can be universally applied to any deep networks \eg CNNs and Vision Transformers, with minimal computational cost. Extensive experiments show that our LUMix can consistently boost the performance for networks with a wide range of diversity and capacity on ImageNet, \eg $+0.7\%$ for a small model DeiT-S and $+0.6\%$ for a large variant XCiT-L. We also demonstrate that LUMix can lead to better robustness when evaluated on ImageNet-O and ImageNet-A. The source code can be found \href{https://github.com/kevin-ssy/LUMix}{here} △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.09973 [pdf, other]

The Runner-up Solution for YouTube-VIS Long Video Challenge 2022

Authors: Junfeng Wu, Yi Jiang, Qihao Liu, Xiang Bai, Song Bai

Abstract: This technical report describes our 2nd-place solution for the ECCV 2022 YouTube-VIS Long Video Challenge. We adopt the previously proposed online video instance segmentation method IDOL for this challenge. In addition, we use pseudo labels to further help contrastive learning, so as to obtain more temporally consistent instance embedding to improve tracking performance between frames. The propose… ▽ More This technical report describes our 2nd-place solution for the ECCV 2022 YouTube-VIS Long Video Challenge. We adopt the previously proposed online video instance segmentation method IDOL for this challenge. In addition, we use pseudo labels to further help contrastive learning, so as to obtain more temporally consistent instance embedding to improve tracking performance between frames. The proposed method obtains 40.2 AP on the YouTube-VIS 2022 long video dataset and was ranked second place in this challenge. We hope our simple and effective method could benefit further research. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: The Runner-up Solution for YouTube-VIS Long Video Challenge 2022, ECCV 2022 Workshop. arXiv admin note: text overlap with arXiv:2207.10661

arXiv:2211.09961 [pdf, other]

Path Independent Equilibrium Models Can Better Exploit Test-Time Computation

Authors: Cem Anil, Ashwini Pokle, Kaiqu Liang, Johannes Treutlein, Yuhuai Wu, Shaojie Bai, Zico Kolter, Roger Grosse

Abstract: Designing networks capable of attaining better performance with an increased inference budget is important to facilitate generalization to harder problem instances. Recent efforts have shown promising results in this direction by making use of depth-wise recurrent networks. We show that a broad class of architectures named equilibrium models display strong upwards generalization, and find that str… ▽ More Designing networks capable of attaining better performance with an increased inference budget is important to facilitate generalization to harder problem instances. Recent efforts have shown promising results in this direction by making use of depth-wise recurrent networks. We show that a broad class of architectures named equilibrium models display strong upwards generalization, and find that stronger performance on harder examples (which require more iterations of inference to get correct) strongly correlates with the path independence of the system -- its tendency to converge to the same steady-state behaviour regardless of initialization, given enough computation. Experimental interventions made to promote path independence result in improved generalization on harder problem instances, while those that penalize it degrade this ability. Path independence analyses are also useful on a per-example basis: for equilibrium models that have good in-distribution performance, path independence on out-of-distribution samples strongly correlates with accuracy. Our results help explain why equilibrium models are capable of strong upwards generalization and motivates future work that harnesses path independence as a general modelling principle to facilitate scalable test-time usage. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: NeurIPS 2022

arXiv:2210.12971 [pdf, other]

doi 10.1109/TPAMI.2023.3312749

Holistically-Attracted Wireframe Parsing: From Supervised to Self-Supervised Learning

Authors: Nan Xue, Tianfu Wu, Song Bai, Fu-Dong Wang, Gui-Song Xia, Liangpei Zhang, Philip H. S. Torr

Abstract: This article presents Holistically-Attracted Wireframe Parsing (HAWP), a method for geometric analysis of 2D images containing wireframes formed by line segments and junctions. HAWP utilizes a parsimonious Holistic Attraction (HAT) field representation that encodes line segments using a closed-form 4D geometric vector field. The proposed HAWP consists of three sequential components empowered by en… ▽ More This article presents Holistically-Attracted Wireframe Parsing (HAWP), a method for geometric analysis of 2D images containing wireframes formed by line segments and junctions. HAWP utilizes a parsimonious Holistic Attraction (HAT) field representation that encodes line segments using a closed-form 4D geometric vector field. The proposed HAWP consists of three sequential components empowered by end-to-end and HAT-driven designs: (1) generating a dense set of line segments from HAT fields and endpoint proposals from heatmaps, (2) binding the dense line segments to sparse endpoint proposals to produce initial wireframes, and (3) filtering false positive proposals through a novel endpoint-decoupled line-of-interest aligning (EPD LOIAlign) module that captures the co-occurrence between endpoint proposals and HAT fields for better verification. Thanks to our novel designs, HAWPv2 shows strong performance in fully supervised learning, while HAWPv3 excels in self-supervised learning, achieving superior repeatability scores and efficient training (24 GPU hours on a single GPU). Furthermore, HAWPv3 exhibits a promising potential for wireframe parsing in out-of-distribution images without providing ground truth labels of wireframes. △ Less

Submitted 5 September, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

Comments: Journal extension of arXiv:2003.01663; Accepted by IEEE TPAMI; Code is available at https://github.com/cherubicxn/hawp

arXiv:2210.11714 [pdf, other]

Design a Sustainable Micro-mobility Future: Trends and Challenges in the United States and European Union Using Natural Language Processing Techniques

Authors: Lilit Avetisyan, Chengxin Zhang, Sue Bai, Ehsan Moradi Pari, Fred Feng, Shan Bao, Feng Zhou

Abstract: Micro-mobility is promising to contribute to sustainable cities in the future with its efficiency and low cost. To better design such a sustainable future, it is necessary to understand the trends and challenges. Thus, we examined people's opinions on micro-mobility in the US and the EU using Tweets. We used topic modeling based on advanced natural language processing techniques and categorized th… ▽ More Micro-mobility is promising to contribute to sustainable cities in the future with its efficiency and low cost. To better design such a sustainable future, it is necessary to understand the trends and challenges. Thus, we examined people's opinions on micro-mobility in the US and the EU using Tweets. We used topic modeling based on advanced natural language processing techniques and categorized the data into seven topics: promotion and service, mobility, technical features, acceptance, recreation, infrastructure and regulations. Furthermore, using sentiment analysis, we investigated people's positive and negative attitudes towards specific aspects of these topics and compared the patterns of the trends and challenges in the US and the EU. We found that 1) promotion and service included the majority of Twitter discussions in the both regions, 2) the EU had more positive opinions than the US, 3) micro-mobility devices were more widely used for utilitarian mobility and recreational purposes in the EU than in the US, and 4) compared to the EU, people in the US had many more concerns related to infrastructure and regulation issues. These findings help us understand the trends and challenges and prioritize different aspects in micro-mobility to improve their safety and experience across the two areas for designing a more sustainable micro-mobility future. △ Less

Submitted 29 October, 2022; v1 submitted 20 October, 2022; originally announced October 2022.

Comments: 33 pages, 4 figures

ACM Class: I.5

arXiv:2210.07574 [pdf, other]

Is synthetic data from generative models ready for image recognition?

Authors: Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, Xiaojuan Qi

Abstract: Recent text-to-image generation models have shown promising results in generating high-fidelity photo-realistic images. Though the results are astonishing to human eyes, how applicable these generated images are for recognition tasks remains under-explored. In this work, we extensively study whether and how synthetic images generated from state-of-the-art text-to-image generation models can be use… ▽ More Recent text-to-image generation models have shown promising results in generating high-fidelity photo-realistic images. Though the results are astonishing to human eyes, how applicable these generated images are for recognition tasks remains under-explored. In this work, we extensively study whether and how synthetic images generated from state-of-the-art text-to-image generation models can be used for image recognition tasks, and focus on two perspectives: synthetic data for improving classification models in data-scarce settings (i.e. zero-shot and few-shot), and synthetic data for large-scale model pre-training for transfer learning. We showcase the powerfulness and shortcomings of synthetic data from existing generative models, and propose strategies for better applying synthetic data for recognition tasks. Code: https://github.com/CVMI-Lab/SyntheticData. △ Less

Submitted 15 February, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: ICLR 2023, spotlight

arXiv:2210.00226 [pdf, other]

Towards Understanding and Mitigating Dimensional Collapse in Heterogeneous Federated Learning

Authors: Yujun Shi, Jian Liang, Wenqing Zhang, Vincent Y. F. Tan, Song Bai

Abstract: Federated learning aims to train models collaboratively across different clients without the sharing of data for privacy considerations. However, one major challenge for this learning paradigm is the {\em data heterogeneity} problem, which refers to the discrepancies between the local data distributions among various clients. To tackle this problem, we first study how data heterogeneity affects th… ▽ More Federated learning aims to train models collaboratively across different clients without the sharing of data for privacy considerations. However, one major challenge for this learning paradigm is the {\em data heterogeneity} problem, which refers to the discrepancies between the local data distributions among various clients. To tackle this problem, we first study how data heterogeneity affects the representations of the globally aggregated models. Interestingly, we find that heterogeneous data results in the global model suffering from severe {\em dimensional collapse}, in which representations tend to reside in a lower-dimensional space instead of the ambient space. Moreover, we observe a similar phenomenon on models locally trained on each client and deduce that the dimensional collapse on the global model is inherited from local models. In addition, we theoretically analyze the gradient flow dynamics to shed light on how data heterogeneity result in dimensional collapse for local models. To remedy this problem caused by the data heterogeneity, we propose {\sc FedDecorr}, a novel method that can effectively mitigate dimensional collapse in federated learning. Specifically, {\sc FedDecorr} applies a regularization term during local training that encourages different dimensions of representations to be uncorrelated. {\sc FedDecorr}, which is implementation-friendly and computationally-efficient, yields consistent improvements over baselines on standard benchmark datasets. Code: https://github.com/bytedance/FedDecorr. △ Less

Submitted 7 April, 2024; v1 submitted 1 October, 2022; originally announced October 2022.

Comments: camera ready version of ICLR 2023

arXiv:2209.08599 [pdf, other]

Arnold conjecture over integers

Authors: Shaoyun Bai, Guangbo Xu

Abstract: For any closed symplectic manifold, we show that the number of 1-periodic orbits of a nondegenerate Hamiltonian thereon is bounded from below by a version of total Betti number over Z of the ambient space taking account of the total Betti number over Q and torsions of all characteristic. The proof is based on constructing a Hamiltonian Floer theory over the Novikov ring with integer coefficients,… ▽ More For any closed symplectic manifold, we show that the number of 1-periodic orbits of a nondegenerate Hamiltonian thereon is bounded from below by a version of total Betti number over Z of the ambient space taking account of the total Betti number over Q and torsions of all characteristic. The proof is based on constructing a Hamiltonian Floer theory over the Novikov ring with integer coefficients, which generalizes our earlier work for constructing integer-valued Gromov-Witten type invariants. In the course of the construction, we build a Hamiltonian Floer flow category with compatible smooth global Kuranishi charts. This generalizes a recent work of Abouzaid-McLean-Smith, which might be of independent interest. △ Less

Submitted 18 September, 2022; originally announced September 2022.

Comments: 168 pages, 2 figures. Comments welcome!

arXiv:2209.01386 [pdf, other]

SaleNet: A low-power end-to-end CNN accelerator for sustained attention level evaluation using EEG

Authors: Chao Zhang, Zijian Tang, Taoming Guo, Jiaxin Lei, Jiaxin Xiao, Anhe Wang, Shuo Bai, Milin Zhang

Abstract: This paper proposes SaleNet - an end-to-end convolutional neural network (CNN) for sustained attention level evaluation using prefrontal electroencephalogram (EEG). A bias-driven pruning method is proposed together with group convolution, global average pooling (GAP), near-zero pruning, weight clustering and quantization for the model compression, achieving a total compression ratio of 183.11x. Th… ▽ More This paper proposes SaleNet - an end-to-end convolutional neural network (CNN) for sustained attention level evaluation using prefrontal electroencephalogram (EEG). A bias-driven pruning method is proposed together with group convolution, global average pooling (GAP), near-zero pruning, weight clustering and quantization for the model compression, achieving a total compression ratio of 183.11x. The compressed SaleNet obtains a state-of-the-art subject-independent sustained attention level classification accuracy of 84.2% on the recorded 6-subject EEG database in this work. The SaleNet is implemented on a Artix-7 FPGA with a competitive power consumption of 0.11 W and an energy-efficiency of 8.19 GOps/W. △ Less

Submitted 3 September, 2022; originally announced September 2022.

Comments: 5 pages, 4 figures, to be published in IEEE International Symposium on Circuits and Systems (ISCAS) 2022

arXiv:2209.00224 [pdf, ps, other]

1st Place Solution to ECCV 2022 Challenge on Out of Vocabulary Scene Text Understanding: End-to-End Recognition of Out of Vocabulary Words

Authors: Zhangzi Zhu, Chuhui Xue, Yu Hao, Wenqing Zhang, Song Bai

Abstract: Scene text recognition has attracted increasing interest in recent years due to its wide range of applications in multilingual translation, autonomous driving, etc. In this report, we describe our solution to the Out of Vocabulary Scene Text Understanding (OOV-ST) Challenge, which aims to extract out-of-vocabulary (OOV) words from natural scene images. Our oCLIP-based model achieves 28.59\% in h-m… ▽ More Scene text recognition has attracted increasing interest in recent years due to its wide range of applications in multilingual translation, autonomous driving, etc. In this report, we describe our solution to the Out of Vocabulary Scene Text Understanding (OOV-ST) Challenge, which aims to extract out-of-vocabulary (OOV) words from natural scene images. Our oCLIP-based model achieves 28.59\% in h-mean which ranks 1st in end-to-end OOV word recognition track of OOV Challenge in ECCV2022 TiE Workshop. △ Less

Submitted 1 September, 2022; originally announced September 2022.

Comments: Report to ECCV TiE OOV competition

arXiv:2208.03524 [pdf]

doi 10.2139/ssrn.4253498

Deep Learning-enabled Spatial Phase Unwrapping for 3D Measurement

Authors: Xiaolong Luo, Wanzhong Song, Songlin Bai, Yu Li, Zhihe Zhao

Abstract: In terms of 3D imaging speed and system cost, the single-camera system projecting single-frequency patterns is the ideal option among all proposed Fringe Projection Profilometry (FPP) systems. This system necessitates a robust spatial phase unwrapping (SPU) algorithm. However, robust SPU remains a challenge in complex scenes. Quality-guided SPU algorithms need more efficient ways to identify the u… ▽ More In terms of 3D imaging speed and system cost, the single-camera system projecting single-frequency patterns is the ideal option among all proposed Fringe Projection Profilometry (FPP) systems. This system necessitates a robust spatial phase unwrapping (SPU) algorithm. However, robust SPU remains a challenge in complex scenes. Quality-guided SPU algorithms need more efficient ways to identify the unreliable points in phase maps before unwrapping. End-to-end deep learning SPU methods face generality and interpretability problems. This paper proposes a hybrid method combining deep learning and traditional path-following for robust SPU in FPP. This hybrid SPU scheme demonstrates better robustness than traditional quality-guided SPU methods, better interpretability than end-to-end deep learning scheme, and generality on unseen data. Experiments on the real dataset of multiple illumination conditions and multiple FPP systems differing in image resolution, the number of fringes, fringe direction, and optics wavelength verify the effectiveness of the proposed method. △ Less

Submitted 6 August, 2022; originally announced August 2022.

Comments: 26 pages

ACM Class: I.4.5

Journal ref: Optics & Laser Technology, 163 (2023) 109340

arXiv:2208.02747 [pdf, ps, other]

Runner-Up Solution to ECCV 2022 Challenge on Out of Vocabulary Scene Text Understanding: Cropped Word Recognition

Authors: Zhangzi Zhu, Yu Hao, Wenqing Zhang, Chuhui Xue, Song Bai

Abstract: This report presents our 2nd place solution to ECCV 2022 challenge on Out-of-Vocabulary Scene Text Understanding (OOV-ST) : Cropped Word Recognition. This challenge is held in the context of ECCV 2022 workshop on Text in Everything (TiE), which aims to extract out-of-vocabulary words from natural scene images. In the competition, we first pre-train SCATTER on the synthetic datasets, then fine-tune… ▽ More This report presents our 2nd place solution to ECCV 2022 challenge on Out-of-Vocabulary Scene Text Understanding (OOV-ST) : Cropped Word Recognition. This challenge is held in the context of ECCV 2022 workshop on Text in Everything (TiE), which aims to extract out-of-vocabulary words from natural scene images. In the competition, we first pre-train SCATTER on the synthetic datasets, then fine-tune the model on the training set with data augmentations. Meanwhile, two additional models are trained specifically for long and vertical texts. Finally, we combine the output from different models with different layers, different backbones, and different seeds as the final results. Our solution achieves a word accuracy of 59.45\% when considering out-of-vocabulary words only. △ Less

Submitted 31 August, 2022; v1 submitted 4 August, 2022; originally announced August 2022.

arXiv:2208.00090 [pdf, other]

Explicit Occlusion Reasoning for Multi-person 3D Human Pose Estimation

Authors: Qihao Liu, Yi Zhang, Song Bai, Alan Yuille

Abstract: Occlusion poses a great threat to monocular multi-person 3D human pose estimation due to large variability in terms of the shape, appearance, and position of occluders. While existing methods try to handle occlusion with pose priors/constraints, data augmentation, or implicit reasoning, they still fail to generalize to unseen poses or occlusion cases and may make large mistakes when multiple peopl… ▽ More Occlusion poses a great threat to monocular multi-person 3D human pose estimation due to large variability in terms of the shape, appearance, and position of occluders. While existing methods try to handle occlusion with pose priors/constraints, data augmentation, or implicit reasoning, they still fail to generalize to unseen poses or occlusion cases and may make large mistakes when multiple people are present. Inspired by the remarkable ability of humans to infer occluded joints from visible cues, we develop a method to explicitly model this process that significantly improves bottom-up multi-person human pose estimation with or without occlusions. First, we split the task into two subtasks: visible keypoints detection and occluded keypoints reasoning, and propose a Deeply Supervised Encoder Distillation (DSED) network to solve the second one. To train our model, we propose a Skeleton-guided human Shape Fitting (SSF) approach to generate pseudo occlusion labels on the existing datasets, enabling explicit occlusion reasoning. Experiments show that explicitly learning from occlusions improves human pose estimation. In addition, exploiting feature-level information of visible joints allows us to reason about occluded joints more accurately. Our method outperforms both the state-of-the-art top-down and bottom-up methods on several benchmarks. △ Less

Submitted 29 July, 2022; originally announced August 2022.

Comments: ECCV 2022

arXiv:2207.12955 [pdf, other]

Contextual Text Block Detection towards Scene Text Understanding

Authors: Chuhui Xue, Jiaxing Huang, Shijian Lu, Changhu Wang, Song Bai

Abstract: Most existing scene text detectors focus on detecting characters or words that only capture partial text messages due to missing contextual information. For a better understanding of text in scenes, it is more desired to detect contextual text blocks (CTBs) which consist of one or multiple integral text units (e.g., characters, words, or phrases) in natural reading order and transmit certain compl… ▽ More Most existing scene text detectors focus on detecting characters or words that only capture partial text messages due to missing contextual information. For a better understanding of text in scenes, it is more desired to detect contextual text blocks (CTBs) which consist of one or multiple integral text units (e.g., characters, words, or phrases) in natural reading order and transmit certain complete text messages. This paper presents contextual text detection, a new setup that detects CTBs for better understanding of texts in scenes. We formulate the new setup by a dual detection task which first detects integral text units and then groups them into a CTB. To this end, we design a novel scene text clustering technique that treats integral text units as tokens and groups them (belonging to the same CTB) into an ordered token sequence. In addition, we create two datasets SCUT-CTW-Context and ReCTS-Context to facilitate future research, where each CTB is well annotated by an ordered sequence of integral text units. Further, we introduce three metrics that measure contextual text detection in local accuracy, continuity, and global accuracy. Extensive experiments show that our method accurately detects CTBs which effectively facilitates downstream tasks such as text classification and translation. The project is available at https://sg-vilab.github.io/publication/xue2022contextual/. △ Less

Submitted 26 July, 2022; originally announced July 2022.

Comments: Accepted by ECCV2022

arXiv:2207.10661 [pdf, other]

In Defense of Online Models for Video Instance Segmentation

Authors: Junfeng Wu, Qihao Liu, Yi Jiang, Song Bai, Alan Yuille, Xiang Bai

Abstract: In recent years, video instance segmentation (VIS) has been largely advanced by offline models, while online models gradually attracted less attention possibly due to their inferior performance. However, online methods have their inherent advantage in handling long video sequences and ongoing videos while offline models fail due to the limit of computational resources. Therefore, it would be highl… ▽ More In recent years, video instance segmentation (VIS) has been largely advanced by offline models, while online models gradually attracted less attention possibly due to their inferior performance. However, online methods have their inherent advantage in handling long video sequences and ongoing videos while offline models fail due to the limit of computational resources. Therefore, it would be highly desirable if online models can achieve comparable or even better performance than offline models. By dissecting current online models and offline models, we demonstrate that the main cause of the performance gap is the error-prone association between frames caused by the similar appearance among different instances in the feature space. Observing this, we propose an online framework based on contrastive learning that is able to learn more discriminative instance embeddings for association and fully exploit history information for stability. Despite its simplicity, our method outperforms all online and offline methods on three benchmarks. Specifically, we achieve 49.5 AP on YouTube-VIS 2019, a significant improvement of 13.2 AP and 2.1 AP over the prior online and offline art, respectively. Moreover, we achieve 30.2 AP on OVIS, a more challenging dataset with significant crowding and occlusions, surpassing the prior art by 14.8 AP. The proposed method won first place in the video instance segmentation track of the 4th Large-scale Video Object Segmentation Challenge (CVPR2022). We hope the simplicity and effectiveness of our method, as well as our insight into current methods, could shed light on the exploration of VIS models. △ Less

Submitted 21 July, 2022; originally announced July 2022.

Comments: ECCV 2022, Oral

arXiv:2207.09161 [pdf, other]

Single Stage Virtual Try-on via Deformable Attention Flows

Authors: Shuai Bai, Huiling Zhou, Zhikang Li, Chang Zhou, Hongxia Yang

Abstract: Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image. Existing methods usually build up multi-stage frameworks to deal with clothes warping and body blending respectively, or rely heavily on intermediate parser-based labels which may be noisy or even inaccurate. To solve the above challenges, we propose a single-stage try-on framewo… ▽ More Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image. Existing methods usually build up multi-stage frameworks to deal with clothes warping and body blending respectively, or rely heavily on intermediate parser-based labels which may be noisy or even inaccurate. To solve the above challenges, we propose a single-stage try-on framework by developing a novel Deformable Attention Flow (DAFlow), which applies the deformable attention scheme to multi-flow estimation. With pose keypoints as the guidance only, the self- and cross-deformable attention flows are estimated for the reference person and the garment images, respectively. By sampling multiple flow fields, the feature-level and pixel-level information from different semantic areas are simultaneously extracted and merged through the attention mechanism. It enables clothes warping and body synthesizing at the same time which leads to photo-realistic results in an end-to-end manner. Extensive experiments on two try-on datasets demonstrate that our proposed method achieves state-of-the-art performance both qualitatively and quantitatively. Furthermore, additional experiments on the other two image editing tasks illustrate the versatility of our method for multi-view synthesis and image animation. △ Less

Submitted 19 July, 2022; originally announced July 2022.

Comments: ECCV 2022

arXiv:2207.06118 [pdf, other]

Stability of Weighted Majority Voting under Estimated Weights

Authors: Shaojie Bai, Dongxia Wang, Tim Muller, Peng Cheng, Jiming Chen

Abstract: Weighted Majority Voting (WMV) is a well-known optimal decision rule for collective decision making, given the probability of sources to provide accurate information (trustworthiness). However, in reality, the trustworthiness is not a known quantity to the decision maker - they have to rely on an estimate called trust. A (machine learning) algorithm that computes trust is called unbiased when it h… ▽ More Weighted Majority Voting (WMV) is a well-known optimal decision rule for collective decision making, given the probability of sources to provide accurate information (trustworthiness). However, in reality, the trustworthiness is not a known quantity to the decision maker - they have to rely on an estimate called trust. A (machine learning) algorithm that computes trust is called unbiased when it has the property that it does not systematically overestimate or underestimate the trustworthiness. To formally analyse the uncertainty to the decision process, we introduce and analyse two important properties of such unbiased trust values: stability of correctness and stability of optimality. Stability of correctness means that the decision accuracy that the decision maker believes they achieved is equal to the actual accuracy. We prove stability of correctness holds. Stability of optimality means that the decisions made based on trust, are equally good as they would have been if they were based on trustworthiness. Stability of optimality does not hold. We analyse the difference between the two, and bounds thereon. We also present an overview of how sensitive decision correctness is to changes in trust and trustworthiness. △ Less

Submitted 30 June, 2024; v1 submitted 13 July, 2022; originally announced July 2022.

Comments: 15 pages, 16 figures

Showing 51–100 of 291 results for author: Bai, S