Search | arXiv e-print repository

Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

Authors: Peng Li, Yuan Liu, Xiaoxiao Long, Feihu Zhang, Cheng Lin, Mengfei Li, Xingqun Qi, Shanghang Zhang, Wenhan Luo, Ping Tan, Wenping Wang, Qifeng Liu, Yike Guo

Abstract: In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from camera prior mismatch, inefficacy, and low resolution, resulting in poor-quality multiview images. Specifically, these methods assume that the input images should… ▽ More In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from camera prior mismatch, inefficacy, and low resolution, resulting in poor-quality multiview images. Specifically, these methods assume that the input images should comply with a predefined camera type, e.g. a perspective camera with a fixed focal length, leading to distorted shapes when the assumption fails. Moreover, the full-image or dense multiview attention they employ leads to an exponential explosion of computational complexity as image resolution increases, resulting in prohibitively expensive training costs. To bridge the gap between assumption and reality, Era3D first proposes a diffusion-based camera prediction module to estimate the focal length and elevation of the input image, which allows our method to generate images without shape distortions. Furthermore, a simple but efficient attention layer, named row-wise attention, is used to enforce epipolar priors in the multiview diffusion, facilitating efficient cross-view information fusion. Consequently, compared with state-of-the-art methods, Era3D generates high-quality multiview images with up to a 512*512 resolution while reducing computation complexity by 12x times. Comprehensive experiments demonstrate that Era3D can reconstruct high-quality and detailed 3D meshes from diverse single-view input images, significantly outperforming baseline multiview diffusion methods. Project page: https://penghtyx.github.io/Era3D/. △ Less

Submitted 29 May, 2024; v1 submitted 19 May, 2024; originally announced May 2024.

arXiv:2405.04880 [pdf, other]

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

Authors: Yuankun Xie, Yi Lu, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun

Abstract: With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on… ▽ More With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including 2 languages, over 1M audio samples, and various test conditions, focus on ALM-based audio detection. As countermeasure, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose the CSAM strategy to learn a domain balanced and generalized minima. In our experiments, we first demonstrate that ADD model training with the Codecfake dataset can effectively detects ALM-based audio. Furthermore, our proposed generalization countermeasure yields the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models. The dataset and associated code are available online. △ Less

Submitted 15 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

arXiv:2404.14047 [pdf, other]

An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs

Authors: Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno

Abstract: The LLaMA family has become one of the most powerful open-source Large Language Models (LLMs) and the popular LLM backbones of Multimodal Large Language Models (MLLMs), widely applied in Computer Vision (CV) and Natural Language Understanding (NLU) tasks. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over… ▽ More The LLaMA family has become one of the most powerful open-source Large Language Models (LLMs) and the popular LLM backbones of Multimodal Large Language Models (MLLMs), widely applied in Computer Vision (CV) and Natural Language Understanding (NLU) tasks. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration can potentially unveil new insights and challenges for low-bit quantization of LLaMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we comprehensively evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets to reveal LLaMA3's low-bit quantization performance. To uncover the capabilities of low-bit quantized MLLM, we assessed the performance of the LLaMA3-based LLaVA-Next-8B model under 2-4 ultra-low bits with post-training quantization methods. Our experimental results indicate that LLaMA3 still suffers non-negligent degradation in linguistic and visual contexts, particularly under ultra-low bit widths. This highlights the significant performance gap under low bit-width that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, driving LLMs and MLLMs to achieve higher accuracy at lower bit to enhance practicality. △ Less

Submitted 19 July, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.13013 [pdf, other]

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Authors: Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, Xiaojuan Qi

Abstract: We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encode… ▽ More We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: https://groma-mllm.github.io/. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.11869 [pdf, other]

Node-like as a Whole: Structure-aware Searching and Coarsening for Graph Classification

Authors: Xiaorui Qi, Qijie Bai, Yanlong Wen, Haiwei Zhang, Xiaojie Yuan

Abstract: Graph Transformers (GTs) have made remarkable achievements in graph-level tasks. However, most existing works regard graph structures as a form of guidance or bias for enhancing node representations, which focuses on node-central perspectives and lacks explicit representations of edges and structures. One natural question is, can we treat graph structures node-like as a whole to learn high-level f… ▽ More Graph Transformers (GTs) have made remarkable achievements in graph-level tasks. However, most existing works regard graph structures as a form of guidance or bias for enhancing node representations, which focuses on node-central perspectives and lacks explicit representations of edges and structures. One natural question is, can we treat graph structures node-like as a whole to learn high-level features? Through experimental analysis, we explore the feasibility of this assumption. Based on our findings, we propose a novel multi-view graph representation learning model via structure-aware searching and coarsening (GRLsc) on GT architecture for graph classification. Specifically, we build three unique views, original, coarsening, and conversion, to learn a thorough structural representation. We compress loops and cliques via hierarchical heuristic graph coarsening and restrict them with well-designed constraints, which builds the coarsening view to learn high-level interactions between structures. We also introduce line graphs for edge embeddings and switch to edge-central perspective to construct the conversion view. Experiments on eight real-world datasets demonstrate the improvements of GRLsc over 28 baselines from various architectures. △ Less

Submitted 25 July, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.11525 [pdf, other]

JointViT: Modeling Oxygen Saturation Levels with Joint Supervision on Long-Tailed OCTA

Authors: Zeyu Zhang, Xuyin Qi, Mingxi Chen, Guangxi Li, Ryan Pham, Ayub Qassim, Ella Berry, Zhibin Liao, Owen Siggs, Robert Mclaughlin, Jamie Craig, Minh-Son To

Abstract: The oxygen saturation level in the blood (SaO2) is crucial for health, particularly in relation to sleep-related breathing disorders. However, continuous monitoring of SaO2 is time-consuming and highly variable depending on patients' conditions. Recently, optical coherence tomography angiography (OCTA) has shown promising development in rapidly and effectively screening eye-related lesions, offeri… ▽ More The oxygen saturation level in the blood (SaO2) is crucial for health, particularly in relation to sleep-related breathing disorders. However, continuous monitoring of SaO2 is time-consuming and highly variable depending on patients' conditions. Recently, optical coherence tomography angiography (OCTA) has shown promising development in rapidly and effectively screening eye-related lesions, offering the potential for diagnosing sleep-related disorders. To bridge this gap, our paper presents three key contributions. Firstly, we propose JointViT, a novel model based on the Vision Transformer architecture, incorporating a joint loss function for supervision. Secondly, we introduce a balancing augmentation technique during data preprocessing to improve the model's performance, particularly on the long-tail distribution within the OCTA dataset. Lastly, through comprehensive experiments on the OCTA dataset, our proposed method significantly outperforms other state-of-the-art methods, achieving improvements of up to 12.28% in overall accuracy. This advancement lays the groundwork for the future utilization of OCTA in diagnosing sleep-related disorders. See project website https://steve-zeyu-zhang.github.io/JointViT △ Less

Submitted 28 July, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

Comments: Accepted to MIUA 2024 Oral

arXiv:2404.09613 [pdf, other]

Efficient and accurate neural field reconstruction using resistive memory

Authors: Yifei Yu, Shaocong Wang, Woyu Zhang, Xinyuan Zhang, Xiuzhe Wu, Yangu He, Jichang Yang, Yue Zhang, Ning Lin, Bo Wang, Xi Chen, Songqi Wang, Xumeng Zhang, Xiaojuan Qi, Zhongrui Wang, Dashan Shang, Qi Liu, Kwang-Ting Cheng, Ming Liu

Abstract: Human beings construct perception of space by integrating sparse observations into massively interconnected synapses and neurons, offering a superior parallelism and efficiency. Replicating this capability in AI finds wide applications in medical imaging, AR/VR, and embodied AI, where input data is often sparse and computing resources are limited. However, traditional signal reconstruction methods… ▽ More Human beings construct perception of space by integrating sparse observations into massively interconnected synapses and neurons, offering a superior parallelism and efficiency. Replicating this capability in AI finds wide applications in medical imaging, AR/VR, and embodied AI, where input data is often sparse and computing resources are limited. However, traditional signal reconstruction methods on digital computers face both software and hardware challenges. On the software front, difficulties arise from storage inefficiencies in conventional explicit signal representation. Hardware obstacles include the von Neumann bottleneck, which limits data transfer between the CPU and memory, and the limitations of CMOS circuits in supporting parallel processing. We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs. Software-wise, we employ neural field to implicitly represent signals via neural networks, which is further compressed using low-rank decomposition and structured pruning. Hardware-wise, we design a resistive memory-based computing-in-memory (CIM) platform, featuring a Gaussian Encoder (GE) and an MLP Processing Engine (PE). The GE harnesses the intrinsic stochasticity of resistive memory for efficient input encoding, while the PE achieves precise weight mapping through a Hardware-Aware Quantization (HAQ) circuit. We demonstrate the system's efficacy on a 40nm 256Kb resistive memory-based in-memory computing macro, achieving huge energy efficiency and parallelism improvements without compromising reconstruction quality in tasks like 3D CT sparse reconstruction, novel view synthesis, and novel view synthesis for dynamic scenes. This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.05648 [pdf, other]

Resistive Memory-based Neural Differential Equation Solver for Score-based Diffusion Model

Authors: Jichang Yang, Hegan Chen, Jia Chen, Songqi Wang, Shaocong Wang, Yifei Yu, Xi Chen, Bo Wang, Xinyuan Zhang, Binbin Cui, Yi Li, Ning Lin, Meng Xu, Yi Li, Xiaoxin Xu, Xiaojuan Qi, Zhongrui Wang, Xumeng Zhang, Dashan Shang, Han Wang, Qi Liu, Kwang-Ting Cheng, Ming Liu

Abstract: Human brains image complicated scenes when reading a novel. Replicating this imagination is one of the ultimate goals of AI-Generated Content (AIGC). However, current AIGC methods, such as score-based diffusion, are still deficient in terms of rapidity and efficiency. This deficiency is rooted in the difference between the brain and digital computers. Digital computers have physically separated st… ▽ More Human brains image complicated scenes when reading a novel. Replicating this imagination is one of the ultimate goals of AI-Generated Content (AIGC). However, current AIGC methods, such as score-based diffusion, are still deficient in terms of rapidity and efficiency. This deficiency is rooted in the difference between the brain and digital computers. Digital computers have physically separated storage and processing units, resulting in frequent data transfers during iterative calculations, incurring large time and energy overheads. This issue is further intensified by the conversion of inherently continuous and analog generation dynamics, which can be formulated by neural differential equations, into discrete and digital operations. Inspired by the brain, we propose a time-continuous and analog in-memory neural differential equation solver for score-based diffusion, employing emerging resistive memory. The integration of storage and computation within resistive memory synapses surmount the von Neumann bottleneck, benefiting the generative speed and energy efficiency. The closed-loop feedback integrator is time-continuous, analog, and compact, physically implementing an infinite-depth neural network. Moreover, the software-hardware co-design is intrinsically robust to analog noise. We experimentally validate our solution with 180 nm resistive memory in-memory computing macros. Demonstrating equivalent generative quality to the software baseline, our system achieved remarkable enhancements in generative speed for both unconditional and conditional generation tasks, by factors of 64.8 and 156.5, respectively. Moreover, it accomplished reductions in energy consumption by factors of 5.2 and 4.1. Our approach heralds a new horizon for hardware solutions in edge computing for generative AI applications. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2404.02026 [pdf]

Infrared nanosensors of pico- to micro-newton forces

Authors: Natalie Fardian-Melamed, Artiom Skripka, Changhwan Lee, Benedikt Ursprung, Thomas P. Darlington, Ayelet Teitelboim, Xiao Qi, Maoji Wang, Jordan M. Gerton, Bruce E. Cohen, Emory M. Chan, P. James Schuck

Abstract: Mechanical force is an essential feature for many physical and biological processes.1-12 Remote measurement of mechanical signals with high sensitivity and spatial resolution is needed for diverse applications, including robotics,13 biophysics,14-20 energy storage,21-24 and medicine.25-27 Nanoscale luminescent force sensors excel at measuring piconewton forces,28-32 while larger sensors have prove… ▽ More Mechanical force is an essential feature for many physical and biological processes.1-12 Remote measurement of mechanical signals with high sensitivity and spatial resolution is needed for diverse applications, including robotics,13 biophysics,14-20 energy storage,21-24 and medicine.25-27 Nanoscale luminescent force sensors excel at measuring piconewton forces,28-32 while larger sensors have proven powerful in probing micronewton forces.33,34 However, large gaps remain in the force magnitudes that can be probed remotely from subsurface or interfacial sites, and no individual, non-invasive sensor is capable of measuring over the large dynamic range needed to understand many systems.35,36 Here, we demonstrate Tm3+-doped avalanching nanoparticle37 force sensors that can be addressed remotely by deeply penetrating near-infrared (NIR) light and can detect piconewton to micronewton forces with a dynamic range spanning more than four orders of magnitude. Using atomic force microscopy coupled with single-nanoparticle optical spectroscopy, we characterize the mechanical sensitivity of the photon avalanching process and reveal its exceptional force responsiveness. By manipulating the Tm3+ concentrations and energy transfer within the nanosensors, we demonstrate different optical force-sensing modalities, including mechanobrightening and mechanochromism. The adaptability of these nanoscale optical force sensors, along with their multiscale sensing capability, enable operation in the dynamic and versatile environments present in real-world, complex structures spanning biological organisms to nanoelectromechanical systems (NEMS). △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2404.00409 [pdf, other]

3DGSR: Implicit Surface Reconstruction with 3D Gaussian Splatting

Authors: Xiaoyang Lyu, Yang-Tian Sun, Yi-Hua Huang, Xiuzhe Wu, Ziyi Yang, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi

Abstract: In this paper, we present an implicit surface reconstruction method with 3D Gaussian Splatting (3DGS), namely 3DGSR, that allows for accurate 3D reconstruction with intricate details while inheriting the high efficiency and rendering quality of 3DGS. The key insight is incorporating an implicit signed distance field (SDF) within 3D Gaussians to enable them to be aligned and jointly optimized. Firs… ▽ More In this paper, we present an implicit surface reconstruction method with 3D Gaussian Splatting (3DGS), namely 3DGSR, that allows for accurate 3D reconstruction with intricate details while inheriting the high efficiency and rendering quality of 3DGS. The key insight is incorporating an implicit signed distance field (SDF) within 3D Gaussians to enable them to be aligned and jointly optimized. First, we introduce a differentiable SDF-to-opacity transformation function that converts SDF values into corresponding Gaussians' opacities. This function connects the SDF and 3D Gaussians, allowing for unified optimization and enforcing surface constraints on the 3D Gaussians. During learning, optimizing the 3D Gaussians provides supervisory signals for SDF learning, enabling the reconstruction of intricate details. However, this only provides sparse supervisory signals to the SDF at locations occupied by Gaussians, which is insufficient for learning a continuous SDF. Then, to address this limitation, we incorporate volumetric rendering and align the rendered geometric attributes (depth, normal) with those derived from 3D Gaussians. This consistency regularization introduces supervisory signals to locations not covered by discrete 3D Gaussians, effectively eliminating redundant surfaces outside the Gaussian sampling range. Our extensive experimental results demonstrate that our 3DGSR method enables high-quality 3D surface reconstruction while preserving the efficiency and rendering quality of 3DGS. Besides, our method competes favorably with leading surface reconstruction techniques while offering a more efficient learning process and much better rendering qualities. The code will be available at https://github.com/CVMI-Lab/3DGSR. △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2403.19314 [pdf, other]

Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction

Authors: Xiaoyang Lyu, Chirui Chang, Peng Dai, Yang-Tian Sun, Xiaojuan Qi

Abstract: Scene reconstruction from multi-view images is a fundamental problem in computer vision and graphics. Recent neural implicit surface reconstruction methods have achieved high-quality results; however, editing and manipulating the 3D geometry of reconstructed scenes remains challenging due to the absence of naturally decomposed object entities and complex object/background compositions. In this pap… ▽ More Scene reconstruction from multi-view images is a fundamental problem in computer vision and graphics. Recent neural implicit surface reconstruction methods have achieved high-quality results; however, editing and manipulating the 3D geometry of reconstructed scenes remains challenging due to the absence of naturally decomposed object entities and complex object/background compositions. In this paper, we present Total-Decom, a novel method for decomposed 3D reconstruction with minimal human interaction. Our approach seamlessly integrates the Segment Anything Model (SAM) with hybrid implicit-explicit neural surface representations and a mesh-based region-growing technique for accurate 3D object decomposition. Total-Decom requires minimal human annotations while providing users with real-time control over the granularity and quality of decomposition. We extensively evaluate our method on benchmark datasets and demonstrate its potential for downstream applications, such as animation and scene editing. The code is available at https://github.com/CVMI-Lab/Total-Decom.git. △ Less

Submitted 30 March, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

Comments: 8 pages, 7 figures, accepted by CVPR 2024

arXiv:2403.14760 [pdf, other]

Can 3D Vision-Language Models Truly Understand Natural Language?

Authors: Weipeng Deng, Jihan Yang, Runyu Ding, Jiahui Liu, Yijiang Li, Xiaojuan Qi, Edith Ngai

Abstract: Rapid advancements in 3D vision-language (3D-VL) tasks have opened up new avenues for human interaction with embodied agents or robots using natural language. Despite this progress, we find a notable limitation: existing 3D-VL models exhibit sensitivity to the styles of language input, struggling to understand sentences with the same semantic meaning but written in different variants. This observa… ▽ More Rapid advancements in 3D vision-language (3D-VL) tasks have opened up new avenues for human interaction with embodied agents or robots using natural language. Despite this progress, we find a notable limitation: existing 3D-VL models exhibit sensitivity to the styles of language input, struggling to understand sentences with the same semantic meaning but written in different variants. This observation raises a critical question: Can 3D vision-language models truly understand natural language? To test the language understandability of 3D-VL models, we first propose a language robustness task for systematically assessing 3D-VL models across various tasks, benchmarking their performance when presented with different language style variants. Importantly, these variants are commonly encountered in applications requiring direct interaction with humans, such as embodied robotics, given the diversity and unpredictability of human language. We propose a 3D Language Robustness Dataset, designed based on the characteristics of human language, to facilitate the systematic study of robustness. Our comprehensive evaluation uncovers a significant drop in the performance of all existing models across various 3D-VL tasks. Even the state-of-the-art 3D-LLM fails to understand some variants of the same sentences. Further in-depth analysis suggests that the existing models have a fragile and biased fusion module, which stems from the low diversity of the existing dataset. Finally, we propose a training-free module driven by LLM, which improves language robustness. Datasets and code will be available at github. △ Less

Submitted 3 July, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

Comments: https://github.com/VincentDENGP/3D-LR

arXiv:2403.12104 [pdf, other]

Topology reconstruction for asymmetric systems by isomorphic mapping or perturbation approximation

Authors: Yunlin Li, Jingguang Chen, Xingchao Qi, Langlang Xiong, Xianjun Wang, Yufu Liu, Fang Guan, Lei Shi, Xunya Jiang

Abstract: The systems without symmetries, e.g. the spatial and chiral symmetries, are generally thought to be improper for topological study and no conventional integral topological invariant can be well defined. In this work, with multi-band asymmetric Rice-Mele-like systems as examples, for the first time we show that the topology of all gaps can be reconstructed by two general methods and topological ori… ▽ More The systems without symmetries, e.g. the spatial and chiral symmetries, are generally thought to be improper for topological study and no conventional integral topological invariant can be well defined. In this work, with multi-band asymmetric Rice-Mele-like systems as examples, for the first time we show that the topology of all gaps can be reconstructed by two general methods and topological origin of many phenomena are revealed. A new integral topological invariant, i.e. the renormalized real-space winding number, can properly characterize the topology and bulk-edge correspondence of such systems. For the first method, an isomorphic mapping relationship between a Rice-Mele-like system and its chiral counterpart is set up, which accounts for the topology reconstruction in the half-filling gaps. For the second method, the Hilbert space of asymmetric systems could be reduced into degenerate subspaces by perturbation approximation, so that the topology in subspaces accounts for the topology reconstruction in the fractional-filling gaps. Surprisingly, the topology reconstructed by perturbation approximation exhibits extraordinary robustness since the topological edge states even exist far beyond the weak perturbation limit. We also show that both methods can be widely used for other asymmetric systems, e.g. the two-dimensional (2D) Rice-Mele systems and the superconductor systems. At last, for the asymmetric photonic systems, we predict different topological edge states by our topology-reconstruction theory and experimentally observe them in the laboratory, which agrees with each other very well. Our findings open a door for investigating new topological phenomena in asymmetric systems by various topological reconstruction methods which should greatly expand the category of topology study. △ Less

Submitted 24 March, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

arXiv:2403.12035 [pdf, other]

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

Authors: Bojia Zi, Shihao Zhao, Xianbiao Qi, Jianan Wang, Yukai Shi, Qianyu Chen, Bin Liang, Kam-Fai Wong, Lei Zhang

Abstract: Recent advancements in video generation have been remarkable, yet many existing methods struggle with issues of consistency and poor text-video alignment. Moreover, the field lacks effective techniques for text-guided video inpainting, a stark contrast to the well-explored domain of text-guided image inpainting. To this end, this paper proposes a novel text-guided video inpainting model that achie… ▽ More Recent advancements in video generation have been remarkable, yet many existing methods struggle with issues of consistency and poor text-video alignment. Moreover, the field lacks effective techniques for text-guided video inpainting, a stark contrast to the well-explored domain of text-guided image inpainting. To this end, this paper proposes a novel text-guided video inpainting model that achieves better consistency, controllability and compatibility. Specifically, we introduce a simple but efficient motion capture module to preserve motion consistency, and design an instance-aware region selection instead of a random region selection to obtain better textual controllability, and utilize a novel strategy to inject some personalized models into our CoCoCo model and thus obtain better model compatibility. Extensive experiments show that our model can generate high-quality video clips. Meanwhile, our model shows better motion consistency, textual controllability and model compatibility. More details are shown in [cococozibojia.github.io](cococozibojia.github.io). △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.10071 [pdf, other]

Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling

Authors: Baoquan Zhang, Huaibin Wang, Luo Chuyao, Xutao Li, Liang Guotao, Yunming Ye, Xiaochen Qi, Yao He

Abstract: Vector-Quantized Image Modeling (VQIM) is a fundamental research problem in image synthesis, which aims to represent an image with a discrete token sequence. Existing studies effectively address this problem by learning a discrete codebook from scratch and in a code-independent manner to quantize continuous representations into discrete tokens. However, learning a codebook from scratch and in a co… ▽ More Vector-Quantized Image Modeling (VQIM) is a fundamental research problem in image synthesis, which aims to represent an image with a discrete token sequence. Existing studies effectively address this problem by learning a discrete codebook from scratch and in a code-independent manner to quantize continuous representations into discrete tokens. However, learning a codebook from scratch and in a code-independent manner is highly challenging, which may be a key reason causing codebook collapse, i.e., some code vectors can rarely be optimized without regard to the relationship between codes and good codebook priors such that die off finally. In this paper, inspired by pretrained language models, we find that these language models have actually pretrained a superior codebook via a large number of text corpus, but such information is rarely exploited in VQIM. To this end, we propose a novel codebook transfer framework with part-of-speech, called VQCT, which aims to transfer a well-trained codebook from pretrained language models to VQIM for robust codebook learning. Specifically, we first introduce a pretrained codebook from language models and part-of-speech knowledge as priors. Then, we construct a vision-related codebook with these priors for achieving codebook transfer. Finally, a novel codebook transfer network is designed to exploit abundant semantic relationships between codes contained in pretrained codebooks for robust VQIM codebook learning. Experimental results on four datasets show that our VQCT method achieves superior VQIM performance over previous state-of-the-art methods. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: Accepted by CVPR 2024

arXiv:2403.06384 [pdf, other]

Precision Spectroscopy and Nuclear Structure Parameters in 7Li+ ion

Authors: Hua Guan, Xiao-Qiu Qi, Peng-Peng Zhou, Wei Sun, Shao-Long Chen, Xu-Rui Chang, Yao Huang, Pei-Pei Zhang, Zong-Chao Yan, G. W. F. Drake, Ai-Xi Chen, Zhen-Xiang Zhong, Ting-Yun Shi, Ke-Lin Gao

Abstract: The optical Ramsey technique is used to obtain precise measurements of the hyperfine splittings in the $2\,^3\!S_1$ and $2\,^3\!P_J$ states of $^7$Li$^+$. Together with bound-state quantum electrodynamic theory, the Zemach radius and quadrupole moment of the $^7$Li nucleus are determined to be $3.35(1)$~fm and $-3.86(5)$~fm$^2$ respectively, with the quadrupole moment deviating from the recommende… ▽ More The optical Ramsey technique is used to obtain precise measurements of the hyperfine splittings in the $2\,^3\!S_1$ and $2\,^3\!P_J$ states of $^7$Li$^+$. Together with bound-state quantum electrodynamic theory, the Zemach radius and quadrupole moment of the $^7$Li nucleus are determined to be $3.35(1)$~fm and $-3.86(5)$~fm$^2$ respectively, with the quadrupole moment deviating from the recommended value of $-4.00(3)$~fm$^2$ by $1.75σ$. Furthermore, we determine the quadrupole moment ratio of $^6$Li to $^7$Li as $0.101(13)$, exhibiting a $6σ$ deviation from the previous measured value of $0.020161(13)$ by LiF molecular spectroscopy. The results taken together provide a sensitive test of nuclear structure models. △ Less

Submitted 10 March, 2024; originally announced March 2024.

arXiv:2403.05895 [pdf, other]

DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos

Authors: Xiuzhe Wu, Xiaoyang Lyu, Qihao Huang, Yong Liu, Yang Wu, Ying Shan, Xiaojuan Qi

Abstract: Although considerable advancements have been attained in self-supervised depth estimation from monocular videos, most existing methods often treat all objects in a video as static entities, which however violates the dynamic nature of real-world scenes and fails to model the geometry and motion of moving objects. In this paper, we propose a self-supervised method to jointly learn 3D motion and dep… ▽ More Although considerable advancements have been attained in self-supervised depth estimation from monocular videos, most existing methods often treat all objects in a video as static entities, which however violates the dynamic nature of real-world scenes and fails to model the geometry and motion of moving objects. In this paper, we propose a self-supervised method to jointly learn 3D motion and depth from monocular videos. Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion. Depth and motion networks work collaboratively to faithfully model the geometry and dynamics of real-world scenes, which, in turn, benefits both depth and 3D motion estimation. Their predictions are further combined to synthesize a novel video frame for self-supervised training. As a core component of our framework, DO3D is a new motion disentanglement module that learns to predict camera ego-motion and instance-aware 3D object motion separately. To alleviate the difficulties in estimating non-rigid 3D object motions, they are decomposed to object-wise 6-DoF global transformations and a pixel-wise local 3D motion deformation field. Qualitative and quantitative experiments are conducted on three benchmark datasets, including KITTI, Cityscapes, and VKITTI2, where our model delivers superior performance in all evaluated settings. For the depth estimation task, our model outperforms all compared research works in the high-resolution setting, attaining an absolute relative depth error (abs rel) of 0.099 on the KITTI benchmark. Besides, our optical flow estimation results (an overall EPE of 7.09 on KITTI) also surpass state-of-the-art methods and largely improve the estimation of dynamic regions, demonstrating the effectiveness of our motion model. Our code will be available. △ Less

Submitted 9 March, 2024; originally announced March 2024.

Comments: 24 pages, 14 figures, Tech Report

arXiv:2403.04098 [pdf]

Intrinsic Optical Bistability of Photon Avalanching Nanocrystals

Authors: Artiom Skripka, Zhuolei Zhang, Xiao Qi, Benedikt Ursprung, Peter Ercius, Bruce E. Cohen, P. James Schuck, Daniel Jaque, Emory M. Chan

Abstract: Optically bistable materials respond to a single input with two possible optical outputs, contingent upon excitation history. Such materials would be ideal for optical switching and memory, yet limited understanding of intrinsic optical bistability (IOB) prevents development of nanoscale IOB materials suitable for devices. Here, we demonstrate IOB in Nd3+-doped KPb2Cl5 avalanching nanoparticles (A… ▽ More Optically bistable materials respond to a single input with two possible optical outputs, contingent upon excitation history. Such materials would be ideal for optical switching and memory, yet limited understanding of intrinsic optical bistability (IOB) prevents development of nanoscale IOB materials suitable for devices. Here, we demonstrate IOB in Nd3+-doped KPb2Cl5 avalanching nanoparticles (ANPs), which switch with high contrast between luminescent and non-luminescent states, with hysteresis characteristic of bistability. We elucidate a nonthermal mechanism in which IOB originates from suppressed nonradiative relaxation in Nd3+ ions and from the positive feedback of photon avalanching, resulting in extreme, >200th-order optical nonlinearities. Modulation of laser pulsing tunes hysteresis widths, and dual-laser excitation enables transistor-like optical switching. This control over nanoscale IOB establishes ANPs for photonic devices in which light is used to manipulate light. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2403.02579 [pdf, other]

Geometric Dynamics of Signal Propagation Predict Trainability of Transformers

Authors: Aditya Cowsik, Tamra Nebabu, Xiao-Liang Qi, Surya Ganguli

Abstract: We investigate forward signal propagation and gradient back propagation in deep, randomly initialized transformers, yielding simple necessary and sufficient conditions on initialization hyperparameters that ensure trainability of deep transformers. Our approach treats the evolution of the representations of $n$ tokens as they propagate through the transformer layers in terms of a discrete time dyn… ▽ More We investigate forward signal propagation and gradient back propagation in deep, randomly initialized transformers, yielding simple necessary and sufficient conditions on initialization hyperparameters that ensure trainability of deep transformers. Our approach treats the evolution of the representations of $n$ tokens as they propagate through the transformer layers in terms of a discrete time dynamical system of $n$ interacting particles. We derive simple update equations for the evolving geometry of this particle system, starting from a permutation symmetric simplex. Our update equations show that without MLP layers, this system will collapse to a line, consistent with prior work on rank collapse in transformers. However, unlike prior work, our evolution equations can quantitatively track particle geometry in the additional presence of nonlinear MLP layers, and it reveals an order-chaos phase transition as a function of initialization hyperparameters, like the strength of attentional and MLP residual connections and weight variances. In the ordered phase the particles are attractive and collapse to a line, while in the chaotic phase the particles are repulsive and converge to a regular $n$-simplex. We analytically derive two Lyapunov exponents: an angle exponent that governs departures from the edge of chaos in this particle system, and a gradient exponent that governs the rate of exponential growth or decay of backpropagated gradients. We show through experiments that, remarkably, the final test loss at the end of training is well predicted just by these two exponents at the beginning of training, and that the simultaneous vanishing of these two exponents yields a simple necessary and sufficient condition to achieve minimal test loss. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.18133 [pdf, other]

Classes Are Not Equal: An Empirical Study on Image Recognition Fairness

Authors: Jiequan Cui, Beier Zhu, Xin Wen, Xiaojuan Qi, Bei Yu, Hanwang Zhang

Abstract: In this paper, we present an empirical study on image recognition fairness, i.e., extreme class accuracy disparity on balanced data like ImageNet. We experimentally demonstrate that classes are not equal and the fairness issue is prevalent for image classification models across various datasets, network architectures, and model capacities. Moreover, several intriguing properties of fairness are id… ▽ More In this paper, we present an empirical study on image recognition fairness, i.e., extreme class accuracy disparity on balanced data like ImageNet. We experimentally demonstrate that classes are not equal and the fairness issue is prevalent for image classification models across various datasets, network architectures, and model capacities. Moreover, several intriguing properties of fairness are identified. First, the unfairness lies in problematic representation rather than classifier bias. Second, with the proposed concept of Model Prediction Bias, we investigate the origins of problematic representation during optimization. Our findings reveal that models tend to exhibit greater prediction biases for classes that are more challenging to recognize. It means that more other classes will be confused with harder classes. Then the False Positives (FPs) will dominate the learning in optimization, thus leading to their poor accuracy. Further, we conclude that data augmentation and representation learning algorithms improve overall performance by promoting fairness to some degree in image classification. The Code is available at https://github.com/dvlab-research/Parametric-Contrastive-Learning. △ Less

Submitted 12 March, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: CVPR 2024

arXiv:2402.15870 [pdf, other]

Spec-Gaussian: Anisotropic View-Dependent Appearance for 3D Gaussian Splatting

Authors: Ziyi Yang, Xinyu Gao, Yangtian Sun, Yihua Huang, Xiaoyang Lyu, Wen Zhou, Shaohui Jiao, Xiaojuan Qi, Xiaogang Jin

Abstract: The recent advancements in 3D Gaussian splatting (3D-GS) have not only facilitated real-time rendering through modern GPU rasterization pipelines but have also attained state-of-the-art rendering quality. Nevertheless, despite its exceptional rendering quality and performance on standard datasets, 3D-GS frequently encounters difficulties in accurately modeling specular and anisotropic components.… ▽ More The recent advancements in 3D Gaussian splatting (3D-GS) have not only facilitated real-time rendering through modern GPU rasterization pipelines but have also attained state-of-the-art rendering quality. Nevertheless, despite its exceptional rendering quality and performance on standard datasets, 3D-GS frequently encounters difficulties in accurately modeling specular and anisotropic components. This issue stems from the limited ability of spherical harmonics (SH) to represent high-frequency information. To overcome this challenge, we introduce Spec-Gaussian, an approach that utilizes an anisotropic spherical Gaussian (ASG) appearance field instead of SH for modeling the view-dependent appearance of each 3D Gaussian. Additionally, we have developed a coarse-to-fine training strategy to improve learning efficiency and eliminate floaters caused by overfitting in real-world scenes. Our experimental results demonstrate that our method surpasses existing approaches in terms of rendering quality. Thanks to ASG, we have significantly improved the ability of 3D-GS to model scenes with specular and anisotropic components without increasing the number of 3D Gaussians. This improvement extends the applicability of 3D GS to handle intricate scenarios with specular and anisotropic surfaces. △ Less

Submitted 24 February, 2024; originally announced February 2024.

arXiv:2402.14968 [pdf, other]

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

Authors: Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, Chaowei Xiao

Abstract: Despite the general capabilities of Large Language Models (LLM), these models still request fine-tuning or adaptation with customized data when meeting specific business demands. However, this process inevitably introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack) under the setting of Language-Model-as-a-Service (LMaaS), where the model's safety has been s… ▽ More Despite the general capabilities of Large Language Models (LLM), these models still request fine-tuning or adaptation with customized data when meeting specific business demands. However, this process inevitably introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack) under the setting of Language-Model-as-a-Service (LMaaS), where the model's safety has been significantly compromised by fine-tuning users' uploaded examples contain just a few harmful examples. Though potential defenses have been proposed that the service providers can integrate safety examples into the fine-tuning dataset to reduce safety issues, such approaches require incorporating a substantial amount of data, making it inefficient. To effectively defend against the FJAttack with limited safety examples under LMaaS, we propose the Backdoor Enhanced Safety Alignment method inspired by an analogy with the concept of backdoor attacks. In particular, service providers will construct prefixed safety examples with a secret prompt, acting as a "backdoor trigger". By integrating prefixed safety examples into the fine-tuning dataset, the subsequent fine-tuning process effectively acts as the "backdoor attack", establishing a strong correlation between the secret prompt and safety generations. Consequently, safe responses are ensured once service providers prepend this secret prompt ahead of any user input during inference. Our comprehensive experiments demonstrate that through the Backdoor Enhanced Safety Alignment with adding as few as 11 prefixed safety examples, the maliciously fine-tuned LLMs will achieve similar safety performance as the original aligned models without harming the benign performance. Furthermore, we also present the effectiveness of our method in a more practical setting where the fine-tuning data consists of both FJAttack examples and the fine-tuning task data. △ Less

Submitted 20 June, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.14577 [pdf, other]

Debiasing Text-to-Image Diffusion Models

Authors: Ruifei He, Chuhui Xue, Haoru Tan, Wenqing Zhang, Yingchen Yu, Song Bai, Xiaojuan Qi

Abstract: Learning-based Text-to-Image (TTI) models like Stable Diffusion have revolutionized the way visual content is generated in various domains. However, recent research has shown that nonnegligible social bias exists in current state-of-the-art TTI systems, which raises important concerns. In this work, we target resolving the social bias in TTI diffusion models. We begin by formalizing the problem se… ▽ More Learning-based Text-to-Image (TTI) models like Stable Diffusion have revolutionized the way visual content is generated in various domains. However, recent research has shown that nonnegligible social bias exists in current state-of-the-art TTI systems, which raises important concerns. In this work, we target resolving the social bias in TTI diffusion models. We begin by formalizing the problem setting and use the text descriptions of bias groups to establish an unsafe direction for guiding the diffusion process. Next, we simplify the problem into a weight optimization problem and attempt a Reinforcement solver, Policy Gradient, which shows sub-optimal performance with slow convergence. Further, to overcome limitations, we propose an iterative distribution alignment (IDA) method. Despite its simplicity, we show that IDA shows efficiency and fast convergence in resolving the social bias in TTI diffusion models. Our code will be released. △ Less

Submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.12791 [pdf, other]

Dual-polarization huge photonic spin Hall shift and deep-subwavelength sensing based on topological singularities in one-dimensional photonic crystals

Authors: Yufu Liu, Xianjun Wang, Yunlin Li, Haoran Zhang, Langlang Xiong, Xingchao Qi, Zhen Lai, Xuezhi Wang, Xunya Jiang

Abstract: Although several efforts have been taken to enhance the photonic spin Hall shift in deep-subwavelength region, according to effective medium theory, the fundamental confliction between near-zero reflection coefficient and near-zero incident angle still hinders the further application. Here, we reveal a fundamental breakdown of effective medium theory due to the existing of topological singularity… ▽ More Although several efforts have been taken to enhance the photonic spin Hall shift in deep-subwavelength region, according to effective medium theory, the fundamental confliction between near-zero reflection coefficient and near-zero incident angle still hinders the further application. Here, we reveal a fundamental breakdown of effective medium theory due to the existing of topological singularity in deep-subwavelength region in one-dimensional photonic crystals. We find that near the topological singularity, huge photonic spin Hall shift can be achieved for s-polarization and p-polarization. At the topological singularity, the reflected filed is split as dipole-like distribution with zero photonic spin Hall shift for both-polarizations, which is resulted from the interfere of the spin-maintained normal light and spin-flipped abnormal light. Based on the theoretical research, dual-polarizations thickness and dielectric constant sensing devices can be designed in deep-subwavelength region. Further more, by applying more complicated layered structure, multi-channels dual-polarizations detection and broadband dual-polarizations huge spin Hall shift platform can be designed. This work paves the way to exploring the topological properties and polarization control of photonic crystals and provides a prospective method for the design of multi-channels sensitive detection spin optical devices. △ Less

Submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.05162 [pdf, other]

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Authors: Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson

Abstract: Large language models (LLMs) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from util… ▽ More Large language models (LLMs) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels. Surprisingly, the isolated regions we find are sparse, comprising about $3\%$ at the parameter level and $2.5\%$ at the rank level. Removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the model's safety mechanisms. Moreover, we show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted. These findings underscore the urgent need for more robust safety strategies in LLMs. △ Less

Submitted 1 July, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

Comments: 22 pages, 9 figures. Project page is available at https://boyiwei.com/alignment-attribution/

arXiv:2402.04291 [pdf, other]

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Authors: Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi

Abstract: Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintai… ▽ More Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. Based on the weight distribution of LLMs, BiLLM first identifies and structurally selects salient weights, and minimizes the compression loss through an effective binary residual approximation strategy. Moreover, considering the bell-shaped distribution of the non-salient weights, we propose an optimal splitting search to group and binarize them accurately. BiLLM achieving for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant margins. Moreover, BiLLM enables the binarization process of the LLM with 7 billion weights within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency. Our code is available at https://github.com/Aaronhuang-778/BiLLM. △ Less

Submitted 15 May, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

Comments: 19 pages

arXiv:2402.03908 [pdf, other]

EscherNet: A Generative Model for Scalable View Synthesis

Authors: Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiaojuan Qi, Andrew J. Davison

Abstract: We introduce EscherNet, a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with a specialised camera positional encoding, allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views. EscherNet offers exceptional generality, flexibility, and scala… ▽ More We introduce EscherNet, a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with a specialised camera positional encoding, allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views. EscherNet offers exceptional generality, flexibility, and scalability in view synthesis -- it can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU, despite being trained with a fixed number of 3 reference views to 3 target views. As a result, EscherNet not only addresses zero-shot novel view synthesis, but also naturally unifies single- and multi-image 3D reconstruction, combining these diverse tasks into a single, cohesive framework. Our extensive experiments demonstrate that EscherNet achieves state-of-the-art performance in multiple benchmarks, even when compared to methods specifically tailored for each individual problem. This remarkable versatility opens up new directions for designing scalable neural architectures for 3D vision. Project page: https://kxhit.github.io/EscherNet. △ Less

Submitted 19 March, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

Comments: CVPR2024 Project Page: https://kxhit.github.io/EscherNet

arXiv:2402.03310 [pdf, other]

V-IRL: Grounding Virtual Intelligence in Real Life

Authors: Jihan Yang, Runyu Ding, Ellis Brown, Xiaojuan Qi, Saining Xie

Abstract: There is a sensory gulf between the Earth that humans inhabit and the digital realms in which modern AI agents are created. To develop AI agents that can sense, think, and act as flexibly as humans in real-world settings, it is imperative to bridge the realism gap between the digital and physical worlds. How can we embody agents in an environment as rich and diverse as the one we inhabit, without… ▽ More There is a sensory gulf between the Earth that humans inhabit and the digital realms in which modern AI agents are created. To develop AI agents that can sense, think, and act as flexibly as humans in real-world settings, it is imperative to bridge the realism gap between the digital and physical worlds. How can we embody agents in an environment as rich and diverse as the one we inhabit, without the constraints imposed by real hardware and control? Towards this end, we introduce V-IRL: a platform that enables agents to scalably interact with the real world in a virtual yet realistic environment. Our platform serves as a playground for developing agents that can accomplish various practical tasks and as a vast testbed for measuring progress in capabilities spanning perception, decision-making, and interaction with real-world data across the entire globe. △ Less

Submitted 18 July, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: Project page: https://virl-platform.github.io

arXiv:2401.11372 [pdf, other]

Back-stepping Experience Replay with Application to Model-free Reinforcement Learning for a Soft Snake Robot

Authors: Xinda Qi, Dong Chen, Zhaojian Li, Xiaobo Tan

Abstract: In this paper, we propose a novel technique, Back-stepping Experience Replay (BER), that is compatible with arbitrary off-policy reinforcement learning (RL) algorithms. BER aims to enhance learning efficiency in systems with approximate reversibility, reducing the need for complex reward shaping. The method constructs reversed trajectories using back-stepping transitions to reach random or fixed t… ▽ More In this paper, we propose a novel technique, Back-stepping Experience Replay (BER), that is compatible with arbitrary off-policy reinforcement learning (RL) algorithms. BER aims to enhance learning efficiency in systems with approximate reversibility, reducing the need for complex reward shaping. The method constructs reversed trajectories using back-stepping transitions to reach random or fixed targets. Interpretable as a bi-directional approach, BER addresses inaccuracies in back-stepping transitions through a distillation of the replay experience during learning. Given the intricate nature of soft robots and their complex interactions with environments, we present an application of BER in a model-free RL approach for the locomotion and navigation of a soft snake robot, which is capable of serpentine motion enabled by anisotropic friction between the body and ground. In addition, a dynamic simulator is developed to assess the effectiveness and efficiency of the BER algorithm, in which the robot demonstrates successful learning (reaching a 100% success rate) and adeptly reaches random targets, achieving an average speed 48% faster than that of the best baseline approach. △ Less

Submitted 20 January, 2024; originally announced January 2024.

Comments: Submitted to the IEEE for possible publication

arXiv:2401.06377 [pdf, other]

Design and Nonlinear Modeling of a Modular Cable Driven Soft Robotic Arm

Authors: Xinda Qi, Yu Mei, Dong Chen, Zhaojian Li, Xiaobo Tan

Abstract: We propose a novel multi-section cable-driven soft robotic arm inspired by octopus tentacles along with a new modeling approach. Each section of the modular manipulator is made of a soft tubing backbone, a soft silicon arm body, and two rigid endcaps, which connect adjacent sections and decouple the actuation cables of different sections. The soft robotic arm is made with casting after the rigid e… ▽ More We propose a novel multi-section cable-driven soft robotic arm inspired by octopus tentacles along with a new modeling approach. Each section of the modular manipulator is made of a soft tubing backbone, a soft silicon arm body, and two rigid endcaps, which connect adjacent sections and decouple the actuation cables of different sections. The soft robotic arm is made with casting after the rigid endcaps are 3D-printed, achieving low-cost and convenient fabrication. To capture the nonlinear effect of cables pushing into the soft silicon arm body, which results from the absence of intermediate rigid cable guides for higher compliance, an analytical static model is developed to capture the relationship between the bending curvature and the cable lengths. The proposed model shows superior prediction performance in experiments over that of a baseline model, especially under large bending conditions. Based on the nonlinear static model, a kinematic model of a multi-section arm is further developed and used to derive a motion planning algorithm. Experiments show that the proposed soft arm has high flexibility and a large workspace, and the tracking errors under the algorithm based on the proposed modeling approach are up to 52$\%$ smaller than those with the algorithm derived from the baseline model. The presented modeling approach is expected to be applicable to a broad range of soft cable-driven actuators and manipulators. △ Less

Submitted 15 May, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: The paper has been accepted by IEEE Transactions on Mechatronics

arXiv:2401.05750 [pdf, other]

GO-NeRF: Generating Virtual Objects in Neural Radiance Fields

Authors: Peng Dai, Feitong Tan, Xin Yu, Yinda Zhang, Xiaojuan Qi

Abstract: Despite advances in 3D generation, the direct creation of 3D objects within an existing 3D scene represented as NeRF remains underexplored. This process requires not only high-quality 3D object generation but also seamless composition of the generated 3D content into the existing NeRF. To this end, we propose a new method, GO-NeRF, capable of utilizing scene context for high-quality and harmonious… ▽ More Despite advances in 3D generation, the direct creation of 3D objects within an existing 3D scene represented as NeRF remains underexplored. This process requires not only high-quality 3D object generation but also seamless composition of the generated 3D content into the existing NeRF. To this end, we propose a new method, GO-NeRF, capable of utilizing scene context for high-quality and harmonious 3D object generation within an existing NeRF. Our method employs a compositional rendering formulation that allows the generated 3D objects to be seamlessly composited into the scene utilizing learned 3D-aware opacity maps without introducing unintended scene modification. Moreover, we also develop tailored optimization objectives and training strategies to enhance the model's ability to exploit scene context and mitigate artifacts, such as floaters, originating from 3D object generation within a scene. Extensive experiments on both feed-forward and $360^o$ scenes show the superior performance of our proposed GO-NeRF in generating objects harmoniously composited with surrounding scenes and synthesizing high-quality novel view images. Project page at {\url{https://daipengwa.github.io/GO-NeRF/}. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: 12 pages

MSC Class: ACM-class

arXiv:2401.02475 [pdf, other]

Space-time generalization of mutual information

Authors: Paolo Glorioso, Xiao-Liang Qi, Zhenbin Yang

Abstract: The mutual information characterizes correlations between spatially separated regions of a system. Yet, in experiments we often measure dynamical correlations, which involve probing operators that are also separated in time. Here, we introduce a space-time generalization of mutual information which, by construction, satisfies several natural properties of the mutual information and at the same tim… ▽ More The mutual information characterizes correlations between spatially separated regions of a system. Yet, in experiments we often measure dynamical correlations, which involve probing operators that are also separated in time. Here, we introduce a space-time generalization of mutual information which, by construction, satisfies several natural properties of the mutual information and at the same time characterizes correlations across subsystems that are separated in time. In particular, this quantity, that we call the \emph{space-time mutual information}, bounds all dynamical correlations. We construct this quantity based on the idea of the quantum hypothesis testing. As a by-product, our definition provides a transparent interpretation in terms of an experimentally accessible setup. We draw connections with other notions in quantum information theory, such as quantum channel discrimination. Finally, we study the behavior of the space-time mutual information in several settings and contrast its long-time behavior in many-body localizing and thermalizing systems. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Comments: 33 pages, 12 figures

arXiv:2401.00149 [pdf, ps, other]

Properties of new even and odd nonlinear coherent states with different parameters

Authors: Cheng Zhang, Rui-Jiao Miao, Xiao-Qiu Qi

Abstract: We construct a class of nonlinear coherent states (NLCSs) by introducing a more general nonlinear function and study their non-classical properties, specifically the second-order correlation function $g^{(2)}(0)$, Mandel parameter $Q$, squeezing, amplitude squared squeezing and Wigner function of the optical field. The results indicate that the non-classical properties of the new types of even and… ▽ More We construct a class of nonlinear coherent states (NLCSs) by introducing a more general nonlinear function and study their non-classical properties, specifically the second-order correlation function $g^{(2)}(0)$, Mandel parameter $Q$, squeezing, amplitude squared squeezing and Wigner function of the optical field. The results indicate that the non-classical properties of the new types of even and odd NLCSs crucially depend on nonlinear functions. More concretely, we find that the new even NLCSs could exhibit the photon-bunching effect whereas the new odd NLCSs could show photon-antibunching effect. The degree of squeezing is also significantly affected by the parameter selection of these NLCSs. By employing various forms of nonlinear functions, it becomes possible to construct NLCSs with diverse properties, thereby providing a theoretical foundation for corresponding experimental investigations. △ Less

Submitted 30 December, 2023; originally announced January 2024.

arXiv:2312.14937 [pdf, other]

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

Authors: Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, Xiaojuan Qi

Abstract: Novel view synthesis for dynamic scenes is still a challenging problem in computer vision and graphics. Recently, Gaussian splatting has emerged as a robust technique to represent static scenes and enable high-quality and real-time novel view synthesis. Building upon this technique, we propose a new representation that explicitly decomposes the motion and appearance of dynamic scenes into sparse c… ▽ More Novel view synthesis for dynamic scenes is still a challenging problem in computer vision and graphics. Recently, Gaussian splatting has emerged as a robust technique to represent static scenes and enable high-quality and real-time novel view synthesis. Building upon this technique, we propose a new representation that explicitly decomposes the motion and appearance of dynamic scenes into sparse control points and dense Gaussians, respectively. Our key idea is to use sparse control points, significantly fewer in number than the Gaussians, to learn compact 6 DoF transformation bases, which can be locally interpolated through learned interpolation weights to yield the motion field of 3D Gaussians. We employ a deformation MLP to predict time-varying 6 DoF transformations for each control point, which reduces learning complexities, enhances learning abilities, and facilitates obtaining temporal and spatial coherent motion patterns. Then, we jointly learn the 3D Gaussians, the canonical space locations of control points, and the deformation MLP to reconstruct the appearance, geometry, and dynamics of 3D scenes. During learning, the location and number of control points are adaptively adjusted to accommodate varying motion complexities in different regions, and an ARAP loss following the principle of as rigid as possible is developed to enforce spatial continuity and local rigidity of learned motions. Finally, thanks to the explicit sparse motion representation and its decomposition from appearance, our method can enable user-controlled motion editing while retaining high-fidelity appearances. Extensive experiments demonstrate that our approach outperforms existing approaches on novel view synthesis with a high rendering speed and enables novel appearance-preserved motion editing applications. Project page: https://yihua7.github.io/SC-GS-web/ △ Less

Submitted 31 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: Code link: https://github.com/yihua7/SC-GS

arXiv:2312.12824 [pdf, other]

FedSODA: Federated Cross-assessment and Dynamic Aggregation for Histopathology Segmentation

Authors: Yuan Zhang, Yaolei Qi, Xiaoming Qi, Lotfi Senhadji, Yongyue Wei, Feng Chen, Guanyu Yang

Abstract: Federated learning (FL) for histopathology image segmentation involving multiple medical sites plays a crucial role in advancing the field of accurate disease diagnosis and treatment. However, it is still a task of great challenges due to the sample imbalance across clients and large data heterogeneity from disparate organs, variable segmentation tasks, and diverse distribution. Thus, we propose a… ▽ More Federated learning (FL) for histopathology image segmentation involving multiple medical sites plays a crucial role in advancing the field of accurate disease diagnosis and treatment. However, it is still a task of great challenges due to the sample imbalance across clients and large data heterogeneity from disparate organs, variable segmentation tasks, and diverse distribution. Thus, we propose a novel FL approach for histopathology nuclei and tissue segmentation, FedSODA, via synthetic-driven cross-assessment operation (SO) and dynamic stratified-layer aggregation (DA). Our SO constructs a cross-assessment strategy to connect clients and mitigate the representation bias under sample imbalance. Our DA utilizes layer-wise interaction and dynamic aggregation to diminish heterogeneity and enhance generalization. The effectiveness of our FedSODA has been evaluated on the most extensive histopathology image segmentation dataset from 7 independent datasets. The code is available at https://github.com/yuanzhang7/FedSODA. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP2024

arXiv:2312.11960 [pdf, ps, other]

Offensive Alliances in Signed Graphs

Authors: Zhidan Feng, Henning Fernau, Kevin Mann, Xingqin Qi

Abstract: Signed graphs have been introduced to enrich graph structures expressing relationships between persons or general social entities, introducing edge signs to reflect the nature of the relationship, e.g., friendship or enmity. Independently, offensive alliances have been defined and studied for undirected, unsigned graphs. We join both lines of research and define offensive alliances in signed graph… ▽ More Signed graphs have been introduced to enrich graph structures expressing relationships between persons or general social entities, introducing edge signs to reflect the nature of the relationship, e.g., friendship or enmity. Independently, offensive alliances have been defined and studied for undirected, unsigned graphs. We join both lines of research and define offensive alliances in signed graphs, hence considering the nature of relationships. Apart from some combinatorial results, mainly on k-balanced and k-anti-balanced signed graphs (where the latter is a newly introduced family of signed graphs), we focus on the algorithmic complexity of finding smallest offensive alliances, looking at a number of parameterizations. While the parameter solution size leads to an FPT result for unsigned graphs, we obtain W[2]-completeness for the signed setting. We introduce new parameters for signed graphs, e.g., distance to weakly balanced signed graphs, that could be of independent interest. We show that these parameters yield FPT results. Here, we make use of the recently introduced parameter neighborhood diversity for signed graphs. △ Less

Submitted 19 December, 2023; originally announced December 2023.

arXiv:2312.11843 [pdf, other]

Enhancing Social Decision-Making of Autonomous Vehicles: A Mixed-Strategy Game Approach With Interaction Orientation Identification

Authors: Jiaqi Liu, Xiao Qi, Peng Hang, Jian Sun

Abstract: The integration of Autonomous Vehicles (AVs) into existing human-driven traffic systems poses considerable challenges, especially within environments where human and machine interactions are frequent and complex, such as at unsignalized intersections. To deal with these challenges, we introduce a novel framework predicated on dynamic and socially-aware decision-making game theory to augment the so… ▽ More The integration of Autonomous Vehicles (AVs) into existing human-driven traffic systems poses considerable challenges, especially within environments where human and machine interactions are frequent and complex, such as at unsignalized intersections. To deal with these challenges, we introduce a novel framework predicated on dynamic and socially-aware decision-making game theory to augment the social decision-making prowess of AVs in mixed driving environments. This comprehensive framework is delineated into three primary modules: Interaction Orientation Identification, Mixed-Strategy Game Modeling, and Expert Mode Learning. We introduce 'Interaction Orientation' as a metric to evaluate the social decision-making tendencies of various agents, incorporating both environmental factors and trajectory characteristics. The mixed-strategy game model developed as part of this framework considers the evolution of future traffic scenarios and includes a utility function that balances safety, operational efficiency, and the unpredictability of environmental conditions. To adapt to real-world driving complexities, our framework utilizes a dynamic optimization framework for assimilating and learning from expert human driving strategies. These strategies are compiled into a comprehensive strategy library, serving as a reference for future decision-making processes. The proposed approach is validated through extensive driving datasets and human-in-loop driving experiments, and the results demonstrate marked enhancements in decision timing and precision. △ Less

Submitted 4 April, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.11084 [pdf, other]

Multi-Agent Reinforcement Learning for Connected and Automated Vehicles Control: Recent Advancements and Future Prospects

Authors: Min Hua, Dong Chen, Xinda Qi, Kun Jiang, Zemin Eitan Liu, Quan Zhou, Hongming Xu

Abstract: Connected and automated vehicles (CAVs) have emerged as a potential solution to the future challenges of developing safe, efficient, and eco-friendly transportation systems. However, CAV control presents significant challenges, given the complexity of interconnectivity and coordination required among the vehicles. To address this, multi-agent reinforcement learning (MARL), with its notable advance… ▽ More Connected and automated vehicles (CAVs) have emerged as a potential solution to the future challenges of developing safe, efficient, and eco-friendly transportation systems. However, CAV control presents significant challenges, given the complexity of interconnectivity and coordination required among the vehicles. To address this, multi-agent reinforcement learning (MARL), with its notable advancements in addressing complex problems in autonomous driving, robotics, and human-vehicle interaction, has emerged as a promising tool for enhancing the capabilities of CAVs. However, there is a notable absence of current reviews on the state-of-the-art MARL algorithms in the context of CAVs. Therefore, this paper delivers a comprehensive review of the application of MARL techniques within the field of CAV control. The paper begins by introducing MARL, followed by a detailed explanation of its unique advantages in addressing complex mobility and traffic scenarios that involve multiple agents. It then presents a comprehensive survey of MARL applications on the extent of control dimensions for CAVs, covering critical and typical scenarios such as platooning control, lane-changing, and unsignalized intersections. In addition, the paper provides a comprehensive review of the prominent simulation platforms used to create reliable environments for training in MARL. Lastly, the paper examines the current challenges associated with deploying MARL within CAV control and outlines potential solutions that can effectively overcome these issues. Through this review, the study highlights the tremendous potential of MARL to enhance the performance and collaboration of CAV control in terms of safety, travel efficiency, and economy. △ Less

Submitted 16 January, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.09424 [pdf, other]

Open Domain Knowledge Extraction for Knowledge Graphs

Authors: Kun Qian, Anton Belyi, Fei Wu, Samira Khorshidi, Azadeh Nikfarjam, Rahul Khot, Yisi Sang, Katherine Luna, Xianqi Chu, Eric Choi, Yash Govind, Chloe Seivwright, Yiwen Sun, Ahmed Fakhry, Theo Rekatsinas, Ihab Ilyas, Xiaoguang Qi, Yunyao Li

Abstract: The quality of a knowledge graph directly impacts the quality of downstream applications (e.g. the number of answerable questions using the graph). One ongoing challenge when building a knowledge graph is to ensure completeness and freshness of the graph's entities and facts. In this paper, we introduce ODKE, a scalable and extensible framework that sources high-quality entities and facts from ope… ▽ More The quality of a knowledge graph directly impacts the quality of downstream applications (e.g. the number of answerable questions using the graph). One ongoing challenge when building a knowledge graph is to ensure completeness and freshness of the graph's entities and facts. In this paper, we introduce ODKE, a scalable and extensible framework that sources high-quality entities and facts from open web at scale. ODKE utilizes a wide range of extraction models and supports both streaming and batch processing at different latency. We reflect on the challenges and design decisions made and share lessons learned when building and deploying ODKE to grow an industry-scale open domain knowledge graph. △ Less

Submitted 30 October, 2023; originally announced December 2023.

Comments: 7 pages, 7 figures, 5 tables, preprint technical report, no code or data is released

MSC Class: 68T30 (primary) ACM Class: F.4.1; I.2.4

arXiv:2312.09262 [pdf, other]

Random resistive memory-based deep extreme point learning machine for unified visual processing

Authors: Shaocong Wang, Yizhao Gao, Yi Li, Woyu Zhang, Yifei Yu, Bo Wang, Ning Lin, Hegan Chen, Yue Zhang, Yang Jiang, Dingchen Wang, Jia Chen, Peng Dai, Hao Jiang, Peng Lin, Xumeng Zhang, Xiaojuan Qi, Xiaoxin Xu, Hayden So, Zhongrui Wang, Dashan Shang, Qi Liu, Kwang-Ting Cheng, Ming Liu

Abstract: Visual sensors, including 3D LiDAR, neuromorphic DVS sensors, and conventional frame cameras, are increasingly integrated into edge-side intelligent machines. Realizing intensive multi-sensory data analysis directly on edge intelligent machines is crucial for numerous emerging edge applications, such as augmented and virtual reality and unmanned aerial vehicles, which necessitates unified data rep… ▽ More Visual sensors, including 3D LiDAR, neuromorphic DVS sensors, and conventional frame cameras, are increasingly integrated into edge-side intelligent machines. Realizing intensive multi-sensory data analysis directly on edge intelligent machines is crucial for numerous emerging edge applications, such as augmented and virtual reality and unmanned aerial vehicles, which necessitates unified data representation, unprecedented hardware energy efficiency and rapid model training. However, multi-sensory data are intrinsically heterogeneous, causing significant complexity in the system development for edge-side intelligent machines. In addition, the performance of conventional digital hardware is limited by the physically separated processing and memory units, known as the von Neumann bottleneck, and the physical limit of transistor scaling, which contributes to the slowdown of Moore's law. These limitations are further intensified by the tedious training of models with ever-increasing sizes. We propose a novel hardware-software co-design, random resistive memory-based deep extreme point learning machine (DEPLM), that offers efficient unified point set analysis. We show the system's versatility across various data modalities and two different learning tasks. Compared to a conventional digital hardware-based system, our co-design system achieves huge energy efficiency improvements and training cost reduction when compared to conventional systems. Our random resistive memory-based deep extreme point learning machine may pave the way for energy-efficient and training-friendly edge AI across various data modalities and tasks. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2312.08754 [pdf, other]

UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation

Authors: Zexiang Liu, Yangguang Li, Youtian Lin, Xin Yu, Sida Peng, Yan-Pei Cao, Xiaojuan Qi, Xiaoshui Huang, Ding Liang, Wanli Ouyang

Abstract: Recent advancements in text-to-3D generation technology have significantly advanced the conversion of textual descriptions into imaginative well-geometrical and finely textured 3D objects. Despite these developments, a prevalent limitation arises from the use of RGB data in diffusion or reconstruction models, which often results in models with inherent lighting and shadows effects that detract fro… ▽ More Recent advancements in text-to-3D generation technology have significantly advanced the conversion of textual descriptions into imaginative well-geometrical and finely textured 3D objects. Despite these developments, a prevalent limitation arises from the use of RGB data in diffusion or reconstruction models, which often results in models with inherent lighting and shadows effects that detract from their realism, thereby limiting their usability in applications that demand accurate relighting capabilities. To bridge this gap, we present UniDream, a text-to-3D generation framework by incorporating unified diffusion priors. Our approach consists of three main components: (1) a dual-phase training process to get albedo-normal aligned multi-view diffusion and reconstruction models, (2) a progressive generation procedure for geometry and albedo-textures based on Score Distillation Sample (SDS) using the trained reconstruction and diffusion models, and (3) an innovative application of SDS for finalizing PBR generation while keeping a fixed albedo based on Stable Diffusion model. Extensive evaluations demonstrate that UniDream surpasses existing methods in generating 3D objects with clearer albedo textures, smoother surfaces, enhanced realism, and superior relighting capabilities. △ Less

Submitted 13 July, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: Accepted to ECCV 2024

arXiv:2312.04044 [pdf, other]

Residual Graph Convolutional Network for Bird's-Eye-View Semantic Segmentation

Authors: Qiuxiao Chen, Xiaojun Qi

Abstract: Retrieving spatial information and understanding the semantic information of the surroundings are important for Bird's-Eye-View (BEV) semantic segmentation. In the application of autonomous driving, autonomous vehicles need to be aware of their surroundings to drive safely. However, current BEV semantic segmentation techniques, deep Convolutional Neural Networks (CNNs) and transformers, have diffi… ▽ More Retrieving spatial information and understanding the semantic information of the surroundings are important for Bird's-Eye-View (BEV) semantic segmentation. In the application of autonomous driving, autonomous vehicles need to be aware of their surroundings to drive safely. However, current BEV semantic segmentation techniques, deep Convolutional Neural Networks (CNNs) and transformers, have difficulties in obtaining the global semantic relationships of the surroundings at the early layers of the network. In this paper, we propose to incorporate a novel Residual Graph Convolutional (RGC) module in deep CNNs to acquire both the global information and the region-level semantic relationship in the multi-view image domain. Specifically, the RGC module employs a non-overlapping graph space projection to efficiently project the complete BEV information into graph space. It then builds interconnected spatial and channel graphs to extract spatial information between each node and channel information within each node (i.e., extract contextual relationships of the global features). Furthermore, it uses a downsample residual process to enhance the coordinate feature reuse to maintain the global information. The segmentation data augmentation and alignment module helps to simultaneously augment and align BEV features and ground truth to geometrically preserve their alignment to achieve better segmentation results. Our experimental results on the nuScenes benchmark dataset demonstrate that the RGC network outperforms four state-of-the-art networks and its four variants in terms of IoU and mIoU. The proposed RGC network achieves a higher mIoU of 3.1% than the best state-of-the-art network, BEVFusion. Code and models will be released. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: 8 pages, 5 figures, this paper has been accepted by and will be presented at the WACV 2024

arXiv:2311.17963 [pdf, other]

M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation

Authors: Xiaowei Chi, Rongyu Zhang, Zhengkai Jiang, Yijiang Liu, Yatian Wang, Xingqun Qi, Wenhan Luo, Peng Gao, Shanghang Zhang, Qifeng Liu, Yike Guo

Abstract: While current LLM chatbots like GPT-4V bridge the gap between human instructions and visual representations to enable text-image generations, they still lack efficient alignment methods for high-fidelity performance on multiple downstream tasks. In this paper, we propose \textbf{$M^{2}Chat$}, a novel unified multimodal LLM framework for generating interleaved text-image conversation across various… ▽ More While current LLM chatbots like GPT-4V bridge the gap between human instructions and visual representations to enable text-image generations, they still lack efficient alignment methods for high-fidelity performance on multiple downstream tasks. In this paper, we propose \textbf{$M^{2}Chat$}, a novel unified multimodal LLM framework for generating interleaved text-image conversation across various scenarios. Specifically, we propose an $M^{3}Adapter$ that efficiently integrates granular low-level visual information and high-level semantic features from multi-modality prompts. Upon the well-aligned fused feature, $M^{3}Adapter$ tailors a learnable gating strategy to balance the model creativity and consistency across various tasks adaptively. Moreover, to further enhance the effectiveness of $M^{3}Adapter$ while preserving the coherence of semantic context comprehension, we introduce a two-stage $M^{3}FT$ fine-tuning strategy. This strategy optimizes disjoint groups of parameters for image-text alignment and visual-instruction respectively. Extensive experiments demonstrate our $M^{2}Chat$ surpasses state-of-the-art counterparts across diverse benchmarks, showcasing its prowess in interleaving generation, storytelling, and multimodal dialogue systems. The demo and code are available at \red{https://mattie-e.github.io/M2Chat.github.io}. △ Less

Submitted 13 April, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.17532 [pdf, other]

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Authors: Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi, Mengfei Li, Wenhan Luo, Wei Xue, Shanghang Zhang, Qifeng Liu, Yike Guo

Abstract: Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label, they overlook that long gesture sequence modeling with emotion transition is more practical in real scenes. In addition, the lack of large-scale available datasets with emo… ▽ More Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label, they overlook that long gesture sequence modeling with emotion transition is more practical in real scenes. In addition, the lack of large-scale available datasets with emotional transition speech and corresponding 3D human gestures also limits the addressing of this task. To fulfill this goal, we first incorporate the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches. Considering obtaining the realistic 3D pose annotations corresponding to the dynamically inpainted emotion transition audio is extremely difficult, we propose a novel weakly supervised training strategy to encourage authority gesture transitions. Specifically, to enhance the coordination of transition gestures w.r.t different emotional ones, we model the temporal association representation between two different emotional gesture sequences as style guidance and infuse it into the transition generation. We further devise an emotion mixture mechanism that provides weak supervision based on a learnable mixed emotion label for transition gestures. Last, we present a keyframe sampler to supply effective initial posture cues in long sequences, enabling us to generate diverse gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts on our newly defined emotion transition task and datasets. Our code and dataset will be released on the project page: https://xingqunqi-lab.github.io/Emo-Transition-Gesture/. △ Less

Submitted 27 March, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: Accepted by CVPR 2024

arXiv:2311.13200 [pdf, other]

Self-guided Few-shot Semantic Segmentation for Remote Sensing Imagery Based on Large Vision Models

Authors: Xiyu Qi, Yifan Wu, Yongqiang Mao, Wenhui Zhang, Yidan Zhang

Abstract: The Segment Anything Model (SAM) exhibits remarkable versatility and zero-shot learning abilities, owing largely to its extensive training data (SA-1B). Recognizing SAM's dependency on manual guidance given its category-agnostic nature, we identified unexplored potential within few-shot semantic segmentation tasks for remote sensing imagery. This research introduces a structured framework designed… ▽ More The Segment Anything Model (SAM) exhibits remarkable versatility and zero-shot learning abilities, owing largely to its extensive training data (SA-1B). Recognizing SAM's dependency on manual guidance given its category-agnostic nature, we identified unexplored potential within few-shot semantic segmentation tasks for remote sensing imagery. This research introduces a structured framework designed for the automation of few-shot semantic segmentation. It utilizes the SAM model and facilitates a more efficient generation of semantically discernible segmentation outcomes. Central to our methodology is a novel automatic prompt learning approach, leveraging prior guided masks to produce coarse pixel-wise prompts for SAM. Extensive experiments on the DLRSD datasets underline the superiority of our approach, outperforming other available few-shot methodologies. △ Less

Submitted 22 November, 2023; originally announced November 2023.

arXiv:2311.12315 [pdf, other]

AcademicGPT: Empowering Academic Research

Authors: Shufa Wei, Xiaolong Xu, Xianbiao Qi, Xi Yin, Jun Xia, Jingyi Ren, Peijun Tang, Yuxiang Zhong, Yihao Chen, Xiaoqin Ren, Yuxin Liang, Liankai Huang, Kai Xie, Weikang Gui, Wei Tan, Shuanglong Sun, Yongquan Hu, Qinxian Liu, Nanjin Li, Chihao Dai, Lihua Wang, Xiaohui Liu, Lei Zhang, Yutao Xie

Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Yet, many of these advanced LLMs are tailored for broad, general-purpose applications. In this technical report, we introduce AcademicGPT, designed specifically to empower academic research. AcademicGPT is a continual training model derived from LLaMA2-70B. Our training corpus… ▽ More Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Yet, many of these advanced LLMs are tailored for broad, general-purpose applications. In this technical report, we introduce AcademicGPT, designed specifically to empower academic research. AcademicGPT is a continual training model derived from LLaMA2-70B. Our training corpus mainly consists of academic papers, thesis, content from some academic domain, high-quality Chinese data and others. While it may not be extensive in data scale, AcademicGPT marks our initial venture into a domain-specific GPT tailored for research area. We evaluate AcademicGPT on several established public benchmarks such as MMLU and CEval, as well as on some specialized academic benchmarks like PubMedQA, SCIEval, and our newly-created ComputerScienceQA, to demonstrate its ability from general knowledge ability, to Chinese ability, and to academic ability. Building upon AcademicGPT's foundation model, we also developed several applications catered to the academic area, including General Academic Question Answering, AI-assisted Paper Reading, Paper Review, and AI-assisted Title and Abstract Generation. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: Technical Report. arXiv admin note: text overlap with arXiv:2310.12081, arXiv:2310.10053 by other authors

arXiv:2311.08424 [pdf, other]

Exploring Multi-Programming-Language Commits and Their Impacts on Software Quality: An Empirical Study on Apache Projects

Authors: Zengyang Li, Xiaoxiao Qi, Qinyi Yu, Peng Liang, Ran Mo, Chen Yang

Abstract: Context: Modern software systems (e.g., Apache Spark) are usually written in multiple programming languages (PLs). There is little understanding on the phenomenon of multi-programming-language commits (MPLCs), which involve modified source files written in multiple PLs. Objective: This work aims to explore MPLCs and their impacts on development difficulty and software quality. Methods: We performe… ▽ More Context: Modern software systems (e.g., Apache Spark) are usually written in multiple programming languages (PLs). There is little understanding on the phenomenon of multi-programming-language commits (MPLCs), which involve modified source files written in multiple PLs. Objective: This work aims to explore MPLCs and their impacts on development difficulty and software quality. Methods: We performed an empirical study on eighteen non-trivial Apache projects with 197,566 commits. Results: (1) the most commonly used PL combination consists of all the four PLs, i.e., C/C++, Java, JavaScript, and Python; (2) 9% of the commits from all the projects are MPLCs, and the proportion of MPLCs in 83% of the projects goes to a relatively stable level; (3) more than 90% of the MPLCs from all the projects involve source files in two PLs; (4) the change complexity of MPLCs is significantly higher than that of non-MPLCs; (5) issues fixed in MPLCs take significantly longer to be resolved than issues fixed in non-MPLCs in 89% of the projects; (6) MPLCs do not show significant effects on issue reopen; (7) source files undergoing MPLCs tend to be more bug-prone; and (8) MPLCs introduce more bugs than non-MPLCs. Conclusions: MPLCs are related to increased development difficulty and decreased software quality. △ Less

Submitted 12 November, 2023; originally announced November 2023.

Comments: Preprint accepted for publication in Journal of Systems and Software, 2022. arXiv admin note: substantial text overlap with arXiv:2103.11691

arXiv:2311.07164 [pdf, other]

Pruning random resistive memory for optimizing analogue AI

Authors: Yi Li, Songqi Wang, Yaping Zhao, Shaocong Wang, Woyu Zhang, Yangu He, Ning Lin, Binbin Cui, Xi Chen, Shiming Zhang, Hao Jiang, Peng Lin, Xumeng Zhang, Xiaojuan Qi, Zhongrui Wang, Xiaoxin Xu, Dashan Shang, Qi Liu, Kwang-Ting Cheng, Ming Liu

Abstract: The rapid advancement of artificial intelligence (AI) has been marked by the large language models exhibiting human-like intelligence. However, these models also present unprecedented challenges to energy consumption and environmental sustainability. One promising solution is to revisit analogue computing, a technique that predates digital computing and exploits emerging analogue electronic device… ▽ More The rapid advancement of artificial intelligence (AI) has been marked by the large language models exhibiting human-like intelligence. However, these models also present unprecedented challenges to energy consumption and environmental sustainability. One promising solution is to revisit analogue computing, a technique that predates digital computing and exploits emerging analogue electronic devices, such as resistive memory, which features in-memory computing, high scalability, and nonvolatility. However, analogue computing still faces the same challenges as before: programming nonidealities and expensive programming due to the underlying devices physics. Here, we report a universal solution, software-hardware co-design using structural plasticity-inspired edge pruning to optimize the topology of a randomly weighted analogue resistive memory neural network. Software-wise, the topology of a randomly weighted neural network is optimized by pruning connections rather than precisely tuning resistive memory weights. Hardware-wise, we reveal the physical origin of the programming stochasticity using transmission electron microscopy, which is leveraged for large-scale and low-cost implementation of an overparameterized random neural network containing high-performance sub-networks. We implemented the co-design on a 40nm 256K resistive memory macro, observing 17.3% and 19.9% accuracy improvements in image and audio classification on FashionMNIST and Spoken digits datasets, as well as 9.8% (2%) improvement in PR (ROC) in image segmentation on DRIVE datasets, respectively. This is accompanied by 82.1%, 51.2%, and 99.8% improvement in energy efficiency thanks to analogue in-memory computing. By embracing the intrinsic stochasticity and in-memory computing, this work may solve the biggest obstacle of analogue computing systems and thus unleash their immense potential for next-generation AI hardware. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2311.06956 [pdf, other]

SegReg: Segmenting OARs by Registering MR Images and CT Annotations

Authors: Zeyu Zhang, Xuyin Qi, Bowen Zhang, Biao Wu, Hien Le, Bora Jeong, Zhibin Liao, Yunxiang Liu, Johan Verjans, Minh-Son To, Richard Hartley

Abstract: Organ at risk (OAR) segmentation is a critical process in radiotherapy treatment planning such as head and neck tumors. Nevertheless, in clinical practice, radiation oncologists predominantly perform OAR segmentations manually on CT scans. This manual process is highly time-consuming and expensive, limiting the number of patients who can receive timely radiotherapy. Additionally, CT scans offer lo… ▽ More Organ at risk (OAR) segmentation is a critical process in radiotherapy treatment planning such as head and neck tumors. Nevertheless, in clinical practice, radiation oncologists predominantly perform OAR segmentations manually on CT scans. This manual process is highly time-consuming and expensive, limiting the number of patients who can receive timely radiotherapy. Additionally, CT scans offer lower soft-tissue contrast compared to MRI. Despite MRI providing superior soft-tissue visualization, its time-consuming nature makes it infeasible for real-time treatment planning. To address these challenges, we propose a method called SegReg, which utilizes Elastic Symmetric Normalization for registering MRI to perform OAR segmentation. SegReg outperforms the CT-only baseline by 16.78% in mDSC and 18.77% in mIoU, showing that it effectively combines the geometric accuracy of CT with the superior soft-tissue contrast of MRI, making accurate automated OAR segmentation for clinical practice become possible. See project website https://steve-zeyu-zhang.github.io/SegReg △ Less

Submitted 1 March, 2024; v1 submitted 12 November, 2023; originally announced November 2023.

Comments: Accepted to ISBI 2024

arXiv:2311.05470 [pdf, other]

Designing ship hull forms using generative adversarial networks

Authors: Kazuo Yonekura, Kotaro Omori, Xinran Qi, Katsuyuki Suzuki

Abstract: We proposed a GAN-based method to generate a ship hull form. Unlike mathematical hull forms that require geometrical parameters to generate ship hull forms, the proposed method requires desirable ship performance parameters, i.e., the drag coefficient and tonnage. The requirements of ship owners are generally focused on the ship performance and not the geometry itself. Hence, the proposed model is… ▽ More We proposed a GAN-based method to generate a ship hull form. Unlike mathematical hull forms that require geometrical parameters to generate ship hull forms, the proposed method requires desirable ship performance parameters, i.e., the drag coefficient and tonnage. The requirements of ship owners are generally focused on the ship performance and not the geometry itself. Hence, the proposed model is useful for obtaining the ship hull form based on an owner's requirements. The GAN model was trained using a ship hull form dataset generated using the generalized Wigley hull form. The proposed method was evaluated through numerical experiments and successfully generated ship data with small errors. △ Less

Submitted 9 November, 2023; originally announced November 2023.

Showing 51–100 of 623 results for author: Qi, X