Search | arXiv e-print repository

EXIM: A Hybrid Explicit-Implicit Representation for Text-Guided 3D Shape Generation

Authors: Zhengzhe Liu, Jingyu Hu, Ka-Hei Hui, Xiaojuan Qi, Daniel Cohen-Or, Chi-Wing Fu

Abstract: This paper presents a new text-guided technique for generating 3D shapes. The technique leverages a hybrid 3D shape representation, namely EXIM, combining the strengths of explicit and implicit representations. Specifically, the explicit stage controls the topology of the generated 3D shapes and enables local modifications, whereas the implicit stage refines the shape and paints it with plausible… ▽ More This paper presents a new text-guided technique for generating 3D shapes. The technique leverages a hybrid 3D shape representation, namely EXIM, combining the strengths of explicit and implicit representations. Specifically, the explicit stage controls the topology of the generated 3D shapes and enables local modifications, whereas the implicit stage refines the shape and paints it with plausible colors. Also, the hybrid approach separates the shape and color and generates color conditioned on shape to ensure shape-color consistency. Unlike the existing state-of-the-art methods, we achieve high-fidelity shape generation from natural-language descriptions without the need for time-consuming per-shape optimization or reliance on human-annotated texts during training or test-time optimization. Further, we demonstrate the applicability of our approach to generate indoor scenes with consistent styles using text-induced 3D shapes. Through extensive experiments, we demonstrate the compelling quality of our results and the high coherency of our generated shapes with the input texts, surpassing the performance of existing methods by a significant margin. Codes and models are released at https://github.com/liuzhengzhe/EXIM. △ Less

Submitted 30 November, 2023; v1 submitted 3 November, 2023; originally announced November 2023.

Comments: SIGGRAPH Asia 2023 & TOG Project page: https://liuzhengzhe.github.io/EXIM.github.io/

arXiv:2310.19415 [pdf, other]

Text-to-3D with Classifier Score Distillation

Authors: Xin Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Song-Hai Zhang, Xiaojuan Qi

Abstract: Text-to-3D generation has made remarkable progress recently, particularly with methods based on Score Distillation Sampling (SDS) that leverages pre-trained 2D diffusion models. While the usage of classifier-free guidance is well acknowledged to be crucial for successful optimization, it is considered an auxiliary trick rather than the most essential component. In this paper, we re-evaluate the ro… ▽ More Text-to-3D generation has made remarkable progress recently, particularly with methods based on Score Distillation Sampling (SDS) that leverages pre-trained 2D diffusion models. While the usage of classifier-free guidance is well acknowledged to be crucial for successful optimization, it is considered an auxiliary trick rather than the most essential component. In this paper, we re-evaluate the role of classifier-free guidance in score distillation and discover a surprising finding: the guidance alone is enough for effective text-to-3D generation tasks. We name this method Classifier Score Distillation (CSD), which can be interpreted as using an implicit classification model for generation. This new perspective reveals new insights for understanding existing techniques. We validate the effectiveness of CSD across a variety of text-to-3D tasks including shape generation, texture synthesis, and shape editing, achieving results superior to those of state-of-the-art methods. Our project page is https://xinyu-andy.github.io/Classifier-Score-Distillation △ Less

Submitted 31 October, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

Comments: Our project page is https://xinyu-andy.github.io/Classifier-Score-Distillation

arXiv:2310.18725 [pdf, other]

The Evolution of the Interplay Between Input Distributions and Linear Regions in Networks

Authors: Xuan Qi, Yi Wei

Abstract: It is commonly recognized that the expressiveness of deep neural networks is contingent upon a range of factors, encompassing their depth, width, and other relevant considerations. Currently, the practical performance of the majority of deep neural networks remains uncertain. For ReLU (Rectified Linear Unit) networks with piecewise linear activations, the number of linear convex regions serves as… ▽ More It is commonly recognized that the expressiveness of deep neural networks is contingent upon a range of factors, encompassing their depth, width, and other relevant considerations. Currently, the practical performance of the majority of deep neural networks remains uncertain. For ReLU (Rectified Linear Unit) networks with piecewise linear activations, the number of linear convex regions serves as a natural metric to gauge the network's expressivity. In this paper, we count the number of linear convex regions in deep neural networks based on ReLU. In particular, we prove that for any one-dimensional input, there exists a minimum threshold for the number of neurons required to express it. We also empirically observe that for the same network, intricate inputs hinder its capacity to express linear regions. Furthermore, we unveil the iterative refinement process of decision boundaries in ReLU networks during training. We aspire for our research to serve as an inspiration for network optimization endeavors and aids in the exploration and analysis of the behaviors exhibited by deep networks. △ Less

Submitted 6 November, 2023; v1 submitted 28 October, 2023; originally announced October 2023.

Comments: Under review

arXiv:2310.16667 [pdf, other]

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

Authors: Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, Xiaojuan Qi

Abstract: Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language models for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel appr… ▽ More Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language models for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel approach that overcomes the reliance on pre-aligned vision-language space by reformulating region-word alignment as a co-occurring object discovery problem. Intuitively, by grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept. Extensive experiments demonstrate that CoDet has superior performances and compelling scalability in open-vocabulary detection, e.g., by scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and 44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 $\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available at https://github.com/CVMI-Lab/CoDet. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: Accepted by NeurIPS 2023

arXiv:2310.14664 [pdf, other]

Data Pruning via Moving-one-Sample-out

Authors: Haoru Tan, Sitong Wu, Fei Du, Yukang Chen, Zhibin Wang, Fan Wang, Xiaojuan Qi

Abstract: In this paper, we propose a novel data-pruning approach called moving-one-sample-out (MoSo), which aims to identify and remove the least informative samples from the training set. The core insight behind MoSo is to determine the importance of each sample by assessing its impact on the optimal empirical risk. This is achieved by measuring the extent to which the empirical risk changes when a partic… ▽ More In this paper, we propose a novel data-pruning approach called moving-one-sample-out (MoSo), which aims to identify and remove the least informative samples from the training set. The core insight behind MoSo is to determine the importance of each sample by assessing its impact on the optimal empirical risk. This is achieved by measuring the extent to which the empirical risk changes when a particular sample is excluded from the training set. Instead of using the computationally expensive leaving-one-out-retraining procedure, we propose an efficient first-order approximator that only requires gradient information from different training stages. The key idea behind our approximation is that samples with gradients that are consistently aligned with the average gradient of the training set are more informative and should receive higher scores, which could be intuitively understood as follows: if the gradient from a specific sample is consistent with the average gradient vector, it implies that optimizing the network using the sample will yield a similar effect on all remaining samples. Experimental results demonstrate that MoSo effectively mitigates severe performance degradation at high pruning ratios and achieves satisfactory performance across various settings. △ Less

Submitted 25 October, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: Accepted by the Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)

arXiv:2310.12281 [pdf, other]

Enhancing the Performance of Automated Grade Prediction in MOOC using Graph Representation Learning

Authors: Soheila Farokhi, Aswani Yaramala, Jiangtao Huang, Muhammad F. A. Khan, Xiaojun Qi, Hamid Karimi

Abstract: In recent years, Massive Open Online Courses (MOOCs) have gained significant traction as a rapidly growing phenomenon in online learning. Unlike traditional classrooms, MOOCs offer a unique opportunity to cater to a diverse audience from different backgrounds and geographical locations. Renowned universities and MOOC-specific providers, such as Coursera, offer MOOC courses on various subjects. Aut… ▽ More In recent years, Massive Open Online Courses (MOOCs) have gained significant traction as a rapidly growing phenomenon in online learning. Unlike traditional classrooms, MOOCs offer a unique opportunity to cater to a diverse audience from different backgrounds and geographical locations. Renowned universities and MOOC-specific providers, such as Coursera, offer MOOC courses on various subjects. Automated assessment tasks like grade and early dropout predictions are necessary due to the high enrollment and limited direct interaction between teachers and learners. However, current automated assessment approaches overlook the structural links between different entities involved in the downstream tasks, such as the students and courses. Our hypothesis suggests that these structural relationships, manifested through an interaction graph, contain valuable information that can enhance the performance of the task at hand. To validate this, we construct a unique knowledge graph for a large MOOC dataset, which will be publicly available to the research community. Furthermore, we utilize graph embedding techniques to extract latent structural information encoded in the interactions between entities in the dataset. These techniques do not require ground truth labels and can be utilized for various tasks. Finally, by combining entity-specific features, behavioral features, and extracted structural features, we enhance the performance of predictive machine learning models in student assignment grade prediction. Our experiments demonstrate that structural features can significantly improve the predictive performance of downstream assessment tasks. The code and data are available in \url{https://github.com/DSAatUSU/MOOPer_grade_prediction} △ Less

Submitted 18 October, 2023; originally announced October 2023.

arXiv:2310.10644 [pdf, other]

TOSS:High-quality Text-guided Novel View Synthesis from a Single Image

Authors: Yukai Shi, Jianan Wang, He Cao, Boshi Tang, Xianbiao Qi, Tianyu Yang, Yukun Huang, Shilong Liu, Lei Zhang, Heung-Yeung Shum

Abstract: In this paper, we present TOSS, which introduces text to the task of novel view synthesis (NVS) from just a single RGB image. While Zero-1-to-3 has demonstrated impressive zero-shot open-set NVS capability, it treats NVS as a pure image-to-image translation problem. This approach suffers from the challengingly under-constrained nature of single-view NVS: the process lacks means of explicit user co… ▽ More In this paper, we present TOSS, which introduces text to the task of novel view synthesis (NVS) from just a single RGB image. While Zero-1-to-3 has demonstrated impressive zero-shot open-set NVS capability, it treats NVS as a pure image-to-image translation problem. This approach suffers from the challengingly under-constrained nature of single-view NVS: the process lacks means of explicit user control and often results in implausible NVS generations. To address this limitation, TOSS uses text as high-level semantic information to constrain the NVS solution space. TOSS fine-tunes text-to-image Stable Diffusion pre-trained on large-scale text-image pairs and introduces modules specifically tailored to image and camera pose conditioning, as well as dedicated training for pose correctness and preservation of fine details. Comprehensive experiments are conducted with results showing that our proposed TOSS outperforms Zero-1-to-3 with more plausible, controllable and multiview-consistent NVS results. We further support these results with comprehensive ablations that underscore the effectiveness and potential of the introduced semantic guidance and architecture design. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2310.07995 [pdf, other]

HeightFormer: A Multilevel Interaction and Image-adaptive Classification-regression Network for Monocular Height Estimation with Aerial Images

Authors: Zhan Chen, Yidan Zhang, Xiyu Qi, Yongqiang Mao, Xin Zhou, Lulu Niu, Hui Wu, Lei Wang, Yunping Ge

Abstract: Height estimation has long been a pivotal topic within measurement and remote sensing disciplines, proving critical for endeavours such as 3D urban modelling, MR and autonomous driving. Traditional methods utilise stereo matching or multisensor fusion, both well-established techniques that typically necessitate multiple images from varying perspectives and adjunct sensors like SAR, leading to subs… ▽ More Height estimation has long been a pivotal topic within measurement and remote sensing disciplines, proving critical for endeavours such as 3D urban modelling, MR and autonomous driving. Traditional methods utilise stereo matching or multisensor fusion, both well-established techniques that typically necessitate multiple images from varying perspectives and adjunct sensors like SAR, leading to substantial deployment costs. Single image height estimation has emerged as an attractive alternative, boasting a larger data source variety and simpler deployment. However, current methods suffer from limitations such as fixed receptive fields, a lack of global information interaction, leading to noticeable instance-level height deviations. The inherent complexity of height prediction can result in a blurry estimation of object edge depth when using mainstream regression methods based on fixed height division. This paper presents a comprehensive solution for monocular height estimation in remote sensing, termed HeightFormer, combining multilevel interactions and image-adaptive classification-regression. It features the Multilevel Interaction Backbone (MIB) and Image-adaptive Classification-regression Height Generator (ICG). MIB supplements the fixed sample grid in CNN of the conventional backbone network with tokens of different interaction ranges. It is complemented by a pixel-, patch-, and feature map-level hierarchical interaction mechanism, designed to relay spatial geometry information across different scales and introducing a global receptive field to enhance the quality of instance-level height estimation. The ICG dynamically generates height partition for each image and reframes the traditional regression task, using a refinement from coarse to fine classification-regression that significantly mitigates the innate ill-posedness issue and drastically improves edge sharpness. △ Less

Submitted 11 October, 2023; originally announced October 2023.

arXiv:2310.07178 [pdf, other]

Engineering of energy band and its impact on light transmission in non-reciprocal Hermitian hourglass lattice

Authors: Junhao Yang, Yuandan Wang, Yu Lin, Wenjing Zhang, Guoguo Xin, Xinyuan Qi

Abstract: We study a quasi-one-dimensional non-reciprocal Hermitian hourglass photonic lattice that can accomplish multiple functions. Under the effect of non-reciprocal coupling, this lattice can produce an energy isolation effect, two kinds of flat bands, and energy band inversion. The excitation and propagation of a single energy band and multiple energy bands can be realized; in the flat band condition,… ▽ More We study a quasi-one-dimensional non-reciprocal Hermitian hourglass photonic lattice that can accomplish multiple functions. Under the effect of non-reciprocal coupling, this lattice can produce an energy isolation effect, two kinds of flat bands, and energy band inversion. The excitation and propagation of a single energy band and multiple energy bands can be realized; in the flat band condition, the system has compact localized states, and the flat bands can be excited by a straightforward method. In addition, we investigate the edge states under the open boundary condition; a double edge state appears by using a defect in the system. Our findings advance the theory of energy band regulation in artificial photonic lattices. △ Less

Submitted 11 October, 2023; originally announced October 2023.

arXiv:2310.04030 [pdf]

Robust inference with GhostKnockoffs in genome-wide association studies

Authors: Xinran Qi, Michael E. Belloy, Jiaqi Gu, Xiaoxia Liu, Hua Tang, Zihuai He

Abstract: Genome-wide association studies (GWASs) have been extensively adopted to depict the underlying genetic architecture of complex diseases. Motivated by GWASs' limitations in identifying small effect loci to understand complex traits' polygenicity and fine-mapping putative causal variants from proxy ones, we propose a knockoff-based method which only requires summary statistics from GWASs and demonst… ▽ More Genome-wide association studies (GWASs) have been extensively adopted to depict the underlying genetic architecture of complex diseases. Motivated by GWASs' limitations in identifying small effect loci to understand complex traits' polygenicity and fine-mapping putative causal variants from proxy ones, we propose a knockoff-based method which only requires summary statistics from GWASs and demonstrate its validity in the presence of relatedness. We show that GhostKnockoffs inference is robust to its input Z-scores as long as they are from valid marginal association tests and their correlations are consistent with the correlations among the corresponding genetic variants. The property generalizes GhostKnockoffs to other GWASs settings, such as the meta-analysis of multiple overlapping studies and studies based on association test statistics deviated from score tests. We demonstrate GhostKnockoffs' performance using empirical simulation and a meta-analysis of nine European ancestral genome-wide association studies and whole exome/genome sequencing studies. Both results demonstrate that GhostKnockoffs identify more putative causal variants with weak genotype-phenotype associations that are missed by conventional GWASs. △ Less

Submitted 6 October, 2023; originally announced October 2023.

arXiv:2310.03693 [pdf, other]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Authors: Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson

Abstract: Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5 Turbo on custom datasets also encourage this practice. But, what are the safety costs associated with such custom fine-tuning? We note that while existing safety alignment inf… ▽ More Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5 Turbo on custom datasets also encourage this practice. But, what are the safety costs associated with such custom fine-tuning? We note that while existing safety alignment infrastructures can restrict harmful behaviors of LLMs at inference time, they do not cover safety risks when fine-tuning privileges are extended to end-users. Our red teaming studies find that the safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. For instance, we jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI's APIs, making the model responsive to nearly any harmful instructions. Disconcertingly, our research also reveals that, even without malicious intent, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLMs, though to a lesser extent. These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing -- even if a model's initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning. We outline and critically analyze potential mitigations and advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs. △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2309.16987 [pdf, other]

SpikeMOT: Event-based Multi-Object Tracking with Sparse Motion Features

Authors: Song Wang, Zhu Wang, Can Li, Xiaojuan Qi, Hayden Kwok-Hay So

Abstract: In comparison to conventional RGB cameras, the superior temporal resolution of event cameras allows them to capture rich information between frames, making them prime candidates for object tracking. Yet in practice, despite their theoretical advantages, the body of work on event-based multi-object tracking (MOT) remains in its infancy, especially in real-world settings where events from complex ba… ▽ More In comparison to conventional RGB cameras, the superior temporal resolution of event cameras allows them to capture rich information between frames, making them prime candidates for object tracking. Yet in practice, despite their theoretical advantages, the body of work on event-based multi-object tracking (MOT) remains in its infancy, especially in real-world settings where events from complex background and camera motion can easily obscure the true target motion. In this work, an event-based multi-object tracker, called SpikeMOT, is presented to address these challenges. SpikeMOT leverages spiking neural networks to extract sparse spatiotemporal features from event streams associated with objects. The resulting spike train representations are used to track the object movement at high frequency, while a simultaneous object detector provides updated spatial information of these objects at an equivalent frame rate. To evaluate the effectiveness of SpikeMOT, we introduce DSEC-MOT, the first large-scale event-based MOT benchmark incorporating fine-grained annotations for objects experiencing severe occlusions, frequent trajectory intersections, and long-term re-identification in real-world contexts. Extensive experiments employing DSEC-MOT and another event-based dataset, named FE240hz, demonstrate SpikeMOT's capability to achieve high tracking accuracy amidst challenging real-world scenarios, advancing the state-of-the-art in event-based multi-object tracking. △ Less

Submitted 29 September, 2023; originally announced September 2023.

arXiv:2309.07109 [pdf, ps, other]

Real-time Monitoring for the Next Core-Collapse Supernova in JUNO

Authors: Angel Abusleme, Thomas Adam, Shakeel Ahmad, Rizwan Ahmed, Sebastiano Aiello, Muhammad Akram, Abid Aleem, Fengpeng An, Qi An, Giuseppe Andronico, Nikolay Anfimov, Vito Antonelli, Tatiana Antoshkina, Burin Asavapibhop, João Pedro Athayde Marcondes de André, Didier Auguste, Weidong Bai, Nikita Balashov, Wander Baldini, Andrea Barresi, Davide Basilico, Eric Baussan, Marco Bellato, Marco Beretta, Antonio Bergnoli , et al. (606 additional authors not shown)

Abstract: The core-collapse supernova (CCSN) is considered one of the most energetic astrophysical events in the universe. The early and prompt detection of neutrinos before (pre-SN) and during the supernova (SN) burst presents a unique opportunity for multi-messenger observations of CCSN events. In this study, we describe the monitoring concept and present the sensitivity of the system to pre-SN and SN neu… ▽ More The core-collapse supernova (CCSN) is considered one of the most energetic astrophysical events in the universe. The early and prompt detection of neutrinos before (pre-SN) and during the supernova (SN) burst presents a unique opportunity for multi-messenger observations of CCSN events. In this study, we describe the monitoring concept and present the sensitivity of the system to pre-SN and SN neutrinos at the Jiangmen Underground Neutrino Observatory (JUNO), a 20 kton liquid scintillator detector currently under construction in South China. The real-time monitoring system is designed to ensure both prompt alert speed and comprehensive coverage of progenitor stars. It incorporates prompt monitors on the electronic board as well as online monitors at the data acquisition stage. Assuming a false alert rate of 1 per year, this monitoring system exhibits sensitivity to pre-SN neutrinos up to a distance of approximately 1.6 (0.9) kiloparsecs and SN neutrinos up to about 370 (360) kiloparsecs for a progenitor mass of 30 solar masses, considering both normal and inverted mass ordering scenarios. The pointing ability of the CCSN is evaluated by analyzing the accumulated event anisotropy of inverse beta decay interactions from pre-SN or SN neutrinos. This, along with the early alert, can play a crucial role in facilitating follow-up multi-messenger observations of the next galactic or nearby extragalactic CCSN. △ Less

Submitted 4 December, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: 24 pages, 9 figures, accepted for the publication at JCAP

arXiv:2309.06801 [pdf, ps, other]

Defensive Alliances in Signed Networks

Authors: Emmanuel Arrighi, Zhidan Feng, Henning Fernau, Kevin Mann, Xingqin Qi, Petra Wolf

Abstract: The analysis of (social) networks and multi-agent systems is a central theme in Artificial Intelligence. Some line of research deals with finding groups of agents that could work together to achieve a certain goal. To this end, different notions of so-called clusters or communities have been introduced in the literature of graphs and networks. Among these, defensive alliance is a kind of quantitat… ▽ More The analysis of (social) networks and multi-agent systems is a central theme in Artificial Intelligence. Some line of research deals with finding groups of agents that could work together to achieve a certain goal. To this end, different notions of so-called clusters or communities have been introduced in the literature of graphs and networks. Among these, defensive alliance is a kind of quantitative group structure. However, all studies on the alliance so for have ignored one aspect that is central to the formation of alliances on a very intuitive level, assuming that the agents are preconditioned concerning their attitude towards other agents: they prefer to be in some group (alliance) together with the agents they like, so that they are happy to help each other towards their common aim, possibly then working against the agents outside of their group that they dislike. Signed networks were introduced in the psychology literature to model liking and disliking between agents, generalizing graphs in a natural way. Hence, we propose the novel notion of a defensive alliance in the context of signed networks. We then investigate several natural algorithmic questions related to this notion. These, and also combinatorial findings, connect our notion to that of correlation clustering, which is a well-established idea of finding groups of agents within a signed network. Also, we introduce a new structural parameter for signed graphs, signed neighborhood diversity snd, and exhibit a parameterized algorithm that finds a smallest defensive alliance in a signed graph. △ Less

Submitted 23 January, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

arXiv:2309.04814 [pdf, other]

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

Authors: Xiuzhe Wu, Pengfei Hu, Yang Wu, Xiaoyang Lyu, Yan-Pei Cao, Ying Shan, Wenming Yang, Zhongqian Sun, Xiaojuan Qi

Abstract: Synthesizing realistic videos according to a given speech is still an open challenge. Previous works have been plagued by issues such as inaccurate lip shape generation and poor image quality. The key reason is that only motions and appearances on limited facial areas (e.g., lip area) are mainly driven by the input speech. Therefore, directly learning a mapping function from speech to the entire h… ▽ More Synthesizing realistic videos according to a given speech is still an open challenge. Previous works have been plagued by issues such as inaccurate lip shape generation and poor image quality. The key reason is that only motions and appearances on limited facial areas (e.g., lip area) are mainly driven by the input speech. Therefore, directly learning a mapping function from speech to the entire head image is prone to ambiguity, particularly when using a short video for training. We thus propose a decomposition-synthesis-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance to facilitate effective learning from limited training data, resulting in the generation of natural-looking videos. First, given a fixed head pose (i.e., canonical space), we present a speech-driven implicit model for lip image generation which concentrates on learning speech-sensitive motion and appearance. Next, to model the major speech-insensitive motion (i.e., head movement), we introduce a geometry-aware mutual explicit mapping (GAMEM) module that establishes geometric mappings between different head poses. This allows us to paste generated lip images at the canonical space onto head images with arbitrary poses and synthesize talking videos with natural head movements. In addition, a Blend-Net and a contrastive sync loss are introduced to enhance the overall synthesis performance. Quantitative and qualitative results on three benchmarks demonstrate that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization. Code: https://github.com/CVMI-Lab/Speech2Lip. △ Less

Submitted 9 September, 2023; originally announced September 2023.

arXiv:2309.02411 [pdf, other]

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

Authors: Bojia Zi, Xianbiao Qi, Lingzhi Wang, Jianan Wang, Kam-Fai Wong, Lei Zhang

Abstract: In this paper, we present Delta-LoRA, which is a novel parameter-efficient approach to fine-tune large language models (LLMs). In contrast to LoRA and other low-rank adaptation methods such as AdaLoRA, Delta-LoRA not only updates the low-rank matrices $\bA$ and $\bB$, but also propagate the learning to the pre-trained weights $\bW$ via updates utilizing the delta of the product of two low-rank mat… ▽ More In this paper, we present Delta-LoRA, which is a novel parameter-efficient approach to fine-tune large language models (LLMs). In contrast to LoRA and other low-rank adaptation methods such as AdaLoRA, Delta-LoRA not only updates the low-rank matrices $\bA$ and $\bB$, but also propagate the learning to the pre-trained weights $\bW$ via updates utilizing the delta of the product of two low-rank matrices ($\bA^{(t+1)}\bB^{(t+1)} - \bA^{(t)}\bB^{(t)}$). Such a strategy effectively addresses the limitation that the incremental update of low-rank matrices is inadequate for learning representations capable for downstream tasks. Moreover, as the update of $\bW$ does not need to compute the gradients of $\bW$ and store their momentums, Delta-LoRA shares comparable memory requirements and computational costs with LoRA. Extensive experiments show that Delta-LoRA significantly outperforms existing low-rank adaptation methods. We further support these results with comprehensive analyses that underscore the effectiveness of Delta-LoRA. △ Less

Submitted 5 September, 2023; originally announced September 2023.

arXiv:2309.00223 [pdf, other]

The FruitShell French synthesis system at the Blizzard 2023 Challenge

Authors: Xin Qi, Xiaopeng Wang, Zhiyong Wang, Wang Liu, Mingming Ding, Shuchen Shi

Abstract: This paper presents a French text-to-speech synthesis system for the Blizzard Challenge 2023. The challenge consists of two tasks: generating high-quality speech from female speakers and generating speech that closely resembles specific individuals. Regarding the competition data, we conducted a screening process to remove missing or erroneous text data. We organized all symbols except for phoneme… ▽ More This paper presents a French text-to-speech synthesis system for the Blizzard Challenge 2023. The challenge consists of two tasks: generating high-quality speech from female speakers and generating speech that closely resembles specific individuals. Regarding the competition data, we conducted a screening process to remove missing or erroneous text data. We organized all symbols except for phonemes and eliminated symbols that had no pronunciation or zero duration. Additionally, we added word boundary and start/end symbols to the text, which we have found to improve speech quality based on our previous experience. For the Spoke task, we performed data augmentation according to the competition rules. We used an open-source G2P model to transcribe the French texts into phonemes. As the G2P model uses the International Phonetic Alphabet (IPA), we applied the same transcription process to the provided competition data for standardization. However, due to compiler limitations in recognizing special symbols from the IPA chart, we followed the rules to convert all phonemes into the phonetic scheme used in the competition data. Finally, we resampled all competition audio to a uniform sampling rate of 16 kHz. We employed a VITS-based acoustic model with the hifigan vocoder. For the Spoke task, we trained a multi-speaker model and incorporated speaker information into the duration predictor, vocoder, and flow layers of the model. The evaluation results of our system showed a quality MOS score of 3.6 for the Hub task and 3.4 for the Spoke task, placing our system at an average level among all participating teams. △ Less

Submitted 20 August, 2024; v1 submitted 31 August, 2023; originally announced September 2023.

arXiv:2308.12439 [pdf, other]

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

Authors: Tinghao Xie, Xiangyu Qi, Ping He, Yiming Li, Jiachen T. Wang, Prateek Mittal

Abstract: We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor fu… ▽ More We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor functionality of a given backdoored model to a backdoor expert model. The approach is straightforward -- finetuning the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model (dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 17 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet) across various model architectures (ResNet, VGG, MobileNetV2 and Vision Transformer). △ Less

Submitted 5 October, 2023; v1 submitted 23 August, 2023; originally announced August 2023.

arXiv:2308.10490 [pdf, other]

Texture Generation on 3D Meshes with Point-UV Diffusion

Authors: Xin Yu, Peng Dai, Wenbo Li, Lan Ma, Zhengzhe Liu, Xiaojuan Qi

Abstract: In this work, we focus on synthesizing high-quality textures on 3D meshes. We present Point-UV diffusion, a coarse-to-fine pipeline that marries the denoising diffusion model with UV mapping to generate 3D consistent and high-quality texture images in UV space. We start with introducing a point diffusion model to synthesize low-frequency texture components with our tailored style guidance to tackl… ▽ More In this work, we focus on synthesizing high-quality textures on 3D meshes. We present Point-UV diffusion, a coarse-to-fine pipeline that marries the denoising diffusion model with UV mapping to generate 3D consistent and high-quality texture images in UV space. We start with introducing a point diffusion model to synthesize low-frequency texture components with our tailored style guidance to tackle the biased color distribution. The derived coarse texture offers global consistency and serves as a condition for the subsequent UV diffusion stage, aiding in regularizing the model to generate a 3D consistent UV texture image. Then, a UV diffusion model with hybrid conditions is developed to enhance the texture fidelity in the 2D UV space. Our method can process meshes of any genus, generating diversified, geometry-compatible, and high-fidelity textures. Code is available at https://cvmi-lab.github.io/Point-UV-Diffusion △ Less

Submitted 21 August, 2023; originally announced August 2023.

Comments: Accepted to ICCV 2023, Oral

arXiv:2308.02933 [pdf, other]

InnovationInsights: A Visual Analytics Approach for Understanding the Dual Frontiers of Science and Technology

Authors: Yifang Wang, Yifan Qian, Xiaoyu Qi, Nan Cao, Dashun Wang

Abstract: Science has long been viewed as a key driver of economic growth and rising standards of living. Knowledge about how scientific advances support marketplace inventions is therefore essential for understanding the role of science in propelling real-world applications and technological progress. The increasing availability of large-scale datasets tracing scientific publications and patented invention… ▽ More Science has long been viewed as a key driver of economic growth and rising standards of living. Knowledge about how scientific advances support marketplace inventions is therefore essential for understanding the role of science in propelling real-world applications and technological progress. The increasing availability of large-scale datasets tracing scientific publications and patented inventions and the complex interactions among them offers us new opportunities to explore the evolving dual frontiers of science and technology at an unprecedented level of scale and detail. However, we lack suitable visual analytics approaches to analyze such complex interactions effectively. Here we introduce InnovationInsights, an interactive visual analysis system for researchers, research institutions, and policymakers to explore the complex linkages between science and technology, and to identify critical innovations, inventors, and potential partners. The system first identifies important associations between scientific papers and patented inventions through a set of statistical measures introduced by our experts from the field of the Science of Science. A series of visualization views are then used to present these associations in the data context. In particular, we introduce the Interplay Graph to visualize patterns and insights derived from the data, helping users effectively navigate citation relationships between papers and patents. This visualization thereby helps them identify the origins of technical inventions and the impact of scientific research. We evaluate the system through two case studies with experts followed by expert interviews. We further engage a premier research institution to test-run the system, helping its institution leaders to extract new insights for innovation. △ Less

Submitted 8 August, 2023; v1 submitted 5 August, 2023; originally announced August 2023.

arXiv:2308.00353 [pdf, other]

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Authors: Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, Xiaojuan Qi

Abstract: Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Interne… ▽ More Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Internet, which cover a wide range of vocabulary concepts. However, this success is hard to replicate in 3D scenarios due to the scarcity of 3D-text pairs. To address this challenge, we propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for multi-view images of 3D scenes. This allows us to establish explicit associations between 3D shapes and semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic representation learning from captions for object-level categorization, we design hierarchical point-caption association methods to learn semantic-aware embeddings that exploit the 3D geometry between 3D points and multi-view images. In addition, to tackle the localization challenge for novel classes in the open-world setting, we develop debiased instance localization, which involves training object grouping modules on unlabeled data using instance-level pseudo supervision. This significantly improves the generalization capabilities of instance grouping and thus the ability to accurately locate novel objects. We conduct extensive experiments on 3D semantic, instance, and panoptic segmentation tasks, covering indoor and outdoor scenes across three datasets. Our method outperforms baseline methods by a significant margin in semantic segmentation (e.g. 34.5%$\sim$65.3%), instance segmentation (e.g. 21.8%$\sim$54.0%) and panoptic segmentation (e.g. 14.7%$\sim$43.3%). Code will be available. △ Less

Submitted 1 August, 2023; originally announced August 2023.

Comments: submit to TPAMI

arXiv:2307.16620 [pdf, other]

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

Authors: Chen Liu, Peike Li, Xingqun Qi, Hu Zhang, Lincheng Li, Dadong Wang, Xin Yu

Abstract: The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient… ▽ More The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects. △ Less

Submitted 31 July, 2023; v1 submitted 31 July, 2023; originally announced July 2023.

Comments: This paper has been received by ACM MM 23

arXiv:2307.16118 [pdf, other]

MTD-GPT: A Multi-Task Decision-Making GPT Model for Autonomous Driving at Unsignalized Intersections

Authors: Jiaqi Liu, Peng Hang, Xiao qi, Jianqiang Wang, Jian Sun

Abstract: Autonomous driving technology is poised to transform transportation systems. However, achieving safe and accurate multi-task decision-making in complex scenarios, such as unsignalized intersections, remains a challenge for autonomous vehicles. This paper presents a novel approach to this issue with the development of a Multi-Task Decision-Making Generative Pre-trained Transformer (MTD-GPT) model.… ▽ More Autonomous driving technology is poised to transform transportation systems. However, achieving safe and accurate multi-task decision-making in complex scenarios, such as unsignalized intersections, remains a challenge for autonomous vehicles. This paper presents a novel approach to this issue with the development of a Multi-Task Decision-Making Generative Pre-trained Transformer (MTD-GPT) model. Leveraging the inherent strengths of reinforcement learning (RL) and the sophisticated sequence modeling capabilities of the Generative Pre-trained Transformer (GPT), the MTD-GPT model is designed to simultaneously manage multiple driving tasks, such as left turns, straight-ahead driving, and right turns at unsignalized intersections. We initially train a single-task RL expert model, sample expert data in the environment, and subsequently utilize a mixed multi-task dataset for offline GPT training. This approach abstracts the multi-task decision-making problem in autonomous driving as a sequence modeling task. The MTD-GPT model is trained and evaluated across several decision-making tasks, demonstrating performance that is either superior or comparable to that of state-of-the-art single-task decision-making models. △ Less

Submitted 29 July, 2023; originally announced July 2023.

Comments: Accepted by ITSC 2023

arXiv:2307.15950 [pdf, other]

Teaching Autonomous Vehicles to Express Interaction Intent during Unprotected Left Turns: A Human-Driving-Prior-Based Trajectory Planning Approach

Authors: Jiaqi Liu, Xiao Qi, Ying Ni, Jian Sun, Peng Hang

Abstract: Incorporating Autonomous Vehicles (AVs) into existing transportation systems necessitates examining their coexistence with Human-driven Vehicles (HVs) in mixed traffic environments. Central to this coexistence is the AVs' ability to emulate human-like interaction intentions within traffic scenarios. We introduce a novel framework for planning unprotected left-turn trajectories for AVs, designed to… ▽ More Incorporating Autonomous Vehicles (AVs) into existing transportation systems necessitates examining their coexistence with Human-driven Vehicles (HVs) in mixed traffic environments. Central to this coexistence is the AVs' ability to emulate human-like interaction intentions within traffic scenarios. We introduce a novel framework for planning unprotected left-turn trajectories for AVs, designed to mirror human driving behaviors and effectively communicate social intentions. This framework consists of three phases: trajectory generation, evaluation, and selection.In the trajectory generation phase, we utilize real human-driving trajectory data to establish constraints for a predicted trajectory space, creating candidate motion trajectories that reflect intent. The evaluation phase incorporates maximum entropy inverse reinforcement learning (ME-IRL) to gauge human trajectory preferences, considering aspects like traffic efficiency, driving comfort, and interactive safety. During the selection phase, a Boltzmann distribution-based approach is employed to assign rewards and probabilities to the candidate trajectories, promoting human-like decision-making. We validate our framework using an authentic trajectory dataset and conduct a comparative analysis with various baseline methods. Our results, derived from simulator tests and human-in-the-loop driving experiments, affirm our framework's superiority in mimicking human-like driving, expressing intent, and computational efficiency. For additional information of this research, please visit https://shorturl.at/jqu35. △ Less

Submitted 25 November, 2023; v1 submitted 29 July, 2023; originally announced July 2023.

arXiv:2307.15689 [pdf, other]

Engineering entanglement geometry via spacetime-modulated measurements

Authors: Aditya Cowsik, Matteo Ippoliti, Xiao-Liang Qi

Abstract: We introduce a general approach to realize quantum states with holographic entanglement structure via monitored dynamics. Starting from random unitary circuits in $1+1$ dimensions, we introduce measurements with a spatiotemporally-modulated density. Exploiting the known critical properties of the measurement-induced entanglement transition, this allows us to engineer arbitrary geometries for the b… ▽ More We introduce a general approach to realize quantum states with holographic entanglement structure via monitored dynamics. Starting from random unitary circuits in $1+1$ dimensions, we introduce measurements with a spatiotemporally-modulated density. Exploiting the known critical properties of the measurement-induced entanglement transition, this allows us to engineer arbitrary geometries for the bulk space (with a fixed topology). These geometries in turn control the entanglement structure of the boundary (output) state. We demonstrate our approach by giving concrete protocols for two geometries of interest in two dimensions: the hyperbolic half-plane and a spatial section of the BTZ black hole. We numerically verify signatures of the underlying entanglement geometry, including a direct imaging of entanglement wedges by using locally-entangled reference qubits. Our results provide a concrete platform for realizing geometric entanglement structures on near-term quantum simulators. △ Less

Submitted 28 July, 2023; originally announced July 2023.

Comments: 5 pages, 3 figures

arXiv:2307.15061 [pdf, other]

The RoboDepth Challenge: Methods and Advancements Towards Robust Depth Estimation

Authors: Lingdong Kong, Yaru Niu, Shaoyuan Xie, Hanjiang Hu, Lai Xing Ng, Benoit R. Cottereau, Ding Zhao, Liangjun Zhang, Hesheng Wang, Wei Tsang Ooi, Ruijie Zhu, Ziyang Song, Li Liu, Tianzhu Zhang, Jun Yu, Mohan Jing, Pengwei Li, Xiaohua Qi, Cheng Jin, Yingfeng Chen, Jie Hou, Jie Zhang, Zhen Kan, Qiang Ling, Liang Peng , et al. (18 additional authors not shown)

Abstract: Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summari… ▽ More Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summarize the winning solutions from the RoboDepth Challenge -- an academic competition designed to facilitate and advance robust OoD depth estimation. This challenge was developed based on the newly established KITTI-C and NYUDepth2-C benchmarks. We hosted two stand-alone tracks, with an emphasis on robust self-supervised and robust fully-supervised depth estimation, respectively. Out of more than two hundred participants, nine unique and top-performing solutions have appeared, with novel designs ranging from the following aspects: spatial- and frequency-domain augmentations, masked image modeling, image restoration and super-resolution, adversarial training, diffusion-based noise suppression, vision-language pre-training, learned model ensembling, and hierarchical feature enhancement. Extensive experimental analyses along with insightful observations are drawn to better understand the rationale behind each design. We hope this challenge could lay a solid foundation for future research on robust and reliable depth estimation and beyond. The datasets, competition toolkit, workshop recordings, and source code from the winning teams are publicly available on the challenge website. △ Less

Submitted 27 July, 2023; originally announced July 2023.

Comments: Technical Report; 65 pages, 34 figures, 24 tables; Code at https://github.com/ldkong1205/RoboDepth

arXiv:2307.09316 [pdf, other]

MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds

Authors: Jiahui Liu, Chirui Chang, Jianhui Liu, Xiaoyang Wu, Lan Ma, Xiaojuan Qi

Abstract: 3D semantic segmentation on multi-scan large-scale point clouds plays an important role in autonomous systems. Unlike the single-scan-based semantic segmentation task, this task requires distinguishing the motion states of points in addition to their semantic categories. However, methods designed for single-scan-based segmentation tasks perform poorly on the multi-scan task due to the lacking of a… ▽ More 3D semantic segmentation on multi-scan large-scale point clouds plays an important role in autonomous systems. Unlike the single-scan-based semantic segmentation task, this task requires distinguishing the motion states of points in addition to their semantic categories. However, methods designed for single-scan-based segmentation tasks perform poorly on the multi-scan task due to the lacking of an effective way to integrate temporal information. We propose MarS3D, a plug-and-play motion-aware module for semantic segmentation on multi-scan 3D point clouds. This module can be flexibly combined with single-scan models to allow them to have multi-scan perception abilities. The model encompasses two key designs: the Cross-Frame Feature Embedding module for enriching representation learning and the Motion-Aware Feature Learning module for enhancing motion awareness. Extensive experiments show that MarS3D can improve the performance of the baseline model by a large margin. The code is available at https://github.com/CVMI-Lab/MarS3D. △ Less

Submitted 18 July, 2023; originally announced July 2023.

Journal ref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023

arXiv:2307.08388 [pdf, other]

Dynamic Snake Convolution based on Topological Geometric Constraints for Tubular Structure Segmentation

Authors: Yaolei Qi, Yuting He, Xiaoming Qi, Yuan Zhang, Guanyu Yang

Abstract: Accurate segmentation of topological tubular structures, such as blood vessels and roads, is crucial in various fields, ensuring accuracy and efficiency in downstream tasks. However, many factors complicate the task, including thin local structures and variable global morphologies. In this work, we note the specificity of tubular structures and use this knowledge to guide our DSCNet to simultaneou… ▽ More Accurate segmentation of topological tubular structures, such as blood vessels and roads, is crucial in various fields, ensuring accuracy and efficiency in downstream tasks. However, many factors complicate the task, including thin local structures and variable global morphologies. In this work, we note the specificity of tubular structures and use this knowledge to guide our DSCNet to simultaneously enhance perception in three stages: feature extraction, feature fusion, and loss constraint. First, we propose a dynamic snake convolution to accurately capture the features of tubular structures by adaptively focusing on slender and tortuous local structures. Subsequently, we propose a multi-view feature fusion strategy to complement the attention to features from multiple perspectives during feature fusion, ensuring the retention of important information from different global morphologies. Finally, a continuity constraint loss function, based on persistent homology, is proposed to constrain the topological continuity of the segmentation better. Experiments on 2D and 3D datasets show that our DSCNet provides better accuracy and continuity on the tubular structure segmentation task compared with several methods. Our codes will be publicly available. △ Less

Submitted 18 August, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

Comments: Accepted by ICCV 2023

arXiv:2307.00771 [pdf, other]

Resistive memory-based zero-shot liquid state machine for multimodal event data learning

Authors: Ning Lin, Shaocong Wang, Yi Li, Bo Wang, Shuhui Shi, Yangu He, Woyu Zhang, Yifei Yu, Yue Zhang, Xiaojuan Qi, Xiaoming Chen, Hao Jiang, Xumeng Zhang, Peng Lin, Xiaoxin Xu, Qi Liu, Zhongrui Wang, Dashan Shang, Ming Liu

Abstract: The human brain is a complex spiking neural network (SNN) that learns multimodal signals in a zero-shot manner by generalizing existing knowledge. Remarkably, the brain achieves this with minimal power consumption, using event-based signals that propagate within its structure. However, mimicking the human brain in neuromorphic hardware presents both hardware and software challenges. Hardware limit… ▽ More The human brain is a complex spiking neural network (SNN) that learns multimodal signals in a zero-shot manner by generalizing existing knowledge. Remarkably, the brain achieves this with minimal power consumption, using event-based signals that propagate within its structure. However, mimicking the human brain in neuromorphic hardware presents both hardware and software challenges. Hardware limitations, such as the slowdown of Moore's law and the von Neumann bottleneck, hinder the efficiency of digital computers. On the software side, SNNs are known for their difficult training, especially when learning multimodal signals. To overcome these challenges, we propose a hardware-software co-design that combines a fixed and random liquid state machine (LSM) SNN encoder with trainable artificial neural network (ANN) projections. The LSM is physically implemented using analogue resistive memory, leveraging the inherent stochasticity of resistive switching to generate random weights. This highly efficient and nanoscale in-memory computing approach effectively addresses the von Neumann bottleneck and the slowdown of Moore's law. The ANN projections are implemented digitally, allowing for easy optimization using contrastive loss, which helps to overcome the difficulties associated with SNN training. We experimentally implement this co-design on a 40nm 256Kb in-memory computing macro. We first demonstrate LSM-based event encoding through supervised classification and linear probing on the N-MNIST and N-TIDIGITS datasets. △ Less

Submitted 3 July, 2023; originally announced July 2023.

arXiv:2306.16687 [pdf, other]

Bulk Reconstruction from Generalized Free Fields

Authors: Tamra M. Nebabu, Xiaoliang Qi

Abstract: We propose a generalized protocol for constructing a dual free bulk theory from any boundary model of generalized free fields (GFFs). To construct the bulk operators, we employ a linear ansatz similar to the Hamilton-Kabat-Liftschytz and Lowe (HKLL) construction. However, unlike the HKLL construction, our protocol relies only on boundary data with no presupposed form for the bulk equations of moti… ▽ More We propose a generalized protocol for constructing a dual free bulk theory from any boundary model of generalized free fields (GFFs). To construct the bulk operators, we employ a linear ansatz similar to the Hamilton-Kabat-Liftschytz and Lowe (HKLL) construction. However, unlike the HKLL construction, our protocol relies only on boundary data with no presupposed form for the bulk equations of motion, so our reconstructed bulk is fully emergent. For a (1+1)d bulk, imposing the bulk operator algebra as well as a causal structure is sufficient to determine the bulk operators and dynamics uniquely up to an unimportant local basis choice. We study the bulk construction for several two-sided SYK models with and without coupling between the two sides, and find good agreement with known results in the low-temperature conformal limit. In particular, we find bulk features consistent with the presence of a black hole horizon for the TFD state, and characterize the infalling fermion modes. We are also able to extract bulk quantities such as the curvature and bulk state correlators in terms of boundary quantities. In the presence of coupling between the two SYK models, we are able to observe evidence of the shockwave geometry and the traversable wormhole geometry using the two-sided mutual information between the reconstructed bulk operators. Our results show evidence that features of the geometric bulk can survive away from the low temperature conformal limit. Furthermore, the generality of the protocol allows it to be applied to other boundary theories with no canonical holographic bulk. △ Less

Submitted 29 June, 2023; originally announced June 2023.

arXiv:2306.16064 [pdf, other]

Federated Generative Learning with Foundation Models

Authors: Jie Zhang, Xiaohua Qi, Bo Zhao

Abstract: Existing approaches in Federated Learning (FL) mainly focus on sending model parameters or gradients from clients to a server. However, these methods are plagued by significant inefficiency, privacy, and security concerns. Thanks to the emerging foundation generative models, we propose a novel federated learning framework, namely Federated Generative Learning. In this framework, each client can cr… ▽ More Existing approaches in Federated Learning (FL) mainly focus on sending model parameters or gradients from clients to a server. However, these methods are plagued by significant inefficiency, privacy, and security concerns. Thanks to the emerging foundation generative models, we propose a novel federated learning framework, namely Federated Generative Learning. In this framework, each client can create text embeddings that are tailored to their local data, and send embeddings to the server. Then the informative training data can be synthesized remotely on the server using foundation generative models with these embeddings, which can benefit FL tasks. Our proposed framework offers several advantages, including increased communication efficiency, robustness to data heterogeneity, substantial performance improvements, and enhanced privacy protection. We validate these benefits through extensive experiments conducted on 12 datasets. For example, on the ImageNet100 dataset with a highly skewed data distribution, our method outperforms FedAvg by 12% in a single communication round, compared to FedAvg's performance over 200 communication rounds. We have released the code for all experiments conducted in this study. △ Less

Submitted 31 May, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

arXiv:2306.13213 [pdf, other]

Visual Adversarial Examples Jailbreak Aligned Large Language Models

Authors: Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, Prateek Mittal

Abstract: Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT-4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, represen… ▽ More Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT-4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned LLM, compelling it to heed a wide range of harmful instructions that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. Our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend toward multimodality in frontier foundation models. △ Less

Submitted 16 August, 2023; v1 submitted 22 June, 2023; originally announced June 2023.

arXiv:2306.12422 [pdf, other]

DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Generation

Authors: Yukun Huang, Jianan Wang, Yukai Shi, Boshi Tang, Xianbiao Qi, Lei Zhang

Abstract: Text-to-image diffusion models pre-trained on billions of image-text pairs have recently enabled 3D content creation by optimizing a randomly initialized differentiable 3D representation with score distillation. However, the optimization process suffers slow convergence and the resultant 3D models often exhibit two limitations: (a) quality concerns such as missing attributes and distorted shape an… ▽ More Text-to-image diffusion models pre-trained on billions of image-text pairs have recently enabled 3D content creation by optimizing a randomly initialized differentiable 3D representation with score distillation. However, the optimization process suffers slow convergence and the resultant 3D models often exhibit two limitations: (a) quality concerns such as missing attributes and distorted shape and texture; (b) extremely low diversity comparing to text-guided image synthesis. In this paper, we show that the conflict between the 3D optimization process and uniform timestep sampling in score distillation is the main reason for these limitations. To resolve this conflict, we propose to prioritize timestep sampling with monotonically non-increasing functions, which aligns the 3D optimization process with the sampling process of diffusion model. Extensive experiments show that our simple redesign significantly improves 3D content creation with faster convergence, better quality and diversity. △ Less

Submitted 6 May, 2024; v1 submitted 21 June, 2023; originally announced June 2023.

Comments: ICLR 2024

arXiv:2306.09338 [pdf, other]

Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant

Authors: Xianbiao Qi, Jianan Wang, Lei Zhang

Abstract: This article provides a comprehensive understanding of optimization in deep learning, with a primary focus on the challenges of gradient vanishing and gradient exploding, which normally lead to diminished model representational ability and training instability, respectively. We analyze these two challenges through several strategic measures, including the improvement of gradient flow and the impos… ▽ More This article provides a comprehensive understanding of optimization in deep learning, with a primary focus on the challenges of gradient vanishing and gradient exploding, which normally lead to diminished model representational ability and training instability, respectively. We analyze these two challenges through several strategic measures, including the improvement of gradient flow and the imposition of constraints on a network's Lipschitz constant. To help understand the current optimization methodologies, we categorize them into two classes: explicit optimization and implicit optimization. Explicit optimization methods involve direct manipulation of optimizer parameters, including weight, gradient, learning rate, and weight decay. Implicit optimization methods, by contrast, focus on improving the overall landscape of a network by enhancing its modules, such as residual shortcuts, normalization methods, attention mechanisms, and activations. In this article, we provide an in-depth analysis of these two optimization classes and undertake a thorough examination of the Jacobian matrices and the Lipschitz constants of many widely used deep learning modules, highlighting existing issues as well as potential improvements. Moreover, we also conduct a series of analytical experiments to substantiate our theoretical discussions. This article does not aim to propose a new optimizer or network. Rather, our intention is to present a comprehensive understanding of optimization in deep learning. We hope that this article will assist readers in gaining a deeper insight in this field and encourages the development of more robust, efficient, and high-performing models. △ Less

Submitted 12 November, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: International Digital Economy Academy (IDEA)

arXiv:2306.09076 [pdf, other]

Awayvirus: A Playful and Tangible Approach to Improve Children's Hygiene Habits in Family Education

Authors: Xiang Qi, Yaxiong Lei, Shijing He, Shuxin Cheng

Abstract: Despite various playful and educational tools have been developed to support children's learning abilities, limited work focuses on tangible toys designed to improve and maintain children's hygiene perception, habits and awareness, as well as fostering their collaboration and social abilities in home education contexts. We developed \textbf{Awayvirus} to address this research and design gap, aimin… ▽ More Despite various playful and educational tools have been developed to support children's learning abilities, limited work focuses on tangible toys designed to improve and maintain children's hygiene perception, habits and awareness, as well as fostering their collaboration and social abilities in home education contexts. We developed \textbf{Awayvirus} to address this research and design gap, aiming to help children gain hygiene habits knowledge through tangible blocks. Our findings indicate that a playful tangible interaction method can effectively increase children's interest in learning and encourage parents to become actively involved in their children's hygiene and health education. Additionally, Awayvirus seeks to build a collaborative bridge between children and parents, promoting communication strategies while mitigating the adverse effects of the challenging the post-pandemic period. △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: 10 pages, 3 figures, Paper accepted by INTERACT23 short paper track

ACM Class: H.5

arXiv:2306.07760 [pdf, other]

Urania: Visualizing Data Analysis Pipelines for Natural Language-Based Data Exploration

Authors: Yi Guo, Nan Cao, Xiaoyu Qi, Haoyang Li, Danqing Shi, Jing Zhang, Qing Chen, Daniel Weiskopf

Abstract: Exploratory Data Analysis (EDA) is an essential yet tedious process for examining a new dataset. To facilitate it, natural language interfaces (NLIs) can help people intuitively explore the dataset via data-oriented questions. However, existing NLIs primarily focus on providing accurate answers to questions, with few offering explanations or presentations of the data analysis pipeline used to unco… ▽ More Exploratory Data Analysis (EDA) is an essential yet tedious process for examining a new dataset. To facilitate it, natural language interfaces (NLIs) can help people intuitively explore the dataset via data-oriented questions. However, existing NLIs primarily focus on providing accurate answers to questions, with few offering explanations or presentations of the data analysis pipeline used to uncover the answer. Such presentations are crucial for EDA as they enhance the interpretability and reliability of the answer, while also helping users understand the analysis process and derive insights. To fill this gap, we introduce Urania, a natural language interactive system that is able to visualize the data analysis pipelines used to resolve input questions. It integrates a natural language interface that allows users to explore data via questions, and a novel data-aware question decomposition algorithm that resolves each input question into a data analysis pipeline. This pipeline is visualized in the form of a datamation, with animated presentations of analysis operations and their corresponding data changes. Through two quantitative experiments and expert interviews, we demonstrated that our data-aware question decomposition algorithm outperforms the state-of-the-art technique in terms of execution accuracy, and that Urania can help people explore datasets better. In the end, we discuss the observations from the studies and the potential future works. △ Less

Submitted 13 June, 2023; originally announced June 2023.

arXiv:2306.07505 [pdf]

Deep learning radiomics for assessment of gastroesophageal varices in people with compensated advanced chronic liver disease

Authors: Lan Wang, Ruiling He, Lili Zhao, Jia Wang, Zhengzi Geng, Tao Ren, Guo Zhang, Peng Zhang, Kaiqiang Tang, Chaofei Gao, Fei Chen, Liting Zhang, Yonghe Zhou, Xin Li, Fanbin He, Hui Huan, Wenjuan Wang, Yunxiao Liang, Juan Tang, Fang Ai, Tingyu Wang, Liyun Zheng, Zhongwei Zhao, Jiansong Ji, Wei Liu , et al. (22 additional authors not shown)

Abstract: Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV). Design: A prospective multicenter study was conducted in patients with… ▽ More Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV). Design: A prospective multicenter study was conducted in patients with compensated advanced chronic liver disease. 305 patients were enrolled from 12 hospitals, and finally 265 patients were included, with 1136 liver stiffness measurement (LSM) images and 1042 spleen stiffness measurement (SSM) images generated by 2D-SWE. We leveraged deep learning methods to uncover associations between image features and patient risk, and thus conducted models to predict GEV and HRV. Results: A multi-modality Deep Learning Risk Prediction model (DLRP) was constructed to assess GEV and HRV, based on LSM and SSM images, and clinical information. Validation analysis revealed that the AUCs of DLRP were 0.91 for GEV (95% CI 0.90 to 0.93, p < 0.05) and 0.88 for HRV (95% CI 0.86 to 0.89, p < 0.01), which were significantly and robustly better than canonical risk indicators, including the value of LSM and SSM. Moreover, DLPR was better than the model using individual parameters, including LSM and SSM images. In HRV prediction, the 2D-SWE images of SSM outperform LSM (p < 0.01). Conclusion: DLRP shows excellent performance in predicting GEV and HRV over canonical risk indicators LSM and SSM. Additionally, the 2D-SWE images of SSM provided more information for better accuracy in predicting HRV than the LSM. △ Less

Submitted 12 June, 2023; originally announced June 2023.

arXiv:2306.07265 [pdf, other]

detrex: Benchmarking Detection Transformers

Authors: Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, Jianan Wang, Zhaoyang Zeng, Xianbiao Qi, Yuhui Yuan, Jianwei Yang, Lei Zhang

Abstract: The DEtection TRansformer (DETR) algorithm has received considerable attention in the research community and is gradually emerging as a mainstream approach for object detection and other perception tasks. However, the current field lacks a unified and comprehensive benchmark specifically tailored for DETR-based models. To address this issue, we develop a unified, highly modular, and lightweight co… ▽ More The DEtection TRansformer (DETR) algorithm has received considerable attention in the research community and is gradually emerging as a mainstream approach for object detection and other perception tasks. However, the current field lacks a unified and comprehensive benchmark specifically tailored for DETR-based models. To address this issue, we develop a unified, highly modular, and lightweight codebase called detrex, which supports a majority of the mainstream DETR-based instance recognition algorithms, covering various fundamental tasks, including object detection, segmentation, and pose estimation. We conduct extensive experiments under detrex and perform a comprehensive benchmark for DETR-based models. Moreover, we enhance the performance of detection transformers through the refinement of training hyper-parameters, providing strong baselines for supported algorithms.We hope that detrex could offer research communities a standardized and unified platform to evaluate and compare different DETR-based models while fostering a deeper understanding and driving advancements in DETR-based instance recognition. Our code is available at https://github.com/IDEA-Research/detrex. The project is currently being actively developed. We encourage the community to use detrex codebase for further development and contributions. △ Less

Submitted 13 June, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

Comments: project link: https://github.com/IDEA-Research/detrex

arXiv:2306.06200 [pdf]

doi 10.1021/acs.nanolett.3c01955

A generalized approach to photon avalanche upconversion in luminescent nanocrystals

Authors: Artiom Skripka, Minji Lee, Xiao Qi, Jia-Ahn Pan, Haoran Yang, Changhwan Lee, P. James Schuck, Bruce E. Cohen, Daniel Jaque, Emory M. Chan

Abstract: Photon avalanching nanoparticles (ANPs) exhibit extremely nonlinear upconverted emission valuable for sub-diffraction imaging, nanoscale sensing, and optical computing. Avalanching has been demonstrated with Tm3+, Nd3+ or Pr3+-doped nanocrystals, but their emission is limited to 600 and 800 nm, restricting applications. Here, we utilize Gd3+-assisted energy migration to tune the emission wavelengt… ▽ More Photon avalanching nanoparticles (ANPs) exhibit extremely nonlinear upconverted emission valuable for sub-diffraction imaging, nanoscale sensing, and optical computing. Avalanching has been demonstrated with Tm3+, Nd3+ or Pr3+-doped nanocrystals, but their emission is limited to 600 and 800 nm, restricting applications. Here, we utilize Gd3+-assisted energy migration to tune the emission wavelengths of Tm3+-sensitized ANPs and generate highly nonlinear emission of Eu3+, Tb3+, Ho3+, and Er3+ ions. The upconversion intensities of these spectrally discrete ANPs scale with the nonlinearity factor s = 10-17 under 1064 nm excitation at power densities as low as 6 kW/cm2. This strategy for imprinting avalanche behavior on remote emitters can be extended to fluorophores adjacent to ANPs, as we demonstrate with CdS/CdSe/CdS core/shell/shell quantum dots. ANPs with rationally designed energy transfer networks provide the means to transform conventional linear emitters into a highly nonlinear ones, expanding the use of photon avalanching in biological, chemical, and photonic applications. △ Less

Submitted 9 June, 2023; originally announced June 2023.

Comments: 13 pages, 5 figures

arXiv:2305.18891 [pdf, other]

EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation

Authors: Xingqun Qi, Chen Liu, Lincheng Li, Jie Hou, Haoran Xin, Xin Yu

Abstract: Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that emotion is one of the key factors of authentic co-speech gesture generation. In this work, we propose EmotionGesture, a novel framework for synthesizing vivid and diverse emotional co-s… ▽ More Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that emotion is one of the key factors of authentic co-speech gesture generation. In this work, we propose EmotionGesture, a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures. Our code and dataset will be released at the project page: https://xingqunqi-lab.github.io/Emotion-Gesture-Web/ △ Less

Submitted 3 January, 2024; v1 submitted 30 May, 2023; originally announced May 2023.

Comments: Under review

arXiv:2305.14691 [pdf, other]

Label-Efficient Learning in Agriculture: A Comprehensive Review

Authors: Jiajia Li, Dong Chen, Xinda Qi, Zhaojian Li, Yanbo Huang, Daniel Morris, Xiaobo Tan

Abstract: The past decade has witnessed many great successes of machine learning (ML) and deep learning (DL) applications in agricultural systems, including weed control, plant disease diagnosis, agricultural robotics, and precision livestock management. Despite tremendous progresses, one downside of such ML/DL models is that they generally rely on large-scale labeled datasets for training, and the performa… ▽ More The past decade has witnessed many great successes of machine learning (ML) and deep learning (DL) applications in agricultural systems, including weed control, plant disease diagnosis, agricultural robotics, and precision livestock management. Despite tremendous progresses, one downside of such ML/DL models is that they generally rely on large-scale labeled datasets for training, and the performance of such models is strongly influenced by the size and quality of available labeled data samples. In addition, collecting, processing, and labeling such large-scale datasets is extremely costly and time-consuming, partially due to the rising cost in human labor. Therefore, developing label-efficient ML/DL methods for agricultural applications has received significant interests among researchers and practitioners. In fact, there are more than 50 papers on developing and applying deep-learning-based label-efficient techniques to address various agricultural problems since 2016, which motivates the authors to provide a timely and comprehensive review of recent label-efficient ML/DL methods in agricultural applications. To this end, we first develop a principled taxonomy to organize these methods according to the degree of supervision, including weak supervision (i.e., active learning and semi-/weakly- supervised learning), and no supervision (i.e., un-/self- supervised learning), supplemented by representative state-of-the-art label-efficient ML/DL methods. In addition, a systematic review of various agricultural applications exploiting these label-efficient algorithms, such as precision agriculture, plant phenotyping, and postharvest quality assessment, is presented. Finally, we discuss the current problems and challenges, as well as future research directions. A well-classified paper list can be accessed at https://github.com/DongChen06/Label-efficient-in-Agriculture. △ Less

Submitted 23 May, 2023; originally announced May 2023.

Comments: 34 pages, 23 figures

arXiv:2305.13948 [pdf, other]

Decoupled Kullback-Leibler Divergence Loss

Authors: Jiequan Cui, Zhuotao Tian, Zhisheng Zhong, Xiaojuan Qi, Bei Yu, Hanwang Zhang

Abstract: In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and observe that it is equivalent to the Doupled Kullback-Leibler (DKL) Divergence loss that consists of 1) a weighted Mean Square Error (wMSE) loss and 2) a Cross-Entropy loss incorporating soft labels. From our analysis of the DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of… ▽ More In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and observe that it is equivalent to the Doupled Kullback-Leibler (DKL) Divergence loss that consists of 1) a weighted Mean Square Error (wMSE) loss and 2) a Cross-Entropy loss incorporating soft labels. From our analysis of the DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of DKL in scenarios like knowledge distillation by breaking its asymmetry property in training optimization. This modification ensures that the wMSE component is always effective during training, providing extra constructive cues. Secondly, we introduce global information into DKL for intra-class consistency regularization. With these two enhancements, we derive the Improved Kullback-Leibler (IKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100 and ImageNet datasets, focusing on adversarial training and knowledge distillation tasks. The proposed approach achieves new state-of-the-art performance on both tasks, demonstrating the substantial practical merits. Code and models will be available soon at https://github.com/jiequancui/DKL. △ Less

Submitted 23 May, 2023; originally announced May 2023.

Comments: under review

arXiv:2305.13869 [pdf, other]

Trend-Based SAC Beam Control Method with Zero-Shot in Superconducting Linear Accelerator

Authors: Xiaolong Chen, Xin Qi, Chunguang Su, Yuan He, Zhijun Wang, Kunxiang Sun, Chao Jin, Weilong Chen, Shuhui Liu, Xiaoying Zhao, Duanyang Jia, Man Yi

Abstract: The superconducting linear accelerator is a highly flexiable facility for modern scientific discoveries, necessitating weekly reconfiguration and tuning. Accordingly, minimizing setup time proves essential in affording users with ample experimental time. We propose a trend-based soft actor-critic(TBSAC) beam control method with strong robustness, allowing the agents to be trained in a simulated en… ▽ More The superconducting linear accelerator is a highly flexiable facility for modern scientific discoveries, necessitating weekly reconfiguration and tuning. Accordingly, minimizing setup time proves essential in affording users with ample experimental time. We propose a trend-based soft actor-critic(TBSAC) beam control method with strong robustness, allowing the agents to be trained in a simulated environment and applied to the real accelerator directly with zero-shot. To validate the effectiveness of our method, two different typical beam control tasks were performed on China Accelerator Facility for Superheavy Elements (CAFe II) and a light particle injector(LPI) respectively. The orbit correction tasks were performed in three cryomodules in CAFe II seperately, the time required for tuning has been reduced to one-tenth of that needed by human experts, and the RMS values of the corrected orbit were all less than 1mm. The other transmission efficiency optimization task was conducted in the LPI, our agent successfully optimized the transmission efficiency of radio-frequency quadrupole(RFQ) to over $85\%$ within 2 minutes. The outcomes of these two experiments offer substantiation that our proposed TBSAC approach can efficiently and effectively accomplish beam commissioning tasks while upholding the same standard as skilled human experts. As such, our method exhibits potential for future applications in other accelerator commissioning fields. △ Less

Submitted 25 May, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

arXiv:2305.12529 [pdf, other]

DreamWaltz: Make a Scene with Complex 3D Animatable Avatars

Authors: Yukun Huang, Jianan Wang, Ailing Zeng, He Cao, Xianbiao Qi, Yukai Shi, Zheng-Jun Zha, Lei Zhang

Abstract: We present DreamWaltz, a novel framework for generating and animating complex 3D avatars given text guidance and parametric human body prior. While recent methods have shown encouraging results for text-to-3D generation of common objects, creating high-quality and animatable 3D avatars remains challenging. To create high-quality 3D avatars, DreamWaltz proposes 3D-consistent occlusion-aware Score D… ▽ More We present DreamWaltz, a novel framework for generating and animating complex 3D avatars given text guidance and parametric human body prior. While recent methods have shown encouraging results for text-to-3D generation of common objects, creating high-quality and animatable 3D avatars remains challenging. To create high-quality 3D avatars, DreamWaltz proposes 3D-consistent occlusion-aware Score Distillation Sampling (SDS) to optimize implicit neural representations with canonical poses. It provides view-aligned supervision via 3D-aware skeleton conditioning which enables complex avatar generation without artifacts and multiple faces. For animation, our method learns an animatable 3D avatar representation from abundant image priors of diffusion model conditioned on various poses, which could animate complex non-rigged avatars given arbitrary poses without retraining. Extensive evaluations demonstrate that DreamWaltz is an effective and robust approach for creating 3D avatars that can take on complex shapes and appearances as well as novel poses for animation. The proposed framework further enables the creation of complex scenes with diverse compositions, including avatar-avatar, avatar-object and avatar-scene interactions. See https://dreamwaltz3d.github.io/ for more vivid 3D avatar and animation results. △ Less

Submitted 5 November, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

Comments: To appear in NeurIPS 2023; Project page: https://dreamwaltz3d.github.io/

arXiv:2305.04454 [pdf, other]

Frequency combs induced by optical feedback and harmonic order tunability in quantum cascade lasers

Authors: Carlo Silvestri, Xiaoqiong Qi, Thomas Taimre, Aleksandar D. Rakić

Abstract: This study investigates the interaction between frequency combs and optical feedback effects in Quantum Cascade Lasers (QCLs). The theoretical analysis reveals new phenomena arising from the interplay between comb generation and feedback. By considering the bias current corresponding to free-running single mode emission, the introduction of optical feedback can trigger the generation of frequency… ▽ More This study investigates the interaction between frequency combs and optical feedback effects in Quantum Cascade Lasers (QCLs). The theoretical analysis reveals new phenomena arising from the interplay between comb generation and feedback. By considering the bias current corresponding to free-running single mode emission, the introduction of optical feedback can trigger the generation of frequency combs, including both fundamental and harmonic combs. This presents opportunities to extend the comb region and generate harmonic frequency combs with different orders through optimization of external cavity parameters such as losses and length. Furthermore, the study demonstrates that optical feedback can selectively tune the harmonic order of a pre-existing free-running comb by adjusting the external cavity length, particularly for feedback ratios around 1%, which are readily achievable in experimental setups. Under strong feedback conditions (Acket parameter C>4.6), mixed states emerge, displaying the features of both laser and external cavity dynamics. While the study is predominantly centered on Terahertz QCLs, we have also confirmed that the described phenomena occur when utilizing mid-infrared QCL parameters. This work establishes a connection between comb technology and the utilization of optical feedback, providing new avenues for exploration and advancement in the field. In fact, the novel reported phenomena open a pathway towards new methodologies across various domains, such as design of tunable comb sources, hyperspectral imaging, multimode coherent sensing, and multi-channel communication. △ Less

Submitted 16 October, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

arXiv:2304.14503 [pdf]

UHRNet: A Deep Learning-Based Method for Accurate 3D Reconstruction from a Single Fringe-Pattern

Authors: Yixiao Wang, Canlin Zhou, Xingyang Qi, Hui Li

Abstract: The quick and accurate retrieval of an object height from a single fringe pattern in Fringe Projection Profilometry has been a topic of ongoing research. While a single shot fringe to depth CNN based method can restore height map directly from a single pattern, its accuracy is currently inferior to the traditional phase shifting technique. To improve this method's accuracy, we propose using a U sh… ▽ More The quick and accurate retrieval of an object height from a single fringe pattern in Fringe Projection Profilometry has been a topic of ongoing research. While a single shot fringe to depth CNN based method can restore height map directly from a single pattern, its accuracy is currently inferior to the traditional phase shifting technique. To improve this method's accuracy, we propose using a U shaped High resolution Network (UHRNet). The network uses UNet encoding and decoding structure as backbone, with Multi-Level convolution Block and High resolution Fusion Block applied to extract local features and global features. We also designed a compound loss function by combining Structural Similarity Index Measure Loss (SSIMLoss) function and chunked L2 loss function to improve 3D reconstruction details.We conducted several experiments to demonstrate the validity and robustness of our proposed method. A few experiments have been conducted to demonstrate the validity and robustness of the proposed method, The average RMSE of 3D reconstruction by our method is only 0.443(mm). which is 41.13% of the UNet method and 33.31% of Wang et al hNet method. Our experimental results show that our proposed method can increase the accuracy of 3D reconstruction from a single fringe pattern. △ Less

Submitted 23 April, 2023; originally announced April 2023.

arXiv:2304.12988 [pdf, other]

doi 10.59275/j.melba.2023-7e96

Multi-Scale Feature Fusion using Parallel-Attention Block for COVID-19 Chest X-ray Diagnosis

Authors: Xiao Qi, David J. Foran, John L. Nosher, Ilker Hacihaliloglu

Abstract: Under the global COVID-19 crisis, accurate diagnosis of COVID-19 from Chest X-ray (CXR) images is critical. To reduce intra- and inter-observer variability, during the radiological assessment, computer-aided diagnostic tools have been utilized to supplement medical decision-making and subsequent disease management. Computational methods with high accuracy and robustness are required for rapid tria… ▽ More Under the global COVID-19 crisis, accurate diagnosis of COVID-19 from Chest X-ray (CXR) images is critical. To reduce intra- and inter-observer variability, during the radiological assessment, computer-aided diagnostic tools have been utilized to supplement medical decision-making and subsequent disease management. Computational methods with high accuracy and robustness are required for rapid triaging of patients and aiding radiologists in the interpretation of the collected data. In this study, we propose a novel multi-feature fusion network using parallel attention blocks to fuse the original CXR images and local-phase feature-enhanced CXR images at multi-scales. We examine our model on various COVID-19 datasets acquired from different organizations to assess the generalization ability. Our experiments demonstrate that our method achieves state-of-art performance and has improved generalization capability, which is crucial for widespread deployment. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2023:008

Journal ref: Machine.Learning.for.Biomedical.Imaging. 2 (2023)

arXiv:2304.12652 [pdf, other]

Hybrid Neural Rendering for Large-Scale Scenes with Motion Blur

Authors: Peng Dai, Yinda Zhang, Xin Yu, Xiaoyang Lyu, Xiaojuan Qi

Abstract: Rendering novel view images is highly desirable for many applications. Despite recent progress, it remains challenging to render high-fidelity and view-consistent novel views of large-scale scenes from in-the-wild images with inevitable artifacts (e.g., motion blur). To this end, we develop a hybrid neural rendering model that makes image-based representation and neural 3D representation join forc… ▽ More Rendering novel view images is highly desirable for many applications. Despite recent progress, it remains challenging to render high-fidelity and view-consistent novel views of large-scale scenes from in-the-wild images with inevitable artifacts (e.g., motion blur). To this end, we develop a hybrid neural rendering model that makes image-based representation and neural 3D representation join forces to render high-quality, view-consistent images. Besides, images captured in the wild inevitably contain artifacts, such as motion blur, which deteriorates the quality of rendered images. Accordingly, we propose strategies to simulate blur effects on the rendered images to mitigate the negative influence of blurriness images and reduce their importance during training based on precomputed quality-aware weights. Extensive experiments on real and synthetic data demonstrate our model surpasses state-of-the-art point-based methods for novel view synthesis. The code is available at https://daipengwa.github.io/Hybrid-Rendering-ProjectPage. △ Less

Submitted 9 July, 2023; v1 submitted 25 April, 2023; originally announced April 2023.

arXiv:2304.11537 [pdf, ps, other]

Bounds for eccentricity-based parameters of graphs

Authors: Yunfang Tang, Xuli Qi, Douglas B. West

Abstract: The \emph{eccentricity} of a vertex $u$ in a graph $G$, denoted by $e_G(u)$, is the maximum distance from $u$ to other vertices in $G$. We study extremal problems for the average eccentricity and the first and second Zagreb eccentricity indices, denoted by $σ_0(G)$, $σ_1(G)$, and $σ_2(G)$, respectively. These are defined by $σ_0(G)=\frac{1}{|V(G)|}\sum_{u\in V(G)}e_G(u)$,… ▽ More The \emph{eccentricity} of a vertex $u$ in a graph $G$, denoted by $e_G(u)$, is the maximum distance from $u$ to other vertices in $G$. We study extremal problems for the average eccentricity and the first and second Zagreb eccentricity indices, denoted by $σ_0(G)$, $σ_1(G)$, and $σ_2(G)$, respectively. These are defined by $σ_0(G)=\frac{1}{|V(G)|}\sum_{u\in V(G)}e_G(u)$, $σ_1(G)=\sum_{u\in V(G)}e_G^2(u)$, and $σ_2(G)=\sum_{uv\in E(G)}e_G(u)e_G(v)$. We study lower and upper bounds on these parameters among $n$-vertex connected graphs with fixed diameter, chromatic number, clique number, or matching number. Most of the bounds are sharp, with the corresponding extremal graphs characterized. △ Less

Submitted 23 April, 2023; originally announced April 2023.

Comments: 27 pages

arXiv:2304.09856 [pdf, other]

LipsFormer: Introducing Lipschitz Continuity to Vision Transformers

Authors: Xianbiao Qi, Jianan Wang, Yihao Chen, Yukai Shi, Lei Zhang

Abstract: We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability both theoretically and empirically for Transformer-based models. In contrast to previous practical tricks that address training instability by learning rate warmup, layer normalization, attention formulation, and weight initialization, we show that Lipschitz continuity is a more essential property to ens… ▽ More We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability both theoretically and empirically for Transformer-based models. In contrast to previous practical tricks that address training instability by learning rate warmup, layer normalization, attention formulation, and weight initialization, we show that Lipschitz continuity is a more essential property to ensure training stability. In LipsFormer, we replace unstable Transformer component modules with Lipschitz continuous counterparts: CenterNorm instead of LayerNorm, spectral initialization instead of Xavier initialization, scaled cosine similarity attention instead of dot-product attention, and weighted residual shortcut. We prove that these introduced modules are Lipschitz continuous and derive an upper bound on the Lipschitz constant of LipsFormer. Our experiments show that LipsFormer allows stable training of deep Transformer architectures without the need of careful learning rate tuning such as warmup, yielding a faster convergence and better generalization. As a result, on the ImageNet 1K dataset, LipsFormer-Swin-Tiny based on Swin Transformer training for 300 epochs can obtain 82.7\% without any learning rate warmup. Moreover, LipsFormer-CSwin-Tiny, based on CSwin, training for 300 epochs achieves a top-1 accuracy of 83.5\% with 4.7G FLOPs and 24M parameters. The code will be released at \url{https://github.com/IDEA-Research/LipsFormer}. △ Less

Submitted 19 April, 2023; originally announced April 2023.

Comments: To appear in ICLR 2023, our code will be public at https://github.com/IDEA-Research/LipsFormer

Showing 101–150 of 623 results for author: Qi, X