Search | arXiv e-print repository

Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

Authors: Tian Ye, Zicheng Xu, Yuanzhi Li, Zeyuan Allen-Zhu

Abstract: Language models have demonstrated remarkable performance in solving reasoning tasks; however, even the strongest models still occasionally make reasoning mistakes. Recently, there has been active research aimed at improving reasoning accuracy, particularly by using pretrained language models to "self-correct" their mistakes via multi-round prompting. In this paper, we follow this line of work but… ▽ More Language models have demonstrated remarkable performance in solving reasoning tasks; however, even the strongest models still occasionally make reasoning mistakes. Recently, there has been active research aimed at improving reasoning accuracy, particularly by using pretrained language models to "self-correct" their mistakes via multi-round prompting. In this paper, we follow this line of work but focus on understanding the usefulness of incorporating "error-correction" data directly into the pretraining stage. This data consists of erroneous solution steps immediately followed by their corrections. Using a synthetic math dataset, we show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy directly (i.e., through simple auto-regression, without multi-round prompting) compared to pretraining on the same amount of error-free data. We also delve into many details, such as (1) how this approach differs from beam search, (2) how such data can be prepared, (3) whether masking is needed on the erroneous tokens, (4) the amount of error required, (5) whether such data can be deferred to the fine-tuning stage, and many others. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: arXiv admin note: text overlap with arXiv:2407.20311

arXiv:2408.12541 [pdf, other]

Clarifying the Role of the Mantel-Haenszel Risk Difference Estimator in Randomized Clinical Trials

Authors: Xiaoyu Qiu, Yuhan Qian, Jaehwan Yi, Jinqiu Wang, Yu Du, Yanyao Yi, Ting Ye

Abstract: The Mantel-Haenszel (MH) risk difference estimator, commonly used in randomized clinical trials for binary outcomes, calculates a weighted average of stratum-specific risk difference estimators. Traditionally, this method requires the stringent assumption that risk differences are homogeneous across strata, also known as the common risk difference assumption. In our article, we relax this assumpti… ▽ More The Mantel-Haenszel (MH) risk difference estimator, commonly used in randomized clinical trials for binary outcomes, calculates a weighted average of stratum-specific risk difference estimators. Traditionally, this method requires the stringent assumption that risk differences are homogeneous across strata, also known as the common risk difference assumption. In our article, we relax this assumption and adopt a modern perspective, viewing the MH risk difference estimator as an approach for covariate adjustment in randomized clinical trials, distinguishing its use from that in meta-analysis and observational studies. We demonstrate that the MH risk difference estimator consistently estimates the average treatment effect within a standard super-population framework, which is often the primary interest in randomized clinical trials, in addition to estimating a weighted average of stratum-specific risk difference. We rigorously study its properties under both the large-stratum and sparse-stratum asymptotic regimes. Furthermore, for either estimand, we propose a unified robust variance estimator that improves over the popular variance estimators by Greenland and Robins (1985) and Sato et al. (1989) and has provable consistency across both asymptotic regimes, regardless of assuming common risk differences. Extensions of our theoretical results also provide new insights into the Cochran-Mantel-Haenszel test and the post-stratification estimator. Our findings are thoroughly validated through simulations and a clinical trial example. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.11785 [pdf, other]

doi 10.1145/3664647.3681236

Timeline and Boundary Guided Diffusion Network for Video Shadow Detection

Authors: Haipeng Zhou, Honqiu Wang, Tian Ye, Zhaohu Xing, Jun Ma, Ping Li, Qiong Wang, Lei Zhu

Abstract: Video Shadow Detection (VSD) aims to detect the shadow masks with frame sequence. Existing works suffer from inefficient temporal learning. Moreover, few works address the VSD problem by considering the characteristic (i.e., boundary) of shadow. Motivated by this, we propose a Timeline and Boundary Guided Diffusion (TBGDiff) network for VSD where we take account of the past-future temporal guidanc… ▽ More Video Shadow Detection (VSD) aims to detect the shadow masks with frame sequence. Existing works suffer from inefficient temporal learning. Moreover, few works address the VSD problem by considering the characteristic (i.e., boundary) of shadow. Motivated by this, we propose a Timeline and Boundary Guided Diffusion (TBGDiff) network for VSD where we take account of the past-future temporal guidance and boundary information jointly. In detail, we design a Dual Scale Aggregation (DSA) module for better temporal understanding by rethinking the affinity of the long-term and short-term frames for the clipped video. Next, we introduce Shadow Boundary Aware Attention (SBAA) to utilize the edge contexts for capturing the characteristics of shadows. Moreover, we are the first to introduce the Diffusion model for VSD in which we explore a Space-Time Encoded Embedding (STEE) to inject the temporal guidance for Diffusion to conduct shadow detection. Benefiting from these designs, our model can not only capture the temporal information but also the shadow property. Extensive experiments show that the performance of our approach overtakes the state-of-the-art methods, verifying the effectiveness of our components. We release the codes, weights, and results at \url{https://github.com/haipengzhou856/TBGDiff}. △ Less

Submitted 21 August, 2024; originally announced August 2024.

Comments: ACM MM2024

arXiv:2408.09204 [pdf, other]

A Series of (Net) Spin-down Glitches in PSR J1522-5735: Insights from the Vortex Creep and Vortex Bending Models

Authors: S. Q. Zhou, W. T. Ye, M. Y. Ge, E. GügercinoğLu, S. J. Zheng, C. Yu, J. P. Yuan, J. Zhang

Abstract: Through a detailed timing analysis of $\textit{Fermi}$-LAT data, the rotational behavior of the $γ$-ray pulsar PSR J1522$-$5735 was tracked from August 2008 (MJD 54692) to January 2024 (MJD 60320). During this 15.4-year period, two over-recovery glitches and four anti-glitches were identified, marking a rare occurrence in rotation-powered pulsars (RPPs). The magnitudes of these (net) spin-down gli… ▽ More Through a detailed timing analysis of $\textit{Fermi}$-LAT data, the rotational behavior of the $γ$-ray pulsar PSR J1522$-$5735 was tracked from August 2008 (MJD 54692) to January 2024 (MJD 60320). During this 15.4-year period, two over-recovery glitches and four anti-glitches were identified, marking a rare occurrence in rotation-powered pulsars (RPPs). The magnitudes of these (net) spin-down glitches were determined to be $|Δν_{\rm g}/ν| \sim 10^{-8}$, well above the estimated detectability limit. For the two over-recovery glitches, the respective recovery fractions $Q$ are $2.1(7)$ and $1.4(2)$. Further analysis showed no substantial variations in either the flux or pulse profile shape in any of these events, suggesting that small (net) spin-down glitches, unlike large events observed in magnetars and magnetar-like RPPs, may occur without leaving an impact on the magnetosphere. Within the framework of the vortex creep and vortex bending models, anti-glitches and over-recoveries indicate the recoupling of vortex lines that moved inward as a result of a crustquake; meanwhile, the apparent fluctuations in the spin-down rate after the glitches occur as a result of the coupling of the oscillations of bent vortex lines to the magnetosphere. △ Less

Submitted 17 August, 2024; originally announced August 2024.

Comments: 8 pages, 3 figures, 2 tables, submitted to ApJL. Comments Welcome!

arXiv:2408.07032 [pdf]

QIris: Quantum Implementation of Rainbow Table Attacks

Authors: Lee Jun Quan, Tan Jia Ye, Goh Geok Ling, Vivek Balachandran

Abstract: This paper explores the use of Grover's Algorithm in the classical rainbow table, uncovering the potential of integrating quantum computing techniques with conventional cryptographic methods to develop a Quantum Rainbow Table Proof-of-Concept. This leverages on Quantum concepts and algorithms which includes the principle of qubit superposition, entanglement and teleportation, coupled with Grover's… ▽ More This paper explores the use of Grover's Algorithm in the classical rainbow table, uncovering the potential of integrating quantum computing techniques with conventional cryptographic methods to develop a Quantum Rainbow Table Proof-of-Concept. This leverages on Quantum concepts and algorithms which includes the principle of qubit superposition, entanglement and teleportation, coupled with Grover's Algorithm to enable a more efficient search through the rainbow table. The paper also details on the hardware constraints and the work around to produce better results in the implementation stages. Through this work we develop a working prototype of quantum rainbow table and demonstrate how quantum computing could significantly improve the speed of cyber tools such as password crackers and thus impact the cyber security landscape. △ Less

Submitted 13 August, 2024; originally announced August 2024.

arXiv:2407.20311 [pdf, other]

Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Authors: Tian Ye, Zicheng Xu, Yuanzhi Li, Zeyuan Allen-Zhu

Abstract: Recent advances in language models have demonstrated their capability to solve mathematical reasoning problems, achieving near-perfect accuracy on grade-school level math benchmarks like GSM8K. In this paper, we formally study how language models solve these problems. We design a series of controlled experiments to address several fundamental questions: (1) Can language models truly develop reason… ▽ More Recent advances in language models have demonstrated their capability to solve mathematical reasoning problems, achieving near-perfect accuracy on grade-school level math benchmarks like GSM8K. In this paper, we formally study how language models solve these problems. We design a series of controlled experiments to address several fundamental questions: (1) Can language models truly develop reasoning skills, or do they simply memorize templates? (2) What is the model's hidden (mental) reasoning process? (3) Do models solve math questions using skills similar to or different from humans? (4) Do models trained on GSM8K-like datasets develop reasoning skills beyond those necessary for solving GSM8K problems? (5) What mental process causes models to make reasoning mistakes? (6) How large or deep must a model be to effectively solve GSM8K-level math questions? Our study uncovers many hidden mechanisms by which language models solve mathematical questions, providing insights that extend beyond current understandings of LLMs. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: video appeared in ICML 2024 tutorial

arXiv:2407.18035 [pdf, other]

RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language Models

Authors: Haoyu Chen, Wenbo Li, Jinjin Gu, Jingjing Ren, Sixiang Chen, Tian Ye, Renjing Pei, Kaiwen Zhou, Fenglong Song, Lei Zhu

Abstract: Natural images captured by mobile devices often suffer from multiple types of degradation, such as noise, blur, and low light. Traditional image restoration methods require manual selection of specific tasks, algorithms, and execution sequences, which is time-consuming and may yield suboptimal results. All-in-one models, though capable of handling multiple tasks, typically support only a limited r… ▽ More Natural images captured by mobile devices often suffer from multiple types of degradation, such as noise, blur, and low light. Traditional image restoration methods require manual selection of specific tasks, algorithms, and execution sequences, which is time-consuming and may yield suboptimal results. All-in-one models, though capable of handling multiple tasks, typically support only a limited range and often produce overly smooth, low-fidelity outcomes due to their broad data distribution fitting. To address these challenges, we first define a new pipeline for restoring images with multiple degradations, and then introduce RestoreAgent, an intelligent image restoration system leveraging multimodal large language models. RestoreAgent autonomously assesses the type and extent of degradation in input images and performs restoration through (1) determining the appropriate restoration tasks, (2) optimizing the task sequence, (3) selecting the most suitable models, and (4) executing the restoration. Experimental results demonstrate the superior performance of RestoreAgent in handling complex degradation, surpassing human experts. Furthermore, the system modular design facilitates the fast integration of new tasks and models, enhancing its flexibility and scalability for various applications. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.14900 [pdf, other]

AGLLDiff: Guiding Diffusion Models Towards Unsupervised Training-free Real-world Low-light Image Enhancement

Authors: Yunlong Lin, Tian Ye, Sixiang Chen, Zhenqi Fu, Yingying Wang, Wenhao Chai, Zhaohu Xing, Lei Zhu, Xinghao Ding

Abstract: Existing low-light image enhancement (LIE) methods have achieved noteworthy success in solving synthetic distortions, yet they often fall short in practical applications. The limitations arise from two inherent challenges in real-world LIE: 1) the collection of distorted/clean image pairs is often impractical and sometimes even unavailable, and 2) accurately modeling complex degradations presents… ▽ More Existing low-light image enhancement (LIE) methods have achieved noteworthy success in solving synthetic distortions, yet they often fall short in practical applications. The limitations arise from two inherent challenges in real-world LIE: 1) the collection of distorted/clean image pairs is often impractical and sometimes even unavailable, and 2) accurately modeling complex degradations presents a non-trivial problem. To overcome them, we propose the Attribute Guidance Diffusion framework (AGLLDiff), a training-free method for effective real-world LIE. Instead of specifically defining the degradation process, AGLLDiff shifts the paradigm and models the desired attributes, such as image exposure, structure and color of normal-light images. These attributes are readily available and impose no assumptions about the degradation process, which guides the diffusion sampling process to a reliable high-quality solution space. Extensive experiments demonstrate that our approach outperforms the current leading unsupervised LIE methods across benchmarks in terms of distortion-based and perceptual-based metrics, and it performs well even in sophisticated wild degradation. △ Less

Submitted 23 July, 2024; v1 submitted 20 July, 2024; originally announced July 2024.

Comments: 21 pages, 9 figures

arXiv:2407.10923 [pdf, other]

OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting

Authors: Penglei Gao, Kai Yao, Tiandi Ye, Steven Wang, Yuan Yao, Xiaofeng Wang

Abstract: In this paper, we tackle the recently popular topic of generating 360-degree images given the conventional narrow field of view (NFoV) images that could be taken from a single camera or cellphone. This task aims to predict the reasonable and consistent surroundings from the NFoV images. Existing methods for feature extraction and fusion, often built with transformer-based architectures, incur subs… ▽ More In this paper, we tackle the recently popular topic of generating 360-degree images given the conventional narrow field of view (NFoV) images that could be taken from a single camera or cellphone. This task aims to predict the reasonable and consistent surroundings from the NFoV images. Existing methods for feature extraction and fusion, often built with transformer-based architectures, incur substantial memory usage and computational expense. They also have limitations in maintaining visual continuity across the entire 360-degree images, which could cause inconsistent texture and style generation. To solve the aforementioned issues, we propose a novel text-guided out-painting framework equipped with a State-Space Model called Mamba to utilize its long-sequence modelling and spatial continuity. Furthermore, incorporating textual information is an effective strategy for guiding image generation, enriching the process with detailed context and increasing diversity. Efficiently extracting textual features and integrating them with image attributes presents a significant challenge for 360-degree image out-painting. To address this, we develop two modules, Visual-textual Consistency Refiner (VCR) and Global-local Mamba Adapter (GMA). VCR enhances contextual richness by fusing the modified text features with the image features, while GMA provides adaptive state-selective conditions by capturing the information flow from global to local representations. Our proposed method achieves state-of-the-art performance with extensive experiments on two broadly used 360-degree image datasets, including indoor and outdoor settings. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.09480 [pdf, other]

Using Artificial Intelligence to Unlock Crowdfunding Success for Small Businesses

Authors: Teng Ye, Jingnan Zheng, Junhui Jin, Jingyi Qiu, Wei Ai, Qiaozhu Mei

Abstract: While small businesses are increasingly turning to online crowdfunding platforms for essential funding, over 40% of these campaigns may fail to raise any money, especially those from low socio-economic areas. We utilize the latest advancements in AI technology to identify crucial factors that influence the success of crowdfunding campaigns and to improve their fundraising outcomes by strategically… ▽ More While small businesses are increasingly turning to online crowdfunding platforms for essential funding, over 40% of these campaigns may fail to raise any money, especially those from low socio-economic areas. We utilize the latest advancements in AI technology to identify crucial factors that influence the success of crowdfunding campaigns and to improve their fundraising outcomes by strategically optimizing these factors. Our best-performing machine learning model accurately predicts the fundraising outcomes of 81.0% of campaigns, primarily based on their textual descriptions. Interpreting the machine learning model allows us to provide actionable suggestions on improving the textual description before launching a campaign. We demonstrate that by augmenting just three aspects of the narrative using a large language model, a campaign becomes more preferable to 83% human evaluators, and its likelihood of securing financial support increases by 11.9%. Our research uncovers the effective strategies for crafting descriptions for small business fundraising campaigns and opens up a new realm in integrating large language models into crowdfunding methodologies. △ Less

Submitted 24 April, 2024; originally announced July 2024.

arXiv:2407.01191 [pdf, other]

MARS: Multimodal Active Robotic Sensing for Articulated Characterization

Authors: Hongliang Zeng, Ping Zhang, Chengjiong Wu, Jiahua Wang, Tingyu Ye, Fang Li

Abstract: Precise perception of articulated objects is vital for empowering service robots. Recent studies mainly focus on point cloud, a single-modal approach, often neglecting vital texture and lighting details and assuming ideal conditions like optimal viewpoints, unrepresentative of real-world scenarios. To address these limitations, we introduce MARS, a novel framework for articulated object characteri… ▽ More Precise perception of articulated objects is vital for empowering service robots. Recent studies mainly focus on point cloud, a single-modal approach, often neglecting vital texture and lighting details and assuming ideal conditions like optimal viewpoints, unrepresentative of real-world scenarios. To address these limitations, we introduce MARS, a novel framework for articulated object characterization. It features a multi-modal fusion module utilizing multi-scale RGB features to enhance point cloud features, coupled with reinforcement learning-based active sensing for autonomous optimization of observation viewpoints. In experiments conducted with various articulated object instances from the PartNet-Mobility dataset, our method outperformed current state-of-the-art methods in joint parameter estimation accuracy. Additionally, through active sensing, MARS further reduces errors, demonstrating enhanced efficiency in handling suboptimal viewpoints. Furthermore, our method effectively generalizes to real-world articulated objects, enhancing robot interactions. Code is available at https://github.com/robhlzeng/MARS. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.17342 [pdf, other]

Masked Generative Extractor for Synergistic Representation and 3D Generation of Point Clouds

Authors: Hongliang Zeng, Ping Zhang, Fang Li, Jiahua Wang, Tingyu Ye, Pengteng Guo

Abstract: Representation and generative learning, as reconstruction-based methods, have demonstrated their potential for mutual reinforcement across various domains. In the field of point cloud processing, although existing studies have adopted training strategies from generative models to enhance representational capabilities, these methods are limited by their inability to genuinely generate 3D shapes. To… ▽ More Representation and generative learning, as reconstruction-based methods, have demonstrated their potential for mutual reinforcement across various domains. In the field of point cloud processing, although existing studies have adopted training strategies from generative models to enhance representational capabilities, these methods are limited by their inability to genuinely generate 3D shapes. To explore the benefits of deeply integrating 3D representation learning and generative learning, we propose an innovative framework called \textit{Point-MGE}. Specifically, this framework first utilizes a vector quantized variational autoencoder to reconstruct a neural field representation of 3D shapes, thereby learning discrete semantic features of point patches. Subsequently, we design a sliding masking ratios to smooth the transition from representation learning to generative learning. Moreover, our method demonstrates strong generalization capability in learning high-capacity models, achieving new state-of-the-art performance across multiple downstream tasks. In shape classification, Point-MGE achieved an accuracy of 94.2% (+1.0%) on the ModelNet40 dataset and 92.9% (+5.5%) on the ScanObjectNN dataset. Experimental results also confirmed that Point-MGE can generate high-quality 3D shapes in both unconditional and conditional settings. △ Less

Submitted 15 August, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.15478 [pdf]

Impact of the Top SiO2 Interlayer Thickness on Memory Window of Si Channel FeFET with TiN/SiO2/Hf0.5Zr0.5O2/SiOx/Si (MIFIS) Gate Structure

Authors: Tao Hu, Xianzhou Shao, Mingkai Bai, Xinpei Jia, Saifei Dai, Xiaoqing Sun, Runhao Han, Jia Yang, Xiaoyu Ke, Fengbin Tian, Shuai Yang, Junshuai Chai, Hao Xu, Xiaolei Wang, Wenwu Wang, Tianchun Ye

Abstract: We study the impact of top SiO2 interlayer thickness on the memory window (MW) of Si channel ferroelectric field-effect transistor (FeFET) with TiN/SiO2/Hf0.5Zr0.5O2/SiOx/Si (MIFIS) gate structure. We find that the MW increases with the increasing thickness of the top SiO2 interlayer, and such an increase exhibits a two-stage linear dependence. The physical origin is the presence of the different… ▽ More We study the impact of top SiO2 interlayer thickness on the memory window (MW) of Si channel ferroelectric field-effect transistor (FeFET) with TiN/SiO2/Hf0.5Zr0.5O2/SiOx/Si (MIFIS) gate structure. We find that the MW increases with the increasing thickness of the top SiO2 interlayer, and such an increase exhibits a two-stage linear dependence. The physical origin is the presence of the different interfacial charges trapped at the top SiO2/Hf0.5Zr0.5O2 interface. Moreover, we investigate the dependence of endurance characteristics on initial MW. We find that the endurance characteristic degrades with increasing the initial MW. By inserting a 3.4 nm SiO2 dielectric interlayer between the gate metal TiN and the ferroelectric Hf0.5Zr0.5O2, we achieve a MW of 6.3 V and retention over 10 years. Our work is helpful in the device design of FeFET. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 6 pages, 12 figures. arXiv admin note: substantial text overlap with arXiv:2404.15825

arXiv:2406.11935 [pdf, other]

Iterative or Innovative? A Problem-Oriented Perspective for Code Optimization

Authors: Tong Ye, Tengfei Ma, Lingfei Wu, Xuhong Zhang, Shouling Ji, Wenhai Wang

Abstract: Large language models (LLMs) have demonstrated strong capabilities in solving a wide range of programming tasks. However, LLMs have rarely been explored for code optimization. In this paper, we explore code optimization with a focus on performance enhancement, specifically aiming to optimize code for minimal execution time. The recently proposed first PIE dataset for performance optimization const… ▽ More Large language models (LLMs) have demonstrated strong capabilities in solving a wide range of programming tasks. However, LLMs have rarely been explored for code optimization. In this paper, we explore code optimization with a focus on performance enhancement, specifically aiming to optimize code for minimal execution time. The recently proposed first PIE dataset for performance optimization constructs program optimization pairs based on iterative submissions from the same programmer for the same problem. However, this approach restricts LLMs to local performance improvements, neglecting global algorithmic innovation. Therefore, we adopt a completely different perspective by reconstructing the optimization pairs into a problem-oriented approach. This allows for the integration of various ingenious ideas from different programmers tackling the same problem. Experimental results demonstrate that adapting LLMs to problem-oriented optimization pairs significantly enhances their optimization capabilities. Meanwhile, we identified performance bottlenecks within the problem-oriented perspective. By employing model merge, we further overcame bottlenecks and ultimately elevated the program optimization ratio ($51.76\%\rightarrow76.65\%$) and speedup ($2.65\times\rightarrow5.09\times$) to new levels. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11247 [pdf, other]

STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft

Authors: Zhonghan Zhao, Wenhao Chai, Xuan Wang, Ke Ma, Kewei Chen, Dongxu Guo, Tian Ye, Yanting Zhang, Hongwei Wang, Gaoang Wang

Abstract: Building an embodied agent system with a large language model (LLM) as its core is a promising direction. Due to the significant costs and uncontrollable factors associated with deploying and training such agents in the real world, we have decided to begin our exploration within the Minecraft environment. Our STEVE Series agents can complete basic tasks in a virtual environment and more challengin… ▽ More Building an embodied agent system with a large language model (LLM) as its core is a promising direction. Due to the significant costs and uncontrollable factors associated with deploying and training such agents in the real world, we have decided to begin our exploration within the Minecraft environment. Our STEVE Series agents can complete basic tasks in a virtual environment and more challenging tasks such as navigation and even creative tasks, with an efficiency far exceeding previous state-of-the-art methods by a factor of $2.5\times$ to $7.3\times$. We begin our exploration with a vanilla large language model, augmenting it with a vision encoder and an action codebase trained on our collected high-quality dataset STEVE-21K. Subsequently, we enhanced it with a Critic and memory to transform it into a complex system. Finally, we constructed a hierarchical multi-agent system. Our recent work explored how to prune the agent system through knowledge distillation. In the future, we will explore more potential applications of STEVE agents in the real world. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: CVPR 2024 Embodied AI Workshop

arXiv:2406.07801 [pdf, other]

PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models

Authors: Runyan Yang, Huibao Yang, Xiqing Zhang, Tiantian Ye, Ying Liu, Yingying Gao, Shilei Zhang, Chao Deng, Junlan Feng

Abstract: Recently, there have been attempts to integrate various speech processing tasks into a unified model. However, few previous works directly demonstrated that joint optimization of diverse tasks in multitask speech models has positive influence on the performance of individual tasks. In this paper we present a multitask speech model -- PolySpeech, which supports speech recognition, speech synthesis,… ▽ More Recently, there have been attempts to integrate various speech processing tasks into a unified model. However, few previous works directly demonstrated that joint optimization of diverse tasks in multitask speech models has positive influence on the performance of individual tasks. In this paper we present a multitask speech model -- PolySpeech, which supports speech recognition, speech synthesis, and two speech classification tasks. PolySpeech takes multi-modal language model as its core structure and uses semantic representations as speech inputs. We introduce semantic speech embedding tokenization and speech reconstruction methods to PolySpeech, enabling efficient generation of high-quality speech for any given speaker. PolySpeech shows competitiveness across various tasks compared to single-task models. In our experiments, multitask optimization achieves performance comparable to single-task optimization and is especially beneficial for specific tasks. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 5 pages, 2 figures

arXiv:2405.16133 [pdf, other]

Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting

Authors: Tong Ye, Yangkai Du, Tengfei Ma, Lingfei Wu, Xuhong Zhang, Shouling Ji, Wenhai Wang

Abstract: Large Language Models (LLMs) have exhibited remarkable proficiency in generating code. However, the misuse of LLM-generated (Synthetic) code has prompted concerns within both educational and industrial domains, highlighting the imperative need for the development of synthetic code detectors. Existing methods for detecting LLM-generated content are primarily tailored for general text and often stru… ▽ More Large Language Models (LLMs) have exhibited remarkable proficiency in generating code. However, the misuse of LLM-generated (Synthetic) code has prompted concerns within both educational and industrial domains, highlighting the imperative need for the development of synthetic code detectors. Existing methods for detecting LLM-generated content are primarily tailored for general text and often struggle with code content due to the distinct grammatical structure of programming languages and massive "low-entropy" tokens. Building upon this, our work proposes a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants. Our method relies on the intuition that the differences between the LLM-rewritten and original codes tend to be smaller when the original code is synthetic. We utilize self-supervised contrastive learning to train a code similarity model and assess our approach on two synthetic code detection benchmarks. Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts, with an improvement of 20.5% in the APPS benchmark and 29.1% in the MBPP benchmark. △ Less

Submitted 29 May, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

Comments: Previously submitted to EMNLP2023

arXiv:2405.16046 [pdf, ps, other]

Sensitivity Analysis for Attributable Effects in Case$^2$ Studies

Authors: Kan Chen, Ting Ye, Dylan S. Small

Abstract: The case$^2$ study, also referred to as the case-case study design, is a valuable approach for conducting inference for treatment effects. Unlike traditional case-control studies, the case$^2$ design compares treatment in two types of cases with the same disease. A key quantity of interest is the attributable effect, which is the number of cases of disease among treated units which are caused by t… ▽ More The case$^2$ study, also referred to as the case-case study design, is a valuable approach for conducting inference for treatment effects. Unlike traditional case-control studies, the case$^2$ design compares treatment in two types of cases with the same disease. A key quantity of interest is the attributable effect, which is the number of cases of disease among treated units which are caused by the treatment. Two key assumptions that are usually made for making inferences about the attributable effect in case$^2$ studies are 1.) treatment does not cause the second type of case, and 2.) the treatment does not alter an individual's case type. However, these assumptions are not realistic in many real-data applications. In this article, we present a sensitivity analysis framework to scrutinize the impact of deviations from these assumptions on obtained results. We also include sensitivity analyses related to the assumption of unmeasured confounding, recognizing the potential bias introduced by unobserved covariates. The proposed methodology is exemplified through an investigation into whether having violent behavior in the last year of life increases suicide risk via 1993 National Mortality Followback Survey dataset. △ Less

Submitted 25 May, 2024; originally announced May 2024.

Comments: 25 pages, 2 Figures, 4 Tables

arXiv:2405.05811 [pdf, other]

Parallel Cross Strip Attention Network for Single Image Dehazing

Authors: Lihan Tong, Yun Liu, Tian Ye, Weijia Li, Liyuan Chen, Erkang Chen

Abstract: The objective of single image dehazing is to restore hazy images and produce clear, high-quality visuals. Traditional convolutional models struggle with long-range dependencies due to their limited receptive field size. While Transformers excel at capturing such dependencies, their quadratic computational complexity in relation to feature map resolution makes them less suitable for pixel-to-pixel… ▽ More The objective of single image dehazing is to restore hazy images and produce clear, high-quality visuals. Traditional convolutional models struggle with long-range dependencies due to their limited receptive field size. While Transformers excel at capturing such dependencies, their quadratic computational complexity in relation to feature map resolution makes them less suitable for pixel-to-pixel dense prediction tasks. Moreover, fixed kernels or tokens in most models do not adapt well to varying blur sizes, resulting in suboptimal dehazing performance. In this study, we introduce a novel dehazing network based on Parallel Stripe Cross Attention (PCSA) with a multi-scale strategy. PCSA efficiently integrates long-range dependencies by simultaneously capturing horizontal and vertical relationships, allowing each pixel to capture contextual cues from an expanded spatial domain. To handle different sizes and shapes of blurs flexibly, We employs a channel-wise design with varying convolutional kernel sizes and strip lengths in each PCSA to capture context information at different scales.Additionally, we incorporate a softmax-based adaptive weighting mechanism within PCSA to prioritize and leverage more critical features. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 10 pages , 4 figures, CTISC'24

Report number: C052

arXiv:2404.17747 [pdf, other]

MMA-UNet: A Multi-Modal Asymmetric UNet Architecture for Infrared and Visible Image Fusion

Authors: Jingxue Huang, Xilai Li, Tianshu Tan, Xiaosong Li, Tao Ye

Abstract: Multi-modal image fusion (MMIF) maps useful information from various modalities into the same representation space, thereby producing an informative fused image. However, the existing fusion algorithms tend to symmetrically fuse the multi-modal images, causing the loss of shallow information or bias towards a single modality in certain regions of the fusion results. In this study, we analyzed the… ▽ More Multi-modal image fusion (MMIF) maps useful information from various modalities into the same representation space, thereby producing an informative fused image. However, the existing fusion algorithms tend to symmetrically fuse the multi-modal images, causing the loss of shallow information or bias towards a single modality in certain regions of the fusion results. In this study, we analyzed the spatial distribution differences of information in different modalities and proved that encoding features within the same network is not conducive to achieving simultaneous deep feature space alignment for multi-modal images. To overcome this issue, a Multi-Modal Asymmetric UNet (MMA-UNet) was proposed. We separately trained specialized feature encoders for different modal and implemented a cross-scale fusion strategy to maintain the features from different modalities within the same representation space, ensuring a balanced information fusion process. Furthermore, extensive fusion and downstream task experiments were conducted to demonstrate the efficiency of MMA-UNet in fusing infrared and visible image information, producing visually natural and semantically rich fusion results. Its performance surpasses that of the state-of-the-art comparison fusion methods. △ Less

Submitted 11 July, 2024; v1 submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.17176 [pdf, other]

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

Authors: Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, Gaoang Wang

Abstract: Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing methods either employ complex spatial-temporal modules or rely heavily on additional perception models to extract temporal features for video understanding, and they only perform well on short videos. For long… ▽ More Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing methods either employ complex spatial-temporal modules or rely heavily on additional perception models to extract temporal features for video understanding, and they only perform well on short videos. For long videos, the computational complexity and memory costs associated with long-term temporal connections are significantly increased, posing additional challenges.Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose MovieChat to overcome these challenges. We lift pre-trained multi-modal large language models for understanding long videos without incorporating additional trainable temporal modules, employing a zero-shot approach. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video, 2K temporal grounding labels, and 14K manual annotations for validation of the effectiveness of our method. The code along with the dataset can be accessed via the following https://github.com/rese1f/MovieChat. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.15825 [pdf]

Impact of Top SiO2 interlayer Thickness on Memory Window of Si Channel FeFET with TiN/SiO2/Hf0.5Zr0.5O2/SiOx/Si (MIFIS) Gate Structure

Authors: Tao Hu, Xianzhou Shao, Mingkai Bai, Xinpei Jia, Saifei Dai, Xiaoqing Sun, Runhao Han, Jia Yang, Xiaoyu Ke, Fengbin Tian, Shuai Yang, Junshuai Chai, Hao Xu, Xiaolei Wang, Wenwu Wang, Tianchun Ye

Abstract: We study the impact of top SiO2 interlayer thickness on memory window of Si channel FeFET with TiN/SiO2/Hf0.5Zr0.5O2/SiOx/Si (MIFIS) gate structure. The memory window increases with thicker top SiO2. We realize the memory window of 6.3 V for 3.4 nm top SiO2. Moreover, we find that the endurance characteristic degrades with increasing the initial memory window. We study the impact of top SiO2 interlayer thickness on memory window of Si channel FeFET with TiN/SiO2/Hf0.5Zr0.5O2/SiOx/Si (MIFIS) gate structure. The memory window increases with thicker top SiO2. We realize the memory window of 6.3 V for 3.4 nm top SiO2. Moreover, we find that the endurance characteristic degrades with increasing the initial memory window. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: 4 page 7 figures

arXiv:2404.14248 [pdf, other]

NTIRE 2024 Challenge on Low Light Image Enhancement: Methods and Results

Authors: Xiaoning Liu, Zongwei Wu, Ao Li, Florin-Alexandru Vasluianu, Yulun Zhang, Shuhang Gu, Le Zhang, Ce Zhu, Radu Timofte, Zhi Jin, Hongjun Wu, Chenxi Wang, Haitao Ling, Yuanhao Cai, Hao Bian, Yuxin Zheng, Jing Lin, Alan Yuille, Ben Shao, Jin Guo, Tianli Liu, Mohao Wu, Yixu Feng, Shuo Hou, Haotian Lin , et al. (87 additional authors not shown)

Abstract: This paper reviews the NTIRE 2024 low light image enhancement challenge, highlighting the proposed solutions and results. The aim of this challenge is to discover an effective network design or solution capable of generating brighter, clearer, and visually appealing results when dealing with a variety of conditions, including ultra-high resolution (4K and beyond), non-uniform illumination, backlig… ▽ More This paper reviews the NTIRE 2024 low light image enhancement challenge, highlighting the proposed solutions and results. The aim of this challenge is to discover an effective network design or solution capable of generating brighter, clearer, and visually appealing results when dealing with a variety of conditions, including ultra-high resolution (4K and beyond), non-uniform illumination, backlighting, extreme darkness, and night scenes. A notable total of 428 participants registered for the challenge, with 22 teams ultimately making valid submissions. This paper meticulously evaluates the state-of-the-art advancements in enhancing low-light images, reflecting the significant progress and creativity in this field. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: NTIRE 2024 Challenge Report

arXiv:2404.03225 [pdf, other]

FACTUAL: A Novel Framework for Contrastive Learning Based Robust SAR Image Classification

Authors: Xu Wang, Tian Ye, Rajgopal Kannan, Viktor Prasanna

Abstract: Deep Learning (DL) Models for Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR), while delivering improved performance, have been shown to be quite vulnerable to adversarial attacks. Existing works improve robustness by training models on adversarial samples. However, by focusing mostly on attacks that manipulate images randomly, they neglect the real-world feasibility of such atta… ▽ More Deep Learning (DL) Models for Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR), while delivering improved performance, have been shown to be quite vulnerable to adversarial attacks. Existing works improve robustness by training models on adversarial samples. However, by focusing mostly on attacks that manipulate images randomly, they neglect the real-world feasibility of such attacks. In this paper, we propose FACTUAL, a novel Contrastive Learning framework for Adversarial Training and robust SAR classification. FACTUAL consists of two components: (1) Differing from existing works, a novel perturbation scheme that incorporates realistic physical adversarial attacks (such as OTSA) to build a supervised adversarial pre-training network. This network utilizes class labels for clustering clean and perturbed images together into a more informative feature space. (2) A linear classifier cascaded after the encoder to use the computed representations to predict the target labels. By pre-training and fine-tuning our model on both clean and adversarial samples, we show that our model achieves high prediction accuracy on both cases. Our model achieves 99.7% accuracy on clean samples, and 89.6% on perturbed samples, both outperforming previous state-of-the-art methods. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: 2024 IEEE Radar Conference

arXiv:2403.18493 [pdf, other]

VersaT2I: Improving Text-to-Image Models with Versatile Reward

Authors: Jianshu Guo, Wenhao Chai, Jie Deng, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, Gaoang Wang

Abstract: Recent text-to-image (T2I) models have benefited from large-scale and high-quality data, demonstrating impressive performance. However, these T2I models still struggle to produce images that are aesthetically pleasing, geometrically accurate, faithful to text, and of good low-level quality. We present VersaT2I, a versatile training framework that can boost the performance with multiple rewards of… ▽ More Recent text-to-image (T2I) models have benefited from large-scale and high-quality data, demonstrating impressive performance. However, these T2I models still struggle to produce images that are aesthetically pleasing, geometrically accurate, faithful to text, and of good low-level quality. We present VersaT2I, a versatile training framework that can boost the performance with multiple rewards of any T2I model. We decompose the quality of the image into several aspects such as aesthetics, text-image alignment, geometry, low-level quality, etc. Then, for every quality aspect, we select high-quality images in this aspect generated by the model as the training set to finetune the T2I model using the Low-Rank Adaptation (LoRA). Furthermore, we introduce a gating function to combine multiple quality aspects, which can avoid conflicts between different quality aspects. Our method is easy to extend and does not require any manual annotation, reinforcement learning, or model architecture changes. Extensive experiments demonstrate that VersaT2I outperforms the baseline methods across various quality criteria. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.18318 [pdf, other]

Uncertainty-Aware SAR ATR: Defending Against Adversarial Attacks via Bayesian Neural Networks

Authors: Tian Ye, Rajgopal Kannan, Viktor Prasanna, Carl Busart

Abstract: Adversarial attacks have demonstrated the vulnerability of Machine Learning (ML) image classifiers in Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) systems. An adversarial attack can deceive the classifier into making incorrect predictions by perturbing the input SAR images, for example, with a few scatterers attached to the on-ground objects. Therefore, it is critical to devel… ▽ More Adversarial attacks have demonstrated the vulnerability of Machine Learning (ML) image classifiers in Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) systems. An adversarial attack can deceive the classifier into making incorrect predictions by perturbing the input SAR images, for example, with a few scatterers attached to the on-ground objects. Therefore, it is critical to develop robust SAR ATR systems that can detect potential adversarial attacks by leveraging the inherent uncertainty in ML classifiers, thereby effectively alerting human decision-makers. In this paper, we propose a novel uncertainty-aware SAR ATR for detecting adversarial attacks. Specifically, we leverage the capability of Bayesian Neural Networks (BNNs) in performing image classification with quantified epistemic uncertainty to measure the confidence for each input SAR image. By evaluating the uncertainty, our method alerts when the input SAR image is likely to be adversarially generated. Simultaneously, we also generate visual explanations that reveal the specific regions in the SAR image where the adversarial scatterers are likely to to be present, thus aiding human decision-making with hints of evidence of adversarial attacks. Experiments on the MSTAR dataset demonstrate that our approach can identify over 80% adversarial SAR images with fewer than 20% false alarms, and our visual explanations can identify up to over 90% of scatterers in an adversarial SAR image. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.13260 [pdf, other]

A Bayesian Approach for Selecting Relevant External Data (BASE): Application to a study of Long-Term Outcomes in a Hemophilia Gene Therapy Trial

Authors: Tianyu Pan, Xiang Zhang, Weining Shen, Ting Ye

Abstract: Gene therapies aim to address the root causes of diseases, particularly those stemming from rare genetic defects that can be life-threatening or severely debilitating. While there has been notable progress in the development of gene therapies in recent years, understanding their long-term effectiveness remains challenging due to a lack of data on long-term outcomes, especially during the early sta… ▽ More Gene therapies aim to address the root causes of diseases, particularly those stemming from rare genetic defects that can be life-threatening or severely debilitating. While there has been notable progress in the development of gene therapies in recent years, understanding their long-term effectiveness remains challenging due to a lack of data on long-term outcomes, especially during the early stages of their introduction to the market. To address the critical question of estimating long-term efficacy without waiting for the completion of lengthy clinical trials, we propose a novel Bayesian framework. This framework selects pertinent data from external sources, often early-phase clinical trials with more comprehensive longitudinal efficacy data that could lead to an improved inference of the long-term efficacy outcome. We apply this methodology to predict the long-term factor IX (FIX) levels of HEMGENIX (etranacogene dezaparvovec), the first FDA-approved gene therapy to treat adults with severe Hemophilia B, in a phase 3 study. Our application showcases the capability of the framework to estimate the 5-year FIX levels following HEMGENIX therapy, demonstrating sustained FIX levels induced by HEMGENIX infusion. Additionally, we provide theoretical insights into the methodology by establishing its posterior convergence properties. △ Less

Submitted 9 April, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.12056 [pdf, other]

Enhancing Digital Hologram Reconstruction Using Reverse-Attention Loss for Untrained Physics-Driven Deep Learning Models with Uncertain Distance

Authors: Xiwen Chen, Hao Wang, Zhao Zhang, Zhenmin Li, Huayu Li, Tong Ye, Abolfazl Razi

Abstract: Untrained Physics-based Deep Learning (DL) methods for digital holography have gained significant attention due to their benefits, such as not requiring an annotated training dataset, and providing interpretability since utilizing the governing laws of hologram formation. However, they are sensitive to the hard-to-obtain precise object distance from the imaging plane, posing the… ▽ More Untrained Physics-based Deep Learning (DL) methods for digital holography have gained significant attention due to their benefits, such as not requiring an annotated training dataset, and providing interpretability since utilizing the governing laws of hologram formation. However, they are sensitive to the hard-to-obtain precise object distance from the imaging plane, posing the $\textit{Autofocusing}$ challenge. Conventional solutions involve reconstructing image stacks for different potential distances and applying focus metrics to select the best results, which apparently is computationally inefficient. In contrast, recently developed DL-based methods treat it as a supervised task, which again needs annotated data and lacks generalizability. To address this issue, we propose $\textit{reverse-attention loss}$, a weighted sum of losses for all possible candidates with learnable weights. This is a pioneering approach to addressing the Autofocusing challenge in untrained deep-learning methods. Both theoretical analysis and experiments demonstrate its superiority in efficiency and accuracy. Interestingly, our method presents a significant reconstruction performance over rival methods (i.e. alternating descent-like optimization, non-weighted loss integration, and random distance assignment) and even is almost equal to that achieved with a precisely known object distance. For example, the difference is less than 1dB in PSNR and 0.002 in SSIM for the target sample in our experiment. △ Less

Submitted 10 January, 2024; originally announced March 2024.

arXiv:2403.11822 [pdf]

Discovery of self-assembled Ru/Si heterostructures with unique periodic nanostripe patterns boosting hydrogen evolution

Authors: Weizheng Cai, Xinyi He, Tian-Nan Ye, Xinmeng Hu, Chuanlong Liu, Masato Sasase, Masaaki Kitano, Toshio Kamiya, Hideo Hosono, Jiazhen Wu

Abstract: Two-dimensional (2D) heterostructuring is a versatile methodology for designing nanoarchitecture catalytic systems that allow for reconstruction and modulation of interfaces and electronic structures. However, catalysts with such structures are extremely scarce due to limited synthetic strategies. Here, we report a highly ordered 2D Ru/Si nano-heterostructures (RSHS) by acid etching of the LaRuSi… ▽ More Two-dimensional (2D) heterostructuring is a versatile methodology for designing nanoarchitecture catalytic systems that allow for reconstruction and modulation of interfaces and electronic structures. However, catalysts with such structures are extremely scarce due to limited synthetic strategies. Here, we report a highly ordered 2D Ru/Si nano-heterostructures (RSHS) by acid etching of the LaRuSi electride. RSHS shows a superior electrocatalytic activity for hydrogen evolution with an overpotential of 14 mV at 10 mA/cm2 in alkaline media. Both experimental analysis and first-principles calculations demonstrate that the electronic states of Ru can be tuned by strong interactions of the interfacial Ru-Si, leading to an optimized hydrogen adsorption energy. Moreover, due to the synergistic effect of Ru and Si, the energy barrier of water dissociation is significantly reduced. The unique nanostripe structure with abundant interfaces in RSHS will provide a paradigm for construction of efficient catalysts with tunable electronic states and dual active sites. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: 20 pages, 5 figures

arXiv:2403.10340 [pdf, other]

Thermal-NeRF: Neural Radiance Fields from an Infrared Camera

Authors: Tianxiang Ye, Qi Wu, Junyuan Deng, Guoqing Liu, Liu Liu, Songpengcheng Xia, Liang Pang, Wenxian Yu, Ling Pei

Abstract: In recent years, Neural Radiance Fields (NeRFs) have demonstrated significant potential in encoding highly-detailed 3D geometry and environmental appearance, positioning themselves as a promising alternative to traditional explicit representation for 3D scene reconstruction. However, the predominant reliance on RGB imaging presupposes ideal lighting conditions: a premise frequently unmet in roboti… ▽ More In recent years, Neural Radiance Fields (NeRFs) have demonstrated significant potential in encoding highly-detailed 3D geometry and environmental appearance, positioning themselves as a promising alternative to traditional explicit representation for 3D scene reconstruction. However, the predominant reliance on RGB imaging presupposes ideal lighting conditions: a premise frequently unmet in robotic applications plagued by poor lighting or visual obstructions. This limitation overlooks the capabilities of infrared (IR) cameras, which excel in low-light detection and present a robust alternative under such adverse scenarios. To tackle these issues, we introduce Thermal-NeRF, the first method that estimates a volumetric scene representation in the form of a NeRF solely from IR imaging. By leveraging a thermal mapping and structural thermal constraint derived from the thermal characteristics of IR imaging, our method showcasing unparalleled proficiency in recovering NeRFs in visually degraded scenes where RGB-based methods fall short. We conduct extensive experiments to demonstrate that Thermal-NeRF can achieve superior quality compared to existing methods. Furthermore, we contribute a dataset for IR-based NeRF applications, paving the way for future research in IR NeRF reconstruction. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.08282 [pdf, other]

Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation

Authors: Zhonghan Zhao, Kewei Chen, Dongxu Guo, Wenhao Chai, Tian Ye, Yanting Zhang, Gaoang Wang

Abstract: Due to the dynamic and unpredictable open-world setting, navigating complex environments in Minecraft poses significant challenges for multi-agent systems. Agents must interact with the environment and coordinate their actions with other agents to achieve common objectives. However, traditional approaches often struggle to efficiently manage inter-agent communication and task distribution, crucial… ▽ More Due to the dynamic and unpredictable open-world setting, navigating complex environments in Minecraft poses significant challenges for multi-agent systems. Agents must interact with the environment and coordinate their actions with other agents to achieve common objectives. However, traditional approaches often struggle to efficiently manage inter-agent communication and task distribution, crucial for effective multi-agent navigation. Furthermore, processing and integrating multi-modal information (such as visual, textual, and auditory data) is essential for agents to comprehend their goals and navigate the environment successfully and fully. To address this issue, we design the HAS framework to auto-organize groups of LLM-based agents to complete navigation tasks. In our approach, we devise a hierarchical auto-organizing navigation system, which is characterized by 1) a hierarchical system for multi-agent organization, ensuring centralized planning and decentralized execution; 2) an auto-organizing and intra-communication mechanism, enabling dynamic group adjustment under subtasks; 3) a multi-modal information platform, facilitating multi-modal perception to perform the three navigation tasks with one system. To assess organizational behavior, we design a series of navigation tasks in the Minecraft environment, which includes searching and exploring. We aim to develop embodied organizations that push the boundaries of embodied AI, moving it towards a more human-like organizational structure. △ Less

Submitted 18 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

Comments: ICLR 2024 Workshop on LLM Agents

arXiv:2403.06144 [pdf, other]

Simulating Family Conversations using LLMs: Demonstration of Parenting Styles

Authors: Frank Tian-fang Ye, Xiaozi Gao

Abstract: This study presents a framework for conducting psychological and linguistic research through simulated conversations using large language models (LLMs). The proposed methodology offers significant advantages, particularly for simulating human interactions involving potential unethical language or behaviors that would be impermissible in traditional experiments with human participants. As a demonst… ▽ More This study presents a framework for conducting psychological and linguistic research through simulated conversations using large language models (LLMs). The proposed methodology offers significant advantages, particularly for simulating human interactions involving potential unethical language or behaviors that would be impermissible in traditional experiments with human participants. As a demonstration, we employed LLMs to simulate family conversations across four parenting styles (authoritarian, authoritative, permissive, and uninvolved). In general, we observed that the characteristics of the four parenting styles were portrayed in the simulated conversations. Several strategies could be used to improve the simulation quality, such as including context awareness, employing a few-shot prompting approach or fine-tuning models to cater to specific simulation requirements. Overall, this study introduces a promising methodology for conducting psychological and linguistic research through simulated conversations, while acknowledging the current limitations and proposing potential solutions for future refinement and improvement. △ Less

Submitted 10 March, 2024; originally announced March 2024.

arXiv:2403.01198 [pdf]

Organic solvent boosts charge storage and charging dynamics of conductive MOF supercapacitors

Authors: Ming Chen, Taizheng Wu, Liang Niu, Ting Ye, Wenlei Dai, Liang Zeng, Alexei A. Kornyshev, Zhenxiang Wang, Zhou Liu, Guang Feng

Abstract: Conductive metal-organic frameworks (c-MOFs) and ionic liquids (ILs) have emerged as auspicious combinations for high-performance supercapacitors. However, the nanoconfinement from c-MOFs and high viscosity of ILs slow down the charging process. This hindrance can, however, be resolved by adding solvent. Here, we performed constant-potential molecular simulations to scrutinize the solvent impact o… ▽ More Conductive metal-organic frameworks (c-MOFs) and ionic liquids (ILs) have emerged as auspicious combinations for high-performance supercapacitors. However, the nanoconfinement from c-MOFs and high viscosity of ILs slow down the charging process. This hindrance can, however, be resolved by adding solvent. Here, we performed constant-potential molecular simulations to scrutinize the solvent impact on charge storage and charging dynamics of MOF-IL-based supercapacitors. We find conditions for >100% enhancement in capacity and ~6 times increase in charging speed. These improvements were confirmed by synthesizing near-ideal c-MOFs and developing multiscale models linking molecular simulations to electrochemical measurements. Fundamentally, our findings elucidate that the solvent acts as an ionophobic agent to induce a substantial enhancement in charge storage, and as an ion traffic police to eliminate convoluted counterion and co-ion motion paths and create two distinct ion transport highways to accelerate charging dynamics. This work paves the way for the optimal design of MOF supercapacitors. △ Less

Submitted 2 March, 2024; originally announced March 2024.

arXiv:2402.16043 [pdf, other]

LuaTaint: A Static Taint Analysis System for Web Interface Framework Vulnerability of IoT Devices

Authors: Jiahui Xiang, Wenhai Wang, Tong Ye, Peiyu Liu

Abstract: IoT devices are currently facing continuous malicious attacks due to their widespread use. Among these IoT devices, web vulnerabilities are also widely exploited because of their inherent characteristics, such as improper permission controls and insecure interfaces. Recently, the embedded system web interface framework has become highly diverse, and specific vulnerabilities can arise if developers… ▽ More IoT devices are currently facing continuous malicious attacks due to their widespread use. Among these IoT devices, web vulnerabilities are also widely exploited because of their inherent characteristics, such as improper permission controls and insecure interfaces. Recently, the embedded system web interface framework has become highly diverse, and specific vulnerabilities can arise if developers forget to detect user input parameters or if the detection process is not strict enough. Therefore, discovering vulnerabilities in the web interfaces of IoT devices accurately and comprehensively through an automated method is a major challenge. This paper aims to work out the challenge. We have developed an automated vulnerability detection system called LuaTaint for the typical web interface framework, LuCI. The system employs static taint analysis to address web security issues on mobile terminal platforms to ensure detection coverage. It integrates rules pertaining to page handler control logic within the taint detection process to improve its extensibility. We also implemented a post-processing step with the assistance of large language models to enhance accuracy and reduce the need for manual analysis. We have created a prototype of LuaTaint and tested it on 92 IoT firmwares from 8 well-known vendors. LuaTaint has discovered 68 unknown vulnerabilities. △ Less

Submitted 25 February, 2024; originally announced February 2024.

arXiv:2402.14226 [pdf, other]

Broadband noise and quasi-periodic oscillation characteristics of the X-ray pulsar RX J0440.9+4431

Authors: P. P. Li, L. Tao, R. C. Ma, M. Y. Ge, Q. C. Zhao, S. J. Zhao, L. Zhang, Q. C. Bu, L. D. Kong, Y. L. Tuo, L. Ji, S. Zhang, J. L. Qu, S. N. Zhang, Y. Huang, X. Ma, W. T. Ye, Q. C. Shui

Abstract: We present a comprehensive timing analysis on the Be/X-ray binary pulsar RX J0440.9+4431 using observations from \textit{NICER} and \textit{Insight}-HXMT during the 2022--2023 outburst. The power density spectrum (PDS) of RX J0440.9+4431 exhibits typical aperiodic variability in X-ray flux across a wide frequency range. During a super-critical accretion state, we detect quasi-periodic oscillations… ▽ More We present a comprehensive timing analysis on the Be/X-ray binary pulsar RX J0440.9+4431 using observations from \textit{NICER} and \textit{Insight}-HXMT during the 2022--2023 outburst. The power density spectrum (PDS) of RX J0440.9+4431 exhibits typical aperiodic variability in X-ray flux across a wide frequency range. During a super-critical accretion state, we detect quasi-periodic oscillations (QPOs) at 0.2--0.5\,Hz in the light curves of five pulses for RX J0440.9+4431. The observed QPOs manifest during flares, while the flares appear at the peaks of the pulse profiles on a timescale of seconds and are primarily caused by an increase in hard photons. These flares can be explained by increased material ingestion in the accretion column at a fixed phase, primarily generating hard photons. Alternatively, an increase in accretion rate, independent of phase, may result in highly beamed hard photons within the accretion column, causing the flares. We argue the origin of QPOs to instabilities within the accretion flow. Additionally, we find that the break frequencies in the noise power spectra align well with $\propto L_{\mathrm{x}}^{3 / 7}$ across three orders of magnitude in the luminosity, which points to a relatively strong magnetic field in RX J0440.9+4431, estimated to be \textasciitilde$10^{13}$\,G. △ Less

Submitted 21 February, 2024; originally announced February 2024.

Comments: 8 pages, 7 figures. Accepted in MNRAS

arXiv:2402.00307 [pdf, other]

Debiased Multivariable Mendelian Randomization

Authors: Yinxiang Wu, Hyunseung Kang, Ting Ye

Abstract: Multivariable Mendelian randomization (MVMR) uses genetic variants as instrumental variables to infer the direct effect of multiple exposures on an outcome. Compared to univariable Mendelian randomization, MVMR is less prone to horizontal pleiotropy and enables estimation of the direct effect of each exposure on the outcome. However, MVMR faces greater challenges with weak instruments -- genetic v… ▽ More Multivariable Mendelian randomization (MVMR) uses genetic variants as instrumental variables to infer the direct effect of multiple exposures on an outcome. Compared to univariable Mendelian randomization, MVMR is less prone to horizontal pleiotropy and enables estimation of the direct effect of each exposure on the outcome. However, MVMR faces greater challenges with weak instruments -- genetic variants that are weakly associated with some exposures conditional on the other exposures. This article focuses on MVMR using summary data from genome-wide association studies (GWAS). We provide a new asymptotic regime to analyze MVMR estimators with many weak instruments, allowing for linear combinations of exposures to have different degrees of instrument strength, and formally show that the popular multivariable inverse-variance weighted (MV-IVW) estimator's asymptotic behavior is highly sensitive to instruments' strength. We then propose a multivariable debiased IVW (MV-dIVW) estimator, which effectively reduces the asymptotic bias from weak instruments in MV-IVW, and introduce an adjusted version, MV-adIVW, for improved finite-sample robustness. We establish the theoretical properties of our proposed estimators and extend them to handle balanced horizontal pleiotropy. We conclude by demonstrating the performance of our proposed methods in simulated and real datasets. We implement this method in the R package mr.divw. △ Less

Submitted 23 March, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

arXiv:2401.15992 [pdf, other]

Pulsed Iron line Emission from the First Galactic Ultraluminous X-ray Pulsar Swift J0243.6+6124

Authors: Y. X. Xiao, Y. J. Xu, M. Y. Ge, F. J. Lu, S. N. Zhang, S. Zhang, L. Tao, J. L. Qu, P. J. Wang, L. D. Kong, Y. L. Tuo, Y. You, S. J. Zhao, J. Q. Peng, Y. F. Du, Y. H. Zhang, W. T. Ye

Abstract: We report the phase-resolved spectral results of the first Galactic Pulsating Ultra-Luminous X-ray source (PULX) Swift J0243.6+6124, modeling at its 2017-2018 outburst peak using data collected by the Hard X-ray Modulation Telescope (Insight-HXMT). The broad energy coverage of Insight-HXMT allows us to obtain more accurate spectral continuum to reduce the coupling of broad iron line profiles with… ▽ More We report the phase-resolved spectral results of the first Galactic Pulsating Ultra-Luminous X-ray source (PULX) Swift J0243.6+6124, modeling at its 2017-2018 outburst peak using data collected by the Hard X-ray Modulation Telescope (Insight-HXMT). The broad energy coverage of Insight-HXMT allows us to obtain more accurate spectral continuum to reduce the coupling of broad iron line profiles with other components. We use three different continuum spectrum models but obtain similar iron line results. For the first time, we detected the pulse characteristics of the broad iron line in a PULX. The variation in width and intensity of this iron line with $σ\sim 1.2-1.5$\,keV has a phase offset of about 0.25 from the pulse phase. We suggest that the uneven irradiation of the thick inner disk by the accretion column produces the modulated variation of the broad iron line. In addition, the non-pulsed narrow line is suggested to come from the outer disk region. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.13560 [pdf, other]

SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation

Authors: Zhaohu Xing, Tian Ye, Yijun Yang, Guang Liu, Lei Zhu

Abstract: The Transformer architecture has shown a remarkable ability in modeling global relationships. However, it poses a significant computational challenge when processing high-dimensional medical images. This hinders its development and widespread adoption in this task. Mamba, as a State Space Model (SSM), recently emerged as a notable manner for long-range dependencies in sequential modeling, excellin… ▽ More The Transformer architecture has shown a remarkable ability in modeling global relationships. However, it poses a significant computational challenge when processing high-dimensional medical images. This hinders its development and widespread adoption in this task. Mamba, as a State Space Model (SSM), recently emerged as a notable manner for long-range dependencies in sequential modeling, excelling in natural language processing filed with its remarkable memory efficiency and computational speed. Inspired by its success, we introduce SegMamba, a novel 3D medical image \textbf{Seg}mentation \textbf{Mamba} model, designed to effectively capture long-range dependencies within whole volume features at every scale. Our SegMamba, in contrast to Transformer-based methods, excels in whole volume feature modeling from a state space model standpoint, maintaining superior processing speed, even with volume features at a resolution of {$64\times 64\times 64$}. Comprehensive experiments on the BraTS2023 dataset demonstrate the effectiveness and efficiency of our SegMamba. The code for SegMamba is available at: https://github.com/ge-xing/SegMamba △ Less

Submitted 25 February, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

Comments: Code has released

arXiv:2401.03692 [pdf, other]

Boosting Column Generation with Graph Neural Networks for Joint Rider Trip Planning and Crew Shift Scheduling

Authors: Jiawei Lu, Tinghan Ye, Wenbo Chen, Pascal Van Hentenryck

Abstract: Optimizing service schedules is pivotal to the reliable, efficient, and inclusive on-demand mobility. This pressing challenge is further exacerbated by the increasing needs of an aging population, the over-subscription of existing services, and the lack of effective solution methods. This study addresses the intricacies of service scheduling, by jointly optimizing rider trip planning and crew sche… ▽ More Optimizing service schedules is pivotal to the reliable, efficient, and inclusive on-demand mobility. This pressing challenge is further exacerbated by the increasing needs of an aging population, the over-subscription of existing services, and the lack of effective solution methods. This study addresses the intricacies of service scheduling, by jointly optimizing rider trip planning and crew scheduling for a complex dynamic mobility service. The resulting optimization problems are extremely challenging computationally for state-of-the-art methods. To address this fundamental gap, this paper introduces the Joint Rider Trip Planning and Crew Shift Scheduling Problem (JRTPCSSP) and a novel solution method, called AGGNNI-CG (Attention and Gated GNN- Informed Column Generation), that hybridizes column generation and machine learning to obtain near-optimal solutions to the JRTPCSSP with the real-time constraints of the application. The key idea of the machine-learning component is to dramatically reduce the number of paths to explore in the pricing component, accelerating the most time-consuming component of the column generation. The machine learning component is a graph neural network with an attention mechanism and a gated architecture, that is particularly suited to cater for the different input sizes coming from daily operations. AGGNNI-CG has been applied to a challenging, real-world dataset from the Paratransit system of Chatham County in Georgia. It produces dramatic improvements compared to the baseline column generation approach, which typically cannot produce feasible solutions in reasonable time on both medium-sized and large-scale complex instances. AGGNNI-CG also produces significant improvements in service compared to the existing system. △ Less

Submitted 8 January, 2024; originally announced January 2024.

arXiv:2312.16777 [pdf, other]

Multiphoton fusion of light nuclei in intense laser fields

Authors: Binbing Wu, Zhengfeng Fan, Difa Ye, Tao Ye, Congzhang Gao, Chengxin Yu, Xuefeng Xu, Cunbo Zhang, Jie Liu

Abstract: We investigate the fusion cross sections of light nuclei in the presence of linearly polarized intense laser fields. By combining the Coulomb-Volkov solutions with the complex spherical square-well nuclear potential, we derive an explicit formulation of the multiphoton cross section in a self-consistent manner. Our analysis is specifically focused on deuteron-triton (DT) and proton-boron (p… ▽ More We investigate the fusion cross sections of light nuclei in the presence of linearly polarized intense laser fields. By combining the Coulomb-Volkov solutions with the complex spherical square-well nuclear potential, we derive an explicit formulation of the multiphoton cross section in a self-consistent manner. Our analysis is specifically focused on deuteron-triton (DT) and proton-boron (p$^{11}{\rm B}$) fusion reactions, both of which have garnered widespread attention. The theoretical results reveal that, under conditions of longer laser wavelengths and lower incident particle kinetic energies, a few thousands of photons can participate in the fusion reactions, resulting in a substantial enhancement of fusion cross sections by almost ten orders of magnitude. We elucidate the multiphoton mechanism underlying these findings and discuss their implications. △ Less

Submitted 27 December, 2023; originally announced December 2023.

arXiv:2312.13470 [pdf, ps, other]

Coffee: Cost-Effective Edge Caching for 360 Degree Live Video Streaming

Authors: Chen Li, Tingwei Ye, Tongyu Zong, Liyang Sun, Houwei Cao, Yong Liu

Abstract: While live 360 degree video streaming delivers immersive viewing experience, it poses significant bandwidth and latency challenges for content delivery networks. Edge servers are expected to play an important role in facilitating live streaming of 360 degree videos. In this paper, we propose a novel predictive edge caching algorithm (Coffee) for live 360 degree video that employ collaborative FoV… ▽ More While live 360 degree video streaming delivers immersive viewing experience, it poses significant bandwidth and latency challenges for content delivery networks. Edge servers are expected to play an important role in facilitating live streaming of 360 degree videos. In this paper, we propose a novel predictive edge caching algorithm (Coffee) for live 360 degree video that employ collaborative FoV prediction and predictive tile prefetching to reduce bandwidth consumption, streaming cost and improve the streaming quality and robustness. Our light-weight caching algorithms exploit the unique tile consumption patterns of live 360 degree video streaming to achieve high tile caching gains. Through extensive experiments driven by real 360 degree video streaming traces, we demonstrate that edge caching algorithms specifically designed for live 360 degree video streaming can achieve high streaming cost savings with small edge cache space consumption. Coffee, guided by viewer FoV predictions, significantly reduces back-haul traffic up to 76% compared to state-of-the-art edge caching algorithms. Furthermore, we develop a transcoding-aware variant (TransCoffee) and evaluate it using comprehensive experiments, which demonstrate that TransCoffee can achieve 63\% lower cost compared to state-of-the-art transcoding-aware approaches. △ Less

Submitted 27 January, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

arXiv:2312.08874 [pdf, other]

Agent Attention: On the Integration of Softmax and Linear Attention

Authors: Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Siyuan Pan, Pengfei Wan, Shiji Song, Gao Huang

Abstract: The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention… ▽ More The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$, introduces an additional set of agent tokens $A$ into the conventional attention module. The agent tokens first act as the agent for the query tokens $Q$ to aggregate information from $K$ and $V$, and then broadcast the information back to $Q$. Given the number of agent tokens can be designed to be much smaller than the number of query tokens, the agent attention is significantly more efficient than the widely adopted Softmax attention, while preserving global context modelling capability. Interestingly, we show that the proposed agent attention is equivalent to a generalized form of linear attention. Therefore, agent attention seamlessly integrates the powerful Softmax attention and the highly efficient linear attention. Extensive experiments demonstrate the effectiveness of agent attention with various vision Transformers and across diverse vision tasks, including image classification, object detection, semantic segmentation and image generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature. For instance, when applied to Stable Diffusion, our agent attention accelerates generation and substantially enhances image generation quality without any additional training. Code is available at https://github.com/LeapLabTHU/Agent-Attention. △ Less

Submitted 15 July, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: ECCV 2024

arXiv:2312.08606 [pdf, other]

VQCNIR: Clearer Night Image Restoration with Vector-Quantized Codebook

Authors: Wenbin Zou, Hongxia Gao, Tian Ye, Liang Chen, Weipeng Yang, Shasha Huang, Hongsheng Chen, Sixiang Chen

Abstract: Night photography often struggles with challenges like low light and blurring, stemming from dark environments and prolonged exposures. Current methods either disregard priors and directly fitting end-to-end networks, leading to inconsistent illumination, or rely on unreliable handcrafted priors to constrain the network, thereby bringing the greater error to the final result. We believe in the str… ▽ More Night photography often struggles with challenges like low light and blurring, stemming from dark environments and prolonged exposures. Current methods either disregard priors and directly fitting end-to-end networks, leading to inconsistent illumination, or rely on unreliable handcrafted priors to constrain the network, thereby bringing the greater error to the final result. We believe in the strength of data-driven high-quality priors and strive to offer a reliable and consistent prior, circumventing the restrictions of manual priors. In this paper, we propose Clearer Night Image Restoration with Vector-Quantized Codebook (VQCNIR) to achieve remarkable and consistent restoration outcomes on real-world and synthetic benchmarks. To ensure the faithful restoration of details and illumination, we propose the incorporation of two essential modules: the Adaptive Illumination Enhancement Module (AIEM) and the Deformable Bi-directional Cross-Attention (DBCA) module. The AIEM leverages the inter-channel correlation of features to dynamically maintain illumination consistency between degraded features and high-quality codebook features. Meanwhile, the DBCA module effectively integrates texture and structural information through bi-directional cross-attention and deformable convolution, resulting in enhanced fine-grained detail and structural fidelity across parallel decoders. Extensive experiments validate the remarkable benefits of VQCNIR in enhancing image quality under low-light conditions, showcasing its state-of-the-art performance on both synthetic and real-world datasets. The code is available at https://github.com/AlexZou14/VQCNIR. △ Less

Submitted 16 December, 2023; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: This paper is accepted by AAAI2024

arXiv:2312.06940 [pdf, other]

doi 10.1109/HPEC58863.2023.10363455.

Benchmarking Deep Learning Classifiers for SAR Automatic Target Recognition

Authors: Jacob Fein-Ashley, Tian Ye, Rajgopal Kannan, Viktor Prasanna, Carl Busart

Abstract: Synthetic Aperture Radar SAR Automatic Target Recognition ATR is a key technique of remote-sensing image recognition which can be supported by deep neural networks The existing works of SAR ATR mostly focus on improving the accuracy of the target recognition while ignoring the systems performance in terms of speed and storage which is critical to real-world applications of SAR ATR For decision-mak… ▽ More Synthetic Aperture Radar SAR Automatic Target Recognition ATR is a key technique of remote-sensing image recognition which can be supported by deep neural networks The existing works of SAR ATR mostly focus on improving the accuracy of the target recognition while ignoring the systems performance in terms of speed and storage which is critical to real-world applications of SAR ATR For decision-makers aiming to identify a proper deep learning model to deploy in a SAR ATR system it is important to understand the performance of different candidate deep learning models and determine the best model accordingly This paper comprehensively benchmarks several advanced deep learning models for SAR ATR with multiple distinct SAR imagery datasets Specifically we train and test five SAR image classifiers based on Residual Neural Networks ResNet18 ResNet34 ResNet50 Graph Neural Network GNN and Vision Transformer for Small-Sized Datasets (SS-ViT) We select three datasets MSTAR GBSAR and SynthWakeSAR that offer heterogeneity We evaluate and compare the five classifiers concerning their classification accuracy runtime performance in terms of inference throughput and analytical performance in terms of number of parameters number of layers model size and number of operations Experimental results show that the GNN classifier outperforms with respect to throughput and latency However it is also shown that no clear model winner emerges from all of our chosen metrics and a one model rules all case is doubtful in the domain of SAR ATR △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: 6 Pages

arXiv:2312.03775 [pdf, other]

FAAC: Facial Animation Generation with Anchor Frame and Conditional Control for Superior Fidelity and Editability

Authors: Linze Li, Sunqi Fan, Hengjun Pu, Zhaodong Bing, Yao Tang, Tianzhu Ye, Tong Yang, Liangyu Chen, Jiajun Liang

Abstract: Over recent years, diffusion models have facilitated significant advancements in video generation. Yet, the creation of face-related videos still confronts issues such as low facial fidelity, lack of frame consistency, limited editability and uncontrollable human poses. To address these challenges, we introduce a facial animation generation method that enhances both face identity fidelity and edit… ▽ More Over recent years, diffusion models have facilitated significant advancements in video generation. Yet, the creation of face-related videos still confronts issues such as low facial fidelity, lack of frame consistency, limited editability and uncontrollable human poses. To address these challenges, we introduce a facial animation generation method that enhances both face identity fidelity and editing capabilities while ensuring frame consistency. This approach incorporates the concept of an anchor frame to counteract the degradation of generative ability in original text-to-image models when incorporating a motion module. We propose two strategies towards this objective: training-free and training-based anchor frame methods. Our method's efficacy has been validated on multiple representative DreamBooth and LoRA models, delivering substantial improvements over the original outcomes in terms of facial fidelity, text-to-image editability, and video motion. Moreover, we introduce conditional control using a 3D parametric face model to capture accurate facial movements and expressions. This solution augments the creative possibilities for facial animation generation through the integration of multiple control signals. For additional samples, please visit https://paper-faac.github.io/. △ Less

Submitted 20 December, 2023; v1 submitted 5 December, 2023; originally announced December 2023.

arXiv:2312.02912 [pdf, other]

Realistic Scatterer Based Adversarial Attacks on SAR Image Classifiers

Authors: Tian Ye, Rajgopal Kannan, Viktor Prasanna, Carl Busart, Lance Kaplan

Abstract: Adversarial attacks have highlighted the vulnerability of classifiers based on machine learning for Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) tasks. An adversarial attack perturbs SAR images of on-ground targets such that the classifiers are misled into making incorrect predictions. However, many existing attacking techniques rely on arbitrary manipulation of SAR images whi… ▽ More Adversarial attacks have highlighted the vulnerability of classifiers based on machine learning for Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) tasks. An adversarial attack perturbs SAR images of on-ground targets such that the classifiers are misled into making incorrect predictions. However, many existing attacking techniques rely on arbitrary manipulation of SAR images while overlooking the feasibility of executing the attacks on real-world SAR imagery. Instead, adversarial attacks should be able to be implemented by physical actions, for example, placing additional false objects as scatterers around the on-ground target to perturb the SAR image and fool the SAR ATR. In this paper, we propose the On-Target Scatterer Attack (OTSA), a scatterer-based physical adversarial attack. To ensure the feasibility of its physical execution, we enforce a constraint on the positioning of the scatterers. Specifically, we restrict the scatterers to be placed only on the target instead of in the shadow regions or the background. To achieve this, we introduce a positioning score based on Gaussian kernels and formulate an optimization problem for our OTSA attack. Using a gradient ascent method to solve the optimization problem, the OTSA can generate a vector of parameters describing the positions, shapes, sizes and amplitudes of the scatterers to guide the physical execution of the attack that will mislead SAR image classifiers. The experimental results show that our attack obtains significantly higher success rates under the positioning constraint compared with the existing method. △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2311.18173 [pdf]

Quantification of cardiac capillarization in single-immunostained myocardial slices using weakly supervised instance segmentation

Authors: Zhao Zhang, Xiwen Chen, William Richardson, Bruce Z. Gao, Abolfazl Razi, Tong Ye

Abstract: Decreased myocardial capillary density has been reported as an important histopathological feature associated with various heart disorders. Quantitative assessment of cardiac capillarization typically involves double immunostaining of cardiomyocytes (CMs) and capillaries in myocardial slices. In contrast, single immunostaining of basement membrane components is a straightforward approach to simult… ▽ More Decreased myocardial capillary density has been reported as an important histopathological feature associated with various heart disorders. Quantitative assessment of cardiac capillarization typically involves double immunostaining of cardiomyocytes (CMs) and capillaries in myocardial slices. In contrast, single immunostaining of basement membrane components is a straightforward approach to simultaneously label CMs and capillaries, presenting fewer challenges in background staining. However, subsequent image analysis always requires manual work in identifying and segmenting CMs and capillaries. Here, we developed an image analysis tool, AutoQC, to automatically identify and segment CMs and capillaries in immunofluorescence images of collagen type IV, a predominant basement membrane protein within the myocardium. In addition, commonly used capillarization-related measurements can be derived from segmentation masks. AutoQC features a weakly supervised instance segmentation algorithm by leveraging the power of a pre-trained segmentation model via prompt engineering. AutoQC outperformed YOLOv8-Seg, a state-of-the-art instance segmentation model, in both instance segmentation and capillarization assessment. Furthermore, the training of AutoQC required only a small dataset with bounding box annotations instead of pixel-wise annotations, leading to a reduced workload during network training. AutoQC provides an automated solution for quantifying cardiac capillarization in basement-membrane-immunostained myocardial slices, eliminating the need for manual image analysis once it is trained. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.15209 [pdf, other]

See and Think: Embodied Agent in Virtual Environment

Authors: Zhonghan Zhao, Wenhao Chai, Xuan Wang, Li Boyi, Shengyu Hao, Shidong Cao, Tian Ye, Gaoang Wang

Abstract: Large language models (LLMs) have achieved impressive pro-gress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot. This paper proposes STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment. STEVE comprises three key components: vision perception, language instruction, and code action. Vision perception involves interpre… ▽ More Large language models (LLMs) have achieved impressive pro-gress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot. This paper proposes STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment. STEVE comprises three key components: vision perception, language instruction, and code action. Vision perception involves interpreting visual information in the environment, which is then integrated into the LLMs component with agent state and task instruction. Language instruction is responsible for iterative reasoning and decomposing complex tasks into manageable guidelines. Code action generates executable skill actions based on retrieval in skill database, enabling the agent to interact effectively within the Minecraft environment. We also collect STEVE-21K dataset, which includes 600+ vision-environment pairs, 20K knowledge question-answering pairs, and 200+ skill-code pairs. We conduct continuous block search, knowledge question and answering, and tech tree mastery to evaluate the performance. Extensive experiments show that STEVE achieves at most 1.5x faster unlocking key tech trees and 2.5x quicker in block search tasks. △ Less

Submitted 9 July, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

Comments: ECCV 2024. First three authors contribute equally to this work. Project Website https://rese1f.github.io/STEVE/

arXiv:2311.12358 [pdf, other]

Federated Learning via Consensus Mechanism on Heterogeneous Data: A New Perspective on Convergence

Authors: Shu Zheng, Tiandi Ye, Xiang Li, Ming Gao

Abstract: Federated learning (FL) on heterogeneous data (non-IID data) has recently received great attention. Most existing methods focus on studying the convergence guarantees for the global objective. While these methods can guarantee the decrease of the global objective in each communication round, they fail to ensure risk decrease for each client. In this paper, to address the problem,we propose FedCOME… ▽ More Federated learning (FL) on heterogeneous data (non-IID data) has recently received great attention. Most existing methods focus on studying the convergence guarantees for the global objective. While these methods can guarantee the decrease of the global objective in each communication round, they fail to ensure risk decrease for each client. In this paper, to address the problem,we propose FedCOME, which introduces a consensus mechanism to enforce decreased risk for each client after each training round. In particular, we allow a slight adjustment to a client's gradient on the server side, which generates an acute angle between the corrected gradient and the original ones of other clients. We theoretically show that the consensus mechanism can guarantee the convergence of the global objective. To generalize the consensus mechanism to the partial participation FL scenario, we devise a novel client sampling strategy to select the most representative clients for the global data distribution. Training on these selected clients with the consensus mechanism could empirically lead to risk decrease for clients that are not selected. Finally, we conduct extensive experiments on four benchmark datasets to show the superiority of FedCOME against other state-of-the-art methods in terms of effectiveness, efficiency and fairness. For reproducibility, we make our source code publicly available at: \url{https://github.com/fedcome/fedcome}. △ Less

Submitted 21 November, 2023; originally announced November 2023.

arXiv:2311.11638 [pdf, other]

Reti-Diff: Illumination Degradation Image Restoration with Retinex-based Latent Diffusion Model

Authors: Chunming He, Chengyu Fang, Yulun Zhang, Tian Ye, Kai Li, Longxiang Tang, Zhenhua Guo, Xiu Li, Sina Farsiu

Abstract: Illumination degradation image restoration (IDIR) techniques aim to improve the visibility of degraded images and mitigate the adverse effects of deteriorated illumination. Among these algorithms, diffusion model (DM)-based methods have shown promising performance but are often burdened by heavy computational demands and pixel misalignment issues when predicting the image-level distribution. To ta… ▽ More Illumination degradation image restoration (IDIR) techniques aim to improve the visibility of degraded images and mitigate the adverse effects of deteriorated illumination. Among these algorithms, diffusion model (DM)-based methods have shown promising performance but are often burdened by heavy computational demands and pixel misalignment issues when predicting the image-level distribution. To tackle these problems, we propose to leverage DM within a compact latent space to generate concise guidance priors and introduce a novel solution called Reti-Diff for the IDIR task. Reti-Diff comprises two key components: the Retinex-based latent DM (RLDM) and the Retinex-guided transformer (RGformer). To ensure detailed reconstruction and illumination correction, RLDM is empowered to acquire Retinex knowledge and extract reflectance and illumination priors. These priors are subsequently utilized by RGformer to guide the decomposition of image features into their respective reflectance and illumination components. Following this, RGformer further enhances and consolidates the decomposed features, resulting in the production of refined images with consistent content and robustness to handle complex degradation scenarios. Extensive experiments show that Reti-Diff outperforms existing methods on three IDIR tasks, as well as downstream applications. Code will be available at \url{https://github.com/ChunmingHe/Reti-Diff}. △ Less

Submitted 9 March, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

Comments: 20 pages, 11 figures, 11 tables

Showing 1–50 of 248 results for author: Ye, T