Search | arXiv e-print repository

arXiv:2408.17380 [pdf, other]

Traffic expertise meets residual RL: Knowledge-informed model-based residual reinforcement learning for CAV trajectory control

Authors: Zihao Sheng, Zilin Huang, Sikai Chen

Abstract: Model-based reinforcement learning (RL) is anticipated to exhibit higher sample efficiency compared to model-free RL by utilizing a virtual environment model. However, it is challenging to obtain sufficiently accurate representations of the environmental dynamics due to uncertainties in complex systems and environments. An inaccurate environment model may degrade the sample efficiency and performa… ▽ More Model-based reinforcement learning (RL) is anticipated to exhibit higher sample efficiency compared to model-free RL by utilizing a virtual environment model. However, it is challenging to obtain sufficiently accurate representations of the environmental dynamics due to uncertainties in complex systems and environments. An inaccurate environment model may degrade the sample efficiency and performance of model-based RL. Furthermore, while model-based RL can improve sample efficiency, it often still requires substantial training time to learn from scratch, potentially limiting its advantages over model-free approaches. To address these challenges, this paper introduces a knowledge-informed model-based residual reinforcement learning framework aimed at enhancing learning efficiency by infusing established expert knowledge into the learning process and avoiding the issue of beginning from zero. Our approach integrates traffic expert knowledge into a virtual environment model, employing the Intelligent Driver Model (IDM) for basic dynamics and neural networks for residual dynamics, thus ensuring adaptability to complex scenarios. We propose a novel strategy that combines traditional control methods with residual RL, facilitating efficient learning and policy optimization without the need to learn from scratch. The proposed approach is applied to CAV trajectory control tasks for the dissipation of stop-and-go waves in mixed traffic flow. Experimental results demonstrate that our proposed approach enables the CAV agent to achieve superior performance in trajectory control compared to the baseline agents in terms of sample efficiency, traffic flow smoothness and traffic mobility. The source code and supplementary materials are available at https://github.com/zihaosheng/traffic-expertise-RL/. △ Less

Submitted 30 August, 2024; originally announced August 2024.

arXiv:2408.17081 [pdf, other]

Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training

Authors: Zizheng Huang, Haoxing Chen, Jiaqi Li, Jun Lan, Huijia Zhu, Weiqiang Wang, Limin Wang

Abstract: Recent Vision Mamba models not only have much lower complexity for processing higher resolution images and longer videos but also the competitive performance with Vision Transformers (ViTs). However, they are stuck into overfitting and thus only present up to base size (about 80M). It is still unclear how vanilla Vision Mamba (Vim) can be efficiently scaled up to larger sizes, which is essentially… ▽ More Recent Vision Mamba models not only have much lower complexity for processing higher resolution images and longer videos but also the competitive performance with Vision Transformers (ViTs). However, they are stuck into overfitting and thus only present up to base size (about 80M). It is still unclear how vanilla Vision Mamba (Vim) can be efficiently scaled up to larger sizes, which is essentially for further exploitation. In this paper, we propose a stochastic layer-wise shuffle regularization, which empowers successfully scaling non-hierarchical Vision Mamba to a large size (about 300M) in a supervised setting. Specifically, our base and large-scale ShuffleMamba models can outperform the supervised ViTs of similar size by 0.8\% and 1.0\% classification accuracy on ImageNet1k, respectively, without auxiliary data. When evaluated on the ADE20K semantic segmentation and COCO detection tasks, our ShuffleMamba models also show significant improvements. Without bells and whistles, the stochastic layer-wise shuffle has the following highlights: (1) \textit{Plug and play:} it does not change model architectures and will be omitted in inference. (2) \textit{Simple but effective:} it can improve the overfitting in Vim training and only introduce random token permutation operations. (3) \textit{Intuitive:} the token sequences in deeper layers are more likely to be shuffled as they are expected to be more semantic and less sensitive to patch positions. Code and models will be available at https://github.com/huangzizheng01/ShuffleMamba. △ Less

Submitted 30 August, 2024; originally announced August 2024.

arXiv:2408.16398 [pdf, other]

Pair Counting without Binning -- A New Approach to Correlation Functions in Clustering Statistics

Authors: Shiyu Yue, Longlong Feng, Wenjie Ju, Jun Pan, Zhiqi Huang, Feng Fang, Zhuoyang Li, Yan-Chuan Cai, Weishan Zhu

Abstract: This paper presents a novel perspective on correlation functions in the clustering analysis of the large-scale structure of the universe. We first recognise that pair counting in bins of radial separation is equivalent to evaluating counts-in-cells (CIC), which can be modelled using a filtered density field with a binning-window function. This insight leads to an in situ expression for the two-poi… ▽ More This paper presents a novel perspective on correlation functions in the clustering analysis of the large-scale structure of the universe. We first recognise that pair counting in bins of radial separation is equivalent to evaluating counts-in-cells (CIC), which can be modelled using a filtered density field with a binning-window function. This insight leads to an in situ expression for the two-point correlation function (2PCF). Essentially, the core idea underlying our method is to introduce a window function to define the binning scheme, enabling pair-counting without binning. This approach develops a concept of generalised 2PCF, which extends beyond conventional discrete pair counting by accommodating non-sharp-edged window functions. To extend this framework to N-point correlation functions (NPCF) using current optimal edge-corrected estimators, we developed a binning scheme independent of the specific parameterisation of polyhedral configurations. In particular, we demonstrate a fast algorithm for the three-point correlation function (3PCF), where triplet counting is accomplished by assigning either a spherical tophat or a Gaussian filter to each vertex of triangles. Additionally, we derive analytical expressions for the 3PCF using a multipole expansion in Legendre polynomials, accounting for filtered field (binning) corrections. Numerical tests using several suites of N-body simulation samples show that our approach aligns remarkably well with the theoretical predictions. Our method provides an exact solution for quantifying binning effects in practical measurements and offers a high-speed algorithm, enabling high-order clustering analysis in extremely large datasets from ongoing and upcoming surveys such as Euclid, LSST, and DESI. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: 17 pages, 12 figures, submitted to MNRAS

arXiv:2408.15529 [pdf, other]

Quasi-Lindblad pseudomode theory for open quantum systems

Authors: Gunhee Park, Zhen Huang, Yuanran Zhu, Chao Yang, Garnet Kin-Lic Chan, Lin Lin

Abstract: We introduce a new framework to study the dynamics of open quantum systems with linearly coupled Gaussian baths. Our approach replaces the continuous bath with an auxiliary discrete set of pseudomodes with dissipative dynamics, but we further relax the complete positivity requirement in the Lindblad master equation and formulate a quasi-Lindblad pseudomode theory. We show that this quasi-Lindblad… ▽ More We introduce a new framework to study the dynamics of open quantum systems with linearly coupled Gaussian baths. Our approach replaces the continuous bath with an auxiliary discrete set of pseudomodes with dissipative dynamics, but we further relax the complete positivity requirement in the Lindblad master equation and formulate a quasi-Lindblad pseudomode theory. We show that this quasi-Lindblad pseudomode formulation directly leads to a representation of the bath correlation function in terms of a complex weighted sum of complex exponentials, an expansion that is known to be rapidly convergent in practice and thus leads to a compact set of pseudomodes. The pseudomode representation is not unique and can differ by a gauge choice. When the global dynamics can be simulated exactly, the system dynamics is unique and independent of the specific pseudomode representation. However, the gauge choice may affect the stability of the global dynamics, and we provide an analysis of why and when the global dynamics can retain stability despite losing positivity. We showcase the performance of this formulation across various spectral densities in both bosonic and fermionic problems, finding significant improvements over conventional pseudomode formulations. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: 13 pages, 6 figures (main text); 8 pages, 1 figure (Supplementary Material)

arXiv:2408.14881 [pdf, other]

MEET-U project II: Curvature perturbations from kinetic preheating after $α$-attractor inflation

Authors: Zhiqi Huang, Xichang Ouyang, Yu Cui, Jianqi Liu, Yanhong Yao, Zehong Qiu, Guangyao Yu, Lu Huang, Zhuoyang Li, Chi-Fong Wong

Abstract: Preheating at the end of inflation is a violent nonlinear process that efficiently transfers the energy of the inflaton to a second field, the preheat field. When the preheat field is light during inflation and its background value modulates the preheating process, the superhorizon isocurvature perturbations of the preheat field may be converted to curvature perturbations that leave an imprint on… ▽ More Preheating at the end of inflation is a violent nonlinear process that efficiently transfers the energy of the inflaton to a second field, the preheat field. When the preheat field is light during inflation and its background value modulates the preheating process, the superhorizon isocurvature perturbations of the preheat field may be converted to curvature perturbations that leave an imprint on the cosmic microwave background and the large-scale structure of the universe. We use high-precision lattice simulations to study kinetic preheating after $α$-attractor inflation, a case where the effective mass of the preheat field is naturally suppressed during inflation. By comparing the expansion e-folds between different Hubble patches, we find that the conversion from isocurvature perturbations to curvature perturbations is very inefficient and can hardly be detected by cosmological observations. △ Less

Submitted 27 August, 2024; originally announced August 2024.

Report number: SYSU-SPA-2024 MSC Class: 83F05 ACM Class: J.2

arXiv:2408.14765 [pdf, other]

CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

Authors: Weijia Li, Jun He, Junyan Ye, Huaping Zhong, Zhimeng Zheng, Zilong Huang, Dahua Lin, Conghui He

Abstract: Satellite-to-street view synthesis aims at generating a realistic street-view image from its corresponding satellite-view image. Although stable diffusion models have exhibit remarkable performance in a variety of image generation applications, their reliance on similar-view inputs to control the generated structure or texture restricts their application to the challenging cross-view synthesis tas… ▽ More Satellite-to-street view synthesis aims at generating a realistic street-view image from its corresponding satellite-view image. Although stable diffusion models have exhibit remarkable performance in a variety of image generation applications, their reliance on similar-view inputs to control the generated structure or texture restricts their application to the challenging cross-view synthesis task. In this work, we propose CrossViewDiff, a cross-view diffusion model for satellite-to-street view synthesis. To address the challenges posed by the large discrepancy across views, we design the satellite scene structure estimation and cross-view texture mapping modules to construct the structural and textural controls for street-view image synthesis. We further design a cross-view control guided denoising process that incorporates the above controls via an enhanced cross-view attention module. To achieve a more comprehensive evaluation of the synthesis results, we additionally design a GPT-based scoring method as a supplement to standard evaluation metrics. We also explore the effect of different data sources (e.g., text, maps, building heights, and multi-temporal satellite imagery) on this task. Results on three public cross-view datasets show that CrossViewDiff outperforms current state-of-the-art on both standard and GPT-based evaluation metrics, generating high-quality street-view panoramas with more realistic structures and textures across rural, suburban, and urban scenes. The code and models of this work will be released at https://opendatalab.github.io/CrossViewDiff/. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: 21 pages, 11 figures

arXiv:2408.14354 [pdf, other]

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

Authors: Daoguang Zan, Zhirong Huang, Ailun Yu, Shaoxin Lin, Yifan Shi, Wei Liu, Dong Chen, Zongshuai Qi, Hao Yu, Lei Yu, Dezhi Ran, Muhan Zeng, Bo Shen, Pan Bian, Guangtai Liang, Bei Guan, Pengjie Huang, Tao Xie, Yongji Wang, Qianxiang Wang

Abstract: GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in… ▽ More GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual benchmark is time-consuming and labor-intensive, so we welcome contributions through pull requests or collaboration to accelerate its iteration and refinement, paving the way for fully automated programming. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: This work is in progress

arXiv:2408.14282 [pdf, other]

All-microwave readout, spectroscopy, and dynamic polarization of individual nuclear spins in a crystal

Authors: J. Travesedo, J. O'Sullivan, L. Pallegoix, Z. W. Huang, P. Hogan, P. Goldner, T. Chaneliere, S. Bertaina, D. Esteve, P. Abgrall, D. Vion, E. Flurin, P. Bertet

Abstract: Pushing the sensitivity of nuclear magnetic resonance spectroscopy to the single spin level would have a major impact in chemistry and biology and is the goal of intense research efforts. Individual nuclear spins have been detected via their hyperfine coupling to an individual electronic paramagnetic system, itself measured by optical or electrical means. These methods are however only applicable… ▽ More Pushing the sensitivity of nuclear magnetic resonance spectroscopy to the single spin level would have a major impact in chemistry and biology and is the goal of intense research efforts. Individual nuclear spins have been detected via their hyperfine coupling to an individual electronic paramagnetic system, itself measured by optical or electrical means. These methods are however only applicable when suitable optical transitions or electron-spin-to-charge conversion mechanisms exist, and a more universal method is currently lacking. Here, we report spectroscopic measurements of individual $^{183}\mathrm{W}$ nuclear spins in a CaWO$_4$ crystal via their hyperfine interaction with a neighboring $\mathrm{Er}^{3+}$ ion detected by microwave photon counting at millikelvin temperatures. We observe real-time quantum jumps of the nuclear spin state, a proof of their individual nature. We perform single-spin ELDOR-detected NMR spectroscopy by microwave driving the zero- and double-quantum transitions of the $^{183}$W--Er$^{3+}$ coupled system. By repeated driving of these transitions, we also achieve single-spin solid-effect dynamical nuclear polarization. Relying exclusively on microwave driving and microwave detection, the methods reported here apply in principle to any nuclear spin coupled to a paramagnetic impurity, and therefore open the way to single-nuclear-spin spectroscopy in a large class of samples. △ Less

Submitted 26 August, 2024; originally announced August 2024.

arXiv:2408.13890 [pdf, other]

Making Large Language Models Better Planners with Reasoning-Decision Alignment

Authors: Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, Zequn Jie, Lin Ma, Guangrun Wang, Xiaodan Liang

Abstract: Data-driven approaches for autonomous driving (AD) have been widely adopted in the past decade but are confronted with dataset bias and uninterpretability. Inspired by the knowledge-driven nature of human driving, recent approaches explore the potential of large language models (LLMs) to improve understanding and decision-making in traffic scenarios. They find that the pretrain-finetune paradigm o… ▽ More Data-driven approaches for autonomous driving (AD) have been widely adopted in the past decade but are confronted with dataset bias and uninterpretability. Inspired by the knowledge-driven nature of human driving, recent approaches explore the potential of large language models (LLMs) to improve understanding and decision-making in traffic scenarios. They find that the pretrain-finetune paradigm of LLMs on downstream data with the Chain-of-Thought (CoT) reasoning process can enhance explainability and scene understanding. However, such a popular strategy proves to suffer from the notorious problems of misalignment between the crafted CoTs against the consequent decision-making, which remains untouched by previous LLM-based AD methods. To address this problem, we motivate an end-to-end decision-making model based on multimodality-augmented LLM, which simultaneously executes CoT reasoning and carries out planning results. Furthermore, we propose a reasoning-decision alignment constraint between the paired CoTs and planning results, imposing the correspondence between reasoning and decision-making. Moreover, we redesign the CoTs to enable the model to comprehend complex scenarios and enhance decision-making performance. We dub our proposed large language planners with reasoning-decision alignment as RDA-Driver. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate the effectiveness of our RDA-Driver in enhancing the performance of end-to-end AD systems. Specifically, our RDA-Driver achieves state-of-the-art planning performance on the nuScenes dataset with 0.80 L2 error and 0.32 collision rate, and also achieves leading results on challenging DriveLM-nuScenes benchmarks with 0.82 L2 error and 0.38 collision rate. △ Less

Submitted 25 August, 2024; originally announced August 2024.

arXiv:2408.13671 [pdf, other]

Ultrafast Charge Transfer Dynamics at the MoS$_2$/Au Interface Observed via Optical Spectroscopy under Ambient Conditions

Authors: Tao Yang, Zhipeng Huang, Stephan Sleziona, Eckart Hasselbrink, Peter Kratzer, Marika Schleberger, R. Kramer Campen, Yujin Tong

Abstract: To take advantage of the exceptional properties of atomically thin transition metal dichalcogenides (TMDC) for advanced devices and catalysts, integration with metallic surfaces is an efficacious approach for facilitating charge carrier injection and extraction from TMDC monolayers. Light-matter interactions predominantly occur at the K point in TMDC monolayers, making the charge carrier dynamics… ▽ More To take advantage of the exceptional properties of atomically thin transition metal dichalcogenides (TMDC) for advanced devices and catalysts, integration with metallic surfaces is an efficacious approach for facilitating charge carrier injection and extraction from TMDC monolayers. Light-matter interactions predominantly occur at the K point in TMDC monolayers, making the charge carrier dynamics at this point essential for their optimal performance. However, direct access to and comprehensive understanding of the charge carrier dynamics at the K point of TMDC monolayer on a metal substrate remains challenging. In this study, we employed azimuth- and polarization-dependent final-state sum frequency generation (FS-SFG) spectroscopy to investigate the ultrafast dynamics of charge transfer at the K point of a MoS$_2$ monolayer interfaced with an Au substrate. We observed an ultrafast injection (sub-20 fs) of photoexcited hot electrons from the Au substrate to the conduction band minimum (CBM) of the MoS$_2$ monolayer. Subsequently, driven by an internal electric field induced by charge redistribution, injected hot electrons in MoS$_2$ experience a relaxation and fast return ($\sim2$ ps) from the CBM and a trap state mediated slow return ($\sim60$ ps) process. The direct optical observation of the full electron dynamics at the K point of MoS$_2$ monolayer in ambient conditions provides valuable insights into the mechanisms of charge carrier transfer across the TMDC-metal interface, informing the design of advanced TMDC-based devices with enhanced charge transfer rates. △ Less

Submitted 24 August, 2024; originally announced August 2024.

Comments: 16 pages, 3 figures and supplemental material

arXiv:2408.13385 [pdf, other]

MICM: Rethinking Unsupervised Pretraining for Enhanced Few-shot Learning

Authors: Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Zhimeng Huang, Yuhua Li, Ruixuan Li

Abstract: Humans exhibit a remarkable ability to learn quickly from a limited number of labeled samples, a capability that starkly contrasts with that of current machine learning systems. Unsupervised Few-Shot Learning (U-FSL) seeks to bridge this divide by reducing reliance on annotated datasets during initial training phases. In this work, we first quantitatively assess the impacts of Masked Image Modelin… ▽ More Humans exhibit a remarkable ability to learn quickly from a limited number of labeled samples, a capability that starkly contrasts with that of current machine learning systems. Unsupervised Few-Shot Learning (U-FSL) seeks to bridge this divide by reducing reliance on annotated datasets during initial training phases. In this work, we first quantitatively assess the impacts of Masked Image Modeling (MIM) and Contrastive Learning (CL) on few-shot learning tasks. Our findings highlight the respective limitations of MIM and CL in terms of discriminative and generalization abilities, which contribute to their underperformance in U-FSL contexts. To address these trade-offs between generalization and discriminability in unsupervised pretraining, we introduce a novel paradigm named Masked Image Contrastive Modeling (MICM). MICM creatively combines the targeted object learning strength of CL with the generalized visual feature learning capability of MIM, significantly enhancing its efficacy in downstream few-shot learning inference. Extensive experimental analyses confirm the advantages of MICM, demonstrating significant improvements in both generalization and discrimination capabilities for few-shot learning. Our comprehensive quantitative evaluations further substantiate the superiority of MICM, showing that our two-stage U-FSL framework based on MICM markedly outperforms existing leading baselines. △ Less

Submitted 23 August, 2024; originally announced August 2024.

Comments: ACMMM 2024 (Oral)

arXiv:2408.13008 [pdf, other]

Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models

Authors: Adnan Haider, Xingyu Na, Erik McDermott, Tim Ng, Zhen Huang, Xiaodan Zhuang

Abstract: This paper introduces a novel training framework called Focused Discriminative Training (FDT) to further improve streaming word-piece end-to-end (E2E) automatic speech recognition (ASR) models trained using either CTC or an interpolation of CTC and attention-based encoder-decoder (AED) loss. The proposed approach presents a novel framework to identify and improve a model's recognition on challengi… ▽ More This paper introduces a novel training framework called Focused Discriminative Training (FDT) to further improve streaming word-piece end-to-end (E2E) automatic speech recognition (ASR) models trained using either CTC or an interpolation of CTC and attention-based encoder-decoder (AED) loss. The proposed approach presents a novel framework to identify and improve a model's recognition on challenging segments of an audio. Notably, this training framework is independent of hidden Markov models (HMMs) and lattices, eliminating the need for substantial decision-making regarding HMM topology, lexicon, and graph generation, as typically required in standard discriminative training approaches. Compared to additional fine-tuning with MMI or MWER loss on the encoder, FDT is shown to be more effective in achieving greater reductions in Word Error Rate (WER) on streaming models trained on LibriSpeech. Additionally, this method is shown to be effective in further improving a converged word-piece streaming E2E model trained on 600k hours of assistant and dictation dataset. △ Less

Submitted 23 August, 2024; originally announced August 2024.

Comments: UK Speech 2024, Submitted to SLT 2024

arXiv:2408.12821 [pdf, other]

Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery

Authors: Zhenyuan Yang, Xuhui Lin, Qinyi He, Ziye Huang, Zhengliang Liu, Hanqi Jiang, Peng Shu, Zihao Wu, Yiwei Li, Stephen Law, Gengchen Mai, Tianming Liu, Tao Yang

Abstract: The emergence of Large Language Models (LLMs) and multimodal foundation models (FMs) has generated heightened interest in their applications that integrate vision and language. This paper investigates the capabilities of ChatGPT-4V and Gemini Pro for Street View Imagery, Built Environment, and Interior by evaluating their performance across various tasks. The assessments include street furniture i… ▽ More The emergence of Large Language Models (LLMs) and multimodal foundation models (FMs) has generated heightened interest in their applications that integrate vision and language. This paper investigates the capabilities of ChatGPT-4V and Gemini Pro for Street View Imagery, Built Environment, and Interior by evaluating their performance across various tasks. The assessments include street furniture identification, pedestrian and car counts, and road width measurement in Street View Imagery; building function classification, building age analysis, building height analysis, and building structure classification in the Built Environment; and interior room classification, interior design style analysis, interior furniture counts, and interior length measurement in Interior. The results reveal proficiency in length measurement, style analysis, question answering, and basic image understanding, but highlight limitations in detailed recognition and counting tasks. While zero-shot learning shows potential, performance varies depending on the problem domains and image complexities. This study provides new insights into the strengths and weaknesses of multimodal foundation models for practical challenges in Street View Imagery, Built Environment, and Interior. Overall, the findings demonstrate foundational multimodal intelligence, emphasizing the potential of FMs to drive forward interdisciplinary applications at the intersection of computer vision and language. △ Less