Search | arXiv e-print repository

Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training

Authors: Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, Meng Jiang

Abstract: Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. Although fine-tuning with intermediate steps (i.e., rationales) elicits some mathematical reasoning skills, the resulting models still fall short in vis… ▽ More Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. Although fine-tuning with intermediate steps (i.e., rationales) elicits some mathematical reasoning skills, the resulting models still fall short in visual comprehension due to inadequate visual-centric supervision, which leads to inaccurate interpretation of math figures. To address this issue, we propose a two-step training pipeline VCAR, which emphasizes the Visual Comprehension training in Addition to mathematical Reasoning learning. It first improves the visual comprehension ability of MLLMs through the visual description generation task, followed by another training step on generating rationales with the assistance of descriptions. Experimental results on two popular benchmarks demonstrate that VCAR substantially outperforms baseline methods solely relying on rationale supervision, especially on problems with high visual demands. △ Less

Submitted 25 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.14405 [pdf, other]

Learning H-Infinity Locomotion Control

Authors: Junfeng Long, Wenye Yu, Quanyi Li, Zirui Wang, Dahua Lin, Jiangmiao Pang

Abstract: Stable locomotion in precipitous environments is an essential task for quadruped robots, requiring the ability to resist various external disturbances. Recent neural policies enhance robustness against disturbances by learning to resist external forces sampled from a fixed distribution in the simulated environment. However, the force generation process doesn't consider the robot's current state, m… ▽ More Stable locomotion in precipitous environments is an essential task for quadruped robots, requiring the ability to resist various external disturbances. Recent neural policies enhance robustness against disturbances by learning to resist external forces sampled from a fixed distribution in the simulated environment. However, the force generation process doesn't consider the robot's current state, making it difficult to identify the most effective direction and magnitude that can push the robot to the most unstable but recoverable state. Thus, challenging cases in the buffer are insufficient to optimize robustness. In this paper, we propose to model the robust locomotion learning process as an adversarial interaction between the locomotion policy and a learnable disturbance that is conditioned on the robot state to generate appropriate external forces. To make the joint optimization stable, our novel $H_{\infty}$ constraint mandates the bound of the ratio between the cost and the intensity of the external forces. We verify the robustness of our approach in both simulated environments and real-world deployment, on quadrupedal locomotion tasks and a more challenging task where the quadruped performs locomotion merely on hind legs. Training and deployment code will be made public. △ Less

Submitted 12 June, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: Project Page: https://junfeng-long.github.io/HINF/

arXiv:2404.14129 [pdf, other]

doi 10.3847/1538-4357/ad4d86

Discovery of a long thermonuclear X-ray burst from the ultra-compact binary 4U 1850$-$087

Authors: Yongqi Lu, Zhaosheng Li, Wenhui Yu, Yuanyue Pan, Maurizio Falanga

Abstract: We report the detection of a long X-ray burst triggered on MJD 60171.65 from the ultra-compact binary 4U 1850$-$087 by the Monitor of All-sky X-ray Image and Neutron Star Interior Composition Explorer (NICER). We analyse the NICER data observed in between MJD 60095.19$-$60177.43, including one observation covered part of the long X-ray burst tail, i.e., $0.15-3.8$ hr after the trigger. The persist… ▽ More We report the detection of a long X-ray burst triggered on MJD 60171.65 from the ultra-compact binary 4U 1850$-$087 by the Monitor of All-sky X-ray Image and Neutron Star Interior Composition Explorer (NICER). We analyse the NICER data observed in between MJD 60095.19$-$60177.43, including one observation covered part of the long X-ray burst tail, i.e., $0.15-3.8$ hr after the trigger. The persistent spectra are quite similar and well described by a combination of multi-color disk blackbody, with the inner temperature of 0.5 keV, and a thermally comptonized continuum with the asymptotic power-law photon index of $Γ\sim2.2$, and electron temperature of $kT_{\rm e}\sim20-30$ keV. The persistent fluxes were around $3.8\times10^{-10}~{\rm erg~cm^{-2}~s^{-1}}$, corresponding to a local accretion rate of $1\%~\dot{m}_{\rm Edd}$. Part of time-resolved burst spectra show a clear deviation from the blackbody model, which can be improved by considering the enhanced persistent emission due to the Poynting-Robertson drag, or the reflected disk emission illuminated by the burst. From the burst flux during the cooling tail, we estimate the burst duration, $τ\approx 0.78$ hr, the burst fluence, $E_\mathrm{b} \approx 4.1 \times 10^{41}$ ergs, and the ignition column depth, $y_{\rm ign}\approx 3.5\times10^{10}~{\rm g~cm^{-2}}$. We propose that the long X-ray burst is powered by unstable burning of pure helium in deep layer. Moreover, we identify significant 1 keV emission lines in the burst spectra, which may originate from the surrounding disk. △ Less

Submitted 30 July, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: 9 pages, 6 figures, to match the published version in ApJ

Journal ref: ApJ 969, 15 (2024)

arXiv:2404.13848 [pdf, other]

DSDRNet: Disentangling Representation and Reconstruct Network for Domain Generalization

Authors: Juncheng Yang, Zuchao Li, Shuai Xie, Wei Yu, Shijun Li

Abstract: Domain generalization faces challenges due to the distribution shift between training and testing sets, and the presence of unseen target domains. Common solutions include domain alignment, meta-learning, data augmentation, or ensemble learning, all of which rely on domain labels or domain adversarial techniques. In this paper, we propose a Dual-Stream Separation and Reconstruction Network, dubbed… ▽ More Domain generalization faces challenges due to the distribution shift between training and testing sets, and the presence of unseen target domains. Common solutions include domain alignment, meta-learning, data augmentation, or ensemble learning, all of which rely on domain labels or domain adversarial techniques. In this paper, we propose a Dual-Stream Separation and Reconstruction Network, dubbed DSDRNet. It is a disentanglement-reconstruction approach that integrates features of both inter-instance and intra-instance through dual-stream fusion. The method introduces novel supervised signals by combining inter-instance semantic distance and intra-instance similarity. Incorporating Adaptive Instance Normalization (AdaIN) into a two-stage cyclic reconstruction process enhances self-disentangled reconstruction signals to facilitate model convergence. Extensive experiments on four benchmark datasets demonstrate that DSDRNet outperforms other popular methods in terms of domain generalization capabilities. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: This paper is accepted to IJCNN 2024

arXiv:2404.13640 [pdf, other]

Beyond Alignment: Blind Video Face Restoration via Parsing-Guided Temporal-Coherent Transformer

Authors: Kepeng Xu, Li Xu, Gang He, Wenxin Yu, Yunsong Li

Abstract: Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and arti… ▽ More Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and artifacts from pose changes and keypoint localization errors. To address this, we propose the first blind video face restoration approach with a novel parsing-guided temporal-coherent transformer (PGTFormer) without pre-alignment. PGTFormer leverages semantic parsing guidance to select optimal face priors for generating temporally coherent artifact-free results. Specifically, we pre-train a temporal-spatial vector quantized auto-encoder on high-quality video face datasets to extract expressive context-rich priors. Then, the temporal parse-guided codebook predictor (TPCP) restores faces in different poses based on face parsing context cues without performing face pre-alignment. This strategy reduces artifacts and mitigates jitter caused by cumulative errors from face pre-alignment. Finally, the temporal fidelity regulator (TFR) enhances fidelity through temporal feature interaction and improves video temporal consistency. Extensive experiments on face videos show that our method outperforms previous face restoration baselines. The code will be released on \href{https://github.com/kepengxu/PGTFormer}{https://github.com/kepengxu/PGTFormer}. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: 9 pages

arXiv:2404.13392 [pdf, other]

Beamforming Design for Integrated Sensing and Communications Using Uplink-Downlink Duality

Authors: Kareem M. Attiah, Wei Yu

Abstract: This paper presents a novel optimization framework for beamforming design in integrated sensing and communication systems where a base station seeks to minimize the Bayesian Cramér-Rao bound of a sensing problem while satisfying quality of service constraints for the communication users. Prior approaches formulate the design problem as a semidefinite program for which acquiring a beamforming solut… ▽ More This paper presents a novel optimization framework for beamforming design in integrated sensing and communication systems where a base station seeks to minimize the Bayesian Cramér-Rao bound of a sensing problem while satisfying quality of service constraints for the communication users. Prior approaches formulate the design problem as a semidefinite program for which acquiring a beamforming solution is computationally expensive. In this work, we show that the computational burden can be considerably alleviated. To achieve this, we transform the design problem to a tractable form that not only provides a new understanding of Cramér-Rao bound optimization, but also allows for an uplink-downlink duality relation to be developed. Such a duality result gives rise to an efficient algorithm that enables the beamforming design problem to be solved at a much lower complexity as compared to the-state-of-the-art methods. △ Less

Submitted 20 April, 2024; originally announced April 2024.

Comments: 6 pages, 2 figures, accepted at ISIT2024

arXiv:2404.12879 [pdf, other]

Unlocking Multi-View Insights in Knowledge-Dense Retrieval-Augmented Generation

Authors: Guanhua Chen, Wenhan Yu, Lei Sha

Abstract: While Retrieval-Augmented Generation (RAG) plays a crucial role in the application of Large Language Models (LLMs), existing retrieval methods in knowledge-dense domains like law and medicine still suffer from a lack of multi-perspective views, which are essential for improving interpretability and reliability. Previous research on multi-view retrieval often focused solely on different semantic fo… ▽ More While Retrieval-Augmented Generation (RAG) plays a crucial role in the application of Large Language Models (LLMs), existing retrieval methods in knowledge-dense domains like law and medicine still suffer from a lack of multi-perspective views, which are essential for improving interpretability and reliability. Previous research on multi-view retrieval often focused solely on different semantic forms of queries, neglecting the expression of specific domain knowledge perspectives. This paper introduces a novel multi-view RAG framework, MVRAG, tailored for knowledge-dense domains that utilizes intention-aware query rewriting from multiple domain viewpoints to enhance retrieval precision, thereby improving the effectiveness of the final inference. Experiments conducted on legal and medical case retrieval demonstrate significant improvements in recall and precision rates with our framework. Our multi-perspective retrieval approach unleashes the potential of multi-view information enhancing RAG tasks, accelerating the further application of LLMs in knowledge-intensive fields. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.12588 [pdf, other]

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

Authors: Juncheng Yang, Zuchao Li, Shuai Xie, Weiping Zhu, Wei Yu, Shijun Li

Abstract: Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cro… ▽ More Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cross-modal cues for the efficient adaptation of parameters in visual-language models. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions. Additionally, it explores hard samples based on differences in cross-modal affinity and enhances model performance through adaptive adjustment of sample learning intensity. Extensive experimental results on benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly regarding accuracy, generalization, and efficiency. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: This paper is accepted to ICME 2024

arXiv:2404.12081 [pdf, other]

doi 10.1109/TGRS.2024.3424300

MaskCD: A Remote Sensing Change Detection Network Based on Mask Classification

Authors: Weikang Yu, Xiaokang Zhang, Samiran Das, Xiao Xiang Zhu, Pedram Ghamisi

Abstract: Change detection (CD) from remote sensing (RS) images using deep learning has been widely investigated in the literature. It is typically regarded as a pixel-wise labeling task that aims to classify each pixel as changed or unchanged. Although per-pixel classification networks in encoder-decoder structures have shown dominance, they still suffer from imprecise boundaries and incomplete object deli… ▽ More Change detection (CD) from remote sensing (RS) images using deep learning has been widely investigated in the literature. It is typically regarded as a pixel-wise labeling task that aims to classify each pixel as changed or unchanged. Although per-pixel classification networks in encoder-decoder structures have shown dominance, they still suffer from imprecise boundaries and incomplete object delineation at various scenes. For high-resolution RS images, partly or totally changed objects are more worthy of attention rather than a single pixel. Therefore, we revisit the CD task from the mask prediction and classification perspective and propose MaskCD to detect changed areas by adaptively generating categorized masks from input image pairs. Specifically, it utilizes a cross-level change representation perceiver (CLCRP) to learn multiscale change-aware representations and capture spatiotemporal relations from encoded features by exploiting deformable multihead self-attention (DeformMHSA). Subsequently, a masked-attention-based detection transformers (MA-DETR) decoder is developed to accurately locate and identify changed objects based on masked attention and self-attention mechanisms. It reconstructs the desired changed objects by decoding the pixel-wise representations into learnable mask proposals and making final predictions from these candidates. Experimental results on five benchmark datasets demonstrate the proposed approach outperforms other state-of-the-art models. Codes and pretrained models are available online (https://github.com/EricYu97/MaskCD). △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.11979 [pdf, other]

MTGA: Multi-view Temporal Granularity aligned Aggregation for Event-based Lip-reading

Authors: Wenhao Zhang, Jun Wang, Yong Luo, Lei Yu, Wei Yu, Zheng He

Abstract: Lip-reading is to utilize the visual information of the speaker's lip movements to recognize words and sentences. Existing event-based lip-reading solutions integrate different frame rate branches to learn spatio-temporal features of varying granularities. However, aggregating events into event frames inevitably leads to the loss of fine-grained temporal information within frames. To remedy this d… ▽ More Lip-reading is to utilize the visual information of the speaker's lip movements to recognize words and sentences. Existing event-based lip-reading solutions integrate different frame rate branches to learn spatio-temporal features of varying granularities. However, aggregating events into event frames inevitably leads to the loss of fine-grained temporal information within frames. To remedy this drawback, we propose a novel framework termed Multi-view Temporal Granularity aligned Aggregation (MTGA). Specifically, we first present a novel event representation method, namely time-segmented voxel graph list, where the most significant local voxels are temporally connected into a graph list. Then we design a spatio-temporal fusion module based on temporal granularity alignment, where the global spatial features extracted from event frames, together with the local relative spatial and temporal features contained in voxel graph list are effectively aligned and integrated. Finally, we design a temporal aggregation module that incorporates positional encoding, which enables the capture of local absolute spatial and global temporal information. Experiments demonstrate that our method outperforms both the event-based and video-based lip-reading counterparts. Our code will be publicly available. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.11249 [pdf, other]

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

Authors: Wenbo Zhang, Yifan Zhang, Jianfeng Lin, Binqiang Huang, Jinlu Zhang, Wenhao Yu

Abstract: Pre-trained vision-language (V-L) models such as CLIP have shown excellent performance in many downstream cross-modal tasks. However, most of them are only applicable to the English context. Subsequent research has focused on this problem and proposed improved models, such as CN-CLIP and AltCLIP, to facilitate their applicability to Chinese and even other languages. Nevertheless, these models suff… ▽ More Pre-trained vision-language (V-L) models such as CLIP have shown excellent performance in many downstream cross-modal tasks. However, most of them are only applicable to the English context. Subsequent research has focused on this problem and proposed improved models, such as CN-CLIP and AltCLIP, to facilitate their applicability to Chinese and even other languages. Nevertheless, these models suffer from high latency and a large memory footprint in inference, which limits their further deployment on resource-constrained edge devices. In this work, we propose a conceptually simple yet effective multilingual CLIP Compression framework and train a lightweight multilingual vision-language model, called DC-CLIP, for both Chinese and English context. In this framework, we collect high-quality Chinese and English text-image pairs and design two training stages, including multilingual vision-language feature distillation and alignment. During the first stage, lightweight image/text student models are designed to learn robust visual/multilingual textual feature representation ability from corresponding teacher models, respectively. Subsequently, the multilingual vision-language alignment stage enables effective alignment of visual and multilingual textual features to further improve the model's multilingual performance. Comprehensive experiments in zero-shot image classification, conducted based on the ELEVATER benchmark, showcase that DC-CLIP achieves superior performance in the English context and competitive performance in the Chinese context, even with less training data, when compared to existing models of similar parameter magnitude. The evaluation demonstrates the effectiveness of our designed training mechanism. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.10948 [pdf, other]

First double-differential cross section measurement of neutral-current $π^0$ production in neutrino-argon scattering in the MicroBooNE detector

Authors: MicroBooNE collaboration, P. Abratenko, O. Alterkait, D. Andrade Aldana, L. Arellano, J. Asaadi, A. Ashkenazi, S. Balasubramanian, B. Baller, A. Barnard, G. Barr, D. Barrow, J. Barrow, V. Basque, J. Bateman, O. Benevides Rodrigues, S. Berkman, A. Bhanderi, A. Bhat, M. Bhattacharya, M. Bishai, A. Blake, B. Bogart, T. Bolton, J. Y. Book , et al. (166 additional authors not shown)

Abstract: We report the first double-differential cross section measurement of neutral-current neutral pion (NC$π^0$) production in neutrino-argon scattering, as well as single-differential measurements of the same channel in terms of final states with and without protons. The kinematic variables of interest for these measurements are the $π^0$ momentum and the $π^0$ scattering angle with respect to the neu… ▽ More We report the first double-differential cross section measurement of neutral-current neutral pion (NC$π^0$) production in neutrino-argon scattering, as well as single-differential measurements of the same channel in terms of final states with and without protons. The kinematic variables of interest for these measurements are the $π^0$ momentum and the $π^0$ scattering angle with respect to the neutrino beam. A total of 4971 candidate NC$π^0$ events fully-contained within the MicroBooNE detector are selected using data collected at a mean neutrino energy of $\sim 0.8$ GeV from $6.4\times10^{20}$ protons on target from the Booster Neutrino Beam at the Fermi National Accelerator Laboratory. After extensive data-driven model validation to ensure unbiased unfolding, the Wiener-SVD method is used to extract nominal flux-averaged cross sections. The results are compared to predictions from commonly used neutrino event generators, which tend to overpredict the measured NC$π^0$ cross section, especially in the 0.2-0.5 GeV/c $π^0$ momentum range, at forward scattering angles, and when at least one proton is present in the final state. These measurements show sensitivity to a variety of features that complicate the description of NC$π^0$ production including the form factors describing the elementary neutrino interaction and the final state interactions of the outgoing particles in the residual argon nucleus. This data will help improve the modeling of NC$π^0$ production, which represents a major background in measurements of charge-parity violation in the neutrino sector and in searches for new physics beyond the Standard Model. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Report number: FERMILAB-PUB-24-0125

arXiv:2404.09949 [pdf, other]

Measurement of the differential cross section for neutral pion production in charged-current muon neutrino interactions on argon with the MicroBooNE detector

Authors: MicroBooNE collaboration, P. Abratenko, O. Alterkait, D. Andrade Aldana, L. Arellano, J. Asaadi, A. Ashkenazi, S. Balasubramanian, B. Baller, G. Barr, D. Barrow, J. Barrow, V. Basque, O. Benevides Rodrigues, S. Berkman, A. Bhanderi, A. Bhat, M. Bhattacharya, M. Bishai, A. Blake, B. Bogart, T. Bolton, J. Y. Book, M. B. Brunetti, L. Camilleri , et al. (163 additional authors not shown)

Abstract: We present a measurement of neutral pion production in charged-current interactions using data recorded with the MicroBooNE detector exposed to Fermilab's booster neutrino beam. The signal comprises one muon, one neutral pion, any number of nucleons, and no charged pions. Studying neutral pion production in the MicroBooNE detector provides an opportunity to better understand neutrino-argon interac… ▽ More We present a measurement of neutral pion production in charged-current interactions using data recorded with the MicroBooNE detector exposed to Fermilab's booster neutrino beam. The signal comprises one muon, one neutral pion, any number of nucleons, and no charged pions. Studying neutral pion production in the MicroBooNE detector provides an opportunity to better understand neutrino-argon interactions, and is crucial for future accelerator-based neutrino oscillation experiments. Using a dataset corresponding to $6.86 \times 10^{20}$ protons on target, we present single-differential cross sections in muon and neutral pion momenta, scattering angles with respect to the beam for the outgoing muon and neutral pion, as well as the opening angle between the muon and neutral pion. Data extracted cross sections are compared to generator predictions. We report good agreement between the data and the models for scattering angles, except for an over-prediction by generators at muon forward angles. Similarly, the agreement between data and the models as a function of momentum is good, except for an underprediction by generators in the medium momentum ranges, $200-400$ MeV for muons and $100-200$ MeV for pions. △ Less

Submitted 6 May, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Report number: FERMILAB-PUB-24-0142-CSAID-PPD

arXiv:2404.09276 [pdf, other]

Algorithm xxx: Faster Randomized SVD with Dynamic Shifts

Authors: Xu Feng, Wenjian Yu, Yuyang Xie, Jie Tang

Abstract: Aiming to provide a faster and convenient truncated SVD algorithm for large sparse matrices from real applications (i.e. for computing a few of largest singular values and the corresponding singular vectors), a dynamically shifted power iteration technique is applied to improve the accuracy of the randomized SVD method. This results in a dynamic shifts based randomized SVD (dashSVD) algorithm, whi… ▽ More Aiming to provide a faster and convenient truncated SVD algorithm for large sparse matrices from real applications (i.e. for computing a few of largest singular values and the corresponding singular vectors), a dynamically shifted power iteration technique is applied to improve the accuracy of the randomized SVD method. This results in a dynamic shifts based randomized SVD (dashSVD) algorithm, which also collaborates with the skills for handling sparse matrices. An accuracy-control mechanism is included in the dashSVD algorithm to approximately monitor the per vector error bound of computed singular vectors with negligible overhead. Experiments on real-world data validate that the dashSVD algorithm largely improves the accuracy of randomized SVD algorithm or attains same accuracy with fewer passes over the matrix, and provides an efficient accuracy-control mechanism to the randomized SVD computation, while demonstrating the advantages on runtime and parallel efficiency. A bound of the approximation error of the randomized SVD with the shifted power iteration is also proved. △ Less

Submitted 14 April, 2024; originally announced April 2024.

Comments: 26 pages, accepted by ACM Transactions on Mathematical Software

arXiv:2404.08675 [pdf, other]

RecGPT: Generative Personalized Prompts for Sequential Recommendation via ChatGPT Training Paradigm

Authors: Yabin Zhang, Wenhui Yu, Erhan Zhang, Xu Chen, Lantao Hu, Peng Jiang, Kun Gai

Abstract: ChatGPT has achieved remarkable success in natural language understanding. Considering that recommendation is indeed a conversation between users and the system with items as words, which has similar underlying pattern with ChatGPT, we design a new chat framework in item index level for the recommendation task. Our novelty mainly contains three parts: model, training and inference. For the model p… ▽ More ChatGPT has achieved remarkable success in natural language understanding. Considering that recommendation is indeed a conversation between users and the system with items as words, which has similar underlying pattern with ChatGPT, we design a new chat framework in item index level for the recommendation task. Our novelty mainly contains three parts: model, training and inference. For the model part, we adopt Generative Pre-training Transformer (GPT) as the sequential recommendation model and design a user modular to capture personalized information. For the training part, we adopt the two-stage paradigm of ChatGPT, including pre-training and fine-tuning. In the pre-training stage, we train GPT model by auto-regression. In the fine-tuning stage, we train the model with prompts, which include both the newly-generated results from the model and the user's feedback. For the inference part, we predict several user interests as user representations in an autoregressive manner. For each interest vector, we recall several items with the highest similarity and merge the items recalled by all interest vectors into the final result. We conduct experiments with both offline public datasets and online A/B test to demonstrate the effectiveness of our proposed method. △ Less

Submitted 6 April, 2024; originally announced April 2024.

arXiv:2404.07545 [pdf, other]

Stereo-LiDAR Depth Estimation with Deformable Propagation and Learned Disparity-Depth Conversion

Authors: Ang Li, Anning Hu, Wei Xi, Wenxian Yu, Danping Zou

Abstract: Accurate and dense depth estimation with stereo cameras and LiDAR is an important task for automatic driving and robotic perception. While sparse hints from LiDAR points have improved cost aggregation in stereo matching, their effectiveness is limited by the low density and non-uniform distribution. To address this issue, we propose a novel stereo-LiDAR depth estimation network with Semi-Dense hin… ▽ More Accurate and dense depth estimation with stereo cameras and LiDAR is an important task for automatic driving and robotic perception. While sparse hints from LiDAR points have improved cost aggregation in stereo matching, their effectiveness is limited by the low density and non-uniform distribution. To address this issue, we propose a novel stereo-LiDAR depth estimation network with Semi-Dense hint Guidance, named SDG-Depth. Our network includes a deformable propagation module for generating a semi-dense hint map and a confidence map by propagating sparse hints using a learned deformable window. These maps then guide cost aggregation in stereo matching. To reduce the triangulation error in depth recovery from disparity, especially in distant regions, we introduce a disparity-depth conversion module. Our method is both accurate and efficient. The experimental results on benchmark tests show its superior performance. Our code is available at https://github.com/SJTU-ViSYS/SDG-Depth. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: Accepted in ICRA 2024. 8 pages, 6 figures

arXiv:2404.07490 [pdf, other]

Low-energy spin dynamics in a Kitaev material Na3Ni2BiO6 investigated by NMR

Authors: Xinyu Shi, Yi Cui, Yanyan Shangguan, Xiaoyu Xu, Zhanlong Wu, Ze Hu, Shuo Li, Kefan Du, Ying Chen, Long Ma, Zhengxin Liu, Jinsheng Wen, Jinshan Zhang, Weiqiang Yu

Abstract: We performed 23Na NMR and magnetization measurements on an S = 1, quasi-2D honeycomb lattice antiferromagnet Na3Ni2BiO6. A large positive Curie-Weiss constant of 22.9 K is observed. The NMR spectra at low fields are consistent with a "zigzag" magnetic order, indicating a large easy-axis anisotropy. With field applied along the c* axis, the NMR spectra confirm the existence of a 1/3-magnetization p… ▽ More We performed 23Na NMR and magnetization measurements on an S = 1, quasi-2D honeycomb lattice antiferromagnet Na3Ni2BiO6. A large positive Curie-Weiss constant of 22.9 K is observed. The NMR spectra at low fields are consistent with a "zigzag" magnetic order, indicating a large easy-axis anisotropy. With field applied along the c* axis, the NMR spectra confirm the existence of a 1/3-magnetization plateau phase between 5.1 T and 7.1 T. The transition from the zigzag order to the 1/3-magnetization plateau phase is also found to be a first-order type. A monotonic decrease of the spin gap is revealed in the 1/3-magnetization plateau phase, which reaches zero at a quantum critical field Hc = 8.35 T before entering the fully polarized phase. These data suggest the existence of exchange frustration in the system along with strong ferromagnetic interactions, hosting the possibility for Kitaev physics. Besides, well below the ordered phase, the 1/T1 at high fields shows either a level off or an enhancement upon cooling below 3 K, which suggests the existence of low-energy fluctuations. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 7 pages, 7 figures

arXiv:2404.06037 [pdf, other]

A Survey of Distributed Graph Algorithms on Massive Graphs

Authors: Lingkai Meng, Yu Shao, Long Yuan, Longbin Lai, Peng Cheng, Xue Li, Wenyuan Yu, Wenjie Zhang, Xuemin Lin, Jingren Zhou

Abstract: Distributed processing of large-scale graph data has many practical applications and has been widely studied. In recent years, a lot of distributed graph processing frameworks and algorithms have been proposed. While many efforts have been devoted to analyzing these, with most analyzing them based on programming models, less research focuses on understanding their challenges in distributed environ… ▽ More Distributed processing of large-scale graph data has many practical applications and has been widely studied. In recent years, a lot of distributed graph processing frameworks and algorithms have been proposed. While many efforts have been devoted to analyzing these, with most analyzing them based on programming models, less research focuses on understanding their challenges in distributed environments. Applying graph tasks to distributed environments is not easy, often facing numerous challenges through our analysis, including parallelism, load balancing, communication overhead, and bandwidth. In this paper, we provide an extensive overview of the current state-of-the-art in this field by outlining the challenges and solutions of distributed graph algorithms. We first conduct a systematic analysis of the inherent challenges in distributed graph processing, followed by presenting an overview of existing general solutions. Subsequently, we survey the challenges highlighted in recent distributed graph processing papers and the strategies adopted to address them. Finally, we discuss the current research trends and identify potential future opportunities. △ Less

Submitted 9 April, 2024; originally announced April 2024.

arXiv:2404.05260 [pdf, other]

SRAM-PG: Power Delivery Network Benchmarks from SRAM Circuits

Authors: Shan Shen, Zhiqiang Liu, Wenjian Yu

Abstract: Designing the power delivery network (PDN) in very large-scale integrated (VLSI) circuits is increasingly important, especially for nowadays low-power integrated circuit (IC) design. In order to ensure that the designed PDN enables a low level of voltage drop and noise which is required for the success of IC design, accurate analysis of PDN is largely demanded and brings a challenge of computation… ▽ More Designing the power delivery network (PDN) in very large-scale integrated (VLSI) circuits is increasingly important, especially for nowadays low-power integrated circuit (IC) design. In order to ensure that the designed PDN enables a low level of voltage drop and noise which is required for the success of IC design, accurate analysis of PDN is largely demanded and brings a challenge of computation during the whole process of IC design. This promotes the research of efficient and scalable simulation methods for PDN. However, the lack of sufficient public PDN benchmarks hinders the relevant research. % on this aspect since it is hard to conduct a rapid and clear comparison between different approaches to solving this problem. To this end, we construct and release a set of PDN benchmarks (named \emph{SRAM-PG}) from SRAM circuit design in this work. The benchmarks are obtained from realistic and state-of-the-art SRAM designs, following a workflow for generating the post-layout PDN netlists with full RC parasitics. With careful modeling of load currents, the benchmarks reflect the dynamic work mode of the IC and can be used for both transient and DC analysis. The benchmarks are derived from the designs for diverse applications. And, sharing them in the public domain with detailed descriptions would largely benefit the relevant research. The whole set of benchmarks is available at \href{github}{https://github.com/ShenShan123/SRAM-PG}. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: Oral presentation at ISQED'24

arXiv:2404.04538 [pdf, other]

Soft-Prompting with Graph-of-Thought for Multi-modal Representation Learning

Authors: Juncheng Yang, Zuchao Li, Shuai Xie, Wei Yu, Shijun Li, Bo Du

Abstract: The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a… ▽ More The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The proposed AGoT models the human thought process not only as a chain but also models each step as a reasoning aggregation graph to cope with the overlooked multiple aspects of thinking in single-step reasoning. This turns the entire reasoning process into prompt aggregation and prompt flow operations. Experiments show that our multi-modal model enhanced with AGoT soft-prompting achieves good results in several tasks such as text-image retrieval, visual question answering, and image recognition. In addition, we demonstrate that it has good domain generalization performance due to better reasoning. △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: This paper is accepted to LREC-COLING 2024

arXiv:2404.03411 [pdf, ps, other]

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

Authors: Shuo Chen, Zhen Han, Bailan He, Zifeng Ding, Wenqian Yu, Philip Torr, Volker Tresp, Jindong Gu

Abstract: Various jailbreak attacks have been proposed to red-team Large Language Models (LLMs) and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal Large Language Models (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproductio… ▽ More Various jailbreak attacks have been proposed to red-team Large Language Models (LLMs) and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal Large Language Models (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproduction and fair comparison. Besides, there is a lack of comprehensive evaluation of closed-source state-of-the-art (SOTA) models, especially MLLMs, such as GPT-4V. To address these issues, this work first builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. Based on this dataset, extensive red-teaming experiments are conducted on 11 different LLMs and MLLMs, including both SOTA proprietary models and open-source models. We then conduct a deep analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other open-source models. (3) The transferability of visual jailbreak methods is relatively limited compared to textual jailbreak methods. The dataset and code can be found here https://anonymous.4open.science/r/red_teaming_gpt4-C1CE/README.md . △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: technical report

arXiv:2404.00505 [pdf, other]

doi 10.1109/TMLCN.2024.3384329

Transfer Learning with Reconstruction Loss

Authors: Wei Cui, Wei Yu

Abstract: In most applications of utilizing neural networks for mathematical optimization, a dedicated model is trained for each specific optimization objective. However, in many scenarios, several distinct yet correlated objectives or tasks often need to be optimized on the same set of problem inputs. Instead of independently training a different neural network for each problem separately, it would be more… ▽ More In most applications of utilizing neural networks for mathematical optimization, a dedicated model is trained for each specific optimization objective. However, in many scenarios, several distinct yet correlated objectives or tasks often need to be optimized on the same set of problem inputs. Instead of independently training a different neural network for each problem separately, it would be more efficient to exploit the correlations between these objectives and to train multiple neural network models with shared model parameters and feature representations. To achieve this, this paper first establishes the concept of common information: the shared knowledge required for solving the correlated tasks, then proposes a novel approach for model training by adding into the model an additional reconstruction stage associated with a new reconstruction loss. This loss is for reconstructing the common information starting from a selected hidden layer in the model. The proposed approach encourages the learned features to be general and transferable, and therefore can be readily used for efficient transfer learning. For numerical simulations, three applications are studied: transfer learning on classifying MNIST handwritten digits, the device-to-device wireless network power allocation, and the multiple-input-single-output network downlink beamforming and localization. Simulation results suggest that the proposed approach is highly efficient in data and model complexity, is resilient to over-fitting, and has competitive performances. △ Less

Submitted 11 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

Comments: 16 pages, 5 figures. To appear in IEEE Transactions on Machine Learning in Communications and Networking (TMLCN)

arXiv:2403.19574 [pdf, other]

Measurement of double-differential cross sections for mesonless charged-current muon neutrino interactions on argon with final-state protons using the MicroBooNE detector

Authors: MicroBooNE collaboration, P. Abratenko, O. Alterkait, D. Andrade Aldana, L. Arellano, J. Asaadi, A. Ashkenazi, S. Balasubramanian, B. Baller, G. Barr, D. Barrow, J. Barrow, V. Basque, O. Benevides Rodrigues, S. Berkman, A. Bhanderi, A. Bhat, M. Bhattacharya, M. Bishai, A. Blake, B. Bogart, T. Bolton, J. Y. Book, M. B. Brunetti, L. Camilleri , et al. (163 additional authors not shown)

Abstract: Charged-current neutrino interactions with final states containing zero mesons and at least one proton are of high interest for current and future accelerator-based neutrino oscillation experiments. Using the Booster Neutrino Beam and the MicroBooNE detector at Fermi National Accelerator Laboratory, we have obtained the first double-differential cross section measurements of this channel for muon… ▽ More Charged-current neutrino interactions with final states containing zero mesons and at least one proton are of high interest for current and future accelerator-based neutrino oscillation experiments. Using the Booster Neutrino Beam and the MicroBooNE detector at Fermi National Accelerator Laboratory, we have obtained the first double-differential cross section measurements of this channel for muon neutrino scattering on an argon target with a proton momentum threshold of 0.25 GeV/c. We also report a flux-averaged total cross section of $σ= (11.8 \pm 1.2) \times 10^{-38}$ cm$^2$ / Ar and several single-differential measurements which extend and improve upon previous results. Statistical and systematic uncertainties are quantified with a full treatment of correlations across 359 kinematic bins, including correlations between distributions describing different observables. The resulting data set provides the most detailed information obtained to date for testing models of mesonless neutrino-argon scattering. △ Less

Submitted 16 April, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

Comments: 83 pages, 67 figures (including supplemental material). For v2, added oversized files in extended data release

Report number: FERMILAB-PUB-24-0120-AD-CSAID-LBNF-PPD-TD

arXiv:2403.19128 [pdf, other]

OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition

Authors: Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, Zhibo Yang

Abstract: Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However, due to the diversified targets and heterogeneous… ▽ More Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However, due to the diversified targets and heterogeneous schemas, previous works usually design task-specific architectures and objectives for individual tasks, which inadvertently leads to modal isolation and complex workflow. In this paper, we propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically, we devise a universal model, called OmniParser, which can simultaneously handle three typical visually-situated text parsing tasks: text spotting, key information extraction, and table recognition. In OmniParser, all tasks share the unified encoder-decoder architecture, the unified objective: point-conditioned text generation, and the unified input & output representation: prompt & structured sequences. Extensive experiments demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks, despite its unified, concise design. The code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: CVPR 2024

arXiv:2403.18272 [pdf, other]

Recovery of High-energy Low-frequency Quasi-periodic Oscillations from Black Hole X-ray Binary MAXI J1535-571 with a Hilbert-Huang Transform Method

Authors: Qingcang Shui, Shu Zhang, Shuangnan Zhang, Yupeng Chen, Lingda Kong, Jingqiang Peng, Long Ji, Pengju Wang, Zhi Chang, Zhuoli Yu, Hongxing Yin, Jinlu Qu, Lian Tao, Mingyu Ge, Xiang Ma, Liang Zhang, Wei Yu, Jian Li

Abstract: We propose a method based on the Hilbert-Huang transform (HHT) to recover the high-energy waveform of low-frequency quasi-periodic oscillations (LFQPOs). Based on the method, we successfully obtain the modulation of the phase-folded light curve above 170 keV using the QPO phase reconstructed at lower energies in MAXI J1535-571 with Insight-HXMT observations. A comprehensive simulation study is con… ▽ More We propose a method based on the Hilbert-Huang transform (HHT) to recover the high-energy waveform of low-frequency quasi-periodic oscillations (LFQPOs). Based on the method, we successfully obtain the modulation of the phase-folded light curve above 170 keV using the QPO phase reconstructed at lower energies in MAXI J1535-571 with Insight-HXMT observations. A comprehensive simulation study is conducted to demonstrate that such modulation indeed originates from the QPO. Thus the highest energies turn out to significantly exceed the upper limit of ~100 keV for QPOs reported previously using the Fourier method, marking the first opportunity to study QPO properties above 100 keV in this source. Detailed analyses of these high-energy QPO profiles reveal different QPO properties between the 30-100 keV and 100-200 keV energy ranges: the phase lag remains relatively stable, and the amplitude slightly increases below ~100 keV, whereas above this threshold, soft phase lags and a decrease in amplitude are observed. Given the reports of a hard tail detection in broad spectroscopy, we propose that the newly discovered QPO properties above 100 keV are dominated by the hard tail component, possibly stemming from a relativistic jet. Our findings also indicate a strong correlation between the QPOs originating from the jet and corona, supporting the scenario of jet-corona coupling precssion. We emphasize that our proposed HHT-based method can serve as an efficient manner in expanding the high energy band for studying QPOs, thereby enhancing our understanding of their origin. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: 21 pages, 15 figures, accepted for publication in ApJL

arXiv:2403.18197 [pdf, other]

LocoMan: Advancing Versatile Quadrupedal Dexterity with Lightweight Loco-Manipulators

Authors: Changyi Lin, Xingyu Liu, Yuxiang Yang, Yaru Niu, Wenhao Yu, Tingnan Zhang, Jie Tan, Byron Boots, Ding Zhao

Abstract: Quadrupedal robots have emerged as versatile agents capable of locomoting and manipulating in complex environments. Traditional designs typically rely on the robot's inherent body parts or incorporate top-mounted arms for manipulation tasks. However, these configurations may limit the robot's operational dexterity, efficiency and adaptability, particularly in cluttered or constrained spaces. In th… ▽ More Quadrupedal robots have emerged as versatile agents capable of locomoting and manipulating in complex environments. Traditional designs typically rely on the robot's inherent body parts or incorporate top-mounted arms for manipulation tasks. However, these configurations may limit the robot's operational dexterity, efficiency and adaptability, particularly in cluttered or constrained spaces. In this work, we present LocoMan, a dexterous quadrupedal robot with a novel morphology to perform versatile manipulation in diverse constrained environments. By equipping a Unitree Go1 robot with two low-cost and lightweight modular 3-DoF loco-manipulators on its front calves, LocoMan leverages the combined mobility and functionality of the legs and grippers for complex manipulation tasks that require precise 6D positioning of the end effector in a wide workspace. To harness the loco-manipulation capabilities of LocoMan, we introduce a unified control framework that extends the whole-body controller (WBC) to integrate the dynamics of loco-manipulators. Through experiments, we validate that the proposed whole-body controller can accurately and stably follow desired 6D trajectories of the end effector and torso, which, when combined with the large workspace from our design, facilitates a diverse set of challenging dexterous loco-manipulation tasks in confined spaces, such as opening doors, plugging into sockets, picking objects in narrow and low-lying spaces, and bimanual manipulation. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: Project page: https://linchangyi1.github.io/LocoMan

arXiv:2403.15637 [pdf, other]

CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments

Authors: Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, Mohamed Elnoor, Anuj Zore, Brian Ichter, Fei Xia, Jie Tan, Wenhao Yu, Dinesh Manocha

Abstract: We present ConVOI, a novel method for autonomous robot navigation in real-world indoor and outdoor environments using Vision Language Models (VLMs). We employ VLMs in two ways: first, we leverage their zero-shot image classification capability to identify the context or scenario (e.g., indoor corridor, outdoor terrain, crosswalk, etc) of the robot's surroundings, and formulate context-based naviga… ▽ More We present ConVOI, a novel method for autonomous robot navigation in real-world indoor and outdoor environments using Vision Language Models (VLMs). We employ VLMs in two ways: first, we leverage their zero-shot image classification capability to identify the context or scenario (e.g., indoor corridor, outdoor terrain, crosswalk, etc) of the robot's surroundings, and formulate context-based navigation behaviors as simple text prompts (e.g. ``stay on the pavement"). Second, we utilize their state-of-the-art semantic understanding and logical reasoning capabilities to compute a suitable trajectory given the identified context. To this end, we propose a novel multi-modal visual marking approach to annotate the obstacle-free regions in the RGB image used as input to the VLM with numbers, by correlating it with a local occupancy map of the environment. The marked numbers ground image locations in the real-world, direct the VLM's attention solely to navigable locations, and elucidate the spatial relationships between them and terrains depicted in the image to the VLM. Next, we query the VLM to select numbers on the marked image that satisfy the context-based behavior text prompt, and construct a reference path using the selected numbers. Finally, we propose a method to extrapolate the reference trajectory when the robot's environmental context has not changed to prevent unnecessary VLM queries. We use the reference trajectory to guide a motion planner, and demonstrate that it leads to human-like behaviors (e.g. not cutting through a group of people, using crosswalks, etc.) in various real-world indoor and outdoor scenarios. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: 9 pages, 4 figures

arXiv:2403.14240 [pdf, other]

Weak Supervision with Arbitrary Single Frame for Micro- and Macro-expression Spotting

Authors: Wang-Wang Yu, Xian-Shi Zhang, Fu-Ya Luo, Yijun Cao, Kai-Fu Yang, Hong-Mei Yan, Yong-Jie Li

Abstract: Frame-level micro- and macro-expression spotting methods require time-consuming frame-by-frame observation during annotation. Meanwhile, video-level spotting lacks sufficient information about the location and number of expressions during training, resulting in significantly inferior performance compared with fully-supervised spotting. To bridge this gap, we propose a point-level weakly-supervised… ▽ More Frame-level micro- and macro-expression spotting methods require time-consuming frame-by-frame observation during annotation. Meanwhile, video-level spotting lacks sufficient information about the location and number of expressions during training, resulting in significantly inferior performance compared with fully-supervised spotting. To bridge this gap, we propose a point-level weakly-supervised expression spotting (PWES) framework, where each expression requires to be annotated with only one random frame (i.e., a point). To mitigate the issue of sparse label distribution, the prevailing solution is pseudo-label mining, which, however, introduces new problems: localizing contextual background snippets results in inaccurate boundaries and discarding foreground snippets leads to fragmentary predictions. Therefore, we design the strategies of multi-refined pseudo label generation (MPLG) and distribution-guided feature contrastive learning (DFCL) to address these problems. Specifically, MPLG generates more reliable pseudo labels by merging class-specific probabilities, attention scores, fused features, and point-level labels. DFCL is utilized to enhance feature similarity for the same categories and feature variability for different categories while capturing global representations across the entire datasets. Extensive experiments on the CAS(ME)^2, CAS(ME)^3, and SAMM-LV datasets demonstrate PWES achieves promising performance comparable to that of recent fully-supervised methods. △ Less

Submitted 21 March, 2024; originally announced March 2024.

arXiv:2403.14168 [pdf, other]

M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

Authors: Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang

Abstract: Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although multiple academic video datasets have been constructed and released, few of them suppo… ▽ More Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations. In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (M$^3$AV), which has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics. With high-quality human annotations of the slide text and spoken words, in particular high-valued name entities, the dataset can be used for multiple audio-visual recognition and understanding tasks. Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation tasks demonstrate that the diversity of M$^3$AV makes it a challenging dataset. △ Less

Submitted 4 June, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

Comments: ACL 2024 Main Conference. Project website: https://jack-zc8.github.io/M3AV-dataset-page

arXiv:2403.13127 [pdf, other]

Timing analysis of the newly discovered black hole candidate Swift J1727.8-1613 with Insight-HXMT

Authors: Wei Yu, Qing-Cui Bu, Shuang-Nan Zhang, He-Xin Liu, Liang Zhang, Lorenzo Ducci, Lian Tao, Andrea Santangelo, Victor Doroshenko, Yue Huang, Zi-Xu Yang, Jin-Lu Qu

Abstract: We present the results obtained from an X-ray timing study of the new black hole candidate (BHC) Swift J1727.8-1613. The work is based on Hard X-ray Modulation Telescope (Insight-HXMT) observations carried out during the 2023 outburst. Prominent type-C low-frequency Quasi-periodic Oscillations (LFQPOs) are detected throughout the observations. With the substantial effective area of the Insight-HXM… ▽ More We present the results obtained from an X-ray timing study of the new black hole candidate (BHC) Swift J1727.8-1613. The work is based on Hard X-ray Modulation Telescope (Insight-HXMT) observations carried out during the 2023 outburst. Prominent type-C low-frequency Quasi-periodic Oscillations (LFQPOs) are detected throughout the observations. With the substantial effective area of the Insight-HXMT at high energies, we examine the energy dependence of various parameters, including the centroid frequency, fractional rms, and phase lags of the type-C QPOs. Our findings align closely with those observed in high-inclination systems. During the initial stage of the outburst, a peaked noise component is also detected, the frequency of which is highly correlated with the LFQPO frequency, aligning with the Psaltis-Belloni-van der Klis (PBK) relation. By assuming that the peaked noise originates from the precession of the accretion disc, the spin of this source can be constrained. Our results suggest that this source may possess a high spin. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.12471 [pdf, other]

Theoretical Modeling and Bio-inspired Trajectory Optimization of A Multiple-locomotion Origami Robot

Authors: Keqi Zhu, Haotian Guo, Wei Yu, Hassen Nigatu, Tong Li, Huixu Dong

Abstract: Recent research on mobile robots has focused on increasing their adaptability to unpredictable and unstructured environments using soft materials and structures. However, the determination of key design parameters and control over these compliant robots are predominantly iterated through experiments, lacking a solid theoretical foundation. To improve their efficiency, this paper aims to provide ma… ▽ More Recent research on mobile robots has focused on increasing their adaptability to unpredictable and unstructured environments using soft materials and structures. However, the determination of key design parameters and control over these compliant robots are predominantly iterated through experiments, lacking a solid theoretical foundation. To improve their efficiency, this paper aims to provide mathematics modeling over two locomotion, crawling and swimming. Specifically, a dynamic model is first devised to reveal the influence of the contact surfaces' frictional coefficients on displacements in different motion phases. Besides, a swimming kinematics model is provided using coordinate transformation, based on which, we further develop an algorithm that systematically plans human-like swimming gaits, with maximum thrust obtained. The proposed algorithm is highly generalizable and has the potential to be applied in other soft robots with multiple joints. Simulation experiments have been conducted to illustrate the effectiveness of the proposed modeling. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 8 pages

arXiv:2403.10340 [pdf, other]

Thermal-NeRF: Neural Radiance Fields from an Infrared Camera

Authors: Tianxiang Ye, Qi Wu, Junyuan Deng, Guoqing Liu, Liu Liu, Songpengcheng Xia, Liang Pang, Wenxian Yu, Ling Pei

Abstract: In recent years, Neural Radiance Fields (NeRFs) have demonstrated significant potential in encoding highly-detailed 3D geometry and environmental appearance, positioning themselves as a promising alternative to traditional explicit representation for 3D scene reconstruction. However, the predominant reliance on RGB imaging presupposes ideal lighting conditions: a premise frequently unmet in roboti… ▽ More In recent years, Neural Radiance Fields (NeRFs) have demonstrated significant potential in encoding highly-detailed 3D geometry and environmental appearance, positioning themselves as a promising alternative to traditional explicit representation for 3D scene reconstruction. However, the predominant reliance on RGB imaging presupposes ideal lighting conditions: a premise frequently unmet in robotic applications plagued by poor lighting or visual obstructions. This limitation overlooks the capabilities of infrared (IR) cameras, which excel in low-light detection and present a robust alternative under such adverse scenarios. To tackle these issues, we introduce Thermal-NeRF, the first method that estimates a volumetric scene representation in the form of a NeRF solely from IR imaging. By leveraging a thermal mapping and structural thermal constraint derived from the thermal characteristics of IR imaging, our method showcasing unparalleled proficiency in recovering NeRFs in visually degraded scenes where RGB-based methods fall short. We conduct extensive experiments to demonstrate that Thermal-NeRF can achieve superior quality compared to existing methods. Furthermore, we contribute a dataset for IR-based NeRF applications, paving the way for future research in IR NeRF reconstruction. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.09004 [pdf, ps, other]

Meta-Learning-Based Fronthaul Compression for Cloud Radio Access Networks

Authors: Ruihua Qiao, Tao Jiang, Wei Yu

Abstract: This paper investigates the fronthaul compression problem in a user-centric cloud radio access network, in which single-antenna users are served by a central processor (CP) cooperatively via a cluster of remote radio heads (RRHs). To satisfy the fronthaul capacity constraint, this paper proposes a transform-compress-forward scheme, which consists of well-designed transformation matrices and unifor… ▽ More This paper investigates the fronthaul compression problem in a user-centric cloud radio access network, in which single-antenna users are served by a central processor (CP) cooperatively via a cluster of remote radio heads (RRHs). To satisfy the fronthaul capacity constraint, this paper proposes a transform-compress-forward scheme, which consists of well-designed transformation matrices and uniform quantizers. The transformation matrices perform dimension reduction in the uplink and dimension expansion in the downlink. To reduce the communication overhead for designing the transformation matrices, this paper further proposes a deep learning framework to first learn a suboptimal transformation matrix at each RRH based on the local channel state information (CSI), and then to refine it iteratively. To facilitate the refinement process, we propose an efficient signaling scheme that only requires the transmission of low-dimensional effective CSI and its gradient between the CP and RRH, and further, a meta-learning based gated recurrent unit network to reduce the number of signaling transmission rounds. For the sum-rate maximization problem, simulation results show that the proposed two-stage neural network can perform close to the fully cooperative global CSI based benchmark with significantly reduced communication overhead for both the uplink and the downlink. Moreover, using the first stage alone can already outperform the existing local CSI based benchmark. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: 15 Pages, 13 Figures; accepted in IEEE Transactions on Wireless Communications

arXiv:2403.08651 [pdf, other]

HAIFIT: Human-to-AI Fashion Image Translation

Authors: Jianan Jiang, Xinglin Li, Weiren Yu, Di Wu

Abstract: In the realm of fashion design, sketches serve as the canvas for expressing an artist's distinctive drawing style and creative vision, capturing intricate details like stroke variations and texture nuances. The advent of sketch-to-image cross-modal translation technology has notably aided designers. However, existing methods often compromise these sketch details during image generation, resulting… ▽ More In the realm of fashion design, sketches serve as the canvas for expressing an artist's distinctive drawing style and creative vision, capturing intricate details like stroke variations and texture nuances. The advent of sketch-to-image cross-modal translation technology has notably aided designers. However, existing methods often compromise these sketch details during image generation, resulting in images that deviate from the designer's intended concept. This limitation hampers the ability to offer designers a precise preview of the final output. To overcome this challenge, we introduce HAIFIT, a novel approach that transforms sketches into high-fidelity, lifelike clothing images by integrating multi-scale features and capturing extensive feature map dependencies from diverse perspectives. Through extensive qualitative and quantitative evaluations conducted on our self-collected dataset, our method demonstrates superior performance compared to existing methods in generating photorealistic clothing images. Our method excels in preserving the distinctive style and intricate details essential for fashion design applications. In addition, our method also has obvious advantages in model training and inference speed, contributing to reducing designers' time costs and improving design efficiency. △ Less

Submitted 13 August, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

Comments: 10 pages,8 figures

arXiv:2403.07698 [pdf, ps, other]

The Kazdan-Warner problem on compact Kähler surfaces

Authors: Weike Yu

Abstract: In this paper, we investigate a Kazdan-Warner problem on compact Kähler surfaces with negative Gauduchon degree, which corresponds to prescribing sign-changing Chern scalar curvatures. By the method of our recent paper [J. Funt. Anal. 285 (2023): 109948], we establish a Chen-Li type existence theorem on compact Kähler surfaces when the candidate curvature function is of negative average. Moreover,… ▽ More In this paper, we investigate a Kazdan-Warner problem on compact Kähler surfaces with negative Gauduchon degree, which corresponds to prescribing sign-changing Chern scalar curvatures. By the method of our recent paper [J. Funt. Anal. 285 (2023): 109948], we establish a Chen-Li type existence theorem on compact Kähler surfaces when the candidate curvature function is of negative average. Moreover, we give an alternative proof of Ding-Liu's theorem [Trans. Amer. Math. Soc. 347(1995) 1059-1066] on prescribing sign-changing Gaussian curvatures by using the $\sup+\inf$ inequality due to H. Brezis, Y. Y. Li and I. Shafrir. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: 12 pages

MSC Class: 32Q15; 35J60

arXiv:2403.05262 [pdf, other]

Debiasing Multimodal Large Language Models

Authors: Yi-Fan Zhang, Weichen Yu, Qingsong Wen, Xue Wang, Zhang Zhang, Liang Wang, Rong Jin, Tieniu Tan

Abstract: In the realms of computer vision and natural language processing, Large Vision-Language Models (LVLMs) have become indispensable tools, proficient in generating textual descriptions based on visual inputs. Despite their advancements, our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior ra… ▽ More In the realms of computer vision and natural language processing, Large Vision-Language Models (LVLMs) have become indispensable tools, proficient in generating textual descriptions based on visual inputs. Despite their advancements, our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior rather than the input image. Our empirical experiments underscore the persistence of this bias, as LVLMs often provide confident answers even in the absence of relevant images or given incongruent visual input. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies. Firstly, for tasks such as classification or multi-choice question-answering (QA), we propose a ``calibration'' step through affine transformation to adjust the output distribution. This ``Post-Hoc debias'' approach ensures uniform scores for each answer when the image is absent, serving as an effective regularization technique to alleviate the influence of LLM priors. For more intricate open-ended generation tasks, we extend this method to ``Debias sampling'', drawing inspirations from contrastive decoding methods. Furthermore, our investigation sheds light on the instability of LVLMs across various decoding configurations. Through systematic exploration of different settings, we significantly enhance performance, surpassing reported results and raising concerns about the fairness of existing evaluations. Comprehensive experiments substantiate the effectiveness of our proposed strategies in mitigating biases. These strategies not only prove beneficial in minimizing hallucinations but also contribute to the generation of more helpful and precise illustrations. △ Less

Submitted 27 March, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

Comments: 38 pages, 17 figures

arXiv:2403.01714 [pdf, other]

doi 10.1103/PhysRevB.109.184407

Molecular intercalation in the van der Waals antiferromagnets FePS3 and NiPS3

Authors: Cong Li, Ze Hu, Xiaofei Hou, Sheng Xu, Zhanlong Wu, Kefan Du, Shuo Li, Xiaoyu Xu, Ying Chen, Zeyu Wang, Tiancheng Mu, Tian-Long Xia, Yanfeng Guo, B. Normand, Weiqiang Yu, Yi Cui

Abstract: We have performed electrochemical treatment of the van der Waals antiferromagnetic materials FePS$_3$ and NiPS$_3$ with the ionic liquid EMIM-BF$_4$, achieving significant molecular intercalation. Mass analysis of the intercalated compounds, EMIM$_x$-FePS$_3$ and EMIM$_x$-NiPS$_3$, indicated respective intercalation levels, $x$, of approximately 27\% and 37\%, and X-ray diffraction measurements de… ▽ More We have performed electrochemical treatment of the van der Waals antiferromagnetic materials FePS$_3$ and NiPS$_3$ with the ionic liquid EMIM-BF$_4$, achieving significant molecular intercalation. Mass analysis of the intercalated compounds, EMIM$_x$-FePS$_3$ and EMIM$_x$-NiPS$_3$, indicated respective intercalation levels, $x$, of approximately 27\% and 37\%, and X-ray diffraction measurements demonstrated a massive (over 50\%) enhancement of the $c$-axis lattice parameters. To investigate the consequences of these changes for the magnetic properties, we performed magnetic susceptibility and $^{31}$P nuclear magnetic resonance (NMR) studies of both systems. For EMIM$_x$-FePS$_3$, intercalation reduces the magnetic ordering temperature from $T_N = 120$~K to 78~K, and we find a spin gap in the antiferromagnetic phase that drops from 45~K to 30~K. For EMIM$_x$-NiPS$_3$, the ordering temperature is almost unaffected (changing from 148~K to 145~K), but a change towards nearly isotropic spin fluctuations suggests an alteration of the magnetic Hamiltonian. Such relatively modest changes, given that the huge extension of the $c$ axes is expected to cause a very strong suppression any interlayer interactions, point unequivocally to the conclusion that the magnetic properties of both parent compounds are determined solely by two-dimensional (2D), intralayer physics. The changes in transition temperatures and low-temperature spin dynamics in both compounds therefore indicate that intercalation also results in a significant modulation of the intralayer magnetic interactions, which we propose is due to charge doping and localization on the P sites. Our study offers chemical intercalation with ionic liquids as an effective method to control not only the interlayer but also the intralayer interactions in quasi-2D magnetic materials. △ Less

Submitted 3 March, 2024; originally announced March 2024.

Journal ref: Physical Review B 109, 184407(2024)

arXiv:2403.01457 [pdf, other]

Logic Rules as Explanations for Legal Case Retrieval

Authors: Zhongxiang Sun, Kepu Zhang, Weijie Yu, Haoyu Wang, Jun Xu

Abstract: In this paper, we address the issue of using logic rules to explain the results from legal case retrieval. The task is critical to legal case retrieval because the users (e.g., lawyers or judges) are highly specialized and require the system to provide logical, faithful, and interpretable explanations before making legal decisions. Recently, research efforts have been made to learn explainable leg… ▽ More In this paper, we address the issue of using logic rules to explain the results from legal case retrieval. The task is critical to legal case retrieval because the users (e.g., lawyers or judges) are highly specialized and require the system to provide logical, faithful, and interpretable explanations before making legal decisions. Recently, research efforts have been made to learn explainable legal case retrieval models. However, these methods usually select rationales (key sentences) from the legal cases as explanations, failing to provide faithful and logically correct explanations. In this paper, we propose Neural-Symbolic enhanced Legal Case Retrieval (NS-LCR), a framework that explicitly conducts reasoning on the matching of legal cases through learning case-level and law-level logic rules. The learned rules are then integrated into the retrieval process in a neuro-symbolic manner. Benefiting from the logic and interpretable nature of the logic rules, NS-LCR is equipped with built-in faithful explainability. We also show that NS-LCR is a model-agnostic framework that can be plugged in for multiple legal retrieval models. To showcase NS-LCR's superiority, we enhance existing benchmarks by adding manually annotated logic rules and introducing a novel explainability metric using Large Language Models (LLMs). Our comprehensive experiments reveal NS-LCR's effectiveness for ranking, alongside its proficiency in delivering reliable explanations for legal case retrieval. △ Less

Submitted 3 March, 2024; originally announced March 2024.

Comments: accepted by lrec-coling 2024

arXiv:2403.00134 [pdf, other]

Active Sensing for Reciprocal MIMO Channels

Authors: Tao Jiang, Wei Yu

Abstract: This paper addresses the design of transmit precoder and receive combiner matrices to support $N_{\rm s}$ independent data streams over a time-division duplex (TDD) point-to-point massive multiple-input multiple-output (MIMO) channel with either a fully digital or a hybrid structure. The optimal precoder and combiner design amounts to finding the top-$N_{\rm s}$ singular vectors of the channel mat… ▽ More This paper addresses the design of transmit precoder and receive combiner matrices to support $N_{\rm s}$ independent data streams over a time-division duplex (TDD) point-to-point massive multiple-input multiple-output (MIMO) channel with either a fully digital or a hybrid structure. The optimal precoder and combiner design amounts to finding the top-$N_{\rm s}$ singular vectors of the channel matrix, but the explicit estimation of the entire high-dimensional channel would require significant pilot overhead. Alternatively, prior works suggest to find the precoding and combining matrices directly by exploiting channel reciprocity and by using the power iteration method, but its performance degrades in the low SNR regime. To tackle this challenging problem, this paper proposes a learning-based active sensing framework, where the transmitter and the receiver send pilots alternately using sensing beamformers that are actively designed as functions of previously received pilots. This is accomplished by using recurrent neural networks to summarize information from the historical observations into hidden state vectors, then using fully connected neural networks to learn the appropriate sensing beamformers in the next pilot stage and finally the transmit precoding and receive combiner matrices for data communications. Simulations demonstrate that the learning-based method outperforms existing approaches significantly and maintains superior performance even in the low SNR regime for both the fully digital and hybrid MIMO scenarios. △ Less

Submitted 6 June, 2024; v1 submitted 29 February, 2024; originally announced March 2024.

Comments: This paper is accepted in IEEE Transactions on Signal Processing

arXiv:2402.19385 [pdf, other]

Towards Safe and Reliable Autonomous Driving: Dynamic Occupancy Set Prediction

Authors: Wenbo Shao, Jiahui Xu, Wenhao Yu, Jun Li, Hong Wang

Abstract: In the rapidly evolving field of autonomous driving, reliable prediction is pivotal for vehicular safety. However, trajectory predictions often deviate from actual paths, particularly in complex and challenging environments, leading to significant errors. To address this issue, our study introduces a novel method for Dynamic Occupancy Set (DOS) prediction, it effectively combines advanced trajecto… ▽ More In the rapidly evolving field of autonomous driving, reliable prediction is pivotal for vehicular safety. However, trajectory predictions often deviate from actual paths, particularly in complex and challenging environments, leading to significant errors. To address this issue, our study introduces a novel method for Dynamic Occupancy Set (DOS) prediction, it effectively combines advanced trajectory prediction networks with a DOS prediction module, overcoming the shortcomings of existing models. It provides a comprehensive and adaptable framework for predicting the potential occupancy sets of traffic participants. The innovative contributions of this study include the development of a novel DOS prediction model specifically tailored for navigating complex scenarios, the introduction of precise DOS mathematical representations, and the formulation of optimized loss functions that collectively advance the safety and efficiency of autonomous systems. Through rigorous validation, our method demonstrates marked improvements over traditional models, establishing a new benchmark for safety and operational efficiency in intelligent transportation systems. △ Less

Submitted 2 June, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

Comments: Accepted by IEEE IV 2024

arXiv:2402.19281 [pdf, other]

doi 10.1103/PhysRevLett.133.041801

First simultaneous measurement of differential muon-neutrino charged-current cross sections on argon for final states with and without protons using MicroBooNE data

Authors: MicroBooNE collaboration, P. Abratenko, O. Alterkait, D. Andrade Aldana, L. Arellano, J. Asaadi, A. Ashkenazi, S. Balasubramanian, B. Baller, G. Barr, D. Barrow, J. Barrow, V. Basque, O. Benevides Rodrigues, S. Berkman, A. Bhanderi, A. Bhat, M. Bhattacharya, M. Bishai, A. Blake, B. Bogart, T. Bolton, J. Y. Book, M. B. Brunetti, L. Camilleri , et al. (163 additional authors not shown)

Abstract: We report the first double-differential neutrino-argon cross section measurement made simultaneously for final states with and without protons for the inclusive muon neutrino charged-current interaction channel. The proton kinematics of this channel are further explored with a differential cross section measurement as a function of the leading proton's kinetic energy that extends across the detect… ▽ More We report the first double-differential neutrino-argon cross section measurement made simultaneously for final states with and without protons for the inclusive muon neutrino charged-current interaction channel. The proton kinematics of this channel are further explored with a differential cross section measurement as a function of the leading proton's kinetic energy that extends across the detection threshold. These measurements utilize data collected using the MicroBooNE detector from 6.4$\times10^{20}$ protons on target from the Fermilab Booster Neutrino Beam with a mean neutrino energy of $\sim$0.8 GeV. Extensive data-driven model validation utilizing the conditional constraint formalism is employed. This motivates enlarging the uncertainties with an empirical reweighting approach to minimize the possibility of extracting biased cross section results. The extracted nominal flux-averaged cross sections are compared to widely used event generator predictions revealing severe mismodeling of final states without protons for muon neutrino charged-current interactions, possibly from insufficient treatment of final state interactions. These measurements provide a wealth of new information useful for improving event generators which will enhance the sensitivity of precision measurements in neutrino experiments. △ Less

Submitted 27 July, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

Report number: FERMILAB-PUB-24-0045

Journal ref: Phys. Rev. Lett. 133, 041801 (2024)

arXiv:2402.19216 [pdf, other]

doi 10.1103/PhysRevD.110.013006

Inclusive cross section measurements in final states with and without protons for charged-current $ν_μ$-Ar scattering in MicroBooNE

Authors: MicroBooNE collaboration, P. Abratenko, O. Alterkait, D. Andrade Aldana, L. Arellano, J. Asaadi, A. Ashkenazi, S. Balasubramanian, B. Baller, G. Barr, D. Barrow, J. Barrow, V. Basque, O. Benevides Rodrigues, S. Berkman, A. Bhanderi, A. Bhat, M. Bhattacharya, M. Bishai, A. Blake, B. Bogart, T. Bolton, J. Y. Book, M. B. Brunetti, L. Camilleri , et al. (164 additional authors not shown)

Abstract: A detailed understanding of inclusive muon neutrino charged-current interactions on argon is crucial to the study of neutrino oscillations in current and future experiments using liquid argon time projection chambers. To that end, we report a comprehensive set of differential cross section measurements for this channel that simultaneously probe the leptonic and hadronic systems by dividing the cha… ▽ More A detailed understanding of inclusive muon neutrino charged-current interactions on argon is crucial to the study of neutrino oscillations in current and future experiments using liquid argon time projection chambers. To that end, we report a comprehensive set of differential cross section measurements for this channel that simultaneously probe the leptonic and hadronic systems by dividing the channel into final states with and without protons. Measurements of the proton kinematics and proton multiplicity of the final state are also presented. For these measurements, we utilize data collected with the MicroBooNE detector from 6.4$\times10^{20}$ protons on target from the Fermilab Booster Neutrino Beam at a mean neutrino energy of approximately 0.8 GeV. We present in detail the cross section extraction procedure, including the unfolding, and model validation that uses data to model comparisons and the conditional constraint formalism to detect mismodeling that may introduce biases to extracted cross sections that are larger than their uncertainties. The validation exposes insufficiencies in the overall model, motivating the inclusion of an additional data-driven reweighting systematic to ensure the accuracy of the unfolding. The extracted results are compared to a number of event generators and their performance is discussed with a focus on the regions of phase-space that indicate the greatest need for modeling improvements. △ Less

Submitted 27 July, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

Report number: FERMILAB-PUB-24-0044

Journal ref: Phys. Rev. D 110, 013006 (2024)

arXiv:2402.19173 [pdf, other]

StarCoder 2 and The Stack v2: The Next Generation

Authors: Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo , et al. (41 additional authors not shown)

Abstract: The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data… ▽ More The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2402.18109 [pdf, other]

doi 10.1007/s11042-023-17517-w

Dual-Context Aggregation for Universal Image Matting

Authors: Qinglin Liu, Xiaoqian Lv, Wei Yu, Changyong Guo, Shengping Zhang

Abstract: Natural image matting aims to estimate the alpha matte of the foreground from a given image. Various approaches have been explored to address this problem, such as interactive matting methods that use guidance such as click or trimap, and automatic matting methods tailored to specific objects. However, existing matting methods are designed for specific objects or guidance, neglecting the common re… ▽ More Natural image matting aims to estimate the alpha matte of the foreground from a given image. Various approaches have been explored to address this problem, such as interactive matting methods that use guidance such as click or trimap, and automatic matting methods tailored to specific objects. However, existing matting methods are designed for specific objects or guidance, neglecting the common requirement of aggregating global and local contexts in image matting. As a result, these methods often encounter challenges in accurately identifying the foreground and generating precise boundaries, which limits their effectiveness in unforeseen scenarios. In this paper, we propose a simple and universal matting framework, named Dual-Context Aggregation Matting (DCAM), which enables robust image matting with arbitrary guidance or without guidance. Specifically, DCAM first adopts a semantic backbone network to extract low-level features and context features from the input image and guidance. Then, we introduce a dual-context aggregation network that incorporates global object aggregators and local appearance aggregators to iteratively refine the extracted context features. By performing both global contour segmentation and local boundary refinement, DCAM exhibits robustness to diverse types of guidance and objects. Finally, we adopt a matting decoder network to fuse the low-level features and the refined context features for alpha matte estimation. Experimental results on five matting datasets demonstrate that the proposed DCAM outperforms state-of-the-art matting methods in both automatic matting and interactive matting tasks, which highlights the strong universality and high performance of DCAM. The source code is available at \url{https://github.com/Windaway/DCAM}. △ Less

Submitted 28 February, 2024; originally announced February 2024.

Journal ref: Multimed Tools Appl (2023)

arXiv:2402.17416 [pdf, other]

doi 10.1088/1674-1056/ad3c32

Semiclassical approach to spin dynamics of a ferromagnetic S=1 chain

Authors: Chengchen Li, Yi Cui, Weiqiang Yu, Rong Yu

Abstract: Motivated by recent experimental progress in the quasi-one-dimensional quantum magnet NiNb$_2$O$_6$, we study the spin dynamics of an S=1 ferromagnetic Heisenberg chain with single-ion anisotropy by using a semiclassical molecular dynamics approach. This system undergoes a quantum phase transition from a ferromagnetic to a paramagnetic state under a transverse magnetic field, and the magnetic resp… ▽ More Motivated by recent experimental progress in the quasi-one-dimensional quantum magnet NiNb$_2$O$_6$, we study the spin dynamics of an S=1 ferromagnetic Heisenberg chain with single-ion anisotropy by using a semiclassical molecular dynamics approach. This system undergoes a quantum phase transition from a ferromagnetic to a paramagnetic state under a transverse magnetic field, and the magnetic responses reflecting this transition is well described by our semiclassical method. We show that at low-temperature the transverse component of the dynamical structure factor depicts clearly the magnon dispersion, and the longitudinal component exhibits two continua associated with single- and two-magnon excitations, respectively. These spin excitation spectra show interesting temperature dependence as effects of magnon interactions.Our findings shed light on experimental detection of spin excitations in a large class of quasi-one-dimensional magnets. △ Less

Submitted 31 July, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Journal ref: Chinese Phys. B 33 (2024) 067501

arXiv:2402.15365 [pdf, other]

Efficient semi-supervised inference for logistic regression under case-control studies

Authors: Zhuojun Quan, Yuanyuan Lin, Kani Chen, Wen Yu

Abstract: Semi-supervised learning has received increasingly attention in statistics and machine learning. In semi-supervised learning settings, a labeled data set with both outcomes and covariates and an unlabeled data set with covariates only are collected. We consider an inference problem in semi-supervised settings where the outcome in the labeled data is binary and the labeled data is collected by case… ▽ More Semi-supervised learning has received increasingly attention in statistics and machine learning. In semi-supervised learning settings, a labeled data set with both outcomes and covariates and an unlabeled data set with covariates only are collected. We consider an inference problem in semi-supervised settings where the outcome in the labeled data is binary and the labeled data is collected by case-control sampling. Case-control sampling is an effective sampling scheme for alleviating imbalance structure in binary data. Under the logistic model assumption, case-control data can still provide consistent estimator for the slope parameter of the regression model. However, the intercept parameter is not identifiable. Consequently, the marginal case proportion cannot be estimated from case-control data. We find out that with the availability of the unlabeled data, the intercept parameter can be identified in semi-supervised learning setting. We construct the likelihood function of the observed labeled and unlabeled data and obtain the maximum likelihood estimator via an iterative algorithm. The proposed estimator is shown to be consistent, asymptotically normal, and semiparametrically efficient. Extensive simulation studies are conducted to show the finite sample performance of the proposed method. The results imply that the unlabeled data not only helps to identify the intercept but also improves the estimation efficiency of the slope parameter. Meanwhile, the marginal case proportion can be estimated accurately by the proposed method. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.14308 [pdf, other]

Ground-Fusion: A Low-cost Ground SLAM System Robust to Corner Cases

Authors: Jie Yin, Ang Li, Wei Xi, Wenxian Yu, Danping Zou

Abstract: We introduce Ground-Fusion, a low-cost sensor fusion simultaneous localization and mapping (SLAM) system for ground vehicles. Our system features efficient initialization, effective sensor anomaly detection and handling, real-time dense color mapping, and robust localization in diverse environments. We tightly integrate RGB-D images, inertial measurements, wheel odometer and GNSS signals within a… ▽ More We introduce Ground-Fusion, a low-cost sensor fusion simultaneous localization and mapping (SLAM) system for ground vehicles. Our system features efficient initialization, effective sensor anomaly detection and handling, real-time dense color mapping, and robust localization in diverse environments. We tightly integrate RGB-D images, inertial measurements, wheel odometer and GNSS signals within a factor graph to achieve accurate and reliable localization both indoors and outdoors. To ensure successful initialization, we propose an efficient strategy that comprises three different methods: stationary, visual, and dynamic, tailored to handle diverse cases. Furthermore, we develop mechanisms to detect sensor anomalies and degradation, handling them adeptly to maintain system accuracy. Our experimental results on both public and self-collected datasets demonstrate that Ground-Fusion outperforms existing low-cost SLAM systems in corner cases. We release the code and datasets at https://github.com/SJTU-ViSYS/Ground-Fusion. △ Less

Submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.12084 [pdf, other]

Simultaneous multi-wavelength observations of the repeating fast radio burst FRB 20190520B with Swift and FAST

Authors: Zhen Yan, Wenfei Yu, Kim L. Page, Jie Lin, Di Li, Chenhui Niu, Casey Law, Bing Zhang, Shami Chatterjee, Xian Zhang, Reshma Anna-Thomas

Abstract: Fast radio bursts (FRBs) are bright, millisecond-duration radio bursts of cosmic origin. There have been several dozen FRBs found to repeat. Among them, those precisely localized provide the best opportunity to probe their multi-wavelength counterparts, local environment, and host galaxy that would reveal their origins. Here we report our X-ray, ultraviolet (UV) and optical observations with the… ▽ More Fast radio bursts (FRBs) are bright, millisecond-duration radio bursts of cosmic origin. There have been several dozen FRBs found to repeat. Among them, those precisely localized provide the best opportunity to probe their multi-wavelength counterparts, local environment, and host galaxy that would reveal their origins. Here we report our X-ray, ultraviolet (UV) and optical observations with the $Swift$ satellite that were performed simultaneously in the radio band with the Five-hundred-meter Aperture Spherical radio Telescope (FAST) observations of the repeating FRB 20190520B, aiming at detection of possible multi-wavelength bursts in association with radio bursts and multi-wavelength counterpart of the persistent radio source (PRS). While a total of 10 radio bursts were detected by FAST at the same time of $Swift$ observations, we detected neither X-ray, UV or optical bursts in accompany of the radio bursts, nor persistent multi-wavelength counterpart of the PRS. We obtained the energy upper limits ($3σ$) on any multi-wavelength bursts as $5.03 \times 10^{47}$ erg in the hard X-ray band (15-150 keV), $7.98 \times 10^{45}$ erg in the soft X-ray band (0.3-10 keV), and $4.51 \times 10^{44}$ erg in the U band, respectively. The energy ratio between soft X-ray (0.3-10 keV) and radio emission of the bursts is constrained as $<6\times10^{7}$, and the ratio between optical (U band) and radio as $<1.19\times10^{6}$. The 3$σ$ luminosity upper limits at the position of PRS are 1.04$\times10^{47}$ (15-150 keV), 8.81$\times10^{42}$ (0.3-10 keV), 9.26$\times10^{42}$ (UVW1), and 2.54$\times10^{42}$ erg s$^{-1}$ (U), respectively. We show that the PRS is much more radio loud than representative pulsar wind nebulae, supernova remnants, extended jet of Galactic X-ray binaries and ultraluminous X-ray sources, suggestive of boosted radio emission of the PRS. △ Less

Submitted 19 February, 2024; originally announced February 2024.

Comments: 12 pages, 7 figures, submitted to ApJ

arXiv:2402.11743 [pdf, other]

doi 10.1109/TWC.2023.3335362

Hybrid Online-Offline Learning for Task Offloading in Mobile Edge Computing Systems

Authors: Muhammad Sohaib, Sang-Woon Jeon, Wei Yu

Abstract: We consider a multi-user multi-server mobile edge computing (MEC) system, in which users arrive on a network randomly over time and generate computation tasks, which will be computed either locally on their own computing devices or be offloaded to one of the MEC servers. Under such a dynamic network environment, we propose a novel task offloading policy based on hybrid online-offline learning, whi… ▽ More We consider a multi-user multi-server mobile edge computing (MEC) system, in which users arrive on a network randomly over time and generate computation tasks, which will be computed either locally on their own computing devices or be offloaded to one of the MEC servers. Under such a dynamic network environment, we propose a novel task offloading policy based on hybrid online-offline learning, which can efficiently reduce the overall computation delay and energy consumption only with information available at nearest MEC servers from each user. We provide a practical signaling and learning framework that can train deep neural networks for both online and offline learning and can adjust its offloading policy based on the queuing status of each MEC server and network dynamics. Numerical results demonstrate that the proposed scheme significantly reduces the average computation delay for a broad class of network environments compared to the conventional offloading methods. It is further shown that the proposed hybrid online-offline learning framework can be extended to a general cost function reflecting both delay and energy-dependent metrics. △ Less

Submitted 27 February, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

Comments: accepted by IEEE Transactions on Wireless Communications

Journal ref: IEEE Transactions on Wireless Communications (2023)

arXiv:2402.11450 [pdf, other]

Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Authors: Jacky Liang, Fei Xia, Wenhao Yu, Andy Zeng, Montserrat Gonzalez Arenas, Maria Attarian, Maria Bauza, Matthew Bennice, Alex Bewley, Adil Dostmohamed, Chuyuan Kelly Fu, Nimrod Gileadi, Marissa Giustina, Keerthana Gopalakrishnan, Leonard Hasenclever, Jan Humplik, Jasmine Hsu, Nikhil Joshi, Ben Jyenis, Chase Kew, Sean Kirmani, Tsang-Wei Edward Lee, Kuang-Huei Lee, Assaf Hurwitz Michaely, Joss Moore , et al. (25 additional authors not shown)

Abstract: Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for o… ▽ More Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for only as long as it fits within the context size of the LLM, and can be forgotten over longer interactions. In this work, we investigate fine-tuning the robot code-writing LLMs, to remember their in-context interactions and improve their teachability i.e., how efficiently they adapt to human inputs (measured by average number of corrections before the user considers the task successful). Our key observation is that when human-robot interactions are viewed as a partially observable Markov decision process (in which human language inputs are observations, and robot code outputs are actions), then training an LLM to complete previous interactions is training a transition dynamics model -- that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments -- improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%. See videos, code, and demos at: https://robot-teaching.github.io/. △ Less

Submitted 31 May, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

Showing 101–150 of 1,468 results for author: Yu, W