Search | arXiv e-print repository

arXiv:2408.09429 [pdf, other]

Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models

Authors: Kening Zheng, Junkai Chen, Yibo Yan, Xin Zou, Xuming Hu

Abstract: Hallucination issues persistently plagued current multimodal large language models (MLLMs). While existing research primarily focuses on object-level or attribute-level hallucinations, sidelining the more sophisticated relation hallucinations that necessitate advanced reasoning abilities from MLLMs. Besides, recent benchmarks regarding relation hallucinations lack in-depth evaluation and effective… ▽ More Hallucination issues persistently plagued current multimodal large language models (MLLMs). While existing research primarily focuses on object-level or attribute-level hallucinations, sidelining the more sophisticated relation hallucinations that necessitate advanced reasoning abilities from MLLMs. Besides, recent benchmarks regarding relation hallucinations lack in-depth evaluation and effective mitigation. Moreover, their datasets are typically derived from a systematic annotation process, which could introduce inherent biases due to the predefined process. To handle the aforementioned challenges, we introduce Reefknot, a comprehensive benchmark specifically targeting relation hallucinations, consisting of over 20,000 samples derived from real-world scenarios. Specifically, we first provide a systematic definition of relation hallucinations, integrating perspectives from perceptive and cognitive domains. Furthermore, we construct the relation-based corpus utilizing the representative scene graph dataset Visual Genome (VG), from which semantic triplets follow real-world distributions. Our comparative evaluation across three distinct tasks revealed a substantial shortcoming in the capabilities of current MLLMs to mitigate relation hallucinations. Finally, we advance a novel confidence-based mitigation strategy tailored to tackle the relation hallucinations problem. Across three datasets, including Reefknot, we observed an average reduction of 9.75% in the hallucination rate. We believe our paper sheds valuable insights into achieving trustworthy multimodal intelligence. Our dataset and code will be released upon paper acceptance. △ Less

Submitted 18 August, 2024; originally announced August 2024.

arXiv:2408.08802 [pdf, other]

PriorMapNet: Enhancing Online Vectorized HD Map Construction with Priors

Authors: Rongxuan Wang, Xin Lu, Xiaoyang Liu, Xiaoyi Zou, Tongyi Cao, Ying Li

Abstract: Online vectorized High-Definition (HD) map construction is crucial for subsequent prediction and planning tasks in autonomous driving. Following MapTR paradigm, recent works have made noteworthy achievements. However, reference points are randomly initialized in mainstream methods, leading to unstable matching between predictions and ground truth. To address this issue, we introduce PriorMapNet to… ▽ More Online vectorized High-Definition (HD) map construction is crucial for subsequent prediction and planning tasks in autonomous driving. Following MapTR paradigm, recent works have made noteworthy achievements. However, reference points are randomly initialized in mainstream methods, leading to unstable matching between predictions and ground truth. To address this issue, we introduce PriorMapNet to enhance online vectorized HD map construction with priors. We propose the PPS-Decoder, which provides reference points with position and structure priors. Fitted from the map elements in the dataset, prior reference points lower the learning difficulty and achieve stable matching. Furthermore, we propose the PF-Encoder to enhance the image-to-BEV transformation with BEV feature priors. Besides, we propose the DMD cross-attention, which decouples cross-attention along multi-scale and multi-sample respectively to achieve efficiency. Our proposed PriorMapNet achieves state-of-the-art performance in the online vectorized HD map construction task on nuScenes and Argoverse2 datasets. The code will be released publicly soon. △ Less

Submitted 20 August, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

arXiv:2408.02878 [pdf]

Ultrahigh-speed thin-film lithium niobate optical coherent receiver

Authors: Xiaojun Xie, Chao Wei, Xingchen He, Yake Chen, Chenghao Wang, Jihui Sun, Lin Jiang, Jia Ye, Xihua Zou, Wei Pan, Lianshan Yan

Abstract: The rapid advancement of the thin-film lithium niobate platform has established it as a premier choice for high-performance photonics integration. High-speed optical coherent receivers are essential for supporting the large communication capacities required by data center interconnects. Although high-speed photodiodes have been demonstrated on the thin-film LiNbO3 platform, the development of an u… ▽ More The rapid advancement of the thin-film lithium niobate platform has established it as a premier choice for high-performance photonics integration. High-speed optical coherent receivers are essential for supporting the large communication capacities required by data center interconnects. Although high-speed photodiodes have been demonstrated on the thin-film LiNbO3 platform, the development of an ultrahigh-speed optical coherent receiver on this platform has not yet been realized. Here, we propose and experimentally demonstrate an ultra-wideband PD and ultrahigh-speed optical coherent receiver on an InP-LiNbO3 wafer-level heterogeneous integration platform. The fabricated single PD exhibits a record-high bandwidth of 140 GHz and successfully receives a high-quality 100-Gbaud pulse amplitude modulation (PAM4) signal. Furthermore, a thin-film LiNbO3 optical coherent receiver, featuring a large balanced detection bandwidth of 60 GHz, a large common mode rejection ratio (CMRR) exceeding 20 dB, and a low energy consumption of 9.6 fJ per bit, enables an ultrahigh-speed coherent reception with advanced modulation formats. The single-polarization I-Q coherent receiver, incorporating a compact 2x4 90 optical hybrid and a balanced photodetector array, achieves a receiving capacity of 600 Gbps per channel with 100-Gbaud 64 quadrature amplitude modulation (QAM) signal and 512 Gbps per channel with 128-Gbaud 16 QAM signal. Additionally, we demonstrate a long-distance reception of 100 Gbaud quadrature phase-shift keying (QPSK) and 16 QAM signals over transmission distances of 1040 km and 25 km. A seven-channel single-polarization I-Q coherent receiving chip achieves a total receiving capacity of 3.584 Tbps. This heterogeneous-integrated thin-film LiNbO3 optical coherent receiver shows the potential for Pbps-scale applications in future hyperscale data center interconnects. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2407.17062 [pdf, other]

MK-like spectral classification for hot subdwarf stars with LAMOST spectra

Authors: Xuan Zou, Zhenxin Lei

Abstract: An MK-like spectral classification has been conducted for 1224 hot subdwarf stars with LAMOST DR9 low-resolution spectra. The whole sample was divided into four categories according to the spectral line characteristics: He-normal, He-weak, He-strong C and He-strong. Each selected spectrum was assigned a spectral class, a luminosity class and an helium class by comparing the line depth and width wi… ▽ More An MK-like spectral classification has been conducted for 1224 hot subdwarf stars with LAMOST DR9 low-resolution spectra. The whole sample was divided into four categories according to the spectral line characteristics: He-normal, He-weak, He-strong C and He-strong. Each selected spectrum was assigned a spectral class, a luminosity class and an helium class by comparing the line depth and width with standard spectra selected in LAMOST. Relationships between atmospheric parameters and spectral classification were also presented. △ Less

Submitted 24 July, 2024; originally announced July 2024.

Comments: 12 pages,10 figures, 2 tables, accepted for publication in PASJ

arXiv:2407.15354 [pdf, other]

Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

Authors: Zhili Chen, Shuangjie Xu, Maosheng Ye, Zian Qian, Xiaoyi Zou, Dit-Yan Yeung, Qifeng Chen

Abstract: The Bird's-Eye-View (BEV) representation is a critical factor that directly impacts the 3D object detection performance, but the traditional BEV grid representation induces quadratic computational cost as the spatial resolution grows. To address this limitation, we present a new camera-based 3D object detector with high-resolution vector representation: VectorFormer. The presented high-resolution… ▽ More The Bird's-Eye-View (BEV) representation is a critical factor that directly impacts the 3D object detection performance, but the traditional BEV grid representation induces quadratic computational cost as the spatial resolution grows. To address this limitation, we present a new camera-based 3D object detector with high-resolution vector representation: VectorFormer. The presented high-resolution vector representation is combined with the lower-resolution BEV representation to efficiently exploit 3D geometry from multi-camera images at a high resolution through our two novel modules: vector scattering and gathering. To this end, the learned vector representation with richer scene contexts can serve as the decoding query for final predictions. We conduct extensive experiments on the nuScenes dataset and demonstrate state-of-the-art performance in NDS and inference time. Furthermore, we investigate query-BEV-based methods incorporated with our proposed vector representation and observe a consistent performance improvement. △ Less

Submitted 21 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024. Project page: https://github.com/zlichen/VectorFormer

arXiv:2407.02534 [pdf, other]

Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything

Authors: Xiaotian Zou, Ke Li, Yongkang Chen

Abstract: Large Visual Language Model\textbfs (VLMs) such as GPT-4V have achieved remarkable success in generating comprehensive and nuanced responses. Researchers have proposed various benchmarks for evaluating the capabilities of VLMs. With the integration of visual and text inputs in VLMs, new security issues emerge, as malicious attackers can exploit multiple modalities to achieve their objectives. This… ▽ More Large Visual Language Model\textbfs (VLMs) such as GPT-4V have achieved remarkable success in generating comprehensive and nuanced responses. Researchers have proposed various benchmarks for evaluating the capabilities of VLMs. With the integration of visual and text inputs in VLMs, new security issues emerge, as malicious attackers can exploit multiple modalities to achieve their objectives. This has led to increasing attention on the vulnerabilities of VLMs to jailbreak. Most existing research focuses on generating adversarial images or nonsensical image to jailbreak these models. However, no researchers evaluate whether logic understanding capabilities of VLMs in flowchart can influence jailbreak. Therefore, to fill this gap, this paper first introduces a novel dataset Flow-JD specifically designed to evaluate the logic-based flowchart jailbreak capabilities of VLMs. We conduct an extensive evaluation on GPT-4o, GPT-4V, other 5 SOTA open source VLMs and the jailbreak rate is up to 92.8%. Our research reveals significant vulnerabilities in current VLMs concerning image-to-text jailbreak and these findings underscore the the urgency for the development of robust and effective future defenses. △ Less

Submitted 26 August, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.18358 [pdf, other]

Microscopic characteristics of SF6 partial discharge induced by a floating linear metal particle

Authors: Zihao Feng, Yuanyuan Jiang, Liyang Zhang, Zhigang Liu, Kai Wang, Xinxin Wang, Xiaobing Zou, Haiyun Luo, Yangyang Fu

Abstract: Direct current (DC) gas insulated transmission lines (GILs) have been widely used in power transmission, but might be threatened by partial discharge due to the presence of floating impurities (e.g., dust and metal particles) inside the sealed chamber. In this letter, by using a 2D fluid model we characterize the microscopic properties of the partial discharge induced by a floating linear metal pa… ▽ More Direct current (DC) gas insulated transmission lines (GILs) have been widely used in power transmission, but might be threatened by partial discharge due to the presence of floating impurities (e.g., dust and metal particles) inside the sealed chamber. In this letter, by using a 2D fluid model we characterize the microscopic properties of the partial discharge induced by a floating linear metal particle in SF6 (both the discharge propagation and interaction between space charge and metal particle) under negative high voltage direct current (HVDC) conditions. Due to the strong electronegativity of SF6, the spatiotemporal distributions of the charged species (electrons, positive and negative ions), space charge, and reduced electric field are rather different from those in air. Notably, a negative ion region is observed around the top tip of the metal particle, and it plays an important role in the generation and propagation of primary and secondary streamers in SF6, which may lead to severe motion characteristics of the particle and aliasing of partial discharge signals. Additionally, we analyze the charging process and electric force reversal phenomenon, which may provide a more precise understanding of the underlying mechanisms of the firefly motion previously reported for DC GILs. △ Less

Submitted 20 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.18007 [pdf, other]

Deep Mamba Multi-modal Learning

Authors: Jian Zhu, Xin Zou, Yu Cui, Zhangmin Huang, Chenshu Hu, Bo Lyu

Abstract: Inspired by the excellent performance of Mamba networks, we propose a novel Deep Mamba Multi-modal Learning (DMML). It can be used to achieve the fusion of multi-modal features. We apply DMML to the field of multimedia retrieval and propose an innovative Deep Mamba Multi-modal Hashing (DMMH) method. It combines the advantages of algorithm accuracy and inference speed. We validated the effectivenes… ▽ More Inspired by the excellent performance of Mamba networks, we propose a novel Deep Mamba Multi-modal Learning (DMML). It can be used to achieve the fusion of multi-modal features. We apply DMML to the field of multimedia retrieval and propose an innovative Deep Mamba Multi-modal Hashing (DMMH) method. It combines the advantages of algorithm accuracy and inference speed. We validated the effectiveness of DMMH on three public datasets and achieved state-of-the-art results. △ Less

Submitted 9 April, 2024; originally announced June 2024.

Comments: Deep Mamba Multi-modal Learning; Deep Mamba Multi-modal Hashing

arXiv:2406.17952 [pdf, other]

LINSCAN -- A Linearity Based Clustering Algorithm

Authors: Andrew Dennehy, Xiaoyu Zou, Shabnam J. Semnani, Yuri Fialko, Alexander Cloninger

Abstract: DBSCAN and OPTICS are powerful algorithms for identifying clusters of points in domains where few assumptions can be made about the structure of the data. In this paper, we leverage these strengths and introduce a new algorithm, LINSCAN, designed to seek lineated clusters that are difficult to find and isolate with existing methods. In particular, by embedding points as normal distributions approx… ▽ More DBSCAN and OPTICS are powerful algorithms for identifying clusters of points in domains where few assumptions can be made about the structure of the data. In this paper, we leverage these strengths and introduce a new algorithm, LINSCAN, designed to seek lineated clusters that are difficult to find and isolate with existing methods. In particular, by embedding points as normal distributions approximating their local neighborhoods and leveraging a distance function derived from the Kullback Leibler Divergence, LINSCAN can detect and distinguish lineated clusters that are spatially close but have orthogonal covariances. We demonstrate how LINSCAN can be applied to seismic data to identify active faults, including intersecting faults, and determine their orientation. Finally, we discuss the properties a generalization of DBSCAN and OPTICS must have in order to retain the stability benefits of these algorithms. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.16020 [pdf, other]

AudioBench: A Universal Benchmark for Audio Large Language Models

Authors: Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F. Chen

Abstract: We introduce AudioBench, a new benchmark designed to evaluate audio large language models (AudioLLMs). AudioBench encompasses 8 distinct tasks and 26 carefully selected or newly curated datasets, focusing on speech understanding, voice interpretation, and audio scene understanding. Despite the rapid advancement of large language models, including multimodal versions, a significant gap exists in co… ▽ More We introduce AudioBench, a new benchmark designed to evaluate audio large language models (AudioLLMs). AudioBench encompasses 8 distinct tasks and 26 carefully selected or newly curated datasets, focusing on speech understanding, voice interpretation, and audio scene understanding. Despite the rapid advancement of large language models, including multimodal versions, a significant gap exists in comprehensive benchmarks for thoroughly evaluating their capabilities. AudioBench addresses this gap by providing relevant datasets and evaluation metrics. In our study, we evaluated the capabilities of four models across various aspects and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-source code, data, and leaderboard will offer a robust testbed for future model developments. △ Less

Submitted 25 June, 2024; v1 submitted 23 June, 2024; originally announced June 2024.

Comments: 20 pages; v2 - typo update; Code: https://github.com/AudioLLMs/AudioBench

arXiv:2406.14069 [pdf, other]

Towards Multi-modality Fusion and Prototype-based Feature Refinement for Clinically Significant Prostate Cancer Classification in Transrectal Ultrasound

Authors: Hong Wu, Juan Fu, Hongsheng Ye, Yuming Zhong, Xuebin Zou, Jianhua Zhou, Yi Wang

Abstract: Prostate cancer is a highly prevalent cancer and ranks as the second leading cause of cancer-related deaths in men globally. Recently, the utilization of multi-modality transrectal ultrasound (TRUS) has gained significant traction as a valuable technique for guiding prostate biopsies. In this study, we propose a novel learning framework for clinically significant prostate cancer (csPCa) classifica… ▽ More Prostate cancer is a highly prevalent cancer and ranks as the second leading cause of cancer-related deaths in men globally. Recently, the utilization of multi-modality transrectal ultrasound (TRUS) has gained significant traction as a valuable technique for guiding prostate biopsies. In this study, we propose a novel learning framework for clinically significant prostate cancer (csPCa) classification using multi-modality TRUS. The proposed framework employs two separate 3D ResNet-50 to extract distinctive features from B-mode and shear wave elastography (SWE). Additionally, an attention module is incorporated to effectively refine B-mode features and aggregate the extracted features from both modalities. Furthermore, we utilize few shot segmentation task to enhance the capacity of classification encoder. Due to the limited availability of csPCa masks, a prototype correction module is employed to extract representative prototypes of csPCa. The performance of the framework is assessed on a large-scale dataset consisting of 512 TRUS videos with biopsy-proved prostate cancer. The results demonstrate the strong capability in accurately identifying csPCa, achieving an area under the curve (AUC) of 0.86. Moreover, the framework generates visual class activation mapping (CAM), which can serve as valuable assistance for localizing csPCa. These CAM images may offer valuable guidance during TRUS-guided targeted biopsies, enhancing the efficacy of the biopsy procedure.The code is available at https://github.com/2313595986/SmileCode. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.12943 [pdf]

A square cross-section FOV rotational CL (SC-CL) and its analytical reconstruction method

Authors: Xiang Zou, Wuliang Shi, Muge Du, Yuxiang Xing

Abstract: Rotational computed laminography (CL) has broad application potential in three-dimensional imaging of plate-like objects, as it only needs x-ray to pass through the tested object in the thickness direction during the imaging process. In this study, a square cross-section FOV rotational CL (SC-CL) was proposed. Then, the FDK-type analytical reconstruction algorithm applicable to the SC-CL was deriv… ▽ More Rotational computed laminography (CL) has broad application potential in three-dimensional imaging of plate-like objects, as it only needs x-ray to pass through the tested object in the thickness direction during the imaging process. In this study, a square cross-section FOV rotational CL (SC-CL) was proposed. Then, the FDK-type analytical reconstruction algorithm applicable to the SC-CL was derived. On this basis, the proposed method was validated through numerical experiments. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12018 [pdf, other]

CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling

Authors: Yu Bai, Xiyuan Zou, Heyan Huang, Sanxing Chen, Marc-Antoine Rondeau, Yang Gao, Jackie Chi Kit Cheung

Abstract: Long sequence modeling has gained broad interest as large language models (LLMs) continue to advance. Recent research has identified that a large portion of hidden states within the key-value caches of Transformer models can be discarded (also termed evicted) without affecting the perplexity performance in generating long sequences. However, we show that these methods, despite preserving perplexit… ▽ More Long sequence modeling has gained broad interest as large language models (LLMs) continue to advance. Recent research has identified that a large portion of hidden states within the key-value caches of Transformer models can be discarded (also termed evicted) without affecting the perplexity performance in generating long sequences. However, we show that these methods, despite preserving perplexity performance, often drop information that is important for solving downstream tasks, a problem which we call information neglect. To address this issue, we introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states. In addition, we design a method for chunked sequence processing to further improve efficiency. Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget, while preserving language modeling perplexity. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Work in progress

arXiv:2406.06887 [pdf, other]

PLUM: Preference Learning Plus Test Cases Yields Better Code Language Models

Authors: Dylan Zhang, Shizhe Diao, Xueyan Zou, Hao Peng

Abstract: Instruction-finetuned code language models (LMs) have shown promise in various programming tasks. They are trained, using a language modeling objective, on natural language instructions and gold code snippet pairs. Recent evidence suggests that these models, never exposed to incorrect solutions during training, often struggle to distinguish between correct and incorrect solutions. This observation… ▽ More Instruction-finetuned code language models (LMs) have shown promise in various programming tasks. They are trained, using a language modeling objective, on natural language instructions and gold code snippet pairs. Recent evidence suggests that these models, never exposed to incorrect solutions during training, often struggle to distinguish between correct and incorrect solutions. This observation raises our inquiry: Can preference learning, which trains models to prefer correct solutions over incorrect ones, help push the boundaries of code LMs even further? We propose PLUM, a novel \textbf{p}reference \textbf{l}earning framework a\textbf{u}gmented with test cases tailored for code L\textbf{M}s.PLUM aims to investigate the key success factors and potential benefits of preference learning in code LMs, which remain elusive despite its success in aligning LMs with human values. PLUM consists of three stages: (1) Generating test cases for natural language instructions, (2) sampling candidate solutions from the policy and evaluating them against the test cases to create a preference dataset, which is then used to (3) train the policy with a preference learning algorithm. Experiments demonstrate that PLUM substantially improves the performance of existing code LMs on established code generation benchmarks such as HumanEval (+) and MBPP (+), even for the state-of-the-art open-source language model CodeQwen-1.5-7B-Chat. PLUM complements the supervised fine-tuning (SFT) stage, demonstrating synergistic effects. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2405.19647 [pdf, other]

FTS: A Framework to Find a Faithful TimeSieve

Authors: Songning Lai, Ninghui Feng, Jiechao Gao, Hao Wang, Haochen Sui, Xin Zou, Jiayu Yang, Wenshuo Chen, Hang Zhao, Xuming Hu, Yutao Yue

Abstract: The field of time series forecasting has garnered significant attention in recent years, prompting the development of advanced models like TimeSieve, which demonstrates impressive performance. However, an analysis reveals certain unfaithfulness issues, including high sensitivity to random seeds, input and layer noise perturbations and parametric perturbations. Recognizing these challenges, we emba… ▽ More The field of time series forecasting has garnered significant attention in recent years, prompting the development of advanced models like TimeSieve, which demonstrates impressive performance. However, an analysis reveals certain unfaithfulness issues, including high sensitivity to random seeds, input and layer noise perturbations and parametric perturbations. Recognizing these challenges, we embark on a quest to define the concept of \textbf{\underline{F}aithful \underline{T}ime\underline{S}ieve \underline{(FTS)}}, a model that consistently delivers reliable and robust predictions. To address these issues, we propose a novel framework aimed at identifying and rectifying unfaithfulness in TimeSieve. Our framework is designed to enhance the model's stability and faithfulness, ensuring that its outputs are less susceptible to the aforementioned factors. Experimentation validates the effectiveness of our proposed framework, demonstrating improved faithfulness in the model's behavior. △ Less

Submitted 10 August, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

Journal ref: IJCAI2024 workshop

arXiv:2405.18991 [pdf, other]

EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

Authors: Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Yunkuo Chen, Bo Liu, MengLi Cheng, Xing Shi, Jun Huang

Abstract: This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block. It is used to capture temporal dynamics, thereby ensuring the producti… ▽ More This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block. It is used to capture temporal dynamics, thereby ensuring the production of consistent frames and seamless motion transitions. The motion module can be adapted to various DiT baseline methods to generate video with different styles. It can also generate videos with different frame rates and resolutions during both training and inference phases, suitable for both images and videos. Moreover, we introduce slice VAE, a novel approach to condense the temporal axis, facilitating the generation of long duration videos. Currently, EasyAnimate exhibits the proficiency to generate videos with 144 frames. We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training (both the baseline model and LoRA model), and end-to-end video inference. Code is available at: https://github.com/aigc-apps/EasyAnimate. We are continuously working to enhance the performance of our method. △ Less

Submitted 5 July, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

Comments: 8 pages, 6 figures

arXiv:2405.14135 [pdf, other]

Learning Geospatial Region Embedding with Heterogeneous Graph

Authors: Xingchen Zou, Jiani Huang, Xixuan Hao, Yuhao Yang, Haomin Wen, Yibo Yan, Chao Huang, Yuxuan Liang

Abstract: Learning effective geospatial embeddings is crucial for a series of geospatial applications such as city analytics and earth monitoring. However, learning comprehensive region representations presents two significant challenges: first, the deficiency of effective intra-region feature representation; and second, the difficulty of learning from intricate inter-region dependencies. In this paper, we… ▽ More Learning effective geospatial embeddings is crucial for a series of geospatial applications such as city analytics and earth monitoring. However, learning comprehensive region representations presents two significant challenges: first, the deficiency of effective intra-region feature representation; and second, the difficulty of learning from intricate inter-region dependencies. In this paper, we present GeoHG, an effective heterogeneous graph structure for learning comprehensive region embeddings for various downstream tasks. Specifically, we tailor satellite image representation learning through geo-entity segmentation and point-of-interest (POI) integration for expressive intra-regional features. Furthermore, GeoHG unifies informative spatial interdependencies and socio-environmental attributes into a powerful heterogeneous graph to encourage explicit modeling of higher-order inter-regional relationships. The intra-regional features and inter-regional correlations are seamlessly integrated by a model-agnostic graph learning framework for diverse downstream tasks. Extensive experiments demonstrate the effectiveness of GeoHG in geo-prediction tasks compared to existing methods, even under extreme data scarcity (with just 5% of training data). With interpretable region representations, GeoHG exhibits strong generalization capabilities across regions. We will release code and data upon paper notification. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.08327 [pdf, other]

Multiband Simultaneous Photometry of Type II SN 2023ixf with Mephisto and the Twin 50-cm Telescopes

Authors: Yuan-Pei Yang, Xiangkun Liu, Yu Pan, Xinzhong Er, Dezi Liu, Yuan Fang, Guowang Du, Yongzhi Cai, Xian Xu, Xinlei Chen, Xingzhu Zou, Helong Guo, Chenxu Liu, Yehao Cheng, Brajesh Kumar, Xiaowei Liu

Abstract: SN 2023ixf, recently reported in the nearby galaxy M101 at a distance of $6.85~{\rm Mpc}$, was one of the closest and brightest core-collapse supernovae (CCSNe) in the last decade. In this work, we present multi-wavelength photometric observation of SN 2023ixf with the Multi-channel Photometric Survey Telescope (Mephisto) in $uvgr$ bands and with the twin 50-cm telescopes in $griz$ bands. We find… ▽ More SN 2023ixf, recently reported in the nearby galaxy M101 at a distance of $6.85~{\rm Mpc}$, was one of the closest and brightest core-collapse supernovae (CCSNe) in the last decade. In this work, we present multi-wavelength photometric observation of SN 2023ixf with the Multi-channel Photometric Survey Telescope (Mephisto) in $uvgr$ bands and with the twin 50-cm telescopes in $griz$ bands. We find that the bolometric luminosity reached the maximum value of $3\times10^{43}~{\rm erg~s^{-1}}$ at 3.9 days after the explosion and fully settled onto the radioactive tail at $\sim90$ days. The effective temperature decreased from $3.2\times10^4~{\rm K}$ at the first observation and approached to a constant of $\sim(3000-4000)~{\rm K}$ after the first two months. The evolution of the photospheric radius is consistent with a homologous expansion with a velocity of $8700~{\rm km~s^{-1}}$ in the first two months, and it shrunk subsequently. Based on the radioactive tail, the initial nickel mass is about $M_{\rm Ni}\sim 0.098M_\odot$. The explosion energy and the ejecta mass are estimated to be $E\simeq(1.0-5.7)\times10^{51}~{\rm erg}$ and $M_{\rm ej}\simeq(3.8-16)M_\odot$, respectively. The peak bolometric luminosity is proposed to be contributed by the interaction between the ejecta and the circumstellar medium (CSM). We find a shocked CSM mass of $M_{\rm CSM}\sim0.013M_\odot$, a CSM density of $ρ_{\rm CSM}\sim2.5\times10^{-13}~{\rm g~cm^{-3}}$ and a mass loss rate of the progenitor of $\dot M\sim0.022M_\odot~{\rm yr^{-1}}$. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: 15 pages, 7 figures, 3 tables. Accepted for publication in ApJ. Comments welcome!

arXiv:2405.07964 [pdf, other]

Early-phase simultaneous multiband observations of the Type II supernova SN 2024ggi with Mephisto

Authors: Xinlei Chen, Brajesh Kumar, Xinzhong Er, Helong Guo, Yuan-Pei Yang, Weikang Lin, Yuan Fang, Guowang Du, Chenxu Liu, Jiewei Zhao, Tianyu Zhang, Yuxi Bao, Xingzhu Zou, Yu Pan, Yu Wang, Xufeng Zhu, Kaushik Chatterjee, Xiangkun Liu, Dezi Liu, Edoardo P. Lagioia, Geeta Rangwal, Shiyan Zhong, Jinghua Zhang, Jianhui Lian, Yongzhi Cai , et al. (2 additional authors not shown)

Abstract: We present early-phase good-cadence (hour-to-day) simultaneous multiband ($ugi$ and $vrz$ bands) imaging of the nearby supernova SN~2024ggi, which exploded in the nearby galaxy, NGC 3621. A quick follow-up was conducted within less than a day after the explosion and continued $\sim$23 days. The $uvg$ band light curves display a rapid rise ($\sim$1.4 mag day$^{-1}$) to maximum in $\sim$4 days and a… ▽ More We present early-phase good-cadence (hour-to-day) simultaneous multiband ($ugi$ and $vrz$ bands) imaging of the nearby supernova SN~2024ggi, which exploded in the nearby galaxy, NGC 3621. A quick follow-up was conducted within less than a day after the explosion and continued $\sim$23 days. The $uvg$ band light curves display a rapid rise ($\sim$1.4 mag day$^{-1}$) to maximum in $\sim$4 days and absolute magnitude $M_{g}\sim$--17.75 mag. The post-peak decay rate in redder bands is $\sim$0.01 mag day$^{-1}$. Different colors (e.g., $u-g$ and $v-r$) of SN~2024ggi are slightly redder than SN 2023ixf. A significant rise ($\sim$12.5 kK) in black-body temperature (optical) was noticed within $\sim$2 days after the explosion, which successively decreased, indicating shock break out inside a dense circumstellar medium (CSM) surrounding the progenitor. Using semianalytical modeling, the ejecta mass and progenitor radius were estimated as 1.2 $M_\odot$ and $\sim$550 $R_\odot$. The archival deep images ($g,r,i and z$ bands) from the Dark Energy Camera Legacy Survey were examined, and a possible progenitor was detected in each band ($\sim$22--22.5 mag) and had a mass range of 14--17 $M_\odot$. △ Less

Submitted 2 August, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

Comments: Pages 12, Table 1, Figures 7

Journal ref: ApJL, 2024, 971:L2

arXiv:2405.01204 [pdf, other]

Towards Cross-Scale Attention and Surface Supervision for Fractured Bone Segmentation in CT

Authors: Yu Zhou, Xiahao Zou, Yi Wang

Abstract: Bone segmentation is an essential step for the preoperative planning of fracture trauma surgery. The automated segmentation of fractured bone from computed tomography (CT) scans remains challenging, due to the large differences of fractures in position and morphology, and also the inherent anatomical characteristics of different bone structures. To alleviate these issues, we propose a cross-scale… ▽ More Bone segmentation is an essential step for the preoperative planning of fracture trauma surgery. The automated segmentation of fractured bone from computed tomography (CT) scans remains challenging, due to the large differences of fractures in position and morphology, and also the inherent anatomical characteristics of different bone structures. To alleviate these issues, we propose a cross-scale attention mechanism as well as a surface supervision strategy for fractured bone segmentation in CT. Specifically, a cross-scale attention mechanism is introduced to effectively aggregate the features among different scales to provide more powerful fracture representation. Moreover, a surface supervision strategy is employed, which explicitly constrains the network to pay more attention to the bone boundary. The efficacy of the proposed method is evaluated on a public dataset containing CT scans with hip fractures. The evaluation metrics are Dice similarity coefficient (DSC), average symmetric surface distance (ASSD), and Hausdorff distance (95HD). The proposed method achieves an average DSC of 93.36%, ASSD of 0.85mm, 95HD of 7.51mm. Our method offers an effective fracture segmentation approach for the pelvic CT examinations, and has the potential to be used for improving the segmentation performance of other types of fractures. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2404.09569 [pdf]

Surprising pressure-induced magnetic transformations from Helimagnetic order to Antiferromagnetic state in NiI2

Authors: Qiye Liu, Wenjie Su, Yue Gu, Xi Zhang, Xiuquan Xia, Le Wang, Ke Xiao, Xiaodong Cui, Xiaolong Zou, Bin Xi, Jia-Wei Mei, Jun-Feng Dai

Abstract: Interlayer magnetic interactions play a pivotal role in determining the magnetic arrangement within van der Waals (vdW) magnets, and the remarkable tunability of these interactions through applied pressure further enhances their significance. Here, we investigate NiI2 flakes, a representative vdW magnet, under hydrostatic pressures up to 11 GPa. We reveal a notable increase in magnetic transition… ▽ More Interlayer magnetic interactions play a pivotal role in determining the magnetic arrangement within van der Waals (vdW) magnets, and the remarkable tunability of these interactions through applied pressure further enhances their significance. Here, we investigate NiI2 flakes, a representative vdW magnet, under hydrostatic pressures up to 11 GPa. We reveal a notable increase in magnetic transition temperatures for both helimagnetic and antiferromagnetic states, and find that a reversible transition from helimagnetic to antiferromagnetic (AFM) phases at approximately 7 GPa challenges established theoretical and experimental expectations. While the increase in transition temperature aligns with pressure-enhanced overall exchange interaction strengths, we identify the significant role of the second-nearest neighbor interlayer interaction, which competes with intra-layer frustration and favors the AFM state as demonstrated in the Monte Carlo simulations. Experimental and simulated results converge on the existence of an intermediate helimagnetic ordered state in NiI2 before transitioning to the AFM state. These findings underscore the pivotal role of interlayer interactions in shaping the magnetic ground state, providing fresh perspectives for innovative applications in nanoscale magnetic device design. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.09433 [pdf, other]

MarsQE: Semantic-Informed Quality Enhancement for Compressed Martian Image

Authors: Chengfeng Liu, Mai Xu, Qunliang Xing, Xin Zou

Abstract: Lossy image compression is essential for Mars exploration missions, due to the limited bandwidth between Earth and Mars. However, the compression may introduce visual artifacts that complicate the geological analysis of the Martian surface. Existing quality enhancement approaches, primarily designed for Earth images, fall short for Martian images due to a lack of consideration for the unique Marti… ▽ More Lossy image compression is essential for Mars exploration missions, due to the limited bandwidth between Earth and Mars. However, the compression may introduce visual artifacts that complicate the geological analysis of the Martian surface. Existing quality enhancement approaches, primarily designed for Earth images, fall short for Martian images due to a lack of consideration for the unique Martian semantics. In response to this challenge, we conduct an in-depth analysis of Martian images, yielding two key insights based on semantics: the presence of texture similarities and the compact nature of texture representations in Martian images. Inspired by these findings, we introduce MarsQE, an innovative, semantic-informed, two-phase quality enhancement approach specifically designed for Martian images. The first phase involves the semantic-based matching of texture-similar reference images, and the second phase enhances image quality by transferring texture patterns from these reference images to the compressed image. We also develop a post-enhancement network to further reduce compression artifacts and achieve superior compression quality. Our extensive experiments demonstrate that MarsQE significantly outperforms existing approaches for Earth images, establishing a new benchmark for the quality enhancement on Martian images. △ Less

Submitted 14 April, 2024; originally announced April 2024.

arXiv:2404.07458 [pdf, other]

I-mode Plasma Confinement Improvement by Real-time Lithium Injection and its Classification on EAST Tokamak

Authors: X. M. Zhong, X. L. Zou, A. D. Liu, Y. T. Song, G. Zhuang, H. Q. Liu, L. Q. Xu, E. Z. Li, B. Zhang, G. Z. Zuo, Z. Wang, C. Zhou, J. Zhang, W. X. Shi, L. T. Gao, S. F. Wang, W. Gao, T. Q. Jia, Q. Zang, H. L. Zhao, M. Wang, H. D. Xu, X. J. Wang, X. Gao, X. D. Lin , et al. (3 additional authors not shown)

Abstract: I-mode is a promising regime for future fusion reactors due to the high energy confinement and the moderate particle confinement. However, the effect of lithium, which has been widely applied for particle recycling and impurity control, on I-mode plasma is still unclear. Recently, experiments of real-time lithium powder injection on I-mode plasma have been carried out in EAST Tokamak. It was found… ▽ More I-mode is a promising regime for future fusion reactors due to the high energy confinement and the moderate particle confinement. However, the effect of lithium, which has been widely applied for particle recycling and impurity control, on I-mode plasma is still unclear. Recently, experiments of real-time lithium powder injection on I-mode plasma have been carried out in EAST Tokamak. It was found that the confinement performance of the I-mode can be improved by the lithium powder injection, which can strongly reduce electron turbulence (ET) and then trigger ion turbulence (IT). Four different regimes of I-mode have been identified in EAST. The Type I I-mode plasma is characterized by the weakly coherent mode (WCM) and the geodesic-acoustic mode (GAM). The Type II I-mode is featured as the WCM and the edge temperature ring oscillation (ETRO). The Type III I-mode corresponds to the plasma with the co-existence of ETRO, GAM, and WCM. The Type IV I-mode denotes the plasma with only WCM but without ETRO and GAM. It has been observed that WCM and ETRO are increased with lithium powder injection due to the reduction of ion and electron turbulence, and the enhancement of the pedestal electron temperature gradient. EAST experiments demonstrate that lithium powder injection is an effective tool for real-time control and confinement improvement of I-mode plasma. △ Less

Submitted 10 April, 2024; originally announced April 2024.

arXiv:2404.00727 [pdf, other]

A Controlled Reevaluation of Coreference Resolution Models

Authors: Ian Porada, Xiyuan Zou, Jackie Chi Kit Cheung

Abstract: All state-of-the-art coreference resolution (CR) models involve finetuning a pretrained language model. Whether the superior performance of one CR model over another is due to the choice of language model or other factors, such as the task-specific architecture, is difficult or impossible to determine due to lack of a standardized experimental setup. To resolve this ambiguity, we systematically ev… ▽ More All state-of-the-art coreference resolution (CR) models involve finetuning a pretrained language model. Whether the superior performance of one CR model over another is due to the choice of language model or other factors, such as the task-specific architecture, is difficult or impossible to determine due to lack of a standardized experimental setup. To resolve this ambiguity, we systematically evaluate five CR models and control for certain design decisions including the pretrained language model used by each. When controlling for language model size, encoder-based CR models outperform more recent decoder-based models in terms of both accuracy and inference speed. Surprisingly, among encoder-based CR models, more recent models are not always more accurate, and the oldest CR model that we test generalizes the best to out-of-domain textual genres. We conclude that controlling for the choice of language model reduces most, but not all, of the increase in F1 score reported in the past five years. △ Less

Submitted 22 April, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

Comments: LREC-COLING 2024

arXiv:2403.19980 [pdf, other]

A Parallel Attention Network for Cattle Face Recognition

Authors: Jiayu Li, Xuechao Zou, Shiying Wang, Ben Chen, Junliang Xing, Pin Tao

Abstract: Cattle face recognition holds paramount significance in domains such as animal husbandry and behavioral research. Despite significant progress in confined environments, applying these accomplishments in wild settings remains challenging. Thus, we create the first large-scale cattle face recognition dataset, ICRWE, for wild environments. It encompasses 483 cattle and 9,816 high-resolution image sam… ▽ More Cattle face recognition holds paramount significance in domains such as animal husbandry and behavioral research. Despite significant progress in confined environments, applying these accomplishments in wild settings remains challenging. Thus, we create the first large-scale cattle face recognition dataset, ICRWE, for wild environments. It encompasses 483 cattle and 9,816 high-resolution image samples. Each sample undergoes annotation for face features, light conditions, and face orientation. Furthermore, we introduce a novel parallel attention network, PANet. Comprising several cascaded Transformer modules, each module incorporates two parallel Position Attention Modules (PAM) and Feature Mapping Modules (FMM). PAM focuses on local and global features at each image position through parallel channel attention, and FMM captures intricate feature patterns through non-linear mappings. Experimental results indicate that PANet achieves a recognition accuracy of 88.03% on the ICRWE dataset, establishing itself as the current state-of-the-art approach. The source code is available in the supplementary materials. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: Accepted by ICME 2024

arXiv:2403.19950 [pdf, other]

doi 10.1609/aaai.v38i15.29673

Coverage-Guaranteed Prediction Sets for Out-of-Distribution Data

Authors: Xin Zou, Weiwei Liu

Abstract: Out-of-distribution (OOD) generalization has attracted increasing research attention in recent years, due to its promising experimental results in real-world applications. In this paper,we study the confidence set prediction problem in the OOD generalization setting. Split conformal prediction (SCP) is an efficient framework for handling the confidence set prediction problem. However, the validity… ▽ More Out-of-distribution (OOD) generalization has attracted increasing research attention in recent years, due to its promising experimental results in real-world applications. In this paper,we study the confidence set prediction problem in the OOD generalization setting. Split conformal prediction (SCP) is an efficient framework for handling the confidence set prediction problem. However, the validity of SCP requires the examples to be exchangeable, which is violated in the OOD setting. Empirically, we show that trivially applying SCP results in a failure to maintain the marginal coverage when the unseen target domain is different from the source domain. To address this issue, we develop a method for forming confident prediction sets in the OOD setting and theoretically prove the validity of our method. Finally, we conduct experiments on simulated data to empirically verify the correctness of our theory and the validity of our proposed method. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Journal ref: AAAI (2024) Vol. 38, No. 15, pages 17263-17270

arXiv:2403.14135 [pdf, other]

Powerful Lossy Compression for Noisy Images

Authors: Shilv Cai, Xiaoguo Liang, Shuning Cao, Luxin Yan, Sheng Zhong, Liqun Chen, Xu Zou

Abstract: Image compression and denoising represent fundamental challenges in image processing with many real-world applications. To address practical demands, current solutions can be categorized into two main strategies: 1) sequential method; and 2) joint method. However, sequential methods have the disadvantage of error accumulation as there is information loss between multiple individual models. Recentl… ▽ More Image compression and denoising represent fundamental challenges in image processing with many real-world applications. To address practical demands, current solutions can be categorized into two main strategies: 1) sequential method; and 2) joint method. However, sequential methods have the disadvantage of error accumulation as there is information loss between multiple individual models. Recently, the academic community began to make some attempts to tackle this problem through end-to-end joint methods. Most of them ignore that different regions of noisy images have different characteristics. To solve these problems, in this paper, our proposed signal-to-noise ratio~(SNR) aware joint solution exploits local and non-local features for image compression and denoising simultaneously. We design an end-to-end trainable network, which includes the main encoder branch, the guidance branch, and the signal-to-noise ratio~(SNR) aware branch. We conducted extensive experiments on both synthetic and real-world datasets, demonstrating that our joint solution outperforms existing state-of-the-art methods. △ Less

Submitted 26 March, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

Comments: Accepted by ICME 2024

arXiv:2403.11373 [pdf, other]

Reconstruct before Query: Continual Missing Modality Learning with Decomposed Prompt Collaboration

Authors: Shu Zhao, Xiaohan Zou, Tan Yu, Huijuan Xu

Abstract: Pre-trained large multi-modal models (LMMs) exploit fine-tuning to adapt diverse user applications. Nevertheless, fine-tuning may face challenges due to deactivated sensors (e.g., cameras turned off for privacy or technical issues), yielding modality-incomplete data and leading to inconsistency in training data and the data for inference. Additionally, continuous training leads to catastrophic for… ▽ More Pre-trained large multi-modal models (LMMs) exploit fine-tuning to adapt diverse user applications. Nevertheless, fine-tuning may face challenges due to deactivated sensors (e.g., cameras turned off for privacy or technical issues), yielding modality-incomplete data and leading to inconsistency in training data and the data for inference. Additionally, continuous training leads to catastrophic forgetting, diluting the knowledge in pre-trained LMMs. To overcome these challenges, we introduce a novel task, Continual Missing Modality Learning (CMML), to investigate how models can generalize when data of certain modalities is missing during continual fine-tuning. Our preliminary benchmarks reveal that existing methods suffer from a significant performance drop in CMML, even with the aid of advanced continual learning techniques. Therefore, we devise a framework termed Reconstruct before Query (RebQ). It decomposes prompts into modality-specific ones and breaks them into components stored in pools accessible via a key-query mechanism, which facilitates ParameterEfficient Fine-Tuning and enhances knowledge transferability for subsequent tasks. Meanwhile, our RebQ leverages extensive multi-modal knowledge from pre-trained LMMs to reconstruct the data of missing modality. Comprehensive experiments demonstrate that RebQ effectively reconstructs the missing modality information and retains pre-trained knowledge. Specifically, compared with the baseline, RebQ improves average precision from 20.00 to 50.92 and decreases average forgetting from 75.95 to 8.56. Code and datasets are available on https://github.com/Tree-Shu-Zhao/RebQ.pytorch △ Less

Submitted 17 March, 2024; originally announced March 2024.

arXiv:2403.10920 [pdf, other]

Batch-oriented Element-wise Approximate Activation for Privacy-Preserving Neural Networks

Authors: Peng Zhang, Ao Duan, Xianglu Zou, Yuhong Liu

Abstract: Privacy-Preserving Neural Networks (PPNN) are advanced to perform inference without breaching user privacy, which can serve as an essential tool for medical diagnosis to simultaneously achieve big data utility and privacy protection. As one of the key techniques to enable PPNN, Fully Homomorphic Encryption (FHE) is facing a great challenge that homomorphic operations cannot be easily adapted for n… ▽ More Privacy-Preserving Neural Networks (PPNN) are advanced to perform inference without breaching user privacy, which can serve as an essential tool for medical diagnosis to simultaneously achieve big data utility and privacy protection. As one of the key techniques to enable PPNN, Fully Homomorphic Encryption (FHE) is facing a great challenge that homomorphic operations cannot be easily adapted for non-linear activation calculations. In this paper, batch-oriented element-wise data packing and approximate activation are proposed, which train linear low-degree polynomials to approximate the non-linear activation function - ReLU. Compared with other approximate activation methods, the proposed fine-grained, trainable approximation scheme can effectively reduce the accuracy loss caused by approximation errors. Meanwhile, due to element-wise data packing, a large batch of images can be packed and inferred concurrently, leading to a much higher utility ratio of ciphertext slots. Therefore, although the total inference time increases sharply, the amortized time for each image actually decreases, especially when the batch size increases. Furthermore, knowledge distillation is adopted in the training process to further enhance the inference accuracy. Experiment results show that when ciphertext inference is performed on 4096 input images, compared with the current most efficient channel-wise method, the inference accuracy is improved by 1.65%, and the amortized inference time is reduced by 99.5%. △ Less

Submitted 16 March, 2024; originally announced March 2024.

arXiv:2403.08572 [pdf, other]

Caformer: Rethinking Time Series Analysis from Causal Perspective

Authors: Kexuan Zhang, Xiaobei Zou, Yang Tang

Abstract: Time series analysis is a vital task with broad applications in various domains. However, effectively capturing cross-dimension and cross-time dependencies in non-stationary time series poses significant challenges, particularly in the context of environmental factors. The spurious correlation induced by the environment confounds the causal relationships between cross-dimension and cross-time depe… ▽ More Time series analysis is a vital task with broad applications in various domains. However, effectively capturing cross-dimension and cross-time dependencies in non-stationary time series poses significant challenges, particularly in the context of environmental factors. The spurious correlation induced by the environment confounds the causal relationships between cross-dimension and cross-time dependencies. In this paper, we introduce a novel framework called Caformer (\underline{\textbf{Ca}}usal Trans\underline{\textbf{former}}) for time series analysis from a causal perspective. Specifically, our framework comprises three components: Dynamic Learner, Environment Learner, and Dependency Learner. The Dynamic Learner unveils dynamic interactions among dimensions, the Environment Learner mitigates spurious correlations caused by environment with a back-door adjustment, and the Dependency Learner aims to infer robust interactions across both time and dimensions. Our Caformer demonstrates consistent state-of-the-art performance across five mainstream time series analysis tasks, including long- and short-term forecasting, imputation, classification, and anomaly detection, with proper interpretability. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.02601 [pdf, other]

Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning

Authors: Haoyu Chen, Wenbo Li, Jinjin Gu, Jingjing Ren, Haoze Sun, Xueyi Zou, Zhensong Zhang, Youliang Yan, Lei Zhu

Abstract: For image super-resolution (SR), bridging the gap between the performance on synthetic datasets and real-world degradation scenarios remains a challenge. This work introduces a novel "Low-Res Leads the Way" (LWay) training framework, merging Supervised Pre-training with Self-supervised Learning to enhance the adaptability of SR models to real-world images. Our approach utilizes a low-resolution (L… ▽ More For image super-resolution (SR), bridging the gap between the performance on synthetic datasets and real-world degradation scenarios remains a challenge. This work introduces a novel "Low-Res Leads the Way" (LWay) training framework, merging Supervised Pre-training with Self-supervised Learning to enhance the adaptability of SR models to real-world images. Our approach utilizes a low-resolution (LR) reconstruction network to extract degradation embeddings from LR images, merging them with super-resolved outputs for LR reconstruction. Leveraging unseen LR images for self-supervised learning guides the model to adapt its modeling space to the target domain, facilitating fine-tuning of SR models without requiring paired high-resolution (HR) images. The integration of Discrete Wavelet Transform (DWT) further refines the focus on high-frequency details. Extensive evaluations show that our method significantly improves the generalization and detail restoration capabilities of SR models on unseen real-world datasets, outperforming existing methods. Our training regime is universally compatible, requiring no network architecture modifications, making it a practical solution for real-world SR applications. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024

arXiv:2402.19348 [pdf, other]

doi 10.1016/j.inffus.2024.102606.

Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook

Authors: Xingchen Zou, Yibo Yan, Xixuan Hao, Yuehong Hu, Haomin Wen, Erdong Liu, Junbo Zhang, Yong Li, Tianrui Li, Yu Zheng, Yuxuan Liang

Abstract: As cities continue to burgeon, Urban Computing emerges as a pivotal discipline for sustainable development by harnessing the power of cross-domain data fusion from diverse sources (e.g., geographical, traffic, social media, and environmental data) and modalities (e.g., spatio-temporal, visual, and textual modalities). Recently, we are witnessing a rising trend that utilizes various deep-learning m… ▽ More As cities continue to burgeon, Urban Computing emerges as a pivotal discipline for sustainable development by harnessing the power of cross-domain data fusion from diverse sources (e.g., geographical, traffic, social media, and environmental data) and modalities (e.g., spatio-temporal, visual, and textual modalities). Recently, we are witnessing a rising trend that utilizes various deep-learning methods to facilitate cross-domain data fusion in smart cities. To this end, we propose the first survey that systematically reviews the latest advancements in deep learning-based data fusion methods tailored for urban computing. Specifically, we first delve into data perspective to comprehend the role of each modality and data source. Secondly, we classify the methodology into four primary categories: feature-based, alignment-based, contrast-based, and generation-based fusion methods. Thirdly, we further categorize multi-modal urban applications into seven types: urban planning, transportation, economy, public safety, society, environment, and energy. Compared with previous surveys, we focus more on the synergy of deep learning methods with urban computing applications. Furthermore, we shed light on the interplay between Large Language Models (LLMs) and urban computing, postulating future research directions that could revolutionize the field. We firmly believe that the taxonomy, progress, and prospects delineated in our survey stand poised to significantly enrich the research community. The summary of the comprehensive and up-to-date paper list can be found at https://github.com/yoshall/Awesome-Multimodal-Urban-Computing. △ Less

Submitted 16 June, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

Journal ref: Inform.Fusion.113(2025)102606

arXiv:2402.18828 [pdf, other]

Strongly-tilted field induced Hamiltonian dimerization and nested quantum scars in the 1D spinless Fermi-Hubbard model

Authors: Wei-Jie Huang, Yu-Biao Wu, Guang-Can Guo, Wu-Ming Liu, Xu-Bo Zou

Abstract: We investigate the quantum dynamics of the 1D spinless Fermi-Hubbard model with a linear-tilted potential. Surprisingly in a strong resonance regime, we show that the model can be described by the kinetically constrained effective Hamiltonian, and it can be spontaneously divided into two commuting parts dubbed Hamiltonian dimerization, which consist of a sum of constrained two-site hopping terms a… ▽ More We investigate the quantum dynamics of the 1D spinless Fermi-Hubbard model with a linear-tilted potential. Surprisingly in a strong resonance regime, we show that the model can be described by the kinetically constrained effective Hamiltonian, and it can be spontaneously divided into two commuting parts dubbed Hamiltonian dimerization, which consist of a sum of constrained two-site hopping terms acting on odd or even bonds. Specifically it is showed that each part can be independently mapped onto the well-known PXP model, therefore the dimerized Hamiltonian is equivalent to a two-fold PXP model. As a consequence, we numerically demonstrate this system can host the so-called quantum many-body scars, which present persistent dynamical revivals and ergodicity-breaking behaviors. However in sharp contrast with traditional quantum many-body scars, here the scarring states in our model driven by different parts of Hamiltonian will oscillate in different periods, and those of double parts can display a biperiodic oscillation pattern, both originating from the Hamiltonian dimerization. Besides, the condition of off-resonance is also discussed and we show the crossover from quantum many-body scar to ergodicity breaking utilizing level statistics. Our model provides a platform for understanding the interplay of Hilbert space fragmentation and the constrained quantum systems △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: 12 pages, 10 figures

arXiv:2402.15758 [pdf, other]

Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens

Authors: Ziqian Zeng, Jiahong Yu, Qianshi Pang, Zihao Wang, Huiping Zhuang, Hongen Shao, Xiaofeng Zou

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their widespread application is hindered by the resource-intensive decoding process. To address this challenge, current approaches have incorporated additional decoding heads to enable parallel prediction of multiple subsequent tokens, thereby achieving inference acceleration. Nevertheless, the ac… ▽ More Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their widespread application is hindered by the resource-intensive decoding process. To address this challenge, current approaches have incorporated additional decoding heads to enable parallel prediction of multiple subsequent tokens, thereby achieving inference acceleration. Nevertheless, the accuracy of these decoding heads falls short of the auto-regressive decoding approach. In light of these limitations, we propose Chimera, a novel framework specifically designed for speculative sampling. Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words. To ensure both accuracy and efficiency, we present two strategies within the lightweight draft model. Firstly, we focus on capturing short-range dependencies at the bottom layer. Secondly, we leverage the readily available representations from the original LLM.Through empirical evaluation on the Vicuna and LlaMA-2 series, Chimera demonstrates impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach. This highlights the potential of our proposed framework in significantly improving the efficiency of large language models during the decoding process. △ Less

Submitted 18 April, 2024; v1 submitted 24 February, 2024; originally announced February 2024.

arXiv:2402.14857 [pdf, other]

Is the System Message Really Important to Jailbreaks in Large Language Models?

Authors: Xiaotian Zou, Yongkang Chen, Ke Li

Abstract: The rapid evolution of Large Language Models (LLMs) has rendered them indispensable in modern society. While security measures are typically to align LLMs with human values prior to release, recent studies have unveiled a concerning phenomenon named "Jailbreak". This term refers to the unexpected and potentially harmful responses generated by LLMs when prompted with malicious questions. Most exist… ▽ More The rapid evolution of Large Language Models (LLMs) has rendered them indispensable in modern society. While security measures are typically to align LLMs with human values prior to release, recent studies have unveiled a concerning phenomenon named "Jailbreak". This term refers to the unexpected and potentially harmful responses generated by LLMs when prompted with malicious questions. Most existing research focus on generating jailbreak prompts but system message configurations vary significantly in experiments. In this paper, we aim to answer a question: Is the system message really important for jailbreaks in LLMs? We conduct experiments in mainstream LLMs to generate jailbreak prompts with varying system messages: short, long, and none. We discover that different system messages have distinct resistances to jailbreaks. Therefore, we explore the transferability of jailbreaks across LLMs with different system messages. Furthermore, we propose the System Messages Evolutionary Algorithm (SMEA) to generate system messages that are more resistant to jailbreak prompts, even with minor changes. Through SMEA, we get a robust system messages population with little change in the length of system messages. Our research not only bolsters LLMs security but also raises the bar for jailbreaks, fostering advancements in this field of study. △ Less

Submitted 18 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

Comments: 13 pages,3 figures

arXiv:2402.06096 [pdf, other]

Doppler Tracking Data of Martian Mission Tianwen-I and Upper Limit of Stochastic Gravitational Wave Background

Authors: Xiaoming Bi, Zhongkai Guo, Xiaobo Zou, Yong Huang, Peijia Li, Jianfeng Cao, Lue Chen, Wenlin Tang, Yun Kau Lau

Abstract: Two way ranging data for spacecraft tracking of China's first Martian mission Tianwen-I is analysed. Shortly before the spacecraft entered the Mars parking orbit, the two way coherent microwave link between the spacecraft and the Earth resembles a long arm gravitational wave interferometer, with both the spacecraft and the Earth regarded as in an approximate free falling state. By carefully select… ▽ More Two way ranging data for spacecraft tracking of China's first Martian mission Tianwen-I is analysed. Shortly before the spacecraft entered the Mars parking orbit, the two way coherent microwave link between the spacecraft and the Earth resembles a long arm gravitational wave interferometer, with both the spacecraft and the Earth regarded as in an approximate free falling state. By carefully selecting and analysing data segments of the time series of the two way ranging data during this time span, a parametric statistical model is built for the data segments and an upper limit for the stochastic gravitational waves background (SGWB) is then estimated within the frequency window 0.1Hz to 0.1 mHz. The upper bound improves considerably on those obtained before. In particular, around the deci-Hz band, there is a three orders improvement on the bound obtained previously by the two way ranging data of the Chang e 3 mission. Scientific applications of the upper bound is then considered and a weak upper bound is worked out for axions which is a promising candidate for ultra light dark matter. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Comments: 10 pages, 8 figures

arXiv:2402.03767 [pdf]

Magnetic Field Gated and Current Controlled Spintronic Mem-transistor Neuron -based Spiking Neural Networks

Authors: Aijaz H. Lone, Meng Tang, Daniel N. Rahimi, Xuecui Zou, Dongxing Zheng, Hossein Fariborzi, Xixiang Zhang, Gianluca Setti

Abstract: Spintronic devices, such as the domain walls and skyrmions, have shown significant potential for applications in energy-efficient data storage and beyond CMOS computing architectures. In recent years, spiking neural networks have shown more bio-plausibility. Based on the magnetic multilayer spintronic devices, we demonstrate the magnetic field-gated Leaky integrate and fire neuron characteristics… ▽ More Spintronic devices, such as the domain walls and skyrmions, have shown significant potential for applications in energy-efficient data storage and beyond CMOS computing architectures. In recent years, spiking neural networks have shown more bio-plausibility. Based on the magnetic multilayer spintronic devices, we demonstrate the magnetic field-gated Leaky integrate and fire neuron characteristics for the spiking neural network applications. The LIF characteristics are controlled by the current pulses, which drive the domain wall, and an external magnetic field is used as the bias to tune the firing properties of the neuron. Thus, the device works like a gate-controlled LIF neuron, acting like a spintronic Mem-Transistor device. We develop a LIF neuron model based on the measured characteristics to show the device integration in the system-level SNNs. We extend the study and propose a scaled version of the demonstrated device with a multilayer spintronic domain wall magnetic tunnel junction as a LIF neuron. using the combination of SOT and the variation of the demagnetization energy across the thin film, the modified leaky integrate and fire LIF neuron characteristics are realized in the proposed devices. The neuron device characteristics are modeled as the modified LIF neuron model. Finally, we integrate the measured and simulated neuron models in the 3-layer spiking neural network and convolutional spiking neural network CSNN framework to test these spiking neuron models for classification of the MNIST and FMNIST datasets. In both architectures, the network achieves classification accuracy above 96%. Considering the good system-level performance, mem-transistor properties, and promise for scalability. The presented devices show an excellent properties for neuromorphic computing applications. △ Less

Submitted 6 February, 2024; originally announced February 2024.

Comments: 33 pages, 10 figures

arXiv:2401.14427 [pdf, other]

Beimingwu: A Learnware Dock System

Authors: Zhi-Hao Tan, Jian-Dong Liu, Xiao-Dong Bi, Peng Tan, Qin-Cheng Zheng, Hai-Tian Liu, Yi Xie, Xiao-Chuan Zou, Yang Yu, Zhi-Hua Zhou

Abstract: The learnware paradigm proposed by Zhou [2016] aims to enable users to reuse numerous existing well-trained models instead of building machine learning models from scratch, with the hope of solving new user tasks even beyond models' original purposes. In this paradigm, developers worldwide can submit their high-performing models spontaneously to the learnware dock system (formerly known as learnwa… ▽ More The learnware paradigm proposed by Zhou [2016] aims to enable users to reuse numerous existing well-trained models instead of building machine learning models from scratch, with the hope of solving new user tasks even beyond models' original purposes. In this paradigm, developers worldwide can submit their high-performing models spontaneously to the learnware dock system (formerly known as learnware market) without revealing their training data. Once the dock system accepts the model, it assigns a specification and accommodates the model. This specification allows the model to be adequately identified and assembled to reuse according to future users' needs, even if they have no prior knowledge of the model. This paradigm greatly differs from the current big model direction and it is expected that a learnware dock system housing millions or more high-performing models could offer excellent capabilities for both planned tasks where big models are applicable; and unplanned, specialized, data-sensitive scenarios where big models are not present or applicable. This paper describes Beimingwu, the first open-source learnware dock system providing foundational support for future research of learnware paradigm.The system significantly streamlines the model development for new user tasks, thanks to its integrated architecture and engine design, extensive engineering implementations and optimizations, and the integration of various algorithms for learnware identification and reuse. Notably, this is possible even for users with limited data and minimal expertise in machine learning, without compromising the raw data's security. Beimingwu supports the entire process of learnware paradigm. The system lays the foundation for future research in learnware-related algorithms and systems, and prepares the ground for hosting a vast array of learnwares and establishing a learnware ecosystem. △ Less

Submitted 24 January, 2024; originally announced January 2024.

arXiv:2401.06715 [pdf, other]

Reframing Tax Law Entailment as Analogical Reasoning

Authors: Xinrui Zou, Ming Zhang, Nathaniel Weir, Benjamin Van Durme, Nils Holzenberger

Abstract: Statutory reasoning refers to the application of legislative provisions to a series of case facts described in natural language. We re-frame statutory reasoning as an analogy task, where each instance of the analogy task involves a combination of two instances of statutory reasoning. This increases the dataset size by two orders of magnitude, and introduces an element of interpretability. We show… ▽ More Statutory reasoning refers to the application of legislative provisions to a series of case facts described in natural language. We re-frame statutory reasoning as an analogy task, where each instance of the analogy task involves a combination of two instances of statutory reasoning. This increases the dataset size by two orders of magnitude, and introduces an element of interpretability. We show that this task is roughly as difficult to Natural Language Processing models as the original task. Finally, we come back to statutory reasoning, solving it with a combination of a retrieval mechanism and analogy models, and showing some progress on prior comparable work. △ Less

Submitted 12 January, 2024; originally announced January 2024.

arXiv:2312.13637 [pdf]

Layer-dependent evolution of electronic structures and correlations in rhombohedral multilayer graphene

Authors: Yue-Ying Zhou, Yang Zhang, Shihao Zhang, Hao Cai, Ling-Hui Tong, Yuan Tian, Tongtong Chen, Qiwei Tian, Chen Zhang, Yiliu Wang, Xuming Zou, Xingqiang Liu, Yuanyuan Hu, Li Zhang, Lijie Zhang, Wen-Xiao Wang, Lei Liao, Zhihui Qin, Long-Jing Yin

Abstract: The recent discovery of superconductivity and magnetism in trilayer rhombohedral graphene (RG) establishes an ideal, untwisted platform to study strong correlation electronic phenomena. However, the correlated effects in multilayer RG have received limited attention, and, particularly, the evolution of the correlations with increasing layer number remains an unresolved question. Here, we show the… ▽ More The recent discovery of superconductivity and magnetism in trilayer rhombohedral graphene (RG) establishes an ideal, untwisted platform to study strong correlation electronic phenomena. However, the correlated effects in multilayer RG have received limited attention, and, particularly, the evolution of the correlations with increasing layer number remains an unresolved question. Here, we show the observation of layer-dependent electronic structures and correlations in RG multilayers from 3 to 9 layers by using scanning tunneling microscopy and spectroscopy. We explicitly determine layer-enhanced low-energy flat bands and interlayer coupling strength. The former directly demonstrates the further flattening of low-energy bands in thicker RG, and the later indicates the presence of varying interlayer interactions in RG multilayers. Moreover, we find significant splitting of the flat bands, ranging from ~50-80 meV, under liquid nitrogen temperature when they are partially filled, indicating the emergence of interaction-induced strongly correlated states. Particularly, the strength of the correlated states is notably enhanced in thicker RG and reaches its maximum in the six-layer, validating directly theoretical predictions and establishing abundant new candidates for strongly correlated systems. Our results provide valuable insights into the layer dependence of the electronic properties in RG, paving the way for investigating robust and highly accessible correlated phases in simpler systems. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: 21 pages, 4 figures

arXiv:2312.12236 [pdf, ps, other]

Generalization Analysis of Machine Learning Algorithms via the Worst-Case Data-Generating Probability Measure

Authors: Xinying Zou, Samir M. Perlaza, Iñaki Esnaola, Eitan Altman

Abstract: In this paper, the worst-case probability measure over the data is introduced as a tool for characterizing the generalization capabilities of machine learning algorithms. More specifically, the worst-case probability measure is a Gibbs probability measure and the unique solution to the maximization of the expected loss under a relative entropy constraint with respect to a reference probability mea… ▽ More In this paper, the worst-case probability measure over the data is introduced as a tool for characterizing the generalization capabilities of machine learning algorithms. More specifically, the worst-case probability measure is a Gibbs probability measure and the unique solution to the maximization of the expected loss under a relative entropy constraint with respect to a reference probability measure. Fundamental generalization metrics, such as the sensitivity of the expected loss, the sensitivity of the empirical risk, and the generalization gap are shown to have closed-form expressions involving the worst-case data-generating probability measure. Existing results for the Gibbs algorithm, such as characterizing the generalization gap as a sum of mutual information and lautum information, up to a constant factor, are recovered. A novel parallel is established between the worst-case data-generating probability measure and the Gibbs algorithm. Specifically, the Gibbs probability measure is identified as a fundamental commonality of the model space and the data space for machine learning algorithms. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: To appear in the Proceedings of the AAAI Conference on Artificial Intelligence (7 + 2 pages)

Report number: INRIA Technical Report RR-9515

arXiv:2312.07532 [pdf, other]

Interfacing Foundation Models' Embeddings

Authors: Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Junyi Wei, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang

Abstract: Foundation models possess strong capabilities in reasoning and memorizing across modalities. To further unleash the power of foundation models, we present FIND, a generalized interface for aligning foundation models' embeddings with unified image and dataset-level understanding spanning modality and granularity. As shown in the teaser figure, a lightweight transformer interface without tuning any… ▽ More Foundation models possess strong capabilities in reasoning and memorizing across modalities. To further unleash the power of foundation models, we present FIND, a generalized interface for aligning foundation models' embeddings with unified image and dataset-level understanding spanning modality and granularity. As shown in the teaser figure, a lightweight transformer interface without tuning any foundation model weights is enough for segmentation, grounding, and retrieval in an interleaved manner. The proposed interface has the following favorable attributes: (1) Generalizable. It applies to various tasks spanning retrieval, segmentation, etc., under the same architecture and weights. (2) Interleavable. With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. (3) Extendable. The proposed interface is adaptive to new tasks, and new models. In light of the interleaved embedding space, we introduce FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleaved segmentation and retrieval. We are the first work aligning foundations models' embeddings for interleave understanding. Meanwhile, our approach achieves state-of-the-art performance on FIND-Bench and competitive performance on standard retrieval and segmentation settings. △ Less

Submitted 15 July, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

Comments: CODE: https://github.com/UX-Decoder/FIND

arXiv:2312.07141 [pdf, other]

Multilingual large language models leak human stereotypes across language boundaries

Authors: Yang Trista Cao, Anna Sotnikova, Jieyu Zhao, Linda X. Zou, Rachel Rudinger, Hal Daume III

Abstract: Multilingual large language models have been increasingly popular for their proficiency in processing and generating text across various languages. Previous research has shown that the presence of stereotypes and biases in monolingual large language models can be attributed to the nature of their training data, which is collected from humans and reflects societal biases. Multilingual language mode… ▽ More Multilingual large language models have been increasingly popular for their proficiency in processing and generating text across various languages. Previous research has shown that the presence of stereotypes and biases in monolingual large language models can be attributed to the nature of their training data, which is collected from humans and reflects societal biases. Multilingual language models undergo the same training procedure as monolingual ones, albeit with training data sourced from various languages. This raises the question: do stereotypes present in one social context leak across languages within the model? In our work, we first define the term ``stereotype leakage'' and propose a framework for its measurement. With this framework, we investigate how stereotypical associations leak across four languages: English, Russian, Chinese, and Hindi. To quantify the stereotype leakage, we employ an approach from social psychology, measuring stereotypes via group-trait associations. We evaluate human stereotypes and stereotypical associations manifested in multilingual large language models such as mBERT, mT5, and GPT-3.5. Our findings show a noticeable leakage of positive, negative, and non-polar associations across all languages. Notably, Hindi within multilingual models appears to be the most susceptible to influence from other languages, while Chinese is the least. Additionally, GPT-3.5 exhibits a better alignment with human scores than other models. WARNING: This paper contains model outputs which could be offensive in nature. △ Less

Submitted 8 May, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

arXiv:2312.02949 [pdf, other]

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Authors: Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li, Jianwei Yang

Abstract: With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for gr… ▽ More With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at https://github.com/UX-Decoder/LLaVA-Grounding . △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2312.02646 [pdf, other]

SAMSGL: Series-Aligned Multi-Scale Graph Learning for Spatio-Temporal Forecasting

Authors: Xiaobei Zou, Luolin Xiong, Yang Tang, Jürgen Kurths

Abstract: Spatio-temporal forecasting in various domains, like traffic prediction and weather forecasting, is a challenging endeavor, primarily due to the difficulties in modeling propagation dynamics and capturing high-dimensional interactions among nodes. Despite the significant strides made by graph-based networks in spatio-temporal forecasting, there remain two pivotal factors closely related to forecas… ▽ More Spatio-temporal forecasting in various domains, like traffic prediction and weather forecasting, is a challenging endeavor, primarily due to the difficulties in modeling propagation dynamics and capturing high-dimensional interactions among nodes. Despite the significant strides made by graph-based networks in spatio-temporal forecasting, there remain two pivotal factors closely related to forecasting performance that need further consideration: time delays in propagation dynamics and multi-scale high-dimensional interactions. In this work, we present a Series-Aligned Multi-Scale Graph Learning (SAMSGL) framework, aiming to enhance forecasting performance. In order to handle time delays in spatial interactions, we propose a series-aligned graph convolution layer to facilitate the aggregation of non-delayed graph signals, thereby mitigating the influence of time delays for the improvement in accuracy. To understand global and local spatio-temporal interactions, we develop a spatio-temporal architecture via multi-scale graph learning, which encompasses two essential components: multi-scale graph structure learning and graph-fully connected (Graph-FC) blocks. The multi-scale graph structure learning includes a global graph structure to learn both delayed and non-delayed node embeddings, as well as a local one to learn node variations influenced by neighboring factors. The Graph-FC blocks synergistically fuse spatial and temporal information to boost prediction accuracy. To evaluate the performance of SAMSGL, we conduct experiments on meteorological and traffic forecasting datasets, which demonstrate its effectiveness and superiority. △ Less

Submitted 27 May, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

Comments: Accepted by Chaos

arXiv:2311.18327 [pdf]

Deep Reinforcement Learning Based Optimal Energy Management of Multi-energy Microgrids with Uncertainties

Authors: Yang Cui, Yang Xu, Yang Li, Yijian Wang, Xinpeng Zou

Abstract: Multi-energy microgrid (MEMG) offers an effective approach to deal with energy demand diversification and new energy consumption on the consumer side. In MEMG, it is critical to deploy an energy management system (EMS) for efficient utilization of energy and reliable operation of the system. To help EMS formulate optimal dispatching schemes, a deep reinforcement learning (DRL)-based MEMG energy ma… ▽ More Multi-energy microgrid (MEMG) offers an effective approach to deal with energy demand diversification and new energy consumption on the consumer side. In MEMG, it is critical to deploy an energy management system (EMS) for efficient utilization of energy and reliable operation of the system. To help EMS formulate optimal dispatching schemes, a deep reinforcement learning (DRL)-based MEMG energy management scheme with renewable energy source (RES) uncertainty is proposed in this paper. To accurately describe the operating state of the MEMG, the off-design performance model of energy conversion devices is considered in scheduling. The nonlinear optimal dispatching model is expressed as a Markov decision process (MDP) and is then addressed by the twin delayed deep deterministic policy gradient (TD3) algorithm. In addition, to accurately describe the uncertainty of RES, the conditional-least squares generative adversarial networks (C-LSGANs) method based on RES forecast power is proposed to construct the scenarios set of RES power generation. The generated data of RES is used for scheduling to obtain caps and floors for the purchase of electricity and natural gas. Based on this, the superior energy supply sector can formulate solutions in advance to tackle the uncertainty of RES. Finally, the simulation analysis demonstrates the validity and superiority of the method. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: Accepted by CSEE Journal of Power and Energy Systems

arXiv:2311.16512 [pdf, other]

CoSeR: Bridging Image and Language for Cognitive Super-Resolution

Authors: Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Renjing Pei, Xueyi Zou, Youliang Yan, Yujiu Yang

Abstract: Existing super-resolution (SR) models primarily focus on restoring local texture details, often neglecting the global semantic information within the scene. This oversight can lead to the omission of crucial semantic details or the introduction of inaccurate textures during the recovery process. In our work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with t… ▽ More Existing super-resolution (SR) models primarily focus on restoring local texture details, often neglecting the global semantic information within the scene. This oversight can lead to the omission of crucial semantic details or the introduction of inaccurate textures during the recovery process. In our work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding, which not only activates prior information from large text-to-image diffusion models but also facilitates the generation of high-quality reference images to optimize the SR process. To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention", consolidating all conditional information into a single module. Consequently, our method successfully restores semantically correct and photorealistic details, demonstrating state-of-the-art performance across multiple benchmarks. Code: https://github.com/VINHYU/CoSeR △ Less

Submitted 20 December, 2023; v1 submitted 27 November, 2023; originally announced November 2023.

Comments: Project page: https://coser-main.github.io ; GitHub repository: https://github.com/VINHYU/CoSeR

arXiv:2311.13601 [pdf, other]

Visual In-Context Prompting

Authors: Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao

Abstract: In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities, but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object, falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper, we introduce… ▽ More In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities, but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object, falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper, we introduce a universal visual in-context prompting framework for both tasks. In particular, we build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points. We further enhance it to take an arbitrary number of reference image segments as the context. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities to refer and detect, yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets. By joint training on COCO and SA-1B, our model achieves $57.7$ PQ on COCO and $23.2$ PQ on ADE20K. Code will be available at https://github.com/UX-Decoder/DINOv. △ Less

Submitted 22 November, 2023; originally announced November 2023.

Comments: technical report

arXiv:2311.12083 [pdf, other]

PanBench: Towards High-Resolution and High-Performance Pansharpening

Authors: Shiying Wang, Xuechao Zou, Kai Li, Junliang Xing, Pin Tao

Abstract: Pansharpening, a pivotal task in remote sensing, involves integrating low-resolution multispectral images with high-resolution panchromatic images to synthesize an image that is both high-resolution and retains multispectral information. These pansharpened images enhance precision in land cover classification, change detection, and environmental monitoring within remote sensing data analysis. Whil… ▽ More Pansharpening, a pivotal task in remote sensing, involves integrating low-resolution multispectral images with high-resolution panchromatic images to synthesize an image that is both high-resolution and retains multispectral information. These pansharpened images enhance precision in land cover classification, change detection, and environmental monitoring within remote sensing data analysis. While deep learning techniques have shown significant success in pansharpening, existing methods often face limitations in their evaluation, focusing on restricted satellite data sources, single scene types, and low-resolution images. This paper addresses this gap by introducing PanBench, a high-resolution multi-scene dataset containing all mainstream satellites and comprising 5,898 pairs of samples. Each pair includes a four-channel (RGB + near-infrared) multispectral image of 256x256 pixels and a mono-channel panchromatic image of 1,024x1,024 pixels. To achieve high-fidelity synthesis, we propose a Cascaded Multiscale Fusion Network (CMFNet) for Pansharpening. Extensive experiments validate the effectiveness of CMFNet. We have released the dataset, source code, and pre-trained models in the supplementary, fostering further research in remote sensing. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: 10 pages, 5 figures

arXiv:2311.05437 [pdf, other]

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Authors: Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, Chunyuan Li

Abstract: LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understa… ▽ More LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios. △ Less

Submitted 9 November, 2023; originally announced November 2023.

Comments: 25 pages, 25M file size. Project Page: https://llava-vl.github.io/llava-plus/

Showing 1–50 of 388 results for author: Zou, X