-
GRANDlib: A simulation pipeline for the Giant Radio Array for Neutrino Detection (GRAND)
Authors:
GRAND Collaboration,
Rafael Alves Batista,
Aurélien Benoit-Lévy,
Teresa Bister,
Martina Bohacova,
Mauricio Bustamante,
Washington Carvalho,
Yiren Chen,
LingMei Cheng,
Simon Chiche,
Jean-Marc Colley,
Pablo Correa,
Nicoleta Cucu Laurenciu,
Zigao Dai,
Rogerio M. de Almeida,
Beatriz de Errico,
Sijbrand de Jong,
João R. T. de Mello Neto,
Krijn D. de Vries,
Valentin Decoene,
Peter B. Denton,
Bohao Duan,
Kaikai Duan,
Ralph Engel,
William Erba
, et al. (90 additional authors not shown)
Abstract:
The operation of upcoming ultra-high-energy cosmic-ray, gamma-ray, and neutrino radio-detection experiments, like the Giant Radio Array for Neutrino Detection (GRAND), poses significant computational challenges involving the production of numerous simulations of particle showers and their detection, and a high data throughput. GRANDlib is an open-source software tool designed to meet these challen…
▽ More
The operation of upcoming ultra-high-energy cosmic-ray, gamma-ray, and neutrino radio-detection experiments, like the Giant Radio Array for Neutrino Detection (GRAND), poses significant computational challenges involving the production of numerous simulations of particle showers and their detection, and a high data throughput. GRANDlib is an open-source software tool designed to meet these challenges. Its primary goal is to perform end-to-end simulations of the detector operation, from the interaction of ultra-high-energy particles, through -- by interfacing with external air-shower simulations -- the ensuing particle shower development and its radio emission, to its detection by antenna arrays and its processing by data-acquisition systems. Additionally, GRANDlib manages the visualization, storage, and retrieval of experimental and simulated data. We present an overview of GRANDlib to serve as the basis of future GRAND analyses.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Unveiling the jet angular broadening with $γ-$jet in high-energy nuclear collisions
Authors:
Sa Wang,
Yao Li,
Jin-Wen Kang,
Ben-Wei Zhang
Abstract:
Medium modification of jet substructure within the hot and dense nuclear matter has attracted enormous interest from the heavy-ion physics community in recent years. Measurements of inclusive jet show the angular narrowing in nucleus-nucleus collisions, while the recent CMS results of the photon-tagged jets ($γ-$jet) indicate hints of broadening. In this work, we conduct a theoretical study on the…
▽ More
Medium modification of jet substructure within the hot and dense nuclear matter has attracted enormous interest from the heavy-ion physics community in recent years. Measurements of inclusive jet show the angular narrowing in nucleus-nucleus collisions, while the recent CMS results of the photon-tagged jets ($γ-$jet) indicate hints of broadening. In this work, we conduct a theoretical study on the angular structure of inclusive jet and $γ-$jet with a transport approach considering the jet energy loss and the medium response in the quark-gluon plasma. We carry out the girth modification of $γ-$jet in $0-30\%$ PbPb collisions at $\sqrt{s_{NN}}=$ 5.02 TeV, which shows a satisfactory agreement with the recent CMS measurement. We explore the connection between the selection bias and the jet kinematics when choosing different $x_{jγ}=p_T^{\rm jet}/p_T^γ$ threshold. Importantly, we quantitatively demonstrate that $γ-$jet provides significant advantages to reduce the selection bias and can effectively collect jets sufficiently quenched in PbPb collisions compared to the inclusive jet, which is critical to capture the jet angular broadening observed by CMS. We further estimate the contributions of the medium-induced gluon radiation and the medium response to the broadening of the jet angular substructure.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
ShapeSplat: A Large-scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining
Authors:
Qi Ma,
Yue Li,
Bin Ren,
Nicu Sebe,
Ender Konukoglu,
Theo Gevers,
Luc Van Gool,
Danda Pani Paudel
Abstract:
3D Gaussian Splatting (3DGS) has become the de facto method of 3D representation in many vision tasks. This calls for the 3D understanding directly in this representation space. To facilitate the research in this direction, we first build a large-scale dataset of 3DGS using the commonly used ShapeNet and ModelNet datasets. Our dataset ShapeSplat consists of 65K objects from 87 unique categories, w…
▽ More
3D Gaussian Splatting (3DGS) has become the de facto method of 3D representation in many vision tasks. This calls for the 3D understanding directly in this representation space. To facilitate the research in this direction, we first build a large-scale dataset of 3DGS using the commonly used ShapeNet and ModelNet datasets. Our dataset ShapeSplat consists of 65K objects from 87 unique categories, whose labels are in accordance with the respective datasets. The creation of this dataset utilized the compute equivalent of 2 GPU years on a TITAN XP GPU.
We utilize our dataset for unsupervised pretraining and supervised finetuning for classification and segmentation tasks. To this end, we introduce \textbf{\textit{Gaussian-MAE}}, which highlights the unique benefits of representation learning from Gaussian parameters. Through exhaustive experiments, we provide several valuable insights. In particular, we show that (1) the distribution of the optimized GS centroids significantly differs from the uniformly sampled point cloud (used for initialization) counterpart; (2) this change in distribution results in degradation in classification but improvement in segmentation tasks when using only the centroids; (3) to leverage additional Gaussian parameters, we propose Gaussian feature grouping in a normalized feature space, along with splats pooling layer, offering a tailored solution to effectively group and embed similar Gaussians, which leads to notable improvement in finetuning tasks.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Revisiting the measurements and interpretations of DLVO forces
Authors:
Bo Feng,
Xiantang Liu,
Xinmin Liu,
Yingli Li,
Hang Li
Abstract:
The DLVO theory and electrical double layer (EDL) theory are the foundation of colloid and interface science. With the invention and development of surface forces apparatus (SFA) and atomic force microscope (AFM), the measurements and interpretations of DLVO forces (i.e., mainly measuring the EDL force (electrostatic force) FEDL and van der Waals force FvdW, and interpreting the potential ψ, charg…
▽ More
The DLVO theory and electrical double layer (EDL) theory are the foundation of colloid and interface science. With the invention and development of surface forces apparatus (SFA) and atomic force microscope (AFM), the measurements and interpretations of DLVO forces (i.e., mainly measuring the EDL force (electrostatic force) FEDL and van der Waals force FvdW, and interpreting the potential ψ, charge density σ, and Hamaker constant H) can be greatly facilitated by various surface force measurement techniques, and would have been very promising in advancing the DLVO theory, EDL theory, and colloid and interface science. However, although numerous studies have been conducted, pervasive anomalous results can be identified throughout the literature, main including: (1) the fitted ψ/σ is normally extremely small (ψ can be close to or (much) smaller than ψζ (zeta potential)) and varies greatly; (2) the fitted ψ/σ can exceed the allowable range of calculation; and (3) the measured FvdW and the fitted H vary greatly. Based on rigorous and comprehensive arguments, we have reasonably explained the pervasive anomalous results in the literature and further speculated that, the pervasive anomalous results are existing but not noticed and questioned owing to the two important aspects: (1) the pervasive unreasonable understandings of EDL theory and (2) the commonly neglected systematic errors. Consequently, we believe that the related studies have been seriously hampered. We therefore call for re-examination and re-analysis of related experimental results and theoretical understandings by careful consideration of the EDL theory and systematic errors. On these bases, we can interpret the experimental results properly and promote the development of EDL theory, colloid and interface science, and many related fields.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech
Authors:
Xin Qi,
Ruibo Fu,
Zhengqi Wen,
Jianhua Tao,
Shuchen Shi,
Yi Lu,
Zhiyong Wang,
Xiaopeng Wang,
Yuankun Xie,
Yukun Liu,
Guanjun Li,
Xuefei Liu,
Yongwei Li
Abstract:
In the current era of Artificial Intelligence Generated Content (AIGC), a Low-Rank Adaptation (LoRA) method has emerged. It uses a plugin-based approach to learn new knowledge with lower parameter quantities and computational costs, and it can be plugged in and out based on the specific sub-tasks, offering high flexibility. However, the current application schemes primarily incorporate LoRA into t…
▽ More
In the current era of Artificial Intelligence Generated Content (AIGC), a Low-Rank Adaptation (LoRA) method has emerged. It uses a plugin-based approach to learn new knowledge with lower parameter quantities and computational costs, and it can be plugged in and out based on the specific sub-tasks, offering high flexibility. However, the current application schemes primarily incorporate LoRA into the pre-introduced conditional parts of the speech models. This fixes the position of LoRA, limiting the flexibility and scalability of its application. Therefore, we propose the Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech (EELE) method. Starting from a general neutral speech model, we do not pre-introduce emotional information but instead use the LoRA plugin to design a flexible adaptive scheme that endows the model with emotional generation capabilities. Specifically, we initially train the model using only neutral speech data. After training is complete, we insert LoRA into different modules and fine-tune the model with emotional speech data to find the optimal insertion scheme. Through experiments, we compare and test the effects of inserting LoRA at different positions within the model and assess LoRA's ability to learn various emotions, effectively proving the validity of our method. Additionally, we explore the impact of the rank size of LoRA and the difference compared to directly fine-tuning the entire model.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
A Noval Feature via Color Quantisation for Fake Audio Detection
Authors:
Zhiyong Wang,
Xiaopeng Wang,
Yuankun Xie,
Ruibo Fu,
Zhengqi Wen,
Jianhua Tao,
Yukun Liu,
Guanjun Li,
Xin Qi,
Yi Lu,
Xuefei Liu,
Yongwei Li
Abstract:
In the field of deepfake detection, previous studies focus on using reconstruction or mask and prediction methods to train pre-trained models, which are then transferred to fake audio detection training where the encoder is used to extract features, such as wav2vec2.0 and Masked Auto Encoder. These methods have proven that using real audio for reconstruction pre-training can better help the model…
▽ More
In the field of deepfake detection, previous studies focus on using reconstruction or mask and prediction methods to train pre-trained models, which are then transferred to fake audio detection training where the encoder is used to extract features, such as wav2vec2.0 and Masked Auto Encoder. These methods have proven that using real audio for reconstruction pre-training can better help the model distinguish fake audio. However, the disadvantage lies in poor interpretability, meaning it is hard to intuitively present the differences between deepfake and real audio. This paper proposes a noval feature extraction method via color quantisation which constrains the reconstruction to use a limited number of colors for the spectral image-like input. The proposed method ensures reconstructed input differs from the original, which allows for intuitive observation of the focus areas in the spectral reconstruction. Experiments conducted on the ASVspoof2019 dataset demonstrate that the proposed method achieves better classification performance compared to using the original spectral as input and pretraining the recolor network can also benefit the fake audio detection.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Adversarial Attack for Explanation Robustness of Rationalization Models
Authors:
Yuankai Zhang,
Lingxiao Kong,
Haozhao Wang,
Ruixuan Li,
Jun Wang,
Yuhua Li,
Wei Liu
Abstract:
Rationalization models, which select a subset of input text as rationale-crucial for humans to understand and trust predictions-have recently emerged as a prominent research area in eXplainable Artificial Intelligence. However, most of previous studies mainly focus on improving the quality of the rationale, ignoring its robustness to malicious attack. Specifically, whether the rationalization mode…
▽ More
Rationalization models, which select a subset of input text as rationale-crucial for humans to understand and trust predictions-have recently emerged as a prominent research area in eXplainable Artificial Intelligence. However, most of previous studies mainly focus on improving the quality of the rationale, ignoring its robustness to malicious attack. Specifically, whether the rationalization models can still generate high-quality rationale under the adversarial attack remains unknown. To explore this, this paper proposes UAT2E, which aims to undermine the explainability of rationalization models without altering their predictions, thereby eliciting distrust in these models from human users. UAT2E employs the gradient-based search on triggers and then inserts them into the original input to conduct both the non-target and target attack. Experimental results on five datasets reveal the vulnerability of rationalization models in terms of explanation, where they tend to select more meaningless tokens under attacks. Based on this, we make a series of recommendations for improving rationalization models in terms of explanation.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection
Authors:
Tri Cao,
Chengyu Huang,
Yuexin Li,
Huilin Wang,
Amy He,
Nay Oo,
Bryan Hooi
Abstract:
Phishing attacks are a major threat to online security, exploiting user vulnerabilities to steal sensitive information. Various methods have been developed to counteract phishing, each with varying levels of accuracy, but they also encounter notable limitations. In this study, we introduce PhishAgent, a multimodal agent that combines a wide range of tools, integrating both online and offline knowl…
▽ More
Phishing attacks are a major threat to online security, exploiting user vulnerabilities to steal sensitive information. Various methods have been developed to counteract phishing, each with varying levels of accuracy, but they also encounter notable limitations. In this study, we introduce PhishAgent, a multimodal agent that combines a wide range of tools, integrating both online and offline knowledge bases with Multimodal Large Language Models (MLLMs). This combination leads to broader brand coverage, which enhances brand recognition and recall. Furthermore, we propose a multimodal information retrieval framework designed to extract the top k relevant items from offline knowledge bases, utilizing all available information from a webpage, including logos, HTML, and URLs. Our empirical results, based on three real-world datasets, demonstrate that the proposed framework significantly enhances detection accuracy and reduces both false positives and false negatives, while maintaining model efficiency. Additionally, PhishAgent shows strong resilience against various types of adversarial attacks.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
A Noncontact Technique for Wave Measurement Based on Thermal Stereography and Deep Learning
Authors:
Deyu Li,
Longfei Xiao,
Handi Wei,
Yan Li,
Binghua Zhang
Abstract:
The accurate measurement of the wave field and its spatiotemporal evolution is essential in many hydrodynamic experiments and engineering applications. The binocular stereo imaging technique has been widely used to measure waves. However, the optical properties of indoor water surfaces, including transparency, specular reflection, and texture absence, pose challenges for image processing and stere…
▽ More
The accurate measurement of the wave field and its spatiotemporal evolution is essential in many hydrodynamic experiments and engineering applications. The binocular stereo imaging technique has been widely used to measure waves. However, the optical properties of indoor water surfaces, including transparency, specular reflection, and texture absence, pose challenges for image processing and stereo reconstruction. This study proposed a novel technique that combined thermal stereography and deep learning to achieve fully noncontact wave measurements. The optical imaging properties of water in the long-wave infrared spectrum were found to be suitable for stereo matching, effectively avoiding the issues in the visible-light spectrum. After capturing wave images using thermal stereo cameras, a reconstruction strategy involving deep learning techniques was proposed to improve stereo matching performance. A generative approach was employed to synthesize a dataset with ground-truth disparity from unannotated infrared images. This dataset was then fed to a pretrained stereo neural network for fine-tuning to achieve domain adaptation. Wave flume experiments were conducted to validate the feasibility and accuracy of the proposed technique. The final reconstruction results indicated great agreement and high accuracy with a mean bias of less than 2.1% compared with the measurements obtained using wave probes, suggesting that the novel technique effectively measures the spatiotemporal distribution of wave surface in hydrodynamic experiments.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks
Authors:
Dayou Li,
Chenkun Zhao,
Shuo Yang,
Lin Ma,
Yibin Li,
Wei Zhang
Abstract:
We study the task of language instruction-guided robotic manipulation, in which an embodied robot is supposed to manipulate the target objects based on the language instructions. In previous studies, the predicted manipulation regions of the target object typically do not change with specification from the language instructions, which means that the language perception and manipulation prediction…
▽ More
We study the task of language instruction-guided robotic manipulation, in which an embodied robot is supposed to manipulate the target objects based on the language instructions. In previous studies, the predicted manipulation regions of the target object typically do not change with specification from the language instructions, which means that the language perception and manipulation prediction are separate. However, in human behavioral patterns, the manipulation regions of the same object will change for different language instructions. In this paper, we propose Instruction-Guided Affordance Net (IGANet) for predicting affordance maps of instruction-guided robotic manipulation tasks by utilizing powerful priors from vision and language encoders pre-trained on large-scale datasets. We develop a Vison-Language-Models(VLMs)-based data augmentation pipeline, which can generate a large amount of data automatically for model training. Besides, with the help of Large-Language-Models(LLMs), actions can be effectively executed to finish the tasks defined by instructions. A series of real-world experiments revealed that our method can achieve better performance with generated data. Moreover, our model can generalize better to scenarios with unseen objects and language instructions.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Cores and weights of multipartitions and blocks of Ariki-Koike algebras
Authors:
Yanbo Li,
Kai Meng Tan
Abstract:
Let $e$ be an integer at least two. We define the $e$-core and the $e$-weight of a multipartition associated with a multicharge as the $e$-core and the $e$-weight of its image under the Uglov map. We do not place any restriction on the multicharge for these definitions. We show how these definitions lead to the definition of the $e$-core and the $e$-weight of a block of an Ariki-Koike algebra with…
▽ More
Let $e$ be an integer at least two. We define the $e$-core and the $e$-weight of a multipartition associated with a multicharge as the $e$-core and the $e$-weight of its image under the Uglov map. We do not place any restriction on the multicharge for these definitions. We show how these definitions lead to the definition of the $e$-core and the $e$-weight of a block of an Ariki-Koike algebra with quantum parameter $e$, and an analogue of Nakayama's `Conjecture' that classifies these blocks. Our definition of $e$-weight of such a block coincides with that first defined by Fayers. We further generalise the notion of a $[w:k]$-pair for Iwahori-Hecke algebra of type $A$ to the Ariki-Koike algebras, and obtain a sufficient condition for such a pair to be Scopes equivalent.
△ Less
Submitted 28 August, 2024; v1 submitted 20 August, 2024;
originally announced August 2024.
-
Vision Calorimeter for Anti-neutron Reconstruction: A Baseline
Authors:
Hongtian Yu,
Yangu Li,
Mingrui Wu,
Letian Shen,
Yue Liu,
Yunxuan Song,
Qixiang Ye,
Xiaorui Lyu,
Yajun Mao,
Yangheng Zheng,
Yunfan Liu
Abstract:
In high-energy physics, anti-neutrons ($\bar{n}$) are fundamental particles that frequently appear as final-state particles, and the reconstruction of their kinematic properties provides an important probe for understanding the governing principles. However, this confronts significant challenges instrumentally with the electromagnetic calorimeter (EMC), a typical experimental sensor but recovering…
▽ More
In high-energy physics, anti-neutrons ($\bar{n}$) are fundamental particles that frequently appear as final-state particles, and the reconstruction of their kinematic properties provides an important probe for understanding the governing principles. However, this confronts significant challenges instrumentally with the electromagnetic calorimeter (EMC), a typical experimental sensor but recovering the information of incident $\bar{n}$ insufficiently. In this study, we introduce Vision Calorimeter (ViC), a baseline method for anti-neutron reconstruction that leverages deep learning detectors to analyze the implicit relationships between EMC responses and incident $\bar{n}$ characteristics. Our motivation lies in that energy distributions of $\bar{n}$ samples deposited in the EMC cell arrays embody rich contextual information. Converted to 2-D images, such contextual energy distributions can be used to predict the status of $\bar{n}$ ($i.e.$, incident position and momentum) through a deep learning detector along with pseudo bounding boxes and a specified training objective. Experimental results demonstrate that ViC substantially outperforms the conventional reconstruction approach, reducing the prediction error of incident position by 42.81% (from 17.31$^{\circ}$ to 9.90$^{\circ}$). More importantly, this study for the first time realizes the measurement of incident $\bar{n}$ momentum, underscoring the potential of deep learning detectors for particle reconstruction. Code is available at https://github.com/yuhongtian17/ViC.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Where to Fetch: Extracting Visual Scene Representation from Large Pre-Trained Models for Robotic Goal Navigation
Authors:
Yu Li,
Dayou Li,
Chenkun Zhao,
Ruifeng Wang,
Ran Song,
Wei Zhang
Abstract:
To complete a complex task where a robot navigates to a goal object and fetches it, the robot needs to have a good understanding of the instructions and the surrounding environment. Large pre-trained models have shown capabilities to interpret tasks defined via language descriptions. However, previous methods attempting to integrate large pre-trained models with daily tasks are not competent in ma…
▽ More
To complete a complex task where a robot navigates to a goal object and fetches it, the robot needs to have a good understanding of the instructions and the surrounding environment. Large pre-trained models have shown capabilities to interpret tasks defined via language descriptions. However, previous methods attempting to integrate large pre-trained models with daily tasks are not competent in many robotic goal navigation tasks due to poor understanding of the environment. In this work, we present a visual scene representation built with large-scale visual language models to form a feature representation of the environment capable of handling natural language queries. Combined with large language models, this method can parse language instructions into action sequences for a robot to follow, and accomplish goal navigation with querying the scene representation. Experiments demonstrate that our method enables the robot to follow a wide range of instructions and complete complex goal navigation tasks.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Generative Diffusion Models for High Dimensional Channel Estimation
Authors:
Xingyu Zhou,
Le Liang,
Jing Zhang,
Peiwen Jiang,
Yong Li,
Shi Jin
Abstract:
Along with the prosperity of generative artificial intelligence (AI), its potential for solving conventional challenges in wireless communications has also surfaced. Inspired by this trend, we investigate the application of the advanced diffusion models (DMs), a representative class of generative AI models, to high dimensional wireless channel estimation. By capturing the structure of multiple-inp…
▽ More
Along with the prosperity of generative artificial intelligence (AI), its potential for solving conventional challenges in wireless communications has also surfaced. Inspired by this trend, we investigate the application of the advanced diffusion models (DMs), a representative class of generative AI models, to high dimensional wireless channel estimation. By capturing the structure of multiple-input multiple-output (MIMO) wireless channels via a deep generative prior encoded by DMs, we develop a novel posterior inference method for channel reconstruction. We further adapt the proposed method to recover channel information from low-resolution quantized measurements. Additionally, to enhance the over-the-air viability, we integrate the DM with the unsupervised Stein's unbiased risk estimator to enable learning from noisy observations and circumvent the requirements for ground truth channel data that is hardly available in practice. Results reveal that the proposed estimator achieves high-fidelity channel recovery while reducing estimation latency by a factor of 10 compared to state-of-the-art schemes, facilitating real-time implementation. Moreover, our method outperforms existing estimators while reducing the pilot overhead by half, showcasing its scalability to ultra-massive antenna arrays.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Interplay of Quantum Resources in Nonlocality Tests
Authors:
Hai-Hao Dong,
Yuwei Zhu,
Su-Yi Cheng,
Xingjian Zhang,
Cheng-Long Li,
Ying-Zhao Li,
Hao Li,
Lixing You,
Xiongfeng Ma,
Qiang Zhang,
Jian-Wei Pan
Abstract:
Nonlocality, evidenced by the violation of Bell inequalities, not only signifies entanglement but also highlights measurement incompatibility in quantum systems. Utilizing the generalized Clauser-Horne-Shimony-Holt (CHSH) Bell inequality, our high-efficiency optical setup achieves a loophole-free violation of $2.0132$. This result provides a device-independent lower bound on entanglement, quantifi…
▽ More
Nonlocality, evidenced by the violation of Bell inequalities, not only signifies entanglement but also highlights measurement incompatibility in quantum systems. Utilizing the generalized Clauser-Horne-Shimony-Holt (CHSH) Bell inequality, our high-efficiency optical setup achieves a loophole-free violation of $2.0132$. This result provides a device-independent lower bound on entanglement, quantified as the entanglement of formation at $0.0159$. Moreover, by tuning the parameters of the generalized Bell inequality, we enhance the estimation of measurement incompatibility, which is quantified by an effective overlap of $4.3883 \times 10^{-5}$. To explore the intricate interplay among nonlocality, entanglement, and measurement incompatibility, we generate mixed states, allowing for flexible modulation of entanglement via fast switching among the four Bell states using Pockels cells, achieving a fidelity above $99.10\%$. Intriguingly, our results reveal a counterintuitive relationship where increasing incompatibility initially boosts nonlocality but eventually leads to its reduction. Typically, maximal nonlocality does not coincide with maximal incompatibility. This experimental study sheds light on the optimal management of quantum resources for Bell-inequality-based quantum information processing.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Recognizing Beam Profiles from Silicon Photonics Gratings using Transformer Model
Authors:
Yu Dian Lim,
Hong Yu Li,
Simon Chun Kiat Goh,
Xiangyu Wang,
Peng Zhao,
Chuan Seng Tan
Abstract:
Over the past decade, there has been extensive work in developing integrated silicon photonics (SiPh) gratings for the optical addressing of trapped ion qubits in the ion trap quantum computing community. However, when viewing beam profiles from infrared (IR) cameras, it is often difficult to determine the corresponding heights where the beam profiles are located. In this work, we developed transf…
▽ More
Over the past decade, there has been extensive work in developing integrated silicon photonics (SiPh) gratings for the optical addressing of trapped ion qubits in the ion trap quantum computing community. However, when viewing beam profiles from infrared (IR) cameras, it is often difficult to determine the corresponding heights where the beam profiles are located. In this work, we developed transformer models to recognize the corresponding height categories of beam profiles of light from SiPh gratings. The model is trained using two techniques: (1) input patches, and (2) input sequence. For model trained with input patches, the model achieved recognition accuracy of 0.938. Meanwhile, model trained with input sequence shows lower accuracy of 0.895. However, when repeating the model-training 150 cycles, model trained with input patches shows inconsistent accuracy ranges between 0.445 to 0.959, while model trained with input sequence exhibit higher accuracy values between 0.789 to 0.936. The obtained outcomes can be expanded to various applications, including auto-focusing of light beam and auto-adjustment of z-axis stage to acquire desired beam profiles.
△ Less
Submitted 22 August, 2024; v1 submitted 19 August, 2024;
originally announced August 2024.
-
Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models
Authors:
Aviv Bick,
Kevin Y. Li,
Eric P. Xing,
J. Zico Kolter,
Albert Gu
Abstract:
Transformer architectures have become a dominant paradigm for domains like language modeling but suffer in many inference settings due to their quadratic-time self-attention. Recently proposed subquadratic architectures, such as Mamba, have shown promise, but have been pretrained with substantially less computational resources than the strongest Transformer models. In this work, we present a metho…
▽ More
Transformer architectures have become a dominant paradigm for domains like language modeling but suffer in many inference settings due to their quadratic-time self-attention. Recently proposed subquadratic architectures, such as Mamba, have shown promise, but have been pretrained with substantially less computational resources than the strongest Transformer models. In this work, we present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs). The key idea to our approach is that we can view both Transformers and SSMs as applying different forms of mixing matrices over the token sequences. We can thus progressively distill the Transformer architecture by matching different degrees of granularity in the SSM: first matching the mixing matrices themselves, then the hidden units at each block, and finally the end-to-end predictions. Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture (Phi-Mamba) using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens. Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models. MOHAWK allows models like SSMs to leverage computational resources invested in training Transformer-based architectures, highlighting a new avenue for building such models.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
LoopSplat: Loop Closure by Registering 3D Gaussian Splats
Authors:
Liyuan Zhu,
Yue Li,
Erik Sandström,
Shengyu Huang,
Konrad Schindler,
Iro Armeni
Abstract:
Simultaneous Localization and Mapping (SLAM) based on 3D Gaussian Splats (3DGS) has recently shown promise towards more accurate, dense 3D scene maps. However, existing 3DGS-based methods fail to address the global consistency of the scene via loop closure and/or global bundle adjustment. To this end, we propose LoopSplat, which takes RGB-D images as input and performs dense mapping with 3DGS subm…
▽ More
Simultaneous Localization and Mapping (SLAM) based on 3D Gaussian Splats (3DGS) has recently shown promise towards more accurate, dense 3D scene maps. However, existing 3DGS-based methods fail to address the global consistency of the scene via loop closure and/or global bundle adjustment. To this end, we propose LoopSplat, which takes RGB-D images as input and performs dense mapping with 3DGS submaps and frame-to-model tracking. LoopSplat triggers loop closure online and computes relative loop edge constraints between submaps directly via 3DGS registration, leading to improvements in efficiency and accuracy over traditional global-to-local point cloud registration. It uses a robust pose graph optimization formulation and rigidly aligns the submaps to achieve global consistency. Evaluation on the synthetic Replica and real-world TUM-RGBD, ScanNet, and ScanNet++ datasets demonstrates competitive or superior tracking, mapping, and rendering compared to existing methods for dense RGB-D SLAM. Code is available at loopsplat.github.io.
△ Less
Submitted 19 August, 2024; v1 submitted 19 August, 2024;
originally announced August 2024.
-
Finite dimensional 2-cyclic Jacobian algebras
Authors:
Yiyu Li,
Liangang Peng
Abstract:
In this paper, we start with a class of quivers containing only 2-cycles and loops, referred to as 2-cyclic quivers. We prove that there exists a potential on these quivers that ensures the resulting quiver with potential is Jacobian-finite. As an application, we first demonstrate through covering theory that a Jacobian-finite potential exists on a class of 2-acyclic quivers. Secondly, by using th…
▽ More
In this paper, we start with a class of quivers containing only 2-cycles and loops, referred to as 2-cyclic quivers. We prove that there exists a potential on these quivers that ensures the resulting quiver with potential is Jacobian-finite. As an application, we first demonstrate through covering theory that a Jacobian-finite potential exists on a class of 2-acyclic quivers. Secondly, by using the 2-cyclic Caldero-Chapoton formula defined on section 4.2, the $τ$-rigid modules obtained from the Jacobian algebras of our proven Jacobian-finite 2-cyclic quiver with potential can categorify Paquette-Schiffler's generalized cluster algebras in three specific cases: one for a disk with two marked points and one 3-puncture, one for a sphere with one puncture, one 3-puncture and one orbifold point, and another for a sphere with one puncture and two 3-punctures.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders
Authors:
Xuechao Chen,
Ying Chen,
Jialin Li,
Qiang Nie,
Yong Liu,
Qixing Huang,
Yang Li
Abstract:
3D pre-training is crucial to 3D perception tasks. However, limited by the difficulties in collecting clean 3D data, 3D pre-training consistently faced data scaling challenges. Inspired by semi-supervised learning leveraging limited labeled data and a large amount of unlabeled data, in this work, we propose a novel self-supervised pre-training framework utilizing the real 3D data and the pseudo-3D…
▽ More
3D pre-training is crucial to 3D perception tasks. However, limited by the difficulties in collecting clean 3D data, 3D pre-training consistently faced data scaling challenges. Inspired by semi-supervised learning leveraging limited labeled data and a large amount of unlabeled data, in this work, we propose a novel self-supervised pre-training framework utilizing the real 3D data and the pseudo-3D data lifted from images by a large depth estimation model. Another challenge lies in the efficiency. Previous methods such as Point-BERT and Point-MAE, employ k nearest neighbors to embed 3D tokens, requiring quadratic time complexity. To efficiently pre-train on such a large amount of data, we propose a linear-time-complexity token embedding strategy and a training-efficient 2D reconstruction target. Our method achieves state-of-the-art performance in 3D classification and few-shot learning while maintaining high pre-training and downstream fine-tuning efficiency.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Boosting Open-Domain Continual Learning via Leveraging Intra-domain Category-aware Prototype
Authors:
Yadong Lu,
Shitian Zhao,
Boxiang Yun,
Dongsheng Jiang,
Yin Li,
Qingli Li,
Yan Wang
Abstract:
Despite recent progress in enhancing the efficacy of Open-Domain Continual Learning (ODCL) in Vision-Language Models (VLM), failing to (1) correctly identify the Task-ID of a test image and (2) use only the category set corresponding to the Task-ID, while preserving the knowledge related to each domain, cannot address the two primary challenges of ODCL: forgetting old knowledge and maintaining zer…
▽ More
Despite recent progress in enhancing the efficacy of Open-Domain Continual Learning (ODCL) in Vision-Language Models (VLM), failing to (1) correctly identify the Task-ID of a test image and (2) use only the category set corresponding to the Task-ID, while preserving the knowledge related to each domain, cannot address the two primary challenges of ODCL: forgetting old knowledge and maintaining zero-shot capabilities, as well as the confusions caused by category-relatedness between domains. In this paper, we propose a simple yet effective solution: leveraging intra-domain category-aware prototypes for ODCL in CLIP (DPeCLIP), where the prototype is the key to bridging the above two processes. Concretely, we propose a training-free Task-ID discriminator method, by utilizing prototypes as classifiers for identifying Task-IDs. Furthermore, to maintain the knowledge corresponding to each domain, we incorporate intra-domain category-aware prototypes as domain prior prompts into the training process. Extensive experiments conducted on 11 different datasets demonstrate the effectiveness of our approach, achieving 2.37% and 1.14% average improvement in class-incremental and task-incremental settings, respectively.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Privacy Technologies for Financial Intelligence
Authors:
Yang Li,
Thilina Ranbaduge,
Kee Siong Ng
Abstract:
Financial crimes like terrorism financing and money laundering can have real impacts on society, including the abuse and mismanagement of public funds, increase in societal problems such as drug trafficking and illicit gambling with attendant economic costs, and loss of innocent lives in the case of terrorism activities. Complex financial crimes can be hard to detect primarily because data related…
▽ More
Financial crimes like terrorism financing and money laundering can have real impacts on society, including the abuse and mismanagement of public funds, increase in societal problems such as drug trafficking and illicit gambling with attendant economic costs, and loss of innocent lives in the case of terrorism activities. Complex financial crimes can be hard to detect primarily because data related to different pieces of the overall puzzle is usually distributed across a network of financial institutions, regulators, and law-enforcement agencies and they cannot be easily shared due to privacy constraints. Recent advances in Privacy-Preserving Data Matching and Machine Learning provide an opportunity for regulators and the financial industry to come together to solve the risk-discovery problem with technology. This paper provides a survey of the financial intelligence landscape and where opportunities lie for privacy technologies to improve the state-of-the-art in financial-crime detection.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Enhancing quantum phase synchronization through squeezed-reservoir engineering
Authors:
Xing Xiao,
Tian-Xiang Lu,
Wo-Jun Zhong,
Yan-Ling Li
Abstract:
We investigate the enhancement of quantum phase synchronization in a two-level system (TLS) coupled to a squeezed reservoir. Our study reveals that the squeezed reservoir induces a stable limit cycle in the TLS, enhancing the quantum phase synchronization. We utilize the Husimi $Q$-function to describe the phase portrait of the driven TLS, and the $S$-function to quantitatively illustrate the effe…
▽ More
We investigate the enhancement of quantum phase synchronization in a two-level system (TLS) coupled to a squeezed reservoir. Our study reveals that the squeezed reservoir induces a stable limit cycle in the TLS, enhancing the quantum phase synchronization. We utilize the Husimi $Q$-function to describe the phase portrait of the driven TLS, and the $S$-function to quantitatively illustrate the effects of signal strength and detuning on phase synchronization. Remarkably, we demonstrate that the squeezed reservoir imparts its squeezing characteristics to the TLS, leading to a more localized and pronounced synchronization. Additionally, we observe typical features of the Arnold tongue in the synchronization regions. The experimental feasibility of our findings is discussed in the context of a circuit QED system, suggesting that squeezed-reservoir engineering is an effective approach for achieving quantum phase synchronization.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Predicting Long-term Dynamics of Complex Networks via Identifying Skeleton in Hyperbolic Space
Authors:
Ruikun Li,
Huandong Wang,
Jinghua Piao,
Qingmin Liao,
Yong Li
Abstract:
Learning complex network dynamics is fundamental for understanding, modeling, and controlling real-world complex systems. Though great efforts have been made to predict the future states of nodes on networks, the capability of capturing long-term dynamics remains largely limited. This is because they overlook the fact that long-term dynamics in complex network are predominantly governed by their i…
▽ More
Learning complex network dynamics is fundamental for understanding, modeling, and controlling real-world complex systems. Though great efforts have been made to predict the future states of nodes on networks, the capability of capturing long-term dynamics remains largely limited. This is because they overlook the fact that long-term dynamics in complex network are predominantly governed by their inherent low-dimensional manifolds, i.e., skeletons. Therefore, we propose the Dynamics-Invariant Skeleton Neural Net}work (DiskNet), which identifies skeletons of complex networks based on the renormalization group structure in hyperbolic space to preserve both topological and dynamics properties. Specifically, we first condense complex networks with various dynamics into simple skeletons through physics-informed hyperbolic embeddings. Further, we design graph neural ordinary differential equations to capture the condensed dynamics on the skeletons. Finally, we recover the skeleton networks and dynamics to the original ones using a degree-based super-resolution module. Extensive experiments across three representative dynamics as well as five real-world and two synthetic networks demonstrate the superior performances of the proposed DiskNet, which outperforms the state-of-the-art baselines by an average of 10.18\% in terms of long-term prediction accuracy. Code for reproduction is available at: https://github.com/tsinghua-fib-lab/DiskNet.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
TDNetGen: Empowering Complex Network Resilience Prediction with Generative Augmentation of Topology and Dynamics
Authors:
Chang Liu,
Jingtao Ding,
Yiwen Song,
Yong Li
Abstract:
Predicting the resilience of complex networks, which represents the ability to retain fundamental functionality amidst external perturbations or internal failures, plays a critical role in understanding and improving real-world complex systems. Traditional theoretical approaches grounded in nonlinear dynamical systems rely on prior knowledge of network dynamics. On the other hand, data-driven appr…
▽ More
Predicting the resilience of complex networks, which represents the ability to retain fundamental functionality amidst external perturbations or internal failures, plays a critical role in understanding and improving real-world complex systems. Traditional theoretical approaches grounded in nonlinear dynamical systems rely on prior knowledge of network dynamics. On the other hand, data-driven approaches frequently encounter the challenge of insufficient labeled data, a predicament commonly observed in real-world scenarios. In this paper, we introduce a novel resilience prediction framework for complex networks, designed to tackle this issue through generative data augmentation of network topology and dynamics. The core idea is the strategic utilization of the inherent joint distribution present in unlabeled network data, facilitating the learning process of the resilience predictor by illuminating the relationship between network topology and dynamics. Experiment results on three network datasets demonstrate that our proposed framework TDNetGen can achieve high prediction accuracy up to 85%-95%. Furthermore, the framework still demonstrates a pronounced augmentation capability in extreme low-data regimes, thereby underscoring its utility and robustness in enhancing the prediction of network resilience. We have open-sourced our code in the following link, https://github.com/tsinghua-fib-lab/TDNetGen.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
A Population-to-individual Tuning Framework for Adapting Pretrained LM to On-device User Intent Prediction
Authors:
Jiahui Gong,
Jingtao Ding,
Fanjin Meng,
Guilong Chen,
Hong Chen,
Shen Zhao,
Haisheng Lu,
Yong Li
Abstract:
Mobile devices, especially smartphones, can support rich functions and have developed into indispensable tools in daily life. With the rise of generative AI services, smartphones can potentially transform into personalized assistants, anticipating user needs and scheduling services accordingly. Predicting user intents on smartphones, and reflecting anticipated activities based on past interactions…
▽ More
Mobile devices, especially smartphones, can support rich functions and have developed into indispensable tools in daily life. With the rise of generative AI services, smartphones can potentially transform into personalized assistants, anticipating user needs and scheduling services accordingly. Predicting user intents on smartphones, and reflecting anticipated activities based on past interactions and context, remains a pivotal step towards this vision. Existing research predominantly focuses on specific domains, neglecting the challenge of modeling diverse event sequences across dynamic contexts. Leveraging pre-trained language models (PLMs) offers a promising avenue, yet adapting PLMs to on-device user intent prediction presents significant challenges. To address these challenges, we propose PITuning, a Population-to-Individual Tuning framework. PITuning enhances common pattern extraction through dynamic event-to-intent transition modeling and addresses long-tailed preferences via adaptive unlearning strategies. Experimental results on real-world datasets demonstrate PITuning's superior intent prediction performance, highlighting its ability to capture long-tailed preferences and its practicality for on-device prediction scenarios.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation
Authors:
Yunxin Li,
Haoyuan Shi,
Baotian Hu,
Longyue Wang,
Jiashun Zhu,
Jinyi Xu,
Zhen Zhao,
Min Zhang
Abstract:
Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animatio…
▽ More
Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animation process, we pioneer the introduction of large multimodal models (LMMs) as the core processor to build an autonomous animation-making agent, named Anim-Director. This agent mainly harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools to create animated videos from concise narratives or simple instructions. Specifically, it operates in three main stages: Firstly, the Anim-Director generates a coherent storyline from user inputs, followed by a detailed director's script that encompasses settings of character profiles and interior/exterior descriptions, and context-coherent scene descriptions that include appearing characters, interiors or exteriors, and scene events. Secondly, we employ LMMs with the image generation tool to produce visual images of settings and scenes. These images are designed to maintain visual consistency across different scenes using a visual-language prompting method that combines scene descriptions and images of the appearing character and setting. Thirdly, scene images serve as the foundation for producing animated videos, with LMMs generating prompts to guide this process. The whole process is notably autonomous without manual intervention, as the LMMs interact seamlessly with generative tools to generate prompts, evaluate visual quality, and select the best one to optimize the final output.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation
Authors:
Xiao Wang,
Yuehang Li,
Fuling Wang,
Shiao Wang,
Chuanfu Li,
Bo Jiang
Abstract:
Inspired by the tremendous success of Large Language Models (LLMs), existing X-ray medical report generation methods attempt to leverage large models to achieve better performance. They usually adopt a Transformer to extract the visual features of a given X-ray image, and then, feed them into the LLM for text generation. How to extract more effective information for the LLMs to help them improve f…
▽ More
Inspired by the tremendous success of Large Language Models (LLMs), existing X-ray medical report generation methods attempt to leverage large models to achieve better performance. They usually adopt a Transformer to extract the visual features of a given X-ray image, and then, feed them into the LLM for text generation. How to extract more effective information for the LLMs to help them improve final results is an urgent problem that needs to be solved. Additionally, the use of visual Transformer models also brings high computational complexity. To address these issues, this paper proposes a novel context-guided efficient X-ray medical report generation framework. Specifically, we introduce the Mamba as the vision backbone with linear complexity, and the performance obtained is comparable to that of the strong Transformer model. More importantly, we perform context retrieval from the training set for samples within each mini-batch during the training phase, utilizing both positively and negatively related samples to enhance feature representation and discriminative learning. Subsequently, we feed the vision tokens, context information, and prompt statements to invoke the LLM for generating high-quality medical reports. Extensive experiments on three X-ray report generation datasets (i.e., IU-Xray, MIMIC-CXR, CheXpert Plus) fully validated the effectiveness of our proposed model. The source code of this work will be released on \url{https://github.com/Event-AHU/Medical_Image_Analysis}.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
0ptical trapping with optical magnetic field and photonic Hall effect forces
Authors:
Yanzeng Li,
Emmanuel Valenton,
Spoorthi Nagasamudram,
John Parker,
Marcos Perez,
Uttam Manna,
Mahua Biswas,
Stuart A. Rice,
Norbert F. Scherer
Abstract:
Optical trapping is having ever-increasing impact in science $-$ particularly biophysics, photonics and most recently in quantum optomechanics $-$ owing to its superior capability for manipulating nanoscale structures and materials. However, essentially all experimental optical trapping studies in the optical dipole regime have, to date, been dominated by the interaction between a material's elect…
▽ More
Optical trapping is having ever-increasing impact in science $-$ particularly biophysics, photonics and most recently in quantum optomechanics $-$ owing to its superior capability for manipulating nanoscale structures and materials. However, essentially all experimental optical trapping studies in the optical dipole regime have, to date, been dominated by the interaction between a material's electric polarizability, $α_{e}$, and the electric part of the incident electromagnetic field, and therefore described by electric field intensity gradient forces. Optical trapping based on optical magnetic light-matter interactions has not been experimentally addressed despite it's immediate extension of the boundaries of optical trapping research and applications. This paper addresses this long-standing deficiency through the realization of optical magnetic trapping of large index of refraction (i.e., Si) nanoparticles and also presents a formalism for quantitative understanding of the experimental findings. Our experimental optical trapping results require including optical magnetic polarizability, $α_{m}$, and electric-magnetic scattering forces associated with the Photonic Hall effect that are qualitatively and quantitatively validated by Maxwell stress tensor calculations. Our findings bring new opportunities for nanoparticle manipulation, potentially relax the limitations Ashkin claimed based on the optical Earnshaw's theorem, motivate optical matter formation by optical magnetic interactions, and suggest new N-body effects and symmetry breaking to drive dynamics of optical matter systems.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Community-Centric Graph Unlearning
Authors:
Yi Li,
Shichao Zhang,
Guixian Zhang,
Debo Cheng
Abstract:
Graph unlearning technology has become increasingly important since the advent of the `right to be forgotten' and the growing concerns about the privacy and security of artificial intelligence. Graph unlearning aims to quickly eliminate the effects of specific data on graph neural networks (GNNs). However, most existing deterministic graph unlearning frameworks follow a balanced partition-submodel…
▽ More
Graph unlearning technology has become increasingly important since the advent of the `right to be forgotten' and the growing concerns about the privacy and security of artificial intelligence. Graph unlearning aims to quickly eliminate the effects of specific data on graph neural networks (GNNs). However, most existing deterministic graph unlearning frameworks follow a balanced partition-submodel training-aggregation paradigm, resulting in a lack of structural information between subgraph neighborhoods and redundant unlearning parameter calculations. To address this issue, we propose a novel Graph Structure Mapping Unlearning paradigm (GSMU) and a novel method based on it named Community-centric Graph Eraser (CGE). CGE maps community subgraphs to nodes, thereby enabling the reconstruction of a node-level unlearning operation within a reduced mapped graph. CGE makes the exponential reduction of both the amount of training data and the number of unlearning parameters. Extensive experiments conducted on five real-world datasets and three widely used GNN backbones have verified the high performance and efficiency of our CGE method, highlighting its potential in the field of graph unlearning.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
LightWeather: Harnessing Absolute Positional Encoding to Efficient and Scalable Global Weather Forecasting
Authors:
Yisong Fu,
Fei Wang,
Zezhi Shao,
Chengqing Yu,
Yujie Li,
Zhao Chen,
Zhulin An,
Yongjun Xu
Abstract:
Recently, Transformers have gained traction in weather forecasting for their capability to capture long-term spatial-temporal correlations. However, their complex architectures result in large parameter counts and extended training times, limiting their practical application and scalability to global-scale forecasting. This paper aims to explore the key factor for accurate weather forecasting and…
▽ More
Recently, Transformers have gained traction in weather forecasting for their capability to capture long-term spatial-temporal correlations. However, their complex architectures result in large parameter counts and extended training times, limiting their practical application and scalability to global-scale forecasting. This paper aims to explore the key factor for accurate weather forecasting and design more efficient solutions. Interestingly, our empirical findings reveal that absolute positional encoding is what really works in Transformer-based weather forecasting models, which can explicitly model the spatial-temporal correlations even without attention mechanisms. We theoretically prove that its effectiveness stems from the integration of geographical coordinates and real-world time features, which are intrinsically related to the dynamics of weather. Based on this, we propose LightWeather, a lightweight and effective model for station-based global weather forecasting. We employ absolute positional encoding and a simple MLP in place of other components of Transformer. With under 30k parameters and less than one hour of training time, LightWeather achieves state-of-the-art performance on global weather datasets compared to other advanced DL methods. The results underscore the superiority of integrating spatial-temporal knowledge over complex architectures, providing novel insights for DL in weather forecasting.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Gravitational form factor $D$ of charmonium from shear stress
Authors:
Tianyang Hu,
Xianghui Cao,
Siqi Xu,
Yang Li,
Xingbo Zhao,
James P. Vary
Abstract:
Based on our recent analysis of the hadronic matrix element of the stress-energy tensor in covariant light front dynamics, we extract the charmonium gravitational form factor $D(Q^2)$ from shear stress $T^{12}$. This is in contrast to our recent work using the (light-front) energy density $T^{+-}$. Indeed, by comparing these two currents, we identify terms that are responsible for the violation of…
▽ More
Based on our recent analysis of the hadronic matrix element of the stress-energy tensor in covariant light front dynamics, we extract the charmonium gravitational form factor $D(Q^2)$ from shear stress $T^{12}$. This is in contrast to our recent work using the (light-front) energy density $T^{+-}$. Indeed, by comparing these two currents, we identify terms that are responsible for the violation of the current conservation. Numerical results based on basis light-front quantization show that the violation effects are small and the $D$-term extracted from the two currents are close to each other, hence validating our previous work using $T^{+-}$.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
Dissecting a strongly coupled scalar nucleon
Authors:
Xianghui Cao,
Yang Li,
James P. Vary
Abstract:
We continue our investigation of the stress within a strongly coupled scalar nucleon, and now dissect the gravitational form factors into contributions from its constituents, the (mock) nucleon and the (mock) pion. The computation is based on a non-perturbative solution of the scalar Yukawa model in the light-front Hamiltonian formalism with a Fock sector expansion including up to one nucleon and…
▽ More
We continue our investigation of the stress within a strongly coupled scalar nucleon, and now dissect the gravitational form factors into contributions from its constituents, the (mock) nucleon and the (mock) pion. The computation is based on a non-perturbative solution of the scalar Yukawa model in the light-front Hamiltonian formalism with a Fock sector expansion including up to one nucleon and two pions. By employing the ``good currents" $T^{++}_i$, $T^{+-}_i$ and $T^{12}_i$, we extract the full set of gravitational form factors $A_i$, $D_i$, $\bar c_i$ without the contamination of the spurious form factors, and free of uncanceled UV divergences. With these results, we decompose the mass of the system into its constituents and compute the matter and mechanical radii, gaining insights into the strongly coupled system.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition
Authors:
Yangze Li,
Xiong Wang,
Songjun Cao,
Yike Zhang,
Long Ma,
Lei Xie
Abstract:
Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR…
▽ More
Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Non-autoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12.2% and 9.6% CER relatively on Test_Net and Test_Meeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
First Discovery and Confirmation of PN Candidates Found from AI and Deep Learning Techniques Applied to VPHAS+ Survey Data
Authors:
Yushan Li,
Quentin Parker,
Peng Jia
Abstract:
Context. We have developed deep learning (DL) and AI-based tools to search extant narrow-band wide-field H$α$ surveys of the Galactic Plane for elusive planetary nebulae (PNe) which are hidden in dense star fields towards the Galactic center. They are faint, low-surface brightness, usually resolved sources, which are not discovered by previous automatic searches that depend on photometric data for…
▽ More
Context. We have developed deep learning (DL) and AI-based tools to search extant narrow-band wide-field H$α$ surveys of the Galactic Plane for elusive planetary nebulae (PNe) which are hidden in dense star fields towards the Galactic center. They are faint, low-surface brightness, usually resolved sources, which are not discovered by previous automatic searches that depend on photometric data for point-like sources. These sources are very challenging to find by traditional visual inspection in such crowded fields and many have been missed. We have successfully adopted a novel 'Swin-Transformer' AI algorithm, which we described in detail in the preceding Techniques paper (Paper I). Aims. Here, we present preliminary results from our first spectroscopic follow-up run for 31 top-quality PN candidates found by the algorithm from the high-resolution H$α$ survey VPHAS+. This survey has not yet undergone extensive manual, systematic searching. Methods. Our candidate PNe were observed with the SpUpNIC spectrograph on the 1.9 m telescope at the South African Astronomical Observatory (SAAO) in June 2023. We performed standard IRAF spectroscopic reduction and then followed our normal HASH PN identification and classification procedures. Results. Our reduced spectra confirmed that these candidates include 22 true, likely, and possible PNe (70.97\%), 3 emission-line galaxies, 2 emission-line stars, 2 late-type star contaminants, and 2 other H$α$ sources including a newly identified detached fragment of SNR RCW 84. We present the imaging and spectral data of these candidates and a preliminary analysis of their properties. These data provide strong input to help evaluate and refine the behavior of the AI algorithm when searching for PNe in wide-field H$α$ surveys.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
Image-Based Geolocation Using Large Vision-Language Models
Authors:
Yi Liu,
Junchen Ding,
Gelei Deng,
Yuekang Li,
Tianwei Zhang,
Weisong Sun,
Yaowen Zheng,
Jingquan Ge,
Yang Liu
Abstract:
Geolocation is now a vital aspect of modern life, offering numerous benefits but also presenting serious privacy concerns. The advent of large vision-language models (LVLMs) with advanced image-processing capabilities introduces new risks, as these models can inadvertently reveal sensitive geolocation information. This paper presents the first in-depth study analyzing the challenges posed by tradi…
▽ More
Geolocation is now a vital aspect of modern life, offering numerous benefits but also presenting serious privacy concerns. The advent of large vision-language models (LVLMs) with advanced image-processing capabilities introduces new risks, as these models can inadvertently reveal sensitive geolocation information. This paper presents the first in-depth study analyzing the challenges posed by traditional deep learning and LVLM-based geolocation methods. Our findings reveal that LVLMs can accurately determine geolocations from images, even without explicit geographic training.
To address these challenges, we introduce \tool{}, an innovative framework that significantly enhances image-based geolocation accuracy. \tool{} employs a systematic chain-of-thought (CoT) approach, mimicking human geoguessing strategies by carefully analyzing visual and contextual cues such as vehicle types, architectural styles, natural landscapes, and cultural elements. Extensive testing on a dataset of 50,000 ground-truth data points shows that \tool{} outperforms both traditional models and human benchmarks in accuracy. It achieves an impressive average score of 4550.5 in the GeoGuessr game, with an 85.37\% win rate, and delivers highly precise geolocation predictions, with the closest distances as accurate as 0.3 km. Furthermore, our study highlights issues related to dataset integrity, leading to the creation of a more robust dataset and a refined framework that leverages LVLMs' cognitive capabilities to improve geolocation precision. These findings underscore \tool{}'s superior ability to interpret complex visual data, the urgent need to address emerging security vulnerabilities posed by LVLMs, and the importance of responsible AI development to ensure user privacy protection.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition
Authors:
Qifei Li,
Yingming Gao,
Yuhua Wen,
Cong Wang,
Ya Li
Abstract:
To address the limitation in multimodal emotion recognition (MER) performance arising from inter-modal information fusion, we propose a novel MER framework based on multitask learning where fusion occurs after alignment, called Foal-Net. The framework is designed to enhance the effectiveness of modality fusion and includes two auxiliary tasks: audio-video emotion alignment (AVEL) and cross-modal e…
▽ More
To address the limitation in multimodal emotion recognition (MER) performance arising from inter-modal information fusion, we propose a novel MER framework based on multitask learning where fusion occurs after alignment, called Foal-Net. The framework is designed to enhance the effectiveness of modality fusion and includes two auxiliary tasks: audio-video emotion alignment (AVEL) and cross-modal emotion label matching (MEM). First, AVEL achieves alignment of emotional information in audio-video representations through contrastive learning. Then, a modal fusion network integrates the aligned features. Meanwhile, MEM assesses whether the emotions of the current sample pair are the same, providing assistance for modal information fusion and guiding the model to focus more on emotional information. The experimental results conducted on IEMOCAP corpus show that Foal-Net outperforms the state-of-the-art methods and emotion alignment is necessary before modal fusion.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
Discovery of terahertz-frequency orbitally-coupled magnons in a kagome ferromagnet
Authors:
Mengqian Che,
Weizhao Chen,
Maoyuan Wang,
F. Michael Bartram,
Liangyang Liu,
Xuebin Dong,
Jinjin Liu,
Yidian Li,
Hao Lin,
Zhiwei Wang,
Enke Liu,
Yugui Yao,
Zhe Yuan,
Guang-Ming Zhang,
Luyi Yang
Abstract:
In ferromagnetic materials, magnons - quanta of spin waves - typically resonate in the gigahertz range. Beyond conventional magnons, while theoretical studies have predicted magnons associated with orbital magnetic moments, their direct observation has remained challenging. Here, we present the discovery of two distinct terahertz orbitally-coupled magnon resonances in the topological kagome ferrom…
▽ More
In ferromagnetic materials, magnons - quanta of spin waves - typically resonate in the gigahertz range. Beyond conventional magnons, while theoretical studies have predicted magnons associated with orbital magnetic moments, their direct observation has remained challenging. Here, we present the discovery of two distinct terahertz orbitally-coupled magnon resonances in the topological kagome ferromagnet Co3Sn2S2. Using time-resolved Kerr rotation spectroscopy, we pinpoint two magnon resonances at 0.61 and 0.49 THz at 6 K, surpassing all previously reported magnon resonances in ferromagnets due to strong magnetocrystalline anisotropy. These dual modes originate from the strong coupling of localized spin and orbital magnetic moments. These findings unveil a novel category of magnons stemming from orbital magnetic moments, and position Co3Sn2S2 as a promising candidate for high-speed terahertz spintronic applications
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
Transition signatures for electron-positron pair creation in space-time inhomogeneous electric field
Authors:
C. K. Li,
X. X. Zhou,
Q. Chen,
B. An,
Y. J. Li,
N. S. Lin,
Y. Wan
Abstract:
The process of electron-positron pair creation through multi-photon absorption in a space-time dependent electric field is analyzed using computational quantum field theory. Our findings reveal two distinct pair creation channels: the symmetric and asymmetric transition channels. We propose that the asymmetric transition channel arises from the inherent spatial inhomogeneity of intense laser pulse…
▽ More
The process of electron-positron pair creation through multi-photon absorption in a space-time dependent electric field is analyzed using computational quantum field theory. Our findings reveal two distinct pair creation channels: the symmetric and asymmetric transition channels. We propose that the asymmetric transition channel arises from the inherent spatial inhomogeneity of intense laser pulses. By mapping the field-theoretical model of laser-assisted multi-photon pair creation onto a quantum-mechanical time-dependent framework, a semi-analytical solution that captures the asymmetric transition signatures of vacuum decay is derived. Additionally, it is demonstrated that neglecting spatial inhomogeneity leads to erroneous transition amplitudes and incorrect identification of pair creation channels, even when the dipole approximation holds.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
Quantitative uniform exponential acceleration of averages along decaying waves
Authors:
Zhicheng Tong,
Yong Li
Abstract:
In this study, utilizing a specific exponential weighting function, we investigate the uniform exponential convergence of weighted Birkhoff averages along decaying waves and delve into several related variants. A key distinction from traditional scenarios is evident here: despite reduced regularity in observables, our method still maintains exponential convergence. In particular, we develop new te…
▽ More
In this study, utilizing a specific exponential weighting function, we investigate the uniform exponential convergence of weighted Birkhoff averages along decaying waves and delve into several related variants. A key distinction from traditional scenarios is evident here: despite reduced regularity in observables, our method still maintains exponential convergence. In particular, we develop new techniques that yield very precise rates of exponential convergence, as evidenced by numerical simulations. Furthermore, this innovative approach extends to quantitative analyses involving different weighting functions employed by others, surpassing the limitations inherent in prior research. It also enhances the exponential convergence rates of weighted Birkhoff averages along quasi-periodic orbits via analytic observables. To the best of our knowledge, this is the first result on the uniform exponential acceleration beyond averages along quasi-periodic or almost periodic orbits, particularly from a quantitative perspective.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
OU-CoViT: Copula-Enhanced Bi-Channel Multi-Task Vision Transformers with Dual Adaptation for OU-UWF Images
Authors:
Yang Li,
Jianing Deng,
Chong Zhong,
Danjuan Yang,
Meiyan Li,
A. H. Welsh,
Aiyi Liu,
Xingtao Zhou,
Catherine C. Liu,
Bo Fu
Abstract:
Myopia screening using cutting-edge ultra-widefield (UWF) fundus imaging and joint modeling of multiple discrete and continuous clinical scores presents a promising new paradigm for multi-task problems in Ophthalmology. The bi-channel framework that arises from the Ophthalmic phenomenon of ``interocular asymmetries'' of both eyes (OU) calls for new employment on the SOTA transformer-based models.…
▽ More
Myopia screening using cutting-edge ultra-widefield (UWF) fundus imaging and joint modeling of multiple discrete and continuous clinical scores presents a promising new paradigm for multi-task problems in Ophthalmology. The bi-channel framework that arises from the Ophthalmic phenomenon of ``interocular asymmetries'' of both eyes (OU) calls for new employment on the SOTA transformer-based models. However, the application of copula models for multiple mixed discrete-continuous labels on deep learning (DL) is challenging. Moreover, the application of advanced large transformer-based models to small medical datasets is challenging due to overfitting and computational resource constraints. To resolve these challenges, we propose OU-CoViT: a novel Copula-Enhanced Bi-Channel Multi-Task Vision Transformers with Dual Adaptation for OU-UWF images, which can i) incorporate conditional correlation information across multiple discrete and continuous labels within a deep learning framework (by deriving the closed form of a novel Copula Loss); ii) take OU inputs subject to both high correlation and interocular asymmetries using a bi-channel model with dual adaptation; and iii) enable the adaptation of large vision transformer (ViT) models to small medical datasets. Solid experiments demonstrate that OU-CoViT significantly improves prediction performance compared to single-channel baseline models with empirical loss. Furthermore, the novel architecture of OU-CoViT allows generalizability and extensions of our dual adaptation and Copula Loss to various ViT variants and large DL models on small medical datasets. Our approach opens up new possibilities for joint modeling of heterogeneous multi-channel input and mixed discrete-continuous clinical scores in medical practices and has the potential to advance AI-assisted clinical decision-making in various medical domains beyond Ophthalmology.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
ELASTIC: Efficient Linear Attention for Sequential Interest Compression
Authors:
Jiaxin Deng,
Shiyao Wang,
Song Lu,
Yinfeng Li,
Xinchen Luo,
Yuanjun Liu,
Peixing Xu,
Guorui Zhou
Abstract:
State-of-the-art sequential recommendation models heavily rely on transformer's attention mechanism. However, the quadratic computational and memory complexities of self attention have limited its scalability for modeling users' long range behaviour sequences. To address this problem, we propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression, requiring only linear time…
▽ More
State-of-the-art sequential recommendation models heavily rely on transformer's attention mechanism. However, the quadratic computational and memory complexities of self attention have limited its scalability for modeling users' long range behaviour sequences. To address this problem, we propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression, requiring only linear time complexity and decoupling model capacity from computational cost. Specifically, ELASTIC introduces a fixed length interest experts with linear dispatcher attention mechanism which compresses the long-term behaviour sequences to a significantly more compact representation which reduces up to 90% GPU memory usage with x2.7 inference speed up. The proposed linear dispatcher attention mechanism significantly reduces the quadratic complexity and makes the model feasible for adequately modeling extremely long sequences. Moreover, in order to retain the capacity for modeling various user interests, ELASTIC initializes a vast learnable interest memory bank and sparsely retrieves compressed user's interests from the memory with a negligible computational overhead. The proposed interest memory retrieval technique significantly expands the cardinality of available interest space while keeping the same computational cost, thereby striking a trade-off between recommendation accuracy and efficiency. To validate the effectiveness of our proposed ELASTIC, we conduct extensive experiments on various public datasets and compare it with several strong sequential recommenders. Experimental results demonstrate that ELASTIC consistently outperforms baselines by a significant margin and also highlight the computational efficiency of ELASTIC when modeling long sequences. We will make our implementation code publicly available.
△ Less
Submitted 20 August, 2024; v1 submitted 18 August, 2024;
originally announced August 2024.
-
Mutual Information Multinomial Estimation
Authors:
Yanzhi Chen,
Zijing Ou,
Adrian Weller,
Yingzhen Li
Abstract:
Estimating mutual information (MI) is a fundamental yet challenging task in data science and machine learning. This work proposes a new estimator for mutual information. Our main discovery is that a preliminary estimate of the data distribution can dramatically help estimate. This preliminary estimate serves as a bridge between the joint and the marginal distribution, and by comparing with this br…
▽ More
Estimating mutual information (MI) is a fundamental yet challenging task in data science and machine learning. This work proposes a new estimator for mutual information. Our main discovery is that a preliminary estimate of the data distribution can dramatically help estimate. This preliminary estimate serves as a bridge between the joint and the marginal distribution, and by comparing with this bridge distribution we can easily obtain the true difference between the joint distributions and the marginal distributions. Experiments on diverse tasks including non-Gaussian synthetic problems with known ground-truth and real-world applications demonstrate the advantages of our method.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
Flemme: A Flexible and Modular Learning Platform for Medical Images
Authors:
Guoqing Zhang,
Jingyun Yang,
Yang Li
Abstract:
As the rapid development of computer vision and the emergence of powerful network backbones and architectures, the application of deep learning in medical imaging has become increasingly significant. Unlike natural images, medical images lack huge volumes of data but feature more modalities, making it difficult to train a general model that has satisfactory performance across various datasets. In…
▽ More
As the rapid development of computer vision and the emergence of powerful network backbones and architectures, the application of deep learning in medical imaging has become increasingly significant. Unlike natural images, medical images lack huge volumes of data but feature more modalities, making it difficult to train a general model that has satisfactory performance across various datasets. In practice, practitioners often suffer from manually creating and testing models combining independent backbones and architectures, which is a laborious and time-consuming process. We propose Flemme, a FLExible and Modular learning platform for MEdical images. Our platform separates encoders from the model architectures so that different models can be constructed via various combinations of supported encoders and architectures. We construct encoders using building blocks based on convolution, transformer, and state-space model (SSM) to process both 2D and 3D image patches. A base architecture is implemented following an encoder-decoder style, with several derived architectures for image segmentation, reconstruction, and generation tasks. In addition, we propose a general hierarchical architecture incorporating a pyramid loss to optimize and fuse vertical features. Experiments demonstrate that this simple design leads to an average improvement of 5.60% in Dice score and 7.81% in mean interaction of units (mIoU) for segmentation models, as well as an enhancement of 5.57% in peak signal-to-noise ratio (PSNR) and 8.22% in structural similarity (SSIM) for reconstruction models. We further utilize Flemme as an analytical tool to assess the effectiveness and efficiency of various encoders across different tasks. Code is available at https://github.com/wlsdzyzl/flemme.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
Analysis of the Effect of Tilted Corner Cube Reflector Arrays on Lunar Laser Ranging
Authors:
Jin Cao,
Rufeng Tang,
Kai Huang,
Zhulian Li,
Yongzhang Yang,
Kai Huang,
Jintao Li,
Yuqiang Li
Abstract:
This paper primarily investigates the effect of the tilt of corner cube reflector (CCR) arrays on lunar laser ranging (LLR). A mathematical model was established to study the random errors caused by the tilt of the CCR arrays. The study found that, ideally, when the laser ranging pulse width is 10 picoseconds or less, it is possible to distinguish from which specific corner cubes within the CCR ar…
▽ More
This paper primarily investigates the effect of the tilt of corner cube reflector (CCR) arrays on lunar laser ranging (LLR). A mathematical model was established to study the random errors caused by the tilt of the CCR arrays. The study found that, ideally, when the laser ranging pulse width is 10 picoseconds or less, it is possible to distinguish from which specific corner cubes within the CCR array each peak in the echo signal originates. Consequently, partial data from the echo can be extracted for signal processing, significantly reducing random errors and improving the single-shot precision of LLR. The distance obtained by extracting part of the echo can be reduced to the center position of the array, thereby providing multiple higher-precision ranging results from each measurement. This not only improves the precision of LLR but also increases the data volume. A simulation experiment based on the 1.2 m laser ranging system at Yunnan Observatories was conducted. By extracting one peak for signal processing, the single-shot precision improved from 32.24 mm to 2.52 mm, validating the theoretical analysis results. Finally, an experimental laser ranging system based on a 53 cm binocular telescope system was established for ground experiments. The experimental results indicated that the echo signal could identify the tilt state of the CCR array. By extracting the peak returned by the central CCR for signal processing, the ranging precision was greatly improved. Through theoretical analyses, simulation experiments, and ground experiments, a solution to reduce the random errors caused by the tilt of the CCR array was provided. This offers an approach to enhance the single-shot precision of future LLR and provides a reference for upgrading ground-based equipment at future laser ranging stations.
△ Less
Submitted 21 August, 2024; v1 submitted 17 August, 2024;
originally announced August 2024.
-
CogLM: Tracking Cognitive Development of Large Language Models
Authors:
Xinglin Wang,
Peiwen Yuan,
Shaoxiong Feng,
Yiwei Li,
Boyuan Pan,
Heda Wang,
Yao Hu,
Kan Li
Abstract:
Piaget's Theory of Cognitive Development (PTC) posits that the development of cognitive levels forms the foundation for human learning across various abilities. As Large Language Models (LLMs) have recently shown remarkable abilities across a wide variety of tasks, we are curious about the cognitive levels of current LLMs: to what extent they have developed and how this development has been achiev…
▽ More
Piaget's Theory of Cognitive Development (PTC) posits that the development of cognitive levels forms the foundation for human learning across various abilities. As Large Language Models (LLMs) have recently shown remarkable abilities across a wide variety of tasks, we are curious about the cognitive levels of current LLMs: to what extent they have developed and how this development has been achieved. To this end, we construct a benchmark CogLM (Cognitive Ability Evaluation for Language Model) based on PTC to assess the cognitive levels of LLMs. CogLM comprises 1,220 questions spanning 10 cognitive abilities crafted by more than 20 human experts, providing a comprehensive testbed for the cognitive levels of LLMs. Through extensive experiments across multiple mainstream LLMs with CogLM, we find that: (1) Human-like cognitive abilities have emerged in advanced LLMs (GPT-4), comparable to those of a 20-year-old human. (2) The parameter size and optimization objective are two key factors affecting the cognitive levels of LLMs. (3) The performance on downstream tasks is positively correlated with the level of cognitive abilities. These findings fill the gap in research on the cognitive abilities of LLMs, tracing the development of LLMs from a cognitive perspective and guiding the future direction of their evolution.
△ Less
Submitted 17 August, 2024;
originally announced August 2024.
-
Evidence for hybrid gamma-ray emission from the supernova remnant G150.3+4.5
Authors:
Yuan Li,
Siming Liu,
Gwenael Giacinti
Abstract:
The supernova remnant (SNR) G150.3+4.5 was first identified in radio, exhibiting a hard GeV spectrum and a $\sim 1.5^\circ$ radius. Radio observations revealed a bright arc with an index of $\sim -0.40$, which stands in contrast to the index of $\sim -0.69$ for the rest. This arc is coincident with the point-like \emph{Fermi} source 4FGL J0426.5+5434 and KM2A source 1LHAASO J0428+5531. The rest of…
▽ More
The supernova remnant (SNR) G150.3+4.5 was first identified in radio, exhibiting a hard GeV spectrum and a $\sim 1.5^\circ$ radius. Radio observations revealed a bright arc with an index of $\sim -0.40$, which stands in contrast to the index of $\sim -0.69$ for the rest. This arc is coincident with the point-like \emph{Fermi} source 4FGL J0426.5+5434 and KM2A source 1LHAASO J0428+5531. The rest of the SNR has a hard GeV spectrum and a soft TeV spectrum, implying a spectral cut-off or break near 1 TeV. Since there is no X-ray counterpart and no pulse signal detected, the gamma-ray $(γ$-ray) emission mechanism from the SNR and the point-like source appear puzzling. In this work, we reanalyse the $γ$-ray emission using 14 yr data recorded by \emph{Fermi} Large Area Telescope and find that the spectrum of the northern half-sphere is compatible with a broken power law with a break at 146 $\pm$ 11 GeV and photon indices of $Γ_{\rm{Northlobe}}$ =$1.54\pm0.04_{\rm{stat}}\pm0.07_{\rm{syst}}$ ($2.28\pm0.08_{\rm{stat}}\pm0.12_{\rm{syst}}$) below (above) the break. In addition, the southern half-sphere can be described well with a single power law with $Γ_{\rm{Southlobe}}$ =$1.95\pm0.07_{\rm{stat}}\pm0.09_{\rm{syst}}$. Since the southern half-sphere is well correlated with CO emission, we propose that the $γ$-ray emission of the northern half-sphere could be dominated by relativistic electrons via inverse-Compton processes, while the southern half-sphere is dominated by cosmic rays via hadronic processes. 4FGL J0426.5+5434 may result from the illumination of a cloud by escaping cosmic rays or recent shock-cloud interaction. Observations from LHAASO-KM2A thus favour the possibility of a cosmic-ray PeVatron candidate, however, leptonic scenarios cannot be ruled out. Further multi-wavelength observations are warranted to confirm the hadronic nature of 1LHAASO J4028+5531.
△ Less
Submitted 17 August, 2024;
originally announced August 2024.
-
Depth-guided Texture Diffusion for Image Semantic Segmentation
Authors:
Wei Sun,
Yuan Li,
Qixiang Ye,
Jianbin Jiao,
Yanzhao Zhou
Abstract:
Depth information provides valuable insights into the 3D structure especially the outline of objects, which can be utilized to improve the semantic segmentation tasks. However, a naive fusion of depth information can disrupt feature and compromise accuracy due to the modality gap between the depth and the vision. In this work, we introduce a Depth-guided Texture Diffusion approach that effectively…
▽ More
Depth information provides valuable insights into the 3D structure especially the outline of objects, which can be utilized to improve the semantic segmentation tasks. However, a naive fusion of depth information can disrupt feature and compromise accuracy due to the modality gap between the depth and the vision. In this work, we introduce a Depth-guided Texture Diffusion approach that effectively tackles the outlined challenge. Our method extracts low-level features from edges and textures to create a texture image. This image is then selectively diffused across the depth map, enhancing structural information vital for precisely extracting object outlines. By integrating this enriched depth map with the original RGB image into a joint feature embedding, our method effectively bridges the disparity between the depth map and the image, enabling more accurate semantic segmentation. We conduct comprehensive experiments across diverse, commonly-used datasets spanning a wide range of semantic segmentation tasks, including Camouflaged Object Detection (COD), Salient Object Detection (SOD), and indoor semantic segmentation. With source-free estimated depth or depth captured by depth cameras, our method consistently outperforms existing baselines and achieves new state-of-theart results, demonstrating the effectiveness of our Depth-guided Texture Diffusion for image semantic segmentation.
△ Less
Submitted 17 August, 2024;
originally announced August 2024.
-
MoRA: LoRA Guided Multi-Modal Disease Diagnosis with Missing Modality
Authors:
Zhiyi Shi,
Junsik Kim,
Wanhua Li,
Yicong Li,
Hanspeter Pfister
Abstract:
Multi-modal pre-trained models efficiently extract and fuse features from different modalities with low memory requirements for fine-tuning. Despite this efficiency, their application in disease diagnosis is under-explored. A significant challenge is the frequent occurrence of missing modalities, which impairs performance. Additionally, fine-tuning the entire pre-trained model demands substantial…
▽ More
Multi-modal pre-trained models efficiently extract and fuse features from different modalities with low memory requirements for fine-tuning. Despite this efficiency, their application in disease diagnosis is under-explored. A significant challenge is the frequent occurrence of missing modalities, which impairs performance. Additionally, fine-tuning the entire pre-trained model demands substantial computational resources. To address these issues, we introduce Modality-aware Low-Rank Adaptation (MoRA), a computationally efficient method. MoRA projects each input to a low intrinsic dimension but uses different modality-aware up-projections for modality-specific adaptation in cases of missing modalities. Practically, MoRA integrates into the first block of the model, significantly improving performance when a modality is missing. It requires minimal computational resources, with less than 1.6% of the trainable parameters needed compared to training the entire model. Experimental results show that MoRA outperforms existing techniques in disease diagnosis, demonstrating superior performance, robustness, and training efficiency.
△ Less
Submitted 16 August, 2024;
originally announced August 2024.
-
AI-assisted super-resolution cosmological simulations IV: An emulator for deterministic realizations
Authors:
Xiaowen Zhang,
Patrick Lachance,
Ankita Dasgupta,
Rupert A. C. Croft,
Tiziana Di Matteo,
Yueying Ni,
Simeon Bird,
Yin Li
Abstract:
Super-resolution (SR) models in cosmological simulations use deep learning (DL) to rapidly supplement low-resolution (LR) runs with statistically correct, fine details. The SR technique preserves large-scale structures by conditioning on a low-resolution (LR) version of the simulation. On smaller scales, the generative deep learning (DL) process is stochastic, resulting in numerous possible SR rea…
▽ More
Super-resolution (SR) models in cosmological simulations use deep learning (DL) to rapidly supplement low-resolution (LR) runs with statistically correct, fine details. The SR technique preserves large-scale structures by conditioning on a low-resolution (LR) version of the simulation. On smaller scales, the generative deep learning (DL) process is stochastic, resulting in numerous possible SR realizations, each with unique small-scale structures. Validation of SR then relies on making sure that a specific statistic of interest is accurately reproduced by comparing SR and high resolution (HR) runs. In this study, we develop an emulator designed to reproduce the individual small-scale structures of an HR simulation as closely as possible. We process an SR realization alongside a specific High-Resolution Initial Condition (HRIC), transforming the SR output to emulate the results of a full simulation with that HRIC. By comparing visualizations, individual halo measures and cross-correlating Fourier modes we show that the emulated SR runs closely align with the corresponding HR simulation, even on length scales an order of magnitude smaller than the LR run. Additionally, small halos are trained to match the HR simulation, and the subhalo mass function is more accurately reproduced. These results show the promise of this method for generating numerous fast and accurate simulations and mock observations for large galaxy surveys.
△ Less
Submitted 16 August, 2024;
originally announced August 2024.