Search | arXiv e-print repository

LongVLM: Efficient Long Video Understanding via Large Language Models

Authors: Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang

Abstract: Empowered by Large Language Models (LLMs), recent advancements in Video-based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a vast number of visual tokens, making computational and memory costs affordable. Despite successfully providing an overall comprehension of video content, existi… ▽ More Empowered by Large Language Models (LLMs), recent advancements in Video-based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a vast number of visual tokens, making computational and memory costs affordable. Despite successfully providing an overall comprehension of video content, existing VideoLLMs still face challenges in achieving detailed understanding due to overlooking local information in long-term videos. To tackle this challenge, we introduce LongVLM, a simple yet powerful VideoLLM for long video understanding, building upon the observation that long videos often consist of sequential key events, complex actions, and camera movements. Our approach proposes to decompose long videos into multiple short-term segments and encode local features for each segment via a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across sequential short-term segments. Additionally, we propose to integrate global semantics into each local feature to enhance context understanding. In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets demonstrate the superior capabilities of our model over the previous state-of-the-art methods. Qualitative examples show that our model produces more precise responses for long video understanding. Code is available at https://github.com/ziplab/LongVLM. △ Less

Submitted 20 July, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

Comments: Accepted by ECCV 2024

arXiv:2404.03176 [pdf, other]

Information-Theoretic Generalization Bounds for Deep Neural Networks

Authors: Haiyun He, Christina Lee Yu, Ziv Goldfeld

Abstract: Deep neural networks (DNNs) exhibit an exceptional capacity for generalization in practical applications. This work aims to capture the effect and benefits of depth for supervised learning via information-theoretic generalization bounds. We first derive two hierarchical bounds on the generalization error in terms of the Kullback-Leibler (KL) divergence or the 1-Wasserstein distance between the tra… ▽ More Deep neural networks (DNNs) exhibit an exceptional capacity for generalization in practical applications. This work aims to capture the effect and benefits of depth for supervised learning via information-theoretic generalization bounds. We first derive two hierarchical bounds on the generalization error in terms of the Kullback-Leibler (KL) divergence or the 1-Wasserstein distance between the train and test distributions of the network internal representations. The KL divergence bound shrinks as the layer index increases, while the Wasserstein bound implies the existence of a layer that serves as a generalization funnel, which attains a minimal 1-Wasserstein distance. Analytic expressions for both bounds are derived under the setting of binary Gaussian classification with linear DNNs. To quantify the contraction of the relevant information measures when moving deeper into the network, we analyze the strong data processing inequality (SDPI) coefficient between consecutive layers of three regularized DNN models: Dropout, DropConnect, and Gaussian noise injection. This enables refining our generalization bounds to capture the contraction as a function of the network architecture parameters. Specializing our results to DNNs with a finite parameter space and the Gibbs algorithm reveals that deeper yet narrower network architectures generalize better in those examples, although how broadly this statement applies remains a question. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: 25 pages, 5 figures

arXiv:2404.02101 [pdf, other]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Authors: Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, Ceyuan Yang

Abstract: Controllability plays a crucial role in video generation since it allows users to create desired content. However, existing models largely overlooked the precise control of camera pose that serves as a cinematic language to express deeper narrative nuances. To alleviate this issue, we introduce CameraCtrl, enabling accurate camera pose control for text-to-video(T2V) models. After precisely paramet… ▽ More Controllability plays a crucial role in video generation since it allows users to create desired content. However, existing models largely overlooked the precise control of camera pose that serves as a cinematic language to express deeper narrative nuances. To alleviate this issue, we introduce CameraCtrl, enabling accurate camera pose control for text-to-video(T2V) models. After precisely parameterizing the camera trajectory, a plug-and-play camera module is then trained on a T2V model, leaving others untouched. Additionally, a comprehensive study on the effect of various datasets is also conducted, suggesting that videos with diverse camera distribution and similar appearances indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise and domain-adaptive camera control, marking a step forward in the pursuit of dynamic and customized video storytelling from textual and camera pose inputs. Our project website is at: https://hehao13.github.io/projects-CameraCtrl/. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: Project page: https://hehao13.github.io/projects-CameraCtrl/ Code: https://github.com/hehao13/CameraCtrl

arXiv:2404.01664 [pdf, other]

doi 10.1016/j.trc.2024.104586

Nonreciprocal interactions in crowd dynamics: investigating the impact of moving threats on pedestrian speed preferences

Authors: Shaocong Xie, Rui Ye, Xiaolian Li, Zhongyi Huang, Shuchao Cao, Wei Lv, Hong He, Ping Zhang, Zhiming Fang, Jun Zhang, Weiguo Song

Abstract: Nonreciprocal interaction crowd systems, such as human-human, human-vehicle, and human-robot systems, often have serious impacts on pedestrian safety and social order. A more comprehensive understanding of these systems is needed to optimize system stability and efficiency. Despite the importance of these interactions, empirical research in this area remains limited. Thus, in our study we explore… ▽ More Nonreciprocal interaction crowd systems, such as human-human, human-vehicle, and human-robot systems, often have serious impacts on pedestrian safety and social order. A more comprehensive understanding of these systems is needed to optimize system stability and efficiency. Despite the importance of these interactions, empirical research in this area remains limited. Thus, in our study we explore this underresearched area, focusing on scenarios where nonreciprocity plays a critical role, such as mass stabbings, which pose a substantial risk to public safety. We conducted the first experiments on this system and analysed high-accuracy data obtained from these experiments. The extent of the direct threat zone is determined by the speed of the moving threat and the radius of danger occurrence. We further categorize potential threats into direct, adjacent, and rear-view zones, quantifying the level of threat for pedestrians. Our study revealed that a pedestrian's desired velocity correlated positively with potential threat intensity, increasing until near the direct threat zone. An emerging steady state is observed when escape routes are blocked by moving threats. This deviation affects the density-velocity relationship, making it distinct from the general relationship. This deviation signifies unique pedestrian behaviour in the presence of moving threats. Additionally, the rate of change in the angle for pedestrian motion in various desired directions is synchronized. This indicates the emergence of collective intelligence in nonreciprocal interaction crowd systems. As a result, our study may constitute a pioneering step towards understanding nonreciprocal interactions in crowd systems through laboratory experiments. These findings may enhance pedestrian safety and inform not only government crowd management strategies but also individual self-protection measures. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2404.01453 [pdf, other]

Unveiling Divergent Inductive Biases of LLMs on Temporal Data

Authors: Sindhu Kishore, Hangfeng He

Abstract: Unraveling the intricate details of events in natural language necessitates a subtle understanding of temporal dynamics. Despite the adeptness of Large Language Models (LLMs) in discerning patterns and relationships from data, their inherent comprehension of temporal dynamics remains a formidable challenge. This research meticulously explores these intrinsic challenges within LLMs, with a specific… ▽ More Unraveling the intricate details of events in natural language necessitates a subtle understanding of temporal dynamics. Despite the adeptness of Large Language Models (LLMs) in discerning patterns and relationships from data, their inherent comprehension of temporal dynamics remains a formidable challenge. This research meticulously explores these intrinsic challenges within LLMs, with a specific emphasis on evaluating the performance of GPT-3.5 and GPT-4 models in the analysis of temporal data. Employing two distinct prompt types, namely Question Answering (QA) format and Textual Entailment (TE) format, our analysis probes into both implicit and explicit events. The findings underscore noteworthy trends, revealing disparities in the performance of GPT-3.5 and GPT-4. Notably, biases toward specific temporal relationships come to light, with GPT-3.5 demonstrating a preference for "AFTER'' in the QA format for both implicit and explicit events, while GPT-4 leans towards "BEFORE''. Furthermore, a consistent pattern surfaces wherein GPT-3.5 tends towards "TRUE'', and GPT-4 exhibits a preference for "FALSE'' in the TE format for both implicit and explicit events. This persistent discrepancy between GPT-3.5 and GPT-4 in handling temporal data highlights the intricate nature of inductive bias in LLMs, suggesting that the evolution of these models may not merely mitigate bias but may introduce new layers of complexity. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2404.00309 [pdf, other]

Model-Driven Deep Learning for Distributed Detection with Binary Quantization

Authors: Wei Guo, Meng He, Chuan Huang, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief

Abstract: Within the realm of rapidly advancing wireless sensor networks (WSNs), distributed detection assumes a significant role in various practical applications. However, critical challenge lies in maintaining robust detection performance while operating within the constraints of limited bandwidth and energy resources. This paper introduces a novel approach that combines model-driven deep learning (DL) w… ▽ More Within the realm of rapidly advancing wireless sensor networks (WSNs), distributed detection assumes a significant role in various practical applications. However, critical challenge lies in maintaining robust detection performance while operating within the constraints of limited bandwidth and energy resources. This paper introduces a novel approach that combines model-driven deep learning (DL) with binary quantization to strike a balance between communication overhead and detection performance in WSNs. We begin by establishing the lower bound of detection error probability for distributed detection using the maximum a posteriori (MAP) criterion. Furthermore, we prove the global optimality of employing identical local quantizers across sensors, thereby maximizing the corresponding Chernoff information. Subsequently, the paper derives the minimum MAP detection error probability (MAPDEP) by inplementing identical binary probabilistic quantizers across the sensors. Moreover, the paper establishes the equivalence between utilizing all quantized data and their average as input to the detector at the fusion center (FC). In particular, we derive the Kullback-Leibler (KL) divergence, which measures the difference between the true posterior probability and output of the proposed detector. Leveraging the MAPDEP and KL divergence as loss functions, the paper proposes model-driven DL method to separately train the probability controller module in the quantizer and the detector module at the FC. Numerical results validate the convergence and effectiveness of the proposed method, which achieves near-optimal performance with reduced complexity for Gaussian hypothesis testing. △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2404.00246 [pdf, other]

Your Co-Workers Matter: Evaluating Collaborative Capabilities of Language Models in Blocks World

Authors: Guande Wu, Chen Zhao, Claudio Silva, He He

Abstract: Language agents that interact with the world on their own have great potential for automating digital tasks. While large language model (LLM) agents have made progress in understanding and executing tasks such as textual games and webpage control, many real-world tasks also require collaboration with humans or other LLMs in equal roles, which involves intent understanding, task coordination, and c… ▽ More Language agents that interact with the world on their own have great potential for automating digital tasks. While large language model (LLM) agents have made progress in understanding and executing tasks such as textual games and webpage control, many real-world tasks also require collaboration with humans or other LLMs in equal roles, which involves intent understanding, task coordination, and communication. To test LLM's ability to collaborate, we design a blocks-world environment, where two agents, each having unique goals and skills, build a target structure together. To complete the goals, they can act in the world and communicate in natural language. Under this environment, we design increasingly challenging settings to evaluate different collaboration perspectives, from independent to more complex, dependent tasks. We further adopt chain-of-thought prompts that include intermediate reasoning steps to model the partner's state and identify and correct execution errors. Both human-machine and machine-machine experiments show that LLM agents have strong grounding capacities, and our approach significantly improves the evaluation metric. △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2404.00057 [pdf, other]

PerOS: Personalized Self-Adapting Operating Systems in the Cloud

Authors: Hongyu Hè

Abstract: Operating systems (OSes) are foundational to computer systems, managing hardware resources and ensuring secure environments for diverse applications. However, despite their enduring importance, the fundamental design objectives of OSes have seen minimal evolution over decades. Traditionally prioritizing aspects like speed, memory efficiency, security, and scalability, these objectives often overlo… ▽ More Operating systems (OSes) are foundational to computer systems, managing hardware resources and ensuring secure environments for diverse applications. However, despite their enduring importance, the fundamental design objectives of OSes have seen minimal evolution over decades. Traditionally prioritizing aspects like speed, memory efficiency, security, and scalability, these objectives often overlook the crucial aspect of intelligence as well as personalized user experience. The lack of intelligence becomes increasingly critical amid technological revolutions, such as the remarkable advancements in machine learning (ML). Today's personal devices, evolving into intimate companions for users, pose unique challenges for traditional OSes like Linux and iOS, especially with the emergence of specialized hardware featuring heterogeneous components. Furthermore, the rise of large language models (LLMs) in ML has introduced transformative capabilities, reshaping user interactions and software development paradigms. While existing literature predominantly focuses on leveraging ML methods for system optimization or accelerating ML workloads, there is a significant gap in addressing personalized user experiences at the OS level. To tackle this challenge, this work proposes PerOS, a personalized OS ingrained with LLM capabilities. PerOS aims to provide tailored user experiences while safeguarding privacy and personal data through declarative interfaces, self-adaptive kernels, and secure data management in a scalable cloud-centric architecture; therein lies the main research question of this work: How can we develop intelligent, secure, and scalable OSes that deliver personalized experiences to thousands of users? △ Less

Submitted 26 March, 2024; originally announced April 2024.

Comments: 29 pages, 3 figures

arXiv:2403.15172 [pdf, other]

Magnetically arrested disks in FR I radio galaxies

Authors: Han He, Bei You, Ning Jiang, Xinwu Cao, Jingfu Hu, Zhenfeng Sheng, Su Yao, Bozena Czerny

Abstract: A sample of 17 FR I radio galaxies constructed from the 3CR catalog, which is characterized by edge-darkened radio structures, is studied. The optical core luminosities derived from Hubble Space Telescope observation are used to estimate the Eddington ratios which are found to be below $10^{-3.4}$ for this sample. This is supported by the Baldwin-Phillips-Terlevich optical diagnostic diagrams deri… ▽ More A sample of 17 FR I radio galaxies constructed from the 3CR catalog, which is characterized by edge-darkened radio structures, is studied. The optical core luminosities derived from Hubble Space Telescope observation are used to estimate the Eddington ratios which are found to be below $10^{-3.4}$ for this sample. This is supported by the Baldwin-Phillips-Terlevich optical diagnostic diagrams derived with the spectroscopic observation of Telescopio Nazionale Galileo, suggesting that these sources are of low ionization nuclear Emission-line Regions (LINERs). It implies that the accretion in these FR I sources can be modeled as advection-dominated accretion flows (ADAFs). Given the low accretion rate, the predicted jet power with a fast-spinning black hole (BH) $a=0.95$ in the Blandford-Znajek mechanics is lower than the estimated one for almost all the sources in our sample. Such powerful jets indicate the presence of magnetically arrested disks (MAD) in the inner region of the ADAF, in the sense that the magnetic fields in the inner accretion zone are strong. Moreover, we show that, even in the MAD scenario, the BH spins in the sample are most likely moderate and/or fast with $a\gtrsim0.5$. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: 10 pages, 10 figures, 3 tables, Accepted for publication in MNRAS

arXiv:2403.14961 [pdf, ps, other]

Anderson Acceleration with Truncated Gram-Schmidt

Authors: Ziyuan Tang, Tianshi Xu, Huan He, Yousef Saad, Yuanzhe Xi

Abstract: Anderson Acceleration (AA) is a popular algorithm designed to enhance the convergence of fixed-point iterations. In this paper, we introduce a variant of AA based on a Truncated Gram-Schmidt process (AATGS) which has a few advantages over the classical AA. In particular, an attractive feature of AATGS is that its iterates obey a three-term recurrence in the situation when it is applied to solving… ▽ More Anderson Acceleration (AA) is a popular algorithm designed to enhance the convergence of fixed-point iterations. In this paper, we introduce a variant of AA based on a Truncated Gram-Schmidt process (AATGS) which has a few advantages over the classical AA. In particular, an attractive feature of AATGS is that its iterates obey a three-term recurrence in the situation when it is applied to solving symmetric linear problems and this can lead to a considerable reduction of memory and computational costs. We analyze the convergence of AATGS in both full-depth and limited-depth scenarios and establish its equivalence to the classical AA in the linear case. We also report on the effectiveness of AATGS through a set of numerical experiments, ranging from solving nonlinear partial differential equations to tackling nonlinear optimization problems. In particular, the performance of the method is compared with that of the classical AA algorithms. △ Less

Submitted 16 July, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

MSC Class: 65F10; 68W25; 65B99; 65N22

arXiv:2403.13250 [pdf, other]

Facilitating Pornographic Text Detection for Open-Domain Dialogue Systems via Knowledge Distillation of Large Language Models

Authors: Huachuan Qiu, Shuai Zhang, Hongliang He, Anqi Li, Zhenzhong Lan

Abstract: Pornographic content occurring in human-machine interaction dialogues can cause severe side effects for users in open-domain dialogue systems. However, research on detecting pornographic language within human-machine interaction dialogues is an important subject that is rarely studied. To advance in this direction, we introduce CensorChat, a dialogue monitoring dataset aimed at detecting whether t… ▽ More Pornographic content occurring in human-machine interaction dialogues can cause severe side effects for users in open-domain dialogue systems. However, research on detecting pornographic language within human-machine interaction dialogues is an important subject that is rarely studied. To advance in this direction, we introduce CensorChat, a dialogue monitoring dataset aimed at detecting whether the dialogue session contains pornographic content. To this end, we collect real-life human-machine interaction dialogues in the wild and break them down into single utterances and single-turn dialogues, with the last utterance spoken by the chatbot. We propose utilizing knowledge distillation of large language models to annotate the dataset. Specifically, first, the raw dataset is annotated by four open-source large language models, with the majority vote determining the label. Second, we use ChatGPT to update the empty label from the first step. Third, to ensure the quality of the validation and test sets, we utilize GPT-4 for label calibration. If the current label does not match the one generated by GPT-4, we employ a self-criticism strategy to verify its correctness. Finally, to facilitate the detection of pornographic text, we develop a series of text classifiers using a pseudo-labeled dataset. Detailed data analysis demonstrates that leveraging knowledge distillation techniques with large language models provides a practical and cost-efficient method for developing pornographic text detectors. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: Accepted to CSCWD 2024 (27th International Conference on Computer Supported Cooperative Work in Design). arXiv admin note: text overlap with arXiv:2309.09749

arXiv:2403.11832 [pdf, other]

Precise measurement of the cosmic-ray spectrum and $\left \langle \ln A \right \rangle$ by LHAASO -- connecting the Galactic to the extragalactic components

Authors: Xing-Jian Lv, Xiao-Jun Bi, Kun Fang, Yi-Qing Guo, Hui-Hai He, Ling-Ling Ma, Peng-Fei Yin, Qiang Yuan, Meng-Jie Zhao

Abstract: Recently LHAASO Collaboration gives precise measurements of cosmic rays (CR) all particle energy spectrum and mean logarithmic mass $\left \langle \ln A \right \rangle$ from 0.3 PeV to 30 PeV. Combining the CR measurements by AMS-02 and DAMPE in space and that by LHAASO and Auger on the ground we construct a model to recover all these measurements from tens of GeV to tens of EeV. We find the LHAAS… ▽ More Recently LHAASO Collaboration gives precise measurements of cosmic rays (CR) all particle energy spectrum and mean logarithmic mass $\left \langle \ln A \right \rangle$ from 0.3 PeV to 30 PeV. Combining the CR measurements by AMS-02 and DAMPE in space and that by LHAASO and Auger on the ground we construct a model to recover all these measurements from tens of GeV to tens of EeV. We find the LHAASO measurement is crucial in the model construction by connecting the Galactic component to the extragalactic component. The precise measurements of CR spectra for individual species by AMS-02 and DAMPE together with the newest LHAASO results clearly indicates three Galactic CR components, that is, a soft low energy background, a hard high energy component, and a local source contribution. However, the LHAASO data show that above $\sim 10^{16}$ eV a nonnegligible extragalactic component must be included. Combining the Auger results and the LHAASO results we figure out the extragalactic CRs which need at least two components at lower and higher energies. Thanks to the precise measurements by LHAASO the constraints on the model parameters are quite stringent. The spectra features and mass measurements in all energy range are all well reproduced in the model. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: 11 pages, 2 figures, 4 tables

arXiv:2403.10010 [pdf, other]

doi 10.1103/PhysRevLett.132.131002

Measurements of All-Particle Energy Spectrum and Mean Logarithmic Mass of Cosmic Rays from 0.3 to 30 PeV with LHAASO-KM2A

Authors: The LHAASO Collaboration, Zhen Cao, F. Aharonian, Q. An, A. Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen , et al. (256 additional authors not shown)

Abstract: We present the measurements of all-particle energy spectrum and mean logarithmic mass of cosmic rays in the energy range of 0.3-30 PeV using data collected from LHAASO-KM2A between September 2021 and December 2022, which is based on a nearly composition-independent energy reconstruction method, achieving unprecedented accuracy. Our analysis reveals the position of the knee at… ▽ More We present the measurements of all-particle energy spectrum and mean logarithmic mass of cosmic rays in the energy range of 0.3-30 PeV using data collected from LHAASO-KM2A between September 2021 and December 2022, which is based on a nearly composition-independent energy reconstruction method, achieving unprecedented accuracy. Our analysis reveals the position of the knee at $3.67 \pm 0.05 \pm 0.15$ PeV. Below the knee, the spectral index is found to be -$2.7413 \pm 0.0004 \pm 0.0050$, while above the knee, it is -$3.128 \pm 0.005 \pm 0.027$, with the sharpness of the transition measured with a statistical error of 2%. The mean logarithmic mass of cosmic rays is almost heavier than helium in the whole measured energy range. It decreases from 1.7 at 0.3 PeV to 1.3 at 3 PeV, representing a 24% decline following a power law with an index of -$0.1200 \pm 0.0003 \pm 0.0341$. This is equivalent to an increase in abundance of light components. Above the knee, the mean logarithmic mass exhibits a power law trend towards heavier components, which is reversal to the behavior observed in the all-particle energy spectrum. Additionally, the knee position and the change in power-law index are approximately the same. These findings suggest that the knee observed in the all-particle spectrum corresponds to the knee of the light component, rather than the medium-heavy components. △ Less

Submitted 26 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

Comments: 8 pages, 3 figures

Journal ref: Physical Review Letters 132, 131002 (2024)

arXiv:2403.09611 [pdf, other]

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Authors: Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman , et al. (7 additional authors not shown)

Abstract: In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for la… ▽ More In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting. △ Less

Submitted 18 April, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

arXiv:2403.09339 [pdf, other]

doi 10.1364/OPTICA.520697

Field test of mode-pairing quantum key distribution

Authors: Hao-Tao Zhu, Yizhi Huang, Wen-Xin Pan, Chao-Wu Zhou, Jianjun Tang, Hong He, Ming Cheng, Xiandu Jin, Mi Zou, Shibiao Tang, Xiongfeng Ma, Teng-Yun Chen, Jian-Wei Pan

Abstract: Quantum key distribution is a cornerstone of quantum technology, offering information-theoretical secure keys for remote parties. With many quantum communication networks established globally, the mode-pairing protocol stands out for its efficacy over inter-city distances using simple setups, emerging as a promising solution. In this study, we employ the mode-pairing scheme into existing inter-cit… ▽ More Quantum key distribution is a cornerstone of quantum technology, offering information-theoretical secure keys for remote parties. With many quantum communication networks established globally, the mode-pairing protocol stands out for its efficacy over inter-city distances using simple setups, emerging as a promising solution. In this study, we employ the mode-pairing scheme into existing inter-city fiber links, conducting field tests across distances ranging from tens to about a hundred kilometers. Our system achieves a key rate of $1.217$ kbit/s in a $195.85$ km symmetric link and $3.089$ kbit/s in a $127.92$ km asymmetric link without global phase locking. The results demonstrate that the mode-pairing protocol can achieve key rates comparable to those of a single quantum link between two trusted nodes on the Beijing-Shanghai backbone line, effectively reducing the need for half of the trusted nodes. These field tests confirm the mode-pairing scheme's adaptability, efficiency, and practicality, positioning it as a highly suitable protocol for quantum networks. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 15 pages, 5 figures, 6 tables

Journal ref: Optica 11, 883-888 (2024)

arXiv:2403.09092 [pdf, other]

doi 10.1145/3589334.3645385

MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection

Authors: Yupeng Li, Haorui He, Jin Bai, Dacheng Wen

Abstract: The prevalence of fake news across various online sources has had a significant influence on the public. Existing Chinese fake news detection datasets are limited to news sourced solely from Weibo. However, fake news originating from multiple sources exhibits diversity in various aspects, including its content and social context. Methods trained on purely one single news source can hardly be appli… ▽ More The prevalence of fake news across various online sources has had a significant influence on the public. Existing Chinese fake news detection datasets are limited to news sourced solely from Weibo. However, fake news originating from multiple sources exhibits diversity in various aspects, including its content and social context. Methods trained on purely one single news source can hardly be applicable to real-world scenarios. Our pilot experiment demonstrates that the F1 score of the state-of-the-art method that learns from a large Chinese fake news detection dataset, Weibo-21, drops significantly from 0.943 to 0.470 when the test data is changed to multi-source news data, failing to identify more than one-third of the multi-source fake news. To address this limitation, we constructed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. Notably, such news has been fact-checked by 14 authoritative fact-checking agencies worldwide. In addition, various existing Chinese fake news detection methods are thoroughly evaluated on our proposed dataset in cross-source, multi-source, and unseen source ways. MCFEND, as a benchmark dataset, aims to advance Chinese fake news detection approaches in real-world scenarios. △ Less

Submitted 24 July, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

Comments: Accepted by the ACM Web Conference 2024 (WWW 2024) oral, dataset available: https://github.com/TrustworthyComp

arXiv:2403.07837 [pdf, other]

Topological Protection of Optical Skyrmions through Complex Media

Authors: An Aloysius Wang, Zimo Zhao, Yifei Ma, Yuxi Cai, Runchen Zhang, Xiaoyi Shang, Yunqi Zhang, Ji Qin, Zhi Kai Pong, Tade Marozsak, Binguo Chen, Honghui He, Lin Luo, Martin J Booth, Steve J Elston, Stephen M Morris, Chao He

Abstract: Optical Skyrmions have many important properties that make them ideal units for high-density data applications, including the ability to carry digital information through a discrete topological number and the independence of spatially varying polarization to other dimensions. More importantly, the topological nature of the optical Skyrmion heuristically suggests a strong degree of robustness to pe… ▽ More Optical Skyrmions have many important properties that make them ideal units for high-density data applications, including the ability to carry digital information through a discrete topological number and the independence of spatially varying polarization to other dimensions. More importantly, the topological nature of the optical Skyrmion heuristically suggests a strong degree of robustness to perturbations, which is crucial for reliably carrying information in noisy environments. However, the study of the topological robustness of optical Skyrmions is still in its infancy. Here, we quantify this robustness precisely by proving that the topological nature of the Skyrmion arises from its structure on the boundary and, by duality, is therefore resilient to complex perturbations provided they respect the relevant boundary conditions of the unperturbed Skyrmion. We then present experimental evidence validating this robustness in the context of paraxial Skyrmion beams against different polarization aberrations. Our work provides a framework for handling various perturbations of Skyrmion fields and offers guarantees of robustness in a general sense. This, in turn, has implications for applications of the optical Skyrmion where their topological nature is exploited explicitly, and, in particular, provides an underpinning for the use of Skyrmions in optical communications and photonic computing. △ Less

Submitted 6 August, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.07581 [pdf, other]

LLMvsSmall Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model

Authors: Linmei Hu, Hongyu He, Duokang Wang, Ziwang Zhao, Yingxia Shao, Liqiang Nie

Abstract: Personality detection aims to detect one's personality traits underlying in social media posts. One challenge of this task is the scarcity of ground-truth personality traits which are collected from self-report questionnaires. Most existing methods learn post features directly by fine-tuning the pre-trained language models under the supervision of limited personality labels. This leads to inferior… ▽ More Personality detection aims to detect one's personality traits underlying in social media posts. One challenge of this task is the scarcity of ground-truth personality traits which are collected from self-report questionnaires. Most existing methods learn post features directly by fine-tuning the pre-trained language models under the supervision of limited personality labels. This leads to inferior quality of post features and consequently affects the performance. In addition, they treat personality traits as one-hot classification labels, overlooking the semantic information within them. In this paper, we propose a large language model (LLM) based text augmentation enhanced personality detection model, which distills the LLM's knowledge to enhance the small model for personality detection, even when the LLM fails in this task. Specifically, we enable LLM to generate post analyses (augmentations) from the aspects of semantic, sentiment, and linguistic, which are critical for personality detection. By using contrastive learning to pull them together in the embedding space, the post encoder can better capture the psycho-linguistic information within the post representations, thus improving personality detection. Furthermore, we utilize the LLM to enrich the information of personality labels for enhancing the detection performance. Experimental results on the benchmark datasets demonstrate that our model outperforms the state-of-the-art methods on personality detection. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.03128 [pdf, other]

Probing Light Inelastic Dark Matter from Direct Detection

Authors: Hong-Jian He, Yu-Chen Wang, Jiaming Zheng

Abstract: Different dark matter (DM) candidates could have different types of DM-lepton and/or DM-quark interactions. For direct detection experiments, this leads to diversity in the recoil spectra, where both DM-electron and DM-nucleus scatterings may contribute. Furthermore, kinematic effects such as those of the inelastic scattering can also play an important role in shaping the recoil spectra. In this w… ▽ More Different dark matter (DM) candidates could have different types of DM-lepton and/or DM-quark interactions. For direct detection experiments, this leads to diversity in the recoil spectra, where both DM-electron and DM-nucleus scatterings may contribute. Furthermore, kinematic effects such as those of the inelastic scattering can also play an important role in shaping the recoil spectra. In this work, we systematically study signatures of the light exothermic inelastic DM from the recoil spectra including both the DM-electron scattering and Migdal effect. Such inelastic DM has mass around (sub-)GeV scale and the DM mass-splitting ranges from 1keV to 30keV. We analyze the direct detection sensitivities to such light inelastic DM. For different inelastic DM masses and mass-splittings, we find that the DM-electron recoil and Migdal effect can contribute significantly and differently to the direct detection signatures. Hence, it is important to perform combined analysis to include both the DM-electron recoil and Migdal effect. We further demonstrate that this analysis has strong impacts on the cosmological and laboratory bounds for the inelastic DM. △ Less

Submitted 11 March, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: 30 pages, refined version, references added

arXiv:2402.17555 [pdf, other]

Scribble Hides Class: Promoting Scribble-Based Weakly-Supervised Semantic Segmentation with Its Class Label

Authors: Xinliang Zhang, Lei Zhu, Hangzhou He, Lujia Jin, Yanye Lu

Abstract: Scribble-based weakly-supervised semantic segmentation using sparse scribble supervision is gaining traction as it reduces annotation costs when compared to fully annotated alternatives. Existing methods primarily generate pseudo-labels by diffusing labeled pixels to unlabeled ones with local cues for supervision. However, this diffusion process fails to exploit global semantics and class-specific… ▽ More Scribble-based weakly-supervised semantic segmentation using sparse scribble supervision is gaining traction as it reduces annotation costs when compared to fully annotated alternatives. Existing methods primarily generate pseudo-labels by diffusing labeled pixels to unlabeled ones with local cues for supervision. However, this diffusion process fails to exploit global semantics and class-specific cues, which are important for semantic segmentation. In this study, we propose a class-driven scribble promotion network, which utilizes both scribble annotations and pseudo-labels informed by image-level classes and global semantics for supervision. Directly adopting pseudo-labels might misguide the segmentation model, thus we design a localization rectification module to correct foreground representations in the feature space. To further combine the advantages of both supervisions, we also introduce a distance entropy loss for uncertainty reduction, which adapts per-pixel confidence weights according to the reliable region determined by the scribble and pseudo-label's boundary. Experiments on the ScribbleSup dataset with different qualities of scribble annotations outperform all the previous methods, demonstrating the superiority and robustness of our method.The code is available at https://github.com/Zxl19990529/Class-driven-Scribble-Promotion-Network. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.17352 [pdf, other]

Search for neutrino emission from the Cygnus Bubble based on LHAASO $γ$-ray observations

Authors: Wenlian Li, Tian-Qi Huang, Donglian Xu, Huihai He

Abstract: The Cygnus region, which contains massive molecular and atomic clouds and young stars, is a promising Galactic neutrino source candidate. Cosmic rays transport in the region can produce neutrinos and $γ$-rays. Recently, the Large High Altitude Air Shower Observatory (LHAASO) detected an ultrahigh-energy $γ$-ray bubble (Cygnus Bubble) in this region. Using publicly available track events detected b… ▽ More The Cygnus region, which contains massive molecular and atomic clouds and young stars, is a promising Galactic neutrino source candidate. Cosmic rays transport in the region can produce neutrinos and $γ$-rays. Recently, the Large High Altitude Air Shower Observatory (LHAASO) detected an ultrahigh-energy $γ$-ray bubble (Cygnus Bubble) in this region. Using publicly available track events detected by the IceCube Neutrino Observatory in 7 years of full detector operation, we conduct searches for correlated neutrino signals from the Cygnus Bubble with neutrino emission templates based on LHAASO $γ$-ray observations. No significant signals were found for any employed templates. With the 7 TeV $γ$-ray flux template, we set a flux upper limit of 90% confidence level (C.L.) for the neutrino emission from the Cygnus Bubble to be $5.7\times10^{-13}\, \mathrm{TeV}^{-1}\mathrm{cm}^{-2}\mathrm{s}^{-1}$ at 5 TeV. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.16901 [pdf, other]

FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics

Authors: ChenRui Duan, Zelin Zang, Yongjie Xu, Hang He, Zihan Liu, Zijia Song, Ju-Sheng Zheng, Stan Z. Li

Abstract: Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metage… ▽ More Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metagenomic sequences and their functions, we introduce a protein-based gene representation as a context-aware and structure-relevant tokenizer. Our approach includes Masked Gene Modeling (MGM) for gene group-level pre-training, providing insights into inter-gene contextual information, and Triple Enhanced Metagenomic Contrastive Learning (TEM-CL) for gene-level pre-training to model gene sequence-function relationships. MGM and TEM-CL constitute our novel metagenomic language model {\NAME}, pre-trained on 100 million metagenomic sequences. We demonstrate the superiority of our proposed {\NAME} on eight datasets. △ Less

Submitted 24 February, 2024; originally announced February 2024.

arXiv:2402.16557 [pdf, other]

A randomized algorithm for simultaneously diagonalizing symmetric matrices by congruence

Authors: Haoze He, Daniel Kressner

Abstract: A family of symmetric matrices $A_1,\ldots, A_d$ is SDC (simultaneous diagonalization by congruence, also called non-orthogonal joint diagonalization) if there is an invertible matrix $X$ such that every $X^T A_k X$ is diagonal. In this work, a novel randomized SDC (RSDC) algorithm is proposed that reduces SDC to a generalized eigenvalue problem by considering two (random) linear combinations of t… ▽ More A family of symmetric matrices $A_1,\ldots, A_d$ is SDC (simultaneous diagonalization by congruence, also called non-orthogonal joint diagonalization) if there is an invertible matrix $X$ such that every $X^T A_k X$ is diagonal. In this work, a novel randomized SDC (RSDC) algorithm is proposed that reduces SDC to a generalized eigenvalue problem by considering two (random) linear combinations of the family. We establish exact recovery: RSDC achieves diagonalization with probability $1$ if the family is exactly SDC. Under a mild regularity assumption, robust recovery is also established: Given a family that is $ε$-close to SDC then RSDC diagonalizes, with high probability, the family up to an error of norm $\mathcal{O}(ε)$. Under a positive definiteness assumption, which often holds in applications, stronger results are established, including a bound on the condition number of the transformation matrix. For practical use, we suggest to combine RSDC with an optimization algorithm. The performance of the resulting method is verified for synthetic data, image separation and EEG analysis tasks. It turns out that our newly developed method outperforms existing optimization-based methods in terms of efficiency while achieving a comparable level of accuracy. △ Less

Submitted 15 August, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

MSC Class: 65F15; 65F30; 68W20; 15A22; 15A27

arXiv:2402.15764 [pdf, other]

Look Before You Leap: Problem Elaboration Prompting Improves Mathematical Reasoning in Large Language Models

Authors: Haoran Liao, Jidong Tian, Shaohua Hu, Hao He, Yaohui Jin

Abstract: Large language models (LLMs) still grapple with complex tasks like mathematical reasoning. Despite significant efforts invested in improving prefix prompts or reasoning process, the crucial role of problem context might have been neglected. Accurate recognition of inputs is fundamental for solving mathematical tasks, as ill-formed problems could potentially mislead LLM's reasoning. In this study,… ▽ More Large language models (LLMs) still grapple with complex tasks like mathematical reasoning. Despite significant efforts invested in improving prefix prompts or reasoning process, the crucial role of problem context might have been neglected. Accurate recognition of inputs is fundamental for solving mathematical tasks, as ill-formed problems could potentially mislead LLM's reasoning. In this study, we propose a new approach named Problem Elaboration Prompting (PEP) to enhance the mathematical capacities of LLMs. Specifically, PEP decomposes and elucidates the problem context before reasoning, therefore enhancing the context modeling and parsing efficiency. Experiments across datasets and models demonstrate promising performances: (1) PEP demonstrates an overall enhancement in various mathematical tasks. For instance, with the GPT-3.5 model, PEP exhibits improvements of 9.93% and 8.80% on GSM8k through greedy decoding and self-consistency, respectively. (2) PEP can be easily implemented and integrated with other prompting methods. (3) PEP shows particular strength in handling distraction problems. △ Less

Submitted 26 March, 2024; v1 submitted 24 February, 2024; originally announced February 2024.

arXiv:2402.15152 [pdf, other]

On the Duality Between Sharpness-Aware Minimization and Adversarial Training

Authors: Yihao Zhang, Hangzhou He, Jingyu Zhu, Huanran Chen, Yifei Wang, Zeming Wei

Abstract: Adversarial Training (AT), which adversarially perturb the input samples during training, has been acknowledged as one of the most effective defenses against adversarial attacks, yet suffers from inevitably decreased clean accuracy. Instead of perturbing the samples, Sharpness-Aware Minimization (SAM) perturbs the model weights during training to find a more flat loss landscape and improve general… ▽ More Adversarial Training (AT), which adversarially perturb the input samples during training, has been acknowledged as one of the most effective defenses against adversarial attacks, yet suffers from inevitably decreased clean accuracy. Instead of perturbing the samples, Sharpness-Aware Minimization (SAM) perturbs the model weights during training to find a more flat loss landscape and improve generalization. However, as SAM is designed for better clean accuracy, its effectiveness in enhancing adversarial robustness remains unexplored. In this work, considering the duality between SAM and AT, we investigate the adversarial robustness derived from SAM. Intriguingly, we find that using SAM alone can improve adversarial robustness. To understand this unexpected property of SAM, we first provide empirical and theoretical insights into how SAM can implicitly learn more robust features, and conduct comprehensive experiments to show that SAM can improve adversarial robustness notably without sacrificing any clean accuracy, shedding light on the potential of SAM to be a substitute for AT when accuracy comes at a higher priority. Code is available at https://github.com/weizeming/SAM_AT. △ Less

Submitted 5 June, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

Comments: ICML 2024

arXiv:2402.15134 [pdf, other]

Deep Coupling Network For Multivariate Time Series Forecasting

Authors: Kun Yi, Qi Zhang, Hui He, Kaize Shi, Liang Hu, Ning An, Zhendong Niu

Abstract: Multivariate time series (MTS) forecasting is crucial in many real-world applications. To achieve accurate MTS forecasting, it is essential to simultaneously consider both intra- and inter-series relationships among time series data. However, previous work has typically modeled intra- and inter-series relationships separately and has disregarded multi-order interactions present within and between… ▽ More Multivariate time series (MTS) forecasting is crucial in many real-world applications. To achieve accurate MTS forecasting, it is essential to simultaneously consider both intra- and inter-series relationships among time series data. However, previous work has typically modeled intra- and inter-series relationships separately and has disregarded multi-order interactions present within and between time series data, which can seriously degrade forecasting accuracy. In this paper, we reexamine intra- and inter-series relationships from the perspective of mutual information and accordingly construct a comprehensive relationship learning mechanism tailored to simultaneously capture the intricate multi-order intra- and inter-series couplings. Based on the mechanism, we propose a novel deep coupling network for MTS forecasting, named DeepCN, which consists of a coupling mechanism dedicated to explicitly exploring the multi-order intra- and inter-series relationships among time series data concurrently, a coupled variable representation module aimed at encoding diverse variable patterns, and an inference module facilitating predictions through one forward step. Extensive experiments conducted on seven real-world datasets demonstrate that our proposed DeepCN achieves superior performance compared with the state-of-the-art baselines. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.14407 [pdf, other]

Large-Scale Actionless Video Pre-Training via Discrete Diffusion for Efficient Policy Learning

Authors: Haoran He, Chenjia Bai, Ling Pan, Weinan Zhang, Bin Zhao, Xuelong Li

Abstract: Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks and interactions with the physical world. Promising prospects arise for utilizing actionless human videos for pre-training and transferring the knowledge to fa… ▽ More Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks and interactions with the physical world. Promising prospects arise for utilizing actionless human videos for pre-training and transferring the knowledge to facilitate robot policy learning through limited robot demonstrations. In this paper, we introduce a novel framework that leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. We start by compressing both human and robot videos into unified video tokens. In the pre-training stage, we employ a discrete diffusion model with a mask-and-replace diffusion strategy to predict future video tokens in the latent space. In the fine-tuning stage, we harness the imagined future videos to guide low-level action learning trained on a limited set of robot data. Experiments demonstrate that our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches with superior generalization ability. Our project website is available at https://video-diff.github.io/. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: 21 pages

arXiv:2402.14036 [pdf, other]

Quantum Annealing and Graph Neural Networks for Solving TSP with QUBO

Authors: Haoqi He

Abstract: This paper explores the application of Quadratic Unconstrained Binary Optimization (QUBO) models in solving the Travelling Salesman Problem (TSP) through Quantum Annealing algorithms and Graph Neural Networks. Quantum Annealing (QA), a quantum-inspired optimization method that exploits quantum tunneling to escape local minima, is used to solve QUBO formulations of TSP instances on Coherent Ising M… ▽ More This paper explores the application of Quadratic Unconstrained Binary Optimization (QUBO) models in solving the Travelling Salesman Problem (TSP) through Quantum Annealing algorithms and Graph Neural Networks. Quantum Annealing (QA), a quantum-inspired optimization method that exploits quantum tunneling to escape local minima, is used to solve QUBO formulations of TSP instances on Coherent Ising Machines (CIMs). The paper also presents a novel approach where QUBO is employed as a loss function within a GNN architecture tailored for solving TSP efficiently. By leveraging GNN's capability to learn graph representations, this method finds approximate solutions to TSP with improved computational time compared to traditional exact solvers. The paper details how to construct a QUBO model for TSP by encoding city visits into binary variables and formulating constraints that guarantee valid tours. It further discusses the implementation of QUBO-based Quantum Annealing algorithm for TSP (QQA-TSP) and its feasibility demonstration using quantum simulation platforms. In addition, it introduces a Graph Neural Network solution for TSP (QGNN-TSP), which learns the underlying structure of the problem and produces competitive solutions via gradient descent over a QUBO-based loss function. The experimental results compare the performance of QQA-TSP against state-of-the-art classical solvers such as dynamic programming, Concorde, and Gurobi, while also presenting empirical outcomes from training and evaluating QGNN-TSP on various TSP datasets. The study highlights the promise of combining deep learning techniques with quantum-inspired optimization methods for solving NP-hard problems like TSP, suggesting future directions for enhancing GNN architectures and applying QUBO frameworks to more complex combinatorial optimization tasks. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2402.13529 [pdf]

doi 10.1155/2021/6638730

Multitier Service Migration Framework Based on Mobility Prediction in Mobile Edge Computing

Authors: Run Yang, Hui He, Weizhe Zhang

Abstract: Mobile edge computing (MEC) pushes computing resources to the edge of the network and distributes them at the edge of the mobile network. Offloading computing tasks to the edge instead of the cloud can reduce computing latency and backhaul load simultaneously. However, new challenges incurred by user mobility and limited coverage of MEC server service arise. Services should be dynamically migrated… ▽ More Mobile edge computing (MEC) pushes computing resources to the edge of the network and distributes them at the edge of the mobile network. Offloading computing tasks to the edge instead of the cloud can reduce computing latency and backhaul load simultaneously. However, new challenges incurred by user mobility and limited coverage of MEC server service arise. Services should be dynamically migrated between multiple MEC servers to maintain service performance due to user movement. Tackling this problem is nontrivial because it is arduous to predict user movement, and service migration will generate service interruptions and redundant network traffic. Service interruption time must be minimized, and redundant network traffic should be reduced to ensure service quality. In this paper, the container lives migration technology based on prediction is studied, and an online prediction method based on map data that does not rely on prior knowledge such as user trajectories is proposed to address this challenge in terms of mobility prediction accuracy. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: 13 pages, 9 figures

Journal ref: Wireless Communications and Mobile Computing, 2021

arXiv:2402.13471 [pdf]

Thermal transport in a 2D amorphous material

Authors: Yuxi Wang, Xingxing Zhang, Wujuan Yan, Nianjie Liang, Haiyu He, Xinwei Tao, Ang Li, Fuwei Yang, Buxuan Li, Te-Huan Liu, Jia Zhu, Wu Zhou, Wei Wang, Lin Zhou, Bai Song

Abstract: Two-dimensional (2D) crystals proved revolutionary soon after graphene was discovered in 2004. However, 2D amorphous materials only became accessible in 2020 and remain largely unexplored. In particular, the thermophysical properties of amorphous materials are of great interest upon transition from 3D to 2D. Here, we probe thermal transport in 2D amorphous carbon. A cross-plane thermal conductivit… ▽ More Two-dimensional (2D) crystals proved revolutionary soon after graphene was discovered in 2004. However, 2D amorphous materials only became accessible in 2020 and remain largely unexplored. In particular, the thermophysical properties of amorphous materials are of great interest upon transition from 3D to 2D. Here, we probe thermal transport in 2D amorphous carbon. A cross-plane thermal conductivity ($κ$) down to 0.079 $\rm{Wm}^{-1}K^{-1}$ is measured for van der Waals stacked multilayers at room temperature, which is among the lowest reported to date. Meanwhile, an unexpectedly high in-plane $κ$ is obtained for freestanding monolayers which is a few times larger than what is predicted by conventional wisdom for 3D amorphous carbon with similar $\rm{sp}^{2}$ fraction. Our molecular dynamics simulations reveal the role of disorder and highlight the impact of dimensionality. Amorphous materials at the 2D limit open up new avenues for understanding and manipulating heat at the atomic scale. △ Less

Submitted 22 March, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.12749 [pdf]

Me LLaMA: Foundation Large Language Models for Medical Applications

Authors: Qianqian Xie, Qingyu Chen, Aokun Chen, Cheng Peng, Yan Hu, Fongci Lin, Xueqing Peng, Jimin Huang, Jeffrey Zhang, Vipina Keloth, Xinyu Zhou, Huan He, Lucila Ohno-Machado, Yonghui Wu, Hua Xu, Jiang Bian

Abstract: Recent advancements in large language models (LLMs) such as ChatGPT and LLaMA have hinted at their potential to revolutionize medical applications, yet their application in clinical settings often reveals limitations due to a lack of specialized training on medical-specific data. In response to this challenge, this study introduces Me-LLaMA, a novel medical LLM family that includes foundation mode… ▽ More Recent advancements in large language models (LLMs) such as ChatGPT and LLaMA have hinted at their potential to revolutionize medical applications, yet their application in clinical settings often reveals limitations due to a lack of specialized training on medical-specific data. In response to this challenge, this study introduces Me-LLaMA, a novel medical LLM family that includes foundation models - Me-LLaMA 13/70B, along with their chat-enhanced versions - Me-LLaMA 13/70B-chat, developed through continual pre-training and instruction tuning of LLaMA2 using large medical datasets. Our methodology leverages a comprehensive domain-specific data suite, including a large-scale, continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and a new medical evaluation benchmark (MIBE) across six critical medical tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me-LLaMA models achieve overall better performance than existing open-source medical LLMs in zero-shot, few-shot and supervised learning abilities. With task-specific instruction tuning, Me-LLaMA models outperform ChatGPT on 7 out of 8 datasets and GPT-4 on 5 out of 8 datasets. In addition, we investigated the catastrophic forgetting problem, and our results show that Me-LLaMA models outperform other open-source medical LLMs in mitigating this issue. Me-LLaMA is one of the largest open-source medical foundation LLMs that use both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other open-source medical LLMs, rendering it an attractive choice for medical AI applications. We release our models, datasets, and evaluation scripts at: https://github.com/BIDS-Xu-Lab/Me-LLaMA. △ Less

Submitted 11 April, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

Comments: 21 pages, 3 figures, 8 tables

arXiv:2402.12530 [pdf, other]

Parallel Structures in Pre-training Data Yield In-Context Learning

Authors: Yanda Chen, Chen Zhao, Zhou Yu, Kathleen McKeown, He He

Abstract: Pre-trained language models (LMs) are capable of in-context learning (ICL): they can adapt to a task with only a few examples given in the prompt without any parameter update. However, it is unclear where this capability comes from as there is a stark distribution shift between pre-training text and ICL prompts. In this work, we study what patterns of the pre-training data contribute to ICL. We fi… ▽ More Pre-trained language models (LMs) are capable of in-context learning (ICL): they can adapt to a task with only a few examples given in the prompt without any parameter update. However, it is unclear where this capability comes from as there is a stark distribution shift between pre-training text and ICL prompts. In this work, we study what patterns of the pre-training data contribute to ICL. We find that LMs' ICL ability depends on $\textit{parallel structures}$ in the pre-training data -- pairs of phrases following similar templates in the same context window. Specifically, we detect parallel structures by checking whether training on one phrase improves prediction of the other, and conduct ablation experiments to study their effect on ICL. We show that removing parallel structures in the pre-training data reduces LMs' ICL accuracy by 51% (vs 2% from random ablation). This drop persists even when excluding common patterns such as n-gram repetitions and long-range dependency, showing the diversity and generality of parallel structures. A closer look at the detected parallel structures indicates that they cover diverse linguistic tasks and span long distances in the data. △ Less

Submitted 19 February, 2024; originally announced February 2024.

arXiv:2402.10071 [pdf, other]

Approximate Message Passing-Enhanced Graph Neural Network for OTFS Data Detection

Authors: Wenhao Zhuang, Yuyi Mao, Hengtao He, Lei Xie, Shenghui Song, Yao Ge, Zhi Ding

Abstract: Orthogonal time frequency space (OTFS) modulation has emerged as a promising solution to support high-mobility wireless communications, for which, cost-effective data detectors are critical. Although graph neural network (GNN)-based data detectors can achieve decent detection accuracy at reasonable computational cost, they fail to best harness prior information of transmitted data. To further mini… ▽ More Orthogonal time frequency space (OTFS) modulation has emerged as a promising solution to support high-mobility wireless communications, for which, cost-effective data detectors are critical. Although graph neural network (GNN)-based data detectors can achieve decent detection accuracy at reasonable computational cost, they fail to best harness prior information of transmitted data. To further minimize the data detection error of OTFS systems, this letter develops an AMP-GNN-based detector, leveraging the approximate message passing (AMP) algorithm to iteratively improve the symbol estimates of a GNN. Given the inter-Doppler interference (IDI) symbols incur substantial computational overhead to the constructed GNN, learning-based IDI approximation is implemented to sustain low detection complexity. Simulation results demonstrate a remarkable bit error rate (BER) performance achieved by the proposed AMP-GNN-based detector compared to existing baselines. Meanwhile, the proposed IDI approximation scheme avoids a large amount of computations with negligible BER degradation. △ Less

Submitted 14 April, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

Comments: 8 pages, 7 figures, and 3 tables. Part of this article was submitted to IEEE for possible publication

arXiv:2402.08994 [pdf, other]

CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding

Authors: Qiongyi Zhou, Changde Du, Shengpei Wang, Huiguang He

Abstract: The study of decoding visual neural information faces challenges in generalizing single-subject decoding models to multiple subjects, due to individual differences. Moreover, the limited availability of data from a single subject has a constraining impact on model performance. Although prior multi-subject decoding methods have made significant progress, they still suffer from several limitations,… ▽ More The study of decoding visual neural information faces challenges in generalizing single-subject decoding models to multiple subjects, due to individual differences. Moreover, the limited availability of data from a single subject has a constraining impact on model performance. Although prior multi-subject decoding methods have made significant progress, they still suffer from several limitations, including difficulty in extracting global neural response features, linear scaling of model parameters with the number of subjects, and inadequate characterization of the relationship between neural responses of different subjects to various stimuli. To overcome these limitations, we propose a CLIP-guided Multi-sUbject visual neural information SEmantic Decoding (CLIP-MUSED) method. Our method consists of a Transformer-based feature extractor to effectively model global neural representations. It also incorporates learnable subject-specific tokens that facilitates the aggregation of multi-subject data without a linear increase of parameters. Additionally, we employ representational similarity analysis (RSA) to guide token representation learning based on the topological relationship of visual stimuli in the representation space of CLIP, enabling full characterization of the relationship between neural responses of different subjects under different stimuli. Finally, token representations are used for multi-subject semantic decoding. Our proposed method outperforms single-subject decoding methods and achieves state-of-the-art performance among the existing multi-subject methods on two fMRI datasets. Visualization results provide insights into the effectiveness of our proposed method. Code is available at https://github.com/CLIP-MUSED/CLIP-MUSED. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: Accepted by ICLR2024

arXiv:2402.01795 [pdf, other]

doi 10.1109/IV55156.2024.10588417

Few-Shot Scenario Testing for Autonomous Vehicles Based on Neighborhood Coverage and Similarity

Authors: Shu Li, Jingxuan Yang, Honglin He, Yi Zhang, Jianming Hu, Shuo Feng

Abstract: Testing and evaluating the safety performance of autonomous vehicles (AVs) is essential before the large-scale deployment. Practically, the number of testing scenarios permissible for a specific AV is severely limited by tight constraints on testing budgets and time. With the restrictions imposed by strictly restricted numbers of tests, existing testing methods often lead to significant uncertaint… ▽ More Testing and evaluating the safety performance of autonomous vehicles (AVs) is essential before the large-scale deployment. Practically, the number of testing scenarios permissible for a specific AV is severely limited by tight constraints on testing budgets and time. With the restrictions imposed by strictly restricted numbers of tests, existing testing methods often lead to significant uncertainty or difficulty to quantifying evaluation results. In this paper, we formulate this problem for the first time the "few-shot testing" (FST) problem and propose a systematic framework to address this challenge. To alleviate the considerable uncertainty inherent in a small testing scenario set, we frame the FST problem as an optimization problem and search for the testing scenario set based on neighborhood coverage and similarity. Specifically, under the guidance of better generalization ability of the testing scenario set on AVs, we dynamically adjust this set and the contribution of each testing scenario to the evaluation result based on coverage, leveraging the prior information of surrogate models (SMs). With certain hypotheses on SMs, a theoretical upper bound of evaluation error is established to verify the sufficiency of evaluation accuracy within the given limited number of tests. The experiment results on cut-in scenarios demonstrate a notable reduction in evaluation error and variance of our method compared to conventional testing methods, especially for situations with a strict limit on the number of scenarios. △ Less

Submitted 22 April, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

arXiv:2401.16534 [pdf, other]

Democratizing the Creation of Animatable Facial Avatars

Authors: Yilin Zhu, Dalton Omens, Haodi He, Ron Fedkiw

Abstract: In high-end visual effects pipelines, a customized (and expensive) light stage system is (typically) used to scan an actor in order to acquire both geometry and texture for various expressions. Aiming towards democratization, we propose a novel pipeline for obtaining geometry and texture as well as enough expression information to build a customized person-specific animation rig without using a li… ▽ More In high-end visual effects pipelines, a customized (and expensive) light stage system is (typically) used to scan an actor in order to acquire both geometry and texture for various expressions. Aiming towards democratization, we propose a novel pipeline for obtaining geometry and texture as well as enough expression information to build a customized person-specific animation rig without using a light stage or any other high-end hardware (or manual cleanup). A key novel idea consists of warping real-world images to align with the geometry of a template avatar and subsequently projecting the warped image into the template avatar's texture; importantly, this allows us to leverage baked-in real-world lighting/texture information in order to create surrogate facial features (and bridge the domain gap) for the sake of geometry reconstruction. Not only can our method be used to obtain a neutral expression geometry and de-lit texture, but it can also be used to improve avatars after they have been imported into an animation system (noting that such imports tend to be lossy, while also hallucinating various features). Since a default animation rig will contain template expressions that do not correctly correspond to those of a particular individual, we use a Simon Says approach to capture various expressions and build a person-specific animation rig (that moves like they do). Our aforementioned warping/projection method has high enough efficacy to reconstruct geometry corresponding to each expressions. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.16476 [pdf, other]

Unraveling the Mystery of the Low CO-to-H$_2$ Conversion Factor in Starburst Galaxies: RADEX Modeling of the Antennae

Authors: Hao He, Christine D. Wilson, Jiayi Sun, Yu-Hsuan Teng, Erik Rosolowsky, Ashley R. Bemis

Abstract: CO emission has been widely used as a tracer of molecular gas mass. However, it is a long-standing issue to accurately constrain the CO-to-H$_2$ conversion factor ($α_{\mathrm{CO}}$) that converts CO luminosity to molecular gas mass, especially in starburst galaxies. We present the first resolved $α_{\mathrm{CO}}$ modeling results with multiple ALMA CO and $^{13}$CO transition observations at both… ▽ More CO emission has been widely used as a tracer of molecular gas mass. However, it is a long-standing issue to accurately constrain the CO-to-H$_2$ conversion factor ($α_{\mathrm{CO}}$) that converts CO luminosity to molecular gas mass, especially in starburst galaxies. We present the first resolved $α_{\mathrm{CO}}$ modeling results with multiple ALMA CO and $^{13}$CO transition observations at both giant molecular cloud (GMC) scale at 150 pc and kpc scale for one of the closest starburst mergers, the Antennae. By combining our CO modeling results and measurements of 350 GHz dust continuum, we find that most GMCs in the Antennae have $α_{\mathrm{CO}}$ values $\sim$4 times smaller than the commonly adopted Milky Way value (4.3). We find $α_{\mathrm{CO}}$ at GMC scales shows a strong dependence on CO intensity, $^{13}$CO/CO ratio and GMC velocity dispersion, which is consistent with various theoretical and simulation predictions. Specifically, we suggest that the $^{13}$CO/CO line ratio and the velocity dispersion can be used to infer $α_{\mathrm{CO}}$ in starburst regions. By applying our modeled $α_{\mathrm{CO}}$ in GMC analyses, we find that GMCs in the Antennae are less gravitationally bound than in normal spiral galaxies, which is more consistent with what is predicted by merger simulations. At kpc scale, we find that our modeled $α_{\mathrm{CO}}$ values are smaller than the modeled $α_{\mathrm{CO}}$ at GMC scale by 40%, which can be due to inclusion of a diffuse gas component with lower $α_{\mathrm{CO}}$ values. We find a similar correlation of $α_{\mathrm{CO}}$ and CO intensity at kpc scales to that at GMC scales. △ Less

Submitted 9 June, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: 22 pages and 16 figures in the main text; accepted to ApJ

arXiv:2401.14984 [pdf, ps, other]

Projection of Elliptic Orbits and Branching Laws

Authors: Hongyu He

Abstract: Let $G$ be a Lie group, and $H\subset G$ a closed subgroup. Let $π$ be an irreducible unitary representation of $G$. In this paper, we briefly discuss the orbit method and its application to the branching problem $π|_{H}$. We use the Gan-Gross-Prasad branching law for $(G, H)= ( U(p,q), U(p, q-1) )$ as an example to illustrate the relation between $\pro_{\f u(p, q-1)}^{\f u(p,q)} \mc O(λ)$ and the… ▽ More Let $G$ be a Lie group, and $H\subset G$ a closed subgroup. Let $π$ be an irreducible unitary representation of $G$. In this paper, we briefly discuss the orbit method and its application to the branching problem $π|_{H}$. We use the Gan-Gross-Prasad branching law for $(G, H)= ( U(p,q), U(p, q-1) )$ as an example to illustrate the relation between $\pro_{\f u(p, q-1)}^{\f u(p,q)} \mc O(λ)$ and the branching law of the discrete series $D_λ|_{U(p,q-1)}$ for $λ$ an regular elliptic element. We also discuss some results regarding branching laws and wave front sets. The presentation of this paper does not follow the historical timeline of development. △ Less

Submitted 26 January, 2024; originally announced January 2024.

arXiv:2401.14453 [pdf, other]

Hidden Gems on a Ring: Infant Massive Clusters and Their Formation Timeline Unveiled by ALMA, HST, and JWST in NGC 3351

Authors: Jiayi Sun, Hao He, Kyle Batschkun, Rebecca C. Levy, Kimberly Emig, M. Jimena Rodriguez, Hamid Hassani, Adam K. Leroy, Eva Schinnerer, Eve C. Ostriker, Christine D. Wilson, Alberto D. Bolatto, Elisabeth A. C. Mills, Erik Rosolowsky, Janice C. Lee, Daniel A. Dale, Kirsten L. Larson, David A. Thilker, Leonardo Ubeda, Bradley C. Whitmore, Thomas G. Williams, Ashley. T. Barnes, Frank Bigiel, Melanie Chevance, Simon C. O. Glover , et al. (16 additional authors not shown)

Abstract: We study young massive clusters (YMCs) in their embedded "infant" phase with $\sim0.\!^{\prime\prime}1$ ALMA, HST, and JWST observations targeting the central starburst ring in NGC 3351, a nearby Milky Way analog galaxy. Our new ALMA data reveal 18 bright and compact (sub-)millimeter continuum sources, of which 8 have counterparts in JWST images and only 6 have counterparts in HST images. Based on… ▽ More We study young massive clusters (YMCs) in their embedded "infant" phase with $\sim0.\!^{\prime\prime}1$ ALMA, HST, and JWST observations targeting the central starburst ring in NGC 3351, a nearby Milky Way analog galaxy. Our new ALMA data reveal 18 bright and compact (sub-)millimeter continuum sources, of which 8 have counterparts in JWST images and only 6 have counterparts in HST images. Based on the ALMA continuum and molecular line data, as well as ancillary measurements for the HST and JWST counterparts, we identify 14 sources as infant star clusters with high stellar and/or gas masses (${\sim}10^5\;\mathrm{M_\odot}$), small radii (${\lesssim}\,5\;\mathrm{pc}$), large escape velocities ($6{-}10\;\mathrm{km/s}$), and short free-fall times ($0.5{-}1\;\mathrm{Myr}$). Their multiwavelength properties motivate us to divide them into four categories, likely corresponding to four evolutionary stages from starless clumps to exposed HII region-cluster complexes. Leveraging age estimates for HST-identified clusters in the same region, we infer an evolutionary timeline going from $\sim$1-2 Myr before cluster formation as starless clumps, to $\sim$4-6 Myr after as exposed HII region-cluster complexes. Finally, we show that the YMCs make up a substantial fraction of recent star formation across the ring, exhibit an non-uniform azimuthal distribution without a very coherent evolutionary trend along the ring, and are capable of driving large-scale gas outflows. △ Less

Submitted 10 April, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

Comments: 27 pages, 12 figures; ApJ accepted

arXiv:2401.14039 [pdf, other]

Threshold displacement energy map of Frenkel pair generation in $\rm Ga_2O_3$ from machine-learning-driven molecular dynamics simulations

Authors: Huan He, Junlei Zhao, Jesper Byggmästar, Ru He, Kai Nordlund, Chaohui He, Flyura Djurabekova

Abstract: $β$ phase gallium oxide ($β$-$\rm Ga_2O_3… ▽ More $β$ phase gallium oxide ($β$-$\rm Ga_2O_3$) demonstrates tremendous potential for electronics applications and offers promising prospects for integration into future space systems with the necessity of high radiation resistance. Therefore, a comprehensive understanding of the threshold displacement energy (TDE) and the radiation-induced formation of Frenkel pairs (FPs) in this material is vital but has not yet been thoroughly studied. In this work, we performed over 5,000 molecular dynamics simulations using our machine-learning potentials to determine the TDE and investigate the formation of FPs. The average TDEs for the two Ga sites, Ga1 (tetrahedral site) and Ga2 (octahedral site), are 22.9 and 20.0 eV, respectively. While the average TDEs for the three O sites are nearly uniform, ranging from 17.0 to 17.4 eV. The generated TDE maps reveal significant differences in displacement behavior between these five atomic sites. Our developed defect identification methods successfully categorize various types of FPs in this material, with more than ten types of Ga FPs being produced during our simulations. O atoms are found to form two main types of FPs and the O split interstitial site on O1 site is most common. Finally, the recombination behavior and barriers of Ga and O FPs indicate that the O FP has a higher possibility of recovery upon annealing. Our findings provide important insights into the studies of radiation damage and defects in $\rm Ga_2O_3$ and can contribute to the design and development of $\rm Ga_2O_3$-based devices △ Less

Submitted 28 February, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

arXiv:2401.13986 [pdf, other]

Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning

Authors: Yanda Chen, Chandan Singh, Xiaodong Liu, Simiao Zuo, Bin Yu, He He, Jianfeng Gao

Abstract: Large language models (LLMs) often generate convincing, fluent explanations. However, different from humans, they often generate inconsistent explanations on different inputs. For example, an LLM may generate the explanation "all birds can fly" when answering the question "Can sparrows fly?" but meanwhile answer "no" to the related question "Can penguins fly?". Explanations should be consistent ac… ▽ More Large language models (LLMs) often generate convincing, fluent explanations. However, different from humans, they often generate inconsistent explanations on different inputs. For example, an LLM may generate the explanation "all birds can fly" when answering the question "Can sparrows fly?" but meanwhile answer "no" to the related question "Can penguins fly?". Explanations should be consistent across related examples so that they allow a human to simulate the LLM's decision process on multiple examples. We propose explanation-consistency finetuning (EC-finetuning), a method that adapts LLMs to generate more consistent natural-language explanations on related examples. EC-finetuning involves finetuning LLMs on synthetic data that is carefully constructed to contain consistent explanations. Across a variety of question-answering datasets in various domains, EC-finetuning yields a 10.0% relative explanation consistency improvement on four finetuning datasets, and generalizes to seven out-of-distribution datasets not seen during finetuning (+4.5% relative). Code is available at https://github.com/yandachen/explanation-consistency-finetuning . △ Less

Submitted 25 January, 2024; originally announced January 2024.

Comments: arXiv admin note: text overlap with arXiv:2307.08678

arXiv:2401.13919 [pdf, other]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Authors: Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, Dong Yu

Abstract: The rapid advancement of large language models (LLMs) has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innovation in creating advanced web agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world s… ▽ More The rapid advancement of large language models (LLMs) has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innovation in creating advanced web agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we establish a new benchmark by compiling real-world tasks from 15 popular websites and introduce an automatic evaluation protocol leveraging multimodal understanding abilities of GPT-4V to evaluate open-ended web agents. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager. The proposed automatic evaluation metric achieves 85.3% agreement with human judgment, indicating its effectiveness in providing reliable and accurate assessments of web agents. △ Less

Submitted 6 June, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

Comments: Accepted to ACL 2024 (main). Code and data is released at https://github.com/MinorJerry/WebVoyager

arXiv:2401.11566 [pdf, other]

A detectable ultra-high-energy cosmic ray outburst from GRB 221009A

Authors: Hao-Ning He, B. Thoedore Zhang, Yi-Zhong Fan

Abstract: Gamma-ray bursts (GRBs) have been proposed as one of promising sources of ultra-high-energy cosmic rays (UHECRs), but observational evidence is still lacking. The nearby B.O.A.T. (brightest of all time) GRB 221009A, an once-in-1000-year event, is able to accelerate protons to $\sim 10^{3}$ EeV. Protons arriving at the Milky Way are dominated by neutron-decay-induced protons. The inter-galactic mag… ▽ More Gamma-ray bursts (GRBs) have been proposed as one of promising sources of ultra-high-energy cosmic rays (UHECRs), but observational evidence is still lacking. The nearby B.O.A.T. (brightest of all time) GRB 221009A, an once-in-1000-year event, is able to accelerate protons to $\sim 10^{3}$ EeV. Protons arriving at the Milky Way are dominated by neutron-decay-induced protons. The inter-galactic magnetic fields would not yield a sizable delay of the $\geq 10{\rm~EeV}$ cosmic rays if its strength is $\lesssim 10^{-13}{\rm~G}$, while Galactic magnetic fields would cause a significant time delay. We predict that, an UHECR burst from GRB 221009A would be detectable by the Pierre Auger Observatory and the TA$\times$4, within $\sim$ 10 years. The detection of such an UHECR outburst will provide the direct evidence for UHECR acceleration in GRBs. △ Less

Submitted 21 January, 2024; originally announced January 2024.

arXiv:2401.11155 [pdf, other]

Deep Learning-Based Adaptive Joint Source-Channel Coding using Hypernetworks

Authors: Songjie Xie, Hengtao He, Hongru Li, Shenghui Song, Jun Zhang, Ying-Jun Angela Zhang, Khaled B. Letaief

Abstract: Deep learning-based joint source-channel coding (DJSCC) is expected to be a key technique for {the} next-generation wireless networks. However, the existing DJSCC schemes still face the challenge of channel adaptability as they are typically trained under specific channel conditions. In this paper, we propose a generic framework for channel-adaptive DJSCC by utilizing hypernetworks. To tailor the… ▽ More Deep learning-based joint source-channel coding (DJSCC) is expected to be a key technique for {the} next-generation wireless networks. However, the existing DJSCC schemes still face the challenge of channel adaptability as they are typically trained under specific channel conditions. In this paper, we propose a generic framework for channel-adaptive DJSCC by utilizing hypernetworks. To tailor the hypernetwork-based framework for communication systems, we propose a memory-efficient hypernetwork parameterization and then develop a channel-adaptive DJSCC network, named Hyper-AJSCC. Compared with existing adaptive DJSCC based on the attention mechanism, Hyper-AJSCC introduces much fewer parameters and can be seamlessly combined with various existing DJSCC networks without any substantial modifications to their neural network architecture. Extensive experiments demonstrate the better adaptability to channel conditions and higher memory efficiency of Hyper-AJSCC compared with state-of-the-art baselines. △ Less

Submitted 20 January, 2024; originally announced January 2024.

arXiv:2401.07080 [pdf, other]

GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching

Authors: Haibin He, Maoyuan Ye, Jing Zhang, Juhua Liu, Dacheng Tao

Abstract: Beyond the text detection and recognition tasks in image text spotting, video text spotting presents an augmented challenge with the inclusion of tracking. While advanced end-to-end trainable methods have shown commendable performance, the pursuit of multi-task optimization may pose the risk of producing sub-optimal outcomes for individual tasks. In this paper, we highlight a main bottleneck in th… ▽ More Beyond the text detection and recognition tasks in image text spotting, video text spotting presents an augmented challenge with the inclusion of tracking. While advanced end-to-end trainable methods have shown commendable performance, the pursuit of multi-task optimization may pose the risk of producing sub-optimal outcomes for individual tasks. In this paper, we highlight a main bottleneck in the state-of-the-art video text spotter: the limited recognition capability. In response to this issue, we propose to efficiently turn an off-the-shelf query-based image text spotter into a specialist on video and present a simple baseline termed GoMatching, which focuses the training efforts on tracking while maintaining strong recognition performance. To adapt the image text spotter to video datasets, we add a rescoring head to rescore each detected instance's confidence via efficient tuning, leading to a better tracking candidate pool. Additionally, we design a long-short term matching module, termed LST-Matcher, to enhance the spotter's tracking capability by integrating both long- and short-term matching results via Transformer. Based on the above simple designs, GoMatching achieves impressive performance on two public benchmarks, e.g., setting a new record on the ICDAR15-video dataset, and one novel test set with arbitrary-shaped text, while saving considerable training budgets. The code will be released at https://github.com/Hxyz-123/GoMatching. △ Less

Submitted 13 January, 2024; originally announced January 2024.

arXiv:2401.06340 [pdf, other]

A Temporal-Spectral Fusion Transformer with Subject-Specific Adapter for Enhancing RSVP-BCI Decoding

Authors: Xujin Li, Wei Wei, Shuang Qiu, Huiguang He

Abstract: The Rapid Serial Visual Presentation (RSVP)-based Brain-Computer Interface (BCI) is an efficient technology for target retrieval using electroencephalography (EEG) signals. The performance improvement of traditional decoding methods relies on a substantial amount of training data from new test subjects, which increases preparation time for BCI systems. Several studies introduce data from existing… ▽ More The Rapid Serial Visual Presentation (RSVP)-based Brain-Computer Interface (BCI) is an efficient technology for target retrieval using electroencephalography (EEG) signals. The performance improvement of traditional decoding methods relies on a substantial amount of training data from new test subjects, which increases preparation time for BCI systems. Several studies introduce data from existing subjects to reduce the dependence of performance improvement on data from new subjects, but their optimization strategy based on adversarial learning with extensive data increases training time during the preparation procedure. Moreover, most previous methods only focus on the single-view information of EEG signals, but ignore the information from other views which may further improve performance. To enhance decoding performance while reducing preparation time, we propose a Temporal-Spectral fusion transformer with Subject-specific Adapter (TSformer-SA). Specifically, a cross-view interaction module is proposed to facilitate information transfer and extract common representations across two-view features extracted from EEG temporal signals and spectrogram images. Then, an attention-based fusion module fuses the features of two views to obtain comprehensive discriminative features for classification. Furthermore, a multi-view consistency loss is proposed to maximize the feature similarity between two views of the same EEG signal. Finally, we propose a subject-specific adapter to rapidly transfer the knowledge of the model trained on data from existing subjects to decode data from new subjects. Experimental results show that TSformer-SA significantly outperforms comparison methods and achieves outstanding performance with limited training data from new subjects. This facilitates efficient decoding and rapid deployment of BCI systems in practical use. △ Less

Submitted 11 July, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: 19 pages, 10 figures

MSC Class: 68T07 ACM Class: I.5.4

arXiv:2401.04741 [pdf, other]

Masked AutoEncoder for Graph Clustering without Pre-defined Cluster Number k

Authors: Yuanchi Ma, Hui He, Zhongxiang Lei, Zhendong Niu

Abstract: Graph clustering algorithms with autoencoder structures have recently gained popularity due to their efficient performance and low training cost. However, for existing graph autoencoder clustering algorithms based on GCN or GAT, not only do they lack good generalization ability, but also the number of clusters clustered by such autoencoder models is difficult to determine automatically. To solve t… ▽ More Graph clustering algorithms with autoencoder structures have recently gained popularity due to their efficient performance and low training cost. However, for existing graph autoencoder clustering algorithms based on GCN or GAT, not only do they lack good generalization ability, but also the number of clusters clustered by such autoencoder models is difficult to determine automatically. To solve this problem, we propose a new framework called Graph Clustering with Masked Autoencoders (GCMA). It employs our designed fusion autoencoder based on the graph masking method for the fusion coding of graph. It introduces our improved density-based clustering algorithm as a second decoder while decoding with multi-target reconstruction. By decoding the mask embedding, our model can capture more generalized and comprehensive knowledge. The number of clusters and clustering results can be output end-to-end while improving the generalization ability. As a nonparametric class method, extensive experiments demonstrate the superiority of \textit{GCMA} over state-of-the-art baselines. △ Less

Submitted 9 January, 2024; originally announced January 2024.

arXiv:2401.04502 [pdf]

Observation of Higher Order Nodal Line Semimetal in Phononic Crystals

Authors: Qiyun Ma, Zhenhang Pu, Liping Ye, Jiuyang Lu, Xueqin Huang, Manzhu Ke, Hailong He, Weiyin Deng, Zhengyou Liu

Abstract: Higher-order topological insulators and semimetals, which generalize the conventional bulk-boundary correspondence, have attracted extensive research interest. Among them, higher-order Weyl semimetals feature two-fold linear crossing points in three-dimensional (3D) momentum space, 2D Fermi-arc surface states, and 1D hinge states. Higher-order nodal-point semimetals possessing Weyl points or Dirac… ▽ More Higher-order topological insulators and semimetals, which generalize the conventional bulk-boundary correspondence, have attracted extensive research interest. Among them, higher-order Weyl semimetals feature two-fold linear crossing points in three-dimensional (3D) momentum space, 2D Fermi-arc surface states, and 1D hinge states. Higher-order nodal-point semimetals possessing Weyl points or Dirac points have been implemented. However, higher-order nodal-line or nodal-surface semimetals remain to be further explored in experiments in spite of many previous theoretical efforts. In this work, we realize a second-order nodal-line semimetal in 3D phononic crystals. The bulk nodal lines, 2D drumhead surface states guaranteed by Zak phases, and 1D flat hinge states attributed to kz-dependent quadrupole moments, are observed in simulations and experiments. Our findings of nondispersive surface and hinge states may promote applications in acoustic sensing and energy harvesting. △ Less

Submitted 12 January, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

Comments: accepted for publication in PRL

arXiv:2401.02038 [pdf, other]

Understanding LLMs: A Comprehensive Overview from Training to Inference

Authors: Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge

Abstract: The introduction of ChatGPT has led to a significant increase in the utilization of Large Language Models (LLMs) for addressing downstream tasks. There's an increasing focus on cost-efficient training and deployment within this context. Low-cost training and deployment of LLMs represent the future development trend. This paper reviews the evolution of large language model training techniques and i… ▽ More The introduction of ChatGPT has led to a significant increase in the utilization of Large Language Models (LLMs) for addressing downstream tasks. There's an increasing focus on cost-efficient training and deployment within this context. Low-cost training and deployment of LLMs represent the future development trend. This paper reviews the evolution of large language model training techniques and inference deployment technologies aligned with this emerging trend. The discussion on training includes various aspects, including data preprocessing, training architecture, pre-training tasks, parallel training, and relevant content related to model fine-tuning. On the inference side, the paper covers topics such as model compression, parallel computation, memory scheduling, and structural optimization. It also explores LLMs' utilization and provides insights into their future development. △ Less

Submitted 5 January, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

Comments: 30 pages,6 figures

arXiv:2401.01113 [pdf, other]

CRB Minimization for RIS-aided mmWave Integrated Sensing and Communications

Authors: Wanting Lyu, Songjie Yang, Yue Xiu, Ya Li, Hongjun He, Chau Yuen, Zhongpei Zhang

Abstract: In this paper, reconfigurable intelligent surface (RIS) is employed in a millimeter wave (mmWave) integrated sensing and communications (ISAC) system. To alleviate the multi-hop attenuation, the semi-self sensing RIS approach is adopted, wherein sensors are configured at the RIS to receive the radar echo signal. Focusing on the estimation accuracy, the Cramer-Rao bound (CRB) for estimating the dir… ▽ More In this paper, reconfigurable intelligent surface (RIS) is employed in a millimeter wave (mmWave) integrated sensing and communications (ISAC) system. To alleviate the multi-hop attenuation, the semi-self sensing RIS approach is adopted, wherein sensors are configured at the RIS to receive the radar echo signal. Focusing on the estimation accuracy, the Cramer-Rao bound (CRB) for estimating the direction-of-the-angles is derived as the metric for sensing performance. A joint optimization problem on hybrid beamforming and RIS phaseshifts is proposed to minimize the CRB, while maintaining satisfactory communication performance evaluated by the achievable data rate. The CRB minimization problem is first transformed as a more tractable form based on Fisher information matrix (FIM). To solve the complex non-convex problem, a double layer loop algorithm is proposed based on penalty concave-convex procedure (penalty-CCCP) and block coordinate descent (BCD) method with two sub-problems. Successive convex approximation (SCA) algorithm and second order cone (SOC) constraints are employed to tackle the non-convexity in the hybrid beamforming optimization. To optimize the unit modulus constrained analog beamforming and phase shifts, manifold optimization (MO) is adopted. Finally, the numerical results verify the effectiveness of the proposed CRB minimization algorithm, and show the performance improvement compared with other baselines. Additionally, the proposed hybrid beamforming algorithm can achieve approximately 96% of the sensing performance exhibited by the full digital approach within only a limited number of radio frequency (RF) chains. △ Less

Submitted 2 January, 2024; originally announced January 2024.

Showing 101–150 of 1,330 results for author: He, H