Search | arXiv e-print repository

Video-Language Alignment Pre-training via Spatio-Temporal Graph Transformer

Authors: Shi-Xue Zhang, Hongfa Wang, Xiaobin Zhu, Weibo Gu, Tianjin Zhang, Chun Yang, Wei Liu, Xu-Cheng Yin

Abstract: Video-language alignment is a crucial multi-modal task that benefits various downstream applications, e.g., video-text retrieval and video question answering. Existing methods either utilize multi-modal information in video-text pairs or apply global and local alignment techniques to promote alignment precision. However, these methods often fail to fully explore the spatio-temporal relationships a… ▽ More Video-language alignment is a crucial multi-modal task that benefits various downstream applications, e.g., video-text retrieval and video question answering. Existing methods either utilize multi-modal information in video-text pairs or apply global and local alignment techniques to promote alignment precision. However, these methods often fail to fully explore the spatio-temporal relationships among vision tokens within video and across different video-text pairs. In this paper, we propose a novel Spatio-Temporal Graph Transformer module to uniformly learn spatial and temporal contexts for video-language alignment pre-training (dubbed STGT). Specifically, our STGT combines spatio-temporal graph structure information with attention in transformer block, effectively utilizing the spatio-temporal contexts. In this way, we can model the relationships between vision tokens, promoting video-text alignment precision for benefiting downstream tasks. In addition, we propose a self-similarity alignment loss to explore the inherent self-similarity in the video and text. With the initial optimization achieved by contrastive learning, it can further promote the alignment accuracy between video and text. Experimental results on challenging downstream tasks, including video-text retrieval and video question answering, verify the superior performance of our method. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: under review

arXiv:2407.01553 [pdf, other]

Fish-bone diagram of research issue: Gain a bird's-eye view on a specific research topic

Authors: JingHong Li, Huy Phan, Wen Gu, Koichi Ota, Shinobu Hasegawa

Abstract: Novice researchers often face difficulties in understanding a multitude of academic papers and grasping the fundamentals of a new research field. To solve such problems, the knowledge graph supporting research survey is gradually being developed. Existing keyword-based knowledge graphs make it difficult for researchers to deeply understand abstract concepts. Meanwhile, novice researchers may find… ▽ More Novice researchers often face difficulties in understanding a multitude of academic papers and grasping the fundamentals of a new research field. To solve such problems, the knowledge graph supporting research survey is gradually being developed. Existing keyword-based knowledge graphs make it difficult for researchers to deeply understand abstract concepts. Meanwhile, novice researchers may find it difficult to use ChatGPT effectively for research surveys due to their limited understanding of the research field. Without the ability to ask proficient questions that align with key concepts, obtaining desired and accurate answers from this large language model (LLM) could be inefficient. This study aims to help novice researchers by providing a fish-bone diagram that includes causal relationships, offering an overview of the research topic. The diagram is constructed using the issue ontology from academic papers, and it offers a broad, highly generalized perspective of the research field, based on relevance and logical factors. Furthermore, we evaluate the strengths and improvable points of the fish-bone diagram derived from this study's development pattern, emphasizing its potential as a viable tool for supporting research survey. △ Less

Submitted 10 July, 2024; v1 submitted 30 April, 2024; originally announced July 2024.

Comments: This paper has been accepted by IEEE SMC 2024

arXiv:2406.11904 [pdf, other]

Pay Attention to Weak Ties: A Heterogeneous Multiplex Representation Learning Framework for Link Prediction

Authors: Weiwei Gu, Linbi Lv, Gang Lu, Ruiqi Li

Abstract: Graph neural networks (GNNs) can learn effective node representations that significantly improve link prediction accuracy. However, most GNN-based link prediction algorithms are incompetent to predict weak ties connecting different communities. Most link prediction algorithms are designed for networks with only one type of relation between nodes but neglect the fact that many complex systems, incl… ▽ More Graph neural networks (GNNs) can learn effective node representations that significantly improve link prediction accuracy. However, most GNN-based link prediction algorithms are incompetent to predict weak ties connecting different communities. Most link prediction algorithms are designed for networks with only one type of relation between nodes but neglect the fact that many complex systems, including transportation and social networks, consisting of multi-modalities of interactions that correspond to different nature of interactions and dynamics that can be modeled as multiplex network, where different types of relation are represented in different layers. This paper proposes a Multi-Relations-aware Graph Neural Network (MRGNN) framework to learn effective node representations for multiplex networks and make more accurate link predictions, especially for weak ties. Specifically, our model utilizes an intra-layer node-level feature propagation process and an inter-layer representation merge process, which applies a simple yet effective logistic or semantic attention voting mechanism to adaptively aggregate information from different layers. Extensive experiments on four diversified multiplex networks show that MRGNN outperforms the state-of-the-art multiplex link prediction algorithms on overall prediction accuracy, and works pretty well on forecasting weak ties △ Less

Submitted 15 June, 2024; originally announced June 2024.

arXiv:2404.11946 [pdf, other]

doi 10.1109/TIV.2023.3338483

S4TP: Social-Suitable and Safety-Sensitive Trajectory Planning for Autonomous Vehicles

Authors: Xiao Wang, Ke Tang, Xingyuan Dai, Jintao Xu, Quancheng Du, Rui Ai, Yuxiao Wang, Weihao Gu

Abstract: In public roads, autonomous vehicles (AVs) face the challenge of frequent interactions with human-driven vehicles (HDVs), which render uncertain driving behavior due to varying social characteristics among humans. To effectively assess the risks prevailing in the vicinity of AVs in social interactive traffic scenarios and achieve safe autonomous driving, this article proposes a social-suitable and… ▽ More In public roads, autonomous vehicles (AVs) face the challenge of frequent interactions with human-driven vehicles (HDVs), which render uncertain driving behavior due to varying social characteristics among humans. To effectively assess the risks prevailing in the vicinity of AVs in social interactive traffic scenarios and achieve safe autonomous driving, this article proposes a social-suitable and safety-sensitive trajectory planning (S4TP) framework. Specifically, S4TP integrates the Social-Aware Trajectory Prediction (SATP) and Social-Aware Driving Risk Field (SADRF) modules. SATP utilizes Transformers to effectively encode the driving scene and incorporates an AV's planned trajectory during the prediction decoding process. SADRF assesses the expected surrounding risk degrees during AVs-HDVs interactions, each with different social characteristics, visualized as two-dimensional heat maps centered on the AV. SADRF models the driving intentions of the surrounding HDVs and predicts trajectories based on the representation of vehicular interactions. S4TP employs an optimization-based approach for motion planning, utilizing the predicted HDVs'trajectories as input. With the integration of SADRF, S4TP executes real-time online optimization of the planned trajectory of AV within lowrisk regions, thus improving the safety and the interpretability of the planned trajectory. We have conducted comprehensive tests of the proposed method using the SMARTS simulator. Experimental results in complex social scenarios, such as unprotected left turn intersections, merging, cruising, and overtaking, validate the superiority of our proposed S4TP in terms of safety and rationality. S4TP achieves a pass rate of 100% across all scenarios, surpassing the current state-of-the-art methods Fanta of 98.25% and Predictive-Decision of 94.75%. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: 12 pages,4 figures, published to IEEE Transactions on Intelligent Vehicles

arXiv:2404.00231 [pdf, ps, other]

Attention-based Shape-Deformation Networks for Artifact-Free Geometry Reconstruction of Lumbar Spine from MR Images

Authors: Linchen Qian, Jiasong Chen, Linhai Ma, Timur Urakov, Weiyong Gu, Liang Liang

Abstract: Lumbar disc degeneration, a progressive structural wear and tear of lumbar intervertebral disc, is regarded as an essential role on low back pain, a significant global health concern. Automated lumbar spine geometry reconstruction from MR images will enable fast measurement of medical parameters to evaluate the lumbar status, in order to determine a suitable treatment. Existing image segmentation-… ▽ More Lumbar disc degeneration, a progressive structural wear and tear of lumbar intervertebral disc, is regarded as an essential role on low back pain, a significant global health concern. Automated lumbar spine geometry reconstruction from MR images will enable fast measurement of medical parameters to evaluate the lumbar status, in order to determine a suitable treatment. Existing image segmentation-based techniques often generate erroneous segments or unstructured point clouds, unsuitable for medical parameter measurement. In this work, we present $\textit{UNet-DeformSA}$ and $\textit{TransDeformer}$: novel attention-based deep neural networks that reconstruct the geometry of the lumbar spine with high spatial accuracy and mesh correspondence across patients, and we also present a variant of $\textit{TransDeformer}$ for error estimation. Specially, we devise new attention modules with a new attention formula, which integrate image features and tokenized contour features to predict the displacements of the points on a shape template without the need for image segmentation. The deformed template reveals the lumbar spine geometry in an image. Experiment results show that our networks generate artifact-free geometry outputs, and the variant of $\textit{TransDeformer}$ can predict the errors of a reconstructed geometry. Our code is available at https://github.com/linchenq/TransDeformer-Mesh. △ Less

Submitted 30 April, 2024; v1 submitted 29 March, 2024; originally announced April 2024.

arXiv:2403.19615 [pdf, other]

SA-GS: Scale-Adaptive Gaussian Splatting for Training-Free Anti-Aliasing

Authors: Xiaowei Song, Jv Zheng, Shiran Yuan, Huan-ang Gao, Jingwei Zhao, Xiang He, Weihao Gu, Hao Zhao

Abstract: In this paper, we present a Scale-adaptive method for Anti-aliasing Gaussian Splatting (SA-GS). While the state-of-the-art method Mip-Splatting needs modifying the training procedure of Gaussian splatting, our method functions at test-time and is training-free. Specifically, SA-GS can be applied to any pretrained Gaussian splatting field as a plugin to significantly improve the field's anti-alisin… ▽ More In this paper, we present a Scale-adaptive method for Anti-aliasing Gaussian Splatting (SA-GS). While the state-of-the-art method Mip-Splatting needs modifying the training procedure of Gaussian splatting, our method functions at test-time and is training-free. Specifically, SA-GS can be applied to any pretrained Gaussian splatting field as a plugin to significantly improve the field's anti-alising performance. The core technique is to apply 2D scale-adaptive filters to each Gaussian during test time. As pointed out by Mip-Splatting, observing Gaussians at different frequencies leads to mismatches between the Gaussian scales during training and testing. Mip-Splatting resolves this issue using 3D smoothing and 2D Mip filters, which are unfortunately not aware of testing frequency. In this work, we show that a 2D scale-adaptive filter that is informed of testing frequency can effectively match the Gaussian scale, thus making the Gaussian primitive distribution remain consistent across different testing frequencies. When scale inconsistency is eliminated, sampling rates smaller than the scene frequency result in conventional jaggedness, and we propose to integrate the projected 2D Gaussian within each pixel during testing. This integration is actually a limiting case of super-sampling, which significantly improves anti-aliasing performance over vanilla Gaussian Splatting. Through extensive experiments using various settings and both bounded and unbounded scenes, we show SA-GS performs comparably with or better than Mip-Splatting. Note that super-sampling and integration are only effective when our scale-adaptive filtering is activated. Our codes, data and models are available at https://github.com/zsy1987/SA-GS. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: Project page: https://kevinsong729.github.io/project-pages/SA-GS/ Code: https://github.com/zsy1987/SA-GS

arXiv:2403.18926 [pdf, other]

XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

Authors: Yuanhang Yang, Shiyi Qi, Wenchao Gu, Chaozheng Wang, Cuiyun Gao, Zenglin Xu

Abstract: Sparse models, including sparse Mixture-of-Experts (MoE) models, have emerged as an effective approach for scaling Transformer models. However, they often suffer from computational inefficiency since a significant number of parameters are unnecessarily involved in computations via multiplying values by zero or low activation values. To address this issue, we present \tool, a novel MoE designed to… ▽ More Sparse models, including sparse Mixture-of-Experts (MoE) models, have emerged as an effective approach for scaling Transformer models. However, they often suffer from computational inefficiency since a significant number of parameters are unnecessarily involved in computations via multiplying values by zero or low activation values. To address this issue, we present \tool, a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models. \tool leverages small experts and a threshold-based router to enable tokens to selectively engage only essential parameters. Our extensive experiments on language modeling and machine translation tasks demonstrate that \tool can enhance model performance while decreasing the computation load at MoE layers by over 50\% without sacrificing performance. Furthermore, we present the versatility of \tool by applying it to dense models, enabling sparse computation during inference. We provide a comprehensive analysis and make our code available at https://github.com/ysngki/XMoE. △ Less

Submitted 24 May, 2024; v1 submitted 27 February, 2024; originally announced March 2024.

Comments: ACL2024 Findings

arXiv:2403.18762 [pdf, other]

ModaLink: Unifying Modalities for Efficient Image-to-PointCloud Place Recognition

Authors: Weidong Xie, Lun Luo, Nanfei Ye, Yi Ren, Shaoyi Du, Minhang Wang, Jintao Xu, Rui Ai, Weihao Gu, Xieyuanli Chen

Abstract: Place recognition is an important task for robots and autonomous cars to localize themselves and close loops in pre-built maps. While single-modal sensor-based methods have shown satisfactory performance, cross-modal place recognition that retrieving images from a point-cloud database remains a challenging problem. Current cross-modal methods transform images into 3D points using depth estimation… ▽ More Place recognition is an important task for robots and autonomous cars to localize themselves and close loops in pre-built maps. While single-modal sensor-based methods have shown satisfactory performance, cross-modal place recognition that retrieving images from a point-cloud database remains a challenging problem. Current cross-modal methods transform images into 3D points using depth estimation for modality conversion, which are usually computationally intensive and need expensive labeled data for depth supervision. In this work, we introduce a fast and lightweight framework to encode images and point clouds into place-distinctive descriptors. We propose an effective Field of View (FoV) transformation module to convert point clouds into an analogous modality as images. This module eliminates the necessity for depth estimation and helps subsequent modules achieve real-time performance. We further design a non-negative factorization-based encoder to extract mutually consistent semantic features between point clouds and images. This encoder yields more distinctive global descriptors for retrieval. Experimental results on the KITTI dataset show that our proposed methods achieve state-of-the-art performance while running in real time. Additional evaluation on the HAOMO dataset covering a 17 km trajectory further shows the practical generalization capabilities. We have released the implementation of our methods as open source at: https://github.com/haomo-ai/ModaLink.git. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: 8 pages, 11 figures, conference

arXiv:2403.07566 [pdf, other]

An Improved Strategy for Blood Glucose Control Using Multi-Step Deep Reinforcement Learning

Authors: Weiwei Gu, Senquan Wang

Abstract: Blood Glucose (BG) control involves keeping an individual's BG within a healthy range through extracorporeal insulin injections is an important task for people with type 1 diabetes. However,traditional patient self-management is cumbersome and risky. Recent research has been devoted to exploring individualized and automated BG control approaches, among which Deep Reinforcement Learning (DRL) shows… ▽ More Blood Glucose (BG) control involves keeping an individual's BG within a healthy range through extracorporeal insulin injections is an important task for people with type 1 diabetes. However,traditional patient self-management is cumbersome and risky. Recent research has been devoted to exploring individualized and automated BG control approaches, among which Deep Reinforcement Learning (DRL) shows potential as an emerging approach. In this paper, we use an exponential decay model of drug concentration to convert the formalization of the BG control problem, which takes into account the delay and prolongedness of drug effects, from a PAE-POMDP (Prolonged Action Effect-Partially Observable Markov Decision Process) to a MDP, and we propose a novel multi-step DRL-based algorithm to solve the problem. The Prioritized Experience Replay (PER) sampling method is also used in it. Compared to single-step bootstrapped updates, multi-step learning is more efficient and reduces the influence from biasing targets. Our proposed method converges faster and achieves higher cumulative rewards compared to the benchmark in the same training environment, and improves the time-in-range (TIR), the percentage of time the patient's BG is within the target range, in the evaluation phase. Our work validates the effectiveness of multi-step reinforcement learning in BG control, which may help to explore the optimal glycemic control measure and improve the survival of diabetic patients. △ Less

Submitted 15 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.04282 [pdf, other]

doi 10.23919/JSC.2023.0018

Improving link prediction accuracy of network embedding algorithms via rich node attribute information

Authors: Weiwei Gu, Jinqiang Hou, Weiyi Gu

Abstract: Complex networks are widely used to represent an abundance of real-world relations ranging from social networks to brain networks. Inferring missing links or predicting future ones based on the currently observed network is known as the link prediction task.Recent network embedding based link prediction algorithms have demonstrated ground-breaking performance on link prediction accuracy. Those alg… ▽ More Complex networks are widely used to represent an abundance of real-world relations ranging from social networks to brain networks. Inferring missing links or predicting future ones based on the currently observed network is known as the link prediction task.Recent network embedding based link prediction algorithms have demonstrated ground-breaking performance on link prediction accuracy. Those algorithms usually apply node attributes as the initial feature input to accelerate the convergence speed during the training process. However, they do not take full advantage of node feature information. In this paper,besides applying feature attributes as the initial input, we make better utilization of node attribute information by building attributable networks and plugging attributable networks into some typical link prediction algorithms and naming this algorithm Attributive Graph Enhanced Embedding (AGEE). AGEE is able to automatically learn the weighting trades-off between the structure and the attributive networks. Numerical experiments show that AGEE can improve the link prediction accuracy by around 3% compared with link prediction framework SEAL, Variational Graph AutoEncoder (VGAE), and Node2vec. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Journal ref: Journal of Social Computing, 2023, 4(4): 326-336

arXiv:2402.04854 [pdf, other]

Hierarchical Tree-structured Knowledge Graph For Academic Insight Survey

Authors: Jinghong Li, Huy Phan, Wen Gu, Koichi Ota, Shinobu Hasegawa

Abstract: Research surveys have always posed a challenge for beginner researchers who lack of research training. These researchers struggle to understand the directions within their research topic, and the discovery of new research findings within a short time. One way to provide intuitive assistance to beginner researchers is by offering relevant knowledge graphs(KG) and recommending related academic paper… ▽ More Research surveys have always posed a challenge for beginner researchers who lack of research training. These researchers struggle to understand the directions within their research topic, and the discovery of new research findings within a short time. One way to provide intuitive assistance to beginner researchers is by offering relevant knowledge graphs(KG) and recommending related academic papers. However, existing navigation knowledge graphs primarily rely on keywords in the research field and often fail to present the logical hierarchy among multiple related papers clearly. Moreover, most recommendation systems for academic papers simply rely on high text similarity, which can leave researchers confused as to why a particular article is being recommended. They may lack of grasp important information about the insight connection between "Issue resolved" and "Issue finding" that they hope to obtain. To address these issues, this study aims to support research insight surveys for beginner researchers by establishing a hierarchical tree-structured knowledge graph that reflects the inheritance insight of research topics and the relevance insight among the academic papers. △ Less

Submitted 4 July, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

Comments: This paper has been accepted by 'The 18TH International Conference on INnovations in Intelligent SysTems and Applications (INISTA 2024)'

arXiv:2402.01723 [pdf, other]

An Empirical Study on Large Language Models in Accuracy and Robustness under Chinese Industrial Scenarios

Authors: Zongjie Li, Wenying Qiu, Pingchuan Ma, Yichen Li, You Li, Sijia He, Baozheng Jiang, Shuai Wang, Weixi Gu

Abstract: Recent years have witnessed the rapid development of large language models (LLMs) in various domains. To better serve the large number of Chinese users, many commercial vendors in China have adopted localization strategies, training and providing local LLMs specifically customized for Chinese users. Furthermore, looking ahead, one of the key future applications of LLMs will be practical deployment… ▽ More Recent years have witnessed the rapid development of large language models (LLMs) in various domains. To better serve the large number of Chinese users, many commercial vendors in China have adopted localization strategies, training and providing local LLMs specifically customized for Chinese users. Furthermore, looking ahead, one of the key future applications of LLMs will be practical deployment in industrial production by enterprises and users in those sectors. However, the accuracy and robustness of LLMs in industrial scenarios have not been well studied. In this paper, we present a comprehensive empirical study on the accuracy and robustness of LLMs in the context of the Chinese industrial production area. We manually collected 1,200 domain-specific problems from 8 different industrial sectors to evaluate LLM accuracy. Furthermore, we designed a metamorphic testing framework containing four industrial-specific stability categories with eight abilities, totaling 13,631 questions with variants to evaluate LLM robustness. In total, we evaluated 9 different LLMs developed by Chinese vendors, as well as four different LLMs developed by global vendors. Our major findings include: (1) Current LLMs exhibit low accuracy in Chinese industrial contexts, with all LLMs scoring less than 0.6. (2) The robustness scores vary across industrial sectors, and local LLMs overall perform worse than global ones. (3) LLM robustness differs significantly across abilities. Global LLMs are more robust under logical-related variants, while advanced local LLMs perform better on problems related to understanding Chinese industrial terminology. Our study results provide valuable guidance for understanding and promoting the industrial domain capabilities of LLMs from both development and industrial enterprise perspectives. The results further motivate possible research directions and tooling support. △ Less

Submitted 26 January, 2024; originally announced February 2024.

arXiv:2401.14385 [pdf, other]

Entropic Quantum Central Limit Theorem and Quantum Inverse Sumset Theorem

Authors: Kaifeng Bu, Weichen Gu, Arthur Jaffe

Abstract: We establish an entropic, quantum central limit theorem and quantum inverse sumset theorem in discrete-variable quantum systems describing qudits or qubits. Both results are enabled by using our recently-discovered quantum convolution. We show that the exponential rate of convergence of the entropic central limit theorem is bounded by the magic gap. We also establish an ``quantum, entropic inverse… ▽ More We establish an entropic, quantum central limit theorem and quantum inverse sumset theorem in discrete-variable quantum systems describing qudits or qubits. Both results are enabled by using our recently-discovered quantum convolution. We show that the exponential rate of convergence of the entropic central limit theorem is bounded by the magic gap. We also establish an ``quantum, entropic inverse sumset theorem,'' by introducing a quantum doubling constant. Furthermore, we introduce a ``quantum Ruzsa divergence'', and we pose a conjecture called ``convolutional strong subaddivity,'' which leads to the triangle inequality for the quantum Ruzsa divergence. A byproduct of this work is a magic measure to quantify the nonstabilizer nature of a state, based on the quantum Ruzsa divergence. △ Less

Submitted 25 January, 2024; originally announced January 2024.

Comments: 23 pages

arXiv:2401.09627 [pdf]

SymTC: A Symbiotic Transformer-CNN Net for Instance Segmentation of Lumbar Spine MRI

Authors: Jiasong Chen, Linchen Qian, Linhai Ma, Timur Urakov, Weiyong Gu, Liang Liang

Abstract: Intervertebral disc disease, a prevalent ailment, frequently leads to intermittent or persistent low back pain, and diagnosing and assessing of this disease rely on accurate measurement of vertebral bone and intervertebral disc geometries from lumbar MR images. Deep neural network (DNN) models may assist clinicians with more efficient image segmentation of individual instances (disks and vertebrae… ▽ More Intervertebral disc disease, a prevalent ailment, frequently leads to intermittent or persistent low back pain, and diagnosing and assessing of this disease rely on accurate measurement of vertebral bone and intervertebral disc geometries from lumbar MR images. Deep neural network (DNN) models may assist clinicians with more efficient image segmentation of individual instances (disks and vertebrae) of the lumbar spine in an automated way, which is termed as instance image segmentation. In this work, we proposed SymTC, an innovative lumbar spine MR image segmentation model that combines the strengths of Transformer and Convolutional Neural Network (CNN). Specifically, we designed a parallel dual-path architecture to merge CNN layers and Transformer layers, and we integrated a novel position embedding into the self-attention module of Transformer, enhancing the utilization of positional information for more accurate segmentation. To further improves model performance, we introduced a new data augmentation technique to create synthetic yet realistic MR image dataset, named SSMSpine, which is made publicly available. We evaluated our SymTC and the other 15 existing image segmentation models on our private in-house dataset and the public SSMSpine dataset, using two metrics, Dice Similarity Coefficient and 95% Hausdorff Distance. The results show that our SymTC has the best performance for segmenting vertebral bones and intervertebral discs in lumbar spine MR images. The SymTC code and SSMSpine dataset are available at https://github.com/jiasongchen/SymTC. △ Less

Submitted 1 April, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

arXiv:2401.06175 [pdf, other]

MTAD: Tools and Benchmarks for Multivariate Time Series Anomaly Detection

Authors: Jinyang Liu, Wenwei Gu, Zhuangbin Chen, Yichen Li, Yuxin Su, Michael R. Lyu

Abstract: Key Performance Indicators (KPIs) are essential time-series metrics for ensuring the reliability and stability of many software systems. They faithfully record runtime states to facilitate the understanding of anomalous system behaviors and provide informative clues for engineers to pinpoint the root causes. The unprecedented scale and complexity of modern software systems, however, make the volum… ▽ More Key Performance Indicators (KPIs) are essential time-series metrics for ensuring the reliability and stability of many software systems. They faithfully record runtime states to facilitate the understanding of anomalous system behaviors and provide informative clues for engineers to pinpoint the root causes. The unprecedented scale and complexity of modern software systems, however, make the volume of KPIs explode. Consequently, many traditional methods of KPI anomaly detection become impractical, which serves as a catalyst for the fast development of machine learning-based solutions in both academia and industry. However, there is currently a lack of rigorous comparison among these KPI anomaly detection methods, and re-implementation demands a non-trivial effort. Moreover, we observe that different works adopt independent evaluation processes with different metrics. Some of them may not fully reveal the capability of a model and some are creating an illusion of progress. To better understand the characteristics of different KPI anomaly detectors and address the evaluation issue, in this paper, we provide a comprehensive review and evaluation of twelve state-of-the-art methods, and propose a novel metric called salience. Particularly, the selected methods include five traditional machine learning-based methods and seven deep learning-based methods. These methods are evaluated with five multivariate KPI datasets that are publicly available. A unified toolkit with easy-to-use interfaces is also released. We report the benchmark results in terms of accuracy, salience, efficiency, and delay, which are of practical importance for industrial deployment. We believe our work can contribute as a basis for future academic research and industrial application. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: The code and datasets are available at https://github.com/OpsPAI/MTAD

arXiv:2401.01918 [pdf, other]

Distilling Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection

Authors: Haowen Zheng, Dong Cao, Jintao Xu, Rui Ai, Weihao Gu, Yang Yang, Yanyan Liang

Abstract: Striking a balance between precision and efficiency presents a prominent challenge in the bird's-eye-view (BEV) 3D object detection. Although previous camera-based BEV methods achieved remarkable performance by incorporating long-term temporal information, most of them still face the problem of low efficiency. One potential solution is knowledge distillation. Existing distillation methods only foc… ▽ More Striking a balance between precision and efficiency presents a prominent challenge in the bird's-eye-view (BEV) 3D object detection. Although previous camera-based BEV methods achieved remarkable performance by incorporating long-term temporal information, most of them still face the problem of low efficiency. One potential solution is knowledge distillation. Existing distillation methods only focus on reconstructing spatial features, while overlooking temporal knowledge. To this end, we propose TempDistiller, a Temporal knowledge Distiller, to acquire long-term memory from a teacher detector when provided with a limited number of frames. Specifically, a reconstruction target is formulated by integrating long-term temporal knowledge through self-attention operation applied to feature teachers. Subsequently, novel features are generated for masked student features via a generator. Ultimately, we utilize this reconstruction target to reconstruct the student features. In addition, we also explore temporal relational knowledge when inputting full frames for the student model. We verify the effectiveness of the proposed method on the nuScenes benchmark. The experimental results show our method obtain an enhancement of +1.6 mAP and +1.1 NDS compared to the baseline, a speed improvement of approximately 6 FPS after compressing temporal knowledge, and the most accurate velocity estimation. △ Less

Submitted 8 January, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

arXiv:2312.13219 [pdf, other]

Interactive Visual Task Learning for Robots

Authors: Weiwei Gu, Anant Sah, Nakul Gopalan

Abstract: We present a framework for robots to learn novel visual concepts and tasks via in-situ linguistic interactions with human users. Previous approaches have either used large pre-trained visual models to infer novel objects zero-shot, or added novel concepts along with their attributes and representations to a concept hierarchy. We extend the approaches that focus on learning visual concept hierarchi… ▽ More We present a framework for robots to learn novel visual concepts and tasks via in-situ linguistic interactions with human users. Previous approaches have either used large pre-trained visual models to infer novel objects zero-shot, or added novel concepts along with their attributes and representations to a concept hierarchy. We extend the approaches that focus on learning visual concept hierarchies by enabling them to learn novel concepts and solve unseen robotics tasks with them. To enable a visual concept learner to solve robotics tasks one-shot, we developed two distinct techniques. Firstly, we propose a novel approach, Hi-Viscont(HIerarchical VISual CONcept learner for Task), which augments information of a novel concept to its parent nodes within a concept hierarchy. This information propagation allows all concepts in a hierarchy to update as novel concepts are taught in a continual learning setting. Secondly, we represent a visual task as a scene graph with language annotations, allowing us to create novel permutations of a demonstrated task zero-shot in-situ. We present two sets of results. Firstly, we compare Hi-Viscont with the baseline model (FALCON) on visual question answering(VQA) in three domains. While being comparable to the baseline model on leaf level concepts, Hi-Viscont achieves an improvement of over 9% on non-leaf concepts on average. We compare our model's performance against the baseline FALCON model. Our framework achieves 33% improvements in success rate metric, and 19% improvements in the object level accuracy compared to the baseline model. With both of these results we demonstrate the ability of our model to learn tasks and concepts in a continual learning setting on the robot. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: In Proceedings of The 38th Annual AAAI Conference on Artificial Intelligence

arXiv:2312.09038 [pdf, other]

Object Recognition from Scientific Document based on Compartment Refinement Framework

Authors: Jinghong Li, Wen Gu, Koichi Ota, Shinobu Hasegawa

Abstract: With the rapid development of the internet in the past decade, it has become increasingly important to extract valuable information from vast resources efficiently, which is crucial for establishing a comprehensive digital ecosystem, particularly in the context of research surveys and comprehension. The foundation of these tasks focuses on accurate extraction and deep mining of data from scientifi… ▽ More With the rapid development of the internet in the past decade, it has become increasingly important to extract valuable information from vast resources efficiently, which is crucial for establishing a comprehensive digital ecosystem, particularly in the context of research surveys and comprehension. The foundation of these tasks focuses on accurate extraction and deep mining of data from scientific documents, which are essential for building a robust data infrastructure. However, parsing raw data or extracting data from complex scientific documents have been ongoing challenges. Current data extraction methods for scientific documents typically use rule-based (RB) or machine learning (ML) approaches. However, using rule-based methods can incur high coding costs for articles with intricate typesetting. Conversely, relying solely on machine learning methods necessitates annotation work for complex content types within the scientific document, which can be costly. Additionally, few studies have thoroughly defined and explored the hierarchical layout within scientific documents. The lack of a comprehensive definition of the internal structure and elements of the documents indirectly impacts the accuracy of text classification and object recognition tasks. From the perspective of analyzing the standard layout and typesetting used in the specified publication, we propose a new document layout analysis framework called CTBR(Compartment & Text Blocks Refinement). Firstly, we define scientific documents into hierarchical divisions: base domain, compartment, and text blocks. Next, we conduct an in-depth exploration and classification of the meanings of text blocks. Finally, we utilize the results of text block classification to implement object recognition within scientific documents based on rule-based compartment segmentation. △ Less

Submitted 4 July, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: arXiv admin note: text overlap with arXiv:2305.17401

arXiv:2311.17663 [pdf, other]

Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications

Authors: Junyi Ma, Xieyuanli Chen, Jiawei Huang, Jingyi Xu, Zhen Luo, Jintao Xu, Weihao Gu, Rui Ai, Hesheng Wang

Abstract: Understanding how the surrounding environment changes is crucial for performing downstream tasks safely and reliably in autonomous driving applications. Recent occupancy estimation techniques using only camera images as input can provide dense occupancy representations of large-scale scenes based on the current observation. However, they are mostly limited to representing the current 3D space and… ▽ More Understanding how the surrounding environment changes is crucial for performing downstream tasks safely and reliably in autonomous driving applications. Recent occupancy estimation techniques using only camera images as input can provide dense occupancy representations of large-scale scenes based on the current observation. However, they are mostly limited to representing the current 3D space and do not consider the future state of surrounding objects along the time axis. To extend camera-only occupancy estimation into spatiotemporal prediction, we propose Cam4DOcc, a new benchmark for camera-only 4D occupancy forecasting, evaluating the surrounding scene changes in a near future. We build our benchmark based on multiple publicly available datasets, including nuScenes, nuScenes-Occupancy, and Lyft-Level5, which provides sequential occupancy states of general movable and static objects, as well as their 3D backward centripetal flow. To establish this benchmark for future research with comprehensive comparisons, we introduce four baseline types from diverse camera-based perception and prediction implementations, including a static-world occupancy model, voxelization of point cloud prediction, 2D-3D instance-based prediction, and our proposed novel end-to-end 4D occupancy forecasting network. Furthermore, the standardized evaluation protocol for preset multiple tasks is also provided to compare the performance of all the proposed baselines on present and future occupancy estimation with respect to objects of interest in autonomous driving scenarios. The dataset and our implementation of all four baselines in the proposed Cam4DOcc benchmark will be released here: https://github.com/haomo-ai/Cam4DOcc. △ Less

Submitted 7 December, 2023; v1 submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.14514 [pdf, other]

FRAD: Front-Running Attacks Detection on Ethereum using Ternary Classification Model

Authors: Yuheng Zhang, Pin Liu, Guojun Wang, Peiqiang Li, Wanyi Gu, Houji Chen, Xuelei Liu, Jinyao Zhu

Abstract: With the evolution of blockchain technology, the issue of transaction security, particularly on platforms like Ethereum, has become increasingly critical. Front-running attacks, a unique form of security threat, pose significant challenges to the integrity of blockchain transactions. In these attack scenarios, malicious actors monitor other users' transaction activities, then strategically submit… ▽ More With the evolution of blockchain technology, the issue of transaction security, particularly on platforms like Ethereum, has become increasingly critical. Front-running attacks, a unique form of security threat, pose significant challenges to the integrity of blockchain transactions. In these attack scenarios, malicious actors monitor other users' transaction activities, then strategically submit their own transactions with higher fees. This ensures their transactions are executed before the monitored transactions are included in the block. The primary objective of this paper is to delve into a comprehensive classification of transactions associated with front-running attacks, which aims to equip developers with specific strategies to counter each type of attack. To achieve this, we introduce a novel detection method named FRAD (Front-Running Attacks Detection on Ethereum using Ternary Classification Model). This method is specifically tailored for transactions within decentralized applications (DApps) on Ethereum, enabling accurate classification of front-running attacks involving transaction displacement, insertion, and suppression. Our experimental validation reveals that the Multilayer Perceptron (MLP) classifier offers the best performance in detecting front-running attacks, achieving an impressive accuracy rate of 84.59% and F1-score of 84.60%. △ Less

Submitted 24 November, 2023; originally announced November 2023.

Comments: 14 pages, 8 figures

arXiv:2310.02609 [pdf, other]

RLTrace: Synthesizing High-Quality System Call Traces for OS Fuzz Testing

Authors: Wei Chen, Huaijin Wang, Weixi Gu, Shuai Wang

Abstract: Securing operating system (OS) kernel is one central challenge in today's cyber security landscape. The cutting-edge testing technique of OS kernel is software fuzz testing. By mutating the program inputs with random variations for iterations, fuzz testing aims to trigger program crashes and hangs caused by potential bugs that can be abused by the inputs. To achieve high OS code coverage, the de f… ▽ More Securing operating system (OS) kernel is one central challenge in today's cyber security landscape. The cutting-edge testing technique of OS kernel is software fuzz testing. By mutating the program inputs with random variations for iterations, fuzz testing aims to trigger program crashes and hangs caused by potential bugs that can be abused by the inputs. To achieve high OS code coverage, the de facto OS fuzzer typically composes system call traces as the input seed to mutate and to interact with OS kernels. Hence, quality and diversity of the employed system call traces become the prominent factor to decide the effectiveness of OS fuzzing. However, these system call traces to date are generated with hand-coded rules, or by analyzing system call logs of OS utility programs. Our observation shows that such system call traces can only subsume common usage scenarios of OS system calls, and likely omit hidden bugs. In this research, we propose a deep reinforcement learning-based solution, called RLTrace, to synthesize diverse and comprehensive system call traces as the seed to fuzz OS kernels. During model training, the deep learning model interacts with OS kernels and infers optimal system call traces w.r.t. our learning goal -- maximizing kernel code coverage. Our evaluation shows that RLTrace outperforms other seed generators by producing more comprehensive system call traces, subsuming system call corner usage cases and subtle dependencies. By feeding the de facto OS fuzzer, SYZKALLER, with system call traces synthesized by RLTrace, we show that SYZKALLER can achieve higher code coverage for testing Linux kernels. Furthermore, RLTrace found one vulnerability in the Linux kernel (version 5.5-rc6), which is publicly unknown to the best of our knowledge by the time of writing. △ Less

Submitted 4 October, 2023; originally announced October 2023.

Comments: Information Security Conference 2023

arXiv:2308.00624 [pdf, other]

JIANG: Chinese Open Foundation Language Model

Authors: Qinhua Duan, Wenchao Gu, Yujia Chen, Wenxin Mao, Zewen Tian, Hui Cao

Abstract: With the advancements in large language model technology, it has showcased capabilities that come close to those of human beings across various tasks. This achievement has garnered significant interest from companies and scientific research institutions, leading to substantial investments in the research and development of these models. While numerous large models have emerged during this period,… ▽ More With the advancements in large language model technology, it has showcased capabilities that come close to those of human beings across various tasks. This achievement has garnered significant interest from companies and scientific research institutions, leading to substantial investments in the research and development of these models. While numerous large models have emerged during this period, the majority of them have been trained primarily on English data. Although they exhibit decent performance in other languages, such as Chinese, their potential remains limited due to factors like vocabulary design and training corpus. Consequently, their ability to fully express their capabilities in Chinese falls short. To address this issue, we introduce the model named JIANG (Chinese pinyin of ginger) specifically designed for the Chinese language. We have gathered a substantial amount of Chinese corpus to train the model and have also optimized its structure. The extensive experimental results demonstrate the excellent performance of our model. △ Less

Submitted 1 August, 2023; originally announced August 2023.

arXiv:2307.14041 [pdf, other]

GovernR: Provenance and Confidentiality Guarantees In Research Data Repositories

Authors: Anwitaman Datta, Chua Chiah Soon, Wangfan Gu

Abstract: We propose cryptographic protocols to incorporate time provenance guarantees while meeting confidentiality and controlled sharing needs for research data. We demonstrate the efficacy of these mechanisms by developing and benchmarking a practical tool, GovernR, which furthermore takes into usability issues and is compatible with a popular open-sourced research data storage platform, Dataverse. In d… ▽ More We propose cryptographic protocols to incorporate time provenance guarantees while meeting confidentiality and controlled sharing needs for research data. We demonstrate the efficacy of these mechanisms by developing and benchmarking a practical tool, GovernR, which furthermore takes into usability issues and is compatible with a popular open-sourced research data storage platform, Dataverse. In doing so, we identify and provide a solution addressing an important gap (though applicable to only niche use cases) in practical research data management. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: 2 Figures, 3 Tables

arXiv:2307.10869 [pdf, other]

Performance Issue Identification in Cloud Systems with Relational-Temporal Anomaly Detection

Authors: Wenwei Gu, Jinyang Liu, Zhuangbin Chen, Jianping Zhang, Yuxin Su, Jiazhen Gu, Cong Feng, Zengyin Yang, Michael Lyu

Abstract: Performance issues permeate large-scale cloud service systems, which can lead to huge revenue losses. To ensure reliable performance, it's essential to accurately identify and localize these issues using service monitoring metrics. Given the complexity and scale of modern cloud systems, this task can be challenging and may require extensive expertise and resources beyond the capacity of individual… ▽ More Performance issues permeate large-scale cloud service systems, which can lead to huge revenue losses. To ensure reliable performance, it's essential to accurately identify and localize these issues using service monitoring metrics. Given the complexity and scale of modern cloud systems, this task can be challenging and may require extensive expertise and resources beyond the capacity of individual humans. Some existing methods tackle this problem by analyzing each metric independently to detect anomalies. However, this could incur overwhelming alert storms that are difficult for engineers to diagnose manually. To pursue better performance, not only the temporal patterns of metrics but also the correlation between metrics (i.e., relational patterns) should be considered, which can be formulated as a multivariate metrics anomaly detection problem. However, most of the studies fall short of extracting these two types of features explicitly. Moreover, there exist some unlabeled anomalies mixed in the training data, which may hinder the detection performance. To address these limitations, we propose the Relational- Temporal Anomaly Detection Model (RTAnomaly) that combines the relational and temporal information of metrics. RTAnomaly employs a graph attention layer to learn the dependencies among metrics, which will further help pinpoint the anomalous metrics that may cause the anomaly effectively. In addition, we exploit the concept of positive unlabeled learning to address the issue of potential anomalies in the training data. To evaluate our method, we conduct experiments on a public dataset and two industrial datasets. RTAnomaly outperforms all the baseline models by achieving an average F1 score of 0.929 and Hit@3 of 0.920, demonstrating its superiority. △ Less

Submitted 1 August, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

arXiv:2306.17797 [pdf, other]

HIDFlowNet: A Flow-Based Deep Network for Hyperspectral Image Denoising

Authors: Li Pang, Weizhen Gu, Xiangyong Cao, Xiangyu Rui, Jiangjun Peng, Shuang Xu, Gang Yang, Deyu Meng

Abstract: Hyperspectral image (HSI) denoising is essentially ill-posed since a noisy HSI can be degraded from multiple clean HSIs. However, current deep learning-based approaches ignore this fact and restore the clean image with deterministic mapping (i.e., the network receives a noisy HSI and outputs a clean HSI). To alleviate this issue, this paper proposes a flow-based HSI denoising network (HIDFlowNet)… ▽ More Hyperspectral image (HSI) denoising is essentially ill-posed since a noisy HSI can be degraded from multiple clean HSIs. However, current deep learning-based approaches ignore this fact and restore the clean image with deterministic mapping (i.e., the network receives a noisy HSI and outputs a clean HSI). To alleviate this issue, this paper proposes a flow-based HSI denoising network (HIDFlowNet) to directly learn the conditional distribution of the clean HSI given the noisy HSI and thus diverse clean HSIs can be sampled from the conditional distribution. Overall, our HIDFlowNet is induced from the flow methodology and contains an invertible decoder and a conditional encoder, which can fully decouple the learning of low-frequency and high-frequency information of HSI. Specifically, the invertible decoder is built by staking a succession of invertible conditional blocks (ICBs) to capture the local high-frequency details since the invertible network is information-lossless. The conditional encoder utilizes down-sampling operations to obtain low-resolution images and uses transformers to capture correlations over a long distance so that global low-frequency information can be effectively extracted. Extensive experimental results on simulated and real HSI datasets verify the superiority of our proposed HIDFlowNet compared with other state-of-the-art methods both quantitatively and visually. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: 10 pages, 8 figures

arXiv:2306.15369 [pdf, other]

A Meta-analytical Comparison of Naive Bayes and Random Forest for Software Defect Prediction

Authors: Ch Muhammad Awais, Wei Gu, Gcinizwe Dlamini, Zamira Kholmatova, Giancarlo Succi

Abstract: Is there a statistical difference between Naive Bayes and Random Forest in terms of recall, f-measure, and precision for predicting software defects? By utilizing systematic literature review and meta-analysis, we are answering this question. We conducted a systematic literature review by establishing criteria to search and choose papers, resulting in five studies. After that, using the meta-data… ▽ More Is there a statistical difference between Naive Bayes and Random Forest in terms of recall, f-measure, and precision for predicting software defects? By utilizing systematic literature review and meta-analysis, we are answering this question. We conducted a systematic literature review by establishing criteria to search and choose papers, resulting in five studies. After that, using the meta-data and forest-plots of five chosen papers, we conducted a meta-analysis to compare the two models. The results have shown that there is no significant statistical evidence that Naive Bayes perform differently from Random Forest in terms of recall, f-measure, and precision. △ Less

Submitted 27 June, 2023; originally announced June 2023.

Comments: 11 pages, 8 figures, Conference Paper

Journal ref: Intelligent Systems Design and Applications. ISDA 2022. Lecture Notes in Networks and Systems, vol 716

arXiv:2306.10200 [pdf, other]

Privacy-Enhancing Technologies for Financial Data Sharing

Authors: Panagiotis Chatzigiannis, Wanyun Catherine Gu, Srinivasan Raghuraman, Peter Rindal, Mahdi Zamani

Abstract: Today, financial institutions (FIs) store and share consumers' financial data for various reasons such as offering loans, processing payments, and protecting against fraud and financial crime. Such sharing of sensitive data have been subject to data breaches in the past decade. While some regulations (e.g., GDPR, FCRA, and CCPA) help to prevent institutions from freely sharing clients' sensitive… ▽ More Today, financial institutions (FIs) store and share consumers' financial data for various reasons such as offering loans, processing payments, and protecting against fraud and financial crime. Such sharing of sensitive data have been subject to data breaches in the past decade. While some regulations (e.g., GDPR, FCRA, and CCPA) help to prevent institutions from freely sharing clients' sensitive information, some regulations (e.g., BSA 1970) require FIs to share certain financial data with government agencies to combat financial crime. This creates an inherent tension between the privacy and the integrity of financial transactions. In the past decade, significant progress has been made in building efficient privacy-enhancing technologies that allow computer systems and networks to validate encrypted data automatically. In this paper, we investigate some of these technologies to identify the benefits and limitations of each, in particular, for use in data sharing among FIs. As a case study, we look into the emerging area of Central Bank Digital Currencies (CBDCs) and how privacy-enhancing technologies can be integrated into the CBDC architecture. Our study, however, is not limited to CBDCs and can be applied to other financial scenarios with tokenized bank deposits such as cross-border payments, real-time settlements, and card payments. △ Less

Submitted 16 June, 2023; originally announced June 2023.

arXiv:2306.09292 [pdf, other]

Stabilizer Testing and Magic Entropy

Authors: Kaifeng Bu, Weichen Gu, Arthur Jaffe

Abstract: We introduce systematic protocols to perform stabilizer testing for quantum states and gates. These protocols are based on quantum convolutions and swap-tests, realized by quantum circuits that implement the quantum convolution for both qubit and qudit systems. We also introduce ''magic entropy'' to quantify magic in quantum states and gates, in a way which may be measurable experimentally. We introduce systematic protocols to perform stabilizer testing for quantum states and gates. These protocols are based on quantum convolutions and swap-tests, realized by quantum circuits that implement the quantum convolution for both qubit and qudit systems. We also introduce ''magic entropy'' to quantify magic in quantum states and gates, in a way which may be measurable experimentally. △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: 35 pages

arXiv:2305.18743 [pdf, other]

Decomposed Human Motion Prior for Video Pose Estimation via Adversarial Training

Authors: Wenshuo Chen, Xiang Zhou, Zhengdi Yu, Weixi Gu, Kai Zhang

Abstract: Estimating human pose from video is a task that receives considerable attention due to its applicability in numerous 3D fields. The complexity of prior knowledge of human body movements poses a challenge to neural network models in the task of regressing keypoints. In this paper, we address this problem by incorporating motion prior in an adversarial way. Different from previous methods, we propos… ▽ More Estimating human pose from video is a task that receives considerable attention due to its applicability in numerous 3D fields. The complexity of prior knowledge of human body movements poses a challenge to neural network models in the task of regressing keypoints. In this paper, we address this problem by incorporating motion prior in an adversarial way. Different from previous methods, we propose to decompose holistic motion prior to joint motion prior, making it easier for neural networks to learn from prior knowledge thereby boosting the performance on the task. We also utilize a novel regularization loss to balance accuracy and smoothness introduced by motion prior. Our method achieves 9\% lower PA-MPJPE and 29\% lower acceleration error than previous methods tested on 3DPW. The estimator proves its robustness by achieving impressive performance on in-the-wild dataset. △ Less

Submitted 24 September, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

arXiv:2305.17401 [pdf, other]

doi 10.1109/INISTA59065.2023.10310320

A Framework For Refining Text Classification and Object Recognition from Academic Articles

Authors: Jinghong Li, Koichi Ota, Wen Gu, Shinobu Hasegawa

Abstract: With the widespread use of the internet, it has become increasingly crucial to extract specific information from vast amounts of academic articles efficiently. Data mining techniques are generally employed to solve this issue. However, data mining for academic articles is challenging since it requires automatically extracting specific patterns in complex and unstructured layout documents. Current… ▽ More With the widespread use of the internet, it has become increasingly crucial to extract specific information from vast amounts of academic articles efficiently. Data mining techniques are generally employed to solve this issue. However, data mining for academic articles is challenging since it requires automatically extracting specific patterns in complex and unstructured layout documents. Current data mining methods for academic articles employ rule-based(RB) or machine learning(ML) approaches. However, using rule-based methods incurs a high coding cost for complex typesetting articles. On the other hand, simply using machine learning methods requires annotation work for complex content types within the paper, which can be costly. Furthermore, only using machine learning can lead to cases where patterns easily recognized by rule-based methods are mistakenly extracted. To overcome these issues, from the perspective of analyzing the standard layout and typesetting used in the specified publication, we emphasize implementing specific methods for specific characteristics in academic articles. We have developed a novel Text Block Refinement Framework (TBRF), a machine learning and rule-based scheme hybrid. We used the well-known ACL proceeding articles as experimental data for the validation experiment. The experiment shows that our approach achieved over 95% classification accuracy and 90% detection accuracy for tables and figures. △ Less

Submitted 2 July, 2024; v1 submitted 27 May, 2023; originally announced May 2023.

Comments: This paper has been accepted at 'The International Symposium on Innovations in Intelligent Systems and Applications 2023 (INISTA 2023)'

arXiv:2303.15587 [pdf]

Linguistically Informed ChatGPT Prompts to Enhance Japanese-Chinese Machine Translation: A Case Study on Attributive Clauses

Authors: Wenshi Gu

Abstract: In the field of Japanese-Chinese translation linguistics, the issue of correctly translating attributive clauses has persistently proven to be challenging. Present-day machine translation tools often fail to accurately translate attributive clauses from Japanese to Chinese. In light of this, this paper investigates the linguistic problem underlying such difficulties, namely how does the semantic r… ▽ More In the field of Japanese-Chinese translation linguistics, the issue of correctly translating attributive clauses has persistently proven to be challenging. Present-day machine translation tools often fail to accurately translate attributive clauses from Japanese to Chinese. In light of this, this paper investigates the linguistic problem underlying such difficulties, namely how does the semantic role of the modified noun affect the selection of translation patterns for attributive clauses, from a linguistic perspective. To ad-dress these difficulties, a pre-edit scheme is proposed, which aims to enhance the accuracy of translation. Furthermore, we propose a novel two-step prompt strategy, which combines this pre-edit scheme with ChatGPT, currently the most widely used large language model. This prompt strategy is capable of optimizing translation input in zero-shot scenarios and has been demonstrated to improve the average translation accuracy score by over 35%. △ Less

Submitted 27 March, 2023; originally announced March 2023.

arXiv:2303.01043 [pdf, other]

I2P-Rec: Recognizing Images on Large-scale Point Cloud Maps through Bird's Eye View Projections

Authors: Shuhang Zheng, Yixuan Li, Zhu Yu, Beinan Yu, Si-Yuan Cao, Minhang Wang, Jintao Xu, Rui Ai, Weihao Gu, Lun Luo, Hui-Liang Shen

Abstract: Place recognition is an important technique for autonomous cars to achieve full autonomy since it can provide an initial guess to online localization algorithms. Although current methods based on images or point clouds have achieved satisfactory performance, localizing the images on a large-scale point cloud map remains a fairly unexplored problem. This cross-modal matching task is challenging due… ▽ More Place recognition is an important technique for autonomous cars to achieve full autonomy since it can provide an initial guess to online localization algorithms. Although current methods based on images or point clouds have achieved satisfactory performance, localizing the images on a large-scale point cloud map remains a fairly unexplored problem. This cross-modal matching task is challenging due to the difficulty in extracting consistent descriptors from images and point clouds. In this paper, we propose the I2P-Rec method to solve the problem by transforming the cross-modal data into the same modality. Specifically, we leverage on the recent success of depth estimation networks to recover point clouds from images. We then project the point clouds into Bird's Eye View (BEV) images. Using the BEV image as an intermediate representation, we extract global features with a Convolutional Neural Network followed by a NetVLAD layer to perform matching. The experimental results evaluated on the KITTI dataset show that, with only a small set of training data, I2P-Rec achieves recall rates at Top-1\% over 80\% and 90\%, when localizing monocular and stereo images on point cloud maps, respectively. We further evaluate I2P-Rec on a 1 km trajectory dataset collected by an autonomous logistics car and show that I2P-Rec can generalize well to previously unseen environments. △ Less

Submitted 14 August, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

Comments: Accepted by IROS 2023

arXiv:2212.02029 [pdf, other]

The LG Fibration

Authors: Daniel Livschitz, Weiqing Gu

Abstract: Deep Learning has significantly impacted the application of data-to-decision throughout research and industry, however, they lack a rigorous mathematical foundation, which creates situations where algorithmic results fail to be practically invertible. In this paper we present a nearly invertible mapping between $\mathbb{R}^{2^n}$ and $\mathbb{R}^{n+1}$ via a topological connection between… ▽ More Deep Learning has significantly impacted the application of data-to-decision throughout research and industry, however, they lack a rigorous mathematical foundation, which creates situations where algorithmic results fail to be practically invertible. In this paper we present a nearly invertible mapping between $\mathbb{R}^{2^n}$ and $\mathbb{R}^{n+1}$ via a topological connection between $S^{2^n-1}$ and $S^n$. Throughout the paper we utilize the algebra of Multicomplex rotation groups and polyspherical coordinates to define two maps: the first is a contraction from $S^{2^n-1}$ to $\displaystyle \otimes^n_{k=1} SO(2)$, and the second is a projection from $\displaystyle \otimes^n_{k=1} SO(2)$ to $S^{n}$. Together these form a composite map that we call the LG Fibration. In analogy to the generation of Hopf Fibration using Hypercomplex geometry from $S^{(2n-1)} \mapsto CP^n$, our fibration uses Multicomplex geometry to project $S^{2^n-1}$ onto $S^n$. We also investigate the algebraic properties of the LG Fibration, ultimately deriving a distance difference function to determine which pairs of vectors have an invariant inner product under the transformation. The LG Fibration has applications to Machine Learning and AI, in analogy to the current applications of Hopf Fibrations in adaptive UAV control. Furthermore, the ability to invert the LG Fibration for nearly all elements allows for the development of Machine Learning algorithms that may avoid the issues of uncertainty and reproducibility that currently plague contemporary methods. The primary result of this paper is a novel method of nearly invertible geometric dimensional reduction from $S^{2^n-1}$ to $S^n$, which has the capability to extend the research in both mathematics and AI, including but not limited to the fields of homotopy groups of spheres, algebraic topology, machine learning, and algebraic biology. △ Less

Submitted 5 December, 2022; originally announced December 2022.

arXiv:2211.15656 [pdf, other]

SuperFusion: Multilevel LiDAR-Camera Fusion for Long-Range HD Map Generation

Authors: Hao Dong, Xianjing Zhang, Jintao Xu, Rui Ai, Weihao Gu, Huimin Lu, Juho Kannala, Xieyuanli Chen

Abstract: High-definition (HD) semantic map generation of the environment is an essential component of autonomous driving. Existing methods have achieved good performance in this task by fusing different sensor modalities, such as LiDAR and camera. However, current works are based on raw data or network feature-level fusion and only consider short-range HD map generation, limiting their deployment to realis… ▽ More High-definition (HD) semantic map generation of the environment is an essential component of autonomous driving. Existing methods have achieved good performance in this task by fusing different sensor modalities, such as LiDAR and camera. However, current works are based on raw data or network feature-level fusion and only consider short-range HD map generation, limiting their deployment to realistic autonomous driving applications. In this paper, we focus on the task of building the HD maps in both short ranges, i.e., within 30 m, and also predicting long-range HD maps up to 90 m, which is required by downstream path planning and control tasks to improve the smoothness and safety of autonomous driving. To this end, we propose a novel network named SuperFusion, exploiting the fusion of LiDAR and camera data at multiple levels. We use LiDAR depth to improve image depth estimation and use image features to guide long-range LiDAR feature prediction. We benchmark our SuperFusion on the nuScenes dataset and a self-recorded dataset and show that it outperforms the state-of-the-art baseline methods with large margins on all intervals. Additionally, we apply the generated HD map to a downstream path planning task, demonstrating that the long-range HD maps predicted by our method can lead to better path planning for autonomous vehicles. Our code and self-recorded dataset will be available at https://github.com/haomo-ai/SuperFusion. △ Less

Submitted 16 March, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

arXiv:2210.06824 [pdf, other]

An Empirical Study on Finding Spans

Authors: Weiwei Gu, Boyuan Zheng, Yunmo Chen, Tongfei Chen, Benjamin Van Durme

Abstract: We present an empirical study on methods for span finding, the selection of consecutive tokens in text for some downstream tasks. We focus on approaches that can be employed in training end-to-end information extraction systems, and find there is no definitive solution without considering task properties, and provide our observations to help with future design choices: 1) a tagging approach often… ▽ More We present an empirical study on methods for span finding, the selection of consecutive tokens in text for some downstream tasks. We focus on approaches that can be employed in training end-to-end information extraction systems, and find there is no definitive solution without considering task properties, and provide our observations to help with future design choices: 1) a tagging approach often yields higher precision while span enumeration and boundary prediction provide higher recall; 2) span type information can benefit a boundary prediction approach; 3) additional contextualization does not help span finding in most cases. △ Less

Submitted 13 October, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

Comments: Accepted to EMNLP 2022

arXiv:2210.06600 [pdf, other]

Iterative Document-level Information Extraction via Imitation Learning

Authors: Yunmo Chen, William Gantt, Weiwei Gu, Tongfei Chen, Aaron Steven White, Benjamin Van Durme

Abstract: We present a novel iterative extraction model, IterX, for extracting complex relations, or templates (i.e., N-tuples representing a mapping from named slots to spans of text) within a document. Documents may feature zero or more instances of a template of any given type, and the task of template extraction entails identifying the templates in a document and extracting each template's slot values.… ▽ More We present a novel iterative extraction model, IterX, for extracting complex relations, or templates (i.e., N-tuples representing a mapping from named slots to spans of text) within a document. Documents may feature zero or more instances of a template of any given type, and the task of template extraction entails identifying the templates in a document and extracting each template's slot values. Our imitation learning approach casts the problem as a Markov decision process (MDP), and relieves the need to use predefined template orders to train an extractor. It leads to state-of-the-art results on two established benchmarks -- 4-ary relation extraction on SciREX and template extraction on MUC-4 -- as well as a strong baseline on the new BETTER Granular task. △ Less

Submitted 1 May, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: Accepted to EACL 2023

arXiv:2209.08055 [pdf, other]

A Transformer-Based Approach for Improving App Review Response Generation

Authors: Weizhe Zhang, Wenchao Gu, Cuiyun Gao, Michael R. Lyu

Abstract: Mobile apps are becoming an integral part of people's daily life by providing various functionalities, such as messaging and gaming. App developers try their best to ensure user experience during app development and maintenance to improve the rating of their apps on app platforms and attract more user downloads. Previous studies indicated that responding to users' reviews tends to change their att… ▽ More Mobile apps are becoming an integral part of people's daily life by providing various functionalities, such as messaging and gaming. App developers try their best to ensure user experience during app development and maintenance to improve the rating of their apps on app platforms and attract more user downloads. Previous studies indicated that responding to users' reviews tends to change their attitude towards the apps positively. Users who have been replied are more likely to update the given ratings. However, reading and responding to every user review is not an easy task for developers since it is common for popular apps to receive tons of reviews every day. Thus, automation tools for review replying are needed. To address the need above, the paper introduces a Transformer-based approach, named TRRGen, to automatically generate responses to given user reviews. TRRGen extracts apps' categories, rating, and review text as the input features. By adapting a Transformer-based model, TRRGen can generate appropriate replies for new reviews. Comprehensive experiments and analysis on the real-world datasets indicate that the proposed approach can generate high-quality replies for users' reviews and significantly outperform current state-of-art approaches on the task. The manual validation results on the generated replies further demonstrate the effectiveness of the proposed approach. △ Less

Submitted 25 September, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

Comments: Accepted to Software: Practice and Experience (SPE)

arXiv:2209.00465 [pdf, other]

doi 10.1609/aaai.v37i11.26549

On Grounded Planning for Embodied Tasks with Language Models

Authors: Bill Yuchen Lin, Chengsong Huang, Qian Liu, Wenda Gu, Sam Sommerer, Xiang Ren

Abstract: Language models (LMs) have demonstrated their capability in possessing commonsense knowledge of the physical world, a crucial aspect of performing tasks in everyday life. However, it remains unclear **whether LMs have the capacity to generate grounded, executable plans for embodied tasks.** This is a challenging task as LMs lack the ability to perceive the environment through vision and feedback f… ▽ More Language models (LMs) have demonstrated their capability in possessing commonsense knowledge of the physical world, a crucial aspect of performing tasks in everyday life. However, it remains unclear **whether LMs have the capacity to generate grounded, executable plans for embodied tasks.** This is a challenging task as LMs lack the ability to perceive the environment through vision and feedback from the physical environment. In this paper, we address this important research question and present the first investigation into the topic. Our novel problem formulation, named **G-PlanET**, inputs a high-level goal and a data table about objects in a specific environment, and then outputs a step-by-step actionable plan for a robotic agent to follow. To facilitate the study, we establish an **evaluation protocol** and design a dedicated metric to assess the quality of the plans. Our experiments demonstrate that the use of tables for encoding the environment and an iterative decoding strategy can significantly enhance the LMs' ability in grounded planning. Our analysis also reveals interesting and non-trivial findings. △ Less

Submitted 15 July, 2023; v1 submitted 29 August, 2022; originally announced September 2022.

Comments: Accepted to AAAI 2023 Project website: https://yuchenlin.xyz/g-planet/

arXiv:2208.11908 [pdf, other]

Adaptive Perception Transformer for Temporal Action Localization

Authors: Yizheng Ouyang, Tianjin Zhang, Weibo Gu, Hongfa Wang

Abstract: Temporal action localization aims to predict the boundary and category of each action instance in untrimmed long videos. Most of previous methods based on anchors or proposals neglect the global-local context interaction in entire video sequences. Besides, their multi-stage designs cannot generate action boundaries and categories straightforwardly. To address the above issues, this paper proposes… ▽ More Temporal action localization aims to predict the boundary and category of each action instance in untrimmed long videos. Most of previous methods based on anchors or proposals neglect the global-local context interaction in entire video sequences. Besides, their multi-stage designs cannot generate action boundaries and categories straightforwardly. To address the above issues, this paper proposes a end-to-end model, called Adaptive Perception transformer (AdaPerFormer for short). Specifically, AdaPerFormer explores a dual-branch attention mechanism. One branch takes care of the global perception attention, which can model entire video sequences and aggregate global relevant contexts. While the other branch concentrates on the local convolutional shift to aggregate intra-frame and inter-frame information through our bidirectional shift operation. The end-to-end nature produces the boundaries and categories of video actions without extra steps. Extensive experiments together with ablation studies are provided to reveal the effectiveness of our design. Our method obtains competitive performance on the THUMOS14 and ActivityNet-1.3 dataset. △ Less

Submitted 15 September, 2022; v1 submitted 25 August, 2022; originally announced August 2022.

arXiv:2208.05144 [pdf]

Machine Learning-based EEG Applications and Markets

Authors: Weiqing Gu, Bohan Yang, Ryan Chang

Abstract: This paper addresses both the various EEG applications and the current EEG market ecosystem propelled by machine learning. Increasingly available open medical and health datasets using EEG encourage data-driven research with a promise of improving neurology for patient care through knowledge discovery and machine learning data science algorithm development. This effort leads to various kinds of EE… ▽ More This paper addresses both the various EEG applications and the current EEG market ecosystem propelled by machine learning. Increasingly available open medical and health datasets using EEG encourage data-driven research with a promise of improving neurology for patient care through knowledge discovery and machine learning data science algorithm development. This effort leads to various kinds of EEG developments and currently forms a new EEG market. This paper attempts to do a comprehensive survey on the EEG market and covers the six significant applications of EEG, including diagnosis/screening, drug development, neuromarketing, daily health, metaverse, and age/disability assistance. The highlight of this survey is on the compare and contrast between the research field and the business market. Our survey points out the current limitations of EEG and indicates the future direction of research and business opportunity for every EEG application listed above. Based on our survey, more research on machine learning-based EEG applications will lead to a more robust EEG-related market. More companies will use the research technology and apply it to real-life settings. As the EEG-related market grows, the EEG-related devices will collect more EEG data, and there will be more EEG data available for researchers to use in their study, coming back as a virtuous cycle. Our market analysis indicates that research related to the use of EEG data and machine learning in the six applications listed above points toward a clear trend in the growth and development of the EEG ecosystem and machine learning world. △ Less

Submitted 10 August, 2022; originally announced August 2022.

Comments: 35 pages

arXiv:2207.07278 [pdf, other]

Boosting Multi-Modal E-commerce Attribute Value Extraction via Unified Learning Scheme and Dynamic Range Minimization

Authors: Mengyin Liu, Chao Zhu, Hongyu Gao, Weibo Gu, Hongfa Wang, Wei Liu, Xu-cheng Yin

Abstract: With the prosperity of e-commerce industry, various modalities, e.g., vision and language, are utilized to describe product items. It is an enormous challenge to understand such diversified data, especially via extracting the attribute-value pairs in text sequences with the aid of helpful image regions. Although a series of previous works have been dedicated to this task, there remain seldomly inv… ▽ More With the prosperity of e-commerce industry, various modalities, e.g., vision and language, are utilized to describe product items. It is an enormous challenge to understand such diversified data, especially via extracting the attribute-value pairs in text sequences with the aid of helpful image regions. Although a series of previous works have been dedicated to this task, there remain seldomly investigated obstacles that hinder further improvements: 1) Parameters from up-stream single-modal pretraining are inadequately applied, without proper jointly fine-tuning in a down-stream multi-modal task. 2) To select descriptive parts of images, a simple late fusion is widely applied, regardless of priori knowledge that language-related information should be encoded into a common linguistic embedding space by stronger encoders. 3) Due to diversity across products, their attribute sets tend to vary greatly, but current approaches predict with an unnecessary maximal range and lead to more potential false positives. To address these issues, we propose in this paper a novel approach to boost multi-modal e-commerce attribute value extraction via unified learning scheme and dynamic range minimization: 1) Firstly, a unified scheme is designed to jointly train a multi-modal task with pretrained single-modal parameters. 2) Secondly, a text-guided information range minimization method is proposed to adaptively encode descriptive parts of each modality into an identical space with a powerful pretrained linguistic model. 3) Moreover, a prototype-guided attribute range minimization method is proposed to first determine the proper attribute set of the current product, and then select prototypes to guide the prediction of the chosen attributes. Experiments on the popular multi-modal e-commerce benchmarks show that our approach achieves superior performance over the other state-of-the-art techniques. △ Less

Submitted 6 April, 2023; v1 submitted 14 July, 2022; originally announced July 2022.

arXiv:2207.02201 [pdf, other]

Efficient Spatial-Temporal Information Fusion for LiDAR-Based 3D Moving Object Segmentation

Authors: Jiadai Sun, Yuchao Dai, Xianjing Zhang, Jintao Xu, Rui Ai, Weihao Gu, Xieyuanli Chen

Abstract: Accurate moving object segmentation is an essential task for autonomous driving. It can provide effective information for many downstream tasks, such as collision avoidance, path planning, and static map construction. How to effectively exploit the spatial-temporal information is a critical question for 3D LiDAR moving object segmentation (LiDAR-MOS). In this work, we propose a novel deep neural n… ▽ More Accurate moving object segmentation is an essential task for autonomous driving. It can provide effective information for many downstream tasks, such as collision avoidance, path planning, and static map construction. How to effectively exploit the spatial-temporal information is a critical question for 3D LiDAR moving object segmentation (LiDAR-MOS). In this work, we propose a novel deep neural network exploiting both spatial-temporal information and different representation modalities of LiDAR scans to improve LiDAR-MOS performance. Specifically, we first use a range image-based dual-branch structure to separately deal with spatial and temporal information that can be obtained from sequential LiDAR scans, and later combine them using motion-guided attention modules. We also use a point refinement module via 3D sparse convolution to fuse the information from both LiDAR range image and point cloud representations and reduce the artifacts on the borders of the objects. We verify the effectiveness of our proposed approach on the LiDAR-MOS benchmark of SemanticKITTI. Our method outperforms the state-of-the-art methods significantly in terms of LiDAR-MOS IoU. Benefiting from the devised coarse-to-fine architecture, our method operates online at sensor frame rate. The implementation of our method is available as open source at: https://github.com/haomo-ai/MotionSeg3D. △ Less

Submitted 5 July, 2022; originally announced July 2022.

Comments: Accepted by IROS2022. Code: https://github.com/haomo-ai/MotionSeg3D

arXiv:2205.14413 [pdf]

Discrimination-Based Double Auction for Maximizing Social Welfare in the Electricity and Heating Market Considering Privacy Preservation

Authors: Lu Wang, Wei Gu, Shuai Lu, Haifeng Qiu, Zhi Wu

Abstract: This paper proposes a doubled-sided auction mechanism with price discrimination for social welfare (SW) maximization in the electricity and heating market. In this mechanism, energy service providers (ESPs) submit offers and load aggregators (LAs) submit bids to an energy trading center (ETC) to maximize their utility; in turn, the selfless ETC as an auctioneer leverages dis-criminatory price weig… ▽ More This paper proposes a doubled-sided auction mechanism with price discrimination for social welfare (SW) maximization in the electricity and heating market. In this mechanism, energy service providers (ESPs) submit offers and load aggregators (LAs) submit bids to an energy trading center (ETC) to maximize their utility; in turn, the selfless ETC as an auctioneer leverages dis-criminatory price weights to regulate the behaviors of ESPs and LAs, which combines the individual benefits of each stakeholder with the overall social welfare to achieve the global optimum. Nash games are employed to describe the interactions between players with the same market role. Theoretically, we first prove the existence and uniqueness of the Nash equilibrium; then, considering the requirement of game players to preserve privacy, a distributed algorithm based on the alternating direction method of multipliers is developed to implement distributed bidding and analytical target cascading algorithm is applied to reach the balance of demand and supply. We validated the proposed mechanism using case studies on a city-level distribution system. The results indicated that the achieved SW improved by 4%-15% compared with other mechanisms, and also verified the effectiveness of the distributed algorithm. △ Less

Submitted 28 May, 2022; originally announced May 2022.

arXiv:2204.03293 [pdf, other]

CoCoSoDa: Effective Contrastive Learning for Code Search

Authors: Ensheng Shi, Yanlin Wang, Wenchao Gu, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, Hongbin Sun

Abstract: Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved the performance of code search. However, there is still a lot of room for improvement in using contrastive learning for code search. In this paper, we propose C… ▽ More Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved the performance of code search. However, there is still a lot of room for improvement in using contrastive learning for code search. In this paper, we propose CoCoSoDa to effectively utilize contrastive learning for code search via two key factors in contrastive learning: data augmentation and negative samples. Specifically, soft data augmentation is to dynamically masking or replacing some tokens with their types for input sequences to generate positive samples. Momentum mechanism is used to generate large and consistent representations of negative samples in a mini-batch through maintaining a queue and a momentum encoder. In addition, multimodal contrastive learning is used to pull together representations of code-query pairs and push apart the unpaired code snippets and queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. Experimental results show that: (1) CoCoSoDa outperforms 14 baselines and especially exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 13.3%, 10.5%, and 5.9% on average MRR scores, respectively. (2) The ablation studies show the effectiveness of each component of our approach. (3) We adapt our techniques to several different pre-trained models such as RoBERTa, CodeBERT, and GraphCodeBERT and observe a significant boost in their performance in code search. (4) Our model performs robustly under different hyper-parameters. Furthermore, we perform qualitative and quantitative analyses to explore reasons behind the good performance of our model. △ Less

Submitted 12 February, 2023; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: Accepted by ICSE 2023 (The 45th International Conference on Software Engineering)

arXiv:2203.15287 [pdf, other]

Accelerating Code Search with Deep Hashing and Code Classification

Authors: Wenchao Gu, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, Michael R. Lyu

Abstract: Code search is to search reusable code snippets from source code corpus based on natural languages queries. Deep learning-based methods of code search have shown promising results. However, previous methods focus on retrieval accuracy but lacked attention to the efficiency of the retrieval process. We propose a novel method CoSHC to accelerate code search with deep hashing and code classification,… ▽ More Code search is to search reusable code snippets from source code corpus based on natural languages queries. Deep learning-based methods of code search have shown promising results. However, previous methods focus on retrieval accuracy but lacked attention to the efficiency of the retrieval process. We propose a novel method CoSHC to accelerate code search with deep hashing and code classification, aiming to perform an efficient code search without sacrificing too much accuracy. To evaluate the effectiveness of CoSHC, we apply our method to five code search models. Extensive experimental results indicate that compared with previous code search baselines, CoSHC can save more than 90% of retrieval time meanwhile preserving at least 99% of retrieval accuracy. △ Less

Submitted 30 March, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: Accepted to 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)

arXiv:2203.03397 [pdf, other]

doi 10.1109/LRA.2022.3178797

OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for LiDAR-Based Place Recognition

Authors: Junyi Ma, Jun Zhang, Jintao Xu, Rui Ai, Weihao Gu, Xieyuanli Chen

Abstract: Place recognition is an important capability for autonomously navigating vehicles operating in complex environments and under changing conditions. It is a key component for tasks such as loop closing in SLAM or global localization. In this paper, we address the problem of place recognition based on 3D LiDAR scans recorded by an autonomous vehicle. We propose a novel lightweight neural network expl… ▽ More Place recognition is an important capability for autonomously navigating vehicles operating in complex environments and under changing conditions. It is a key component for tasks such as loop closing in SLAM or global localization. In this paper, we address the problem of place recognition based on 3D LiDAR scans recorded by an autonomous vehicle. We propose a novel lightweight neural network exploiting the range image representation of LiDAR sensors to achieve fast execution with less than 2 ms per frame. We design a yaw-angle-invariant architecture exploiting a transformer network, which boosts the place recognition performance of our method. We evaluate our approach on the KITTI and Ford Campus datasets. The experimental results show that our method can effectively detect loop closures compared to the state-of-the-art methods and generalizes well across different environments. To evaluate long-term place recognition performance, we provide a novel dataset containing LiDAR sequences recorded by a mobile robot in repetitive places at different times. The implementation of our method and dataset are released here: https://github.com/haomo-ai/OverlapTransformer △ Less

Submitted 19 April, 2023; v1 submitted 7 March, 2022; originally announced March 2022.

Comments: Accepted by RAL/IROS 2022

arXiv:2202.13686 [pdf, other]

Points-of-Interest Relationship Inference with Spatial-enriched Graph Neural Networks

Authors: Yile Chen, Xiucheng Li, Gao Cong, Cheng Long, Zhifeng Bao, Shang Liu, Wanli Gu, Fuzheng Zhang

Abstract: As a fundamental component in location-based services, inferring the relationship between points-of-interests (POIs) is very critical for service providers to offer good user experience to business owners and customers. Most of the existing methods for relationship inference are not targeted at POI, thus failing to capture unique spatial characteristics that have huge effects on POI relationships.… ▽ More As a fundamental component in location-based services, inferring the relationship between points-of-interests (POIs) is very critical for service providers to offer good user experience to business owners and customers. Most of the existing methods for relationship inference are not targeted at POI, thus failing to capture unique spatial characteristics that have huge effects on POI relationships. In this work we propose PRIM to tackle POI relationship inference for multiple relation types. PRIM features four novel components, including a weighted relational graph neural network, category taxonomy integration, a self-attentive spatial context extractor, and a distance-specific scoring function. Extensive experiments on two real-world datasets show that PRIM achieves the best results compared to state-of-the-art baselines and it is robust against data sparsity and is applicable to unseen cases in practice. △ Less

Submitted 28 February, 2022; originally announced February 2022.

arXiv:2202.06521 [pdf, other]

Source Code Summarization with Structural Relative Position Guided Transformer

Authors: Zi Gong, Cuiyun Gao, Yasheng Wang, Wenchao Gu, Yun Peng, Zenglin Xu

Abstract: Source code summarization aims at generating concise and clear natural language descriptions for programming languages. Well-written code summaries are beneficial for programmers to participate in the software development and maintenance process. To learn the semantic representations of source code, recent efforts focus on incorporating the syntax structure of code into neural networks such as Tra… ▽ More Source code summarization aims at generating concise and clear natural language descriptions for programming languages. Well-written code summaries are beneficial for programmers to participate in the software development and maintenance process. To learn the semantic representations of source code, recent efforts focus on incorporating the syntax structure of code into neural networks such as Transformer. Such Transformer-based approaches can better capture the long-range dependencies than other neural networks including Recurrent Neural Networks (RNNs), however, most of them do not consider the structural relative correlations between tokens, e.g., relative positions in Abstract Syntax Trees (ASTs), which is beneficial for code semantics learning. To model the structural dependency, we propose a Structural Relative Position guided Transformer, named SCRIPT. SCRIPT first obtains the structural relative positions between tokens via parsing the ASTs of source code, and then passes them into two types of Transformer encoders. One Transformer directly adjusts the input according to the structural relative distance; and the other Transformer encodes the structural relative positions during computing the self-attention scores. Finally, we stack these two types of Transformer encoders to learn representations of source code. Experimental results show that the proposed SCRIPT outperforms the state-of-the-art methods by at least 1.6%, 1.4% and 2.8% with respect to BLEU, ROUGE-L and METEOR on benchmark datasets, respectively. We further show that how the proposed SCRIPT captures the structural relative dependencies. △ Less

Submitted 14 February, 2022; originally announced February 2022.

Comments: 12 pages, SANER 2022

arXiv:2201.08054 [pdf, other]

VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine Translation

Authors: Yihang Li, Shuichiro Shimizu, Weiqi Gu, Chenhui Chu, Sadao Kurohashi

Abstract: Existing multimodal machine translation (MMT) datasets consist of images and video captions or general subtitles, which rarely contain linguistic ambiguity, making visual information not so effective to generate appropriate translations. We introduce VISA, a new dataset that consists of 40k Japanese-English parallel sentence pairs and corresponding video clips with the following key features: (1)… ▽ More Existing multimodal machine translation (MMT) datasets consist of images and video captions or general subtitles, which rarely contain linguistic ambiguity, making visual information not so effective to generate appropriate translations. We introduce VISA, a new dataset that consists of 40k Japanese-English parallel sentence pairs and corresponding video clips with the following key features: (1) the parallel sentences are subtitles from movies and TV episodes; (2) the source subtitles are ambiguous, which means they have multiple possible translations with different meanings; (3) we divide the dataset into Polysemy and Omission according to the cause of ambiguity. We show that VISA is challenging for the latest MMT system, and we hope that the dataset can facilitate MMT research. The VISA dataset is available at: https://github.com/ku-nlp/VISA. △ Less

Submitted 26 May, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

Comments: Accepted by LREC2022

arXiv:2112.12653 [pdf, other]

Revisiting, Benchmarking and Exploring API Recommendation: How Far Are We?

Authors: Yun Peng, Shuqing Li, Wenwei Gu, Yichen Li, Wenxuan Wang, Cuiyun Gao, Michael Lyu

Abstract: Application Programming Interfaces (APIs), which encapsulate the implementation of specific functions as interfaces, greatly improve the efficiency of modern software development. As numbers of APIs spring up nowadays, developers can hardly be familiar with all the APIs, and usually need to search for appropriate APIs for usage. So lots of efforts have been devoted to improving the API recommendat… ▽ More Application Programming Interfaces (APIs), which encapsulate the implementation of specific functions as interfaces, greatly improve the efficiency of modern software development. As numbers of APIs spring up nowadays, developers can hardly be familiar with all the APIs, and usually need to search for appropriate APIs for usage. So lots of efforts have been devoted to improving the API recommendation task. However, it has been increasingly difficult to gauge the performance of new models due to the lack of a uniform definition of the task and a standardized benchmark. For example, some studies regard the task as a code completion problem; while others recommend relative APIs given natural language queries. To reduce the challenges and better facilitate future research, in this paper, we revisit the API recommendation task and aim at benchmarking the approaches. Specifically, the paper groups the approaches into two categories according to the task definition, i.e., query-based API recommendation and code-based API recommendation. We study 11 recently-proposed approaches along with 4 widely-used IDEs. One benchmark named as APIBench is then built for the two respective categories of approaches. Based on APIBench, we distill some actionable insights and challenges for API recommendation. We also achieve some implications and directions for improving the performance of recommending APIs, including data source selection, appropriate query reformulation, low resource setting, and cross-domain adaptation. △ Less

Submitted 23 December, 2021; originally announced December 2021.

Showing 1–50 of 70 results for author: Gu, W