-
UNR: Unified Notifiable RMA Library for HPC
Authors:
Guangnan Feng,
Jiabin Xie,
Dezun Dong,
Yutong Lu
Abstract:
Remote Memory Access (RMA) enables direct access to remote memory to achieve high performance for HPC applications. However, most modern parallel programming models lack schemes for the remote process to detect the completion of RMA operations. Many previous works have proposed programming models and extensions to notify the communication peer, but they did not solve the multi-NIC aggregation, por…
▽ More
Remote Memory Access (RMA) enables direct access to remote memory to achieve high performance for HPC applications. However, most modern parallel programming models lack schemes for the remote process to detect the completion of RMA operations. Many previous works have proposed programming models and extensions to notify the communication peer, but they did not solve the multi-NIC aggregation, portability, hardware-software co-design, and usability problems. In this work, we proposed a Unified Notifiable RMA (UNR) library for HPC to address these challenges. In addition, we demonstrate the best practice of utilizing UNR within a real-world scientific application, PowerLLEL. We deployed UNR across four HPC systems, each with a different interconnect. The results show that PowerLLEL powered by UNR achieves up to a 36% acceleration on 1728 nodes of the Tianhe-Xingyi supercomputing system.
△ Less
Submitted 14 August, 2024;
originally announced August 2024.
-
TEAdapter: Supply abundant guidance for controllable text-to-music generation
Authors:
Jialing Zou,
Jiahao Mei,
Xudong Nan,
Jinghua Li,
Daoguo Dong,
Liang He
Abstract:
Although current text-guided music generation technology can cope with simple creative scenarios, achieving fine-grained control over individual text-modality conditions remains challenging as user demands become more intricate. Accordingly, we introduce the TEAcher Adapter (TEAdapter), a compact plugin designed to guide the generation process with diverse control information provided by users. In…
▽ More
Although current text-guided music generation technology can cope with simple creative scenarios, achieving fine-grained control over individual text-modality conditions remains challenging as user demands become more intricate. Accordingly, we introduce the TEAcher Adapter (TEAdapter), a compact plugin designed to guide the generation process with diverse control information provided by users. In addition, we explore the controllable generation of extended music by leveraging TEAdapter control groups trained on data of distinct structural functionalities. In general, we consider controls over global, elemental, and structural levels. Experimental results demonstrate that the proposed TEAdapter enables multiple precise controls and ensures high-quality music generation. Our module is also lightweight and transferable to any diffusion model architecture. Available code and demos will be found soon at https://github.com/Ashley1101/TEAdapter.
△ Less
Submitted 9 August, 2024;
originally announced August 2024.
-
DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs
Authors:
Zhen Tan,
Daize Dong,
Xinyu Zhao,
Jie Peng,
Yu Cheng,
Tianlong Chen
Abstract:
In this paper, we introduce Dynamic Layer Operations (DLO), a novel approach for vertically scaling transformer-based Large Language Models (LLMs) by dynamically expanding, activating, or skipping layers using a sophisticated routing policy based on layerwise feature similarity. Unlike traditional Mixture-of-Experts (MoE) methods that focus on extending the model width, our approach targets model…
▽ More
In this paper, we introduce Dynamic Layer Operations (DLO), a novel approach for vertically scaling transformer-based Large Language Models (LLMs) by dynamically expanding, activating, or skipping layers using a sophisticated routing policy based on layerwise feature similarity. Unlike traditional Mixture-of-Experts (MoE) methods that focus on extending the model width, our approach targets model depth, addressing the redundancy observed across layer representations for various input samples. Our framework is integrated with the Supervised Fine-Tuning (SFT) stage, eliminating the need for resource-intensive Continual Pre-Training (CPT). Experimental results demonstrate that DLO not only outperforms the original unscaled models but also achieves comparable results to densely expanded models with significantly improved efficiency. Our work offers a promising direction for building efficient yet powerful LLMs. We will release our implementation and model weights upon acceptance.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Warming Up Cold-Start CTR Prediction by Learning Item-Specific Feature Interactions
Authors:
Yaqing Wang,
Hongming Piao,
Daxiang Dong,
Quanming Yao,
Jingbo Zhou
Abstract:
In recommendation systems, new items are continuously introduced, initially lacking interaction records but gradually accumulating them over time. Accurately predicting the click-through rate (CTR) for these items is crucial for enhancing both revenue and user experience. While existing methods focus on enhancing item ID embeddings for new items within general CTR models, they tend to adopt a glob…
▽ More
In recommendation systems, new items are continuously introduced, initially lacking interaction records but gradually accumulating them over time. Accurately predicting the click-through rate (CTR) for these items is crucial for enhancing both revenue and user experience. While existing methods focus on enhancing item ID embeddings for new items within general CTR models, they tend to adopt a global feature interaction approach, often overshadowing new items with sparse data by those with abundant interactions. Addressing this, our work introduces EmerG, a novel approach that warms up cold-start CTR prediction by learning item-specific feature interaction patterns. EmerG utilizes hypernetworks to generate an item-specific feature graph based on item characteristics, which is then processed by a Graph Neural Network (GNN). This GNN is specially tailored to provably capture feature interactions at any order through a customized message passing mechanism. We further design a meta learning strategy that optimizes parameters of hypernetworks and GNN across various item CTR prediction tasks, while only adjusting a minimal set of item-specific parameters within each task. This strategy effectively reduces the risk of overfitting when dealing with limited data. Extensive experiments on benchmark datasets validate that EmerG consistently performs the best given no, a few and sufficient instances of new items.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
Authors:
Tong Zhu,
Xiaoye Qu,
Daize Dong,
Jiacheng Ruan,
Jingqi Tong,
Conghui He,
Yu Cheng
Abstract:
Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B mod…
▽ More
Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual Pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters. The source codes and models are available at https://github.com/pjlab-sys4nlp/llama-moe .
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts
Authors:
Tong Zhu,
Daize Dong,
Xiaoye Qu,
Jiacheng Ruan,
Wenliang Chen,
Yu Cheng
Abstract:
Mixture-of-Experts (MoE) models have shown remarkable capability in instruction tuning, especially when the number of tasks scales. However, previous methods simply merge all training tasks (e.g. creative writing, coding, and mathematics) and apply fixed sampling weights, without considering the importance of different tasks as the model training state changes. In this way, the most helpful data c…
▽ More
Mixture-of-Experts (MoE) models have shown remarkable capability in instruction tuning, especially when the number of tasks scales. However, previous methods simply merge all training tasks (e.g. creative writing, coding, and mathematics) and apply fixed sampling weights, without considering the importance of different tasks as the model training state changes. In this way, the most helpful data cannot be effectively distinguished, leading to suboptimal model performance. To reduce the potential redundancies of datasets, we make the first attempt and propose a novel dynamic data mixture for MoE instruction tuning. Specifically, inspired by MoE's token routing preference, we build dataset-level representations and then capture the subtle differences among datasets. Finally, we propose to dynamically adjust the sampling weight of datasets by their inter-redundancies, thus maximizing global performance under a limited training budget. The experimental results on two MoE models demonstrate the effectiveness of our approach on both downstream knowledge \& reasoning tasks and open-ended queries. Code and models are available at https://github.com/Spico197/MoE-SFT .
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Demystifying the Compression of Mixture-of-Experts Through a Unified Framework
Authors:
Shwai He,
Daize Dong,
Liang Ding,
Ang Li
Abstract:
Scaling large language models has revolutionized the performance across diverse domains, yet the continual growth in model size poses significant challenges for real-world deployment. The Mixture of Experts (MoE) approach addresses this by dynamically selecting and activating only a subset of experts, significantly reducing computational costs while maintaining high performance. However, MoE intro…
▽ More
Scaling large language models has revolutionized the performance across diverse domains, yet the continual growth in model size poses significant challenges for real-world deployment. The Mixture of Experts (MoE) approach addresses this by dynamically selecting and activating only a subset of experts, significantly reducing computational costs while maintaining high performance. However, MoE introduces potential redundancy (e.g., parameters) and extra costs (e.g., communication overhead). Despite numerous compression techniques developed for mitigating the redundancy in dense models, the compression of MoE remains under-explored. We first bridge this gap with a cutting-edge unified framework that not only seamlessly integrates mainstream compression methods but also helps systematically understand MoE compression. This framework approaches compression from two perspectives: Expert Slimming which compresses individual experts and Expert Trimming which removes structured modules. Within this framework, we explore the optimization space unexplored by existing methods,and further introduce aggressive Expert Trimming techniques, i.e., Layer Drop and Block Drop, to eliminate redundancy at larger scales. Based on these insights,we present a comprehensive recipe to guide practitioners in compressing MoE effectively. Extensive experimental results demonstrate the effectiveness of the compression methods under our framework and the proposed recipe, achieving a 6.05x speedup and only 20.0GB memory usage while maintaining over 92% of performance on Mixtral-8x7B. Code is released at \url{https://github.com/DaizeDong/Unified-MoE-Compression}.
△ Less
Submitted 24 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
Full-Stack Allreduce on Multi-Rail Networks
Authors:
Enda Yu,
Dezun Dong,
Xiangke Liao
Abstract:
The high communication costs impede scalability in distributed systems. Multimodal models like Sora exacerbate this issue by requiring more resources than current networks can support. However, existing network architectures fail to address this gap. In this paper, we provide full-stack support for allreduce on multi-rail networks, aiming to overcome the scalability limitations of large-scale netw…
▽ More
The high communication costs impede scalability in distributed systems. Multimodal models like Sora exacerbate this issue by requiring more resources than current networks can support. However, existing network architectures fail to address this gap. In this paper, we provide full-stack support for allreduce on multi-rail networks, aiming to overcome the scalability limitations of large-scale networks by facilitating collaborative data transfer across various networks. To achieve this, we propose the Nezha system, which integrates TCP, in-network computing protocol SHARP, and RDMA-based protocol GLEX. To maximize data transfer rates, Nezha incorporates a load balancing data allocation scheme based on cost feedback and combines exception handling to achieve reliable data transmission. Our experiments on a six-node cluster demonstrate that Nezha significantly enhances allreduce performance by 58\% to 87\% in homogeneous dual-rail configurations and offers considerable acceleration in heterogeneous settings, contingent on the performance variance among networks.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation
Authors:
Shengyuan Liu,
Bo Wang,
Ye Ma,
Te Yang,
Xipeng Cao,
Quan Chen,
Han Li,
Di Dong,
Peng Jiang
Abstract:
Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these…
▽ More
Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these limitations, we propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. This approach strengthens the attention map, allowing for precise attribute binding and feature injection for each subject. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation. Furthermore, we propose a novel metric GroundingScore to evaluate subject alignment thoroughly. The obtained quantitative results serve as compelling evidence showcasing the effectiveness of our proposed method. The code will be released soon.
△ Less
Submitted 11 May, 2024;
originally announced May 2024.
-
Online Planning of Power Flows for Power Systems Against Bushfires Using Spatial Context
Authors:
Jianyu Xu,
Qiuzhuang Sun,
Yang Yang,
Huadong Mo,
Daoyi Dong
Abstract:
The 2019-20 Australia bushfire incurred numerous economic losses and significantly affected the operations of power systems. A power station or transmission line can be significantly affected due to bushfires, leading to an increase in operational costs. We study a fundamental but challenging problem of planning the optimal power flow (OPF) for power systems subject to bushfires. Considering the s…
▽ More
The 2019-20 Australia bushfire incurred numerous economic losses and significantly affected the operations of power systems. A power station or transmission line can be significantly affected due to bushfires, leading to an increase in operational costs. We study a fundamental but challenging problem of planning the optimal power flow (OPF) for power systems subject to bushfires. Considering the stochastic nature of bushfire spread, we develop a model to capture such dynamics based on Moore's neighborhood model. Under a periodic inspection scheme that reveals the in-situ bushfire status, we propose an online optimization modeling framework that sequentially plans the power flows in the electricity network. Our framework assumes that the spread of bushfires is non-stationary over time, and the spread and containment probabilities are unknown. To meet these challenges, we develop a contextual online learning algorithm that treats the in-situ geographical information of the bushfire as a 'spatial context'. The online learning algorithm learns the unknown probabilities sequentially based on the observed data and then makes the OPF decision accordingly. The sequential OPF decisions aim to minimize the regret function, which is defined as the cumulative loss against the clairvoyant strategy that knows the true model parameters. We provide a theoretical guarantee of our algorithm by deriving a bound on the regret function, which outperforms the regret bound achieved by other benchmark algorithms. Our model assumptions are verified by the real bushfire data from NSW, Australia, and we apply our model to two power systems to illustrate its applicability.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
iDAT: inverse Distillation Adapter-Tuning
Authors:
Jiacheng Ruan,
Jingsheng Gao,
Mingye Xie,
Daize Dong,
Suncheng Xiang,
Ting Liu,
Yuzhuo Fu
Abstract:
Adapter-Tuning (AT) method involves freezing a pre-trained model and introducing trainable adapter modules to acquire downstream knowledge, thereby calibrating the model for better adaptation to downstream tasks. This paper proposes a distillation framework for the AT method instead of crafting a carefully designed adapter module, which aims to improve fine-tuning performance. For the first time,…
▽ More
Adapter-Tuning (AT) method involves freezing a pre-trained model and introducing trainable adapter modules to acquire downstream knowledge, thereby calibrating the model for better adaptation to downstream tasks. This paper proposes a distillation framework for the AT method instead of crafting a carefully designed adapter module, which aims to improve fine-tuning performance. For the first time, we explore the possibility of combining the AT method with knowledge distillation. Via statistical analysis, we observe significant differences in the knowledge acquisition between adapter modules of different models. Leveraging these differences, we propose a simple yet effective framework called inverse Distillation Adapter-Tuning (iDAT). Specifically, we designate the smaller model as the teacher and the larger model as the student. The two are jointly trained, and online knowledge distillation is applied to inject knowledge of different perspective to student model, and significantly enhance the fine-tuning performance on downstream tasks. Extensive experiments on the VTAB-1K benchmark with 19 image classification tasks demonstrate the effectiveness of iDAT. The results show that using existing AT method within our iDAT framework can further yield a 2.66% performance gain, with only an additional 0.07M trainable parameters. Our approach compares favorably with state-of-the-arts without bells and whistles. Our code is available at https://github.com/JCruan519/iDAT.
△ Less
Submitted 23 March, 2024;
originally announced March 2024.
-
SAM-Lightening: A Lightweight Segment Anything Model with Dilated Flash Attention to Achieve 30 times Acceleration
Authors:
Yanfei Song,
Bangzheng Pu,
Peng Wang,
Hongxu Jiang,
Dong Dong,
Yongxiang Cao,
Yiqing Shen
Abstract:
Segment Anything Model (SAM) has garnered significant attention in segmentation tasks due to their zero-shot generalization ability. However, a broader application of SAMs to real-world practice has been restricted by their low inference speed and high computational memory demands, which mainly stem from the attention mechanism. Existing work concentrated on optimizing the encoder, yet has not ade…
▽ More
Segment Anything Model (SAM) has garnered significant attention in segmentation tasks due to their zero-shot generalization ability. However, a broader application of SAMs to real-world practice has been restricted by their low inference speed and high computational memory demands, which mainly stem from the attention mechanism. Existing work concentrated on optimizing the encoder, yet has not adequately addressed the inefficiency of the attention mechanism itself, even when distilled to a smaller model, which thus leaves space for further improvement. In response, we introduce SAM-Lightening, a variant of SAM, that features a re-engineered attention mechanism, termed Dilated Flash Attention. It not only facilitates higher parallelism, enhancing processing efficiency but also retains compatibility with the existing FlashAttention. Correspondingly, we propose a progressive distillation to enable an efficient knowledge transfer from the vanilla SAM without costly training from scratch. Experiments on COCO and LVIS reveal that SAM-Lightening significantly outperforms the state-of-the-art methods in both run-time efficiency and segmentation accuracy. Specifically, it can achieve an inference speed of 7 milliseconds (ms) per image, for images of size 1024*1024 pixels, which is 30.1 times faster than the vanilla SAM and 2.1 times than the state-of-the-art. Moreover, it takes only 244MB memory, which is 3.5\% of the vanilla SAM. The code and weights are available at https://anonymous.4open.science/r/SAM-LIGHTENING-BC25/.
△ Less
Submitted 17 March, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
A Graph is Worth $K$ Words: Euclideanizing Graph using Pure Transformer
Authors:
Zhangyang Gao,
Daize Dong,
Cheng Tan,
Jun Xia,
Bozhen Hu,
Stan Z. Li
Abstract:
Can we model Non-Euclidean graphs as pure language or even Euclidean vectors while retaining their inherent information? The Non-Euclidean property have posed a long term challenge in graph modeling. Despite recent graph neural networks and graph transformers efforts encoding graphs as Euclidean vectors, recovering the original graph from vectors remains a challenge. In this paper, we introduce Gr…
▽ More
Can we model Non-Euclidean graphs as pure language or even Euclidean vectors while retaining their inherent information? The Non-Euclidean property have posed a long term challenge in graph modeling. Despite recent graph neural networks and graph transformers efforts encoding graphs as Euclidean vectors, recovering the original graph from vectors remains a challenge. In this paper, we introduce GraphsGPT, featuring an Graph2Seq encoder that transforms Non-Euclidean graphs into learnable Graph Words in the Euclidean space, along with a GraphGPT decoder that reconstructs the original graph from Graph Words to ensure information equivalence. We pretrain GraphsGPT on $100$M molecules and yield some interesting findings: (1) The pretrained Graph2Seq excels in graph representation learning, achieving state-of-the-art results on $8/9$ graph classification and regression tasks. (2) The pretrained GraphGPT serves as a strong graph generator, demonstrated by its strong ability to perform both few-shot and conditional graph generation. (3) Graph2Seq+GraphGPT enables effective graph mixup in the Euclidean space, overcoming previously known Non-Euclidean challenges. (4) The edge-centric pretraining framework GraphsGPT demonstrates its efficacy in graph domain tasks, excelling in both representation and generation. Code is available at \href{https://github.com/A4Bio/GraphsGPT}{GitHub}.
△ Less
Submitted 29 May, 2024; v1 submitted 4 February, 2024;
originally announced February 2024.
-
Augmenting Prototype Network with TransMix for Few-shot Hyperspectral Image Classification
Authors:
Chun Liu,
Longwei Yang,
Dongmei Dong,
Zheng Li,
Wei Yang,
Zhigang Han,
Jiayao Wang
Abstract:
Few-shot hyperspectral image classification aims to identify the classes of each pixel in the images by only marking few of these pixels. And in order to obtain the spatial-spectral joint features of each pixel, the fixed-size patches centering around each pixel are often used for classification. However, observing the classification results of existing methods, we found that boundary patches corr…
▽ More
Few-shot hyperspectral image classification aims to identify the classes of each pixel in the images by only marking few of these pixels. And in order to obtain the spatial-spectral joint features of each pixel, the fixed-size patches centering around each pixel are often used for classification. However, observing the classification results of existing methods, we found that boundary patches corresponding to the pixels which are located at the boundary of the objects in the hyperspectral images, are hard to classify. These boundary patchs are mixed with multi-class spectral information. Inspired by this, we propose to augment the prototype network with TransMix for few-shot hyperspectrial image classification(APNT). While taking the prototype network as the backbone, it adopts the transformer as feature extractor to learn the pixel-to-pixel relation and pay different attentions to different pixels. At the same time, instead of directly using the patches which are cut from the hyperspectral images for training, it randomly mixs up two patches to imitate the boundary patches and uses the synthetic patches to train the model, with the aim to enlarge the number of hard training samples and enhance their diversity. And by following the data agumentation technique TransMix, the attention returned by the transformer is also used to mix up the labels of two patches to generate better labels for synthetic patches. Compared with existing methods, the proposed method has demonstrated sate of the art performance and better robustness for few-shot hyperspectral image classification in our experiments.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
TripleSurv: Triplet Time-adaptive Coordinate Loss for Survival Analysis
Authors:
Liwen Zhang,
Lianzhen Zhong,
Fan Yang,
Di Dong,
Hui Hui,
Jie Tian
Abstract:
A core challenge in survival analysis is to model the distribution of censored time-to-event data, where the event of interest may be a death, failure, or occurrence of a specific event. Previous studies have showed that ranking and maximum likelihood estimation (MLE)loss functions are widely-used for survival analysis. However, ranking loss only focus on the ranking of survival time and does not…
▽ More
A core challenge in survival analysis is to model the distribution of censored time-to-event data, where the event of interest may be a death, failure, or occurrence of a specific event. Previous studies have showed that ranking and maximum likelihood estimation (MLE)loss functions are widely-used for survival analysis. However, ranking loss only focus on the ranking of survival time and does not consider potential effect of samples for exact survival time values. Furthermore, the MLE is unbounded and easily subject to outliers (e.g., censored data), which may cause poor performance of modeling. To handle the complexities of learning process and exploit valuable survival time values, we propose a time-adaptive coordinate loss function, TripleSurv, to achieve adaptive adjustments by introducing the differences in the survival time between sample pairs into the ranking, which can encourage the model to quantitatively rank relative risk of pairs, ultimately enhancing the accuracy of predictions. Most importantly, the TripleSurv is proficient in quantifying the relative risk between samples by ranking ordering of pairs, and consider the time interval as a trade-off to calibrate the robustness of model over sample distribution. Our TripleSurv is evaluated on three real-world survival datasets and a public synthetic dataset. The results show that our method outperforms the state-of-the-art methods and exhibits good model performance and robustness on modeling various sophisticated data distributions with different censor rates. Our code will be available upon acceptance.
△ Less
Submitted 5 January, 2024;
originally announced January 2024.
-
CodeFuse-Query: A Data-Centric Static Code Analysis System for Large-Scale Organizations
Authors:
Xiaoheng Xie,
Gang Fan,
Xiaojun Lin,
Ang Zhou,
Shijie Li,
Xunjin Zheng,
Yinan Liang,
Yu Zhang,
Na Yu,
Haokun Li,
Xinyu Chen,
Yingzhuang Chen,
Yi Zhen,
Dejun Dong,
Xianjin Fu,
Jinzhou Su,
Fuxiong Pan,
Pengshuai Luo,
Youzheng Feng,
Ruoxiang Hu,
Jing Fan,
Jinguo Zhou,
Xiao Xiao,
Peng Di
Abstract:
In the domain of large-scale software development, the demands for dynamic and multifaceted static code analysis exceed the capabilities of traditional tools. To bridge this gap, we present CodeFuse-Query, a system that redefines static code analysis through the fusion of Domain Optimized System Design and Logic Oriented Computation Design.
CodeFuse-Query reimagines code analysis as a data compu…
▽ More
In the domain of large-scale software development, the demands for dynamic and multifaceted static code analysis exceed the capabilities of traditional tools. To bridge this gap, we present CodeFuse-Query, a system that redefines static code analysis through the fusion of Domain Optimized System Design and Logic Oriented Computation Design.
CodeFuse-Query reimagines code analysis as a data computation task, support scanning over 10 billion lines of code daily and more than 300 different tasks. It optimizes resource utilization, prioritizes data reusability, applies incremental code extraction, and introduces tasks types specially for Code Change, underscoring its domain-optimized design. The system's logic-oriented facet employs Datalog, utilizing a unique two-tiered schema, COREF, to convert source code into data facts. Through Godel, a distinctive language, CodeFuse-Query enables formulation of complex tasks as logical expressions, harnessing Datalog's declarative prowess.
This paper provides empirical evidence of CodeFuse-Query's transformative approach, demonstrating its robustness, scalability, and efficiency. We also highlight its real-world impact and diverse applications, emphasizing its potential to reshape the landscape of static code analysis in the context of large-scale software development.Furthermore, in the spirit of collaboration and advancing the field, our project is open-sourced and the repository is available for public access
△ Less
Submitted 3 January, 2024;
originally announced January 2024.
-
Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain
Authors:
Dota Tianai Dong,
Mariya Toneva
Abstract:
Integrating information from multiple modalities is arguably one of the essential prerequisites for grounding artificial intelligence systems with an understanding of the real world. Recent advances in video transformers that jointly learn from vision, text, and sound over time have made some progress toward this goal, but the degree to which these models integrate information from modalities stil…
▽ More
Integrating information from multiple modalities is arguably one of the essential prerequisites for grounding artificial intelligence systems with an understanding of the real world. Recent advances in video transformers that jointly learn from vision, text, and sound over time have made some progress toward this goal, but the degree to which these models integrate information from modalities still remains unclear. In this work, we present a promising approach for probing a pre-trained multimodal video transformer model by leveraging neuroscientific evidence of multimodal information processing in the brain. Using brain recordings of participants watching a popular TV show, we analyze the effects of multi-modal connections and interactions in a pre-trained multi-modal video transformer on the alignment with uni- and multi-modal brain regions. We find evidence that vision enhances masked prediction performance during language processing, providing support that cross-modal representations in models can benefit individual modalities. However, we don't find evidence of brain-relevant information captured by the joint multi-modal transformer representations beyond that captured by all of the individual modalities. We finally show that the brain alignment of the pre-trained joint representation can be improved by fine-tuning using a task that requires vision-language inferences. Overall, our results paint an optimistic picture of the ability of multi-modal transformers to integrate vision and language in partially brain-relevant ways but also show that improving the brain alignment of these models may require new approaches.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Mid-Long Term Daily Electricity Consumption Forecasting Based on Piecewise Linear Regression and Dilated Causal CNN
Authors:
Zhou Lan,
Ben Liu,
Yi Feng,
Danhuang Dong,
Peng Zhang
Abstract:
Daily electricity consumption forecasting is a classical problem. Existing forecasting algorithms tend to have decreased accuracy on special dates like holidays. This study decomposes the daily electricity consumption series into three components: trend, seasonal, and residual, and constructs a two-stage prediction method using piecewise linear regression as a filter and Dilated Causal CNN as a pr…
▽ More
Daily electricity consumption forecasting is a classical problem. Existing forecasting algorithms tend to have decreased accuracy on special dates like holidays. This study decomposes the daily electricity consumption series into three components: trend, seasonal, and residual, and constructs a two-stage prediction method using piecewise linear regression as a filter and Dilated Causal CNN as a predictor. The specific steps involve setting breakpoints on the time axis and fitting the piecewise linear regression model with one-hot encoded information such as month, weekday, and holidays. For the challenging prediction of the Spring Festival, distance is introduced as a variable using a third-degree polynomial form in the model. The residual sequence obtained in the previous step is modeled using Dilated Causal CNN, and the final prediction of daily electricity consumption is the sum of the two-stage predictions. Experimental results demonstrate that this method achieves higher accuracy compared to existing approaches.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Learning Informative Latent Representation for Quantum State Tomography
Authors:
Hailan Ma,
Zhenhong Sun,
Daoyi Dong,
Dong Gong
Abstract:
Quantum state tomography (QST) is the process of reconstructing the complete state of a quantum system (mathematically described as a density matrix) through a series of different measurements. These measurements are performed on a number of identical copies of the quantum system, with outcomes gathered as frequencies. QST aims to recover the density matrix and the corresponding properties of the…
▽ More
Quantum state tomography (QST) is the process of reconstructing the complete state of a quantum system (mathematically described as a density matrix) through a series of different measurements. These measurements are performed on a number of identical copies of the quantum system, with outcomes gathered as frequencies. QST aims to recover the density matrix and the corresponding properties of the quantum state from the measured frequencies. Although an informationally complete set of measurements can specify quantum state accurately in an ideal scenario with a large number of identical copies, both measurements and identical copies are restricted and imperfect in practical scenarios, making QST highly ill-posed. The conventional QST methods usually assume adequate or accurate measured frequencies or rely on manually designed regularizers to handle the ill-posed reconstruction problem, suffering from limited applications in realistic scenarios. Recent advances in deep neural networks (DNNs) led to the emergence of deep learning (DL) in QST. However, existing DL-based QST approaches often employ generic DNN models that are not optimized for imperfect conditions of QST. In this paper, we propose a transformer-based autoencoder architecture tailored for QST with imperfect measurement data. Our method leverages a transformer-based encoder to extract an informative latent representation (ILR) from imperfect measurement data and employs a decoder to predict the quantum states based on the ILR. We anticipate that the high-dimensional ILR will capture more comprehensive information about quantum states. To achieve this, we conduct pre-training of the encoder using a pretext task that involves reconstructing high-quality frequencies from measured frequencies. Extensive simulations and experiments demonstrate the remarkable ability of the ILR in dealing with imperfect measurement data in QST.
△ Less
Submitted 30 September, 2023;
originally announced October 2023.
-
EMID: An Emotional Aligned Dataset in Audio-Visual Modality
Authors:
Jialing Zou,
Jiahao Mei,
Guangze Ye,
Tianyu Huai,
Qiwei Shen,
Daoguo Dong
Abstract:
In this paper, we propose Emotionally paired Music and Image Dataset (EMID), a novel dataset designed for the emotional matching of music and images, to facilitate auditory-visual cross-modal tasks such as generation and retrieval. Unlike existing approaches that primarily focus on semantic correlations or roughly divided emotional relations, EMID emphasizes the significance of emotional consisten…
▽ More
In this paper, we propose Emotionally paired Music and Image Dataset (EMID), a novel dataset designed for the emotional matching of music and images, to facilitate auditory-visual cross-modal tasks such as generation and retrieval. Unlike existing approaches that primarily focus on semantic correlations or roughly divided emotional relations, EMID emphasizes the significance of emotional consistency between music and images using an advanced 13-dimension emotional model. By incorporating emotional alignment into the dataset, it aims to establish pairs that closely align with human perceptual understanding, thereby raising the performance of auditory-visual cross-modal tasks. We also design a supplemental module named EMI-Adapter to optimize existing cross-modal alignment methods. To validate the effectiveness of the EMID, we conduct a psychological experiment, which has demonstrated that considering the emotional relationship between the two modalities effectively improves the accuracy of matching in abstract perspective. This research lays the foundation for future cross-modal research in domains such as psychotherapy and contributes to advancing the understanding and utilization of emotions in cross-modal alignment. The EMID dataset is available at https://github.com/ecnu-aigc/EMID.
△ Less
Submitted 9 August, 2024; v1 submitted 15 August, 2023;
originally announced August 2023.
-
ColdNAS: Search to Modulate for User Cold-Start Recommendation
Authors:
Shiguang Wu,
Yaqing Wang,
Qinghe Jing,
Daxiang Dong,
Dejing Dou,
Quanming Yao
Abstract:
Making personalized recommendation for cold-start users, who only have a few interaction histories, is a challenging problem in recommendation systems. Recent works leverage hypernetworks to directly map user interaction histories to user-specific parameters, which are then used to modulate predictor by feature-wise linear modulation function. These works obtain the state-of-the-art performance. H…
▽ More
Making personalized recommendation for cold-start users, who only have a few interaction histories, is a challenging problem in recommendation systems. Recent works leverage hypernetworks to directly map user interaction histories to user-specific parameters, which are then used to modulate predictor by feature-wise linear modulation function. These works obtain the state-of-the-art performance. However, the physical meaning of scaling and shifting in recommendation data is unclear. Instead of using a fixed modulation function and deciding modulation position by expertise, we propose a modulation framework called ColdNAS for user cold-start problem, where we look for proper modulation structure, including function and position, via neural architecture search. We design a search space which covers broad models and theoretically prove that this search space can be transformed to a much smaller space, enabling an efficient and robust one-shot search algorithm. Extensive experimental results on benchmark datasets show that ColdNAS consistently performs the best. We observe that different modulation functions lead to the best performance on different datasets, which validates the necessity of designing a searching-based method.
△ Less
Submitted 6 June, 2023;
originally announced June 2023.
-
Global Structure Knowledge-Guided Relation Extraction Method for Visually-Rich Document
Authors:
Xiangnan Chen,
Qian Xiao,
Juncheng Li,
Duo Dong,
Jun Lin,
Xiaozhong Liu,
Siliang Tang
Abstract:
Visual Relation Extraction (VRE) is a powerful means of discovering relationships between entities within visually-rich documents. Existing methods often focus on manipulating entity features to find pairwise relations, yet neglect the more fundamental structural information that links disparate entity pairs together. The absence of global structure information may make the model struggle to learn…
▽ More
Visual Relation Extraction (VRE) is a powerful means of discovering relationships between entities within visually-rich documents. Existing methods often focus on manipulating entity features to find pairwise relations, yet neglect the more fundamental structural information that links disparate entity pairs together. The absence of global structure information may make the model struggle to learn long-range relations and easily predict conflicted results. To alleviate such limitations, we propose a GlObal Structure knowledge-guided relation Extraction (GOSE) framework. GOSE initiates by generating preliminary relation predictions on entity pairs extracted from a scanned image of the document. Subsequently, global structural knowledge is captured from the preceding iterative predictions, which are then incorporated into the representations of the entities. This "generate-capture-incorporate" cycle is repeated multiple times, allowing entity representations and global structure knowledge to be mutually reinforced. Extensive experiments validate that GOSE not only outperforms existing methods in the standard fine-tuning setting but also reveals superior cross-lingual learning capabilities; indeed, even yields stronger data-efficient performance in the low-resource setting. The code for GOSE will be available at https://github.com/chenxn2020/GOSE.
△ Less
Submitted 27 October, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Time Optimal Ergodic Search
Authors:
Dayi Dong,
Henry Berger,
Ian Abraham
Abstract:
Robots with the ability to balance time against the thoroughness of search have the potential to provide time-critical assistance in applications such as search and rescue. Current advances in ergodic coverage-based search methods have enabled robots to completely explore and search an area in a fixed amount of time. However, optimizing time against the quality of autonomous ergodic search has yet…
▽ More
Robots with the ability to balance time against the thoroughness of search have the potential to provide time-critical assistance in applications such as search and rescue. Current advances in ergodic coverage-based search methods have enabled robots to completely explore and search an area in a fixed amount of time. However, optimizing time against the quality of autonomous ergodic search has yet to be demonstrated. In this paper, we investigate solutions to the time-optimal ergodic search problem for fast and adaptive robotic search and exploration. We pose the problem as a minimum time problem with an ergodic inequality constraint whose upper bound regulates and balances the granularity of search against time. Solutions to the problem are presented analytically using Pontryagin's conditions of optimality and demonstrated numerically through a direct transcription optimization approach. We show the efficacy of the approach in generating time-optimal ergodic search trajectories in simulation and with drone experiments in a cluttered environment. Obstacle avoidance is shown to be readily integrated into our formulation, and we perform ablation studies that investigate parameter dependence on optimized time and trajectory sensitivity for search.
△ Less
Submitted 19 May, 2023;
originally announced May 2023.
-
Tomography of Quantum States from Structured Measurements via quantum-aware transformer
Authors:
Hailan Ma,
Zhenhong Sun,
Daoyi Dong,
Chunlin Chen,
Herschel Rabitz
Abstract:
Quantum state tomography (QST) is the process of reconstructing the state of a quantum system (mathematically described as a density matrix) through a series of different measurements, which can be solved by learning a parameterized function to translate experimentally measured statistics into physical density matrices. However, the specific structure of quantum measurements for characterizing a q…
▽ More
Quantum state tomography (QST) is the process of reconstructing the state of a quantum system (mathematically described as a density matrix) through a series of different measurements, which can be solved by learning a parameterized function to translate experimentally measured statistics into physical density matrices. However, the specific structure of quantum measurements for characterizing a quantum state has been neglected in previous work. In this paper, we explore the similarity between highly structured sentences in natural language and intrinsically structured measurements in QST. To fully leverage the intrinsic quantum characteristics involved in QST, we design a quantum-aware transformer (QAT) model to capture the complex relationship between measured frequencies and density matrices. In particular, we query quantum operators in the architecture to facilitate informative representations of quantum data and integrate the Bures distance into the loss function to evaluate quantum state fidelity, thereby enabling the reconstruction of quantum states from measured data with high fidelity. Extensive simulations and experiments (on IBM quantum computers) demonstrate the superiority of the QAT in reconstructing quantum states with favorable robustness against experimental noise.
△ Less
Submitted 17 November, 2023; v1 submitted 9 May, 2023;
originally announced May 2023.
-
Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning
Authors:
Mingyang Geng,
Shangwen Wang,
Dezun Dong,
Haotian Wang,
Ge Li,
Zhi Jin,
Xiaoguang Mao,
Xiangke Liao
Abstract:
Code comment generation aims at generating natural language descriptions for a code snippet to facilitate developers' program comprehension activities. Despite being studied for a long time, a bottleneck for existing approaches is that given a code snippet, they can only generate one comment while developers usually need to know information from diverse perspectives such as what is the functionali…
▽ More
Code comment generation aims at generating natural language descriptions for a code snippet to facilitate developers' program comprehension activities. Despite being studied for a long time, a bottleneck for existing approaches is that given a code snippet, they can only generate one comment while developers usually need to know information from diverse perspectives such as what is the functionality of this code snippet and how to use it. To tackle this limitation, this study empirically investigates the feasibility of utilizing large language models (LLMs) to generate comments that can fulfill developers' diverse intents. Our intuition is based on the facts that (1) the code and its pairwise comment are used during the pre-training process of LLMs to build the semantic connection between the natural language and programming language, and (2) comments in the real-world projects, which are collected for the pre-training, usually contain different developers' intents. We thus postulate that the LLMs can already understand the code from different perspectives after the pre-training. Indeed, experiments on two large-scale datasets demonstrate the rationale of our insights: by adopting the in-context learning paradigm and giving adequate prompts to the LLM (e.g., providing it with ten or more examples), the LLM can significantly outperform a state-of-the-art supervised learning approach on generating comments with multiple intents. Results also show that customized strategies for constructing the prompts and post-processing strategies for reranking the results can both boost the LLM's performances, which shed light on future research directions for using LLMs to achieve comment generation.
△ Less
Submitted 14 June, 2023; v1 submitted 22 April, 2023;
originally announced April 2023.
-
Auxiliary Task-based Deep Reinforcement Learning for Quantum Control
Authors:
Shumin Zhou,
Hailan Ma,
Sen Kuang,
Daoyi Dong
Abstract:
Due to its property of not requiring prior knowledge of the environment, reinforcement learning has significant potential for quantum control problems. In this work, we investigate the effectiveness of continuous control policies based on deep deterministic policy gradient. To solve the sparse reward signal in quantum learning control problems, we propose an auxiliary task-based deep reinforcement…
▽ More
Due to its property of not requiring prior knowledge of the environment, reinforcement learning has significant potential for quantum control problems. In this work, we investigate the effectiveness of continuous control policies based on deep deterministic policy gradient. To solve the sparse reward signal in quantum learning control problems, we propose an auxiliary task-based deep reinforcement learning (AT-DRL) for quantum control. In particular, we first design a guided reward function based on the fidelity of quantum states that enables incremental fidelity improvement. Then, we introduce the concept of an auxiliary task whose network shares parameters with the main network to predict the reward provided by the environment (called the main task). The auxiliary task learns synchronously with the main task, allowing one to select the most relevant features of the environment, thus aiding the agent in comprehending how to achieve the desired state. The numerical simulations demonstrate that the proposed AT-DRL can provide a solution to the sparse reward in quantum systems, and has great potential in designing control pulses that achieve efficient quantum state preparation.
△ Less
Submitted 28 February, 2023;
originally announced February 2023.
-
Xenos: Dataflow-Centric Optimization to Accelerate Model Inference on Edge Devices
Authors:
Zhang Runhua,
Jiang Hongxu,
Tian Fangzheng,
Geng Jinkun,
Li Xiaobin,
Ma Yuhang,
Zhu Chenhui,
Dong Dong,
Li Xin,
Wang Haojie
Abstract:
Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA, etc.) suffers from inefficiency due to the lack of highly optimized inference frameworks. Previous model inference frameworks are mainly developed in an operator-centric way, which provides insufficient acceleration to edge-based inference. Bes…
▽ More
Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA, etc.) suffers from inefficiency due to the lack of highly optimized inference frameworks. Previous model inference frameworks are mainly developed in an operator-centric way, which provides insufficient acceleration to edge-based inference. Besides, the operator-centric framework incurs significant costs for continuous development and maintenance.
In this paper, we propose Xenos, which can automatically conduct dataflow-centric optimization of the computation graph and accelerate inference in two dimensions. Vertically, Xenos develops operator linking technique to improve data locality by restructuring the inter-operator dataflow. Horizontally, Xenos develops DSP-aware operator split technique to enable higher parallelism across multiple DSP units. Our evaluation proves the effectiveness of vertical and horizontal dataflow optimization, which reduce the inference time by 21.2\%--84.9\% and 17.9\%--96.2\% , respectively. Besides, Xenos also outperforms the widely-used TVM by 3.22$\times$--17.92$\times$. Moreover, we extend Xenos to a distributed solution, which we call d-Xenos. d-Xenos employs multiple edge devices to jointly conduct the inference task and achieves a speedup of 3.68x--3.78x compared with the single device.
△ Less
Submitted 1 February, 2023;
originally announced February 2023.
-
PAD-Net: An Efficient Framework for Dynamic Networks
Authors:
Shwai He,
Liang Ding,
Daize Dong,
Boan Liu,
Fuqiang Yu,
Dacheng Tao
Abstract:
Dynamic networks, e.g., Dynamic Convolution (DY-Conv) and the Mixture of Experts (MoE), have been extensively explored as they can considerably improve the model's representation power with acceptable computational cost. The common practice in implementing dynamic networks is to convert the given static layers into fully dynamic ones where all parameters are dynamic (at least within a single layer…
▽ More
Dynamic networks, e.g., Dynamic Convolution (DY-Conv) and the Mixture of Experts (MoE), have been extensively explored as they can considerably improve the model's representation power with acceptable computational cost. The common practice in implementing dynamic networks is to convert the given static layers into fully dynamic ones where all parameters are dynamic (at least within a single layer) and vary with the input. However, such a fully dynamic setting may cause redundant parameters and high deployment costs, limiting the applicability of dynamic networks to a broader range of tasks and models. The main contributions of our work are challenging the basic commonsense in dynamic networks and proposing a partially dynamic network, namely PAD-Net, to transform the redundant dynamic parameters into static ones. Also, we further design Iterative Mode Partition to partition dynamic and static parameters efficiently. Our method is comprehensively supported by large-scale experiments with two typical advanced dynamic architectures, i.e., DY-Conv and MoE, on both image classification and GLUE benchmarks. Encouragingly, we surpass the fully dynamic networks by $+0.7\%$ top-1 acc with only $30\%$ dynamic parameters for ResNet-50 and $+1.9\%$ average score in language understanding with only $50\%$ dynamic parameters for BERT. Code will be released at: \url{https://github.com/Shwai-He/PAD-Net}.
△ Less
Submitted 31 May, 2023; v1 submitted 10 November, 2022;
originally announced November 2022.
-
Safety-Critical Ergodic Exploration in Cluttered Environments via Control Barrier Functions
Authors:
Cameron Lerch,
Dayi Dong,
Ian Abraham
Abstract:
In this paper, we address the problem of safe trajectory planning for autonomous search and exploration in constrained, cluttered environments. Guaranteeing safe (collision-free) trajectories is a challenging problem that has garnered significant due to its importance in the successful utilization of robots in search and exploration tasks. This work contributes a method that generates guaranteed s…
▽ More
In this paper, we address the problem of safe trajectory planning for autonomous search and exploration in constrained, cluttered environments. Guaranteeing safe (collision-free) trajectories is a challenging problem that has garnered significant due to its importance in the successful utilization of robots in search and exploration tasks. This work contributes a method that generates guaranteed safety-critical search trajectories in a cluttered environment. Our approach integrates safety-critical constraints using discrete control barrier functions (DCBFs) with ergodic trajectory optimization to enable safe exploration. Ergodic trajectory optimization plans continuous exploratory trajectories that guarantee complete coverage of a space. We demonstrate through simulated and experimental results on a drone that our approach is able to generate trajectories that enable safe and effective exploration. Furthermore, we show the efficacy of our approach for safe exploration using real-world single- and multi- drone platforms.
△ Less
Submitted 29 April, 2023; v1 submitted 8 November, 2022;
originally announced November 2022.
-
2D and 3D CT Radiomic Features Performance Comparison in Characterization of Gastric Cancer: A Multi-center Study
Authors:
Lingwei Meng,
Di Dong,
Xin Chen,
Mengjie Fang,
Rongpin Wang,
Jing Li,
Zaiyi Liu,
Jie Tian
Abstract:
Objective: Radiomics, an emerging tool for medical image analysis, is potential towards precisely characterizing gastric cancer (GC). Whether using one-slice 2D annotation or whole-volume 3D annotation remains a long-time debate, especially for heterogeneous GC. We comprehensively compared 2D and 3D radiomic features' representation and discrimination capacity regarding GC, via three tasks.
Meth…
▽ More
Objective: Radiomics, an emerging tool for medical image analysis, is potential towards precisely characterizing gastric cancer (GC). Whether using one-slice 2D annotation or whole-volume 3D annotation remains a long-time debate, especially for heterogeneous GC. We comprehensively compared 2D and 3D radiomic features' representation and discrimination capacity regarding GC, via three tasks.
Methods: Four-center 539 GC patients were retrospectively enrolled and divided into the training and validation cohorts. From 2D or 3D regions of interest (ROIs) annotated by radiologists, radiomic features were extracted respectively. Feature selection and model construction procedures were customed for each combination of two modalities (2D or 3D) and three tasks. Subsequently, six machine learning models (Model_2D^LNM, Model_3D^LNM; Model_2D^LVI, Model_3D^LVI; Model_2D^pT, Model_3D^pT) were derived and evaluated to reflect modalities' performances in characterizing GC. Furthermore, we performed an auxiliary experiment to assess modalities' performances when resampling spacing is different.
Results: Regarding three tasks, the yielded areas under the curve (AUCs) were: Model_2D^LNM's 0.712 (95% confidence interval, 0.613-0.811), Model_3D^LNM's 0.680 (0.584-0.775); Model_2D^LVI's 0.677 (0.595-0.761), Model_3D^LVI's 0.615 (0.528-0.703); Model_2D^pT's 0.840 (0.779-0.901), Model_3D^pT's 0.813 (0.747-0.879). Moreover, the auxiliary experiment indicated that Models_2D are statistically more advantageous than Models3D with different resampling spacings.
Conclusion: Models constructed with 2D radiomic features revealed comparable performances with those constructed with 3D features in characterizing GC.
Significance: Our work indicated that time-saving 2D annotation would be the better choice in GC, and provided a related reference to further radiomics-based researches.
△ Less
Submitted 29 October, 2022;
originally announced October 2022.
-
SparseAdapter: An Easy Approach for Improving the Parameter-Efficiency of Adapters
Authors:
Shwai He,
Liang Ding,
Daize Dong,
Miao Zhang,
Dacheng Tao
Abstract:
Adapter Tuning, which freezes the pretrained language models (PLMs) and only fine-tunes a few extra modules, becomes an appealing efficient alternative to the full model fine-tuning. Although computationally efficient, the recent Adapters often increase parameters (e.g. bottleneck dimension) for matching the performance of full model fine-tuning, which we argue goes against their original intentio…
▽ More
Adapter Tuning, which freezes the pretrained language models (PLMs) and only fine-tunes a few extra modules, becomes an appealing efficient alternative to the full model fine-tuning. Although computationally efficient, the recent Adapters often increase parameters (e.g. bottleneck dimension) for matching the performance of full model fine-tuning, which we argue goes against their original intention. In this work, we re-examine the parameter-efficiency of Adapters through the lens of network pruning (we name such plug-in concept as \texttt{SparseAdapter}) and find that SparseAdapter can achieve comparable or better performance than standard Adapters when the sparse ratio reaches up to 80\%. Based on our findings, we introduce an easy but effective setting ``\textit{Large-Sparse}'' to improve the model capacity of Adapters under the same parameter budget. Experiments on five competitive Adapters upon three advanced PLMs show that with proper sparse method (e.g. SNIP) and ratio (e.g. 40\%) SparseAdapter can consistently outperform their corresponding counterpart. Encouragingly, with the \textit{Large-Sparse} setting, we can obtain further appealing gains, even outperforming the full fine-tuning by a large margin. Our code will be released at: https://github.com/Shwai-He/SparseAdapter.
△ Less
Submitted 10 November, 2022; v1 submitted 9 October, 2022;
originally announced October 2022.
-
Nearly all $k$-SAT functions are unate
Authors:
József Balogh,
Dingding Dong,
Bernard Lidický,
Nitya Mani,
Yufei Zhao
Abstract:
We prove that $1-o(1)$ fraction of all $k$-SAT functions on $n$ Boolean variables are unate (i.e., monotone after first negating some variables), for any fixed positive integer $k$ and as $n \to \infty$. This resolves a conjecture by Bollobás, Brightwell, and Leader from 2003.
We prove that $1-o(1)$ fraction of all $k$-SAT functions on $n$ Boolean variables are unate (i.e., monotone after first negating some variables), for any fixed positive integer $k$ and as $n \to \infty$. This resolves a conjecture by Bollobás, Brightwell, and Leader from 2003.
△ Less
Submitted 3 October, 2023; v1 submitted 11 September, 2022;
originally announced September 2022.
-
Large-scale Knowledge Distillation with Elastic Heterogeneous Computing Resources
Authors:
Ji Liu,
Daxiang Dong,
Xi Wang,
An Qin,
Xingjian Li,
Patrick Valduriez,
Dejing Dou,
Dianhai Yu
Abstract:
Although more layers and more parameters generally improve the accuracy of the models, such big models generally have high computational complexity and require big memory, which exceed the capacity of small devices for inference and incurs long training time. In addition, it is difficult to afford long training time and inference time of big models even in high performance servers, as well. As an…
▽ More
Although more layers and more parameters generally improve the accuracy of the models, such big models generally have high computational complexity and require big memory, which exceed the capacity of small devices for inference and incurs long training time. In addition, it is difficult to afford long training time and inference time of big models even in high performance servers, as well. As an efficient approach to compress a large deep model (a teacher model) to a compact model (a student model), knowledge distillation emerges as a promising approach to deal with the big models. Existing knowledge distillation methods cannot exploit the elastic available computing resources and correspond to low efficiency. In this paper, we propose an Elastic Deep Learning framework for knowledge Distillation, i.e., EDL-Dist. The advantages of EDL-Dist are three-fold. First, the inference and the training process is separated. Second, elastic available computing resources can be utilized to improve the efficiency. Third, fault-tolerance of the training and inference processes is supported. We take extensive experimentation to show that the throughput of EDL-Dist is up to 3.125 times faster than the baseline method (online knowledge distillation) while the accuracy is similar or higher.
△ Less
Submitted 14 July, 2022;
originally announced July 2022.
-
Subspace Phase Retrieval
Authors:
Mengchu Xu,
Dekuan Dong,
Jian Wang
Abstract:
In recent years, phase retrieval has received much attention in statistics, applied mathematics and optical engineering. In this paper, we propose an efficient algorithm, termed Subspace Phase Retrieval (SPR), which can accurately recover an $n$-dimensional $k$-sparse complex-valued signal $\x$ given its $Ω(k^2\log n)$ magnitude-only Gaussian samples if the minimum nonzero entry of $\x$ satisfies…
▽ More
In recent years, phase retrieval has received much attention in statistics, applied mathematics and optical engineering. In this paper, we propose an efficient algorithm, termed Subspace Phase Retrieval (SPR), which can accurately recover an $n$-dimensional $k$-sparse complex-valued signal $\x$ given its $Ω(k^2\log n)$ magnitude-only Gaussian samples if the minimum nonzero entry of $\x$ satisfies $|x_{\min}| = Ω(\|\x\|/\sqrt{k})$. Furthermore, if the energy sum of the most significant $\sqrt{k}$ elements in $\x$ is comparable to $\|\x\|^2$, the SPR algorithm can exactly recover $\x$ with $Ω(k \log n)$ magnitude-only samples, which attains the information-theoretic sampling complexity for sparse phase retrieval. Numerical Experiments demonstrate that the proposed algorithm achieves the state-of-the-art reconstruction performance compared to existing ones.
△ Less
Submitted 7 April, 2024; v1 submitted 6 June, 2022;
originally announced June 2022.
-
On the number of error correcting codes
Authors:
Dingding Dong,
Nitya Mani,
Yufei Zhao
Abstract:
We show that for a fixed $q$, the number of $q$-ary $t$-error correcting codes of length $n$ is at most $2^{(1 + o(1)) H_q(n,t)}$ for all $t \leq (1 - q^{-1})n - C_q\sqrt{n \log n}$ (for sufficiently large constant $C_q$), where $H_q(n, t) = q^n / V_q(n,t)$ is the Hamming bound and $V_q(n,t)$ is the cardinality of the radius $t$ Hamming ball. This proves a conjecture of Balogh, Treglown, and Wagne…
▽ More
We show that for a fixed $q$, the number of $q$-ary $t$-error correcting codes of length $n$ is at most $2^{(1 + o(1)) H_q(n,t)}$ for all $t \leq (1 - q^{-1})n - C_q\sqrt{n \log n}$ (for sufficiently large constant $C_q$), where $H_q(n, t) = q^n / V_q(n,t)$ is the Hamming bound and $V_q(n,t)$ is the cardinality of the radius $t$ Hamming ball. This proves a conjecture of Balogh, Treglown, and Wagner, who showed the result for $t = o(n^{1/3} (\log n)^{-2/3})$.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
A Dirichlet Process Mixture of Robust Task Models for Scalable Lifelong Reinforcement Learning
Authors:
Zhi Wang,
Chunlin Chen,
Daoyi Dong
Abstract:
While reinforcement learning (RL) algorithms are achieving state-of-the-art performance in various challenging tasks, they can easily encounter catastrophic forgetting or interference when faced with lifelong streaming information. In the paper, we propose a scalable lifelong RL method that dynamically expands the network capacity to accommodate new knowledge while preventing past memories from be…
▽ More
While reinforcement learning (RL) algorithms are achieving state-of-the-art performance in various challenging tasks, they can easily encounter catastrophic forgetting or interference when faced with lifelong streaming information. In the paper, we propose a scalable lifelong RL method that dynamically expands the network capacity to accommodate new knowledge while preventing past memories from being perturbed. We use a Dirichlet process mixture to model the non-stationary task distribution, which captures task relatedness by estimating the likelihood of task-to-cluster assignments and clusters the task models in a latent space. We formulate the prior distribution of the mixture as a Chinese restaurant process (CRP) that instantiates new mixture components as needed. The update and expansion of the mixture are governed by the Bayesian non-parametric framework with an expectation maximization (EM) procedure, which dynamically adapts the model complexity without explicit task boundaries or heuristics. Moreover, we use the domain randomization technique to train robust prior parameters for the initialization of each task model in the mixture, thus the resulting model can better generalize and adapt to unseen tasks. With extensive experiments conducted on robot navigation and locomotion domains, we show that our method successfully facilitates scalable lifelong RL and outperforms relevant existing methods.
△ Less
Submitted 22 May, 2022;
originally announced May 2022.
-
Efficient Bayesian Policy Reuse with a Scalable Observation Model in Deep Reinforcement Learning
Authors:
Jinmei Liu,
Zhi Wang,
Chunlin Chen,
Daoyi Dong
Abstract:
Bayesian policy reuse (BPR) is a general policy transfer framework for selecting a source policy from an offline library by inferring the task belief based on some observation signals and a trained observation model. In this paper, we propose an improved BPR method to achieve more efficient policy transfer in deep reinforcement learning (DRL). First, most BPR algorithms use the episodic return as…
▽ More
Bayesian policy reuse (BPR) is a general policy transfer framework for selecting a source policy from an offline library by inferring the task belief based on some observation signals and a trained observation model. In this paper, we propose an improved BPR method to achieve more efficient policy transfer in deep reinforcement learning (DRL). First, most BPR algorithms use the episodic return as the observation signal that contains limited information and cannot be obtained until the end of an episode. Instead, we employ the state transition sample, which is informative and instantaneous, as the observation signal for faster and more accurate task inference. Second, BPR algorithms usually require numerous samples to estimate the probability distribution of the tabular-based observation model, which may be expensive and even infeasible to learn and maintain, especially when using the state transition sample as the signal. Hence, we propose a scalable observation model based on fitting state transition functions of source tasks from only a small number of samples, which can generalize to any signals observed in the target task. Moreover, we extend the offline-mode BPR to the continual learning setting by expanding the scalable observation model in a plug-and-play fashion, which can avoid negative transfer when faced with new unknown tasks. Experimental results show that our method can consistently facilitate faster and more efficient policy transfer.
△ Less
Submitted 13 July, 2023; v1 submitted 16 April, 2022;
originally announced April 2022.
-
SD-Conv: Towards the Parameter-Efficiency of Dynamic Convolution
Authors:
Shwai He,
Chenbo Jiang,
Daize Dong,
Liang Ding
Abstract:
Dynamic convolution achieves better performance for efficient CNNs at the cost of negligible FLOPs increase. However, the performance increase can not match the significantly expanded number of parameters, which is the main bottleneck in real-world applications. Contrastively, mask-based unstructured pruning obtains a lightweight network by removing redundancy in the heavy network. In this paper,…
▽ More
Dynamic convolution achieves better performance for efficient CNNs at the cost of negligible FLOPs increase. However, the performance increase can not match the significantly expanded number of parameters, which is the main bottleneck in real-world applications. Contrastively, mask-based unstructured pruning obtains a lightweight network by removing redundancy in the heavy network. In this paper, we propose a new framework, \textbf{Sparse Dynamic Convolution} (\textsc{SD-Conv}), to naturally integrate these two paths such that it can inherit the advantage of dynamic mechanism and sparsity. We first design a binary mask derived from a learnable threshold to prune static kernels, significantly reducing the parameters and computational cost but achieving higher performance in Imagenet-1K. We further transfer pretrained models into a variety of downstream tasks, showing consistently better results than baselines. We hope our SD-Conv could be an efficient alternative to conventional dynamic convolutions.
△ Less
Submitted 26 May, 2023; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Depthwise Convolution for Multi-Agent Communication with Enhanced Mean-Field Approximation
Authors:
Donghan Xie,
Zhi Wang,
Chunlin Chen,
Daoyi Dong
Abstract:
Multi-agent settings remain a fundamental challenge in the reinforcement learning (RL) domain due to the partial observability and the lack of accurate real-time interactions across agents. In this paper, we propose a new method based on local communication learning to tackle the multi-agent RL (MARL) challenge within a large number of agents coexisting. First, we design a new communication protoc…
▽ More
Multi-agent settings remain a fundamental challenge in the reinforcement learning (RL) domain due to the partial observability and the lack of accurate real-time interactions across agents. In this paper, we propose a new method based on local communication learning to tackle the multi-agent RL (MARL) challenge within a large number of agents coexisting. First, we design a new communication protocol that exploits the ability of depthwise convolution to efficiently extract local relations and learn local communication between neighboring agents. To facilitate multi-agent coordination, we explicitly learn the effect of joint actions by taking the policies of neighboring agents as inputs. Second, we introduce the mean-field approximation into our method to reduce the scale of agent interactions. To more effectively coordinate behaviors of neighboring agents, we enhance the mean-field approximation by a supervised policy rectification network (PRN) for rectifying real-time agent interactions and by a learnable compensation term for correcting the approximation bias. The proposed method enables efficient coordination as well as outperforms several baseline approaches on the adaptive traffic signal control (ATSC) task and the StarCraft II multi-agent challenge (SMAC).
△ Less
Submitted 1 January, 2023; v1 submitted 6 March, 2022;
originally announced March 2022.
-
Echo state graph neural networks with analogue random resistor arrays
Authors:
Shaocong Wang,
Yi Li,
Dingchen Wang,
Woyu Zhang,
Xi Chen,
Danian Dong,
Songqi Wang,
Xumeng Zhang,
Peng Lin,
Claudio Gallicchio,
Xiaoxin Xu,
Qi Liu,
Kwang-Ting Cheng,
Zhongrui Wang,
Dashan Shang,
Ming Liu
Abstract:
Recent years have witnessed an unprecedented surge of interest, from social networks to drug discovery, in learning representations of graph-structured data. However, graph neural networks, the machine learning models for handling graph-structured data, face significant challenges when running on conventional digital hardware, including von Neumann bottleneck incurred by physically separated memor…
▽ More
Recent years have witnessed an unprecedented surge of interest, from social networks to drug discovery, in learning representations of graph-structured data. However, graph neural networks, the machine learning models for handling graph-structured data, face significant challenges when running on conventional digital hardware, including von Neumann bottleneck incurred by physically separated memory and processing units, slowdown of Moore's law due to transistor scaling limit, and expensive training cost. Here we present a novel hardware-software co-design, the random resistor array-based echo state graph neural network, which addresses these challenges. The random resistor arrays not only harness low-cost, nanoscale and stackable resistors for highly efficient in-memory computing using simple physical laws, but also leverage the intrinsic stochasticity of dielectric breakdown to implement random projections in hardware for an echo state network that effectively minimizes the training cost thanks to its fixed and random weights. The system demonstrates state-of-the-art performance on both graph classification using the MUTAG and COLLAB datasets and node classification using the CORA dataset, achieving 34.2x, 93.2x, and 570.4x improvement of energy efficiency and 98.27%, 99.46%, and 95.12% reduction of training cost compared to conventional graph learning on digital hardware, respectively, which may pave the way for the next generation AI system for graph learning.
△ Less
Submitted 30 December, 2021;
originally announced December 2021.
-
Human factors engineering research on single pilot operations for large commercial aircraft: Status and prospect
Authors:
Wei Xu,
Yong Chen,
Wenjun Dong,
Dayong Dong,
Liezhong Ge
Abstract:
The civil aviation community is actively exploring and developing the solutions of single pilot operations SPO for large commercial aircraft. Human factors engineering research for SPO has been launched, and the research mainly focuses on three research solutions: flight deck airborne equipment upgrade, flight support from ground stations, and the combined SPO solution of "flight deck airborne equ…
▽ More
The civil aviation community is actively exploring and developing the solutions of single pilot operations SPO for large commercial aircraft. Human factors engineering research for SPO has been launched, and the research mainly focuses on three research solutions: flight deck airborne equipment upgrade, flight support from ground stations, and the combined SPO solution of "flight deck airborne equipment upgrade, flight support from ground stations". This paper reviews and analyzez the progress of human factors engineering research on SPO. The preliminary research outcome tends to support the combined SPO solution. However, the current human factors engineering research is not comprehensive and cannot provide a complete human factors engineering solution for SPO. For future human factors engineering research, this paper analyzes the key human factors issues on SPO and points out the gaps in the current research and the areas for future work. Finally, this paper puts forward an overall strategy and recommendations for future human factors engineering research on SPO.
△ Less
Submitted 14 October, 2021;
originally announced October 2021.
-
Residual Tensor Train: A Quantum-inspired Approach for Learning Multiple Multilinear Correlations
Authors:
Yiwei Chen,
Yu Pan,
Daoyi Dong
Abstract:
States of quantum many-body systems are defined in a high-dimensional Hilbert space, where rich and complex interactions among subsystems can be modelled. In machine learning, complex multiple multilinear correlations may also exist within input features. In this paper, we present a quantum-inspired multilinear model, named Residual Tensor Train (ResTT), to capture the multiple multilinear correla…
▽ More
States of quantum many-body systems are defined in a high-dimensional Hilbert space, where rich and complex interactions among subsystems can be modelled. In machine learning, complex multiple multilinear correlations may also exist within input features. In this paper, we present a quantum-inspired multilinear model, named Residual Tensor Train (ResTT), to capture the multiple multilinear correlations of features, from low to high orders, within a single model. ResTT is able to build a robust decision boundary in a high-dimensional space for solving fitting and classification tasks. In particular, we prove that the fully-connected layer and the Volterra series can be taken as special cases of ResTT. Furthermore, we derive the rule for weight initialization that stabilizes the training of ResTT based on a mean-field analysis. We prove that such a rule is much more relaxed than that of TT, which means ResTT can easily address the vanishing and exploding gradient problem that exists in the existing TT models. Numerical experiments demonstrate that ResTT outperforms the state-of-the-art tensor network and benchmark deep learning models on MNIST and Fashion-MNIST datasets. Moreover, ResTT achieves better performance than other statistical methods on two practical examples with limited data which are known to have complex feature interactions.
△ Less
Submitted 1 August, 2022; v1 submitted 19 August, 2021;
originally announced August 2021.
-
CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay Compensation
Authors:
Enda Yu,
Dezun Dong,
Yemao Xu,
Shuo Ouyang,
Xiangke Liao
Abstract:
Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combining with parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exists two problems of gradient compression technique to be solved. F…
▽ More
Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combining with parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exists two problems of gradient compression technique to be solved. Firstly, gradient compression brings in extra computation cost, which will delay the next training iteration. Secondly, gradient compression usually leads to the decrease of convergence accuracy.
△ Less
Submitted 6 September, 2021; v1 submitted 20 June, 2021;
originally announced June 2021.
-
JIZHI: A Fast and Cost-Effective Model-As-A-Service System for Web-Scale Online Inference at Baidu
Authors:
Hao Liu,
Qian Gao,
Jiang Li,
Xiaochao Liao,
Hao Xiong,
Guangxing Chen,
Wenlin Wang,
Guobao Yang,
Zhiwei Zha,
Daxiang Dong,
Dejing Dou,
Haoyi Xiong
Abstract:
In modern internet industries, deep learning based recommender systems have became an indispensable building block for a wide spectrum of applications, such as search engine, news feed, and short video clips. However, it remains challenging to carry the well-trained deep models for online real-time inference serving, with respect to the time-varying web-scale traffics from billions of users, in a…
▽ More
In modern internet industries, deep learning based recommender systems have became an indispensable building block for a wide spectrum of applications, such as search engine, news feed, and short video clips. However, it remains challenging to carry the well-trained deep models for online real-time inference serving, with respect to the time-varying web-scale traffics from billions of users, in a cost-effective manner. In this work, we present JIZHI - a Model-as-a-Service system - that per second handles hundreds of millions of online inference requests to huge deep models with more than trillions of sparse parameters, for over twenty real-time recommendation services at Baidu, Inc. In JIZHI, the inference workflow of every recommendation request is transformed to a Staged Event-Driven Pipeline (SEDP), where each node in the pipeline refers to a staged computation or I/O intensive task processor. With traffics of real-time inference requests arrived, each modularized processor can be run in a fully asynchronized way and managed separately. Besides, JIZHI introduces heterogeneous and hierarchical storage to further accelerate the online inference process by reducing unnecessary computations and potential data access latency induced by ultra-sparse model parameters. Moreover, an intelligent resource manager has been deployed to maximize the throughput of JIZHI over the shared infrastructure by searching the optimal resource allocation plan from historical logs and fine-tuning the load shedding policies over intermediate system feedback. Extensive experiments have been done to demonstrate the advantages of JIZHI from the perspectives of end-to-end service latency, system-wide throughput, and resource consumption. JIZHI has helped Baidu saved more than ten million US dollars in hardware and utility costs while handling 200% more traffics without sacrificing inference efficiency.
△ Less
Submitted 3 June, 2021;
originally announced June 2021.
-
Rule-Based Reinforcement Learning for Efficient Robot Navigation with Space Reduction
Authors:
Yuanyang Zhu,
Zhi Wang,
Chunlin Chen,
Daoyi Dong
Abstract:
For real-world deployments, it is critical to allow robots to navigate in complex environments autonomously. Traditional methods usually maintain an internal map of the environment, and then design several simple rules, in conjunction with a localization and planning approach, to navigate through the internal map. These approaches often involve a variety of assumptions and prior knowledge. In cont…
▽ More
For real-world deployments, it is critical to allow robots to navigate in complex environments autonomously. Traditional methods usually maintain an internal map of the environment, and then design several simple rules, in conjunction with a localization and planning approach, to navigate through the internal map. These approaches often involve a variety of assumptions and prior knowledge. In contrast, recent reinforcement learning (RL) methods can provide a model-free, self-learning mechanism as the robot interacts with an initially unknown environment, but are expensive to deploy in real-world scenarios due to inefficient exploration. In this paper, we focus on efficient navigation with the RL technique and combine the advantages of these two kinds of methods into a rule-based RL (RuRL) algorithm for reducing the sample complexity and cost of time. First, we use the rule of wall-following to generate a closed-loop trajectory. Second, we employ a reduction rule to shrink the trajectory, which in turn effectively reduces the redundant exploration space. Besides, we give the detailed theoretical guarantee that the optimal navigation path is still in the reduced space. Third, in the reduced space, we utilize the Pledge rule to guide the exploration strategy for accelerating the RL process at the early stage. Experiments conducted on real robot navigation problems in hex-grid environments demonstrate that RuRL can achieve improved navigation performance.
△ Less
Submitted 15 April, 2021;
originally announced April 2021.
-
Bayesian adversarial multi-node bandit for optimal smart grid protection against cyber attacks
Authors:
Jianyu Xu,
Bin Liu,
Huadong Mo,
Daoyi Dong
Abstract:
The cybersecurity of smart grids has become one of key problems in developing reliable modern power and energy systems. This paper introduces a non-stationary adversarial cost with a variation constraint for smart grids and enables us to investigate the problem of optimal smart grid protection against cyber attacks in a relatively practical scenario. In particular, a Bayesian multi-node bandit (MN…
▽ More
The cybersecurity of smart grids has become one of key problems in developing reliable modern power and energy systems. This paper introduces a non-stationary adversarial cost with a variation constraint for smart grids and enables us to investigate the problem of optimal smart grid protection against cyber attacks in a relatively practical scenario. In particular, a Bayesian multi-node bandit (MNB) model with adversarial costs is constructed and a new regret function is defined for this model. An algorithm called Thompson-Hedge algorithm is presented to solve the problem and the superior performance of the proposed algorithm is proven in terms of the convergence rate of the regret function. The applicability of the algorithm to real smart grid scenarios is verified and the performance of the algorithm is also demonstrated by numerical examples.
△ Less
Submitted 20 February, 2021;
originally announced April 2021.
-
Deep Reinforcement Learning with Quantum-inspired Experience Replay
Authors:
Qing Wei,
Hailan Ma,
Chunlin Chen,
Daoyi Dong
Abstract:
In this paper, a novel training paradigm inspired by quantum computation is proposed for deep reinforcement learning (DRL) with experience replay. In contrast to traditional experience replay mechanism in DRL, the proposed deep reinforcement learning with quantum-inspired experience replay (DRL-QER) adaptively chooses experiences from the replay buffer according to the complexity and the replayed…
▽ More
In this paper, a novel training paradigm inspired by quantum computation is proposed for deep reinforcement learning (DRL) with experience replay. In contrast to traditional experience replay mechanism in DRL, the proposed deep reinforcement learning with quantum-inspired experience replay (DRL-QER) adaptively chooses experiences from the replay buffer according to the complexity and the replayed times of each experience (also called transition), to achieve a balance between exploration and exploitation. In DRL-QER, transitions are first formulated in quantum representations, and then the preparation operation and the depreciation operation are performed on the transitions. In this progress, the preparation operation reflects the relationship between the temporal difference errors (TD-errors) and the importance of the experiences, while the depreciation operation is taken into account to ensure the diversity of the transitions. The experimental results on Atari 2600 games show that DRL-QER outperforms state-of-the-art algorithms such as DRL-PER and DCRL on most of these games with improved training efficiency, and is also applicable to such memory-based DRL approaches as double network and dueling network.
△ Less
Submitted 6 January, 2021;
originally announced January 2021.
-
Curriculum-based Deep Reinforcement Learning for Quantum Control
Authors:
Hailan Ma,
Daoyi Dong,
Steven X. Ding,
Chunlin Chen
Abstract:
Deep reinforcement learning has been recognized as an efficient technique to design optimal strategies for different complex systems without prior knowledge of the control landscape. To achieve a fast and precise control for quantum systems, we propose a novel deep reinforcement learning approach by constructing a curriculum consisting of a set of intermediate tasks defined by a fidelity threshold…
▽ More
Deep reinforcement learning has been recognized as an efficient technique to design optimal strategies for different complex systems without prior knowledge of the control landscape. To achieve a fast and precise control for quantum systems, we propose a novel deep reinforcement learning approach by constructing a curriculum consisting of a set of intermediate tasks defined by a fidelity threshold. Tasks among a curriculum can be statically determined using empirical knowledge or adaptively generated with the learning process. By transferring knowledge between two successive tasks and sequencing tasks according to their difficulties, the proposed curriculum-based deep reinforcement learning (CDRL) method enables the agent to focus on easy tasks in the early stage, then move onto difficult tasks, and eventually approaches the final task. Numerical simulations on closed quantum systems and open quantum systems demonstrate that the proposed method exhibits improved control performance for quantum systems and also provides an efficient way to identify optimal strategies with fewer control pulses.
△ Less
Submitted 2 January, 2021; v1 submitted 30 December, 2020;
originally announced December 2020.
-
SSD-SSD: Communication sparsification for distributed deep learning training
Authors:
Yemao Xu,
Dezun Dong,
Yawei Zhao,
Weixia Xu,
Xiangke Liao
Abstract:
Intensive communication and synchronization cost for gradients and parameters is the well-known bottleneck of distributed deep learning training. Based on the observations that Synchronous SGD (SSGD) obtains good convergence accuracy while asynchronous SGD (ASGD) delivers a faster raw training speed, we propose Several Steps Delay SGD (SSD-SGD) to combine their merits, aiming at tackling the commu…
▽ More
Intensive communication and synchronization cost for gradients and parameters is the well-known bottleneck of distributed deep learning training. Based on the observations that Synchronous SGD (SSGD) obtains good convergence accuracy while asynchronous SGD (ASGD) delivers a faster raw training speed, we propose Several Steps Delay SGD (SSD-SGD) to combine their merits, aiming at tackling the communication bottleneck via communication sparsification. SSD-SGD explores both global synchronous updates in the parameter servers and asynchronous local updates in the workers in each periodic iteration. The periodic and flexible synchronization makes SSD-SGD achieve good convergence accuracy and fast training speed. To the best of our knowledge, we strike the new balance between synchronization quality and communication sparsification, and improve the trade-off between accuracy and training speed. Specifically, the core components of SSD-SGD include proper warm-up stage, steps delay stage, and our novel algorithm of global gradient for local update (GLU). GLU is critical for local update operations to effectively compensate the delayed local weights. Furthermore, we implement SSD-SGD on MXNet framework and comprehensively evaluate its performance with CIFAR-10 and ImageNet datasets. Experimental results show that SSD-SGD can accelerate distributed training speed under different experimental configurations, by up to 110%, while achieving good convergence accuracy.
△ Less
Submitted 9 April, 2021; v1 submitted 9 December, 2020;
originally announced December 2020.
-
RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering
Authors:
Yingqi Qu,
Yuchen Ding,
Jing Liu,
Kai Liu,
Ruiyang Ren,
Wayne Xin Zhao,
Daxiang Dong,
Hua Wu,
Haifeng Wang
Abstract:
In open-domain question answering, dense passage retrieval has become a new paradigm to retrieve relevant passages for finding answers. Typically, the dual-encoder architecture is adopted to learn dense representations of questions and passages for semantic matching. However, it is difficult to effectively train a dual-encoder due to the challenges including the discrepancy between training and in…
▽ More
In open-domain question answering, dense passage retrieval has become a new paradigm to retrieve relevant passages for finding answers. Typically, the dual-encoder architecture is adopted to learn dense representations of questions and passages for semantic matching. However, it is difficult to effectively train a dual-encoder due to the challenges including the discrepancy between training and inference, the existence of unlabeled positives and limited training data. To address these challenges, we propose an optimized training approach, called RocketQA, to improving dense passage retrieval. We make three major technical contributions in RocketQA, namely cross-batch negatives, denoised hard negatives and data augmentation. The experiment results show that RocketQA significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions. We also conduct extensive experiments to examine the effectiveness of the three strategies in RocketQA. Besides, we demonstrate that the performance of end-to-end QA can be improved based on our RocketQA retriever.
△ Less
Submitted 12 May, 2021; v1 submitted 16 October, 2020;
originally announced October 2020.