Search | arXiv e-print repository

DART: Deep Adversarial Automated Red Teaming for LLM Safety

Authors: Bojian Jiang, Yi Jing, Tianhao Shen, Qing Yang, Deyi Xiong

Abstract: Manual Red teaming is a commonly-used method to identify vulnerabilities in large language models (LLMs), which, is costly and unscalable. In contrast, automated red teaming uses a Red LLM to automatically generate adversarial prompts to the Target LLM, offering a scalable way for safety vulnerability detection. However, the difficulty of building a powerful automated Red LLM lies in the fact that… ▽ More Manual Red teaming is a commonly-used method to identify vulnerabilities in large language models (LLMs), which, is costly and unscalable. In contrast, automated red teaming uses a Red LLM to automatically generate adversarial prompts to the Target LLM, offering a scalable way for safety vulnerability detection. However, the difficulty of building a powerful automated Red LLM lies in the fact that the safety vulnerabilities of the Target LLM are dynamically changing with the evolution of the Target LLM. To mitigate this issue, we propose a Deep Adversarial Automated Red Teaming (DART) framework in which the Red LLM and Target LLM are deeply and dynamically interacting with each other in an iterative manner. In each iteration, in order to generate successful attacks as many as possible, the Red LLM not only takes into account the responses from the Target LLM, but also adversarially adjust its attacking directions by monitoring the global diversity of generated attacks across multiple iterations. Simultaneously, to explore dynamically changing safety vulnerabilities of the Target LLM, we allow the Target LLM to enhance its safety via an active learning based data selection mechanism. Experimential results demonstrate that DART significantly reduces the safety risk of the target LLM. For human evaluation on Anthropic Harmless dataset, compared to the instruction-tuning target LLM, DART eliminates the violation risks by 53.4\%. We will release the datasets and codes of DART soon. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2403.20014 [pdf, other]

PURPLE: Making a Large Language Model a Better SQL Writer

Authors: Tonghui Ren, Yuankai Fan, Zhenying He, Ren Huang, Jiaqi Dai, Can Huang, Yinan Jing, Kai Zhang, Yifan Yang, X. Sean Wang

Abstract: Large Language Model (LLM) techniques play an increasingly important role in Natural Language to SQL (NL2SQL) translation. LLMs trained by extensive corpora have strong natural language understanding and basic SQL generation abilities without additional tuning specific to NL2SQL tasks. Existing LLMs-based NL2SQL approaches try to improve the translation by enhancing the LLMs with an emphasis on us… ▽ More Large Language Model (LLM) techniques play an increasingly important role in Natural Language to SQL (NL2SQL) translation. LLMs trained by extensive corpora have strong natural language understanding and basic SQL generation abilities without additional tuning specific to NL2SQL tasks. Existing LLMs-based NL2SQL approaches try to improve the translation by enhancing the LLMs with an emphasis on user intention understanding. However, LLMs sometimes fail to generate appropriate SQL due to their lack of knowledge in organizing complex logical operator composition. A promising method is to input the LLMs with demonstrations, which include known NL2SQL translations from various databases. LLMs can learn to organize operator compositions from the input demonstrations for the given task. In this paper, we propose PURPLE (Pre-trained models Utilized to Retrieve Prompts for Logical Enhancement), which improves accuracy by retrieving demonstrations containing the requisite logical operator composition for the NL2SQL task on hand, thereby guiding LLMs to produce better SQL translation. PURPLE achieves a new state-of-the-art performance of 80.5% exact-set match accuracy and 87.8% execution match accuracy on the validation set of the popular NL2SQL benchmark Spider. PURPLE maintains high accuracy across diverse benchmarks, budgetary constraints, and various LLMs, showing robustness and cost-effectiveness. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: 12 pages, accepted by ICDE 2024 (40th IEEE International Conference on Data Engineering)

arXiv:2403.19275 [pdf, other]

Knowledge Boundary and Persona Dynamic Shape A Better Social Media Agent

Authors: Junkai Zhou, Liang Pang, Ya Jing, Jia Gu, Huawei Shen, Xueqi Cheng

Abstract: Constructing personalized and anthropomorphic agents holds significant importance in the simulation of social networks. However, there are still two key problems in existing works: the agent possesses world knowledge that does not belong to its personas, and it cannot eliminate the interference of diverse persona information on current actions, which reduces the personalization and anthropomorphis… ▽ More Constructing personalized and anthropomorphic agents holds significant importance in the simulation of social networks. However, there are still two key problems in existing works: the agent possesses world knowledge that does not belong to its personas, and it cannot eliminate the interference of diverse persona information on current actions, which reduces the personalization and anthropomorphism of the agent. To solve the above problems, we construct the social media agent based on personalized knowledge and dynamic persona information. For personalized knowledge, we add external knowledge sources and match them with the persona information of agents, thereby giving the agent personalized world knowledge. For dynamic persona information, we use current action information to internally retrieve the persona information of the agent, thereby reducing the interference of diverse persona information on the current action. To make the agent suitable for social media, we design five basic modules for it: persona, planning, action, memory and reflection. To provide an interaction and verification environment for the agent, we build a social media simulation sandbox. In the experimental verification, automatic and human evaluations demonstrated the effectiveness of the agent we constructed. △ Less

Submitted 2 April, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.17745 [pdf, other]

Leave No Patient Behind: Enhancing Medication Recommendation for Rare Disease Patients

Authors: Zihao Zhao, Yi Jing, Fuli Feng, Jiancan Wu, Chongming Gao, Xiangnan He

Abstract: Medication recommendation systems have gained significant attention in healthcare as a means of providing tailored and effective drug combinations based on patients' clinical information. However, existing approaches often suffer from fairness issues, as recommendations tend to be more accurate for patients with common diseases compared to those with rare conditions. In this paper, we propose a no… ▽ More Medication recommendation systems have gained significant attention in healthcare as a means of providing tailored and effective drug combinations based on patients' clinical information. However, existing approaches often suffer from fairness issues, as recommendations tend to be more accurate for patients with common diseases compared to those with rare conditions. In this paper, we propose a novel model called Robust and Accurate REcommendations for Medication (RAREMed), which leverages the pretrain-finetune learning paradigm to enhance accuracy for rare diseases. RAREMed employs a transformer encoder with a unified input sequence approach to capture complex relationships among disease and procedure codes. Additionally, it introduces two self-supervised pre-training tasks, namely Sequence Matching Prediction (SMP) and Self Reconstruction (SR), to learn specialized medication needs and interrelations among clinical codes. Experimental results on two real-world datasets demonstrate that RAREMed provides accurate drug sets for both rare and common disease patients, thereby mitigating unfairness in medication recommendation systems. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.16208 [pdf, ps, other]

Convergence analysis of OT-Flow for sample generation

Authors: Yang Jing, Lei Li

Abstract: Deep generative models aim to learn the underlying distribution of data and generate new ones. Despite the diversity of generative models and their high-quality generation performance in practice, most of them lack rigorous theoretical convergence proofs. In this work, we aim to establish some convergence results for OT-Flow, one of the deep generative models. First, by reformulating the framework… ▽ More Deep generative models aim to learn the underlying distribution of data and generate new ones. Despite the diversity of generative models and their high-quality generation performance in practice, most of them lack rigorous theoretical convergence proofs. In this work, we aim to establish some convergence results for OT-Flow, one of the deep generative models. First, by reformulating the framework of OT-Flow model, we establish the $Γ$-convergence of the formulation of OT-flow to the corresponding optimal transport (OT) problem as the regularization term parameter $α$ goes to infinity. Second, since the loss function will be approximated by Monte Carlo method in training, we established the convergence between the discrete loss function and the continuous one when the sample number $N$ goes to infinity as well. Meanwhile, the approximation capability of the neural network provides an upper bound for the discrete loss function of the minimizers. The proofs in both aspects provide convincing assurances for OT-Flow. △ Less

Submitted 24 March, 2024; originally announced March 2024.

arXiv:2403.00211 [pdf, other]

Trustworthy Self-Attention: Enabling the Network to Focus Only on the Most Relevant References

Authors: Yu Jing, Tan Yujuan, Ren Ao, Liu Duo

Abstract: The prediction of optical flow for occluded points is still a difficult problem that has not yet been solved. Recent methods use self-attention to find relevant non-occluded points as references for estimating the optical flow of occluded points based on the assumption of self-similarity. However, they rely on visual features of a single image and weak constraints, which are not sufficient to cons… ▽ More The prediction of optical flow for occluded points is still a difficult problem that has not yet been solved. Recent methods use self-attention to find relevant non-occluded points as references for estimating the optical flow of occluded points based on the assumption of self-similarity. However, they rely on visual features of a single image and weak constraints, which are not sufficient to constrain the trained network to focus on erroneous and weakly relevant reference points. We make full use of online occlusion recognition information to construct occlusion extended visual features and two strong constraints, allowing the network to learn to focus only on the most relevant references without requiring occlusion ground truth to participate in the training of the network. Our method adds very few network parameters to the original framework, making it very lightweight. Extensive experiments show that our model has the greatest cross-dataset generalization. Our method achieves much greater error reduction, 18.6%, 16.2%, and 20.1% for all points, non-occluded points, and occluded points respectively from the state-of-the-art GMA-base method, MATCHFlow(GMA), on Sintel Albedo pass. Furthermore, our model achieves state-of-the-art performance on the Sintel bench-marks, ranking \#1 among all published methods on Sintel clean pass. The code will be open-source. △ Less

Submitted 26 March, 2024; v1 submitted 29 February, 2024; originally announced March 2024.

Comments: Correct Figure 1

arXiv:2402.17144 [pdf, other]

Metasql: A Generate-then-Rank Framework for Natural Language to SQL Translation

Authors: Yuankai Fan, Zhenying He, Tonghui Ren, Can Huang, Yinan Jing, Kai Zhang, X. Sean Wang

Abstract: The Natural Language Interface to Databases (NLIDB) empowers non-technical users with database access through intuitive natural language (NL) interactions. Advanced approaches, utilizing neural sequence-to-sequence models or large-scale language models, typically employ auto-regressive decoding to generate unique SQL queries sequentially. While these translation models have greatly improved the ov… ▽ More The Natural Language Interface to Databases (NLIDB) empowers non-technical users with database access through intuitive natural language (NL) interactions. Advanced approaches, utilizing neural sequence-to-sequence models or large-scale language models, typically employ auto-regressive decoding to generate unique SQL queries sequentially. While these translation models have greatly improved the overall translation accuracy, surpassing 70% on NLIDB benchmarks, the use of auto-regressive decoding to generate single SQL queries may result in sub-optimal outputs, potentially leading to erroneous translations. In this paper, we propose Metasql, a unified generate-then-rank framework that can be flexibly incorporated with existing NLIDBs to consistently improve their translation accuracy. Metasql introduces query metadata to control the generation of better SQL query candidates and uses learning-to-rank algorithms to retrieve globally optimized queries. Specifically, Metasql first breaks down the meaning of the given NL query into a set of possible query metadata, representing the basic concepts of the semantics. These metadata are then used as language constraints to steer the underlying translation model toward generating a set of candidate SQL queries. Finally, Metasql ranks the candidates to identify the best matching one for the given NL query. Extensive experiments are performed to study Metasql on two public NLIDB benchmarks. The results show that the performance of the translation models can be effectively improved using Metasql. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.15140 [pdf, other]

A Relation-Interactive Approach for Message Passing in Hyper-relational Knowledge Graphs

Authors: Yonglin Jing

Abstract: Hyper-relational knowledge graphs (KGs) contain additional key-value pairs, providing more information about the relations. In many scenarios, the same relation can have distinct key-value pairs, making the original triple fact more recognizable and specific. Prior studies on hyper-relational KGs have established a solid standard method for hyper-relational graph encoding. In this work, we propose… ▽ More Hyper-relational knowledge graphs (KGs) contain additional key-value pairs, providing more information about the relations. In many scenarios, the same relation can have distinct key-value pairs, making the original triple fact more recognizable and specific. Prior studies on hyper-relational KGs have established a solid standard method for hyper-relational graph encoding. In this work, we propose a message-passing-based graph encoder with global relation structure awareness ability, which we call ReSaE. Compared to the prior state-of-the-art approach, ReSaE emphasizes the interaction of relations during message passing process and optimizes the readout structure for link prediction tasks. Overall, ReSaE gives a encoding solution for hyper-relational KGs and ensures stronger performance on downstream link prediction tasks. Our experiments demonstrate that ReSaE achieves state-of-the-art performance on multiple link prediction benchmarks. Furthermore, we also analyze the influence of different model structures on model performance. △ Less

Submitted 1 March, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

arXiv:2401.05879 [pdf]

YOIO: You Only Iterate Once by mining and fusing multiple necessary global information in the optical flow estimation

Authors: Yu Jing, Tan Yujuan, Ren Ao, Liu Duo

Abstract: Occlusions pose a significant challenge to optical flow algorithms that even rely on global evidences. We consider an occluded point to be one that is imaged in the reference frame but not in the next. Estimating the motion of these points is extremely difficult, particularly in the two-frame setting. Previous work only used the current frame as the only input, which could not guarantee providing… ▽ More Occlusions pose a significant challenge to optical flow algorithms that even rely on global evidences. We consider an occluded point to be one that is imaged in the reference frame but not in the next. Estimating the motion of these points is extremely difficult, particularly in the two-frame setting. Previous work only used the current frame as the only input, which could not guarantee providing correct global reference information for occluded points, and had problems such as long calculation time and poor accuracy in predicting optical flow at occluded points. To enable both high accuracy and efficiency, We fully mine and utilize the spatiotemporal information provided by the frame pair, design a loopback judgment algorithm to ensure that correct global reference information is obtained, mine multiple necessary global information, and design an efficient refinement module that fuses these global information. Specifically, we propose a YOIO framework, which consists of three main components: an initial flow estimator, a multiple global information extraction module, and a unified refinement module. We demonstrate that optical flow estimates in the occluded regions can be significantly improved in only one iteration without damaging the performance in non-occluded regions. Compared with GMA, the optical flow prediction accuracy of this method in the occluded area is improved by more than 10%, and the occ_out area exceeds 15%, while the calculation time is 27% shorter. This approach, running up to 18.9fps with 436*1024 image resolution, obtains new state-of-the-art results on the challenging Sintel dataset among all published and unpublished approaches that can run in real-time, suggesting a new paradigm for accurate and efficient optical flow estimation. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: arXiv admin note: text overlap with arXiv:2104.02409 by other authors

arXiv:2312.13139 [pdf, other]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Authors: Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, Tao Kong

Abstract: Generative pre-trained models have demonstrated remarkable effectiveness in language and vision domains by learning useful representations. In this paper, we extend the scope of this effectiveness by showing that visual robot manipulation can significantly benefit from large-scale video generative pre-training. We introduce GR-1, a straightforward GPT-style model designed for multi-task language-c… ▽ More Generative pre-trained models have demonstrated remarkable effectiveness in language and vision domains by learning useful representations. In this paper, we extend the scope of this effectiveness by showing that visual robot manipulation can significantly benefit from large-scale video generative pre-training. We introduce GR-1, a straightforward GPT-style model designed for multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states. It predicts robot actions as well as future images in an end-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly finetuned on robot data after pre-trained on a large-scale video dataset. We perform extensive experiments on the challenging CALVIN benchmark and a real robot. On CALVIN benchmark, our method outperforms state-of-the-art baseline methods and improves the success rate from 88.9% to 94.9%. In the setting of zero-shot unseen scene generalization, GR-1 improves the success rate from 53.3% to 85.4%. In real robot experiments, GR-1 also outperforms baseline methods and shows strong potentials in generalization to unseen scenes and objects. We provide inaugural evidence that a unified GPT-style transformer, augmented with large-scale video generative pre-training, exhibits remarkable generalization to multi-task visual robot manipulation. Project page: https://GR1-Manipulation.github.io △ Less

Submitted 21 December, 2023; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: Project page: https://GR1-Manipulation.github.io

arXiv:2311.09829 [pdf, other]

FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models

Authors: Yimin Jing, Renren Jin, Jiahao Hu, Huishi Qiu, Xiaohua Wang, Peng Wang, Deyi Xiong

Abstract: The effective assessment of the instruction-following ability of large language models (LLMs) is of paramount importance. A model that cannot adhere to human instructions might be not able to provide reliable and helpful responses. In pursuit of this goal, various benchmarks have been constructed to evaluate the instruction-following capacity of these models. However, these benchmarks are limited… ▽ More The effective assessment of the instruction-following ability of large language models (LLMs) is of paramount importance. A model that cannot adhere to human instructions might be not able to provide reliable and helpful responses. In pursuit of this goal, various benchmarks have been constructed to evaluate the instruction-following capacity of these models. However, these benchmarks are limited to a single language and are constructed using automated approaches, which restricts their applicability and the quality of the test examples they contain. To bridge this gap, we introduce the FollowEval benchmark in this paper. This benchmark is composed of instances in both English and Chinese, and all test examples are crafted by human experts. Furthermore, the FollowEval benchmark is designed to assess LLMs across five critical dimensions of instruction following: string manipulation, commonsense reasoning, logical reasoning, spatial reasoning, and response constraints. To enhance the complexity and present a sufficient challenge, each test example is designed to evaluate more than one dimension. We have evaluated various LLMs using the FollowEval benchmark and found that their performance significantly lags behind that of humans. This highlights the considerable room for improvement in the instruction-following ability of these models. △ Less

Submitted 16 November, 2023; originally announced November 2023.

Comments: Work in progress

arXiv:2311.01378 [pdf, other]

Vision-Language Foundation Models as Effective Robot Imitators

Authors: Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, Tao Kong

Abstract: Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dub… ▽ More Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step vision-language comprehension, models sequential history information with an explicit policy head, and is slightly fine-tuned by imitation learning only on language-conditioned manipulation datasets. Such a decomposition provides RoboFlamingo the flexibility for open-loop control and deployment on low-performance platforms. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control. Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. We believe RoboFlamingo has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy. △ Less

Submitted 4 February, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

Comments: Fix typos. Project page: https://roboflamingo.github.io

arXiv:2309.05073 [pdf, other]

FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions

Authors: Jiong Wang, Fengyu Yang, Wenbo Gou, Bingliang Li, Danqi Yan, Ailing Zeng, Yijun Gao, Junle Wang, Yanqing Jing, Ruimao Zhang

Abstract: Estimating the 3D structure of the human body from natural scenes is a fundamental aspect of visual perception. 3D human pose estimation is a vital step in advancing fields like AIGC and human-robot interaction, serving as a crucial technique for understanding and interacting with human actions in real-world settings. However, the current datasets, often collected under single laboratory condition… ▽ More Estimating the 3D structure of the human body from natural scenes is a fundamental aspect of visual perception. 3D human pose estimation is a vital step in advancing fields like AIGC and human-robot interaction, serving as a crucial technique for understanding and interacting with human actions in real-world settings. However, the current datasets, often collected under single laboratory conditions using complex motion capture equipment and unvarying backgrounds, are insufficient. The absence of datasets on variable conditions is stalling the progress of this crucial task. To facilitate the development of 3D pose estimation, we present FreeMan, the first large-scale, multi-view dataset collected under the real-world conditions. FreeMan was captured by synchronizing 8 smartphones across diverse scenarios. It comprises 11M frames from 8000 sequences, viewed from different perspectives. These sequences cover 40 subjects across 10 different scenarios, each with varying lighting conditions. We have also established an semi-automated pipeline containing error detection to reduce the workload of manual check and ensure precise annotation. We provide comprehensive evaluation baselines for a range of tasks, underlining the significant challenges posed by FreeMan. Further evaluations of standard indoor/outdoor human sensing datasets reveal that FreeMan offers robust representation transferability in real and complex scenes. Code and data are available at https://wangjiongw.github.io/freeman. △ Less

Submitted 3 April, 2024; v1 submitted 10 September, 2023; originally announced September 2023.

Comments: CVPR2024 camera ready version. 19 pages, 16 figures. Project page: https://wangjiongw.github.io/freeman/ ; API: https://github.com/wangjiongw/FreeMan_API

arXiv:2308.03624 [pdf, other]

MOMA-Force: Visual-Force Imitation for Real-World Mobile Manipulation

Authors: Taozheng Yang, Ya Jing, Hongtao Wu, Jiafeng Xu, Kuankuan Sima, Guangzeng Chen, Qie Sima, Tao Kong

Abstract: In this paper, we present a novel method for mobile manipulators to perform multiple contact-rich manipulation tasks. While learning-based methods have the potential to generate actions in an end-to-end manner, they often suffer from insufficient action accuracy and robustness against noise. On the other hand, classical control-based methods can enhance system robustness, but at the cost of extens… ▽ More In this paper, we present a novel method for mobile manipulators to perform multiple contact-rich manipulation tasks. While learning-based methods have the potential to generate actions in an end-to-end manner, they often suffer from insufficient action accuracy and robustness against noise. On the other hand, classical control-based methods can enhance system robustness, but at the cost of extensive parameter tuning. To address these challenges, we present MOMA-Force, a visual-force imitation method that seamlessly combines representation learning for perception, imitation learning for complex motion generation, and admittance whole-body control for system robustness and controllability. MOMA-Force enables a mobile manipulator to learn multiple complex contact-rich tasks with high success rates and small contact forces. In a real household setting, our method outperforms baseline methods in terms of task success rates. Moreover, our method achieves smaller contact forces and smaller force variances compared to baseline methods without force imitation. Overall, we offer a promising approach for efficient and robust mobile manipulation in the real world. Videos and more details can be found on \url{https://visual-force-imitation.github.io} △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

arXiv:2308.03620 [pdf, other]

Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods

Authors: Ya Jing, Xuelin Zhu, Xingbin Liu, Qie Sima, Taozheng Yang, Yunhai Feng, Tao Kong

Abstract: Visual pre-training with large-scale real-world data has made great progress in recent years, showing great potential in robot learning with pixel observations. However, the recipes of visual pre-training for robot manipulation tasks are yet to be built. In this paper, we thoroughly investigate the effects of visual pre-training strategies on robot manipulation tasks from three fundamental perspec… ▽ More Visual pre-training with large-scale real-world data has made great progress in recent years, showing great potential in robot learning with pixel observations. However, the recipes of visual pre-training for robot manipulation tasks are yet to be built. In this paper, we thoroughly investigate the effects of visual pre-training strategies on robot manipulation tasks from three fundamental perspectives: pre-training datasets, model architectures and training methods. Several significant experimental findings are provided that are beneficial for robot learning. Further, we propose a visual pre-training scheme for robot manipulation termed Vi-PRoM, which combines self-supervised learning and supervised learning. Concretely, the former employs contrastive learning to acquire underlying patterns from large-scale unlabeled data, while the latter aims learning visual semantics and temporal dynamics. Extensive experiments on robot manipulations in various simulation environments and the real robot demonstrate the superiority of the proposed scheme. Videos and more details can be found on \url{https://explore-pretrain-robot.github.io}. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

arXiv:2307.16356 [pdf, other]

Interleaved Training for Massive MIMO Downlink via Exploring Spatial Correlation

Authors: Cheng Zhang, Chang Liu, Yindi Jing, Minjie Ding, Yongming Huang

Abstract: Interleaved training has been studied for single-user and multi-user massive MIMO downlink with either fully-digital or hybrid beamforming. However, the impact of channel correlation on its average training overhead is rarely addressed. In this paper, we explore the channel correlation to improve the interleaved training for single-user massive MIMO downlink. For the beam-domain interleaved traini… ▽ More Interleaved training has been studied for single-user and multi-user massive MIMO downlink with either fully-digital or hybrid beamforming. However, the impact of channel correlation on its average training overhead is rarely addressed. In this paper, we explore the channel correlation to improve the interleaved training for single-user massive MIMO downlink. For the beam-domain interleaved training, we propose a modified scheme by optimizing the beam training codebook. The basic antenna-domain interleaved training is also improved by dynamically adjusting the training order of the base station (BS) antennas during the training process based on the values of the already trained channels. Exact and simplified approximate expressions of the average training length are derived in closed-form for the basic and modified beam-domain schemes and the basic antenna-domain scheme in correlated channels. For the modified antenna-domain scheme, a deep neural network (DNN)-based approximation is provided for fast performance evaluation. Analytical results and simulations verify the accuracy of our derived training length expressions and explicitly reveal the impact of system parameters on the average training length. In addition, the modified beam/antenna-domain schemes are shown to have a shorter average training length compared to the basic schemes. △ Less

Submitted 16 January, 2024; v1 submitted 30 July, 2023; originally announced July 2023.

Comments: 14 pages (double column), 8 figures. The paper has been accepted by IEEE Transactions on Wireless Communications

arXiv:2307.10730 [pdf, other]

Joint Port Selection Based Channel Acquisition for FDD Cell-Free Massive MIMO

Authors: Cheng Zhang, Pengguang Du, Minjie Ding, Yindi Jing, Yongming Huang

Abstract: In frequency division duplexing (FDD) cell-free massive MIMO, the acquisition of the channel state information (CSI) is very challenging because of the large overhead required for the training and feedback of the downlink channels of multiple cooperating base stations (BSs). In this paper, for systems with partial uplink-downlink channel reciprocity, and a general spatial domain channel model with… ▽ More In frequency division duplexing (FDD) cell-free massive MIMO, the acquisition of the channel state information (CSI) is very challenging because of the large overhead required for the training and feedback of the downlink channels of multiple cooperating base stations (BSs). In this paper, for systems with partial uplink-downlink channel reciprocity, and a general spatial domain channel model with variations in the average port power and correlation among port coefficients, we propose a joint-port-selection-based CSI acquisition and feedback scheme for the downlink transmission with zero-forcing precoding. The scheme uses an eigenvalue-decomposition-based transformation to reduce the feedback overhead by exploring the port correlation. We derive the sum-rate of the system for any port selection. Based on the sum-rate result, we propose a low-complexity greedy-search-based joint port selection (GS-JPS) algorithm. Moreover, to adapt to fast time-varying scenarios, a supervised deep learning-enhanced joint port selection (DL-JPS) algorithm is proposed. Simulations verify the effectiveness of our proposed schemes and their advantage over existing port-selection channel acquisition schemes. △ Less

Submitted 12 January, 2024; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: 15 pages, 11 figures. The paper has been accepted by IEEE TRANSACTIONS ON COMMUNICATIONS

arXiv:2306.07610 [pdf, other]

Soft Language Clustering for Multilingual Model Pre-training

Authors: Jiali Zeng, Yufan Jiang, Yongjing Yin, Yi Jing, Fandong Meng, Binghuai Lin, Yunbo Cao, Jie Zhou

Abstract: Multilingual pre-trained language models have demonstrated impressive (zero-shot) cross-lingual transfer abilities, however, their performance is hindered when the target language has distant typology from source languages or when pre-training data is limited in size. In this paper, we propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Ou… ▽ More Multilingual pre-trained language models have demonstrated impressive (zero-shot) cross-lingual transfer abilities, however, their performance is hindered when the target language has distant typology from source languages or when pre-training data is limited in size. In this paper, we propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods. On the tasks of XTREME including text classification, sequence labeling, question answering, and sentence retrieval, both base- and large-size language models pre-trained with our proposed method exhibit consistent performance improvement. Furthermore, it provides substantial advantages for low-resource languages in unsupervised sentence retrieval and for target languages that differ greatly from the source language in cross-lingual transfer. △ Less

Submitted 13 June, 2023; originally announced June 2023.

arXiv:2305.16982 [pdf, other]

TranSFormer: Slow-Fast Transformer for Machine Translation

Authors: Bei Li, Yi Jing, Xu Tan, Zhen Xing, Tong Xiao, Jingbo Zhu

Abstract: Learning multiscale Transformer models has been evidenced as a viable approach to augmenting machine translation systems. Prior research has primarily focused on treating subwords as basic units in developing such systems. However, the incorporation of fine-grained character-level features into multiscale Transformer has not yet been explored. In this work, we present a \textbf{S}low-\textbf{F}ast… ▽ More Learning multiscale Transformer models has been evidenced as a viable approach to augmenting machine translation systems. Prior research has primarily focused on treating subwords as basic units in developing such systems. However, the incorporation of fine-grained character-level features into multiscale Transformer has not yet been explored. In this work, we present a \textbf{S}low-\textbf{F}ast two-stream learning model, referred to as Tran\textbf{SF}ormer, which utilizes a ``slow'' branch to deal with subword sequences and a ``fast'' branch to deal with longer character sequences. This model is efficient since the fast branch is very lightweight by reducing the model width, and yet provides useful fine-grained features for the slow branch. Our TranSFormer shows consistent BLEU improvements (larger than 1 BLEU point) on several machine translation benchmarks. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: Accepted by Findings of ACL2023

arXiv:2304.14593 [pdf, other]

Deep Graph Reprogramming

Authors: Yongcheng Jing, Chongbin Yuan, Li Ju, Yiding Yang, Xinchao Wang, Dacheng Tao

Abstract: In this paper, we explore a novel model reusing task tailored for graph neural networks (GNNs), termed as "deep graph reprogramming". We strive to reprogram a pre-trained GNN, without amending raw node features nor model parameters, to handle a bunch of cross-level downstream tasks in various domains. To this end, we propose an innovative Data Reprogramming paradigm alongside a Model Reprogramming… ▽ More In this paper, we explore a novel model reusing task tailored for graph neural networks (GNNs), termed as "deep graph reprogramming". We strive to reprogram a pre-trained GNN, without amending raw node features nor model parameters, to handle a bunch of cross-level downstream tasks in various domains. To this end, we propose an innovative Data Reprogramming paradigm alongside a Model Reprogramming paradigm. The former one aims to address the challenge of diversified graph feature dimensions for various tasks on the input side, while the latter alleviates the dilemma of fixed per-task-per-model behavior on the model side. For data reprogramming, we specifically devise an elaborated Meta-FeatPadding method to deal with heterogeneous input dimensions, and also develop a transductive Edge-Slimming as well as an inductive Meta-GraPadding approach for diverse homogenous samples. Meanwhile, for model reprogramming, we propose a novel task-adaptive Reprogrammable-Aggregator, to endow the frozen model with larger expressive capacities in handling cross-domain tasks. Experiments on fourteen datasets across node/graph classification/regression, 3D object recognition, and distributed action recognition, demonstrate that the proposed methods yield gratifying results, on par with those by re-training from scratch. △ Less

Submitted 27 April, 2023; originally announced April 2023.

Comments: CVPR 2023 Highlight

arXiv:2304.11595 [pdf, other]

Segment Anything in Non-Euclidean Domains: Challenges and Opportunities

Authors: Yongcheng Jing, Xinchao Wang, Dacheng Tao

Abstract: The recent work known as Segment Anything (SA) has made significant strides in pushing the boundaries of semantic segmentation into the era of foundation models. The impact of SA has sparked extremely active discussions and ushered in an encouraging new wave of developing foundation models for the diverse tasks in the Euclidean domain, such as object detection and image inpainting. Despite the pro… ▽ More The recent work known as Segment Anything (SA) has made significant strides in pushing the boundaries of semantic segmentation into the era of foundation models. The impact of SA has sparked extremely active discussions and ushered in an encouraging new wave of developing foundation models for the diverse tasks in the Euclidean domain, such as object detection and image inpainting. Despite the promising advances led by SA, the concept has yet to be extended to the non-Euclidean graph domain. In this paper, we explore a novel Segment Non-Euclidean Anything (SNA) paradigm that strives to develop foundation models that can handle the diverse range of graph data within the non-Euclidean domain, seeking to expand the scope of SA and lay the groundwork for future research in this direction. To achieve this goal, we begin by discussing the recent achievements in foundation models associated with SA. We then shed light on the unique challenges that arise when applying the SA concept to graph analysis, which involves understanding the differences between the Euclidean and non-Euclidean domains from both the data and task perspectives. Motivated by these observations, we present several preliminary solutions to tackle the challenges of SNA and detail their corresponding limitations, along with several potential directions to pave the way for future SNA research. Experiments on five Open Graph Benchmark (OGB) datasets across various tasks, including graph property classification and regression, as well as multi-label prediction, demonstrate that the performance of the naive SNA solutions has considerable room for improvement, pointing towards a promising avenue for future exploration of Graph General Intelligence. △ Less

Submitted 23 April, 2023; originally announced April 2023.

Comments: Work in progress

arXiv:2304.04135 [pdf, other]

Propheter: Prophetic Teacher Guided Long-Tailed Distribution Learning

Authors: Wenxiang Xu, Yongcheng Jing, Linyun Zhou, Wenqi Huang, Lechao Cheng, Zunlei Feng, Mingli Song

Abstract: The problem of deep long-tailed learning, a prevalent challenge in the realm of generic visual recognition, persists in a multitude of real-world applications. To tackle the heavily-skewed dataset issue in long-tailed classification, prior efforts have sought to augment existing deep models with the elaborate class-balancing strategies, such as class rebalancing, data augmentation, and module impr… ▽ More The problem of deep long-tailed learning, a prevalent challenge in the realm of generic visual recognition, persists in a multitude of real-world applications. To tackle the heavily-skewed dataset issue in long-tailed classification, prior efforts have sought to augment existing deep models with the elaborate class-balancing strategies, such as class rebalancing, data augmentation, and module improvement. Despite the encouraging performance, the limited class knowledge of the tailed classes in the training dataset still bottlenecks the performance of the existing deep models. In this paper, we propose an innovative long-tailed learning paradigm that breaks the bottleneck by guiding the learning of deep networks with external prior knowledge. This is specifically achieved by devising an elaborated ``prophetic'' teacher, termed as ``Propheter'', that aims to learn the potential class distributions. The target long-tailed prediction model is then optimized under the instruction of the well-trained ``Propheter'', such that the distributions of different classes are as distinguishable as possible from each other. Experiments on eight long-tailed benchmarks across three architectures demonstrate that the proposed prophetic paradigm acts as a promising solution to the challenge of limited class knowledge in long-tailed datasets. The developed code is publicly available at \url{https://github.com/tcmyxc/propheter}. △ Less

Submitted 25 September, 2023; v1 submitted 8 April, 2023; originally announced April 2023.

Comments: 12 pages

arXiv:2304.00782 [pdf, other]

NeMF: Inverse Volume Rendering with Neural Microflake Field

Authors: Youjia Zhang, Teng Xu, Junqing Yu, Yuteng Ye, Junle Wang, Yanqing Jing, Jingyi Yu, Wei Yang

Abstract: Recovering the physical attributes of an object's appearance from its images captured under an unknown illumination is challenging yet essential for photo-realistic rendering. Recent approaches adopt the emerging implicit scene representations and have shown impressive results.However, they unanimously adopt a surface-based representation,and hence can not well handle scenes with very complex geom… ▽ More Recovering the physical attributes of an object's appearance from its images captured under an unknown illumination is challenging yet essential for photo-realistic rendering. Recent approaches adopt the emerging implicit scene representations and have shown impressive results.However, they unanimously adopt a surface-based representation,and hence can not well handle scenes with very complex geometry, translucent object and etc. In this paper, we propose to conduct inverse volume rendering, in contrast to surface-based, by representing a scene using microflake volume, which assumes the space is filled with infinite small flakes and light reflects or scatters at each spatial location according to microflake distributions. We further adopt the coordinate networks to implicitly encode the microflake volume, and develop a differentiable microflake volume renderer to train the network in an end-to-end way in principle.Our NeMF enables effective recovery of appearance attributes for highly complex geometry and scattering object, enables high-quality relighting, material editing, and especially simulates volume rendering effects, such as scattering, which is infeasible for surface-based approaches. △ Less

Submitted 3 April, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

arXiv:2303.10936 [pdf, other]

Learning to Explore Informative Trajectories and Samples for Embodied Perception

Authors: Ya Jing, Tao Kong

Abstract: We are witnessing significant progress on perception models, specifically those trained on large-scale internet images. However, efficiently generalizing these perception models to unseen embodied tasks is insufficiently studied, which will help various relevant applications (e.g., home robots). Unlike static perception methods trained on pre-collected images, the embodied agent can move around in… ▽ More We are witnessing significant progress on perception models, specifically those trained on large-scale internet images. However, efficiently generalizing these perception models to unseen embodied tasks is insufficiently studied, which will help various relevant applications (e.g., home robots). Unlike static perception methods trained on pre-collected images, the embodied agent can move around in the environment and obtain images of objects from any viewpoints. Therefore, efficiently learning the exploration policy and collection method to gather informative training samples is the key to this task. To do this, we first build a 3D semantic distribution map to train the exploration policy self-supervised by introducing the semantic distribution disagreement and the semantic distribution uncertainty rewards. Note that the map is generated from multi-view observations and can weaken the impact of misidentification from an unfamiliar viewpoint. Our agent is then encouraged to explore the objects with different semantic distributions across viewpoints, or uncertain semantic distributions. With the explored informative trajectories, we propose to select hard samples on trajectories based on the semantic distribution uncertainty to reduce unnecessary observations that can be correctly identified. Experiments show that the perception model fine-tuned with our method outperforms the baselines trained with other exploration policies. Further, we demonstrate the robustness of our method in real-robot experiments. △ Less

Submitted 20 March, 2023; originally announced March 2023.

Comments: To be published in IEEE International Conference on Robotics and Automation (ICRA), 2023

arXiv:2212.05946 [pdf, other]

Evaluation and Improvement of Interpretability for Self-Explainable Part-Prototype Networks

Authors: Qihan Huang, Mengqi Xue, Wenqi Huang, Haofei Zhang, Jie Song, Yongcheng Jing, Mingli Song

Abstract: Part-prototype networks (e.g., ProtoPNet, ProtoTree, and ProtoPool) have attracted broad research interest for their intrinsic interpretability and comparable accuracy to non-interpretable counterparts. However, recent works find that the interpretability from prototypes is fragile, due to the semantic gap between the similarities in the feature space and that in the input space. In this work, we… ▽ More Part-prototype networks (e.g., ProtoPNet, ProtoTree, and ProtoPool) have attracted broad research interest for their intrinsic interpretability and comparable accuracy to non-interpretable counterparts. However, recent works find that the interpretability from prototypes is fragile, due to the semantic gap between the similarities in the feature space and that in the input space. In this work, we strive to address this challenge by making the first attempt to quantitatively and objectively evaluate the interpretability of the part-prototype networks. Specifically, we propose two evaluation metrics, termed as consistency score and stability score, to evaluate the explanation consistency across images and the explanation robustness against perturbations, respectively, both of which are essential for explanations taken into practice. Furthermore, we propose an elaborated part-prototype network with a shallow-deep feature alignment (SDFA) module and a score aggregation (SA) module to improve the interpretability of prototypes. We conduct systematical evaluation experiments and provide substantial discussions to uncover the interpretability of existing part-prototype networks. Experiments on three benchmarks across nine architectures demonstrate that our model achieves significantly superior performance to the state of the art, in both the accuracy and interpretability. Our code is available at https://github.com/hqhQAQ/EvalProtoPNet. △ Less

Submitted 25 October, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

arXiv:2212.00532 [pdf, other]

EBHI-Seg: A Novel Enteroscope Biopsy Histopathological Haematoxylin and Eosin Image Dataset for Image Segmentation Tasks

Authors: Liyu Shi, Xiaoyan Li, Weiming Hu, Haoyuan Chen, Jing Chen, Zizhen Fan, Minghe Gao, Yujie Jing, Guotao Lu, Deguo Ma, Zhiyu Ma, Qingtao Meng, Dechao Tang, Hongzan Sun, Marcin Grzegorzek, Shouliang Qi, Yueyang Teng, Chen Li

Abstract: Background and Purpose: Colorectal cancer is a common fatal malignancy, the fourth most common cancer in men, and the third most common cancer in women worldwide. Timely detection of cancer in its early stages is essential for treating the disease. Currently, there is a lack of datasets for histopathological image segmentation of rectal cancer, which often hampers the assessment accuracy when comp… ▽ More Background and Purpose: Colorectal cancer is a common fatal malignancy, the fourth most common cancer in men, and the third most common cancer in women worldwide. Timely detection of cancer in its early stages is essential for treating the disease. Currently, there is a lack of datasets for histopathological image segmentation of rectal cancer, which often hampers the assessment accuracy when computer technology is used to aid in diagnosis. Methods: This present study provided a new publicly available Enteroscope Biopsy Histopathological Hematoxylin and Eosin Image Dataset for Image Segmentation Tasks (EBHI-Seg). To demonstrate the validity and extensiveness of EBHI-Seg, the experimental results for EBHI-Seg are evaluated using classical machine learning methods and deep learning methods. Results: The experimental results showed that deep learning methods had a better image segmentation performance when utilizing EBHI-Seg. The maximum accuracy of the Dice evaluation metric for the classical machine learning method is 0.948, while the Dice evaluation metric for the deep learning method is 0.965. Conclusion: This publicly available dataset contained 5,170 images of six types of tumor differentiation stages and the corresponding ground truth images. The dataset can provide researchers with new segmentation algorithms for medical diagnosis of colorectal cancer, which can be used in the clinical setting to help doctors and patients. △ Less

Submitted 6 December, 2022; v1 submitted 1 December, 2022; originally announced December 2022.

arXiv:2211.08006 [pdf, other]

doi 10.23977/jipta.2023.060101

Auto-outlier Fusion Technique for Chest X-ray classification with Multi-head Attention Mechanism

Authors: Yuru Jing, Zixuan Li

Abstract: A chest X-ray is one of the most widely available radiological examinations for diagnosing and detecting various lung illnesses. The National Institutes of Health (NIH) provides an extensive database, ChestX-ray8 and ChestXray14, to help establish a deep learning community for analysing and predicting lung diseases. ChestX-ray14 consists of 112,120 frontal-view X-ray images of 30,805 distinct pati… ▽ More A chest X-ray is one of the most widely available radiological examinations for diagnosing and detecting various lung illnesses. The National Institutes of Health (NIH) provides an extensive database, ChestX-ray8 and ChestXray14, to help establish a deep learning community for analysing and predicting lung diseases. ChestX-ray14 consists of 112,120 frontal-view X-ray images of 30,805 distinct patients with text-mined fourteen disease image labels, where each image has multiple labels and has been utilised in numerous research in the past. To our current knowledge, no previous study has investigated outliers and multi-label impact for a single X-ray image during the preprocessing stage. The effect of outliers is mitigated in this paper by our proposed auto-outlier fusion technique. The image label is regenerated by concentrating on a particular factor in one image. The final cleaned dataset will be used to compare the mechanisms of multi-head self-attention and multi-head attention with generalised max-pooling. △ Less

Submitted 15 November, 2022; originally announced November 2022.

Comments: Accepted by the Journal of Image Processing Theory and Applications

arXiv:2211.07652 [pdf]

Machine Learning Performance Analysis to Predict Stroke Based on Imbalanced Medical Dataset

Authors: Yuru Jing

Abstract: Cerebral stroke, the second most substantial cause of death universally, has been a primary public health concern over the last few years. With the help of machine learning techniques, early detection of various stroke alerts is accessible, which can efficiently prevent or diminish the stroke. Medical dataset, however, are frequently unbalanced in their class label, with a tendency to poorly predi… ▽ More Cerebral stroke, the second most substantial cause of death universally, has been a primary public health concern over the last few years. With the help of machine learning techniques, early detection of various stroke alerts is accessible, which can efficiently prevent or diminish the stroke. Medical dataset, however, are frequently unbalanced in their class label, with a tendency to poorly predict minority classes. In this paper, the potential risk factors for stroke are investigated. Moreover, four distinctive approaches are applied to improve the classification of the minority class in the imbalanced stroke dataset, which are the ensemble weight voting classifier, the Synthetic Minority Over-sampling Technique (SMOTE), Principal Component Analysis with K-Means Clustering (PCA-Kmeans), Focal Loss with the Deep Neural Network (DNN) and compare their performance. Through the analysis results, SMOTE and PCA-Kmeans with DNN-Focal Loss work best for the limited size of a large severe imbalanced dataset,which is 2-4 times outperform Kaggle work. △ Less

Submitted 14 November, 2022; originally announced November 2022.

Comments: Accepted by CAIBDA 2022

arXiv:2210.13076 [pdf, other]

Towards Unifying Reference Expression Generation and Comprehension

Authors: Duo Zheng, Tao Kong, Ya Jing, Jiaan Wang, Xiaojie Wang

Abstract: Reference Expression Generation (REG) and Comprehension (REC) are two highly correlated tasks. Modeling REG and REC simultaneously for utilizing the relation between them is a promising way to improve both. However, the problem of distinct inputs, as well as building connections between them in a single model, brings challenges to the design and training of the joint model. To address the problems… ▽ More Reference Expression Generation (REG) and Comprehension (REC) are two highly correlated tasks. Modeling REG and REC simultaneously for utilizing the relation between them is a promising way to improve both. However, the problem of distinct inputs, as well as building connections between them in a single model, brings challenges to the design and training of the joint model. To address the problems, we propose a unified model for REG and REC, named UniRef. It unifies these two tasks with the carefully-designed Image-Region-Text Fusion layer (IRTF), which fuses the image, region and text via the image cross-attention and region cross-attention. Additionally, IRTF could generate pseudo input regions for the REC task to enable a uniform way for sharing the identical representation space across the REC and REG. We further propose Vision-conditioned Masked Language Modeling (VMLM) and Text-Conditioned Region Prediction (TRP) to pre-train UniRef model on multi-granular corpora. The VMLM and TRP are directly related to REG and REC, respectively, but could help each other. We conduct extensive experiments on three benchmark datasets, RefCOCO, RefCOCO+ and RefCOCOg. Experimental results show that our model outperforms previous state-of-the-art methods on both REG and REC. △ Less

Submitted 24 October, 2022; originally announced October 2022.

Comments: Accepted to EMNLP 2022 (main conference)

arXiv:2209.07031 [pdf, other]

A semantic hierarchical graph neural network for text classification

Authors: Shuai Hua, Xinxin Li, Yunpeng Jing, Qunfeng Liu

Abstract: The key to the text classification task is language representation and important information extraction, and there are many related studies. In recent years, the research on graph neural network (GNN) in text classification has gradually emerged and shown its advantages, but the existing models mainly focus on directly inputting words as graph nodes into the GNN models ignoring the different level… ▽ More The key to the text classification task is language representation and important information extraction, and there are many related studies. In recent years, the research on graph neural network (GNN) in text classification has gradually emerged and shown its advantages, but the existing models mainly focus on directly inputting words as graph nodes into the GNN models ignoring the different levels of semantic structure information in the samples. To address the issue, we propose a new hierarchical graph neural network (HieGNN) which extracts corresponding information from word-level, sentence-level and document-level respectively. Experimental results on several benchmark datasets achieve better or similar results compared to several baseline methods, which demonstrate that our model is able to obtain more useful information for classification from samples. △ Less

Submitted 14 September, 2022; originally announced September 2022.

Comments: 10 pages, 3 figures

arXiv:2208.12145 [pdf, other]

A deep learning framework for geodesics under spherical Wasserstein-Fisher-Rao metric and its application for weighted sample generation

Authors: Yang Jing, Jiaheng Chen, Lei Li, Jianfeng Lu

Abstract: Wasserstein-Fisher-Rao (WFR) distance is a family of metrics to gauge the discrepancy of two Radon measures, which takes into account both transportation and weight change. Spherical WFR distance is a projected version of WFR distance for probability measures so that the space of Radon measures equipped with WFR can be viewed as metric cone over the space of probability measures with spherical WFR… ▽ More Wasserstein-Fisher-Rao (WFR) distance is a family of metrics to gauge the discrepancy of two Radon measures, which takes into account both transportation and weight change. Spherical WFR distance is a projected version of WFR distance for probability measures so that the space of Radon measures equipped with WFR can be viewed as metric cone over the space of probability measures with spherical WFR. Compared to the case for Wasserstein distance, the understanding of geodesics under the spherical WFR is less clear and still an ongoing research focus. In this paper, we develop a deep learning framework to compute the geodesics under the spherical WFR metric, and the learned geodesics can be adopted to generate weighted samples. Our approach is based on a Benamou-Brenier type dynamic formulation for spherical WFR. To overcome the difficulty in enforcing the boundary constraint brought by the weight change, a Kullback-Leibler (KL) divergence term based on the inverse map is introduced into the cost function. Moreover, a new regularization term using the particle velocity is introduced as a substitute for the Hamilton-Jacobi equation for the potential in dynamic formula. When used for sample generation, our framework can be beneficial for applications with given weighted samples, especially in the Bayesian inference, compared to sample generation with previous flow models. △ Less

Submitted 25 August, 2022; originally announced August 2022.

arXiv:2207.11681 [pdf, other]

Learning Graph Neural Networks for Image Style Transfer

Authors: Yongcheng Jing, Yining Mao, Yiding Yang, Yibing Zhan, Mingli Song, Xinchao Wang, Dacheng Tao

Abstract: State-of-the-art parametric and non-parametric style transfer approaches are prone to either distorted local style patterns due to global statistics alignment, or unpleasing artifacts resulting from patch mismatching. In this paper, we study a novel semi-parametric neural style transfer framework that alleviates the deficiency of both parametric and non-parametric stylization. The core idea of our… ▽ More State-of-the-art parametric and non-parametric style transfer approaches are prone to either distorted local style patterns due to global statistics alignment, or unpleasing artifacts resulting from patch mismatching. In this paper, we study a novel semi-parametric neural style transfer framework that alleviates the deficiency of both parametric and non-parametric stylization. The core idea of our approach is to establish accurate and fine-grained content-style correspondences using graph neural networks (GNNs). To this end, we develop an elaborated GNN model with content and style local patches as the graph vertices. The style transfer procedure is then modeled as the attention-based heterogeneous message passing between the style and content nodes in a learnable manner, leading to adaptive many-to-one style-content correlations at the local patch level. In addition, an elaborated deformable graph convolutional operation is introduced for cross-scale style-content matching. Experimental results demonstrate that the proposed semi-parametric image stylization approach yields encouraging results on the challenging style patterns, preserving both global appearance and exquisite details. Furthermore, by controlling the number of edges at the inference stage, the proposed method also triggers novel functionalities like diversified patch-based stylization with a single model. △ Less

Submitted 13 February, 2023; v1 submitted 24 July, 2022; originally announced July 2022.

Comments: Accepted to ECCV 2022

arXiv:2207.07812 [pdf, other]

doi 10.1007/s11633-022-1391-7

A Survey on Collaborative DNN Inference for Edge Intelligence

Authors: Weiqing Ren, Yuben Qu, Chao Dong, Yuqian Jing, Hao Sun, Qihui Wu, Song Guo

Abstract: With the vigorous development of artificial intelligence (AI), the intelligent applications based on deep neural network (DNN) change people's lifestyles and the production efficiency. However, the huge amount of computation and data generated from the network edge becomes the major bottleneck, and traditional cloud-based computing mode has been unable to meet the requirements of real-time process… ▽ More With the vigorous development of artificial intelligence (AI), the intelligent applications based on deep neural network (DNN) change people's lifestyles and the production efficiency. However, the huge amount of computation and data generated from the network edge becomes the major bottleneck, and traditional cloud-based computing mode has been unable to meet the requirements of real-time processing tasks. To solve the above problems, by embedding AI model training and inference capabilities into the network edge, edge intelligence (EI) becomes a cutting-edge direction in the field of AI. Furthermore, collaborative DNN inference among the cloud, edge, and end device provides a promising way to boost the EI. Nevertheless, at present, EI oriented collaborative DNN inference is still in its early stage, lacking a systematic classification and discussion of existing research efforts. Thus motivated, we have made a comprehensive investigation on the recent studies about EI oriented collaborative DNN inference. In this paper, we firstly review the background and motivation of EI. Then, we classify four typical collaborative DNN inference paradigms for EI, and analyze the characteristics and key technologies of them. Finally, we summarize the current challenges of collaborative DNN inference, discuss the future development trend and provide the future research direction. △ Less

Submitted 15 July, 2022; originally announced July 2022.

Journal ref: Mach. Intell. Res. 20 (2023) 370-395

arXiv:2206.09337 [pdf, other]

Learning Multiscale Transformer Models for Sequence Generation

Authors: Bei Li, Tong Zheng, Yi Jing, Chengbo Jiao, Tong Xiao, Jingbo Zhu

Abstract: Multiscale feature hierarchies have been witnessed the success in the computer vision area. This further motivates researchers to design multiscale Transformer for natural language processing, mostly based on the self-attention mechanism. For example, restricting the receptive field across heads or extracting local fine-grained features via convolutions. However, most of existing works directly mo… ▽ More Multiscale feature hierarchies have been witnessed the success in the computer vision area. This further motivates researchers to design multiscale Transformer for natural language processing, mostly based on the self-attention mechanism. For example, restricting the receptive field across heads or extracting local fine-grained features via convolutions. However, most of existing works directly modeled local features but ignored the word-boundary information. This results in redundant and ambiguous attention distributions, which lacks of interpretability. In this work, we define those scales in different linguistic units, including sub-words, words and phrases. We built a multiscale Transformer model by establishing relationships among scales based on word-boundary information and phrase-level prior knowledge. The proposed \textbf{U}niversal \textbf{M}ulti\textbf{S}cale \textbf{T}ransformer, namely \textsc{Umst}, was evaluated on two sequence generation tasks. Notably, it yielded consistent performance gains over the strong baseline on several test sets without sacrificing the efficiency. △ Less

Submitted 19 June, 2022; originally announced June 2022.

Comments: accepted by ICML2022

arXiv:2205.11192 [pdf, other]

Active Domain Adaptation with Multi-level Contrastive Units for Semantic Segmentation

Authors: Hao Zhang, Ruimao Zhang, Zhanglin Peng, Junle Wang, Yanqing Jing

Abstract: To further reduce the cost of semi-supervised domain adaptation (SSDA) labeling, a more effective way is to use active learning (AL) to annotate a selected subset with specific properties. However, domain adaptation tasks are always addressed in two interactive aspects: domain transfer and the enhancement of discrimination, which requires the selected data to be both uncertain under the model and… ▽ More To further reduce the cost of semi-supervised domain adaptation (SSDA) labeling, a more effective way is to use active learning (AL) to annotate a selected subset with specific properties. However, domain adaptation tasks are always addressed in two interactive aspects: domain transfer and the enhancement of discrimination, which requires the selected data to be both uncertain under the model and diverse in feature space. Contrary to active learning in classification tasks, it is usually challenging to select pixels that contain both the above properties in segmentation tasks, leading to the complex design of pixel selection strategy. To address such an issue, we propose a novel Active Domain Adaptation scheme with Multi-level Contrastive Units (ADA-MCU) for semantic image segmentation. A simple pixel selection strategy followed with the construction of multi-level contrastive units is introduced to optimize the model for both domain adaptation and active supervised learning. In practice, MCUs are constructed from intra-image, cross-image, and cross-domain levels by using both labeled and unlabeled pixels. At each level, we define contrastive losses from center-to-center and pixel-to-pixel manners, with the aim of jointly aligning the category centers and reducing outliers near the decision boundaries. In addition, we also introduce a categories correlation matrix to implicitly describe the relationship between categories, which are used to adjust the weights of the losses for MCUs. Extensive experimental results on standard benchmarks show that the proposed method achieves competitive performance against state-of-the-art SSDA methods with 50% fewer labeled pixels and significantly outperforms state-of-the-art with a large margin by using the same level of annotation cost. △ Less

Submitted 25 May, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

arXiv:2205.03043 [pdf, other]

doi 10.24963/ijcai.2022/682

Sound2Synth: Interpreting Sound via FM Synthesizer Parameters Estimation

Authors: Zui Chen, Yansen Jing, Shengcheng Yuan, Yifei Xu, Jian Wu, Hang Zhao

Abstract: Synthesizer is a type of electronic musical instrument that is now widely used in modern music production and sound design. Each parameters configuration of a synthesizer produces a unique timbre and can be viewed as a unique instrument. The problem of estimating a set of parameters configuration that best restore a sound timbre is an important yet complicated problem, i.e.: the synthesizer parame… ▽ More Synthesizer is a type of electronic musical instrument that is now widely used in modern music production and sound design. Each parameters configuration of a synthesizer produces a unique timbre and can be viewed as a unique instrument. The problem of estimating a set of parameters configuration that best restore a sound timbre is an important yet complicated problem, i.e.: the synthesizer parameters estimation problem. We proposed a multi-modal deep-learning-based pipeline Sound2Synth, together with a network structure Prime-Dilated Convolution (PDC) specially designed to solve this problem. Our method achieved not only SOTA but also the first real-world applicable results on Dexed synthesizer, a popular FM synthesizer. △ Less

Submitted 28 July, 2022; v1 submitted 6 May, 2022; originally announced May 2022.

Comments: 8 pages, 8 figures. v2: IJCAI2022 published, format revisions and bugfixes

arXiv:2204.07205 [pdf]

Expanding the Reach of Research Computing: A Landscape Study

Authors: Dhruva K. Chakravorty, Sarah K. Janes, James V. Howell, Lisa M. Perez, Amy Schultz, Marie Goldie, Austin L. Gamble, Rajiv Malkan, Honggao Liu, Daniel Mireles, Yuanqi Jing, Zhenhua He, Tim Cockerill

Abstract: Research-computing continues to play an ever increasing role in academia. Access to computing resources, however, varies greatly between institutions. Sustaining the growing need for computing skills and access to advanced cyberinfrastructure requires that computing resources be available to students at all levels of scholarship, including community colleges. The National Science Foundation-funded… ▽ More Research-computing continues to play an ever increasing role in academia. Access to computing resources, however, varies greatly between institutions. Sustaining the growing need for computing skills and access to advanced cyberinfrastructure requires that computing resources be available to students at all levels of scholarship, including community colleges. The National Science Foundation-funded Building Research Innovation in Community Colleges (BRICCs) community set out to understand the challenges faced by administrators, researchers and faculty in building a sustainable research computing continuum that extends to smaller and two-year terminal degree granting institutions. BRICCs purpose is to address the technology gaps, and encourage the development of curriculum needed to grow a computationally proficient research workforce. Toward addressing these goals, we performed a landscape study that culminated with a community workshop. Here, we present our key findings from workshop discussions and identify next steps to be taken by BRICCs, funding agencies, and the broader cyberinfrastructure community. △ Less

Submitted 18 April, 2022; v1 submitted 14 April, 2022; originally announced April 2022.

arXiv:2203.15958 [pdf, other]

High-resolution Face Swapping via Latent Semantics Disentanglement

Authors: Yangyang Xu, Bailin Deng, Junle Wang, Yanqing Jing, Jia Pan, Shengfeng He

Abstract: We present a novel high-resolution face swapping method using the inherent prior knowledge of a pre-trained GAN model. Although previous research can leverage generative priors to produce high-resolution results, their quality can suffer from the entangled semantics of the latent space. We explicitly disentangle the latent semantics by utilizing the progressive nature of the generator, deriving st… ▽ More We present a novel high-resolution face swapping method using the inherent prior knowledge of a pre-trained GAN model. Although previous research can leverage generative priors to produce high-resolution results, their quality can suffer from the entangled semantics of the latent space. We explicitly disentangle the latent semantics by utilizing the progressive nature of the generator, deriving structure attributes from the shallow layers and appearance attributes from the deeper ones. Identity and pose information within the structure attributes are further separated by introducing a landmark-driven structure transfer latent direction. The disentangled latent code produces rich generative features that incorporate feature blending to produce a plausible swapping result. We further extend our method to video face swapping by enforcing two spatio-temporal constraints on the latent space and the image space. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art image/video face swapping methods in terms of hallucination quality and consistency. Code can be found at: https://github.com/cnnlstm/FSLSD_HiRes. △ Less

Submitted 29 March, 2022; originally announced March 2022.

Comments: Paper is Acctpted by CVPR2022

arXiv:2203.14812 [pdf]

An attention mechanism based convolutional network for satellite precipitation downscaling over China

Authors: Yinghong Jing, Liupeng Lin, Xinghua Li, Tongwen Li, Huanfeng Shen

Abstract: Precipitation is a key part of hydrological circulation and is a sensitive indicator of climate change. The Integrated Multi-satellitE Retrievals for the Global Precipitation Measurement (GPM) mission (IMERG) datasets are widely used for global and regional precipitation investigations. However, their local application is limited by the relatively coarse spatial resolution. Therefore, in this pape… ▽ More Precipitation is a key part of hydrological circulation and is a sensitive indicator of climate change. The Integrated Multi-satellitE Retrievals for the Global Precipitation Measurement (GPM) mission (IMERG) datasets are widely used for global and regional precipitation investigations. However, their local application is limited by the relatively coarse spatial resolution. Therefore, in this paper, an attention mechanism based convolutional network (AMCN) is proposed to downscale GPM IMERG monthly precipitation data. The proposed method is an end-to-end network, which consists of a global cross-attention module, a multi-factor cross-attention module, and a residual convolutional module, comprehensively considering the potential relationships between precipitation and complicated surface characteristics. In addition, a degradation loss function based on low-resolution precipitation is designed to physically constrain the network training, to improve the robustness of the proposed network under different time and scale variations. The experiments demonstrate that the proposed network significantly outperforms three baseline methods. Finally, a geographic difference analysis method is introduced to further improve the downscaled results by incorporating in-situ measurements for high-quality and fine-scale precipitation estimation. △ Less

Submitted 28 March, 2022; originally announced March 2022.

arXiv:2203.09176 [pdf, other]

ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation

Authors: Bei Li, Quan Du, Tao Zhou, Yi Jing, Shuhan Zhou, Xin Zeng, Tong Xiao, JingBo Zhu, Xuebo Liu, Min Zhang

Abstract: Residual networks are an Euler discretization of solutions to Ordinary Differential Equations (ODE). This paper explores a deeper relationship between Transformer and numerical ODE methods. We first show that a residual block of layers in Transformer can be described as a higher-order solution to ODE. Inspired by this, we design a new architecture, {\it ODE Transformer}, which is analogous to the… ▽ More Residual networks are an Euler discretization of solutions to Ordinary Differential Equations (ODE). This paper explores a deeper relationship between Transformer and numerical ODE methods. We first show that a residual block of layers in Transformer can be described as a higher-order solution to ODE. Inspired by this, we design a new architecture, {\it ODE Transformer}, which is analogous to the Runge-Kutta method that is well motivated in ODE. As a natural extension to Transformer, ODE Transformer is easy to implement and efficient to use. Experimental results on the large-scale machine translation, abstractive summarization, and grammar error correction tasks demonstrate the high genericity of ODE Transformer. It can gain large improvements in model performance over strong baselines (e.g., 30.77 and 44.11 BLEU scores on the WMT'14 English-German and English-French benchmarks) at a slight cost in inference efficiency. △ Less

Submitted 17 March, 2022; originally announced March 2022.

Comments: Long paper accepted by ACL2022 main conference. arXiv admin note: substantial text overlap with arXiv:2104.02308

arXiv:2201.08603 [pdf, other]

Trireme: Exploring Hierarchical Multi-Level Parallelism for Domain Specific Hardware Acceleration

Authors: Georgios Zacharopoulos, Adel Ejjeh, Ying Jing, En-Yu Yang, Tianyu Jia, Iulian Brumar, Jeremy Intan, Muhammad Huzaifa, Sarita Adve, Vikram Adve, Gu-Yeon Wei, David Brooks

Abstract: The design of heterogeneous systems that include domain specific accelerators is a challenging and time-consuming process. While taking into account area constraints, designers must decide which parts of an application to accelerate in hardware and which to leave in software. Moreover, applications in domains such as Extended Reality (XR) offer opportunities for various forms of parallel execution… ▽ More The design of heterogeneous systems that include domain specific accelerators is a challenging and time-consuming process. While taking into account area constraints, designers must decide which parts of an application to accelerate in hardware and which to leave in software. Moreover, applications in domains such as Extended Reality (XR) offer opportunities for various forms of parallel execution, including loop level, task level and pipeline parallelism. To assist the design process and expose every possible level of parallelism, we present Trireme, a fully automated tool-chain that explores multiple levels of parallelism and produces domain specific accelerator designs and configurations that maximize performance, given an area budget. Experiments on demanding benchmarks from the XR domain revealed a speedup of up to 20x, as well as a speedup of up to 37x for smaller applications, compared to software-only implementations. △ Less

Submitted 21 January, 2022; originally announced January 2022.

Comments: 20 pages

arXiv:2201.05232 [pdf, other]

FARSI: Facebook AR System Investigator for Agile Domain-Specific System-on-Chip Exploration

Authors: Behzad Boroujerdian, Ying Jing, Amit Kumar, Lavanya Subramanian, Luke Yen, Vincent Lee, Vivek Venkatesan, Amit Jindal, Robert Shearer, Vijay Janapa Reddi

Abstract: Domain-specific SoCs (DSSoCs) are attractive solutions for domains with stringent power/performance/area constraints; however, they suffer from two fundamental complexities. On the one hand, their many specialized hardware blocks result in complex systems and thus high development effort. On the other, their many system knobs expand the complexity of design space, making the search for the optimal… ▽ More Domain-specific SoCs (DSSoCs) are attractive solutions for domains with stringent power/performance/area constraints; however, they suffer from two fundamental complexities. On the one hand, their many specialized hardware blocks result in complex systems and thus high development effort. On the other, their many system knobs expand the complexity of design space, making the search for the optimal design difficult. Thus to reach prevalence, taming such complexities is necessary. This work identifies necessary features of an early-stage design space exploration (DSE) framework that targets the complex design space of DSSoCs and further provides an instance of one called FARSI, (F)acebook (AR) (S)ystem (I)nvestigator. Concretely, FARSI provides an agile system-level simulator with speed up and accuracy of 8,400X and 98.5% comparing to Synopsys Platform Architect. FARSI also provides an efficient exploration heuristic and achieves up to 16X improvementin convergence time comparing to naive simulated annealing (SA). This is done by augmenting SA with architectural reasoning such as locality exploitation and bottleneck relaxation. Furthermore, we embed various co-design capabilities and show that on average, they have a 32% impact on the convergence rate. Finally, we demonstrate that using simple development-cost-aware policies can lower the system complexity, both in terms of the component count and variation by as much as 150% and 118% (e,g., for Network-on-a-Chip subsystem) △ Less

Submitted 17 January, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

arXiv:2110.12751 [pdf, other]

Maximum Correntropy Criterion Regression models with tending-to-zero scale parameters

Authors: Ying Jing, Lianqiang Yang

Abstract: Maximum correntropy criterion regression (MCCR) models have been well studied within the frame of statistical learning when the scale parameters take fixed values or go to infinity. This paper studies the MCCR models with tending-to-zero scale parameters. It is revealed that the optimal learning rate of MCCR models is ${\mathcal{O}}(n^{-1})$ in the asymptotic sense when the sample size $n$ goes to… ▽ More Maximum correntropy criterion regression (MCCR) models have been well studied within the frame of statistical learning when the scale parameters take fixed values or go to infinity. This paper studies the MCCR models with tending-to-zero scale parameters. It is revealed that the optimal learning rate of MCCR models is ${\mathcal{O}}(n^{-1})$ in the asymptotic sense when the sample size $n$ goes to infinity. In the case of finite samples, the performances on robustness of MCCR, Huber and the least square regression models are compared. The applications of these three methods on real data are also displayed. △ Less

Submitted 25 October, 2021; originally announced October 2021.

arXiv:2109.12872 [pdf, other]

Meta-Aggregator: Learning to Aggregate for 1-bit Graph Neural Networks

Authors: Yongcheng Jing, Yiding Yang, Xinchao Wang, Mingli Song, Dacheng Tao

Abstract: In this paper, we study a novel meta aggregation scheme towards binarizing graph neural networks (GNNs). We begin by developing a vanilla 1-bit GNN framework that binarizes both the GNN parameters and the graph features. Despite the lightweight architecture, we observed that this vanilla framework suffered from insufficient discriminative power in distinguishing graph topologies, leading to a dram… ▽ More In this paper, we study a novel meta aggregation scheme towards binarizing graph neural networks (GNNs). We begin by developing a vanilla 1-bit GNN framework that binarizes both the GNN parameters and the graph features. Despite the lightweight architecture, we observed that this vanilla framework suffered from insufficient discriminative power in distinguishing graph topologies, leading to a dramatic drop in performance. This discovery motivates us to devise meta aggregators to improve the expressive power of vanilla binarized GNNs, of which the aggregation schemes can be adaptively changed in a learnable manner based on the binarized features. Towards this end, we propose two dedicated forms of meta neighborhood aggregators, an exclusive meta aggregator termed as Greedy Gumbel Neighborhood Aggregator (GNA), and a diffused meta aggregator termed as Adaptable Hybrid Neighborhood Aggregator (ANA). GNA learns to exclusively pick one single optimal aggregator from a pool of candidates, while ANA learns a hybrid aggregation behavior to simultaneously retain the benefits of several individual aggregators. Furthermore, the proposed meta aggregators may readily serve as a generic plugin module into existing full-precision GNNs. Experiments across various domains demonstrate that the proposed method yields results superior to the state of the art. △ Less

Submitted 27 September, 2021; originally announced September 2021.

Comments: Accepted to ICCV 2021

arXiv:2109.10485 [pdf, other]

The NiuTrans Machine Translation Systems for WMT21

Authors: Shuhan Zhou, Tao Zhou, Binghao Wei, Yingfeng Luo, Yongyu Mu, Zefan Zhou, Chenglong Wang, Xuanjun Zhou, Chuanhao Lv, Yi Jing, Laohu Wang, Jingnan Zhang, Canan Huang, Zhongxiang Yan, Chi Hu, Bei Li, Tong Xiao, Jingbo Zhu

Abstract: This paper describes NiuTrans neural machine translation systems of the WMT 2021 news translation tasks. We made submissions to 9 language directions, including English$\leftrightarrow$$\{$Chinese, Japanese, Russian, Icelandic$\}$ and English$\rightarrow$Hausa tasks. Our primary systems are built on several effective variants of Transformer, e.g., Transformer-DLCL, ODE-Transformer. We also utilize… ▽ More This paper describes NiuTrans neural machine translation systems of the WMT 2021 news translation tasks. We made submissions to 9 language directions, including English$\leftrightarrow$$\{$Chinese, Japanese, Russian, Icelandic$\}$ and English$\rightarrow$Hausa tasks. Our primary systems are built on several effective variants of Transformer, e.g., Transformer-DLCL, ODE-Transformer. We also utilize back-translation, knowledge distillation, post-ensemble, and iterative fine-tuning techniques to enhance the model performance further. △ Less

Submitted 21 September, 2021; originally announced September 2021.

arXiv:2108.11082 [pdf, other]

3D Face Recognition: A Survey

Authors: Yaping Jing, Xuequan Lu, Shang Gao

Abstract: Face recognition is one of the most studied research topics in the community. In recent years, the research on face recognition has shifted to using 3D facial surfaces, as more discriminating features can be represented by the 3D geometric information. This survey focuses on reviewing the 3D face recognition techniques developed in the past ten years which are generally categorized into convention… ▽ More Face recognition is one of the most studied research topics in the community. In recent years, the research on face recognition has shifted to using 3D facial surfaces, as more discriminating features can be represented by the 3D geometric information. This survey focuses on reviewing the 3D face recognition techniques developed in the past ten years which are generally categorized into conventional methods and deep learning methods. The categorized techniques are evaluated using detailed descriptions of the representative works. The advantages and disadvantages of the techniques are summarized in terms of accuracy, complexity and robustness to face variation (expression, pose and occlusions, etc). The main contribution of this survey is that it comprehensively covers both conventional methods and deep learning methods on 3D face recognition. In addition, a review of available 3D face databases is provided, along with the discussion of future research challenges and directions. △ Less

Submitted 25 August, 2021; originally announced August 2021.

arXiv:2107.11099 [pdf, other]

doi 10.1007/s11128-022-03442-8

RGB Image Classification with Quantum Convolutional Ansaetze

Authors: Yu Jing, Xiaogang Li, Yang Yang, Chonghang Wu, Wenbing Fu, Wei Hu, Yuanyuan Li, Hua Xu

Abstract: With the rapid growth of qubit numbers and coherence times in quantum hardware technology, implementing shallow neural networks on the so-called Noisy Intermediate-Scale Quantum (NISQ) devices has attracted a lot of interest. Many quantum (convolutional) circuit ansaetze are proposed for grayscale images classification tasks with promising empirical results. However, when applying these ansaetze o… ▽ More With the rapid growth of qubit numbers and coherence times in quantum hardware technology, implementing shallow neural networks on the so-called Noisy Intermediate-Scale Quantum (NISQ) devices has attracted a lot of interest. Many quantum (convolutional) circuit ansaetze are proposed for grayscale images classification tasks with promising empirical results. However, when applying these ansaetze on RGB images, the intra-channel information that is useful for vision tasks is not extracted effectively. In this paper, we propose two types of quantum circuit ansaetze to simulate convolution operations on RGB images, which differ in the way how inter-channel and intra-channel information are extracted. To the best of our knowledge, this is the first work of a quantum convolutional circuit to deal with RGB images effectively, with a higher test accuracy compared to the purely classical CNNs. We also investigate the relationship between the size of quantum circuit ansatz and the learnability of the hybrid quantum-classical convolutional neural network. Through experiments based on CIFAR-10 and MNIST datasets, we demonstrate that a larger size of the quantum circuit ansatz improves predictive performance in multiclass classification tasks, providing useful insights for near term quantum algorithm developments. △ Less

Submitted 22 February, 2022; v1 submitted 23 July, 2021; originally announced July 2021.

Comments: https://link.springer.com/article/10.1007/s11128-022-03442-8

Journal ref: Quantum Inf Process 21, 101 (2022)

arXiv:2103.16284 [pdf, other]

Locate then Segment: A Strong Pipeline for Referring Image Segmentation

Authors: Ya Jing, Tao Kong, Wei Wang, Liang Wang, Lei Li, Tieniu Tan

Abstract: Referring image segmentation aims to segment the objects referred by a natural language expression. Previous methods usually focus on designing an implicit and recurrent feature interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask without explicitly modeling the localization information of the referent instances. To tackle these problems, we… ▽ More Referring image segmentation aims to segment the objects referred by a natural language expression. Previous methods usually focus on designing an implicit and recurrent feature interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask without explicitly modeling the localization information of the referent instances. To tackle these problems, we view this task from another perspective by decoupling it into a "Locate-Then-Segment" (LTS) scheme. Given a language expression, people generally first perform attention to the corresponding target image regions, then generate a fine segmentation mask about the object based on its context. The LTS first extracts and fuses both visual and textual features to get a cross-modal representation, then applies a cross-model interaction on the visual-textual features to locate the referred object with position prior, and finally generates the segmentation result with a light-weight segmentation network. Our LTS is simple but surprisingly effective. On three popular benchmark datasets, the LTS outperforms all the previous state-of-the-art methods by a large margin (e.g., +3.2% on RefCOCO+ and +3.4% on RefCOCOg). In addition, our model is more interpretable with explicitly locating the object, which is also proved by visualization experiments. We believe this framework is promising to serve as a strong baseline for referring image segmentation. △ Less

Submitted 30 March, 2021; originally announced March 2021.

Comments: CVPR 2021

arXiv:2103.14123 [pdf, other]

Preliminary Experimental Results of Context-Aware Teams of Multiple Autonomous Agents Operating under Constrained Communications

Authors: Jose Martinez-Lorenzo, Jeff Hudack, Yutao Jing, Michael Shaham, Zixuan Liang, Abdullah Al Bashit, Yushu Wu, Weite Zhang, Matthew Skopin, Juan Heredia-Juesas, Yuntao Ma, Tristan Sweeney, Nicolas Ares, Ari Fox

Abstract: This work presents and experimentally test the framework used by our context-aware, distributed team of small Unmanned Aerial Systems (SUAS) capable of operating in real-time, in an autonomous fashion, and under constrained communications. Our framework relies on three layered approach: (1) Operational layer, where fast temporal and narrow spatial decisions are made; (2) Tactical Layer, where temp… ▽ More This work presents and experimentally test the framework used by our context-aware, distributed team of small Unmanned Aerial Systems (SUAS) capable of operating in real-time, in an autonomous fashion, and under constrained communications. Our framework relies on three layered approach: (1) Operational layer, where fast temporal and narrow spatial decisions are made; (2) Tactical Layer, where temporal and spatial decisions are made for a team of agents; and (3) Strategical Layer, where slow temporal and wide spatial decisions are made for the team of agents. These three layers are coordinated by an ad-hoc, software-defined communications network, which ensures sparse, but timely delivery of messages amongst groups and teams of agents at each layer even under constrained communications. Experimental results are presented for a team of 10 small unmanned aerial systems tasked with searching and monitoring a person in an open area. At the operational layer, our use case presents an agent autonomously performing searching, detection, localization, classification, identification, tracking, and following of the person, while avoiding malicious collisions. At the tactical layer, our experimental use case presents the cooperative interaction of a group of multiple agents that enable the monitoring of the targeted person over a wider spatial and temporal regions. At the strategic layer, our use case involves the detection of complex behaviors-i.e. the person being followed enters a car and runs away, or the person being followed exits the car and runs away-that requires strategic responses to successfully accomplish the mission. △ Less

Submitted 25 March, 2021; originally announced March 2021.

Comments: 7 pages, 6 figures

arXiv:2103.10478 [pdf, other]

Unsupervised Doppler Radar-Based Activity Recognition for e-Healthcare

Authors: Yordanka Karayaneva, Sara Sharifzadeh, Wenda Li, Yanguo Jing, Bo Tan

Abstract: Passive radio frequency (RF) sensing and monitoring of human daily activities in elderly care homes is an emerging topic. Micro-Doppler radars are an appealing solution considering their non-intrusiveness, deep penetration, and high-distance range. Unsupervised activity recognition using Doppler radar data has not received attention, in spite of its importance in case of unlabelled or poorly label… ▽ More Passive radio frequency (RF) sensing and monitoring of human daily activities in elderly care homes is an emerging topic. Micro-Doppler radars are an appealing solution considering their non-intrusiveness, deep penetration, and high-distance range. Unsupervised activity recognition using Doppler radar data has not received attention, in spite of its importance in case of unlabelled or poorly labelled activities in real scenarios. This study proposes two unsupervised feature extraction methods for the purpose of human activity monitoring using Doppler-streams. These include a local Discrete Cosine Transform (DCT)-based feature extraction method and a local entropy-based feature extraction method. In addition, a novel application of Convolutional Variational Autoencoder (CVAE) feature extraction is employed for the first time for Doppler radar data. The three feature extraction architectures are compared with the previously used Convolutional Autoencoder (CAE) and linear feature extraction based on Principal Component Analysis (PCA) and 2DPCA. Unsupervised clustering is performed using K-Means and K-Medoids. The results show the superiority of DCT-based method, entropy-based method, and CVAE features compared to CAE, PCA, and 2DPCA, with more than 5\%-20\% average accuracy. In regards to computation time, the two proposed methods are noticeably much faster than the existing CVAE. Furthermore, for high-dimensional data visualisation, three manifold learning techniques are considered. The methods are compared for the projection of raw data as well as the encoded CVAE features. All three methods show an improved visualisation ability when applied to the encoded CVAE features. △ Less

Submitted 2 November, 2021; v1 submitted 18 March, 2021; originally announced March 2021.

Showing 1–50 of 92 results for author: Jing, Y