Search | arXiv e-print repository

Learning Manipulation by Predicting Interaction

Authors: Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, Li Chen, Hao Dong, Haoming Song, Dong Wang, Di Hu, Ping Luo, Heming Cui, Bin Zhao, Xuelong Li, Yu Qiao, Hongyang Li

Abstract: Representation learning approaches for robotic manipulation have boomed in recent years. Due to the scarcity of in-domain robot data, prevailing methodologies tend to leverage large-scale human video datasets to extract generalizable features for visuomotor policy learning. Despite the progress achieved, prior endeavors disregard the interactive dynamics that capture behavior patterns and physical… ▽ More Representation learning approaches for robotic manipulation have boomed in recent years. Due to the scarcity of in-domain robot data, prevailing methodologies tend to leverage large-scale human video datasets to extract generalizable features for visuomotor policy learning. Despite the progress achieved, prior endeavors disregard the interactive dynamics that capture behavior patterns and physical interaction during the manipulation process, resulting in an inadequate understanding of the relationship between objects and the environment. To this end, we propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction (MPI) and enhances the visual representation.Given a pair of keyframes representing the initial and final states, along with language instructions, our algorithm predicts the transition frame and detects the interaction object, respectively. These two learning objectives achieve superior comprehension towards "how-to-interact" and "where-to-interact". We conduct a comprehensive evaluation of several challenging robotic tasks.The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms as well as simulation environments. Code and checkpoints are publicly shared at https://github.com/OpenDriveLab/MPI. △ Less

Submitted 1 June, 2024; originally announced June 2024.

Comments: Accepted to RSS 2024. Project page: https://github.com/OpenDriveLab/MPI

arXiv:2403.04593 [pdf, other]

Embodied Understanding of Driving Scenarios

Authors: Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li

Abstract: Embodied scene understanding serves as the cornerstone for autonomous agents to perceive, interpret, and respond to open driving scenarios. Such understanding is typically founded upon Vision-Language Models (VLMs). Nevertheless, existing VLMs are restricted to the 2D domain, devoid of spatial awareness and long-horizon extrapolation proficiencies. We revisit the key aspects of autonomous driving… ▽ More Embodied scene understanding serves as the cornerstone for autonomous agents to perceive, interpret, and respond to open driving scenarios. Such understanding is typically founded upon Vision-Language Models (VLMs). Nevertheless, existing VLMs are restricted to the 2D domain, devoid of spatial awareness and long-horizon extrapolation proficiencies. We revisit the key aspects of autonomous driving and formulate appropriate rubrics. Hereby, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans. ELM incorporates space-aware pre-training to endow the agent with robust spatial localization capabilities. Besides, the model employs time-aware token selection to accurately inquire about temporal cues. We instantiate ELM on the reformulated multi-faced benchmark, and it surpasses previous state-of-the-art approaches in all aspects. All code, data, and models will be publicly shared. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: 43 pages, 16 figures

arXiv:2312.13010 [pdf, other]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Authors: Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, Heming Cui

Abstract: The advancement of natural language processing (NLP) has been significantly boosted by the development of transformer-based large language models (LLMs). These models have revolutionized NLP tasks, particularly in code generation, aiding developers in creating software with enhanced efficiency. Despite their advancements, challenges in balancing code snippet generation with effective test case gen… ▽ More The advancement of natural language processing (NLP) has been significantly boosted by the development of transformer-based large language models (LLMs). These models have revolutionized NLP tasks, particularly in code generation, aiding developers in creating software with enhanced efficiency. Despite their advancements, challenges in balancing code snippet generation with effective test case generation and execution persist. To address these issues, this paper introduces Multi-Agent Assistant Code Generation (AgentCoder), a novel solution comprising a multi-agent framework with specialized agents: the programmer agent, the test designer agent, and the test executor agent. During the coding procedure, the programmer agent will focus on the code generation and refinement based on the test executor agent's feedback. The test designer agent will generate test cases for the generated code, and the test executor agent will run the code with the test cases and write the feedback to the programmer. This collaborative system ensures robust code generation, surpassing the limitations of single-agent models and traditional methodologies. Our extensive experiments on 9 code generation models and 12 enhancement approaches showcase AgentCoder's superior performance over existing code generation models and prompt engineering techniques across various benchmarks. For example, AgentCoder (GPT-4) achieves 96.3\% and 91.8\% pass@1 in HumanEval and MBPP datasets with an overall token overhead of 56.9K and 66.3K, while state-of-the-art obtains only 90.2\% and 78.9\% pass@1 with an overall token overhead of 138.2K and 206.5K. △ Less

Submitted 24 May, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: 24 pages, 12 figures

arXiv:2309.14345 [pdf, other]

Bias Testing and Mitigation in LLM-based Code Generation

Authors: Dong Huang, Qingwen Bu, Jie Zhang, Xiaofei Xie, Junjie Chen, Heming Cui

Abstract: Utilizing state-of-the-art Large Language Models (LLMs), automatic code generation models play a pivotal role in enhancing the productivity of software development procedures. As the adoption of LLMs becomes more widespread in software coding ecosystems, a pressing issue has emerged: does the generated code contain social bias and unfairness, such as those related to age, gender, and race? This is… ▽ More Utilizing state-of-the-art Large Language Models (LLMs), automatic code generation models play a pivotal role in enhancing the productivity of software development procedures. As the adoption of LLMs becomes more widespread in software coding ecosystems, a pressing issue has emerged: does the generated code contain social bias and unfairness, such as those related to age, gender, and race? This issue concerns the integrity, fairness, and ethical foundation of software applications that depend on the code generated by these models, yet is under-explored in the literature. This paper presents a novel bias testing framework that is specifically designed for code generation tasks. Based on this framework, we conduct an extensive evaluation of the bias in code generated by five state-of-the-art LLMs. Our findings reveal that 20.29% to 44.93% code functions generated by the models under study are biased when handling bias sensitive tasks (i.e., tasks that involve sensitive attributes such as age and gender). This indicates that the existing LLMs can be unfair in code generation, posing risks of unintended and harmful software behaviors. To mitigate bias for code generation models, we evaluate five bias mitigation prompt strategies, i.e., utilizing bias testing results to refine the code (zero-shot), one-, few-shot, and two Chain-of-Thought (CoT) prompts. Our evaluation results illustrate that these strategies are all effective in mitigating bias. Overall, one-shot and few-shot learning are the two most effective. For GPT-4, 80% to 90% code bias can be removed with one-shot learning. △ Less

Submitted 24 May, 2024; v1 submitted 3 September, 2023; originally announced September 2023.

Comments: Title changed

arXiv:2308.10531 [pdf, other]

SRFormer: Text Detection Transformer with Incorporated Segmentation and Regression

Authors: Qingwen Bu, Sungrae Park, Minsoo Khang, Yichuan Cheng

Abstract: Existing techniques for text detection can be broadly classified into two primary groups: segmentation-based and regression-based methods. Segmentation models offer enhanced robustness to font variations but require intricate post-processing, leading to high computational overhead. Regression-based methods undertake instance-aware prediction but face limitations in robustness and data efficiency d… ▽ More Existing techniques for text detection can be broadly classified into two primary groups: segmentation-based and regression-based methods. Segmentation models offer enhanced robustness to font variations but require intricate post-processing, leading to high computational overhead. Regression-based methods undertake instance-aware prediction but face limitations in robustness and data efficiency due to their reliance on high-level representations. In our academic pursuit, we propose SRFormer, a unified DETR-based model with amalgamated Segmentation and Regression, aiming at the synergistic harnessing of the inherent robustness in segmentation representations, along with the straightforward post-processing of instance-level regression. Our empirical analysis indicates that favorable segmentation predictions can be obtained at the initial decoder layers. In light of this, we constrain the incorporation of segmentation branches to the first few decoder layers and employ progressive regression refinement in subsequent layers, achieving performance gains while minimizing computational load from the mask.Furthermore, we propose a Mask-informed Query Enhancement module. We take the segmentation result as a natural soft-ROI to pool and extract robust pixel representations, which are then employed to enhance and diversify instance queries. Extensive experimentation across multiple benchmarks has yielded compelling findings, highlighting our method's exceptional robustness, superior training and data efficiency, as well as its state-of-the-art performance. Our code is available at https://github.com/retsuh-bqw/SRFormer-Text-Det. △ Less

Submitted 24 December, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

Comments: Title changed. Accepted to AAAI'24

arXiv:2308.08784 [pdf, other]

CodeCoT: Tackling Code Syntax Errors in CoT Reasoning for Code Generation

Authors: Dong Huang, Qingwen Bu, Yuhao Qing, Heming Cui

Abstract: Chain-of-thought (CoT) has emerged as a groundbreaking tool in NLP, notably for its efficacy in complex reasoning tasks, such as mathematical proofs. However, its application in code generation faces a distinct challenge, i.e., although the code generated with CoT reasoning is logically correct, it faces the problem of syntax error (e.g., invalid syntax error report) during code execution, which c… ▽ More Chain-of-thought (CoT) has emerged as a groundbreaking tool in NLP, notably for its efficacy in complex reasoning tasks, such as mathematical proofs. However, its application in code generation faces a distinct challenge, i.e., although the code generated with CoT reasoning is logically correct, it faces the problem of syntax error (e.g., invalid syntax error report) during code execution, which causes the CoT result's pass@1 in HumanEval even lower than the zero-shot result. In this paper, we present Code Chain-of-Thought (CodeCoT) that integrates CoT with a self-examination process for code generation. CodeCoT begins with the LLMs using CoT for initial code development to ensure the generated code follows the correct logic flow. Then, CodeCoT will generate test cases to validate whether the code has syntax errors during the execution. CodeCoT then employs a self-examination phase, in which the generated code is executed against these test cases in the local environment. If the local environment raises error information (e.g., invalid syntax error), CodeCoT will iteratively refine the code based on the feedback information. Within this loop, CodeCoT can make sure their generated codes not only follow the logic flow of the code description, but the syntax error will also be addressed with the self-examination process. Our evaluation results reveal that CodeCoT improves the effectiveness of code generation. For example, CodeCoT increases pass@1 from 75.6% to 79.3% for the HumanEval dataset. △ Less

Submitted 22 February, 2024; v1 submitted 17 August, 2023; originally announced August 2023.

Comments: Title changed

arXiv:2307.11565 [pdf, other]

Adversarial Feature Map Pruning for Backdoor

Authors: Dong Huang, Qingwen Bu

Abstract: Deep neural networks have been widely used in many critical applications, such as autonomous vehicles and medical diagnosis. However, their security is threatened by backdoor attacks, which are achieved by adding artificial patterns to specific training data. Existing defense strategies primarily focus on using reverse engineering to reproduce the backdoor trigger generated by attackers and subseq… ▽ More Deep neural networks have been widely used in many critical applications, such as autonomous vehicles and medical diagnosis. However, their security is threatened by backdoor attacks, which are achieved by adding artificial patterns to specific training data. Existing defense strategies primarily focus on using reverse engineering to reproduce the backdoor trigger generated by attackers and subsequently repair the DNN model by adding the trigger into inputs and fine-tuning the model with ground-truth labels. However, once the trigger generated by the attackers is complex and invisible, the defender cannot reproduce the trigger successfully then the DNN model will not be repaired, as the trigger is not effectively removed. In this work, we propose Adversarial Feature Map Pruning for Backdoor (FMP) to mitigate backdoor from the DNN. Unlike existing defense strategies, which focus on reproducing backdoor triggers, FMP attempts to prune backdoor feature maps, which are trained to extract backdoor information from inputs. After pruning these backdoor feature maps, FMP will fine-tune the model with a secure subset of training data. Our experiments demonstrate that, compared to existing defense strategies, FMP can effectively reduce the Attack Success Rate (ASR) even against the most complex and invisible attack triggers (e.g., FMP decreases the ASR to 2.86\% in CIFAR10, which is 19.2\% to 65.41\% lower than baselines). Second, unlike conventional defense methods that tend to exhibit low robust accuracy (that is, the accuracy of the model on poisoned data), FMP achieves a higher RA, indicating its superiority in maintaining model performance while mitigating the effects of backdoor attacks (e.g., FMP obtains 87.40\% RA in CIFAR10). Our code is publicly available at: https://github.com/retsuh-bqw/FMP. △ Less

Submitted 23 February, 2024; v1 submitted 21 July, 2023; originally announced July 2023.

Comments: Accepted to ICLR 2024

arXiv:2307.11563 [pdf, other]

Feature Map Testing for Deep Neural Networks

Authors: Dong Huang, Qingwen Bu, Yahao Qing, Yichao Fu, Heming Cui

Abstract: Due to the widespread application of deep neural networks~(DNNs) in safety-critical tasks, deep learning testing has drawn increasing attention. During the testing process, test cases that have been fuzzed or selected using test metrics are fed into the model to find fault-inducing test units (e.g., neurons and feature maps, activating which will almost certainly result in a model error) and repor… ▽ More Due to the widespread application of deep neural networks~(DNNs) in safety-critical tasks, deep learning testing has drawn increasing attention. During the testing process, test cases that have been fuzzed or selected using test metrics are fed into the model to find fault-inducing test units (e.g., neurons and feature maps, activating which will almost certainly result in a model error) and report them to the DNN developer, who subsequently repair them~(e.g., retraining the model with test cases). Current test metrics, however, are primarily concerned with the neurons, which means that test cases that are discovered either by guided fuzzing or selection with these metrics focus on detecting fault-inducing neurons while failing to detect fault-inducing feature maps. In this work, we propose DeepFeature, which tests DNNs from the feature map level. When testing is conducted, DeepFeature will scrutinize every internal feature map in the model and identify vulnerabilities that can be enhanced through repairing to increase the model's overall performance. Exhaustive experiments are conducted to demonstrate that (1) DeepFeature is a strong tool for detecting the model's vulnerable feature maps; (2) DeepFeature's test case selection has a high fault detection rate and can detect more types of faults~(comparing DeepFeature to coverage-guided selection techniques, the fault detection rate is increased by 49.32\%). (3) DeepFeature's fuzzer also outperforms current fuzzing techniques and generates valuable test cases more efficiently. △ Less

Submitted 21 July, 2023; originally announced July 2023.

Comments: 12 pages, 5 figures. arXiv admin note: text overlap with arXiv:2307.11011

arXiv:2307.11011 [pdf, other]

Neuron Sensitivity Guided Test Case Selection for Deep Learning Testing

Authors: Dong Huang, Qingwen Bu, Yichao Fu, Yuhao Qing, Bocheng Xiao, Heming Cui

Abstract: Deep Neural Networks~(DNNs) have been widely deployed in software to address various tasks~(e.g., autonomous driving, medical diagnosis). However, they could also produce incorrect behaviors that result in financial losses and even threaten human safety. To reveal the incorrect behaviors in DNN and repair them, DNN developers often collect rich unlabeled datasets from the natural world and label t… ▽ More Deep Neural Networks~(DNNs) have been widely deployed in software to address various tasks~(e.g., autonomous driving, medical diagnosis). However, they could also produce incorrect behaviors that result in financial losses and even threaten human safety. To reveal the incorrect behaviors in DNN and repair them, DNN developers often collect rich unlabeled datasets from the natural world and label them to test the DNN models. However, properly labeling a large number of unlabeled datasets is a highly expensive and time-consuming task. To address the above-mentioned problem, we propose NSS, Neuron Sensitivity guided test case Selection, which can reduce the labeling time by selecting valuable test cases from unlabeled datasets. NSS leverages the internal neuron's information induced by test cases to select valuable test cases, which have high confidence in causing the model to behave incorrectly. We evaluate NSS with four widely used datasets and four well-designed DNN models compared to SOTA baseline methods. The results show that NSS performs well in assessing the test cases' probability of fault triggering and model improvement capabilities. Specifically, compared with baseline approaches, NSS obtains a higher fault detection rate~(e.g., when selecting 5\% test case from the unlabeled dataset in MNIST \& LeNet1 experiment, NSS can obtain 81.8\% fault detection rate, 20\% higher than baselines). △ Less

Submitted 20 July, 2023; originally announced July 2023.

arXiv:2307.09763 [pdf, other]

Towards Building More Robust Models with Frequency Bias

Authors: Qingwen Bu, Dong Huang, Heming Cui

Abstract: The vulnerability of deep neural networks to adversarial samples has been a major impediment to their broad applications, despite their success in various fields. Recently, some works suggested that adversarially-trained models emphasize the importance of low-frequency information to achieve higher robustness. While several attempts have been made to leverage this frequency characteristic, they ha… ▽ More The vulnerability of deep neural networks to adversarial samples has been a major impediment to their broad applications, despite their success in various fields. Recently, some works suggested that adversarially-trained models emphasize the importance of low-frequency information to achieve higher robustness. While several attempts have been made to leverage this frequency characteristic, they have all faced the issue that applying low-pass filters directly to input images leads to irreversible loss of discriminative information and poor generalizability to datasets with distinct frequency features. This paper presents a plug-and-play module called the Frequency Preference Control Module that adaptively reconfigures the low- and high-frequency components of intermediate feature representations, providing better utilization of frequency in robust learning. Empirical studies show that our proposed module can be easily incorporated into any adversarial training framework, further improving model robustness across different architectures and datasets. Additionally, experiments were conducted to examine how the frequency bias of robust models impacts the adversarial training process and its final robustness, revealing interesting insights. △ Less

Submitted 27 July, 2023; v1 submitted 19 July, 2023; originally announced July 2023.

Comments: Accepted by ICCV23

arXiv:2208.08083 [pdf, other]

Two Heads are Better than One: Robust Learning Meets Multi-branch Models

Authors: Dong Huang, Qingwen Bu, Yuhao Qing, Haowen Pi, Sen Wang, Heming Cui

Abstract: Deep neural networks (DNNs) are vulnerable to adversarial examples, in which DNNs are misled to false outputs due to inputs containing imperceptible perturbations. Adversarial training, a reliable and effective method of defense, may significantly reduce the vulnerability of neural networks and becomes the de facto standard for robust learning. While many recent works practice the data-centric phi… ▽ More Deep neural networks (DNNs) are vulnerable to adversarial examples, in which DNNs are misled to false outputs due to inputs containing imperceptible perturbations. Adversarial training, a reliable and effective method of defense, may significantly reduce the vulnerability of neural networks and becomes the de facto standard for robust learning. While many recent works practice the data-centric philosophy, such as how to generate better adversarial examples or use generative models to produce additional training data, we look back to the models themselves and revisit the adversarial robustness from the perspective of deep feature distribution as an insightful complementarity. In this paper, we propose Branch Orthogonality adveRsarial Training (BORT) to obtain state-of-the-art performance with solely the original dataset for adversarial training. To practice our design idea of integrating multiple orthogonal solution spaces, we leverage a simple and straightforward multi-branch neural network that eclipses adversarial attacks with no increase in inference time. We heuristically propose a corresponding loss function, branch-orthogonal loss, to make each solution space of the multi-branch model orthogonal. We evaluate our approach on CIFAR-10, CIFAR-100, and SVHN against \ell_{\infty} norm-bounded perturbations of size ε= 8/255, respectively. Exhaustive experiments are conducted to show that our method goes beyond all state-of-the-art methods without any tricks. Compared to all methods that do not use additional data for training, our models achieve 67.3% and 41.5% robust accuracy on CIFAR-10 and CIFAR-100 (improving upon the state-of-the-art by +7.23% and +9.07%). We also outperform methods using a training set with a far larger scale than ours. All our models and codes are available online at https://github.com/huangd1999/BORT. △ Less

Submitted 17 August, 2022; originally announced August 2022.

Comments: 10 pages, 5 Figures

arXiv:2105.13987 [pdf, other]

ScalingNet: extracting features from raw EEG data for emotion recognition

Authors: Jingzhao Hu, Chen Wang, Qiaomei Jia, Qirong Bu, Jun Feng

Abstract: Convolutional Neural Networks(CNNs) has achieved remarkable performance breakthrough in a variety of tasks. Recently, CNNs based methods that are fed with hand-extracted EEG features gradually produce a powerful performance on the EEG data based emotion recognition task. In this paper, we propose a novel convolutional layer allowing to adaptively extract effective data-driven spectrogram-like feat… ▽ More Convolutional Neural Networks(CNNs) has achieved remarkable performance breakthrough in a variety of tasks. Recently, CNNs based methods that are fed with hand-extracted EEG features gradually produce a powerful performance on the EEG data based emotion recognition task. In this paper, we propose a novel convolutional layer allowing to adaptively extract effective data-driven spectrogram-like features from raw EEG signals, which we reference as scaling layer. Further, it leverages convolutional kernels scaled from one data-driven pattern to exposed a frequency-like dimension to address the shortcomings of prior methods requiring hand-extracted features or their approximations. The proposed neural network architecture based on the scaling layer, references as ScalingNet, has achieved the state-of-the-art result across the established DEAP benchmark dataset. △ Less

Submitted 7 February, 2021; originally announced May 2021.

arXiv:2102.02588 [pdf, other]

Lookup subnet based Spatial Graph Convolutional neural Network

Authors: Jingzhao Hu, Xiaoqi Zhang, Qiaomei Jia, Chen Wang, Qirong Bu, Jun Feng

Abstract: Convolutional Neural Networks(CNNs) has achieved remarkable performance breakthrough in Euclidean structure data. Recently, aggregation-transformation based Graph Neural networks(GNNs) gradually produce a powerful performance on non-Euclidean data. In this paper, we propose a cross-correlation based graph convolution method allowing to naturally generalize CNNs to non-Euclidean domains and inherit… ▽ More Convolutional Neural Networks(CNNs) has achieved remarkable performance breakthrough in Euclidean structure data. Recently, aggregation-transformation based Graph Neural networks(GNNs) gradually produce a powerful performance on non-Euclidean data. In this paper, we propose a cross-correlation based graph convolution method allowing to naturally generalize CNNs to non-Euclidean domains and inherit the excellent natures of CNNs, such as local filters, parameter sharing, flexible receptive field, etc. Meanwhile, it leverages dynamically generated convolution kernel and cross-correlation operators to address the shortcomings of prior methods based on aggregation-transformation or their approximations. Our method has achieved or matched popular state-of-the-art results across three established graph benchmarks: the Cora, Citeseer, and Pubmed citation network datasets. △ Less

Submitted 4 February, 2021; originally announced February 2021.

arXiv:2002.03152 [pdf, other]

CTM: Collaborative Temporal Modeling for Action Recognition

Authors: Qian Liu, Tao Wang, Jie Liu, Yang Guan, Qi Bu, Longfei Yang

Abstract: With the rapid development of digital multimedia, video understanding has become an important field. For action recognition, temporal dimension plays an important role, and this is quite different from image recognition. In order to learn powerful feature of videos, we propose a Collaborative Temporal Modeling (CTM) block (Figure 1) to learn temporal information for action recognition. Besides a p… ▽ More With the rapid development of digital multimedia, video understanding has become an important field. For action recognition, temporal dimension plays an important role, and this is quite different from image recognition. In order to learn powerful feature of videos, we propose a Collaborative Temporal Modeling (CTM) block (Figure 1) to learn temporal information for action recognition. Besides a parameter-free identity shortcut, as a separate temporal modeling block, CTM includes two collaborative paths: a spatial-aware temporal modeling path, which we propose the Temporal-Channel Convolution Module (TCCM) with unshared parameters for each spatial position (H*W) to build, and a spatial-unaware temporal modeling path. CTM blocks can seamlessly be inserted into many popular networks to generate CTM Networks and bring the capability of learning temporal information to 2D CNN backbone networks, which only capture spatial information. Experiments on several popular action recognition datasets demonstrate that CTM blocks bring the performance improvements on 2D CNN baselines, and our method achieves the competitive results against the state-of-the-art methods. Code will be made publicly available. △ Less

Submitted 8 February, 2020; originally announced February 2020.

Showing 1–14 of 14 results for author: Bu, Q