Search | arXiv e-print repository

LVBench: An Extreme Long Video Understanding Benchmark

Authors: Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang

Abstract: Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sport… ▽ More Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations reveal that current multimodal models still underperform on these demanding long video understanding tasks. Through LVBench, we aim to spur the development of more advanced models capable of tackling the complexities of long video comprehension. Our data and code are publicly available at: https://lvbench.github.io. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.01641 [pdf, other]

Reciprocal Reward Influence Encourages Cooperation From Self-Interested Agents

Authors: John L. Zhou, Weizhe Hong, Jonathan C. Kao

Abstract: Emergent cooperation among self-interested individuals is a widespread phenomenon in the natural world, but remains elusive in interactions between artificially intelligent agents. Instead, naïve reinforcement learning algorithms typically converge to Pareto-dominated outcomes in even the simplest of social dilemmas. An emerging class of opponent-shaping methods have demonstrated the ability to re… ▽ More Emergent cooperation among self-interested individuals is a widespread phenomenon in the natural world, but remains elusive in interactions between artificially intelligent agents. Instead, naïve reinforcement learning algorithms typically converge to Pareto-dominated outcomes in even the simplest of social dilemmas. An emerging class of opponent-shaping methods have demonstrated the ability to reach prosocial outcomes by influencing the learning of other agents. However, they rely on higher-order derivatives through the predicted learning step of other agents or learning meta-game dynamics, which in turn rely on stringent assumptions over opponent learning rules or exponential sample complexity, respectively. To provide a learning rule-agnostic and sample-efficient alternative, we introduce Reciprocators, reinforcement learning agents which are intrinsically motivated to reciprocate the influence of an opponent's actions on their returns. This approach effectively seeks to modify other agents' $Q$-values by increasing their return following beneficial actions (with respect to the Reciprocator) and decreasing it after detrimental actions, guiding them towards mutually beneficial actions without attempting to directly shape policy updates. We show that Reciprocators can be used to promote cooperation in a variety of temporally extended social dilemmas during simultaneous learning. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: 9 pages, 4 figures

arXiv:2405.13197 [pdf, other]

Global-Local Detail Guided Transformer for Sea Ice Recognition in Optical Remote Sensing Images

Authors: Zhanchao Huang, Wenjun Hong, Hua Su

Abstract: The recognition of sea ice is of great significance for reflecting climate change and ensuring the safety of ship navigation. Recently, many deep learning based methods have been proposed and applied to segment and recognize sea ice regions. However, the diverse scales of sea ice areas, the zigzag and fine edge contours, and the difficulty in distinguishing different types of sea ice pose challeng… ▽ More The recognition of sea ice is of great significance for reflecting climate change and ensuring the safety of ship navigation. Recently, many deep learning based methods have been proposed and applied to segment and recognize sea ice regions. However, the diverse scales of sea ice areas, the zigzag and fine edge contours, and the difficulty in distinguishing different types of sea ice pose challenges to existing sea ice recognition models. In this paper, a Global-Local Detail Guided Transformer (GDGT) method is proposed for sea ice recognition in optical remote sensing images. In GDGT, a global-local feature fusiont mechanism is designed to fuse global structural correlation features and local spatial detail features. Furthermore, a detail-guided decoder is developed to retain more high-resolution detail information during feature reconstruction for improving the performance of sea ice recognition. Experiments on the produced sea ice dataset demonstrated the effectiveness and advancement of GDGT. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: 5 pages, 5 figures

Journal ref: IEEE IGARSS 2024

arXiv:2405.04312 [pdf, other]

Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

Authors: Zhuoyi Yang, Heyang Jiang, Wenyi Hong, Jiayan Teng, Wendi Zheng, Yuxiao Dong, Ming Ding, Jie Tang

Abstract: Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 4096*4096), the resolution of generated images is often limited to 1024*1024. In this work. we propose a unidirectional block attention mechanism that can adaptively adjust the memory overhead during the inferenc… ▽ More Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 4096*4096), the resolution of generated images is often limited to 1024*1024. In this work. we propose a unidirectional block attention mechanism that can adaptively adjust the memory overhead during the inference process and handle global dependencies. Building on this module, we adopt the DiT structure for upsampling and develop an infinite super-resolution model capable of upsampling images of various shapes and resolutions. Comprehensive experiments show that our model achieves SOTA performance in generating ultra-high-resolution images in both machine and human evaluation. Compared to commonly used UNet structures, our model can save more than 5x memory when generating 4096*4096 images. The project URL is https://github.com/THUDM/Inf-DiT. △ Less

Submitted 8 May, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

arXiv:2404.12623 [pdf, other]

End-to-End Verifiable Decentralized Federated Learning

Authors: Chaehyeon Lee, Jonathan Heiss, Stefan Tai, James Won-Ki Hong

Abstract: Verifiable decentralized federated learning (FL) systems combining blockchains and zero-knowledge proofs (ZKP) make the computational integrity of local learning and global aggregation verifiable across workers. However, they are not end-to-end: data can still be corrupted prior to the learning. In this paper, we propose a verifiable decentralized FL system for end-to-end integrity and authenticit… ▽ More Verifiable decentralized federated learning (FL) systems combining blockchains and zero-knowledge proofs (ZKP) make the computational integrity of local learning and global aggregation verifiable across workers. However, they are not end-to-end: data can still be corrupted prior to the learning. In this paper, we propose a verifiable decentralized FL system for end-to-end integrity and authenticity of data and computation extending verifiability to the data source. Addressing an inherent conflict of confidentiality and transparency, we introduce a two-step proving and verification (2PV) method that we apply to central system procedures: a registration workflow that enables non-disclosing verification of device certificates and a learning workflow that extends existing blockchain and ZKP-based FL systems through non-disclosing data authenticity proofs. Our evaluation on a prototypical implementation demonstrates the technical feasibility with only marginal overheads to state-of-the-art solutions. △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: 9 pages, 5 figures, This article has been accepted for presentation at the IEEE International Conference on Blockchain and Cryptocurrency (ICBC 2024)

arXiv:2403.14943 [pdf, ps, other]

Primary Rate Maximization in Movable Antennas Empowered Symbiotic Radio Communications

Authors: Bin Lyu, Hao Liu, Wenqing Hong, Shimin Gong, Feng Tian

Abstract: In this paper, we propose a movable antenna (MA) empowered scheme for symbiotic radio (SR) communication systems. Specifically, multiple antennas at the primary transmitter (PT) can be flexibly moved to favorable locations to boost the channel conditions of the primary and secondary transmissions. The primary transmission is achieved by the active transmission from the PT to the primary user (PU),… ▽ More In this paper, we propose a movable antenna (MA) empowered scheme for symbiotic radio (SR) communication systems. Specifically, multiple antennas at the primary transmitter (PT) can be flexibly moved to favorable locations to boost the channel conditions of the primary and secondary transmissions. The primary transmission is achieved by the active transmission from the PT to the primary user (PU), while the backscatter device (BD) takes a ride over the incident signal from the PT to passively send the secondary signal to the PU. Under this setup, we consider a primary rate maximization problem by jointly optimizing the transmit beamforming and the positions of MAs at the PT under a practical bit error rate constraint on the secondary transmission. Then, an alternating optimization framework with the utilization of the successive convex approximation, semi-definite processing and simulated annealing (SA) modified particle swarm optimization (SA-PSO) methods is proposed to find the solution of the transmit beamforming and MAs' positions. Finally, numerical results are provided to demonstrate the performance improvement provided by the proposed MA empowered scheme and the proposed algorithm. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: To appear in IEEE VTC-Spring 2024. 6 Pages,5 figures

arXiv:2402.04236 [pdf, other]

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Authors: Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, Jie Tang

Abstract: Vision-Language Models (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses. However, such training of conclusive alignment leads models to ignore essential visual reasoning, further resulting in failures in meticulous visual problems and unfaithful responses. Drawing inspiration from human cognition in solving visual problems… ▽ More Vision-Language Models (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses. However, such training of conclusive alignment leads models to ignore essential visual reasoning, further resulting in failures in meticulous visual problems and unfaithful responses. Drawing inspiration from human cognition in solving visual problems (e.g., marking, zoom in), this paper introduces Chain of Manipulations, a mechanism that enables VLMs to solve problems step-by-step with evidence. After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) with results (e.g., boxes, image) actively without involving external tools, while also allowing users to trace error causes. We study the roadmap to implement this mechanism, including (1) a flexible design of manipulations upon extensive analysis, (2) an efficient automated data generation pipeline, (3) a compatible VLM architecture capable of multi-turn multi-image, and (4) a model training process for versatile capabilities. With the design, we also manually annotate 6K high-quality samples for the challenging graphical mathematical problems. Our trained model, \textbf{CogCoM}, equipped with this mechanism with 17B parameters achieves state-of-the-art performance across 9 benchmarks from 4 categories, demonstrating the effectiveness while preserving the interpretability. Our code, model weights, and collected data are publicly available at https://github.com/THUDM/CogCoM. △ Less

Submitted 22 May, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

Comments: 19 pages, 9 figures

arXiv:2312.08914 [pdf, other]

CogAgent: A Visual Language Model for GUI Agents

Authors: Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang

Abstract: People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billi… ▽ More People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM . △ Less

Submitted 21 December, 2023; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: 27 pages, 19 figures

arXiv:2312.06708 [pdf, other]

Neutral Editing Framework for Diffusion-based Video Editing

Authors: Sunjae Yoon, Gwanhyeong Koo, Ji Woo Hong, Chang D. Yoo

Abstract: Text-conditioned image editing has succeeded in various types of editing based on a diffusion framework. Unfortunately, this success did not carry over to a video, which continues to be challenging. Existing video editing systems are still limited to rigid-type editing such as style transfer and object overlay. To this end, this paper proposes Neutral Editing (NeuEdit) framework to enable complex… ▽ More Text-conditioned image editing has succeeded in various types of editing based on a diffusion framework. Unfortunately, this success did not carry over to a video, which continues to be challenging. Existing video editing systems are still limited to rigid-type editing such as style transfer and object overlay. To this end, this paper proposes Neutral Editing (NeuEdit) framework to enable complex non-rigid editing by changing the motion of a person/object in a video, which has never been attempted before. NeuEdit introduces a concept of `neutralization' that enhances a tuning-editing process of diffusion-based editing systems in a model-agnostic manner by leveraging input video and text without any other auxiliary aids (e.g., visual masks, video captions). Extensive experiments on numerous videos demonstrate adaptability and effectiveness of the NeuEdit framework. The website of our work is available here: https://neuedit.github.io △ Less

Submitted 10 December, 2023; originally announced December 2023.

Comments: 18 pages, 14 figures

arXiv:2312.05454 [pdf, other]

Model Evaluation for Domain Identification of Unknown Classes in Open-World Recognition: A Proposal

Authors: Gusti Ahmad Fanshuri Alfarisy, Owais Ahmed Malik, Ong Wee Hong

Abstract: Open-World Recognition (OWR) is an emerging field that makes a machine learning model competent in rejecting the unknowns, managing them, and incrementally adding novel samples to the base knowledge. However, this broad objective is not practical for an agent that works on a specific task. Not all rejected samples will be used for learning continually in the future. Some novel images in the open e… ▽ More Open-World Recognition (OWR) is an emerging field that makes a machine learning model competent in rejecting the unknowns, managing them, and incrementally adding novel samples to the base knowledge. However, this broad objective is not practical for an agent that works on a specific task. Not all rejected samples will be used for learning continually in the future. Some novel images in the open environment may not belong to the domain of interest. Hence, identifying the unknown in the domain of interest is essential for a machine learning model to learn merely the important samples. In this study, we propose an evaluation protocol for estimating a model's capability in separating unknown in-domain (ID) and unknown out-of-domain (OOD). We evaluated using three approaches with an unknown domain and demonstrated the possibility of identifying the domain of interest using the pre-trained parameters through traditional transfer learning, Automated Machine Learning (AutoML), and Nearest Class Mean (NCM) classifier with First Integer Neighbor Clustering Hierarchy (FINCH). We experimented with five different domains: garbage, food, dogs, plants, and birds. The results show that all approaches can be used as an initial baseline yielding a good accuracy. In addition, a Balanced Accuracy (BACCU) score from a pre-trained model indicates a tendency to excel in one or more domains of interest. We observed that MobileNetV3 yielded the highest BACCU score for the garbage domain and surpassed complex models such as the transformer network. Meanwhile, our results also suggest that a strong representation in the pre-trained model is important for identifying unknown classes in the same domain. This study could open the bridge toward open-world recognition in domain-specific tasks where the relevancy of the unknown classes is vital. △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2311.03707 [pdf, other]

The NeurIPS 2022 Neural MMO Challenge: A Massively Multiagent Competition with Specialization and Trade

Authors: Enhong Liu, Joseph Suarez, Chenhui You, Bo Wu, Bingcheng Chen, Jun Hu, Jiaxin Chen, Xiaolong Zhu, Clare Zhu, Julian Togelius, Sharada Mohanty, Weijun Hong, Rui Du, Yibing Zhang, Qinwen Wang, Xinhang Li, Zheng Yuan, Xiang Li, Yuejia Huang, Kun Zhang, Hanhui Yang, Shiqi Tang, Phillip Isola

Abstract: In this paper, we present the results of the NeurIPS-2022 Neural MMO Challenge, which attracted 500 participants and received over 1,600 submissions. Like the previous IJCAI-2022 Neural MMO Challenge, it involved agents from 16 populations surviving in procedurally generated worlds by collecting resources and defeating opponents. This year's competition runs on the latest v1.6 Neural MMO, which in… ▽ More In this paper, we present the results of the NeurIPS-2022 Neural MMO Challenge, which attracted 500 participants and received over 1,600 submissions. Like the previous IJCAI-2022 Neural MMO Challenge, it involved agents from 16 populations surviving in procedurally generated worlds by collecting resources and defeating opponents. This year's competition runs on the latest v1.6 Neural MMO, which introduces new equipment, combat, trading, and a better scoring system. These elements combine to pose additional robustness and generalization challenges not present in previous competitions. This paper summarizes the design and results of the challenge, explores the potential of this environment as a benchmark for learning methods, and presents some practical reinforcement learning training approaches for complex tasks with sparse rewards. Additionally, we have open-sourced our baselines, including environment wrappers, benchmarks, and visualization tools for future research. △ Less

Submitted 6 November, 2023; originally announced November 2023.

arXiv:2311.03079 [pdf, other]

CogVLM: Visual Expert for Pretrained Language Models

Authors: Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang

Abstract: We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision… ▽ More We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM. △ Less

Submitted 4 February, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

arXiv:2310.10606 [pdf, other]

BayRnTune: Adaptive Bayesian Domain Randomization via Strategic Fine-tuning

Authors: Tianle Huang, Nitish Sontakke, K. Niranjan Kumar, Irfan Essa, Stefanos Nikolaidis, Dennis W. Hong, Sehoon Ha

Abstract: Domain randomization (DR), which entails training a policy with randomized dynamics, has proven to be a simple yet effective algorithm for reducing the gap between simulation and the real world. However, DR often requires careful tuning of randomization parameters. Methods like Bayesian Domain Randomization (Bayesian DR) and Active Domain Randomization (Adaptive DR) address this issue by automatin… ▽ More Domain randomization (DR), which entails training a policy with randomized dynamics, has proven to be a simple yet effective algorithm for reducing the gap between simulation and the real world. However, DR often requires careful tuning of randomization parameters. Methods like Bayesian Domain Randomization (Bayesian DR) and Active Domain Randomization (Adaptive DR) address this issue by automating parameter range selection using real-world experience. While effective, these algorithms often require long computation time, as a new policy is trained from scratch every iteration. In this work, we propose Adaptive Bayesian Domain Randomization via Strategic Fine-tuning (BayRnTune), which inherits the spirit of BayRn but aims to significantly accelerate the learning processes by fine-tuning from previously learned policy. This idea leads to a critical question: which previous policy should we use as a prior during fine-tuning? We investigated four different fine-tuning strategies and compared them against baseline algorithms in five simulated environments, ranging from simple benchmark tasks to more complex legged robot environments. Our analysis demonstrates that our method yields better rewards in the same amount of timesteps compared to vanilla domain randomization or Bayesian DR. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2309.09725 [pdf, ps, other]

Neural Collapse for Unconstrained Feature Model under Cross-entropy Loss with Imbalanced Data

Authors: Wanli Hong, Shuyang Ling

Abstract: Recent years have witnessed the huge success of deep neural networks (DNNs) in various tasks of computer vision and text processing. Interestingly, these DNNs with massive number of parameters share similar structural properties on their feature representation and last-layer classifier at terminal phase of training (TPT). Specifically, if the training data are balanced (each class shares the same… ▽ More Recent years have witnessed the huge success of deep neural networks (DNNs) in various tasks of computer vision and text processing. Interestingly, these DNNs with massive number of parameters share similar structural properties on their feature representation and last-layer classifier at terminal phase of training (TPT). Specifically, if the training data are balanced (each class shares the same number of samples), it is observed that the feature vectors of samples from the same class converge to their corresponding in-class mean features and their pairwise angles are the same. This fascinating phenomenon is known as Neural Collapse (N C), first termed by Papyan, Han, and Donoho in 2019. Many recent works manage to theoretically explain this phenomenon by adopting so-called unconstrained feature model (UFM). In this paper, we study the extension of N C phenomenon to the imbalanced data under cross-entropy loss function in the context of unconstrained feature model. Our contribution is multi-fold compared with the state-of-the-art results: (a) we show that the feature vectors exhibit collapse phenomenon, i.e., the features within the same class collapse to the same mean vector; (b) the mean feature vectors no longer form an equiangular tight frame. Instead, their pairwise angles depend on the sample size; (c) we also precisely characterize the sharp threshold on which the minority collapse (the feature vectors of the minority groups collapse to one single vector) will take place; (d) finally, we argue that the effect of the imbalance in datasize diminishes as the sample size grows. Our results provide a complete picture of the N C under the cross-entropy loss for the imbalanced data. Numerical experiments confirm our theoretical analysis. △ Less

Submitted 24 October, 2023; v1 submitted 18 September, 2023; originally announced September 2023.

Comments: 38 pages, 10 figures

arXiv:2309.03350 [pdf, other]

Relay Diffusion: Unifying diffusion process across resolutions for image synthesis

Authors: Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, Jie Tang

Abstract: Diffusion models achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find the main reason is that \emph{the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain}. In this work, we present Relay Diffusion Model (RDM), which transfers a low-resol… ▽ More Diffusion models achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find the main reason is that \emph{the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain}. In this work, we present Relay Diffusion Model (RDM), which transfers a low-resolution image or noise into an equivalent high-resolution one for diffusion model via blurring diffusion and block noise. Therefore, the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning. RDM achieves state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256$\times$256, surpassing previous works such as ADM, LDM and DiT by a large margin. All the codes and checkpoints are open-sourced at \url{https://github.com/THUDM/RelayDiffusion}. △ Less

Submitted 4 September, 2023; originally announced September 2023.

arXiv:2308.15802 [pdf, other]

Benchmarking Robustness and Generalization in Multi-Agent Systems: A Case Study on Neural MMO

Authors: Yangkun Chen, Joseph Suarez, Junjie Zhang, Chenghui Yu, Bo Wu, Hanmo Chen, Hengman Zhu, Rui Du, Shanliang Qian, Shuai Liu, Weijun Hong, Jinke He, Yibing Zhang, Liang Zhao, Clare Zhu, Julian Togelius, Sharada Mohanty, Jiaxin Chen, Xiu Li, Xiaolong Zhu, Phillip Isola

Abstract: We present the results of the second Neural MMO challenge, hosted at IJCAI 2022, which received 1600+ submissions. This competition targets robustness and generalization in multi-agent systems: participants train teams of agents to complete a multi-task objective against opponents not seen during training. The competition combines relatively complex environment design with large numbers of agents… ▽ More We present the results of the second Neural MMO challenge, hosted at IJCAI 2022, which received 1600+ submissions. This competition targets robustness and generalization in multi-agent systems: participants train teams of agents to complete a multi-task objective against opponents not seen during training. The competition combines relatively complex environment design with large numbers of agents in the environment. The top submissions demonstrate strong success on this task using mostly standard reinforcement learning (RL) methods combined with domain-specific engineering. We summarize the competition design and results and suggest that, as an academic community, competitions may be a powerful approach to solving hard problems and establishing a solid benchmark for algorithms. We will open-source our benchmark including the environment wrapper, baselines, a visualization tool, and selected policies for further research. △ Less

Submitted 30 August, 2023; originally announced August 2023.

arXiv:2308.08538 [pdf, other]

doi 10.1177/02783649241238765

Proprioceptive Learning with Soft Polyhedral Networks

Authors: Xiaobo Liu, Xudong Han, Wei Hong, Fang Wan, Chaoyang Song

Abstract: Proprioception is the "sixth sense" that detects limb postures with motor neurons. It requires a natural integration between the musculoskeletal systems and sensory receptors, which is challenging among modern robots that aim for lightweight, adaptive, and sensitive designs at a low cost. Here, we present the Soft Polyhedral Network with an embedded vision for physical interactions, capable of ada… ▽ More Proprioception is the "sixth sense" that detects limb postures with motor neurons. It requires a natural integration between the musculoskeletal systems and sensory receptors, which is challenging among modern robots that aim for lightweight, adaptive, and sensitive designs at a low cost. Here, we present the Soft Polyhedral Network with an embedded vision for physical interactions, capable of adaptive kinesthesia and viscoelastic proprioception by learning kinetic features. This design enables passive adaptations to omni-directional interactions, visually captured by a miniature high-speed motion tracking system embedded inside for proprioceptive learning. The results show that the soft network can infer real-time 6D forces and torques with accuracies of 0.25/0.24/0.35 N and 0.025/0.034/0.006 Nm in dynamic interactions. We also incorporate viscoelasticity in proprioception during static adaptation by adding a creep and relaxation modifier to refine the predicted results. The proposed soft network combines simplicity in design, omni-adaptation, and proprioceptive sensing with high accuracy, making it a versatile solution for robotics at a low cost with more than 1 million use cycles for tasks such as sensitive and competitive grasping, and touch-based geometry reconstruction. This study offers new insights into vision-based proprioception for soft robots in adaptive grasping, soft manipulation, and human-robot interaction. △ Less

Submitted 16 August, 2023; originally announced August 2023.

Comments: 20 pages, 10 figures, 2 tables, submitted to the International Journal of Robotics Research for review

arXiv:2307.09729 [pdf, other]

NTIRE 2023 Quality Assessment of Video Enhancement Challenge

Authors: Xiaohong Liu, Xiongkuo Min, Wei Sun, Yulun Zhang, Kai Zhang, Radu Timofte, Guangtao Zhai, Yixuan Gao, Yuqin Cao, Tengchuan Kou, Yunlong Dong, Ziheng Jia, Yilin Li, Wei Wu, Shuming Hu, Sibin Deng, Pengxiang Xiao, Ying Chen, Kai Li, Kai Zhao, Kun Yuan, Ming Sun, Heng Cong, Hao Wang, Lingzhi Fu , et al. (47 additional authors not shown)

Abstract: This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual… ▽ More This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211 enhanced videos, including 600 videos with color, brightness, and contrast enhancements, 310 videos with deblurring, and 301 deshaked videos. The challenge has a total of 167 registered participants. 61 participating teams submitted their prediction results during the development phase, with a total of 3168 submissions. A total of 176 submissions were submitted by 37 participating teams during the final testing phase. Finally, 19 participating teams submitted their models and fact sheets, and detailed the methods they used. Some methods have achieved better results than baseline methods, and the winning methods have demonstrated superior prediction performance. △ Less

Submitted 18 July, 2023; originally announced July 2023.

arXiv:2306.14649 [pdf, other]

CIMulator: A Comprehensive Simulation Platform for Computing-In-Memory Circuit Macros with Low Bit-Width and Real Memory Materials

Authors: Hoang-Hiep Le, Md. Aftab Baig, Wei-Chen Hong, Cheng-Hsien Tsai, Cheng-Jui Yeh, Fu-Xiang Liang, I-Ting Huang, Wei-Tzu Tsai, Ting-Yin Cheng, Sourav De, Nan-Yow Chen, Wen-Jay Lee, Ing-Chao Lin, Da-Wei Chang, Darsen D. Lu

Abstract: This paper presents a simulation platform, namely CIMulator, for quantifying the efficacy of various synaptic devices in neuromorphic accelerators for different neural network architectures. Nonvolatile memory devices, such as resistive random-access memory, ferroelectric field-effect transistor, and volatile static random-access memory devices, can be selected as synaptic devices. A multilayer pe… ▽ More This paper presents a simulation platform, namely CIMulator, for quantifying the efficacy of various synaptic devices in neuromorphic accelerators for different neural network architectures. Nonvolatile memory devices, such as resistive random-access memory, ferroelectric field-effect transistor, and volatile static random-access memory devices, can be selected as synaptic devices. A multilayer perceptron and convolutional neural networks (CNNs), such as LeNet-5, VGG-16, and a custom CNN named C4W-1, are simulated to evaluate the effects of these synaptic devices on the training and inference outcomes. The dataset used in the simulations are MNIST, CIFAR-10, and a white blood cell dataset. By applying batch normalization and appropriate optimizers in the training phase, neuromorphic systems with very low-bit-width or binary weights could achieve high pattern recognition rates that approach software-based CNN accuracy. We also introduce spiking neural networks with RRAM-based synaptic devices for the recognition of MNIST handwritten digits. △ Less

Submitted 26 June, 2023; originally announced June 2023.

arXiv:2304.13935 [pdf, ps, other]

doi 10.1109/ICBC56567.2023.10174934

Bitcoin Double-Spending Attack Detection using Graph Neural Network

Authors: Changhoon Kang, Jongsoo Woo, James Won-Ki Hong

Abstract: Bitcoin transactions include unspent transaction outputs (UTXOs) as their inputs and generate one or more newly owned UTXOs at specified addresses. Each UTXO can only be used as an input in a transaction once, and using it in two or more different transactions is referred to as a double-spending attack. Ultimately, due to the characteristics of the Bitcoin protocol, double-spending is impossible.… ▽ More Bitcoin transactions include unspent transaction outputs (UTXOs) as their inputs and generate one or more newly owned UTXOs at specified addresses. Each UTXO can only be used as an input in a transaction once, and using it in two or more different transactions is referred to as a double-spending attack. Ultimately, due to the characteristics of the Bitcoin protocol, double-spending is impossible. However, problems may arise when a transaction is considered final even though its finality has not been fully guaranteed in order to achieve fast payment. In this paper, we propose an approach to detecting Bitcoin double-spending attacks using a graph neural network (GNN). This model predicts whether all nodes in the network contain a given payment transaction in their own memory pool (mempool) using information only obtained from some observer nodes in the network. Our experiment shows that the proposed model can detect double-spending with an accuracy of at least 0.95 when more than about 1% of the entire nodes in the network are observer nodes. △ Less

Submitted 26 April, 2023; originally announced April 2023.

Comments: 3 pages, 1 table, Accepted as poster at IEEE ICBC 2023

arXiv:2304.03593 [pdf, other]

Deep Reinforcement Learning-Based Mapless Crowd Navigation with Perceived Risk of the Moving Crowd for Mobile Robots

Authors: Hafiq Anas, Ong Wee Hong, Owais Ahmed Malik

Abstract: Current state-of-the-art crowd navigation approaches are mainly deep reinforcement learning (DRL)-based. However, DRL-based methods suffer from the issues of generalization and scalability. To overcome these challenges, we propose a method that includes a Collision Probability (CP) in the observation space to give the robot a sense of the level of danger of the moving crowd to help the robot navig… ▽ More Current state-of-the-art crowd navigation approaches are mainly deep reinforcement learning (DRL)-based. However, DRL-based methods suffer from the issues of generalization and scalability. To overcome these challenges, we propose a method that includes a Collision Probability (CP) in the observation space to give the robot a sense of the level of danger of the moving crowd to help the robot navigate safely through crowds with unseen behaviors. We studied the effects of changing the number of moving obstacles to pay attention during navigation. During training, we generated local waypoints to increase the reward density and improve the learning efficiency of the system. Our approach was developed using deep reinforcement learning (DRL) and trained using the Gazebo simulator in a non-cooperative crowd environment with obstacles moving at randomized speeds and directions. We then evaluated our model on four different crowd-behavior scenarios. The results show that our method achieved a 100% success rate in all test settings. We compared our approach with a current state-of-the-art DRL-based approach, and our approach has performed significantly better, especially in terms of social safety. Importantly, our method can navigate in different crowd behaviors and requires no fine-tuning after being trained once. We further demonstrated the crowd navigation capability of our model in real-world tests. △ Less

Submitted 23 September, 2023; v1 submitted 7 April, 2023; originally announced April 2023.

Comments: 6 pages, 7 figures

arXiv:2304.00536 [pdf, other]

Design of a Jumping Control Framework with Heuristic Landing for Bipedal Robots

Authors: Jingwen Zhang, Junjie Shen, Yeting Liu, Dennis W. Hong

Abstract: Generating dynamic jumping motions on legged robots remains a challenging control problem as the full flight phase and large landing impact are expected. Compared to quadrupedal robots or other multi-legged robots, bipedal robots place higher requirements for the control strategy given a much smaller footprint. To solve this problem, a novel heuristic landing planner is proposed in this paper. Wit… ▽ More Generating dynamic jumping motions on legged robots remains a challenging control problem as the full flight phase and large landing impact are expected. Compared to quadrupedal robots or other multi-legged robots, bipedal robots place higher requirements for the control strategy given a much smaller footprint. To solve this problem, a novel heuristic landing planner is proposed in this paper. With the momentum feedback during the flight phase, landing locations can be updated to minimize the influence of uncertainties from tracking errors or external disturbances when landing. To the best of our knowledge, this is the first approach to take advantage of the flight phase to reduce the impact of the jump landing which is implemented in the actual robot. By integrating it with a modified kino-dynamics motion planner with centroidal momentum and a low-level controller which explores the whole-body dynamics to hierarchically handle multiple tasks, a complete and versatile jumping control framework is designed in this paper. Extensive results of simulation and hardware jumping experiments on a miniature bipedal robot with proprioceptive actuation are provided to demonstrate that the proposed framework is able to achieve human-like efficient and robust jumping tasks, including directional jump, twisting jump, step jump, and somersaults. △ Less

Submitted 2 April, 2023; originally announced April 2023.

arXiv:2303.09597 [pdf, other]

Residual Physics Learning and System Identification for Sim-to-real Transfer of Policies on Buoyancy Assisted Legged Robots

Authors: Nitish Sontakke, Hosik Chae, Sangjoon Lee, Tianle Huang, Dennis W. Hong, Sehoon Ha

Abstract: The light and soft characteristics of Buoyancy Assisted Lightweight Legged Unit (BALLU) robots have a great potential to provide intrinsically safe interactions in environments involving humans, unlike many heavy and rigid robots. However, their unique and sensitive dynamics impose challenges to obtaining robust control policies in the real world. In this work, we demonstrate robust sim-to-real tr… ▽ More The light and soft characteristics of Buoyancy Assisted Lightweight Legged Unit (BALLU) robots have a great potential to provide intrinsically safe interactions in environments involving humans, unlike many heavy and rigid robots. However, their unique and sensitive dynamics impose challenges to obtaining robust control policies in the real world. In this work, we demonstrate robust sim-to-real transfer of control policies on the BALLU robots via system identification and our novel residual physics learning method, Environment Mimic (EnvMimic). First, we model the nonlinear dynamics of the actuators by collecting hardware data and optimizing the simulation parameters. Rather than relying on standard supervised learning formulations, we utilize deep reinforcement learning to train an external force policy to match real-world trajectories, which enables us to model residual physics with greater fidelity. We analyze the improved simulation fidelity by comparing the simulation trajectories against the real-world ones. We finally demonstrate that the improved simulator allows us to learn better walking and turning policies that can be successfully deployed on the hardware of BALLU. △ Less

Submitted 16 March, 2023; originally announced March 2023.

arXiv:2211.09861 [pdf, other]

Self-Supervised Visual Representation Learning via Residual Momentum

Authors: Trung X. Pham, Axi Niu, Zhang Kang, Sultan Rizky Madjid, Ji Woo Hong, Daehyeok Kim, Joshua Tian Jin Tee, Chang D. Yoo

Abstract: Self-supervised learning (SSL) approaches have shown promising capabilities in learning the representation from unlabeled data. Amongst them, momentum-based frameworks have attracted significant attention. Despite being a great success, these momentum-based SSL frameworks suffer from a large gap in representation between the online encoder (student) and the momentum encoder (teacher), which hinder… ▽ More Self-supervised learning (SSL) approaches have shown promising capabilities in learning the representation from unlabeled data. Amongst them, momentum-based frameworks have attracted significant attention. Despite being a great success, these momentum-based SSL frameworks suffer from a large gap in representation between the online encoder (student) and the momentum encoder (teacher), which hinders performance on downstream tasks. This paper is the first to investigate and identify this invisible gap as a bottleneck that has been overlooked in the existing SSL frameworks, potentially preventing the models from learning good representation. To solve this problem, we propose "residual momentum" to directly reduce this gap to encourage the student to learn the representation as close to that of the teacher as possible, narrow the performance gap with the teacher, and significantly improve the existing SSL. Our method is straightforward, easy to implement, and can be easily plugged into other SSL frameworks. Extensive experimental results on numerous benchmark datasets and diverse network architectures have demonstrated the effectiveness of our method over the state-of-the-art contrastive learning baselines. △ Less

Submitted 21 November, 2022; v1 submitted 17 November, 2022; originally announced November 2022.

Comments: 18 pages, 16 figures

arXiv:2211.09498 [pdf, other]

Automatic Construction of Parallel Algorithm Portfolios for Multi-objective Optimization

Authors: Xiasheng Ma, Shengcai Liu, Wenjing Hong

Abstract: It has been widely observed that there exists no universal best Multi-objective Evolutionary Algorithm (MOEA) dominating all other MOEAs on all possible Multi-objective Optimization Problems (MOPs). In this work, we advocate using the Parallel Algorithm Portfolio (PAP), which runs multiple MOEAs independently in parallel and gets the best out of them, to combine the advantages of different MOEAs.… ▽ More It has been widely observed that there exists no universal best Multi-objective Evolutionary Algorithm (MOEA) dominating all other MOEAs on all possible Multi-objective Optimization Problems (MOPs). In this work, we advocate using the Parallel Algorithm Portfolio (PAP), which runs multiple MOEAs independently in parallel and gets the best out of them, to combine the advantages of different MOEAs. Since the manual construction of PAPs is non-trivial and tedious, we propose to automatically construct high-performance PAPs for solving MOPs. Specifically, we first propose a variant of PAPs, namely MOEAs/PAP, which can better determine the output solution set for MOPs than conventional PAPs. Then, we present an automatic construction approach for MOEAs/PAP with a novel performance metric for evaluating the performance of MOEAs across multiple MOPs. Finally, we use the proposed approach to construct a MOEAs/PAP based on a training set of MOPs and an algorithm configuration space defined by several variants of NSGA-II. Experimental results show that the automatically constructed MOEAs/PAP can even rival the state-of-the-art ensemble MOEAs designed by human experts, demonstrating the huge potential of automatic construction of PAPs in multi-objective optimization. △ Less

Submitted 6 June, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

arXiv:2211.04803 [pdf]

DSCOT: An NFT-Based Blockchain Architecture for the Authentication of IoT-Enabled Smart Devices in Smart Cities

Authors: Usman Khalil, Owais Ahmed Malik, Ong Wee Hong, Mueen Uddin

Abstract: Smart city architecture brings all the underlying architectures, i.e., Internet of Things (IoT), Cyber-Physical Systems (CPSs), Internet of Cyber-Physical Things (IoCPT), and Internet of Everything (IoE), together to work as a system under its umbrella. The goal of smart city architecture is to come up with a solution that may integrate all the real-time response applications. However, the cyber-p… ▽ More Smart city architecture brings all the underlying architectures, i.e., Internet of Things (IoT), Cyber-Physical Systems (CPSs), Internet of Cyber-Physical Things (IoCPT), and Internet of Everything (IoE), together to work as a system under its umbrella. The goal of smart city architecture is to come up with a solution that may integrate all the real-time response applications. However, the cyber-physical space poses threats that can jeopardize the working of a smart city where all the data belonging to people, systems, and processes will be at risk. Various architectures based on centralized and distributed mechanisms support smart cities; however, the security concerns regarding traceability, scalability, security services, platform assistance, and resource management persist. In this paper, private blockchain-based architecture Decentralized Smart City of Things (DSCoT) is proposed. It actively utilizes fog computing for all the users and smart devices connected to a fog node in a particular management system in a smart city, i.e., a smart house or hospital, etc. Non-fungible tokens (NFTs) have been utilized for representation to define smart device attributes. NFTs in the proposed DSCoT architecture provide devices and user authentication (IoT) functionality. DSCoT has been designed to provide a smart city solution that ensures robust security features such as Confidentiality, Integrity, Availability (CIA), and authorization by defining new attributes and functions for Owner, User, Fog, and IoT devices authentication. The evaluation of the proposed functions and components in terms of Gas consumption and time complexity has shown promising results. Comparatively, the Gas consumption for minting DSCoT NFT showed approximately 27%, and a DSCoT approve() was approximately 11% more efficient than the PUF-based NFT solution. △ Less

Submitted 9 November, 2022; originally announced November 2022.

Comments: 18 pages, 15 figures, 5 tables, journal

arXiv:2210.08714 [pdf, other]

doi 10.1007/978-3-031-20059-5_11

Selective Query-guided Debiasing for Video Corpus Moment Retrieval

Authors: Sunjae Yoon, Ji Woo Hong, Eunseop Yoon, Dahyun Kim, Junyeong Kim, Hee Suk Yoon, Chang D. Yoo

Abstract: Video moment retrieval (VMR) aims to localize target moments in untrimmed videos pertinent to a given textual query. Existing retrieval systems tend to rely on retrieval bias as a shortcut and thus, fail to sufficiently learn multi-modal interactions between query and video. This retrieval bias stems from learning frequent co-occurrence patterns between query and moments, which spuriously correlat… ▽ More Video moment retrieval (VMR) aims to localize target moments in untrimmed videos pertinent to a given textual query. Existing retrieval systems tend to rely on retrieval bias as a shortcut and thus, fail to sufficiently learn multi-modal interactions between query and video. This retrieval bias stems from learning frequent co-occurrence patterns between query and moments, which spuriously correlate objects (e.g., a pencil) referred in the query with moments (e.g., scene of writing with a pencil) where the objects frequently appear in the video, such that they converge into biased moment predictions. Although recent debiasing methods have focused on removing this retrieval bias, we argue that these biased predictions sometimes should be preserved because there are many queries where biased predictions are rather helpful. To conjugate this retrieval bias, we propose a Selective Query-guided Debiasing network (SQuiDNet), which incorporates the following two main properties: (1) Biased Moment Retrieval that intentionally uncovers the biased moments inherent in objects of the query and (2) Selective Query-guided Debiasing that performs selective debiasing guided by the meaning of the query. Our experimental results on three moment retrieval benchmarks (i.e., TVR, ActivityNet, DiDeMo) show the effectiveness of SQuiDNet and qualitative analysis shows improved interpretability. △ Less

Submitted 26 November, 2022; v1 submitted 16 October, 2022; originally announced October 2022.

Comments: 16 pages, 6 figures, Accepted in ECCV 2022

Journal ref: In European Conference on Computer Vision (pp. 185-200). Springer, Cham (2022)

arXiv:2209.12713 [pdf, other]

Multi-Agent Sequential Decision-Making via Communication

Authors: Ziluo Ding, Kefan Su, Weixin Hong, Liwen Zhu, Tiejun Huang, Zongqing Lu

Abstract: Communication helps agents to obtain information about others so that better coordinated behavior can be learned. Some existing work communicates predicted future trajectory with others, hoping to get clues about what others would do for better coordination. However, circular dependencies sometimes can occur when agents are treated synchronously so it is hard to coordinate decision-making. In this… ▽ More Communication helps agents to obtain information about others so that better coordinated behavior can be learned. Some existing work communicates predicted future trajectory with others, hoping to get clues about what others would do for better coordination. However, circular dependencies sometimes can occur when agents are treated synchronously so it is hard to coordinate decision-making. In this paper, we propose a novel communication scheme, Sequential Communication (SeqComm). SeqComm treats agents asynchronously (the upper-level agents make decisions before the lower-level ones) and has two communication phases. In negotiation phase, agents determine the priority of decision-making by communicating hidden states of observations and comparing the value of intention, which is obtained by modeling the environment dynamics. In launching phase, the upper-level agents take the lead in making decisions and communicate their actions with the lower-level agents. Theoretically, we prove the policies learned by SeqComm are guaranteed to improve monotonically and converge. Empirically, we show that SeqComm outperforms existing methods in various multi-agent cooperative tasks. △ Less

Submitted 26 September, 2022; originally announced September 2022.

Comments: 20 pages

arXiv:2208.13158 [pdf, other]

Benchmark Results for Bookshelf Organization Problem as Mixed Integer Nonlinear Program with Mode Switch and Collision Avoidance

Authors: Xuan Lin, Gabriel I. Fernandez, Dennis W. Hong

Abstract: Mixed integer convex and nonlinear programs, MICP and MINLP, are expressive but require long solving times. Recent work that combines data-driven methods on solver heuristics has shown potential to overcome this issue allowing for applications on larger scale practical problems. To solve mixed-integer bilinear programs online with data-driven methods, several formulations exist including mathemati… ▽ More Mixed integer convex and nonlinear programs, MICP and MINLP, are expressive but require long solving times. Recent work that combines data-driven methods on solver heuristics has shown potential to overcome this issue allowing for applications on larger scale practical problems. To solve mixed-integer bilinear programs online with data-driven methods, several formulations exist including mathematical programming with complementary constraints (MPCC), mixed-integer programming (MIP). In this work, we benchmark the performances of those data-driven schemes on a bookshelf organization problem that has discrete mode switch and collision avoidance constraints. The success rate, optimal cost and solving time are compared along with non-data-driven methods. Our proposed methods are demonstrated as a high level planner for a robotic arm for the bookshelf problem. △ Less

Submitted 28 August, 2022; originally announced August 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2110.00666

arXiv:2205.15868 [pdf, other]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Authors: Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang

Abstract: Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this… ▽ More Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations. △ Less

Submitted 29 May, 2022; originally announced May 2022.

arXiv:2204.14217 [pdf, other]

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

Authors: Ming Ding, Wendi Zheng, Wenyi Hong, Jie Tang

Abstract: The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images. In this work, we put forward a solution based on hierarchical transformers and local parallel auto-regressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM), and fi… ▽ More The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images. In this work, we put forward a solution based on hierarchical transformers and local parallel auto-regressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM), and finetune it for fast super-resolution. The new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2, and naturally supports interactive text-guided editing on images. △ Less

Submitted 27 May, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

arXiv:2204.13508 [pdf, other]

Design of Blockchain-based Travel Rule Compliance System

Authors: Chaehyeon Lee, Changhoon Kang, Wonseok Choi, Jehoon Lee, Myunghun Cha, Jongsoo Woo, James Won-Ki Hong

Abstract: In accordance with the guidelines of the Financial Action Task Force (FATF), Virtual Asset Service Providers (VASPs) should comply with a `travel rule', which requires them to exchange originator's and beneficiary's personal information when transferring virtual assets. In this paper, we propose a novel blockchain-based travel rule compliance system that supports fully-decentralized data exchange.… ▽ More In accordance with the guidelines of the Financial Action Task Force (FATF), Virtual Asset Service Providers (VASPs) should comply with a `travel rule', which requires them to exchange originator's and beneficiary's personal information when transferring virtual assets. In this paper, we propose a novel blockchain-based travel rule compliance system that supports fully-decentralized data exchange. The proposed system uses a permissioned blockchain, and thereby eliminates the possibility of leakage of personal information to third parties or even to travel rule service providers, and ensures that travel rule data can be managed securely. △ Less

Submitted 28 April, 2022; originally announced April 2022.

Comments: 3 pages, 1 figure, 1 table. Accepted to IEEE ICBC 2022 as a poster paper

arXiv:2204.11116 [pdf, other]

Human-Robot Shared Control for Surgical Robot Based on Context-Aware Sim-to-Real Adaptation

Authors: Dandan Zhang, Zicong Wu, Junhong Chen, Ruiqi Zhu, Adnan Munawar, Bo Xiao, Yuan Guan, Hang Su, Wuzhou Hong, Yao Guo, Gregory S. Fischer, Benny Lo, Guang-Zhong Yang

Abstract: Human-robot shared control, which integrates the advantages of both humans and robots, is an effective approach to facilitate efficient surgical operation. Learning from demonstration (LfD) techniques can be used to automate some of the surgical subtasks for the construction of the shared control framework. However, a sufficient amount of data is required for the robot to learn the manoeuvres. Usi… ▽ More Human-robot shared control, which integrates the advantages of both humans and robots, is an effective approach to facilitate efficient surgical operation. Learning from demonstration (LfD) techniques can be used to automate some of the surgical subtasks for the construction of the shared control framework. However, a sufficient amount of data is required for the robot to learn the manoeuvres. Using a surgical simulator to collect data is a less resource-demanding approach. With sim-to-real adaptation, the manoeuvres learned from a simulator can be transferred to a physical robot. To this end, we propose a sim-to-real adaptation method to construct a human-robot shared control framework for robotic surgery. In this paper, a desired trajectory is generated from a simulator using LfD method, while dynamic motion primitives (DMPs) based method is used to transfer the desired trajectory from the simulator to the physical robotic platform. Moreover, a role adaptation mechanism is developed such that the robot can adjust its role according to the surgical operation contexts predicted by a neural network model. The effectiveness of the proposed framework is validated on the da Vinci Research Kit (dVRK). Results of the user studies indicated that with the adaptive human-robot shared control framework, the path length of the remote controller, the total clutching number and the task completion time can be reduced significantly. The proposed method outperformed the traditional manual control via teleoperation. △ Less

Submitted 4 June, 2022; v1 submitted 23 April, 2022; originally announced April 2022.

Comments: Accepted by 2022ICRA

arXiv:2204.06697 [pdf, other]

HASA: Hybrid Architecture Search with Aggregation Strategy for Echinococcosis Classification and Ovary Segmentation in Ultrasound Images

Authors: Jikuan Qian, Rui Li, Xin Yang, Yuhao Huang, Mingyuan Luo, Zehui Lin, Wenhui Hong, Ruobing Huang, Haining Fan, Dong Ni, Jun Cheng

Abstract: Different from handcrafted features, deep neural networks can automatically learn task-specific features from data. Due to this data-driven nature, they have achieved remarkable success in various areas. However, manual design and selection of suitable network architectures are time-consuming and require substantial effort of human experts. To address this problem, researchers have proposed neural… ▽ More Different from handcrafted features, deep neural networks can automatically learn task-specific features from data. Due to this data-driven nature, they have achieved remarkable success in various areas. However, manual design and selection of suitable network architectures are time-consuming and require substantial effort of human experts. To address this problem, researchers have proposed neural architecture search (NAS) algorithms which can automatically generate network architectures but suffer from heavy computational cost and instability if searching from scratch. In this paper, we propose a hybrid NAS framework for ultrasound (US) image classification and segmentation. The hybrid framework consists of a pre-trained backbone and several searched cells (i.e., network building blocks), which takes advantage of the strengths of both NAS and the expert knowledge from existing convolutional neural networks. Specifically, two effective and lightweight operations, a mixed depth-wise convolution operator and a squeeze-and-excitation block, are introduced into the candidate operations to enhance the variety and capacity of the searched cells. These two operations not only decrease model parameters but also boost network performance. Moreover, we propose a re-aggregation strategy for the searched cells, aiming to further improve the performance for different vision tasks. We tested our method on two large US image datasets, including a 9-class echinococcosis dataset containing 9566 images for classification and an ovary dataset containing 3204 images for segmentation. Ablation experiments and comparison with other handcrafted or automatically searched architectures demonstrate that our method can generate more powerful and lightweight models for the above US image classification and segmentation tasks. △ Less

Submitted 20 April, 2022; v1 submitted 13 April, 2022; originally announced April 2022.

Comments: 17 pages,11 figures. Accepted by Expert Systems and Applications, 2022

arXiv:2204.05639 [pdf, other]

Neural Network Pruning by Cooperative Coevolution

Authors: Haopu Shang, Jia-Liang Wu, Wenjing Hong, Chao Qian

Abstract: Neural network pruning is a popular model compression method which can significantly reduce the computing cost with negligible loss of accuracy. Recently, filters are often pruned directly by designing proper criteria or using auxiliary modules to measure their importance, which, however, requires expertise and trial-and-error. Due to the advantage of automation, pruning by evolutionary algorithms… ▽ More Neural network pruning is a popular model compression method which can significantly reduce the computing cost with negligible loss of accuracy. Recently, filters are often pruned directly by designing proper criteria or using auxiliary modules to measure their importance, which, however, requires expertise and trial-and-error. Due to the advantage of automation, pruning by evolutionary algorithms (EAs) has attracted much attention, but the performance is limited for deep neural networks as the search space can be quite large. In this paper, we propose a new filter pruning algorithm CCEP by cooperative coevolution, which prunes the filters in each layer by EAs separately. That is, CCEP reduces the pruning space by a divide-and-conquer strategy. The experiments show that CCEP can achieve a competitive performance with the state-of-the-art pruning methods, e.g., prune ResNet56 for $63.42\%$ FLOPs on CIFAR10 with $-0.24\%$ accuracy drop, and ResNet50 for $44.56\%$ FLOPs on ImageNet with $0.07\%$ accuracy drop. △ Less

Submitted 9 May, 2022; v1 submitted 12 April, 2022; originally announced April 2022.

arXiv:2203.16406 [pdf, other]

PerfectDou: Dominating DouDizhu with Perfect Information Distillation

Authors: Guan Yang, Minghuan Liu, Weijun Hong, Weinan Zhang, Fei Fang, Guangjun Zeng, Yue Lin

Abstract: As a challenging multi-player card game, DouDizhu has recently drawn much attention for analyzing competition and collaboration in imperfect-information games. In this paper, we propose PerfectDou, a state-of-the-art DouDizhu AI system that dominates the game, in an actor-critic framework with a proposed technique named perfect information distillation. In detail, we adopt a perfect-training-imper… ▽ More As a challenging multi-player card game, DouDizhu has recently drawn much attention for analyzing competition and collaboration in imperfect-information games. In this paper, we propose PerfectDou, a state-of-the-art DouDizhu AI system that dominates the game, in an actor-critic framework with a proposed technique named perfect information distillation. In detail, we adopt a perfect-training-imperfect-execution framework that allows the agents to utilize the global information to guide the training of the policies as if it is a perfect information game and the trained policies can be used to play the imperfect information game during the actual gameplay. To this end, we characterize card and game features for DouDizhu to represent the perfect and imperfect information. To train our system, we adopt proximal policy optimization with generalized advantage estimation in a parallel training paradigm. In experiments we show how and why PerfectDou beats all existing AI programs, and achieves state-of-the-art performance. △ Less

Submitted 27 February, 2024; v1 submitted 30 March, 2022; originally announced March 2022.

Comments: Fix minor errors. 23 pages, 8 figures, 13 tables. Published at NeurIPS 2022. The first two authors contribute equally. Project page at https://github.com/Netease-Games-AI-Lab-Guangzhou/PerfectDou

arXiv:2203.05769 [pdf, other]

DeTRM: Decentralised Trust and Reputation Management for Blockchain-based Supply Chains

Authors: Guntur Dharma Putra, Changhoon Kang, Salil S. Kanhere, James Won-Ki Hong

Abstract: Blockchain has the potential to enhance supply chain management systems by providing stronger assurance in transparency and traceability of traded commodities. However, blockchain does not overcome the inherent issues of data trust in IoT enabled supply chains. Recent proposals attempt to tackle these issues by incorporating generic trust and reputation management, which does not entirely address… ▽ More Blockchain has the potential to enhance supply chain management systems by providing stronger assurance in transparency and traceability of traded commodities. However, blockchain does not overcome the inherent issues of data trust in IoT enabled supply chains. Recent proposals attempt to tackle these issues by incorporating generic trust and reputation management, which does not entirely address the complex challenges of supply chain operations and suffers from significant drawbacks. In this paper, we propose DeTRM, a decentralised trust and reputation management solution for supply chains, which considers complex supply chain operations, such as splitting or merging of product lots, to provide a coherent trust management solution. We resolve data trust by correlating empirical data from adjacent sensor nodes, using which the authenticity of data can be assessed. We design a consortium blockchain, where smart contracts play a significant role in quantifying trustworthiness as a numerical score from different perspectives. A proof-of-concept implementation in Hyperledger Fabric shows that DeTRM is feasible and only incurs relatively small overheads compared to the baseline. △ Less

Submitted 11 March, 2022; originally announced March 2022.

Comments: 9 pages, 8 figures. Accepted to IEEE ICBC 2022 as a short paper

arXiv:2202.10583 [pdf, other]

MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned

Authors: Anssi Kanervisto, Stephanie Milani, Karolis Ramanauskas, Nicholay Topin, Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang, Weijun Hong, Zhongyue Huang, Haicheng Chen, Guangjun Zeng, Yue Lin, Vincent Micheli, Eloi Alonso, François Fleuret, Alexander Nikulin, Yury Belousov, Oleg Svidchenko, Aleksei Shpilman

Abstract: Reinforcement learning competitions advance the field by providing appropriate scope and support to develop solutions toward a specific problem. To promote the development of more broadly applicable methods, organizers need to enforce the use of general techniques, the use of sample-efficient methods, and the reproducibility of the results. While beneficial for the research community, these restri… ▽ More Reinforcement learning competitions advance the field by providing appropriate scope and support to develop solutions toward a specific problem. To promote the development of more broadly applicable methods, organizers need to enforce the use of general techniques, the use of sample-efficient methods, and the reproducibility of the results. While beneficial for the research community, these restrictions come at a cost -- increased difficulty. If the barrier for entry is too high, many potential participants are demoralized. With this in mind, we hosted the third edition of the MineRL ObtainDiamond competition, MineRL Diamond 2021, with a separate track in which we permitted any solution to promote the participation of newcomers. With this track and more extensive tutorials and support, we saw an increased number of submissions. The participants of this easier track were able to obtain a diamond, and the participants of the harder track progressed the generalizable solutions in the same task. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: Under review for PMLR volume on NeurIPS 2021 competitions

arXiv:2202.04243 [pdf, other]

Motion-Aware Transformer For Occluded Person Re-identification

Authors: Mi Zhou, Hongye Liu, Zhekun Lv, Wei Hong, Xiai Chen

Abstract: Recently, occluded person re-identification(Re-ID) remains a challenging task that people are frequently obscured by other people or obstacles, especially in a crowd massing situation. In this paper, we propose a self-supervised deep learning method to improve the location performance for human parts through occluded person Re-ID. Unlike previous works, we find that motion information derived from… ▽ More Recently, occluded person re-identification(Re-ID) remains a challenging task that people are frequently obscured by other people or obstacles, especially in a crowd massing situation. In this paper, we propose a self-supervised deep learning method to improve the location performance for human parts through occluded person Re-ID. Unlike previous works, we find that motion information derived from the photos of various human postures can help identify major human body components. Firstly, a motion-aware transformer encoder-decoder architecture is designed to obtain keypoints heatmaps and part-segmentation maps. Secondly, an affine transformation module is utilized to acquire motion information from the keypoint detection branch. Then the motion information will support the segmentation branch to achieve refined human part segmentation maps, and effectively divide the human body into reasonable groups. Finally, several cases demonstrate the efficiency of the proposed model in distinguishing different representative parts of the human body, which can avoid the background and occlusion disturbs. Our method consistently achieves state-of-the-art results on several popular datasets, including occluded, partial, and holistic. △ Less

Submitted 10 February, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

Comments: 10 pages, 3 figures

arXiv:2201.11685 [pdf, other]

Generative Adversarial Exploration for Reinforcement Learning

Authors: Weijun Hong, Menghui Zhu, Minghuan Liu, Weinan Zhang, Ming Zhou, Yong Yu, Peng Sun

Abstract: Exploration is crucial for training the optimal reinforcement learning (RL) policy, where the key is to discriminate whether a state visiting is novel. Most previous work focuses on designing heuristic rules or distance metrics to check whether a state is novel without considering such a discrimination process that can be learned. In this paper, we propose a novel method called generative adversar… ▽ More Exploration is crucial for training the optimal reinforcement learning (RL) policy, where the key is to discriminate whether a state visiting is novel. Most previous work focuses on designing heuristic rules or distance metrics to check whether a state is novel without considering such a discrimination process that can be learned. In this paper, we propose a novel method called generative adversarial exploration (GAEX) to encourage exploration in RL via introducing an intrinsic reward output from a generative adversarial network, where the generator provides fake samples of states that help discriminator identify those less frequently visited states. Thus the agent is encouraged to visit those states which the discriminator is less confident to judge as visited. GAEX is easy to implement and of high training efficiency. In our experiments, we apply GAEX into DQN and the DQN-GAEX algorithm achieves convincing performance on challenging exploration problems, including the game Venture, Montezuma's Revenge and Super Mario Bros, without further fine-tuning on complicate learning algorithms. To our knowledge, this is the first work to employ GAN in RL exploration problems. △ Less

Submitted 27 January, 2022; originally announced January 2022.

arXiv:2201.11679 [pdf, other]

DropNAS: Grouped Operation Dropout for Differentiable Architecture Search

Authors: Weijun Hong, Guilin Li, Weinan Zhang, Ruiming Tang, Yunhe Wang, Zhenguo Li, Yong Yu

Abstract: Neural architecture search (NAS) has shown encouraging results in automating the architecture design. Recently, DARTS relaxes the search process with a differentiable formulation that leverages weight-sharing and SGD where all candidate operations are trained simultaneously. Our empirical results show that such procedure results in the co-adaption problem and Matthew Effect: operations with fewer… ▽ More Neural architecture search (NAS) has shown encouraging results in automating the architecture design. Recently, DARTS relaxes the search process with a differentiable formulation that leverages weight-sharing and SGD where all candidate operations are trained simultaneously. Our empirical results show that such procedure results in the co-adaption problem and Matthew Effect: operations with fewer parameters would be trained maturely earlier. This causes two problems: firstly, the operations with more parameters may never have the chance to express the desired function since those with less have already done the job; secondly, the system will punish those underperforming operations by lowering their architecture parameter, and they will get smaller loss gradients, which causes the Matthew Effect. In this paper, we systematically study these problems and propose a novel grouped operation dropout algorithm named DropNAS to fix the problems with DARTS. Extensive experiments demonstrate that DropNAS solves the above issues and achieves promising performance. Specifically, DropNAS achieves 2.26% test error on CIFAR-10, 16.39% on CIFAR-100 and 23.4% on ImageNet (with the same training hyperparameters as DARTS for a fair comparison). It is also observed that DropNAS is robust across variants of the DARTS search space. Code is available at https://github.com/wiljohnhong/DropNAS. △ Less

Submitted 27 January, 2022; originally announced January 2022.

arXiv:2111.01528 [pdf, other]

Effective and Imperceptible Adversarial Textual Attack via Multi-objectivization

Authors: Shengcai Liu, Ning Lu, Wenjing Hong, Chao Qian, Ke Tang

Abstract: The field of adversarial textual attack has significantly grown over the last few years, where the commonly considered objective is to craft adversarial examples (AEs) that can successfully fool the target model. However, the imperceptibility of attacks, which is also essential for practical attackers, is often left out by previous studies. In consequence, the crafted AEs tend to have obvious stru… ▽ More The field of adversarial textual attack has significantly grown over the last few years, where the commonly considered objective is to craft adversarial examples (AEs) that can successfully fool the target model. However, the imperceptibility of attacks, which is also essential for practical attackers, is often left out by previous studies. In consequence, the crafted AEs tend to have obvious structural and semantic differences from the original human-written text, making them easily perceptible. In this work, we advocate leveraging multi-objectivization to address such issue. Specifically, we reformulate the problem of crafting AEs as a multi-objective optimization problem, where the attack imperceptibility is considered as an auxiliary objective. Then, we propose a simple yet effective evolutionary algorithm, dubbed HydraText, to solve this problem. To the best of our knowledge, HydraText is currently the only approach that can be effectively applied to both score-based and decision-based attack settings. Exhaustive experiments involving 44237 instances demonstrate that HydraText consistently achieves competitive attack success rates and better attack imperceptibility than the recently proposed attack approaches. A human evaluation study also shows that the AEs crafted by HydraText are more indistinguishable from human-written text. Finally, these AEs exhibit good transferability and can bring notable robustness improvement to the target model by adversarial training. △ Less

Submitted 14 December, 2023; v1 submitted 2 November, 2021; originally announced November 2021.

arXiv:2110.00666 [pdf, other]

ReDUCE: Reformulation of Mixed Integer Programs using Data from Unsupervised Clusters for Learning Efficient Strategies

Authors: Xuan Lin, Gabriel I. Fernandez, Dennis W. Hong

Abstract: Mixed integer convex and nonlinear programs, MICP and MINLP, are expressive but require long solving times. Recent work that combines learning methods on solver heuristics has shown potential to overcome this issue allowing for applications on larger scale practical problems. Gathering sufficient training data to employ these methods still present a challenge since getting data from traditional so… ▽ More Mixed integer convex and nonlinear programs, MICP and MINLP, are expressive but require long solving times. Recent work that combines learning methods on solver heuristics has shown potential to overcome this issue allowing for applications on larger scale practical problems. Gathering sufficient training data to employ these methods still present a challenge since getting data from traditional solvers are slow and newer learning approaches still require large amounts of data. In order to scale up and make these hybrid learning approaches more manageable we propose ReDUCE, a method that exploits structure within small to medium size datasets. We also introduce the bookshelf organization problem as an MINLP as a way to measure performance of solvers with ReDUCE. Results show that existing algorithms with ReDUCE can solve this problem within a few seconds, a significant improvement over the original formulation. ReDUCE is demonstrated as a high level planner for a robotic arm for the bookshelf problem. △ Less

Submitted 1 October, 2021; originally announced October 2021.

arXiv:2106.07127 [pdf, other]

Transition Motion Planning for Multi-Limbed Vertical Climbing Robots Using Complementarity Constraints

Authors: Jingwen Zhang, Xuan Lin, Dennis W Hong

Abstract: In order to achieve autonomous vertical wall climbing, the transition phase from the ground to the wall requires extra consideration inevitably. This paper focuses on the contact sequence planner to transition between flat terrain and vertical surfaces for multi-limbed climbing robots. To overcome the transition phase, it requires planning both multi-contact and contact wrenches simultaneously whi… ▽ More In order to achieve autonomous vertical wall climbing, the transition phase from the ground to the wall requires extra consideration inevitably. This paper focuses on the contact sequence planner to transition between flat terrain and vertical surfaces for multi-limbed climbing robots. To overcome the transition phase, it requires planning both multi-contact and contact wrenches simultaneously which makes it difficult. Instead of using a predetermined contact sequence, we consider various motions on different environment setups via modeling contact constraints and limb switchability as complementarity conditions. Two safety factors for toe sliding and motor over-torque are the main tuning parameters for different contact sequences. By solving as a nonlinear program (NLP), we can generate several feasible sequences of foot placements and contact forces to avoid failure cases. We verified feasibility with demonstrations on the hardware SiLVIA, a six-legged robot capable of vertically climbing between two walls by bracing itself in-between using only friction. △ Less

Submitted 13 June, 2021; originally announced June 2021.

Comments: ICRA 2021 Accepted. Optimization, Climbing motion planning, Complementarity Constraints

arXiv:2105.13290 [pdf, other]

CogView: Mastering Text-to-Image Generation via Transformers

Authors: Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang

Abstract: Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fash… ▽ More Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E. △ Less

Submitted 5 November, 2021; v1 submitted 26 May, 2021; originally announced May 2021.

Comments: to appear in NeurIPS 2021

arXiv:2012.12453 [pdf, other]

CholecSeg8k: A Semantic Segmentation Dataset for Laparoscopic Cholecystectomy Based on Cholec80

Authors: W. -Y. Hong, C. -L. Kao, Y. -H. Kuo, J. -R. Wang, W. -L. Chang, C. -S. Shih

Abstract: Computer-assisted surgery has been developed to enhance surgery correctness and safety. However, researchers and engineers suffer from limited annotated data to develop and train better algorithms. Consequently, the development of fundamental algorithms such as Simultaneous Localization and Mapping (SLAM) is limited. This article elaborates on the efforts of preparing the dataset for semantic segm… ▽ More Computer-assisted surgery has been developed to enhance surgery correctness and safety. However, researchers and engineers suffer from limited annotated data to develop and train better algorithms. Consequently, the development of fundamental algorithms such as Simultaneous Localization and Mapping (SLAM) is limited. This article elaborates on the efforts of preparing the dataset for semantic segmentation, which is the foundation of many computer-assisted surgery mechanisms. Based on the Cholec80 dataset [3], we extracted 8,080 laparoscopic cholecystectomy image frames from 17 video clips in Cholec80 and annotated the images. The dataset is named CholecSeg8K and its total size is 3GB. Each of these images is annotated at pixel-level for thirteen classes, which are commonly founded in laparoscopic cholecystectomy surgery. CholecSeg8k is released under the license CC BY- NC-SA 4.0. △ Less

Submitted 22 December, 2020; originally announced December 2020.

Comments: 6 pages

arXiv:2011.09967 [pdf, other]

Electric Vehicle Charging Infrastructure Planning: A Scalable Computational Framework

Authors: Wanshi Hong, Cong Zhang, Cy Chan, Bin Wang

Abstract: The optimal charging infrastructure planning problem over a large geospatial area is challenging due to the increasing network sizes of the transportation system and the electric grid. The coupling between the electric vehicle travel behaviors and charging events is therefore complex. This paper focuses on the demonstration of a scalable computational framework for the electric vehicle charging in… ▽ More The optimal charging infrastructure planning problem over a large geospatial area is challenging due to the increasing network sizes of the transportation system and the electric grid. The coupling between the electric vehicle travel behaviors and charging events is therefore complex. This paper focuses on the demonstration of a scalable computational framework for the electric vehicle charging infrastructure planning over the tightly integrated transportation and electric grid networks. On the transportation side, a charging profile generation strategy is proposed leveraging the EV energy consumption model, trip routing, and charger selection methods. On the grid side, a genetic algorithm is utilized within the optimal power flow program to solve the optimal charger placement problem with integer variables by adaptively evaluating candidate solutions in the current iteration and generating new solutions for the next iterations. △ Less

Submitted 17 November, 2020; originally announced November 2020.

arXiv:2009.00577 [pdf, ps, other]

A Short Review on Data Modelling for Vector Fields

Authors: Jun Li, Wanrong Hong, Yusheng Xiang

Abstract: Machine learning methods based on statistical principles have proven highly successful in dealing with a wide variety of data analysis and analytics tasks. Traditional data models are mostly concerned with independent identically distributed data. The recent success of end-to-end modelling scheme using deep neural networks equipped with effective structures such as convolutional layers or skip con… ▽ More Machine learning methods based on statistical principles have proven highly successful in dealing with a wide variety of data analysis and analytics tasks. Traditional data models are mostly concerned with independent identically distributed data. The recent success of end-to-end modelling scheme using deep neural networks equipped with effective structures such as convolutional layers or skip connections allows the extension to more sophisticated and structured practical data, such as natural language, images, videos, etc. On the application side, vector fields are an extremely useful type of data in empirical sciences, as well as signal processing, e.g. non-parametric transformations of 3D point clouds using 3D vector fields, the modelling of the fluid flow in earth science, and the modelling of physical fields. This review article is dedicated to recent computational tools of vector fields, including vector data representations, predictive model of spatial data, as well as applications in computer vision, signal processing, and empirical sciences. △ Less

Submitted 1 September, 2020; originally announced September 2020.

Comments: 18 pages, 0 figures

arXiv:2008.03973 [pdf, other]

Deep Reinforcement Learning with Label Embedding Reward for Supervised Image Hashing

Authors: Zhenzhen Wang, Weixiang Hong, Junsong Yuan

Abstract: Deep hashing has shown promising results in image retrieval and recognition. Despite its success, most existing deep hashing approaches are rather similar: either multi-layer perceptron or CNN is applied to extract image feature, followed by different binarization activation functions such as sigmoid, tanh or autoencoder to generate binary code. In this work, we introduce a novel decision-making a… ▽ More Deep hashing has shown promising results in image retrieval and recognition. Despite its success, most existing deep hashing approaches are rather similar: either multi-layer perceptron or CNN is applied to extract image feature, followed by different binarization activation functions such as sigmoid, tanh or autoencoder to generate binary code. In this work, we introduce a novel decision-making approach for deep supervised hashing. We formulate the hashing problem as travelling across the vertices in the binary code space, and learn a deep Q-network with a novel label embedding reward defined by Bose-Chaudhuri-Hocquenghem (BCH) codes to explore the best path. Extensive experiments and analysis on the CIFAR-10 and NUS-WIDE dataset show that our approach outperforms state-of-the-art supervised hashing methods under various code lengths. △ Less

Submitted 10 August, 2020; originally announced August 2020.

arXiv:2002.09820 [pdf, other]

Deep Reinforcement Learning with Linear Quadratic Regulator Regions

Authors: Gabriel I. Fernandez, Colin Togashi, Dennis W. Hong, Lin F. Yang

Abstract: Practitioners often rely on compute-intensive domain randomization to ensure reinforcement learning policies trained in simulation can robustly transfer to the real world. Due to unmodeled nonlinearities in the real system, however, even such simulated policies can still fail to perform stably enough to acquire experience in real environments. In this paper we propose a novel method that guarantee… ▽ More Practitioners often rely on compute-intensive domain randomization to ensure reinforcement learning policies trained in simulation can robustly transfer to the real world. Due to unmodeled nonlinearities in the real system, however, even such simulated policies can still fail to perform stably enough to acquire experience in real environments. In this paper we propose a novel method that guarantees a stable region of attraction for the output of a policy trained in simulation, even for highly nonlinear systems. Our core technique is to use "bias-shifted" neural networks for constructing the controller and training the network in the simulator. The modified neural networks not only capture the nonlinearities of the system but also provably preserve linearity in a certain region of the state space and thus can be tuned to resemble a linear quadratic regulator that is known to be stable for the real system. We have tested our new method by transferring simulated policies for a swing-up inverted pendulum to real systems and demonstrated its efficacy. △ Less

Submitted 25 February, 2020; v1 submitted 22 February, 2020; originally announced February 2020.

Showing 1–50 of 59 results for author: Hong, W