-
Semantic Communications with Explicit Semantic Bases: Model, Architecture, and Open Problems
Authors:
Fengyu Wang,
Yuan Zheng,
Wenjun Xu,
Junxiao Liang,
Ping Zhang
Abstract:
The increasing demands for massive data transmission pose great challenges to communication systems. Compared to traditional communication systems that focus on the accurate reconstruction of bit sequences, semantic communications (SemComs), which aim to successfully deliver information connotation, have been regarded as the key technology for next-generation communication systems. Most current Se…
▽ More
The increasing demands for massive data transmission pose great challenges to communication systems. Compared to traditional communication systems that focus on the accurate reconstruction of bit sequences, semantic communications (SemComs), which aim to successfully deliver information connotation, have been regarded as the key technology for next-generation communication systems. Most current SemCom systems focus on an E2E trained neural network (NN) for semantic extraction and interpretation, regarding the parameters of the NN as the implicit synchronized background knowledge. However, the implicit knowledge base (KB)-based architectures lack interpretability and flexibility, which limits the performance of SemComs. In this article, we propose a SemCom architecture that employs explicit semantic bases (Sebs), which serve as the basic units to describe semantic information. In specific, the mathematical model of Sebs is first proposed to build explicit KB. Then, the Seb-based SemCom architecture is proposed, consisting of a communication mode and a KB update mode to enable the evolution of communication systems. Specifically, the modules of Sem-codec and channel codec are dedicatedly designed, with the assistance of explicit KB for efficient and robust transmission of semantics. Moreover, unequal error protection is strategically implemented, considering the intent of communications and the importance of Sebs, thereby ensuring reliability of critical semantics. To assess the effectiveness of the proposed Seb-based SemCom architecture, a case study focusing on an image transmission task is conducted. Simulations show that the proposed Seb-based SemComs outperforms state-of-art works in LPIPS by over 20% under varying communication intents, with more robust performance under fluctuating channel conditions, indicating the flexible and robust transmission of the proposed Seb-based SemComs.
△ Less
Submitted 10 August, 2024;
originally announced August 2024.
-
Gait Patterns as Biomarkers: A Video-Based Approach for Classifying Scoliosis
Authors:
Zirui Zhou,
Junhao Liang,
Zizhao Peng,
Chao Fan,
Fengwei An,
Shiqi Yu
Abstract:
Scoliosis presents significant diagnostic challenges, particularly in adolescents, where early detection is crucial for effective treatment. Traditional diagnostic and follow-up methods, which rely on physical examinations and radiography, face limitations due to the need for clinical expertise and the risk of radiation exposure, thus restricting their use for widespread early screening. In respon…
▽ More
Scoliosis presents significant diagnostic challenges, particularly in adolescents, where early detection is crucial for effective treatment. Traditional diagnostic and follow-up methods, which rely on physical examinations and radiography, face limitations due to the need for clinical expertise and the risk of radiation exposure, thus restricting their use for widespread early screening. In response, we introduce a novel video-based, non-invasive method for scoliosis classification using gait analysis, effectively circumventing these limitations. This study presents Scoliosis1K, the first large-scale dataset specifically designed for video-based scoliosis classification, encompassing over one thousand adolescents. Leveraging this dataset, we developed ScoNet, an initial model that faced challenges in handling the complexities of real-world data. This led to the development of ScoNet-MT, an enhanced model incorporating multi-task learning, which demonstrates promising diagnostic accuracy for practical applications. Our findings demonstrate that gait can serve as a non-invasive biomarker for scoliosis, revolutionizing screening practices through deep learning and setting a precedent for non-invasive diagnostic methodologies. The dataset and code are publicly available at https://zhouzi180.github.io/Scoliosis1K/.
△ Less
Submitted 23 August, 2024; v1 submitted 8 July, 2024;
originally announced July 2024.
-
From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano
Authors:
Huan Zhang,
Jinhua Liang,
Simon Dixon
Abstract:
Our study investigates an approach for understanding musical performances through the lens of audio encoding models, focusing on the domain of solo Western classical piano music. Compared to composition-level attribute understanding such as key or genre, we identify a knowledge gap in performance-level music understanding, and address three critical tasks: expertise ranking, difficulty estimation,…
▽ More
Our study investigates an approach for understanding musical performances through the lens of audio encoding models, focusing on the domain of solo Western classical piano music. Compared to composition-level attribute understanding such as key or genre, we identify a knowledge gap in performance-level music understanding, and address three critical tasks: expertise ranking, difficulty estimation, and piano technique detection, introducing a comprehensive Pianism-Labelling Dataset (PLD) for this purpose. We leverage pre-trained audio encoders, specifically Jukebox, Audio-MAE, MERT, and DAC, demonstrating varied capabilities in tackling downstream tasks, to explore whether domain-specific fine-tuning enhances capability in capturing performance nuances. Our best approach achieved 93.6\% accuracy in expertise ranking, 33.7\% in difficulty estimation, and 46.7\% in technique detection, with Audio-MAE as the overall most effective encoder. Finally, we conducted a case study on Chopin Piano Competition data using trained models for expertise ranking, which highlights the challenge of accurately assessing top-tier performances.
△ Less
Submitted 19 July, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.
-
MVGT: A Multi-view Graph Transformer Based on Spatial Relations for EEG Emotion Recognition
Authors:
Yanjie Cui,
Xiaohong Liu,
Jing Liang,
Yamin Fu
Abstract:
Electroencephalography (EEG), a medical imaging technique that captures scalp electrical activity of brain structures via electrodes, has been widely used in affective computing. The spatial domain of EEG is rich in affective information. However, few of the existing studies have simultaneously analyzed EEG signals from multiple perspectives of geometric and anatomical structures in spatial domain…
▽ More
Electroencephalography (EEG), a medical imaging technique that captures scalp electrical activity of brain structures via electrodes, has been widely used in affective computing. The spatial domain of EEG is rich in affective information. However, few of the existing studies have simultaneously analyzed EEG signals from multiple perspectives of geometric and anatomical structures in spatial domain. In this paper, we propose a multi-view Graph Transformer (MVGT) based on spatial relations, which integrates information from the temporal, frequency and spatial domains, including geometric and anatomical structures, so as to enhance the expressive power of the model comprehensively. We incorporate the spatial information of EEG channels into the model as encoding, thereby improving its ability to perceive the spatial structure of the channels. Meanwhile, experimental results based on publicly available datasets demonstrate that our proposed model outperforms state-of-the-art methods in recent years. In addition, the results also show that the MVGT could extract information from multiple domains and capture inter-channel relationships in EEG emotion recognition tasks effectively.
△ Less
Submitted 6 August, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
EmT: A Novel Transformer for Generalized Cross-subject EEG Emotion Recognition
Authors:
Yi Ding,
Chengxuan Tong,
Shuailei Zhang,
Muyun Jiang,
Yong Li,
Kevin Lim Jun Liang,
Cuntai Guan
Abstract:
Integrating prior knowledge of neurophysiology into neural network architecture enhances the performance of emotion decoding. While numerous techniques emphasize learning spatial and short-term temporal patterns, there has been limited emphasis on capturing the vital long-term contextual information associated with emotional cognitive processes. In order to address this discrepancy, we introduce a…
▽ More
Integrating prior knowledge of neurophysiology into neural network architecture enhances the performance of emotion decoding. While numerous techniques emphasize learning spatial and short-term temporal patterns, there has been limited emphasis on capturing the vital long-term contextual information associated with emotional cognitive processes. In order to address this discrepancy, we introduce a novel transformer model called emotion transformer (EmT). EmT is designed to excel in both generalized cross-subject EEG emotion classification and regression tasks. In EmT, EEG signals are transformed into a temporal graph format, creating a sequence of EEG feature graphs using a temporal graph construction module (TGC). A novel residual multi-view pyramid GCN module (RMPG) is then proposed to learn dynamic graph representations for each EEG feature graph within the series, and the learned representations of each graph are fused into one token. Furthermore, we design a temporal contextual transformer module (TCT) with two types of token mixers to learn the temporal contextual information. Finally, the task-specific output module (TSO) generates the desired outputs. Experiments on four publicly available datasets show that EmT achieves higher results than the baseline methods for both EEG emotion classification and regression tasks. The code is available at https://github.com/yi-ding-cs/EmT.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
DExter: Learning and Controlling Performance Expression with Diffusion Models
Authors:
Huan Zhang,
Shreyan Chowdhury,
Carlos Eduardo Cancino-Chacón,
Jinhua Liang,
Simon Dixon,
Gerhard Widmer
Abstract:
In the pursuit of developing expressive music performance models using artificial intelligence, this paper introduces DExter, a new approach leveraging diffusion probabilistic models to render Western classical piano performances. In this approach, performance parameters are represented in a continuous expression space and a diffusion model is trained to predict these continuous parameters while b…
▽ More
In the pursuit of developing expressive music performance models using artificial intelligence, this paper introduces DExter, a new approach leveraging diffusion probabilistic models to render Western classical piano performances. In this approach, performance parameters are represented in a continuous expression space and a diffusion model is trained to predict these continuous parameters while being conditioned on the musical score. Furthermore, DExter also enables the generation of interpretations (expressive variations of a performance) guided by perceptually meaningful features by conditioning jointly on score and perceptual feature representations. Consequently, we find that our model is useful for learning expressive performance, generating perceptually steered performances, and transferring performance styles. We assess the model through quantitative and qualitative analyses, focusing on specific performance metrics regarding dimensions like asynchrony and articulation, as well as through listening tests comparing generated performances with different human interpretations. Results show that DExter is able to capture the time-varying correlation of the expressive parameters, and compares well to existing rendering models in subjectively evaluated ratings. The perceptual-feature-conditioned generation and transferring capabilities of DExter are verified by a proxy model predicting perceptual characteristics of differently steered performances.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge
Authors:
Jie Liang,
Radu Timofte,
Qiaosi Yi,
Shuaizheng Liu,
Lingchen Sun,
Rongyuan Wu,
Xindong Zhang,
Hui Zeng,
Lei Zhang
Abstract:
In this paper, we review the NTIRE 2024 challenge on Restore Any Image Model (RAIM) in the Wild. The RAIM challenge constructed a benchmark for image restoration in the wild, including real-world images with/without reference ground truth in various scenarios from real applications. The participants were required to restore the real-captured images from complex and unknown degradation, where gener…
▽ More
In this paper, we review the NTIRE 2024 challenge on Restore Any Image Model (RAIM) in the Wild. The RAIM challenge constructed a benchmark for image restoration in the wild, including real-world images with/without reference ground truth in various scenarios from real applications. The participants were required to restore the real-captured images from complex and unknown degradation, where generative perceptual quality and fidelity are desired in the restoration result. The challenge consisted of two tasks. Task one employed real referenced data pairs, where quantitative evaluation is available. Task two used unpaired images, and a comprehensive user study was conducted. The challenge attracted more than 200 registrations, where 39 of them submitted results with more than 400 submissions. Top-ranked methods improved the state-of-the-art restoration performance and obtained unanimous recognition from all 18 judges. The proposed datasets are available at https://drive.google.com/file/d/1DqbxUoiUqkAIkExu3jZAqoElr_nu1IXb/view?usp=sharing and the homepage of this challenge is at https://codalab.lisn.upsaclay.fr/competitions/17632.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Design and Implementation of Energy-Efficient Wireless Tire Sensing System with Delay Analysis for Intelligent Vehicles
Authors:
Shashank Mishra,
Jia-Ming Liang
Abstract:
The growing prevalence of Internet of Things (IoT) technologies has led to a rise in the popularity of intelligent vehicles that incorporate a range of sensors to monitor various aspects, such as driving speed, fuel usage, distance proximity and tire anomalies. Nowadays, real-time tire sensing systems play important roles for intelligent vehicles in increasing mileage, reducing fuel consumption, i…
▽ More
The growing prevalence of Internet of Things (IoT) technologies has led to a rise in the popularity of intelligent vehicles that incorporate a range of sensors to monitor various aspects, such as driving speed, fuel usage, distance proximity and tire anomalies. Nowadays, real-time tire sensing systems play important roles for intelligent vehicles in increasing mileage, reducing fuel consumption, improving driving safety, and reducing the potential for traffic accidents. However, the current tire sensing system drains a significant vehicle' energy and lacks effective collection of sensing data, which may not guarantee the immediacy of driving safety. Thus, this paper designs an energy-efficient wireless tire sensing system (WTSS), which leverages energy-saving techniques to significantly reduce power consumption while ensuring data retrieval delays during real-time monitoring. Additionally, we mathematically analyze the worst-case transmission delay of the system to ensure the immediacy based on the collision probabilities of sensor transmissions. This system has been implemented and verified by the simulation and field trial experiments. These results show that the proposed scheme provides enhanced performance in energy efficiency and accurately identifies the worst transmission delay.
△ Less
Submitted 27 May, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
Enhancing High-Speed Cruising Performance of Autonomous Vehicles through Integrated Deep Reinforcement Learning Framework
Authors:
Jinhao Liang,
Kaidi Yang,
Chaopeng Tan,
Jinxiang Wang,
Guodong Yin
Abstract:
High-speed cruising scenarios with mixed traffic greatly challenge the road safety of autonomous vehicles (AVs). Unlike existing works that only look at fundamental modules in isolation, this work enhances AV safety in mixed-traffic high-speed cruising scenarios by proposing an integrated framework that synthesizes three fundamental modules, i.e., behavioral decision-making, path-planning, and mot…
▽ More
High-speed cruising scenarios with mixed traffic greatly challenge the road safety of autonomous vehicles (AVs). Unlike existing works that only look at fundamental modules in isolation, this work enhances AV safety in mixed-traffic high-speed cruising scenarios by proposing an integrated framework that synthesizes three fundamental modules, i.e., behavioral decision-making, path-planning, and motion-control modules. Considering that the integrated framework would increase the system complexity, a bootstrapped deep Q-Network (DQN) is employed to enhance the deep exploration of the reinforcement learning method and achieve adaptive decision making of AVs. Moreover, to make AV behavior understandable by surrounding HDVs to prevent unexpected operations caused by misinterpretations, we derive an inverse reinforcement learning (IRL) approach to learn the reward function of skilled drivers for the path planning of lane-changing maneuvers. Such a design enables AVs to achieve a human-like tradeoff between multi-performance requirements. Simulations demonstrate that the proposed integrated framework can guide AVs to take safe actions while guaranteeing high-speed cruising performance.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Feedback Stability Under Mixed Gain and Phase Uncertainty
Authors:
Jiajin Liang,
Di Zhao,
Li Qiu
Abstract:
In this study, we investigate the robust feedback stability problem for multiple-input-multiple-output linear time-invariant systems involving sectored-disk uncertainty, namely, dynamic uncertainty subject to simultaneous gain and phase constraints. This problem is thereby called a sectored-disk problem. Employing a frequency-wise analysis approach, we derive a fundamental static matrix problem th…
▽ More
In this study, we investigate the robust feedback stability problem for multiple-input-multiple-output linear time-invariant systems involving sectored-disk uncertainty, namely, dynamic uncertainty subject to simultaneous gain and phase constraints. This problem is thereby called a sectored-disk problem. Employing a frequency-wise analysis approach, we derive a fundamental static matrix problem that serves as a key component in addressing the feedback stability. The study of this matrix problem heavily relies on the Davis-Wielandt (DW) shells of matrices, providing a profound insight into matrices subjected to simultaneous gain and phase constraints. This understanding is pivotal for establishing a less conservative sufficient condition for the matrix sectored-disk problem, from which we formulate several robust feedback stability conditions against sectored-disk uncertainty. Finally, several conditions based on linear matrix inequalities are developed for efficient computation and verification of feedback robust stability against sectored-disk uncertainty.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Ground-to-UAV sub-Terahertz channel measurement and modeling
Authors:
Da Li,
Peian Li,
Jiabiao Zhao,
Jianjian Liang,
Jiacheng Liu,
Guohao Liu,
Yuanshuai Lei,
Wenbo Liu,
Jianqin Deng,
Fuyong Liu,
Jianjun Ma
Abstract:
Unmanned Aerial Vehicle (UAV) assisted terahertz (THz) wireless communications have been expected to play a vital role in the next generation of wireless networks. UAVs can serve as either repeaters or data collectors within the communication link, thereby potentially augmenting the efficacy of communication systems. Despite their promise, the channel analysis and modeling specific to THz wireless…
▽ More
Unmanned Aerial Vehicle (UAV) assisted terahertz (THz) wireless communications have been expected to play a vital role in the next generation of wireless networks. UAVs can serve as either repeaters or data collectors within the communication link, thereby potentially augmenting the efficacy of communication systems. Despite their promise, the channel analysis and modeling specific to THz wireless channels leveraging UAVs remain under explored. This work delves into a ground-to-UAV channel at 140 GHz, with a specific focus on the influence of UAV hovering behavior on channel performance. Employing experimental measurements through an unmodulated channel setup and a geometry-based stochastic model (GBSM) that integrates three-dimensional positional coordinates and beamwidth, this work evaluates the impact of UAV dynamic movements and antenna orientation on channel performance. Our findings highlight the minimal impact of UAV orientation adjustments on channel performance and underscore the diminishing necessity for precise alignment between UAVs and ground stations as beamwidth increases.
△ Less
Submitted 30 July, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
Mind the Domain Gap: a Systematic Analysis on Bioacoustic Sound Event Detection
Authors:
Jinhua Liang,
Ines Nolasco,
Burooj Ghani,
Huy Phan,
Emmanouil Benetos,
Dan Stowell
Abstract:
Detecting the presence of animal vocalisations in nature is essential to study animal populations and their behaviors. A recent development in the field is the introduction of the task known as few-shot bioacoustic sound event detection, which aims to train a versatile animal sound detector using only a small set of audio samples. Previous efforts in this area have utilized different architectures…
▽ More
Detecting the presence of animal vocalisations in nature is essential to study animal populations and their behaviors. A recent development in the field is the introduction of the task known as few-shot bioacoustic sound event detection, which aims to train a versatile animal sound detector using only a small set of audio samples. Previous efforts in this area have utilized different architectures and data augmentation techniques to enhance model performance. However, these approaches have not fully bridged the domain gap between source and target distributions, limiting their applicability in real-world scenarios. In this work, we introduce an new dataset designed to augment the diversity and breadth of classes available for few-shot bioacoustic event detection, building on the foundations of our previous datasets. To establish a robust baseline system tailored for the DCASE 2024 Task 5 challenge, we delve into an array of acoustic features and adopt negative hard sampling as our primary domain adaptation strategy. This approach, chosen in alignment with the challenge's guidelines that necessitate the independent treatment of each audio file, sidesteps the use of transductive learning to ensure compliance while aiming to enhance the system's adaptability to domain shifts. Our experiments show that the proposed baseline system achieves a better performance compared with the vanilla prototypical network. The findings also confirm the effectiveness of each domain adaptation method by ablating different components within the networks. This highlights the potential to improve few-shot bioacoustic sound event detection by further reducing the impact of domain shift.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Bi-Level Control of Weaving Sections in Mixed Traffic Environments with Connected and Automated Vehicles
Authors:
Longhao Yan,
Jinhao Liang,
Kaidi Yang
Abstract:
Connected and automated vehicles (CAVs) can be beneficial for improving the operation of highway bottlenecks such as weaving sections. This paper proposes a bi-level control approach based on an upper-level deep reinforcement learning controller and a lower-level model predictive controller to coordinate the lane-changings of a mixed fleet of CAVs and human-driven vehicles (HVs) in weaving section…
▽ More
Connected and automated vehicles (CAVs) can be beneficial for improving the operation of highway bottlenecks such as weaving sections. This paper proposes a bi-level control approach based on an upper-level deep reinforcement learning controller and a lower-level model predictive controller to coordinate the lane-changings of a mixed fleet of CAVs and human-driven vehicles (HVs) in weaving sections. The upper level represents a roadside controller that collects vehicular information from the entire weaving section and determines the control weights used in the lower-level controller. The lower level is implemented within each CAV, which takes the control weights from the upper-level controller and generates the acceleration and steering angle for individual CAVs based on the local situation. The lower-level controller further incorporates an HV trajectory predictor, which is capable of handling the dynamic topology of vehicles in weaving scenarios with intensive mandatory lane changes. The case study inspired by a real weaving section in Basel, Switzerland, shows that our method consistently outperforms state-of-the-art benchmarks.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
WavCraft: Audio Editing and Generation with Large Language Models
Authors:
Jinhua Liang,
Huan Zhang,
Haohe Liu,
Yin Cao,
Qiuqiang Kong,
Xubo Liu,
Wenwu Wang,
Mark D. Plumbley,
Huy Phan,
Emmanouil Benetos
Abstract:
We introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing. Specifically, WavCraft describes the content of raw audio materials in natural language and prompts the LLM conditioned on audio descriptions and user requests. WavCraft leverages the in-context learning ability of the LLM to decompo…
▽ More
We introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing. Specifically, WavCraft describes the content of raw audio materials in natural language and prompts the LLM conditioned on audio descriptions and user requests. WavCraft leverages the in-context learning ability of the LLM to decomposes users' instructions into several tasks and tackle each task collaboratively with the particular module. Through task decomposition along with a set of task-specific models, WavCraft follows the input instruction to create or edit audio content with more details and rationales, facilitating user control. In addition, WavCraft is able to cooperate with users via dialogue interaction and even produce the audio content without explicit user commands. Experiments demonstrate that WavCraft yields a better performance than existing methods, especially when adjusting the local regions of audio clips. Moreover, WavCraft can follow complex instructions to edit and create audio content on the top of input recordings, facilitating audio producers in a broader range of applications. Our implementation and demos are available at this https://github.com/JinhuaLiang/WavCraft.
△ Less
Submitted 10 May, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
Multi-Center Fetal Brain Tissue Annotation (FeTA) Challenge 2022 Results
Authors:
Kelly Payette,
Céline Steger,
Roxane Licandro,
Priscille de Dumast,
Hongwei Bran Li,
Matthew Barkovich,
Liu Li,
Maik Dannecker,
Chen Chen,
Cheng Ouyang,
Niccolò McConnell,
Alina Miron,
Yongmin Li,
Alena Uus,
Irina Grigorescu,
Paula Ramirez Gilliland,
Md Mahfuzur Rahman Siddiquee,
Daguang Xu,
Andriy Myronenko,
Haoyu Wang,
Ziyan Huang,
Jin Ye,
Mireia Alenyà,
Valentin Comte,
Oscar Camara
, et al. (42 additional authors not shown)
Abstract:
Segmentation is a critical step in analyzing the developing human fetal brain. There have been vast improvements in automatic segmentation methods in the past several years, and the Fetal Brain Tissue Annotation (FeTA) Challenge 2021 helped to establish an excellent standard of fetal brain segmentation. However, FeTA 2021 was a single center study, and the generalizability of algorithms across dif…
▽ More
Segmentation is a critical step in analyzing the developing human fetal brain. There have been vast improvements in automatic segmentation methods in the past several years, and the Fetal Brain Tissue Annotation (FeTA) Challenge 2021 helped to establish an excellent standard of fetal brain segmentation. However, FeTA 2021 was a single center study, and the generalizability of algorithms across different imaging centers remains unsolved, limiting real-world clinical applicability. The multi-center FeTA Challenge 2022 focuses on advancing the generalizability of fetal brain segmentation algorithms for magnetic resonance imaging (MRI). In FeTA 2022, the training dataset contained images and corresponding manually annotated multi-class labels from two imaging centers, and the testing data contained images from these two imaging centers as well as two additional unseen centers. The data from different centers varied in many aspects, including scanners used, imaging parameters, and fetal brain super-resolution algorithms applied. 16 teams participated in the challenge, and 17 algorithms were evaluated. Here, a detailed overview and analysis of the challenge results are provided, focusing on the generalizability of the submissions. Both in- and out of domain, the white matter and ventricles were segmented with the highest accuracy, while the most challenging structure remains the cerebral cortex due to anatomical complexity. The FeTA Challenge 2022 was able to successfully evaluate and advance generalizability of multi-class fetal brain tissue segmentation algorithms for MRI and it continues to benchmark new algorithms. The resulting new methods contribute to improving the analysis of brain development in utero.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Key-Graph Transformer for Image Restoration
Authors:
Bin Ren,
Yawei Li,
Jingyun Liang,
Rakesh Ranjan,
Mengyuan Liu,
Rita Cucchiara,
Luc Van Gool,
Nicu Sebe
Abstract:
While it is crucial to capture global information for effective image restoration (IR), integrating such cues into transformer-based methods becomes computationally expensive, especially with high input resolution. Furthermore, the self-attention mechanism in transformers is prone to considering unnecessary global cues from unrelated objects or regions, introducing computational inefficiencies. In…
▽ More
While it is crucial to capture global information for effective image restoration (IR), integrating such cues into transformer-based methods becomes computationally expensive, especially with high input resolution. Furthermore, the self-attention mechanism in transformers is prone to considering unnecessary global cues from unrelated objects or regions, introducing computational inefficiencies. In response to these challenges, we introduce the Key-Graph Transformer (KGT) in this paper. Specifically, KGT views patch features as graph nodes. The proposed Key-Graph Constructor efficiently forms a sparse yet representative Key-Graph by selectively connecting essential nodes instead of all the nodes. Then the proposed Key-Graph Attention is conducted under the guidance of the Key-Graph only among selected nodes with linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed KGT's state-of-the-art performance, showcasing advancements both quantitatively and qualitatively.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
ClST: A Convolutional Transformer Framework for Automatic Modulation Recognition by Knowledge Distillation
Authors:
Dongbin Hou,
Lixin Li,
Wensheng Lin,
Junli Liang,
Zhu Han
Abstract:
With the rapid development of deep learning (DL) in recent years, automatic modulation recognition (AMR) with DL has achieved high accuracy. However, insufficient training signal data in complicated channel environments and large-scale DL models are critical factors that make DL methods difficult to deploy in practice. Aiming to these problems, we propose a novel neural network named convolution-l…
▽ More
With the rapid development of deep learning (DL) in recent years, automatic modulation recognition (AMR) with DL has achieved high accuracy. However, insufficient training signal data in complicated channel environments and large-scale DL models are critical factors that make DL methods difficult to deploy in practice. Aiming to these problems, we propose a novel neural network named convolution-linked signal transformer (ClST) and a novel knowledge distillation method named signal knowledge distillation (SKD). The ClST is accomplished through three primary modifications: a hierarchy of transformer containing convolution, a novel attention mechanism named parallel spatial-channel attention (PSCA) mechanism and a novel convolutional transformer block named convolution-transformer projection (CTP) to leverage a convolutional projection. The SKD is a knowledge distillation method to effectively reduce the parameters and complexity of neural networks. We train two lightweight neural networks using the SKD algorithm, KD-CNN and KD-MobileNet, to meet the demand that neural networks can be used on miniaturized devices. The simulation results demonstrate that the ClST outperforms advanced neural networks on all datasets. Moreover, both KD-CNN and KD-MobileNet obtain higher recognition accuracy with less network complexity, which is very beneficial for the deployment of AMR on miniaturized communication devices.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
Perception-Distortion Balanced Super-Resolution: A Multi-Objective Optimization Perspective
Authors:
Lingchen Sun,
Jie Liang,
Shuaizheng Liu,
Hongwei Yong,
Lei Zhang
Abstract:
High perceptual quality and low distortion degree are two important goals in image restoration tasks such as super-resolution (SR). Most of the existing SR methods aim to achieve these goals by minimizing the corresponding yet conflicting losses, such as the $\ell_1$ loss and the adversarial loss. Unfortunately, the commonly used gradient-based optimizers, such as Adam, are hard to balance these o…
▽ More
High perceptual quality and low distortion degree are two important goals in image restoration tasks such as super-resolution (SR). Most of the existing SR methods aim to achieve these goals by minimizing the corresponding yet conflicting losses, such as the $\ell_1$ loss and the adversarial loss. Unfortunately, the commonly used gradient-based optimizers, such as Adam, are hard to balance these objectives due to the opposite gradient decent directions of the contradictory losses. In this paper, we formulate the perception-distortion trade-off in SR as a multi-objective optimization problem and develop a new optimizer by integrating the gradient-free evolutionary algorithm (EA) with gradient-based Adam, where EA and Adam focus on the divergence and convergence of the optimization directions respectively. As a result, a population of optimal models with different perception-distortion preferences is obtained. We then design a fusion network to merge these models into a single stronger one for an effective perception-distortion trade-off. Experiments demonstrate that with the same backbone network, the perception-distortion balanced SR model trained by our method can achieve better perceptual quality than its competitors while attaining better reconstruction fidelity. Codes and models can be found at https://github.com/csslc/EA-Adam.
△ Less
Submitted 23 December, 2023;
originally announced December 2023.
-
Latency versus Transmission Power Trade-off in Free-Space Optical (FSO) Satellite Networks with Multiple Inter-Continental Connections
Authors:
Jintao Liang,
Aizaz Chaudhry,
John Chinneck,
Halim Yanikomeroglu,
Gunes Kurt,
Peng Hu,
Khaled Ahmed,
Stephane Martel
Abstract:
In free-space optical satellite networks (FSOSNs), satellites connected via laser inter-satellite links (LISLs), latency is a critical factor, especially for long-distance inter-continental connections. Since satellites depend on solar panels for power supply, power consumption is also a vital factor. We investigate the minimization of total network latency (i.e., the sum of the network latencies…
▽ More
In free-space optical satellite networks (FSOSNs), satellites connected via laser inter-satellite links (LISLs), latency is a critical factor, especially for long-distance inter-continental connections. Since satellites depend on solar panels for power supply, power consumption is also a vital factor. We investigate the minimization of total network latency (i.e., the sum of the network latencies of all inter-continental connections in a time slot) in a realistic model of a FSOSN, the latest version of the Starlink Phase 1 Version 3 constellation. We develop mathematical formulations of the total network latency over different LISL ranges and different satellite transmission power constraints for multiple simultaneous inter-continental connections. We use practical system models for calculating network latency and satellite optical link transmission power, and we formulate the problem as a binary integer linear program. The results reveal that, for satellite transmission power limits set at 0.5 W, 0.3 W, and 0.1 W, the average total network latency for all five inter-continental connections studied in this work levels off at 339 ms, 361 ms, and 542 ms, respectively. Furthermore, the corresponding LISL ranges required to achieve these average total network latency values are 4500 km, 3000 km, and 1731 km, respectively. Different limitations on satellite transmission power exhibit varying effects on average total network latency (over 100 time slots), and they also induce differing changes in the corresponding LISL ranges. In the absence of satellite transmission power constraints, as the LISL range extends from the minimum feasible range of 1575 km to the maximum feasible range of 5016 km, the average total network latency decreases from 589 ms to 311 ms.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Free-Space Optical (FSO) Satellite Networks Performance Analysis: Transmission Power, Latency, and Outage Probability
Authors:
Jintao Liang,
Aizaz U. Chaudhry,
Eylem Erdogan,
Halim Yanikomeroglu,
Gunes Karabulut Kurt,
Peng Hu,
Khaled Ahmed,
Stephane Martel
Abstract:
In free-space optical satellite networks (FSOSNs), satellites can have different laser inter-satellite link (LISL) ranges for connectivity. Greater LISL ranges can reduce network latency of the path but can also result in an increase in transmission power for satellites on the path. Consequently, this tradeoff between satellite transmission power and network latency should be investigated, and in…
▽ More
In free-space optical satellite networks (FSOSNs), satellites can have different laser inter-satellite link (LISL) ranges for connectivity. Greater LISL ranges can reduce network latency of the path but can also result in an increase in transmission power for satellites on the path. Consequently, this tradeoff between satellite transmission power and network latency should be investigated, and in this work we examine it in FSOSNs drawing on the Starlink Phase 1 Version 3 and Kuiper Shell 2 constellations for different LISL ranges and different inter-continental connections. We use appropriate system models for calculating the average satellite transmission power and network latency. The results show that the mean network latency decreases and mean average satellite transmission power increases with an increase in LISL range. For the Toronto--Sydney inter-continental connection in an FSOSN with Starlink's Phase 1 Version 3 constellation, when the LISL range is approximately 2,900 km, the mean network latency and mean average satellite transmission power intersect are approximately 135 ms and 380 mW, respectively. For an FSOSN with the Kuiper Shell 2 constellation in this inter-continental connection, this LISL range is around 3,800 km, and the two parameters are approximately 120 ms and 700 mW, respectively. For the Toronto--Istanbul and Toronto--London inter-continental connections, the LISL ranges at the intersection are different and vary from 2,600 km to 3,400 km. Furthermore, we analyze outage probability performance of optical uplink/downlink due to atmosphere attenuation and turbulence.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities
Authors:
Jinhua Liang,
Xubo Liu,
Wenwu Wang,
Mark D. Plumbley,
Huy Phan,
Emmanouil Benetos
Abstract:
The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we…
▽ More
The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as language model inputs. To mitigate the data scarcity in the audio domain, a multi-task learning strategy is proposed by formulating diverse audio tasks in a sequence-to-sequence manner. Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence. This improved framework imposes zero constraints on the input format and thus is capable of tackling more understanding tasks, such as few-shot audio classification and audio reasoning. To further evaluate the reasoning ability of audio networks, we propose natural language audio reasoning (NLAR), a new task that analyses across two audio clips by comparison and summarization. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. We finally demonstrate the APT's ability in extending frozen VLMs to the audio domain without finetuning, achieving promising results in the audio-visual question and answering task. Our code and model weights are released at https://github.com/JinhuaLiang/APT.
△ Less
Submitted 30 November, 2023;
originally announced December 2023.
-
MoVideo: Motion-Aware Video Generation with Diffusion Models
Authors:
Jingyun Liang,
Yuchen Fan,
Kai Zhang,
Radu Timofte,
Luc Van Gool,
Rakesh Ranjan
Abstract:
While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspe…
▽ More
While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. The former regulates motion by per-frame object distances and spatial layouts, while the later describes motion by cross-frame correspondences that help in preserving fine details and improving temporal consistency. More specifically, given a key frame that exists or generated from text prompts, we first design a diffusion model with spatio-temporal modules to generate the video depth and the corresponding optical flows. Then, the video is generated in the latent space by another spatio-temporal diffusion model under the guidance of depth, optical flow-based warped latent video and the calculated occlusion mask. Lastly, we use optical flows again to align and refine different frames for better video decoding from the latent space to the pixel space. In experiments, MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.
△ Less
Submitted 29 July, 2024; v1 submitted 19 November, 2023;
originally announced November 2023.
-
Detection of Small Targets in Sea Clutter Based on RepVGG and Continuous Wavelet Transform
Authors:
Jingchen Ni,
Haoru Li,
Lilin Xu,
Jing Liang
Abstract:
Constructing a high-performance target detector under the background of sea clutter is always necessary and important. In this work, we propose a RepVGGA0-CWT detector, where RepVGG is a residual network that gains a high detection accuracy. Different from traditional residual networks, RepVGG keeps an acceptable calculation speed. Giving consideration to both accuracy and speed, the RepVGGA0 is s…
▽ More
Constructing a high-performance target detector under the background of sea clutter is always necessary and important. In this work, we propose a RepVGGA0-CWT detector, where RepVGG is a residual network that gains a high detection accuracy. Different from traditional residual networks, RepVGG keeps an acceptable calculation speed. Giving consideration to both accuracy and speed, the RepVGGA0 is selected among all the variants of RepVGG. Also, continuous wavelet transform (CWT) is employed to extract the radar echoes' time-frequency feature effectively. In the tests, other networks (ResNet50, ResNet18 and AlexNet) and feature extraction methods (short-time Fourier transform (STFT), CWT) are combined to build detectors for comparison. The result of different datasets shows that the RepVGGA0-CWT detector performs better than those detectors in terms of low controllable false alarm rate, high training speed, high inference speed and low memory usage. This RepVGGA0-CWT detector is hardware-friendly and can be applied in real-time scenes for its high inference speed in detection.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Fast and High-Performance Learned Image Compression With Improved Checkerboard Context Model, Deformable Residual Module, and Knowledge Distillation
Authors:
Haisheng Fu,
Feng Liang,
Jie Liang,
Yongqiang Wang,
Guohe Zhang,
Jingning Han
Abstract:
Deep learning-based image compression has made great progresses recently. However, many leading schemes use serial context-adaptive entropy model to improve the rate-distortion (R-D) performance, which is very slow. In addition, the complexities of the encoding and decoding networks are quite high and not suitable for many practical applications. In this paper, we introduce four techniques to bala…
▽ More
Deep learning-based image compression has made great progresses recently. However, many leading schemes use serial context-adaptive entropy model to improve the rate-distortion (R-D) performance, which is very slow. In addition, the complexities of the encoding and decoding networks are quite high and not suitable for many practical applications. In this paper, we introduce four techniques to balance the trade-off between the complexity and performance. We are the first to introduce deformable convolutional module in compression framework, which can remove more redundancies in the input image, thereby enhancing compression performance. Second, we design a checkerboard context model with two separate distribution parameter estimation networks and different probability models, which enables parallel decoding without sacrificing the performance compared to the sequential context-adaptive model. Third, we develop an improved three-step knowledge distillation and training scheme to achieve different trade-offs between the complexity and the performance of the decoder network, which transfers both the final and intermediate results of the teacher network to the student network to help its training. Fourth, we introduce $L_{1}$ regularization to make the numerical values of the latent representation more sparse. Then we only encode non-zero channels in the encoding and decoding process, which can greatly reduce the encoding and decoding time. Experiments show that compared to the state-of-the-art learned image coding scheme, our method can be about 20 times faster in encoding and 70-90 times faster in decoding, and our R-D performance is also $2.3 \%$ higher. Our method outperforms the traditional approach in H.266/VVC-intra (4:4:4) and some leading learned schemes in terms of PSNR and MS-SSIM metrics when testing on Kodak and Tecnick-40 datasets.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
Enhanced Residual SwinV2 Transformer for Learned Image Compression
Authors:
Yongqiang Wang,
Feng Liang,
Haisheng Fu,
Jie Liang,
Haipeng Qin,
Junzhe Liang
Abstract:
Recently, the deep learning technology has been successfully applied in the field of image compression, leading to superior rate-distortion performance. However, a challenge of many learning-based approaches is that they often achieve better performance via sacrificing complexity, which making practical deployment difficult. To alleviate this issue, in this paper, we propose an effective and effic…
▽ More
Recently, the deep learning technology has been successfully applied in the field of image compression, leading to superior rate-distortion performance. However, a challenge of many learning-based approaches is that they often achieve better performance via sacrificing complexity, which making practical deployment difficult. To alleviate this issue, in this paper, we propose an effective and efficient learned image compression framework based on an enhanced residual Swinv2 transformer. To enhance the nonlinear representation of images in our framework, we use a feature enhancement module that consists of three consecutive convolutional layers. In the subsequent coding and hyper coding steps, we utilize a SwinV2 transformer-based attention mechanism to process the input image. The SwinV2 model can help to reduce model complexity while maintaining high performance. Experimental results show that the proposed method achieves comparable performance compared to some recent learned image compression methods on Kodak and Tecnick datasets, and outperforms some traditional codecs including VVC. In particular, our method achieves comparable results while reducing model complexity by 56% compared to these recent methods.
△ Less
Submitted 22 August, 2023;
originally announced August 2023.
-
Efficient collision avoidance for autonomous vehicles in polygonal domains
Authors:
Jiayu Fan,
Nikolce Murgovski,
Jun Liang
Abstract:
This research focuses on trajectory planning problems for autonomous vehicles utilizing numerical optimal control techniques. The study reformulates the constrained optimization problem into a nonlinear programming problem, incorporating explicit collision avoidance constraints. We present three novel, exact formulations to describe collision constraints. The first formulation is derived from a pr…
▽ More
This research focuses on trajectory planning problems for autonomous vehicles utilizing numerical optimal control techniques. The study reformulates the constrained optimization problem into a nonlinear programming problem, incorporating explicit collision avoidance constraints. We present three novel, exact formulations to describe collision constraints. The first formulation is derived from a proposition concerning the separation of a point and a convex set. We prove the separating proposition through De Morgan's laws. Then, leveraging the hyperplane separation theorem we propose two efficient reformulations. Compared with the existing dual formulations and the first formulation, they significantly reduce the number of auxiliary variables to be optimized and inequality constraints within the nonlinear programming problem. Finally, the efficacy of the proposed formulations is demonstrated in the context of typical autonomous parking scenarios compared with state of the art. For generality, we design three initial guesses to assess the computational effort required for convergence to solutions when using the different collision formulations. The results illustrate that the scheme employing De Morgan's laws performs equally well with those utilizing dual formulations, while the other two schemes based on hyperplane separation theorem exhibit the added benefit of requiring lower computational resources.
△ Less
Submitted 12 December, 2023; v1 submitted 17 August, 2023;
originally announced August 2023.
-
OnUVS: Online Feature Decoupling Framework for High-Fidelity Ultrasound Video Synthesis
Authors:
Han Zhou,
Dong Ni,
Ao Chang,
Xinrui Zhou,
Rusi Chen,
Yanlin Chen,
Lian Liu,
Jiamin Liang,
Yuhao Huang,
Tong Han,
Zhe Liu,
Deng-Ping Fan,
Xin Yang
Abstract:
Ultrasound (US) imaging is indispensable in clinical practice. To diagnose certain diseases, sonographers must observe corresponding dynamic anatomic structures to gather comprehensive information. However, the limited availability of specific US video cases causes teaching difficulties in identifying corresponding diseases, which potentially impacts the detection rate of such cases. The synthesis…
▽ More
Ultrasound (US) imaging is indispensable in clinical practice. To diagnose certain diseases, sonographers must observe corresponding dynamic anatomic structures to gather comprehensive information. However, the limited availability of specific US video cases causes teaching difficulties in identifying corresponding diseases, which potentially impacts the detection rate of such cases. The synthesis of US videos may represent a promising solution to this issue. Nevertheless, it is challenging to accurately animate the intricate motion of dynamic anatomic structures while preserving image fidelity. To address this, we present a novel online feature-decoupling framework called OnUVS for high-fidelity US video synthesis. Our highlights can be summarized by four aspects. First, we introduced anatomic information into keypoint learning through a weakly-supervised training strategy, resulting in improved preservation of anatomical integrity and motion while minimizing the labeling burden. Second, to better preserve the integrity and textural information of US images, we implemented a dual-decoder that decouples the content and textural features in the generator. Third, we adopted a multiple-feature discriminator to extract a comprehensive range of visual cues, thereby enhancing the sharpness and fine details of the generated videos. Fourth, we constrained the motion trajectories of keypoints during online learning to enhance the fluidity of generated videos. Our validation and user studies on in-house echocardiographic and pelvic floor US videos showed that OnUVS synthesizes US videos with high fidelity.
△ Less
Submitted 16 August, 2023;
originally announced August 2023.
-
WavJourney: Compositional Audio Creation with Large Language Models
Authors:
Xubo Liu,
Zhongkai Zhu,
Haohe Liu,
Yi Yuan,
Meng Cui,
Qiushi Huang,
Jinhua Liang,
Yin Cao,
Qiuqiang Kong,
Mark D. Plumbley,
Wenwu Wang
Abstract:
Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation…
▽ More
Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation systems. We present WavJourney, a novel framework that leverages Large Language Models (LLMs) to connect various audio models for audio creation. WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions. Specifically, given a text instruction, WavJourney first prompts LLMs to generate an audio script that serves as a structured semantic representation of audio elements. The audio script is then converted into a computer program, where each line of the program calls a task-specific audio generation model or computational operation function. The computer program is then executed to obtain a compositional and interpretable solution for audio creation. Experimental results suggest that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions, achieving state-of-the-art results on text-to-audio generation benchmarks. Additionally, we introduce a new multi-genre story benchmark. Subjective evaluations demonstrate the potential of WavJourney in crafting engaging storytelling audio content from text. We further demonstrate that WavJourney can facilitate human-machine co-creation in multi-round dialogues. To foster future research, the code and synthesized audio are available at: https://audio-agi.github.io/WavJourney_demopage/.
△ Less
Submitted 26 November, 2023; v1 submitted 26 July, 2023;
originally announced July 2023.
-
Adapting Language-Audio Models as Few-Shot Audio Learners
Authors:
Jinhua Liang,
Xubo Liu,
Haohe Liu,
Huy Phan,
Emmanouil Benetos,
Mark D. Plumbley,
Wenwu Wang
Abstract:
We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data. Specifically, we designed CALM to retrieve the probability distribution of text-audio clips over classes using a set of audio-label pairs and combined it with CLAP's zero-shot classification results. Furthermore, we designed a training-…
▽ More
We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data. Specifically, we designed CALM to retrieve the probability distribution of text-audio clips over classes using a set of audio-label pairs and combined it with CLAP's zero-shot classification results. Furthermore, we designed a training-free version of the Treff adapter by using CALM as a cosine similarity measure. Experiments showed that the proposed Treff adapter is comparable and even better than fully-supervised methods and adaptation methods in low-shot and data-abundant scenarios. While the Treff adapter shows that combining large-scale pretraining and rapid learning of domain-specific knowledge is non-trivial for obtaining generic representations for few-shot learning, it is still limited to audio classification tasks. In the future, we will explore how to use audio-language models in diverse audio domains.
△ Less
Submitted 28 May, 2023;
originally announced May 2023.
-
Denoising Diffusion Models for Plug-and-Play Image Restoration
Authors:
Yuanzhi Zhu,
Kai Zhang,
Jingyun Liang,
Jiezhang Cao,
Bihan Wen,
Radu Timofte,
Luc Van Gool
Abstract:
Plug-and-play Image Restoration (IR) has been widely recognized as a flexible and interpretable method for solving various inverse problems by utilizing any off-the-shelf denoiser as the implicit image prior. However, most existing methods focus on discriminative Gaussian denoisers. Although diffusion models have shown impressive performance for high-quality image synthesis, their potential to ser…
▽ More
Plug-and-play Image Restoration (IR) has been widely recognized as a flexible and interpretable method for solving various inverse problems by utilizing any off-the-shelf denoiser as the implicit image prior. However, most existing methods focus on discriminative Gaussian denoisers. Although diffusion models have shown impressive performance for high-quality image synthesis, their potential to serve as a generative denoiser prior to the plug-and-play IR methods remains to be further explored. While several other attempts have been made to adopt diffusion models for image restoration, they either fail to achieve satisfactory results or typically require an unacceptable number of Neural Function Evaluations (NFEs) during inference. This paper proposes DiffPIR, which integrates the traditional plug-and-play method into the diffusion sampling framework. Compared to plug-and-play IR methods that rely on discriminative Gaussian denoisers, DiffPIR is expected to inherit the generative ability of diffusion models. Experimental results on three representative IR tasks, including super-resolution, image deblurring, and inpainting, demonstrate that DiffPIR achieves state-of-the-art performance on both the FFHQ and ImageNet datasets in terms of reconstruction faithfulness and perceptual quality with no more than 100 NFEs. The source code is available at {\url{https://github.com/yuanzhi-zhu/DiffPIR}}
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
ROI-based Deep Image Compression with Swin Transformers
Authors:
Binglin Li,
Jie Liang,
Haisheng Fu,
Jingning Han
Abstract:
Encoding the Region Of Interest (ROI) with better quality than the background has many applications including video conferencing systems, video surveillance and object-oriented vision tasks. In this paper, we propose a ROI-based image compression framework with Swin transformers as main building blocks for the autoencoder network. The binary ROI mask is integrated into different layers of the netw…
▽ More
Encoding the Region Of Interest (ROI) with better quality than the background has many applications including video conferencing systems, video surveillance and object-oriented vision tasks. In this paper, we propose a ROI-based image compression framework with Swin transformers as main building blocks for the autoencoder network. The binary ROI mask is integrated into different layers of the network to provide spatial information guidance. Based on the ROI mask, we can control the relative importance of the ROI and non-ROI by modifying the corresponding Lagrange multiplier $ λ$ for different regions. Experimental results show our model achieves higher ROI PSNR than other methods and modest average PSNR for human evaluation. When tested on models pre-trained with original images, it has superior object detection and instance segmentation performance on the COCO validation dataset.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis
Authors:
Jinlong Xue,
Yayue Deng,
Fengping Wang,
Ya Li,
Yingming Gao,
Jianhua Tao,
Jianqing Sun,
Jiaen Liang
Abstract:
Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphas…
▽ More
Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis. Moreover, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M2-CTTS, an end-to-end multi-scale multi-modal conversational text-to-speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling. Experimental results demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalness in CMOS tests.
△ Less
Submitted 3 May, 2023;
originally announced May 2023.
-
Driver Profiling and Bayesian Workload Estimation Using Naturalistic Peripheral Detection Study Data
Authors:
Nermin Caber,
Bashar I. Ahmad,
Jiaming Liang,
Simon Godsill,
Alexandra Bremers,
Philip Thomas,
David Oxtoby,
Lee Skrypchuk
Abstract:
Monitoring drivers' mental workload facilitates initiating and maintaining safe interactions with in-vehicle information systems, and thus delivers adaptive human machine interaction with reduced impact on the primary task of driving. In this paper, we tackle the problem of workload estimation from driving performance data. First, we present a novel on-road study for collecting subjective workload…
▽ More
Monitoring drivers' mental workload facilitates initiating and maintaining safe interactions with in-vehicle information systems, and thus delivers adaptive human machine interaction with reduced impact on the primary task of driving. In this paper, we tackle the problem of workload estimation from driving performance data. First, we present a novel on-road study for collecting subjective workload data via a modified peripheral detection task in naturalistic settings. Key environmental factors that induce a high mental workload are identified via video analysis, e.g. junctions and behaviour of vehicle in front. Second, a supervised learning framework using state-of-the-art time series classifiers (e.g. convolutional neural network and transform techniques) is introduced to profile drivers based on the average workload they experience during a journey. A Bayesian filtering approach is then proposed for sequentially estimating, in (near) real-time, the driver's instantaneous workload. This computationally efficient and flexible method can be easily personalised to a driver (e.g. incorporate their inferred average workload profile), adapted to driving/environmental contexts (e.g. road type) and extended with data streams from new sources. The efficacy of the presented profiling and instantaneous workload estimation approaches are demonstrated using the on-road study data, showing $F_{1}$ scores of up to 92% and 81%, respectively.
△ Less
Submitted 8 September, 2023; v1 submitted 26 March, 2023;
originally announced March 2023.
-
SigVIC: Spatial Importance Guided Variable-Rate Image Compression
Authors:
Jiaming Liang,
Meiqin Liu,
Chao Yao,
Chunyu Lin,
Yao Zhao
Abstract:
Variable-rate mechanism has improved the flexibility and efficiency of learning-based image compression that trains multiple models for different rate-distortion tradeoffs. One of the most common approaches for variable-rate is to channel-wisely or spatial-uniformly scale the internal features. However, the diversity of spatial importance is instructive for bit allocation of image compression. In…
▽ More
Variable-rate mechanism has improved the flexibility and efficiency of learning-based image compression that trains multiple models for different rate-distortion tradeoffs. One of the most common approaches for variable-rate is to channel-wisely or spatial-uniformly scale the internal features. However, the diversity of spatial importance is instructive for bit allocation of image compression. In this paper, we introduce a Spatial Importance Guided Variable-rate Image Compression (SigVIC), in which a spatial gating unit (SGU) is designed for adaptively learning a spatial importance mask. Then, a spatial scaling network (SSN) takes the spatial importance mask to guide the feature scaling and bit allocation for variable-rate. Moreover, to improve the quality of decoded image, Top-K shallow features are selected to refine the decoded features through a shallow feature fusion module (SFFM). Experiments show that our method outperforms other learning-based methods (whether variable-rate or not) and traditional codecs, with storage saving and high flexibility.
△ Less
Submitted 16 March, 2023;
originally announced March 2023.
-
Leveraging Pre-trained AudioLDM for Sound Generation: A Benchmark Study
Authors:
Yi Yuan,
Haohe Liu,
Jinhua Liang,
Xubo Liu,
Mark D. Plumbley,
Wenwu Wang
Abstract:
Deep neural networks have recently achieved breakthroughs in sound generation. Despite the outstanding sample quality, current sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting performance. In this paper, we make the first attempt to investigate the benefits of pre-training on sound generation with AudioLDM, the cutting-edge model for audio gen…
▽ More
Deep neural networks have recently achieved breakthroughs in sound generation. Despite the outstanding sample quality, current sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting performance. In this paper, we make the first attempt to investigate the benefits of pre-training on sound generation with AudioLDM, the cutting-edge model for audio generation, as the backbone. Our study demonstrates the advantages of the pre-trained AudioLDM, especially in data-scarcity scenarios. In addition, the baselines and evaluation protocol for sound generation systems are not consistent enough to compare different studies directly. Aiming to facilitate further study on sound generation tasks, we benchmark the sound generation task on various frequently-used datasets. We hope our results on transfer learning and benchmarks can provide references for further research on conditional sound generation.
△ Less
Submitted 29 July, 2024; v1 submitted 7 March, 2023;
originally announced March 2023.
-
Learning from Taxonomy: Multi-label Few-Shot Classification for Everyday Sound Recognition
Authors:
Jinhua Liang,
Huy Phan,
Emmanouil Benetos
Abstract:
Everyday sound recognition aims to infer types of sound events in audio streams. While many works succeeded in training models with high performance in a fully-supervised manner, they are still restricted to the demand of large quantities of labelled data and the range of predefined classes. To overcome these drawbacks, this work firstly curates a new database named FSD-FS for multi-label few-shot…
▽ More
Everyday sound recognition aims to infer types of sound events in audio streams. While many works succeeded in training models with high performance in a fully-supervised manner, they are still restricted to the demand of large quantities of labelled data and the range of predefined classes. To overcome these drawbacks, this work firstly curates a new database named FSD-FS for multi-label few-shot audio classification. It then explores how to incorporate audio taxonomy in few-shot learning. Specifically, this work proposes label-dependent prototypical networks (LaD-protonet) to exploit parent-children relationships between labels. Plus, it applies taxonomy-aware label smoothing techniques to boost model performance. Experiments demonstrate that LaD-protonet outperforms original prototypical networks as well as other state-of-the-art methods. Moreover, its performance can be further boosted when combined with taxonomy-aware label smoothing.
△ Less
Submitted 17 December, 2022;
originally announced December 2022.
-
Resource-Interaction Graph: Efficient Graph Representation for Anomaly Detection
Authors:
James Pope,
Jinyuan Liang,
Vijay Kumar,
Francesco Raimondo,
Xinyi Sun,
Ryan McConville,
Thomas Pasquier,
Rob Piechocki,
George Oikonomou,
Bo Luo,
Dan Howarth,
Ioannis Mavromatis,
Adrian Sanchez Mompo,
Pietro Carnelli,
Theodoros Spyridopoulos,
Aftab Khan
Abstract:
Security research has concentrated on converting operating system audit logs into suitable graphs, such as provenance graphs, for analysis. However, provenance graphs can grow very large requiring significant computational resources beyond what is necessary for many security tasks and are not feasible for resource constrained environments, such as edge devices. To address this problem, we present…
▽ More
Security research has concentrated on converting operating system audit logs into suitable graphs, such as provenance graphs, for analysis. However, provenance graphs can grow very large requiring significant computational resources beyond what is necessary for many security tasks and are not feasible for resource constrained environments, such as edge devices. To address this problem, we present the \textit{resource-interaction graph} that is built directly from the audit log. We show that the resource-interaction graph's storage requirements are significantly lower than provenance graphs using an open-source data set with two container escape attacks captured from an edge device. We use a graph autoencoder and graph clustering technique to evaluate the representation for an anomaly detection task. Both approaches are unsupervised and are thus suitable for detecting zero-day attacks. The approaches can achieve f1 scores typically over 80\% and in some cases over 90\% for the selected data set and attacks.
△ Less
Submitted 16 December, 2022;
originally announced December 2022.
-
A Data Quality Assessment Framework for AI-enabled Wireless Communication
Authors:
Hanning Tang,
Liusha Yang,
Rui Zhou,
Jing Liang,
Hong Wei,
Xuan Wang,
Qingjiang Shi,
Zhi-Quan Luo
Abstract:
Using artificial intelligent (AI) to re-design and enhance the current wireless communication system is a promising pathway for the future sixth-generation (6G) wireless network. The performance of AI-enabled wireless communication depends heavily on the quality of wireless air-interface data. Although there are various approaches to data quality assessment (DQA) for different applications, none h…
▽ More
Using artificial intelligent (AI) to re-design and enhance the current wireless communication system is a promising pathway for the future sixth-generation (6G) wireless network. The performance of AI-enabled wireless communication depends heavily on the quality of wireless air-interface data. Although there are various approaches to data quality assessment (DQA) for different applications, none has been designed for wireless air-interface data. In this paper, we propose a DQA framework to measure the quality of wireless air-interface data from three aspects: similarity, diversity, and completeness. The similarity measures how close the considered datasets are in terms of their statistical distributions; the diversity measures how well-rounded a dataset is, while the completeness measures to what degree the considered dataset satisfies the required performance metrics in an application scenario. The proposed framework can be applied to various types of wireless air-interface data, such as channel state information (CSI), signal-to-interference-plus-noise ratio (SINR), reference signal received power (RSRP), etc. For simplicity, the validity of our proposed DQA framework is corroborated by applying it to CSI data and using similarity and diversity metrics to improve CSI compression and recovery in Massive MIMO systems.
△ Less
Submitted 13 December, 2022;
originally announced December 2022.
-
Reconfigurable Intelligent Surface: Power Consumption Modeling and Practical Measurement Validation
Authors:
Jinghe Wang,
Wankai Tang,
Jing Cheng Liang,
Lei Zhang,
Jun Yan Dai,
Xiao Li,
Shi Jin,
Qiang Cheng,
Tie Jun Cui
Abstract:
The reconfigurable intelligent surface (RIS) has received a lot of interest because of its capacity to reconfigure the wireless communication environment in a cost- and energy-efficient way. However, the realistic power consumption modeling and measurement validation of RIS has received far too little attention. Therefore, in this work, we model the power consumption of RIS and conduct measurement…
▽ More
The reconfigurable intelligent surface (RIS) has received a lot of interest because of its capacity to reconfigure the wireless communication environment in a cost- and energy-efficient way. However, the realistic power consumption modeling and measurement validation of RIS has received far too little attention. Therefore, in this work, we model the power consumption of RIS and conduct measurement validations using various RISs to fill this vacancy. Firstly, we propose a practical power consumption model of RIS. The RIS hardware is divided into three basic parts: the FPGA control board, the drive circuits, and the RIS unit cells. The power consumption of the first two parts is modeled as $P_{\text {static}}$ and that of the last part is modeled as $P_{\text {units}}$. Expressions of $P_{\text {static}}$ and $P_{\text {units}}$ vary amongst different types of RISs. Secondly, we conduct measurements on various RISs to validate the proposed model. Five different RISs including the PIN diode, varactor diode, and RF switch types are measured, and measurement results validate the generality and applicability of the proposed power consumption model of RIS. Finally, we summarize the measurement results and discuss the approaches to achieve the low-power-consumption design of RIS-assisted wireless communication systems.
△ Less
Submitted 6 February, 2024; v1 submitted 1 November, 2022;
originally announced November 2022.
-
Circuit Solutions towards Broadband Piezoelectric Energy Harvesting: An Impedance Analysis
Authors:
Bao Zhao,
Junrui Liang
Abstract:
In the studies of piezoelectric energy harvesting (PEH) systems, literature has shown that circuit advancement has a significant effect on the enhancement of energy harvesting capability in resonance. On the other hand, some recent studies using the phase-variable (PV) synchronized switch technologies have found that the advanced circuit solutions can also broaden the harvesting bandwidth. However…
▽ More
In the studies of piezoelectric energy harvesting (PEH) systems, literature has shown that circuit advancement has a significant effect on the enhancement of energy harvesting capability in resonance. On the other hand, some recent studies using the phase-variable (PV) synchronized switch technologies have found that the advanced circuit solutions can also broaden the harvesting bandwidth. However, the available span of the electrically induced dynamics by the existing energy harvesting circuits was not properly defined and demonstrated. Performance comparison among different circuits cannot be fairly achieved without using a common theoretical language. Given these, this paper provides an impedance-based analysis and comparison on the electromechanical joint dynamics of the PEH systems using different interface circuits. Given that the resonance tunability by circuit solutions has received no attention in the conventional ideal model of kinetic energy harvester, we firstly propose a more inclusive ideal model for better generalization. In practice, it was proven that the attainable dynamic ranges of the practical energy harvesting circuits are only some subsets of the ideal realm. A detailed quantitative study on the attainable ranges of the PV circuit solutions is provided after the introduction of the ideal target. Simulation and experimental results of different interface circuits show good agreement with the theoretical analysis. It can be concluded that the resonance tunability strongly depends on the achievable extent in the reactive direction of the equivalent impedance plane. In practice, the electromechanical coupling conditions and dielectric loss might also influence the resonance tunability. The general ideal model and quantitative impedance analysis provided in this paper help guide the future design effort towards high-capability and broadband PEH systems.
△ Less
Submitted 31 October, 2022;
originally announced October 2022.
-
Electroanatomic Mapping to determine Scar Regions in patients with Atrial Fibrillation
Authors:
Jiyue He,
Kuk Jin Jang,
Katie Walsh,
Jackson Liang,
Sanjay Dixit,
Rahul Mangharam
Abstract:
Left atrial voltage maps are routinely acquired during electroanatomic mapping in patients undergoing catheter ablation for atrial fibrillation. For patients, who have prior catheter ablation when they are in sinus rhythm, the voltage map can be used to identify low voltage areas using a threshold of 0.2 - 0.45 mV. However, such a voltage threshold for maps acquired during atrial fibrillation has…
▽ More
Left atrial voltage maps are routinely acquired during electroanatomic mapping in patients undergoing catheter ablation for atrial fibrillation. For patients, who have prior catheter ablation when they are in sinus rhythm, the voltage map can be used to identify low voltage areas using a threshold of 0.2 - 0.45 mV. However, such a voltage threshold for maps acquired during atrial fibrillation has not been well established. A prerequisite for defining a voltage threshold is to maximize the topologically matched low voltage areas between the electroanatomic mapping acquired during atrial fibrillation and sinus rhythm. This paper demonstrates a new technique to improve the sensitivity and specificity of the matched low voltage areas. This is achieved by computing omni-directional bipolar voltages and applying Gaussian Process Regression based interpolation to derive the atrial fibrillation map. The proposed method is evaluated on a test cohort of 7 male patients, and a total of 46,589 data points were included in analysis. The low voltage areas in the posterior left atrium and pulmonary vein junction are determined using the standard method and the proposed method. Overall, the proposed method showed patient-specific sensitivity and specificity in matching low voltage areas of 75.70% and 65.55% for a geometric mean of 70.69%. On average, there was an improvement of 3.00% in the geometric mean, 7.88% improvement in sensitivity, 0.30% improvement in specificity compared to the standard method. The results show that the proposed method is an improvement in matching low voltage areas. This may help develop the voltage threshold to better identify low voltage areas in the left atrium for patients in atrial fibrillation.
△ Less
Submitted 8 November, 2022; v1 submitted 23 October, 2022;
originally announced October 2022.
-
Learning Task-Oriented Flows to Mutually Guide Feature Alignment in Synthesized and Real Video Denoising
Authors:
Jiezhang Cao,
Qin Wang,
Jingyun Liang,
Yulun Zhang,
Kai Zhang,
Radu Timofte,
Luc Van Gool
Abstract:
Video denoising aims at removing noise from videos to recover clean ones. Some existing works show that optical flow can help the denoising by exploiting the additional spatial-temporal clues from nearby frames. However, the flow estimation itself is also sensitive to noise, and can be unusable under large noise levels. To this end, we propose a new multi-scale refined optical flow-guided video de…
▽ More
Video denoising aims at removing noise from videos to recover clean ones. Some existing works show that optical flow can help the denoising by exploiting the additional spatial-temporal clues from nearby frames. However, the flow estimation itself is also sensitive to noise, and can be unusable under large noise levels. To this end, we propose a new multi-scale refined optical flow-guided video denoising method, which is more robust to different noise levels. Our method mainly consists of a denoising-oriented flow refinement (DFR) module and a flow-guided mutual denoising propagation (FMDP) module. Unlike previous works that directly use off-the-shelf flow solutions, DFR first learns robust multi-scale optical flows, and FMDP makes use of the flow guidance by progressively introducing and refining more flow information from low resolution to high resolution. Together with real noise degradation synthesis, the proposed multi-scale flow-guided denoising network achieves state-of-the-art performance on both synthetic Gaussian denoising and real video denoising. The codes will be made publicly available.
△ Less
Submitted 25 March, 2023; v1 submitted 24 August, 2022;
originally announced August 2022.
-
Learning Quantization in LDPC Decoders
Authors:
Marvin Geiselhart,
Ahmed Elkelesh,
Jannis Clausius,
Fei Liang,
Wen Xu,
Jing Liang,
Stephan ten Brink
Abstract:
Finding optimal message quantization is a key requirement for low complexity belief propagation (BP) decoding. To this end, we propose a floating-point surrogate model that imitates quantization effects as additions of uniform noise, whose amplitudes are trainable variables. We verify that the surrogate model closely matches the behavior of a fixed-point implementation and propose a hand-crafted l…
▽ More
Finding optimal message quantization is a key requirement for low complexity belief propagation (BP) decoding. To this end, we propose a floating-point surrogate model that imitates quantization effects as additions of uniform noise, whose amplitudes are trainable variables. We verify that the surrogate model closely matches the behavior of a fixed-point implementation and propose a hand-crafted loss function to realize a trade-off between complexity and error-rate performance. A deep learning-based method is then applied to optimize the message bitwidths. Moreover, we show that parameter sharing can both ensure implementation-friendly solutions and results in faster training convergence than independent parameters. We provide simulation results for 5G low-density parity-check (LDPC) codes and report an error-rate performance within 0.2 dB of floating-point decoding at an average message quantization bitwidth of 3.1 bits. In addition, we show that the learned bitwidths also generalize to other code rates and channels.
△ Less
Submitted 10 August, 2022;
originally announced August 2022.
-
Weakly-supervised High-fidelity Ultrasound Video Synthesis with Feature Decoupling
Authors:
Jiamin Liang,
Xin Yang,
Yuhao Huang,
Kai Liu,
Xinrui Zhou,
Xindi Hu,
Zehui Lin,
Huanjia Luo,
Yuanji Zhang,
Yi Xiong,
Dong Ni
Abstract:
Ultrasound (US) is widely used for its advantages of real-time imaging, radiation-free and portability. In clinical practice, analysis and diagnosis often rely on US sequences rather than a single image to obtain dynamic anatomical information. This is challenging for novices to learn because practicing with adequate videos from patients is clinically unpractical. In this paper, we propose a novel…
▽ More
Ultrasound (US) is widely used for its advantages of real-time imaging, radiation-free and portability. In clinical practice, analysis and diagnosis often rely on US sequences rather than a single image to obtain dynamic anatomical information. This is challenging for novices to learn because practicing with adequate videos from patients is clinically unpractical. In this paper, we propose a novel framework to synthesize high-fidelity US videos. Specifically, the synthesis videos are generated by animating source content images based on the motion of given driving videos. Our highlights are three-fold. First, leveraging the advantages of self- and fully-supervised learning, our proposed system is trained in weakly-supervised manner for keypoint detection. These keypoints then provide vital information for handling complex high dynamic motions in US videos. Second, we decouple content and texture learning using the dual decoders to effectively reduce the model learning difficulty. Last, we adopt the adversarial training strategy with GAN losses for further improving the sharpness of the generated videos, narrowing the gap between real and synthesis videos. We validate our method on a large in-house pelvic dataset with high dynamic motion. Extensive evaluation metrics and user study prove the effectiveness of our proposed method.
△ Less
Submitted 1 July, 2022;
originally announced July 2022.
-
Asymmetric Learned Image Compression with Multi-Scale Residual Block, Importance Map, and Post-Quantization Filtering
Authors:
Haisheng Fu,
Feng Liang,
Jie Liang,
Binglin Li,
Guohe Zhang,
Jingning Han
Abstract:
Recently, deep learning-based image compression has made signifcant progresses, and has achieved better ratedistortion (R-D) performance than the latest traditional method, H.266/VVC, in both subjective metric and the more challenging objective metric. However, a major problem is that many leading learned schemes cannot maintain a good trade-off between performance and complexity. In this paper, w…
▽ More
Recently, deep learning-based image compression has made signifcant progresses, and has achieved better ratedistortion (R-D) performance than the latest traditional method, H.266/VVC, in both subjective metric and the more challenging objective metric. However, a major problem is that many leading learned schemes cannot maintain a good trade-off between performance and complexity. In this paper, we propose an effcient and effective image coding framework, which achieves similar R-D performance with lower complexity than the state of the art. First, we develop an improved multi-scale residual block (MSRB) that can expand the receptive feld and is easier to obtain global information. It can further capture and reduce the spatial correlation of the latent representations. Second, a more advanced importance map network is introduced to adaptively allocate bits to different regions of the image. Third, we apply a 2D post-quantization flter (PQF) to reduce the quantization error, motivated by the Sample Adaptive Offset (SAO) flter in video coding. Moreover, We fnd that the complexity of encoder and decoder have different effects on image compression performance. Based on this observation, we design an asymmetric paradigm, in which the encoder employs three stages of MSRBs to improve the learning capacity, whereas the decoder only needs one stage of MSRB to yield satisfactory reconstruction, thereby reducing the decoding complexity without sacrifcing performance. Experimental results show that compared to the state-of-the-art method, the encoding and decoding time of the proposed method are about 17 times faster, and the R-D performance is only reduced by less than 1% on both Kodak and Tecnick datasets, which is still better than H.266/VVC(4:4:4) and other recent learning-based methods. Our source code is publicly available at https://github.com/fengyurenpingsheng.
△ Less
Submitted 21 June, 2022;
originally announced June 2022.
-
Recurrent Video Restoration Transformer with Guided Deformable Attention
Authors:
Jingyun Liang,
Yuchen Fan,
Xiaoyu Xiang,
Rakesh Ranjan,
Eddy Ilg,
Simon Green,
Jiezhang Cao,
Kai Zhang,
Radu Timofte,
Luc Van Gool
Abstract:
Video restoration aims at restoring multiple high-quality frames from multiple low-quality frames. Existing video restoration methods generally fall into two extreme cases, i.e., they either restore all frames in parallel or restore the video frame by frame in a recurrent way, which would result in different merits and drawbacks. Typically, the former has the advantage of temporal information fusi…
▽ More
Video restoration aims at restoring multiple high-quality frames from multiple low-quality frames. Existing video restoration methods generally fall into two extreme cases, i.e., they either restore all frames in parallel or restore the video frame by frame in a recurrent way, which would result in different merits and drawbacks. Typically, the former has the advantage of temporal information fusion. However, it suffers from large model size and intensive memory consumption; the latter has a relatively small model size as it shares parameters across frames; however, it lacks long-range dependency modeling ability and parallelizability. In this paper, we attempt to integrate the advantages of the two cases by proposing a recurrent video restoration transformer, namely RVRT. RVRT processes local neighboring frames in parallel within a globally recurrent framework which can achieve a good trade-off between model size, effectiveness, and efficiency. Specifically, RVRT divides the video into multiple clips and uses the previously inferred clip feature to estimate the subsequent clip feature. Within each clip, different frame features are jointly updated with implicit feature aggregation. Across different clips, the guided deformable attention is designed for clip-to-clip alignment, which predicts multiple relevant locations from the whole inferred clip and aggregates their features by the attention mechanism. Extensive experiments on video super-resolution, deblurring, and denoising show that the proposed RVRT achieves state-of-the-art performance on benchmark datasets with balanced model size, testing memory and runtime.
△ Less
Submitted 12 November, 2022; v1 submitted 5 June, 2022;
originally announced June 2022.
-
Tensor Shape Search for Optimum Data Compression
Authors:
Ryan Solgi,
Zichang He,
William Jiahua Liang,
Zheng Zhang
Abstract:
Various tensor decomposition methods have been proposed for data compression. In real world applications of the tensor decomposition, selecting the tensor shape for the given data poses a challenge and the shape of the tensor may affect the error and the compression ratio. In this work, we study the effect of the tensor shape on the tensor decomposition and propose an optimization model to find an…
▽ More
Various tensor decomposition methods have been proposed for data compression. In real world applications of the tensor decomposition, selecting the tensor shape for the given data poses a challenge and the shape of the tensor may affect the error and the compression ratio. In this work, we study the effect of the tensor shape on the tensor decomposition and propose an optimization model to find an optimum shape for the tensor train (TT) decomposition. The proposed optimization model maximizes the compression ratio of the TT decomposition given an error bound. We implement a genetic algorithm (GA) linked with the TT-SVD algorithm to solve the optimization model. We apply the proposed method for the compression of RGB images. The results demonstrate the effectiveness of the proposed evolutionary tensor shape search for the TT decomposition.
△ Less
Submitted 21 May, 2022;
originally announced May 2022.
-
Link Budget Analysis for Free-Space Optical Satellite Networks
Authors:
Jintao Liang,
Aizaz U. Chaudhry,
Eylem Erdogan,
Halim Yanikomeroglu
Abstract:
Free-space optical satellite networks (FSOSNs) will employ free-space optical links between satellites and between satellites and ground stations, and the link budget for optical inter-satellite links and optical uplink/downlink is analyzed in this paper. The satellites in these FSOSNs will have limited energy and thereby limited power, and we investigate the effect of link distance and link margi…
▽ More
Free-space optical satellite networks (FSOSNs) will employ free-space optical links between satellites and between satellites and ground stations, and the link budget for optical inter-satellite links and optical uplink/downlink is analyzed in this paper. The satellites in these FSOSNs will have limited energy and thereby limited power, and we investigate the effect of link distance and link margin on optical inter-satellite link transmission power, and the effect of slant distance, elevation angle, and link margin on optical uplink/downlink transmission power. We model these optical links and compute the results for various parameters. We observe that the transmission power increases when the link distance increases for inter-satellite and uplink/downlink communications, while the transmission power decreases when the elevation angle increases for uplink/downlink transmission. We also observe an inverse relationship between link margin and link distance. Furthermore, we highlight some practical insights and design guidelines gained from this analysis.
△ Less
Submitted 27 April, 2022;
originally announced April 2022.
-
DiRA: Discriminative, Restorative, and Adversarial Learning for Self-supervised Medical Image Analysis
Authors:
Fatemeh Haghighi,
Mohammad Reza Hosseinzadeh Taher,
Michael B. Gotway,
Jianming Liang
Abstract:
Discriminative learning, restorative learning, and adversarial learning have proven beneficial for self-supervised learning schemes in computer vision and medical imaging. Existing efforts, however, omit their synergistic effects on each other in a ternary setup, which, we envision, can significantly benefit deep semantic representation learning. To realize this vision, we have developed DiRA, the…
▽ More
Discriminative learning, restorative learning, and adversarial learning have proven beneficial for self-supervised learning schemes in computer vision and medical imaging. Existing efforts, however, omit their synergistic effects on each other in a ternary setup, which, we envision, can significantly benefit deep semantic representation learning. To realize this vision, we have developed DiRA, the first framework that unites discriminative, restorative, and adversarial learning in a unified manner to collaboratively glean complementary visual information from unlabeled medical images for fine-grained semantic representation learning. Our extensive experiments demonstrate that DiRA (1) encourages collaborative learning among three learning ingredients, resulting in more generalizable representation across organs, diseases, and modalities; (2) outperforms fully supervised ImageNet models and increases robustness in small data regimes, reducing annotation cost across multiple medical imaging applications; (3) learns fine-grained semantic representation, facilitating accurate lesion localization with only image-level annotation; and (4) enhances state-of-the-art restorative approaches, revealing that DiRA is a general mechanism for united representation learning. All code and pre-trained models are available at https: //github.com/JLiangLab/DiRA.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
CAiD: Context-Aware Instance Discrimination for Self-supervised Learning in Medical Imaging
Authors:
Mohammad Reza Hosseinzadeh Taher,
Fatemeh Haghighi,
Michael B. Gotway,
Jianming Liang
Abstract:
Recently, self-supervised instance discrimination methods have achieved significant success in learning visual representations from unlabeled photographic images. However, given the marked differences between photographic and medical images, the efficacy of instance-based objectives, focusing on learning the most discriminative global features in the image (i.e., wheels in bicycle), remains unknow…
▽ More
Recently, self-supervised instance discrimination methods have achieved significant success in learning visual representations from unlabeled photographic images. However, given the marked differences between photographic and medical images, the efficacy of instance-based objectives, focusing on learning the most discriminative global features in the image (i.e., wheels in bicycle), remains unknown in medical imaging. Our preliminary analysis showed that high global similarity of medical images in terms of anatomy hampers instance discrimination methods for capturing a set of distinct features, negatively impacting their performance on medical downstream tasks. To alleviate this limitation, we have developed a simple yet effective self-supervised framework, called Context-Aware instance Discrimination (CAiD). CAiD aims to improve instance discrimination learning by providing finer and more discriminative information encoded from a diverse local context of unlabeled medical images. We conduct a systematic analysis to investigate the utility of the learned features from a three-pronged perspective: (i) generalizability and transferability, (ii) separability in the embedding space, and (iii) reusability. Our extensive experiments demonstrate that CAiD (1) enriches representations learned from existing instance discrimination methods; (2) delivers more discriminative features by adequately capturing finer contextual information from individual medial images; and (3) improves reusability of low/mid-level features compared to standard instance discriminative methods. As open science, all codes and pre-trained models are available on our GitHub page: https://github.com/JLiangLab/CAiD.
△ Less
Submitted 15 April, 2022;
originally announced April 2022.