-
Multi-User Continuous-Aperture Array Communications: How to Learn Current Distribution?
Authors:
Jia Guo,
Yuanwei Liu,
Arumugam Nallanathan
Abstract:
The continuous aperture array (CAPA) can provide higher degree-of-freedom and spatial resolution than the spatially discrete array (SDPA), where optimizing multi-user current distributions in CAPA systems is crucial but challenging. The challenge arises from solving non-convex functional optimization problems without closed-form objective functions and constraints. In this paper, we propose a deep…
▽ More
The continuous aperture array (CAPA) can provide higher degree-of-freedom and spatial resolution than the spatially discrete array (SDPA), where optimizing multi-user current distributions in CAPA systems is crucial but challenging. The challenge arises from solving non-convex functional optimization problems without closed-form objective functions and constraints. In this paper, we propose a deep learning framework called L-CAPA to learn current distribution policies. In the framework, we find finite-dimensional representations of channel functions and current distributions, allowing them to be inputted into and outputted from a deep neural network (DNN) for learning the policy. To address the issue that the integrals in the loss function without closed-form expressions hinder training the DNN in an unsupervised manner, we propose to design another two DNNs for learning the integrals. The DNNs are designed as graph neural networks to incorporate with the permutation properties of the mappings to be learned, thereby improving learning performance. Simulation results show that L-CAPA can achieve the performance upper-bound of optimizing precoding in the SDPA system as the number of antennas approaches infinity, and it is with low inference complexity.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Assessment of Cell Nuclei AI Foundation Models in Kidney Pathology
Authors:
Junlin Guo,
Siqi Lu,
Can Cui,
Ruining Deng,
Tianyuan Yao,
Zhewen Tao,
Yizhe Lin,
Marilyn Lionts,
Quan Liu,
Juming Xiong,
Catie Chang,
Mitchell Wilkes,
Mengmeng Yin,
Haichun Yang,
Yuankai Huo
Abstract:
Cell nuclei instance segmentation is a crucial task in digital kidney pathology. Traditional automatic segmentation methods often lack generalizability when applied to unseen datasets. Recently, the success of foundation models (FMs) has provided a more generalizable solution, potentially enabling the segmentation of any cell type. In this study, we perform a large-scale evaluation of three widely…
▽ More
Cell nuclei instance segmentation is a crucial task in digital kidney pathology. Traditional automatic segmentation methods often lack generalizability when applied to unseen datasets. Recently, the success of foundation models (FMs) has provided a more generalizable solution, potentially enabling the segmentation of any cell type. In this study, we perform a large-scale evaluation of three widely used state-of-the-art (SOTA) cell nuclei foundation models (Cellpose, StarDist, and CellViT). Specifically, we created a highly diverse evaluation dataset consisting of 2,542 kidney whole slide images (WSIs) collected from both human and rodent sources, encompassing various tissue types, sizes, and staining methods. To our knowledge, this is the largest-scale evaluation of its kind to date. Our quantitative analysis of the prediction distribution reveals a persistent performance gap in kidney pathology. Among the evaluated models, CellViT demonstrated superior performance in segmenting nuclei in kidney pathology. However, none of the foundation models are perfect; a performance gap remains in general nuclei segmentation for kidney pathology.
△ Less
Submitted 9 August, 2024;
originally announced August 2024.
-
Hi-SAM: A high-scalable authentication model for satellite-ground Zero-Trust system using mean field game
Authors:
Xuesong Wu,
Tianshuai Zheng,
Runfang Wu,
Jie Ren,
Junyan Guo,
Ye Du
Abstract:
As more and more Internet of Thing (IoT) devices are connected to satellite networks, the Zero-Trust Architecture brings dynamic security to the satellite-ground system, while frequent authentication creates challenges for system availability. To make the system's accommodate more IoT devices, this paper proposes a high-scalable authentication model (Hi-SAM). Hi-SAM introduces the Proof-of-Work id…
▽ More
As more and more Internet of Thing (IoT) devices are connected to satellite networks, the Zero-Trust Architecture brings dynamic security to the satellite-ground system, while frequent authentication creates challenges for system availability. To make the system's accommodate more IoT devices, this paper proposes a high-scalable authentication model (Hi-SAM). Hi-SAM introduces the Proof-of-Work idea to authentication, which allows device to obtain the network resource based on frequency. To optimize the frequency, mean field game is used for competition among devices, which can reduce the decision space of large-scale population games. And a dynamic time-range message authentication code is designed for security. From the test at large population scales, Hi-SAM is superior in the optimization of authentication workload and the anomaly detection efficiency.
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
An Explainable Non-local Network for COVID-19 Diagnosis
Authors:
Jingfu Yang,
Peng Huang,
Jing Hu,
Shu Hu,
Siwei Lyu,
Xin Wang,
Jun Guo,
Xi Wu
Abstract:
The CNN has achieved excellent results in the automatic classification of medical images. In this study, we propose a novel deep residual 3D attention non-local network (NL-RAN) to classify CT images included COVID-19, common pneumonia, and normal to perform rapid and explainable COVID-19 diagnosis. We built a deep residual 3D attention non-local network that could achieve end-to-end training. The…
▽ More
The CNN has achieved excellent results in the automatic classification of medical images. In this study, we propose a novel deep residual 3D attention non-local network (NL-RAN) to classify CT images included COVID-19, common pneumonia, and normal to perform rapid and explainable COVID-19 diagnosis. We built a deep residual 3D attention non-local network that could achieve end-to-end training. The network is embedded with a nonlocal module to capture global information, while a 3D attention module is embedded to focus on the details of the lesion so that it can directly analyze the 3D lung CT and output the classification results. The output of the attention module can be used as a heat map to increase the interpretability of the model. 4079 3D CT scans were included in this study. Each scan had a unique label (novel coronavirus pneumonia, common pneumonia, and normal). The CT scans cohort was randomly split into a training set of 3263 scans, a validation set of 408 scans, and a testing set of 408 scans. And compare with existing mainstream classification methods, such as CovNet, CBAM, ResNet, etc. Simultaneously compare the visualization results with visualization methods such as CAM. Model performance was evaluated using the Area Under the ROC Curve(AUC), precision, and F1-score. The NL-RAN achieved the AUC of 0.9903, the precision of 0.9473, and the F1-score of 0.9462, surpass all the classification methods compared. The heat map output by the attention module is also clearer than the heat map output by CAM. Our experimental results indicate that our proposed method performs significantly better than existing methods. In addition, the first attention module outputs a heat map containing detailed outline information to increase the interpretability of the model. Our experiments indicate that the inference of our model is fast. It can provide real-time assistance with diagnosis.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
Wireless Communications in Doubly Selective Channels with Domain Adaptivity
Authors:
J. Andrew Zhang,
Hongyang Zhang,
Kai Wu,
Xiaojing Huang,
Jinhong Yuan,
Y. Jay Guo
Abstract:
Wireless communications are significantly impacted by the propagation environment, particularly in doubly selective channels with variations in both time and frequency domains. Orthogonal Time Frequency Space (OTFS) modulation has emerged as a promising solution; however, its high equalization complexity, if performed in the delay-Doppler domain, limits its universal application. This article expl…
▽ More
Wireless communications are significantly impacted by the propagation environment, particularly in doubly selective channels with variations in both time and frequency domains. Orthogonal Time Frequency Space (OTFS) modulation has emerged as a promising solution; however, its high equalization complexity, if performed in the delay-Doppler domain, limits its universal application. This article explores domain-adaptive system design, dynamically selecting best-fit domains for modulation, pilot placement, and equalization based on channel conditions, to enhance performance across diverse environments. We examine domain classifications and connections, signal designs, and equalization techniques with domain adaptivity, and finally highlight future research opportunities.
△ Less
Submitted 31 July, 2024; v1 submitted 31 July, 2024;
originally announced July 2024.
-
Towards scalable efficient on-device ASR with transfer learning
Authors:
Laxmi Pandey,
Ke Li,
Jinxi Guo,
Debjyoti Paul,
Arthur Guo,
Jay Mahadeokar,
Xuedong Zhang
Abstract:
Multilingual pretraining for transfer learning significantly boosts the robustness of low-resource monolingual ASR models. This study systematically investigates three main aspects: (a) the impact of transfer learning on model performance during initial training or fine-tuning, (b) the influence of transfer learning across dataset domains and languages, and (c) the effect on rare-word recognition…
▽ More
Multilingual pretraining for transfer learning significantly boosts the robustness of low-resource monolingual ASR models. This study systematically investigates three main aspects: (a) the impact of transfer learning on model performance during initial training or fine-tuning, (b) the influence of transfer learning across dataset domains and languages, and (c) the effect on rare-word recognition compared to non-rare words. Our finding suggests that RNNT-loss pretraining, followed by monolingual fine-tuning with Minimum Word Error Rate (MinWER) loss, consistently reduces Word Error Rates (WER) across languages like Italian and French. WER Reductions (WERR) reach 36.2% and 42.8% compared to monolingual baselines for MLS and in-house datasets. Out-of-domain pretraining leads to 28% higher WERR than in-domain pretraining. Both rare and non-rare words benefit, with rare words showing greater improvements with out-of-domain pretraining, and non-rare words with in-domain pretraining.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
Deep Learning-based CSI Feedback in Wi-Fi Systems
Authors:
Fan Qi,
Jiajia Guo,
Yiming Cui,
Xiangyi Li,
Chao-Kai Wen,
Shi Jin
Abstract:
In Wi-Fi systems, channel state information (CSI) plays a crucial role in enabling access points to execute beamforming operations. However, the feedback overhead associated with CSI significantly hampers the throughput improvements. Recent advancements in deep learning (DL) have transformed the approach to CSI feedback in cellular systems. Drawing inspiration from the successes witnessed in the r…
▽ More
In Wi-Fi systems, channel state information (CSI) plays a crucial role in enabling access points to execute beamforming operations. However, the feedback overhead associated with CSI significantly hampers the throughput improvements. Recent advancements in deep learning (DL) have transformed the approach to CSI feedback in cellular systems. Drawing inspiration from the successes witnessed in the realm of mobile communications, this paper introduces a DL-based CSI feedback framework, named EFNet, tailored for Wi-Fi systems. The proposed framework leverages an autoencoder to achieve precise feedback with minimal overhead. The process involves the station utilizing the encoder to compress and quantize a series of matrices into codeword bit streams, which are then fed back to the access point. Subsequently, the decoder installed at the AP reconstructs beamforming matrices from these bit streams. We implement the EFNet system using standard Wi-Fi equipment operating in the 2.4 GHz band. Experimental findings in an office environment reveal a remarkable 80.77% reduction in feedback overhead compared to the 802.11ac standard, alongside a significant boost in net throughput of up to 30.72%.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Hidden Convexity-Based Distributed Operation of Integrated Electricity-Gas Systems
Authors:
Rong-Peng Liu,
Yue Song,
Junhong Liu,
Xiaozhe Wang,
Jinpeng Guo,
Yunhe Hou
Abstract:
We propose a hidden convexity-based method to address distributed optimal energy flow (OEF) problems for transmission-level integrated electricity-gas systems. First, we develop a node-wise decoupling method to de-compose an OEF problem into multiple OEF subproblems. Then, we propose a hidden convexity-based method to equivalently reformulate nonconvex OEF subproblems as semi-definite programs. Th…
▽ More
We propose a hidden convexity-based method to address distributed optimal energy flow (OEF) problems for transmission-level integrated electricity-gas systems. First, we develop a node-wise decoupling method to de-compose an OEF problem into multiple OEF subproblems. Then, we propose a hidden convexity-based method to equivalently reformulate nonconvex OEF subproblems as semi-definite programs. This method differs from any ap-proximation and convexification methods that may incur infeasible solutions. Since all OEF subproblems are origi-nally convex or equivalently convexified, we adopt an ADMM to solve the hidden convexity-based distributed OEF problem with convergence analysis. Test results validate the effectiveness of the proposed method, especially in handling a large number of agents.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Free-SurGS: SfM-Free 3D Gaussian Splatting for Surgical Scene Reconstruction
Authors:
Jiaxin Guo,
Jiangliu Wang,
Di Kang,
Wenzhen Dong,
Wenting Wang,
Yun-hui Liu
Abstract:
Real-time 3D reconstruction of surgical scenes plays a vital role in computer-assisted surgery, holding a promise to enhance surgeons' visibility. Recent advancements in 3D Gaussian Splatting (3DGS) have shown great potential for real-time novel view synthesis of general scenes, which relies on accurate poses and point clouds generated by Structure-from-Motion (SfM) for initialization. However, 3D…
▽ More
Real-time 3D reconstruction of surgical scenes plays a vital role in computer-assisted surgery, holding a promise to enhance surgeons' visibility. Recent advancements in 3D Gaussian Splatting (3DGS) have shown great potential for real-time novel view synthesis of general scenes, which relies on accurate poses and point clouds generated by Structure-from-Motion (SfM) for initialization. However, 3DGS with SfM fails to recover accurate camera poses and geometry in surgical scenes due to the challenges of minimal textures and photometric inconsistencies. To tackle this problem, in this paper, we propose the first SfM-free 3DGS-based method for surgical scene reconstruction by jointly optimizing the camera poses and scene representation. Based on the video continuity, the key of our method is to exploit the immediate optical flow priors to guide the projection flow derived from 3D Gaussians. Unlike most previous methods relying on photometric loss only, we formulate the pose estimation problem as minimizing the flow loss between the projection flow and optical flow. A consistency check is further introduced to filter the flow outliers by detecting the rigid and reliable points that satisfy the epipolar geometry. During 3D Gaussian optimization, we randomly sample frames to optimize the scene representations to grow the 3D Gaussian progressively. Experiments on the SCARED dataset demonstrate our superior performance over existing methods in novel view synthesis and pose estimation with high efficiency. Code is available at https://github.com/wrld/Free-SurGS.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
An End-to-End Speech Summarization Using Large Language Model
Authors:
Hengchao Shang,
Zongyao Li,
Jiaxin Guo,
Shaojun Li,
Zhiqiang Rao,
Yuanchang Luo,
Daimeng Wei,
Hao Yang
Abstract:
Abstractive Speech Summarization (SSum) aims to generate human-like text summaries from spoken content. It encounters difficulties in handling long speech input and capturing the intricate cross-modal mapping between long speech inputs and short text summaries. Research on large language models (LLMs) and multimodal information fusion has provided new insights for addressing these challenges. In t…
▽ More
Abstractive Speech Summarization (SSum) aims to generate human-like text summaries from spoken content. It encounters difficulties in handling long speech input and capturing the intricate cross-modal mapping between long speech inputs and short text summaries. Research on large language models (LLMs) and multimodal information fusion has provided new insights for addressing these challenges. In this paper, we propose an end-to-end SSum model that utilizes Q-Former as a connector for the audio-text modality and employs LLMs to generate text summaries directly from speech features. We adopt a multi-stage training approach that includes LLM based ASR and Text Summarization (TSum) tasks as auxiliary tasks. ASR tasks are used to align feature spaces and enhance the LLM's ability to handle longer speech. Then, we utilize a curriculum learning strategy to facilitate the model's transition from TSum to SSum. Finally, our model achieves competitive performance on the How-2 dataset.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
VideoQA-SC: Adaptive Semantic Communication for Video Question Answering
Authors:
Jiangyuan Guo,
Wei Chen,
Yuxuan Sun,
Jialong Xu,
Bo Ai
Abstract:
Although semantic communication (SC) has shown its potential in efficiently transmitting multi-modal data such as text, speeches and images, SC for videos has focused primarily on pixel-level reconstruction. However, these SC systems may be suboptimal for downstream intelligent tasks. Moreover, SC systems without pixel-level video reconstruction present advantages by achieving higher bandwidth eff…
▽ More
Although semantic communication (SC) has shown its potential in efficiently transmitting multi-modal data such as text, speeches and images, SC for videos has focused primarily on pixel-level reconstruction. However, these SC systems may be suboptimal for downstream intelligent tasks. Moreover, SC systems without pixel-level video reconstruction present advantages by achieving higher bandwidth efficiency and real-time performance of various intelligent tasks. The difficulty in such system design lies in the extraction of task-related compact semantic representations and their accurate delivery over noisy channels. In this paper, we propose an end-to-end SC system for video question answering (VideoQA) tasks called VideoQA-SC. Our goal is to accomplish VideoQA tasks directly based on video semantics over noisy or fading wireless channels, bypassing the need for video reconstruction at the receiver. To this end, we develop a spatiotemporal semantic encoder for effective video semantic extraction, and a learning-based bandwidth-adaptive deep joint source-channel coding (DJSCC) scheme for efficient and robust video semantic transmission. Experiments demonstrate that VideoQA-SC outperforms traditional and advanced DJSCC-based SC systems that rely on video reconstruction at the receiver under a wide range of channel conditions and bandwidth constraints. In particular, when the signal-to-noise ratio is low, VideoQA-SC can improve the answer accuracy by 5.17% while saving almost 99.5% of the bandwidth at the same time, compared with the advanced DJSCC-based SC system. Our results show the great potential of task-oriented SC system design for video applications.
△ Less
Submitted 17 May, 2024;
originally announced June 2024.
-
Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases
Authors:
Meng Wang,
Tian Lin,
Aidi Lin,
Kai Yu,
Yuanyuan Peng,
Lianyu Wang,
Cheng Chen,
Ke Zou,
Huiyu Liang,
Man Chen,
Xue Yao,
Meiqin Zhang,
Binwei Huang,
Chaoxin Zheng,
Peixin Zhang,
Wei Chen,
Yilong Luo,
Yifan Chen,
Honghe Xia,
Tingkun Shi,
Qi Zhang,
Jinming Guo,
Xiaolin Chen,
Jingcheng Wang,
Yih Chung Tham
, et al. (24 additional authors not shown)
Abstract:
Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources…
▽ More
Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources, encompassing a diverse range of diseases across multiple ethnicities and countries. RetiZero exhibits superior performance in several downstream tasks, including zero-shot disease recognition, image-to-image retrieval, and internal- and cross-domain disease identification. In zero-shot scenarios, RetiZero achieves Top5 accuracy scores of 0.8430 for 15 fundus diseases and 0.7561 for 52 fundus diseases. For image retrieval, it achieves Top5 scores of 0.9500 and 0.8860 for the same disease sets, respectively. Clinical evaluations show that RetiZero's Top3 zero-shot performance surpasses the average of 19 ophthalmologists from Singapore, China and the United States. Furthermore, RetiZero significantly enhances clinicians' accuracy in diagnosing fundus disease. These findings underscore the value of integrating the RetiZero foundation model into clinical settings, where a variety of fundus diseases are encountered.
△ Less
Submitted 30 June, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR
Authors:
Shaojun Li,
Daimeng Wei,
Hengchao Shang,
Jiaxin Guo,
ZongYao Li,
Zhanglin Wu,
Zhiqiang Rao,
Yuanchang Luo,
Xianghui He,
Hao Yang
Abstract:
Despite recent improvements in End-to-End Automatic Speech Recognition (E2E ASR) systems, the performance can degrade due to vocal characteristic mismatches between training and testing data, particularly with limited target speaker adaptation data. We propose a novel speaker adaptation approach Speaker-Smoothed kNN that leverages k-Nearest Neighbors (kNN) retrieval techniques to improve model out…
▽ More
Despite recent improvements in End-to-End Automatic Speech Recognition (E2E ASR) systems, the performance can degrade due to vocal characteristic mismatches between training and testing data, particularly with limited target speaker adaptation data. We propose a novel speaker adaptation approach Speaker-Smoothed kNN that leverages k-Nearest Neighbors (kNN) retrieval techniques to improve model output by finding correctly pronounced tokens from its pre-built datastore during the decoding phase. Moreover, we utilize x-vector to dynamically adjust kNN interpolation parameters for data sparsity issue. This approach was validated using KeSpeech and MagicData corpora under in-domain and all-domain settings. Our method consistently performs comparably to fine-tuning without the associated performance degradation during speaker changes. Furthermore, in the all-domain setting, our method achieves state-of-the-art results, reducing the CER in both single speaker and multi-speaker test scenarios.
△ Less
Submitted 1 July, 2024; v1 submitted 7 June, 2024;
originally announced June 2024.
-
CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
Authors:
Yongyi Zang,
Jiatong Shi,
You Zhang,
Ryuichi Yamamoto,
Jionghao Han,
Yuxun Tang,
Shengyuan Xu,
Wenxiao Zhao,
Jing Guo,
Tomoki Toda,
Zhiyao Duan
Abstract:
Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesi…
▽ More
Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesized using state-of-the-art methods from publicly accessible singing voice datasets. CtrSVDD includes 47.64 hours of bonafide and 260.34 hours of deepfake singing vocals, spanning 14 deepfake methods and involving 164 singer identities. We also present a baseline system with flexible front-end features, evaluated against a structured train/dev/eval split. The experiments show the importance of feature selection and highlight a need for generalization towards deepfake methods that deviate further from training distribution. The CtrSVDD dataset and baselines are publicly accessible.
△ Less
Submitted 18 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
MAMOC: MRI Motion Correction via Masked Autoencoding
Authors:
Lennart Alexander Van der Goten,
Jingyu Guo,
Kevin Smith
Abstract:
The presence of motion artifacts in magnetic resonance imaging (MRI) scans poses a significant challenge, where even minor patient movements can lead to artifacts that may compromise the scan's utility. This paper introduces Masked Motion Correction (MAMOC), a novel method designed to address the issue of Retrospective Artifact Correction (RAC) in motion-affected MRI brain scans. MAMOC uses masked…
▽ More
The presence of motion artifacts in magnetic resonance imaging (MRI) scans poses a significant challenge, where even minor patient movements can lead to artifacts that may compromise the scan's utility. This paper introduces Masked Motion Correction (MAMOC), a novel method designed to address the issue of Retrospective Artifact Correction (RAC) in motion-affected MRI brain scans. MAMOC uses masked autoencoding self-supervision and test-time prediction to efficiently remove motion artifacts, producing state-of-the-art, native resolution scans. Until recently, realistic data to evaluate retrospective motion correction methods did not exist, motion artifacts had to be simulated. Leveraging the MR-ART dataset, this work is the first to evaluate motion correction in MRI scans using real motion data, showing the superiority of MAMOC to existing motion correction (MC) methods.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Efficient Real-world Image Super-Resolution Via Adaptive Directional Gradient Convolution
Authors:
Long Peng,
Yang Cao,
Renjing Pei,
Wenbo Li,
Jiaming Guo,
Xueyang Fu,
Yang Wang,
Zheng-Jun Zha
Abstract:
Real-SR endeavors to produce high-resolution images with rich details while mitigating the impact of multiple degradation factors. Although existing methods have achieved impressive achievements in detail recovery, they still fall short when addressing regions with complex gradient arrangements due to the intensity-based linear weighting feature extraction manner. Moreover, the stochastic artifact…
▽ More
Real-SR endeavors to produce high-resolution images with rich details while mitigating the impact of multiple degradation factors. Although existing methods have achieved impressive achievements in detail recovery, they still fall short when addressing regions with complex gradient arrangements due to the intensity-based linear weighting feature extraction manner. Moreover, the stochastic artifacts introduced by degradation cues during the imaging process in real LR increase the disorder of the overall image details, further complicating the perception of intrinsic gradient arrangement. To address these challenges, we innovatively introduce kernel-wise differential operations within the convolutional kernel and develop several learnable directional gradient convolutions. These convolutions are integrated in parallel with a novel linear weighting mechanism to form an Adaptive Directional Gradient Convolution (DGConv), which adaptively weights and fuses the basic directional gradients to improve the gradient arrangement perception capability for both regular and irregular textures. Coupled with DGConv, we further devise a novel equivalent parameter fusion method for DGConv that maintains its rich representational capabilities while keeping computational costs consistent with a single Vanilla Convolution (VConv), enabling DGConv to improve the performance of existing super-resolution networks without incurring additional computational expenses. To better leverage the superiority of DGConv, we further develop an Adaptive Information Interaction Block (AIIBlock) to adeptly balance the enhancement of texture and contrast while meticulously investigating the interdependencies, culminating in the creation of a DGPNet for Real-SR through simple stacking. Comparative results with 15 SOTA methods across three public datasets underscore the effectiveness and efficiency of our proposed approach.
△ Less
Submitted 11 May, 2024;
originally announced May 2024.
-
Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey
Authors:
Marcos V. Conde,
Zhijun Lei,
Wen Li,
Cosmin Stejerean,
Ioannis Katsavounidis,
Radu Timofte,
Kihwan Yoon,
Ganzorig Gankhuyag,
Jiangtao Lv,
Long Sun,
Jinshan Pan,
Jiangxin Dong,
Jinhui Tang,
Zhiyuan Li,
Hao Wei,
Chenyang Ge,
Dongyang Zhang,
Tianle Liu,
Huaian Chen,
Yi Jin,
Menghan Zhou,
Yiqiang Yan,
Si Gao,
Biao Wu,
Shaoli Liu
, et al. (50 additional authors not shown)
Abstract:
This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod…
▽ More
This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF codec, instead of JPEG. All the proposed methods improve PSNR fidelity over Lanczos interpolation, and process images under 10ms. Out of the 160 participants, 25 teams submitted their code and models. The solutions present novel designs tailored for memory-efficiency and runtime on edge devices. This survey describes the best solutions for real-time SR of compressed high-resolution images.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
LVNS-RAVE: Diversified audio generation with RAVE and Latent Vector Novelty Search
Authors:
Jinyue Guo,
Anna-Maria Christodoulou,
Balint Laczko,
Kyrre Glette
Abstract:
Evolutionary Algorithms and Generative Deep Learning have been two of the most powerful tools for sound generation tasks. However, they have limitations: Evolutionary Algorithms require complicated designs, posing challenges in control and achieving realistic sound generation. Generative Deep Learning models often copy from the dataset and lack creativity. In this paper, we propose LVNS-RAVE, a me…
▽ More
Evolutionary Algorithms and Generative Deep Learning have been two of the most powerful tools for sound generation tasks. However, they have limitations: Evolutionary Algorithms require complicated designs, posing challenges in control and achieving realistic sound generation. Generative Deep Learning models often copy from the dataset and lack creativity. In this paper, we propose LVNS-RAVE, a method to combine Evolutionary Algorithms and Generative Deep Learning to produce realistic and novel sounds. We use the RAVE model as the sound generator and the VGGish model as a novelty evaluator in the Latent Vector Novelty Search (LVNS) algorithm. The reported experiments show that the method can successfully generate diversified, novel audio samples under different mutation setups using different pre-trained RAVE models. The characteristics of the generation process can be easily controlled with the mutation parameters. The proposed algorithm can be a creative tool for sound artists and musicians.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Sparse Direction of Arrival Estimation Method Based on Vector Signal Reconstruction with a Single Vector Sensor
Authors:
Jiabin Guo
Abstract:
This study investigates the application of single vector hydrophones in underwater acoustic signal processing for Direction of Arrival (DOA) estimation. Addressing the limitations of traditional DOA estimation methods in multi-source environments and under noise interference, this research proposes a Vector Signal Reconstruction (VSR) technique. This technique transforms the covariance matrix of s…
▽ More
This study investigates the application of single vector hydrophones in underwater acoustic signal processing for Direction of Arrival (DOA) estimation. Addressing the limitations of traditional DOA estimation methods in multi-source environments and under noise interference, this research proposes a Vector Signal Reconstruction (VSR) technique. This technique transforms the covariance matrix of single vector hydrophone signals into a Toeplitz structure suitable for gridless sparse methods through complex calculations and vector signal reconstruction. Furthermore, two sparse DOA estimation algorithms based on vector signal reconstruction are introduced. Theoretical analysis and simulation experiments demonstrate that the proposed algorithms significantly improve the accuracy and resolution of DOA estimation in multi-source signals and low Signal-to-Noise Ratio (SNR) environments compared to traditional algorithms. The contribution of this study lies in providing an effective new method for DOA estimation with single vector hydrophones in complex environments, introducing new research directions and solutions in the field of vector hydrophone signal processing.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results
Authors:
Xin Li,
Kun Yuan,
Yajing Pei,
Yiting Lu,
Ming Sun,
Chao Zhou,
Zhibo Chen,
Radu Timofte,
Wei Sun,
Haoning Wu,
Zicheng Zhang,
Jun Jia,
Zhichao Zhang,
Linhan Cao,
Qiubo Chen,
Xiongkuo Min,
Weisi Lin,
Guangtao Zhai,
Jianhui Sun,
Tianyi Wang,
Lei Li,
Han Kong,
Wenxuan Wang,
Bing Li,
Cheng Luo
, et al. (43 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The…
▽ More
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Decade-Bandwidth RF-Input Pseudo-Doherty Load Modulated Balanced Amplifier using Signal-Flow-Based Phase Alignment Design
Authors:
Pingzhu Gong,
Jiachen Guo,
Niteesh Bharadwaj Vangipurapu,
Kenle Chen
Abstract:
This paper reports a first-ever decade-bandwidth pseudo-Doherty load-modulated balanced amplifier (PD-LMBA), designed for emerging 4G/5G communications and multi-band operations. By revisiting the LMBA theory using the signal-flow graph, a frequency-agnostic phase-alignment condition is found that is critical for ensuring intrinsically broadband load modulation behavior. This unique design methodo…
▽ More
This paper reports a first-ever decade-bandwidth pseudo-Doherty load-modulated balanced amplifier (PD-LMBA), designed for emerging 4G/5G communications and multi-band operations. By revisiting the LMBA theory using the signal-flow graph, a frequency-agnostic phase-alignment condition is found that is critical for ensuring intrinsically broadband load modulation behavior. This unique design methodology enables, for the first time, the independent optimization of broadband balanced amplifier (BA, as the peaking) and control amplifier (CA, as the carrier), thus fundamentally addressing the longstanding limits imposed on the design of wideband load-modulated power amplifiers (PAs). To prove the proposed concept, an ultra-wideband RF-input PD-LMBA is designed and developed using GaN technology covering the frequency range from 0.2 to 2 GHz. Experimental results demonstrate an efficiency of 51% to 72% for peak output power and 44% to 62% for 10-dB OBO, respectively.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report
Authors:
Bin Ren,
Yawei Li,
Nancy Mehta,
Radu Timofte,
Hongyuan Yu,
Cheng Wan,
Yuxin Hong,
Bingnan Han,
Zhuoyuan Wu,
Yajun Zou,
Yuqing Liu,
Jizhe Li,
Keji He,
Chao Fan,
Heng Zhang,
Xiaolin Zhang,
Xuanwu Yin,
Kunlong Zuo,
Bohao Liao,
Peizhe Xia,
Long Peng,
Zhibo Du,
Xin Di,
Wangkai Li,
Yang Wang
, et al. (109 additional authors not shown)
Abstract:
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such…
▽ More
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/.
△ Less
Submitted 25 June, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Effective internal language model training and fusion for factorized transducer model
Authors:
Jinxi Guo,
Niko Moritz,
Yingyi Ma,
Frank Seide,
Chunyang Wu,
Jay Mahadeokar,
Ozlem Kalinli,
Christian Fuegen,
Mike Seltzer
Abstract:
The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external language models. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal language model for…
▽ More
The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external language models. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal language model for non-blank token prediction. However, even with the adoption of factorized transducer models, limited improvement has been observed compared to shallow fusion. In this paper, we propose a novel ILM training and decoding strategy for factorized transducer models, which effectively combines the blank, acoustic and ILM scores. Our experiments show a 17% relative improvement over the standard decoding method when utilizing a well-trained ILM and the proposed decoding strategy on LibriSpeech datasets. Furthermore, when compared to a strong RNN-T baseline enhanced with external LM fusion, the proposed model yields a 5.5% relative improvement on general-sets and an 8.9% WER reduction for rare words. The proposed model can achieve superior performance without relying on external language models, rendering it highly efficient for production use-cases. To further improve the performance, we propose a novel and memory-efficient ILM-fusion-aware minimum word error rate (MWER) training method which improves ILM integration significantly.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Emotion-Anchored Contrastive Learning Framework for Emotion Recognition in Conversation
Authors:
Fangxu Yu,
Junjie Guo,
Zhen Wu,
Xinyu Dai
Abstract:
Emotion Recognition in Conversation (ERC) involves detecting the underlying emotion behind each utterance within a conversation. Effectively generating representations for utterances remains a significant challenge in this task. Recent works propose various models to address this issue, but they still struggle with differentiating similar emotions such as excitement and happiness. To alleviate thi…
▽ More
Emotion Recognition in Conversation (ERC) involves detecting the underlying emotion behind each utterance within a conversation. Effectively generating representations for utterances remains a significant challenge in this task. Recent works propose various models to address this issue, but they still struggle with differentiating similar emotions such as excitement and happiness. To alleviate this problem, We propose an Emotion-Anchored Contrastive Learning (EACL) framework that can generate more distinguishable utterance representations for similar emotions. To achieve this, we utilize label encodings as anchors to guide the learning of utterance representations and design an auxiliary loss to ensure the effective separation of anchors for similar emotions. Moreover, an additional adaptation process is proposed to adapt anchors to serve as effective classifiers to improve classification performance. Across extensive experiments, our proposed EACL achieves state-of-the-art emotion recognition performance and exhibits superior performance on similar emotions. Our code is available at https://github.com/Yu-Fangxu/EACL.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
Linear Hybrid Asymmetrical Load-Modulated Balanced Amplifier with Multi-Band Reconfigurability and Antenna-VSWR Resilience
Authors:
Jiachen Guo,
Yuchen Cao,
Kenle Chen
Abstract:
This paper presents the first-ever highly linear and load-insensitive three-way load-modulation power amplifier (PA) based on reconfigurable hybrid asymmetrical load modulated balanced amplifier (H-ALMBA). Through proper amplitude and phase controls, the carrier, control amplifier (CA), and two peaking balanced amplifiers (BA1 and BA2) can form a linear high-order load modulation over wide bandwid…
▽ More
This paper presents the first-ever highly linear and load-insensitive three-way load-modulation power amplifier (PA) based on reconfigurable hybrid asymmetrical load modulated balanced amplifier (H-ALMBA). Through proper amplitude and phase controls, the carrier, control amplifier (CA), and two peaking balanced amplifiers (BA1 and BA2) can form a linear high-order load modulation over wide bandwidth. Moreover, it is theoretically unveiled that the load modulation behavior of H-ALMBA can be insensitive to load mismatch by leveraging bias reconfiguration and the intrinsic load-insensitivity of balanced topology. Specifically, the PA's linearity and efficiency profiles can be maintained against arbitrary load mismatch through $Z_\mathrm{L}$-dependent reconfiguration of CA supply voltage ($V_\mathrm{DD,CA}$) and turning-on sequence of BA1 and BA2. Based on the proposed theory, an RF-input linear H-ALMBA is developed with GaN transistors and wideband quadrature hybrids. Over the design bandwidth from $1.7$-$2.9$ GHz, an efficiency of $56.8\%$$-$$72.9\%$ at peak power and $49.8\%$$-$$61.2\%$ at $10$-dB PBO are measured together with linear AMAM and AMPM responses. In modulated evaluation with 4G LTE signal, an EVM of $3.1\%$, ACPR of $-39$ dB, and average efficiency of up to $52\%$ are measured. Moreover, the reconfigurable H-ALMBA experimentally maintains an excellent average efficiency and linearity against arbitrary load mismatch at $2:1$ VSWR, and this mismatch-resilient operation can be achieved at any in-band frequencies. The overall measured performance favorably outperforms the state-of-the-art.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Fluid Antenna for Mobile Edge Computing
Authors:
Yiping Zuo,
Jiajia Guo,
Biyun Sheng,
Chen Dai,
Fu Xiao,
Shi Jin
Abstract:
In the evolving environment of mobile edge computing (MEC), optimizing system performance to meet the growing demand for low-latency computing services is a top priority. Integrating fluidic antenna (FA) technology into MEC networks provides a new approach to address this challenge. This letter proposes an FA-enabled MEC scheme that aims to minimize the total system delay by leveraging the mobilit…
▽ More
In the evolving environment of mobile edge computing (MEC), optimizing system performance to meet the growing demand for low-latency computing services is a top priority. Integrating fluidic antenna (FA) technology into MEC networks provides a new approach to address this challenge. This letter proposes an FA-enabled MEC scheme that aims to minimize the total system delay by leveraging the mobility of FA to enhance channel conditions and improve computational offloading efficiency. By establishing an optimization problem focusing on the joint optimization of computation offloading and antenna positioning, we introduce an alternating iterative algorithm based on the interior point method and particle swarm optimization (IPPSO). Numerical results demonstrate the advantages of our proposed scheme compared to traditional fixed antenna positions, showing significant improvements in transmission rates and reductions in delays. The proposed IPPSO algorithm exhibits robust convergence properties, further validating the effectiveness of our method.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives
Authors:
Ronghui Li,
YuXiang Zhang,
Yachao Zhang,
Hongwen Zhang,
Jie Guo,
Yan Zhang,
Yebin Liu,
Xiu Li
Abstract:
We propose Lodge, a network capable of generating extremely long dance sequences conditioned on given music. We design Lodge as a two-stage coarse to fine diffusion architecture, and propose the characteristic dance primitives that possess significant expressiveness as intermediate representations between two diffusion models. The first stage is global diffusion, which focuses on comprehending the…
▽ More
We propose Lodge, a network capable of generating extremely long dance sequences conditioned on given music. We design Lodge as a two-stage coarse to fine diffusion architecture, and propose the characteristic dance primitives that possess significant expressiveness as intermediate representations between two diffusion models. The first stage is global diffusion, which focuses on comprehending the coarse-level music-dance correlation and production characteristic dance primitives. In contrast, the second-stage is the local diffusion, which parallelly generates detailed motion sequences under the guidance of the dance primitives and choreographic rules. In addition, we propose a Foot Refine Block to optimize the contact between the feet and the ground, enhancing the physical realism of the motion. Our approach can parallelly generate dance sequences of extremely long length, striking a balance between global choreographic patterns and local motion quality and expressiveness. Extensive experiments validate the efficacy of our method.
△ Less
Submitted 19 April, 2024; v1 submitted 15 March, 2024;
originally announced March 2024.
-
Achievable Rate Analysis and Optimization of Double-RIS Assisted Spatially Correlated MIMO with Statistical CSI
Authors:
Kaizhe Xu,
Jiajia Guo,
Jun Zhang,
Shi Jin,
Shaodan Ma
Abstract:
Reconfigurable intelligent surface (RIS) is a novel meta-material which can form a smart radio environment by dynamically altering reflection directions of the impinging electromagnetic waves. In the prior literature, the inter-RIS links which also contribute to the performance of the whole system are usually neglected when multiple RISs are deployed. In this paper we investigate a general double-…
▽ More
Reconfigurable intelligent surface (RIS) is a novel meta-material which can form a smart radio environment by dynamically altering reflection directions of the impinging electromagnetic waves. In the prior literature, the inter-RIS links which also contribute to the performance of the whole system are usually neglected when multiple RISs are deployed. In this paper we investigate a general double-RIS assisted multiple-input multiple-output (MIMO) wireless communication system under spatially correlated non line-of-sight propagation channels, where the cooperation of the double RISs is also considered. The design objective is to maximize the achievable ergodic rate based on full statistical channel state information (CSI). Specifically, we firstly present a closed-form asymptotic expression for the achievable ergodic rate by utilizing replica method from statistical physics. Then a full statistical CSI-enabled optimal design is proposed which avoids high pilot training overhead compared to instantaneous CSI-enabled design. To further reduce the signal processing overhead and lower the complexity for practical realization, a common-phase scheme is proposed to design the double RISs. Simulation results show that the derived asymptotic ergodic rate is quite accurate even for small-sized antenna arrays. And the proposed optimization algorithm can achieve substantial gain at the expense of a low overhead and complexity. Furthermore, the cooperative double-RIS assisted MIMO framework is proven to achieve superior ergodic rate performance and high communication reliability under harsh propagation environment.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Recursive GNNs for Learning Precoding Policies with Size-Generalizability
Authors:
Jia Guo,
Chenyang Yang
Abstract:
Graph neural networks (GNNs) have been shown promising in optimizing power allocation and link scheduling with good size generalizability and low training complexity. These merits are important for learning wireless policies under dynamic environments, which partially come from the matched permutation equivariance (PE) properties of the GNNs to the policies to be learned. Nonetheless, it has been…
▽ More
Graph neural networks (GNNs) have been shown promising in optimizing power allocation and link scheduling with good size generalizability and low training complexity. These merits are important for learning wireless policies under dynamic environments, which partially come from the matched permutation equivariance (PE) properties of the GNNs to the policies to be learned. Nonetheless, it has been noticed in literature that only satisfying the PE property of a precoding policy in multi-antenna systems cannot ensure a GNN for learning precoding to be generalizable to the unseen number of users. Incorporating models with GNNs helps improve size generalizability, which however is only applicable to specific problems, settings, and algorithms. In this paper, we propose a framework of size generalizable GNNs for learning precoding policies that are purely data-driven and can learn wireless policies including but not limited to baseband and hybrid precoding in multi-user multi-antenna systems. To this end, we first find a special structure of each iteration of two numerical algorithms for optimizing precoding, from which we identify the key characteristics of a GNN that affect its size generalizability. Then, we design size-generalizable GNNs that are with these key characteristics and satisfy the PE properties of precoding policies in a recursive manner. Simulation results show that the proposed GNNs can be well-generalized to the number of users for learning baseband and hybrid precoding policies and require much fewer samples than existing counterparts to achieve the same performance.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Sensing in Bi-Static ISAC Systems with Clock Asynchronism: A Signal Processing Perspective
Authors:
Kai Wu,
Jacopo Pegoraro,
Francesca Meneghello,
J. Andrew Zhang,
Jesus O. Lacruz,
Joerg Widmer,
Francesco Restuccia,
Michele Rossi,
Xiaojing Huang,
Daqing Zhang,
Giuseppe Caire,
Y. Jay Guo
Abstract:
Integrated Sensing and Communication (ISAC) has been identified as a pillar usage scenario for the impending 6G era. Bi-static sensing, a major type of sensing in ISAC, is promising to expedite ISAC in the near future, as it requires minimal changes to the existing network infrastructure. However, a critical challenge for bi-static sensing is clock asynchronism due to the use of different clocks a…
▽ More
Integrated Sensing and Communication (ISAC) has been identified as a pillar usage scenario for the impending 6G era. Bi-static sensing, a major type of sensing in ISAC, is promising to expedite ISAC in the near future, as it requires minimal changes to the existing network infrastructure. However, a critical challenge for bi-static sensing is clock asynchronism due to the use of different clocks at far-separated transmitters and receivers. This causes the received signal to be affected by time-varying random phase offsets, severely degrading, or even failing, direct sensing. Hence, to effectively enable ISAC, considerable research has been directed toward addressing the clock asynchronism issue in bi-static sensing. This paper provides an overview of the issue and existing techniques developed in an ISAC background. Based on the review and comparison, we also draw insights into the future research directions and open problems, aiming to nurture the maturation of bi-static sensing in ISAC.
△ Less
Submitted 24 June, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
Distributed Task-Oriented Communication Networks with Multimodal Semantic Relay and Edge Intelligence
Authors:
Jie Guo,
Hao Chen,
Bin Song,
Yuhao Chi,
Chau Yuen,
Fei Richard Yu,
Geoffrey Ye Li,
Dusit Niyato
Abstract:
In this article, we present a novel framework, named distributed task-oriented communication networks (DTCN), based on recent advances in multimodal semantic transmission and edge intelligence. In DTCN, the multimodal knowledge of semantic relays and the adaptive adjustment capability of edge intelligence can be integrated to improve task performance. Specifically, we propose the key techniques in…
▽ More
In this article, we present a novel framework, named distributed task-oriented communication networks (DTCN), based on recent advances in multimodal semantic transmission and edge intelligence. In DTCN, the multimodal knowledge of semantic relays and the adaptive adjustment capability of edge intelligence can be integrated to improve task performance. Specifically, we propose the key techniques in the framework, such as semantic alignment and complement, a semantic relay scheme for deep joint source-channel relay coding, and collaborative device-server optimization and inference. Furthermore, a multimodal classification task is used as an example to demonstrate the benefits of the proposed DTCN over existing methods. Numerical results validate that DTCN can significantly improve the accuracy of classification tasks, even in harsh communication scenarios (e.g., low signal-to-noise regime), thanks to multimodal semantic relay and edge intelligence.
△ Less
Submitted 19 January, 2024; v1 submitted 18 January, 2024;
originally announced January 2024.
-
Anchor-points Assisted Uplink Sensing in Perceptive Mobile Networks
Authors:
Yanmo Hu,
J. Andrew Zhang,
Weibo Deng,
Y. Jay Guo
Abstract:
Uplink sensing in integrated sensing and communications (ISAC) systems, such as Perceptive Mobile Networks, is challenging due to the clock asynchronism between transmitter and receiver. Existing solutions typically require the presence of a dominating line-of-sight path and the knowledge of transmitter location at the receiver. In this paper, relaxing these requirements, we propose a novel and ef…
▽ More
Uplink sensing in integrated sensing and communications (ISAC) systems, such as Perceptive Mobile Networks, is challenging due to the clock asynchronism between transmitter and receiver. Existing solutions typically require the presence of a dominating line-of-sight path and the knowledge of transmitter location at the receiver. In this paper, relaxing these requirements, we propose a novel and effective uplink sensing scheme with the assistance of static anchor points. Two major algorithms are proposed in the scheme. The first algorithm estimates the relative timing and carrier frequency offsets due to clock asynchronism, with respect to those at a randomly selected reference snapshot. Theoretical performance analysis is provided for the algorithm. The estimates from the first algorithm are then used to compensate for the offsets and generate the angle-Doppler maps. Using the maps, the second algorithm identifies the anchor points, and then locates the UE and dynamic targets. Feasibility of UE localization is also analyzed. Simulation results are provided and demonstrate the effectiveness of the proposed algorithms.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
Performance Bounds and Optimization for CSI-Ratio based Bi-static Doppler Sensing in ISAC Systems
Authors:
Yanmo Hu,
Kai Wu,
J. Andrew Zhang,
Weibo Deng,
Y. Jay Guo
Abstract:
Bi-static sensing is crucial for exploring the potential of networked sensing capabilities in integrated sensing and communications (ISAC). However, it suffers from the challenging clock asynchronism issue. CSI ratio-based sensing is an effective means to address the issue. Its performance bounds, particular for Doppler sensing, have not been fully understood yet. This work endeavors to fill the r…
▽ More
Bi-static sensing is crucial for exploring the potential of networked sensing capabilities in integrated sensing and communications (ISAC). However, it suffers from the challenging clock asynchronism issue. CSI ratio-based sensing is an effective means to address the issue. Its performance bounds, particular for Doppler sensing, have not been fully understood yet. This work endeavors to fill the research gap. Focusing on a single dynamic path in high-SNR scenarios, we derive the closed-form CRB. Then, through analyzing the mutual interference between dynamic and static paths, we simplify the CRB results by deriving close approximations, further unveiling new insights of the impact of numerous physical parameters on Doppler sensing. Moreover, utilizing the new CRB and analyses, we propose novel waveform optimization strategies for noise- and interference-limited sensing scenarios, which are also empowered by closed-form and efficient solutions. Extensive simulation results are provided to validate the preciseness of the derived CRB results and analyses, with the aid of the maximum-likelihood estimator. The results also demonstrate the substantial enhanced Doppler sensing accuracy and the sensing capabilities for low-speed target achieved by the proposed waveform design.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
Contrastive Learning With Audio Discrimination For Customizable Keyword Spotting In Continuous Speech
Authors:
Yu Xi,
Baochen Yang,
Hao Li,
Jiaqi Guo,
Kai Yu
Abstract:
Customizable keyword spotting (KWS) in continuous speech has attracted increasing attention due to its real-world application potential. While contrastive learning (CL) has been widely used to extract keyword representations, previous CL approaches all operate on pre-segmented isolated words and employ only audio-text representations matching strategy. However, for KWS in continuous speech, co-art…
▽ More
Customizable keyword spotting (KWS) in continuous speech has attracted increasing attention due to its real-world application potential. While contrastive learning (CL) has been widely used to extract keyword representations, previous CL approaches all operate on pre-segmented isolated words and employ only audio-text representations matching strategy. However, for KWS in continuous speech, co-articulation and streaming word segmentation can easily yield similar audio patterns for different texts, which may consequently trigger false alarms. To address this issue, we propose a novel CL with Audio Discrimination (CLAD) approach to learning keyword representation with both audio-text matching and audio-audio discrimination ability. Here, an InfoNCE loss considering both audio-audio and audio-text CL data pairs is employed for each sliding window during training. Evaluations on the open-source LibriPhrase dataset show that the use of sliding-window level InfoNCE loss yields comparable performance compared to previous CL approaches. Furthermore, experiments on the continuous speech dataset LibriSpeech demonstrate that, by incorporating audio discrimination, CLAD achieves significant performance gain over CL without audio discrimination. Meanwhile, compared to two-stage KWS approaches, the end-to-end KWS with CLAD achieves not only better performance, but also significant speed-up.
△ Less
Submitted 12 January, 2024;
originally announced January 2024.
-
UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction
Authors:
Jiaxin Guo,
Minghan Wang,
Xiaosong Qiao,
Daimeng Wei,
Hengchao Shang,
Zongyao Li,
Zhengzhe Yu,
Yinglu Li,
Chang Su,
Min Zhang,
Shimin Tao,
Hao Yang
Abstract:
Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER). Previous works usually adopt end-to-end models and has strong dependency on Pseudo Paired Data and Original Paired Data. But when only pre-training on Pseudo Paired Data, previous models have negative effect on correction. While fine-tu…
▽ More
Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER). Previous works usually adopt end-to-end models and has strong dependency on Pseudo Paired Data and Original Paired Data. But when only pre-training on Pseudo Paired Data, previous models have negative effect on correction. While fine-tuning on Original Paired Data, the source side data must be transcribed by a well-trained ASR model, which takes a lot of time and not universal. In this paper, we propose UCorrect, an unsupervised Detector-Generator-Selector framework for ASR Error Correction. UCorrect has no dependency on the training data mentioned before. The whole procedure is first to detect whether the character is erroneous, then to generate some candidate characters and finally to select the most confident one to replace the error character. Experiments on the public AISHELL-1 dataset and WenetSpeech dataset show the effectiveness of UCorrect for ASR error correction: 1) it achieves significant WER reduction, achieves 6.83\% even without fine-tuning and 14.29\% after fine-tuning; 2) it outperforms the popular NAR correction models by a large margin with a competitive low latency; and 3) it is an universal method, as it reduces all WERs of the ASR model with different decoding strategies and reduces all WERs of ASR models trained on different scale datasets.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Mapping Information in Feature Extraction Transformation for Chirp Signal
Authors:
Shuyi Gu,
Zhenghua Luo,
Lin Hu,
Yilin Zhang,
Junxiong Guo
Abstract:
Chirp signals have established diverse applications caused by the capable of producing time-dependent linear frequencies. Most feature extraction transformation methods for chirp signals focus on enhancing the performance of transform methods but neglecting the information derived from the transformation process. Consequently, they may fail to fully exploit the information from observations, resul…
▽ More
Chirp signals have established diverse applications caused by the capable of producing time-dependent linear frequencies. Most feature extraction transformation methods for chirp signals focus on enhancing the performance of transform methods but neglecting the information derived from the transformation process. Consequently, they may fail to fully exploit the information from observations, resulting in decreased performance under conditions of low signal-to-noise ratio and limited observations. In this work, we develop a novel post-processing method called mapping information model to addressing this challenge. The model establishes a link between the observation space and feature space in feature extraction transform, enabling interference suppression and obtain more accurate information by iteratively resampling and assigning weights in both spaces. Analysis of the iteration process reveals a continual increase in weight of signal samples and a gradual stability in weight of noise samples. The demonstration of the noise suppression in the iteration process and feature enhancement supports the effectiveness of the mapping information model. Furthermore, numerical simulations also affirm the high efficiency of the proposed model by showcasing enhanced signal detection and estimation performances without requiring additional observations. This superior model allows amplifying performance within feature extraction transformation for chirp signal processing under low SNR and limited observation conditions, opens up new opportunities for areas such as communication, biomedicine, and remote sensing.
△ Less
Submitted 10 January, 2024;
originally announced January 2024.
-
Exploring Multi-Modal Control in Music-Driven Dance Generation
Authors:
Ronghui Li,
Yuqin Dai,
Yachao Zhang,
Jun Li,
Jian Yang,
Jie Guo,
Xiu Li
Abstract:
Existing music-driven 3D dance generation methods mainly concentrate on high-quality dance generation, but lack sufficient control during the generation process. To address these issues, we propose a unified framework capable of generating high-quality dance movements and supporting multi-modal control, including genre control, semantic control, and spatial control. First, we decouple the dance ge…
▽ More
Existing music-driven 3D dance generation methods mainly concentrate on high-quality dance generation, but lack sufficient control during the generation process. To address these issues, we propose a unified framework capable of generating high-quality dance movements and supporting multi-modal control, including genre control, semantic control, and spatial control. First, we decouple the dance generation network from the dance control network, thereby avoiding the degradation in dance quality when adding additional control information. Second, we design specific control strategies for different control information and integrate them into a unified framework. Experimental results show that the proposed dance generation framework outperforms state-of-the-art methods in terms of motion quality and controllability.
△ Less
Submitted 1 January, 2024;
originally announced January 2024.
-
Deep Learning-based MRI Reconstruction with Artificial Fourier Transform (AFT)-Net
Authors:
Yanting Yang,
Jeffery Siyuan Tian,
Matthieu Dagommer,
Jia Guo
Abstract:
The deep complex-valued neural network provides a powerful way to leverage complex number operations and representations, which has succeeded in several phase-based applications. However, most previously published networks have not fully accessed the impact of complex-valued networks in the frequency domain. Here, we introduced a unified complex-valued deep learning framework - artificial Fourier…
▽ More
The deep complex-valued neural network provides a powerful way to leverage complex number operations and representations, which has succeeded in several phase-based applications. However, most previously published networks have not fully accessed the impact of complex-valued networks in the frequency domain. Here, we introduced a unified complex-valued deep learning framework - artificial Fourier transform network (AFT-Net) - which combined domain-manifold learning and complex-valued neural networks. The AFT-Net can be readily used to solve the image inverse problems in domain-transform, especially for accelerated magnetic resonance imaging (MRI) reconstruction and other applications. While conventional methods only accept magnitude images, the proposed method takes raw k-space data in the frequency domain as inputs, allowing a mapping between the k-space domain and the image domain to be determined through cross-domain learning. We show that AFT-Net achieves superior accelerated MRI reconstruction and is comparable to existing approaches. Also, our approach can be applied to different tasks like denoised MRS reconstruction and different datasets with various contrasts. The AFT-Net presented here is a valuable preprocessing component for different preclinical studies and provides an innovative alternative for solving inverse problems in imaging and spectroscopy.
△ Less
Submitted 17 December, 2023;
originally announced December 2023.
-
Enhancing CT Image synthesis from multi-modal MRI data based on a multi-task neural network framework
Authors:
Zhuoyao Xin,
Christopher Wu,
Dong Liu,
Chunming Gu,
Jia Guo,
Jun Hua
Abstract:
Image segmentation, real-value prediction, and cross-modal translation are critical challenges in medical imaging. In this study, we propose a versatile multi-task neural network framework, based on an enhanced Transformer U-Net architecture, capable of simultaneously, selectively, and adaptively addressing these medical image tasks. Validation is performed on a public repository of human brain MR…
▽ More
Image segmentation, real-value prediction, and cross-modal translation are critical challenges in medical imaging. In this study, we propose a versatile multi-task neural network framework, based on an enhanced Transformer U-Net architecture, capable of simultaneously, selectively, and adaptively addressing these medical image tasks. Validation is performed on a public repository of human brain MR and CT images. We decompose the traditional problem of synthesizing CT images into distinct subtasks, which include skull segmentation, Hounsfield unit (HU) value prediction, and image sequential reconstruction. To enhance the framework's versatility in handling multi-modal data, we expand the model with multiple image channels. Comparisons between synthesized CT images derived from T1-weighted and T2-Flair images were conducted, evaluating the model's capability to integrate multi-modal information from both morphological and pixel value perspectives.
△ Less
Submitted 17 December, 2023; v1 submitted 13 December, 2023;
originally announced December 2023.
-
TABSurfer: a Hybrid Deep Learning Architecture for Subcortical Segmentation
Authors:
Aaron Cao,
Vishwanatha M. Rao,
Kejia Liu,
Xinru Liu,
Andrew F. Laine,
Jia Guo
Abstract:
Subcortical segmentation remains challenging despite its important applications in quantitative structural analysis of brain MRI scans. The most accurate method, manual segmentation, is highly labor intensive, so automated tools like FreeSurfer have been adopted to handle this task. However, these traditional pipelines are slow and inefficient for processing large datasets. In this study, we propo…
▽ More
Subcortical segmentation remains challenging despite its important applications in quantitative structural analysis of brain MRI scans. The most accurate method, manual segmentation, is highly labor intensive, so automated tools like FreeSurfer have been adopted to handle this task. However, these traditional pipelines are slow and inefficient for processing large datasets. In this study, we propose TABSurfer, a novel 3D patch-based CNN-Transformer hybrid deep learning model designed for superior subcortical segmentation compared to existing state-of-the-art tools. To evaluate, we first demonstrate TABSurfer's consistent performance across various T1w MRI datasets with significantly shorter processing times compared to FreeSurfer. Then, we validate against manual segmentations, where TABSurfer outperforms FreeSurfer based on the manual ground truth. In each test, we also establish TABSurfer's advantage over a leading deep learning benchmark, FastSurferVINN. Together, these studies highlight TABSurfer's utility as a powerful tool for fully automated subcortical segmentation with high fidelity.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
Quantitative perfusion maps using a novelty spatiotemporal convolutional neural network
Authors:
Anbo Cao,
Pin-Yu Le,
Zhonghui Qie,
Haseeb Hassan,
Yingwei Guo,
Asim Zaman,
Jiaxi Lu,
Xueqiang Zeng,
Huihui Yang,
Xiaoqiang Miao,
Taiyu Han,
Guangtao Huang,
Yan Kang,
Yu Luo,
Jia Guo
Abstract:
Dynamic susceptibility contrast magnetic resonance imaging (DSC-MRI) is widely used to evaluate acute ischemic stroke to distinguish salvageable tissue and infarct core. For this purpose, traditional methods employ deconvolution techniques, like singular value decomposition, which are known to be vulnerable to noise, potentially distorting the derived perfusion parameters. However, deep learning t…
▽ More
Dynamic susceptibility contrast magnetic resonance imaging (DSC-MRI) is widely used to evaluate acute ischemic stroke to distinguish salvageable tissue and infarct core. For this purpose, traditional methods employ deconvolution techniques, like singular value decomposition, which are known to be vulnerable to noise, potentially distorting the derived perfusion parameters. However, deep learning technology could leverage it, which can accurately estimate clinical perfusion parameters compared to traditional clinical approaches. Therefore, this study presents a perfusion parameters estimation network that considers spatial and temporal information, the Spatiotemporal Network (ST-Net), for the first time. The proposed network comprises a designed physical loss function to enhance model performance further. The results indicate that the network can accurately estimate perfusion parameters, including cerebral blood volume (CBV), cerebral blood flow (CBF), and time to maximum of the residual function (Tmax). The structural similarity index (SSIM) mean values for CBV, CBF, and Tmax parameters were 0.952, 0.943, and 0.863, respectively. The DICE score for the hypo-perfused region reached 0.859, demonstrating high consistency. The proposed model also maintains time efficiency, closely approaching the performance of commercial gold-standard software.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
Quantifying white matter hyperintensity and brain volumes in heterogeneous clinical and low-field portable MRI
Authors:
Pablo Laso,
Stefano Cerri,
Annabel Sorby-Adams,
Jennifer Guo,
Farrah Mateen,
Philipp Goebl,
Jiaming Wu,
Peirong Liu,
Hongwei Li,
Sean I. Young,
Benjamin Billot,
Oula Puonti,
Gordon Sze,
Sam Payabavash,
Adam DeHavenon,
Kevin N. Sheth,
Matthew S. Rosen,
John Kirsch,
Nicola Strisciuglio,
Jelmer M. Wolterink,
Arman Eshaghi,
Frederik Barkhof,
W. Taylor Kimberly,
Juan Eugenio Iglesias
Abstract:
Brain atrophy and white matter hyperintensity (WMH) are critical neuroimaging features for ascertaining brain injury in cerebrovascular disease and multiple sclerosis. Automated segmentation and quantification is desirable but existing methods require high-resolution MRI with good signal-to-noise ratio (SNR). This precludes application to clinical and low-field portable MRI (pMRI) scans, thus hamp…
▽ More
Brain atrophy and white matter hyperintensity (WMH) are critical neuroimaging features for ascertaining brain injury in cerebrovascular disease and multiple sclerosis. Automated segmentation and quantification is desirable but existing methods require high-resolution MRI with good signal-to-noise ratio (SNR). This precludes application to clinical and low-field portable MRI (pMRI) scans, thus hampering large-scale tracking of atrophy and WMH progression, especially in underserved areas where pMRI has huge potential. Here we present a method that segments white matter hyperintensity and 36 brain regions from scans of any resolution and contrast (including pMRI) without retraining. We show results on eight public datasets and on a private dataset with paired high- and low-field scans (3T and 64mT), where we attain strong correlation between the WMH ($ρ$=.85) and hippocampal volumes (r=.89) estimated at both fields. Our method is publicly available as part of FreeSurfer, at: http://surfer.nmr.mgh.harvard.edu/fswiki/WMH-SynthSeg.
△ Less
Submitted 15 February, 2024; v1 submitted 8 December, 2023;
originally announced December 2023.
-
Geometrically-Shaped Constellation for Visible Light Communications at Short Blocklength
Authors:
Jia-Ning Guo,
Ru-Han Chen,
Jian Zhang,
Longguang Li,
Xu Yang,
Jing Zhou
Abstract:
In this paper, we present a general framework of designing geometrically shaped constellations for short-packet visible light communications with a peak- and an average-intensity constraints. By leveraging tools from large deviation theory, we first characterize the second-order asymptotics of the optimal constellation shaping region under aforementioned intensity constraints, which serves as a go…
▽ More
In this paper, we present a general framework of designing geometrically shaped constellations for short-packet visible light communications with a peak- and an average-intensity constraints. By leveraging tools from large deviation theory, we first characterize the second-order asymptotics of the optimal constellation shaping region under aforementioned intensity constraints, which serves as a good performance measure for the best geometric shaping in finite blocklength. To further incorporate a sufficiently large coding gain and a nearly-maximum shaping gain, we construct multidimensional constellations by the nested structure of Construction B lattices, where the constellation shaping is implemented by controlling the boundary of the embedded sublattice, i.e., a strategy called coarsely shaping and finely coding. Fast algorithms for constellation mapping and demodulation are presented as well. As an illustrative example, we present an energy-efficient $24$-dimensional constellation design based on the Leech lattice, whose superiority over existing constellation designs is verified by numerical results.
△ Less
Submitted 28 April, 2024; v1 submitted 5 November, 2023;
originally announced November 2023.
-
Knowledge-driven Meta-learning for CSI Feedback
Authors:
Han Xiao,
Wenqiang Tian,
Wendong Liu,
Jiajia Guo,
Zhi Zhang,
Shi Jin,
Zhihua Shi,
Li Guo,
Jia Shen
Abstract:
Accurate and effective channel state information (CSI) feedback is a key technology for massive multiple-input and multiple-output systems. Recently, deep learning (DL) has been introduced for CSI feedback enhancement through massive collected training data and lengthy training time, which is quite costly and impractical for realistic deployment. In this article, a knowledge-driven meta-learning a…
▽ More
Accurate and effective channel state information (CSI) feedback is a key technology for massive multiple-input and multiple-output systems. Recently, deep learning (DL) has been introduced for CSI feedback enhancement through massive collected training data and lengthy training time, which is quite costly and impractical for realistic deployment. In this article, a knowledge-driven meta-learning approach is proposed, where the DL model initialized by the meta model obtained from meta training phase is able to achieve rapid convergence when facing a new scenario during target retraining phase. Specifically, instead of training with massive data collected from various scenarios, the meta task environment is constructed based on the intrinsic knowledge of spatial-frequency characteristics of CSI for meta training. Moreover, the target task dataset is also augmented by exploiting the knowledge of statistical characteristics of wireless channel, so that the DL model can achieve higher performance with small actually collected dataset and short training time. In addition, we provide analyses of rationale for the improvement yielded by the knowledge in both phases. Simulation results demonstrate the superiority of the proposed approach from the perspective of feedback performance and convergence speed.
△ Less
Submitted 25 October, 2023; v1 submitted 24 October, 2023;
originally announced October 2023.
-
Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model
Authors:
Jiamin Xie,
Ke Li,
Jinxi Guo,
Andros Tjandra,
Yuan Shangguan,
Leda Sari,
Chunyang Wu,
Junteng Jia,
Jay Mahadeokar,
Ozlem Kalinli
Abstract:
Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in…
▽ More
Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.
△ Less
Submitted 11 January, 2024; v1 submitted 22 September, 2023;
originally announced September 2023.
-
AI/ML for Beam Management in 5G-Advanced: A Standardization Perspective
Authors:
Qing Xue,
Jiajia Guo,
Binggui Zhou,
Yongjun Xu,
Zhidu Li,
Shaodan Ma
Abstract:
In beamformed wireless cellular systems such as 5G New Radio (NR) networks, beam management (BM) is a crucial operation. In the second phase of 5G NR standardization, known as 5G-Advanced, which is being vigorously promoted, the key component is the use of artificial intelligence (AI) based on machine learning (ML) techniques. AI/ML for BM is selected as a representative use case. This article pro…
▽ More
In beamformed wireless cellular systems such as 5G New Radio (NR) networks, beam management (BM) is a crucial operation. In the second phase of 5G NR standardization, known as 5G-Advanced, which is being vigorously promoted, the key component is the use of artificial intelligence (AI) based on machine learning (ML) techniques. AI/ML for BM is selected as a representative use case. This article provides an overview of the AI/ML for BM in 5G-Advanced. The legacy non-AI and prime AI-enabled BM frameworks are first introduced and compared. Then, the main scope of AI/ML for BM is presented, including improving accuracy, reducing overhead and latency. Finally, the key challenges and open issues in the standardization of AI/ML for BM are discussed, especially the design of new protocols for AI-enabled BM. This article provides a guideline for the study of AI/ML-based BM standardization.
△ Less
Submitted 24 July, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.
-
MASTERKEY: Practical Backdoor Attack Against Speaker Verification Systems
Authors:
Hanqing Guo,
Xun Chen,
Junfeng Guo,
Li Xiao,
Qiben Yan
Abstract:
Speaker Verification (SV) is widely deployed in mobile systems to authenticate legitimate users by using their voice traits. In this work, we propose a backdoor attack MASTERKEY, to compromise the SV models. Different from previous attacks, we focus on a real-world practical setting where the attacker possesses no knowledge of the intended victim. To design MASTERKEY, we investigate the limitation…
▽ More
Speaker Verification (SV) is widely deployed in mobile systems to authenticate legitimate users by using their voice traits. In this work, we propose a backdoor attack MASTERKEY, to compromise the SV models. Different from previous attacks, we focus on a real-world practical setting where the attacker possesses no knowledge of the intended victim. To design MASTERKEY, we investigate the limitation of existing poisoning attacks against unseen targets. Then, we optimize a universal backdoor that is capable of attacking arbitrary targets. Next, we embed the speaker's characteristics and semantics information into the backdoor, making it imperceptible. Finally, we estimate the channel distortion and integrate it into the backdoor. We validate our attack on 6 popular SV models. Specifically, we poison a total of 53 models and use our trigger to attack 16,430 enrolled speakers, composed of 310 target speakers enrolled in 53 poisoned models. Our attack achieves 100% attack success rate with a 15% poison rate. By decreasing the poison rate to 3%, the attack success rate remains around 50%. We validate our attack in 3 real-world scenarios and successfully demonstrate the attack through both over-the-air and over-the-telephony-line scenarios.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
A plug-and-play synthetic data deep learning for undersampled magnetic resonance image reconstruction
Authors:
Min Xiao,
Zi Wang,
Jiefeng Guo,
Xiaobo Qu
Abstract:
Magnetic resonance imaging (MRI) plays an important role in modern medical diagnostic but suffers from prolonged scan time. Current deep learning methods for undersampled MRI reconstruction exhibit good performance in image de-aliasing which can be tailored to the specific k-space undersampling scenario. But it is very troublesome to configure different deep networks when the sampling setting chan…
▽ More
Magnetic resonance imaging (MRI) plays an important role in modern medical diagnostic but suffers from prolonged scan time. Current deep learning methods for undersampled MRI reconstruction exhibit good performance in image de-aliasing which can be tailored to the specific k-space undersampling scenario. But it is very troublesome to configure different deep networks when the sampling setting changes. In this work, we propose a deep plug-and-play method for undersampled MRI reconstruction, which effectively adapts to different sampling settings. Specifically, the image de-aliasing prior is first learned by a deep denoiser trained to remove general white Gaussian noise from synthetic data. Then the learned deep denoiser is plugged into an iterative algorithm for image reconstruction. Results on in vivo data demonstrate that the proposed method provides nice and robust accelerated image reconstruction performance under different undersampling patterns and sampling rates, both visually and quantitatively.
△ Less
Submitted 8 October, 2023; v1 submitted 12 September, 2023;
originally announced September 2023.
-
Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP
Authors:
Jinzuomu Zhong,
Yang Li,
Hui Huang,
Korin Richmond,
Jie Liu,
Zhiba Su,
Jing Guo,
Benlai Tang,
Fengjie Zhu
Abstract:
In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silenc…
▽ More
In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs to enhance prosodic information in latent representations. In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier. Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase boundary respectively, while bearing remarkable robustness to data scarcity.
△ Less
Submitted 11 June, 2024; v1 submitted 11 September, 2023;
originally announced September 2023.
-
AVARS -- Alleviating Unexpected Urban Road Traffic Congestion using UAVs
Authors:
Jiaying Guo,
Michael R. Jones,
Soufiene Djahel,
Shen Wang
Abstract:
Reducing unexpected urban traffic congestion caused by en-route events (e.g., road closures, car crashes, etc.) often requires fast and accurate reactions to choose the best-fit traffic signals. Traditional traffic light control systems, such as SCATS and SCOOT, are not efficient as their traffic data provided by induction loops has a low update frequency (i.e., longer than 1 minute). Moreover, th…
▽ More
Reducing unexpected urban traffic congestion caused by en-route events (e.g., road closures, car crashes, etc.) often requires fast and accurate reactions to choose the best-fit traffic signals. Traditional traffic light control systems, such as SCATS and SCOOT, are not efficient as their traffic data provided by induction loops has a low update frequency (i.e., longer than 1 minute). Moreover, the traffic light signal plans used by these systems are selected from a limited set of candidate plans pre-programmed prior to unexpected events' occurrence. Recent research demonstrates that camera-based traffic light systems controlled by deep reinforcement learning (DRL) algorithms are more effective in reducing traffic congestion, in which the cameras can provide high-frequency high-resolution traffic data. However, these systems are costly to deploy in big cities due to the excessive potential upgrades required to road infrastructure. In this paper, we argue that Unmanned Aerial Vehicles (UAVs) can play a crucial role in dealing with unexpected traffic congestion because UAVs with onboard cameras can be economically deployed when and where unexpected congestion occurs. Then, we propose a system called "AVARS" that explores the potential of using UAVs to reduce unexpected urban traffic congestion using DRL-based traffic light signal control. This approach is validated on a widely used open-source traffic simulator with practical UAV settings, including its traffic monitoring ranges and battery lifetime. Our simulation results show that AVARS can effectively recover the unexpected traffic congestion in Dublin, Ireland, back to its original un-congested level within the typical battery life duration of a UAV.
△ Less
Submitted 10 September, 2023;
originally announced September 2023.