Search | arXiv e-print repository

Advancing Humanoid Locomotion: Mastering Challenging Terrains with Denoising World Model Learning

Authors: Xinyang Gu, Yen-Jen Wang, Xiang Zhu, Chengming Shi, Yanjiang Guo, Yichen Liu, Jianyu Chen

Abstract: Humanoid robots, with their human-like skeletal structure, are especially suited for tasks in human-centric environments. However, this structure is accompanied by additional challenges in locomotion controller design, especially in complex real-world environments. As a result, existing humanoid robots are limited to relatively simple terrains, either with model-based control or model-free reinfor… ▽ More Humanoid robots, with their human-like skeletal structure, are especially suited for tasks in human-centric environments. However, this structure is accompanied by additional challenges in locomotion controller design, especially in complex real-world environments. As a result, existing humanoid robots are limited to relatively simple terrains, either with model-based control or model-free reinforcement learning. In this work, we introduce Denoising World Model Learning (DWL), an end-to-end reinforcement learning framework for humanoid locomotion control, which demonstrates the world's first humanoid robot to master real-world challenging terrains such as snowy and inclined land in the wild, up and down stairs, and extremely uneven terrains. All scenarios run the same learned neural network with zero-shot sim-to-real transfer, indicating the superior robustness and generalization capability of the proposed method. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: Robotics: Science and Systems (RSS), 2024. (Best Paper Award Finalist)

arXiv:2408.12897 [pdf, other]

When Diffusion MRI Meets Diffusion Model: A Novel Deep Generative Model for Diffusion MRI Generation

Authors: Xi Zhu, Wei Zhang, Yijie Li, Lauren J. O'Donnell, Fan Zhang

Abstract: Diffusion MRI (dMRI) is an advanced imaging technique characterizing tissue microstructure and white matter structural connectivity of the human brain. The demand for high-quality dMRI data is growing, driven by the need for better resolution and improved tissue contrast. However, acquiring high-quality dMRI data is expensive and time-consuming. In this context, deep generative modeling emerges as… ▽ More Diffusion MRI (dMRI) is an advanced imaging technique characterizing tissue microstructure and white matter structural connectivity of the human brain. The demand for high-quality dMRI data is growing, driven by the need for better resolution and improved tissue contrast. However, acquiring high-quality dMRI data is expensive and time-consuming. In this context, deep generative modeling emerges as a promising solution to enhance image quality while minimizing acquisition costs and scanning time. In this study, we propose a novel generative approach to perform dMRI generation using deep diffusion models. It can generate high dimension (4D) and high resolution data preserving the gradients information and brain structure. We demonstrated our method through an image mapping task aimed at enhancing the quality of dMRI images from 3T to 7T. Our approach demonstrates highly enhanced performance in generating dMRI images when compared to the current state-of-the-art (SOTA) methods. This achievement underscores a substantial progression in enhancing dMRI quality, highlighting the potential of our novel generative approach to revolutionize dMRI imaging standards. △ Less

Submitted 23 August, 2024; originally announced August 2024.

Comments: 11 pages, 3 figures

arXiv:2408.03265 [pdf, other]

BVI-AOM: A New Training Dataset for Deep Video Compression Optimization

Authors: Jakub Nawała, Yuxuan Jiang, Fan Zhang, Xiaoqing Zhu, Joel Sole, David Bull

Abstract: Deep learning is now playing an important role in enhancing the performance of conventional hybrid video codecs. These learning-based methods typically require diverse and representative training material for optimization in order to achieve model generalization and optimal coding performance. However, existing datasets either offer limited content variability or come with restricted licensing ter… ▽ More Deep learning is now playing an important role in enhancing the performance of conventional hybrid video codecs. These learning-based methods typically require diverse and representative training material for optimization in order to achieve model generalization and optimal coding performance. However, existing datasets either offer limited content variability or come with restricted licensing terms constraining their use to research purposes only. To address these issues, we propose a new training dataset, named BVI-AOM, which contains 956 uncompressed sequences at various resolutions from 270p to 2160p, covering a wide range of content and texture types. The dataset comes with more flexible licensing terms and offers competitive performance when used as a training set for optimizing deep video coding tools. The experimental results demonstrate that when used as a training set to optimize two popular network architectures for two different coding tools, the proposed dataset leads to additional bitrate savings of up to 0.29 and 2.98 percentage points in terms of PSNR-Y and VMAF, respectively, compared to an existing training dataset, BVI-DVC, which has been widely used for deep video coding. The BVI-AOM dataset is available for download under this link: (TBD). △ Less

Submitted 7 August, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

Comments: 6 pages, 5 figures. Swapped the PSNR-HVS plot in Fig. 3 for a PSNR-YUV plot

arXiv:2407.17758 [pdf, other]

Speed-enhanced Subdomain Adaptation Regression for Long-term Stable Neural Decoding in Brain-computer Interfaces

Authors: Jiyu Wei, Dazhong Rong, Xinyun Zhu, Qinming He, Yueming Wang

Abstract: Brain-computer interfaces (BCIs) offer a means to convert neural signals into control signals, providing a potential restoration of movement for people with paralysis. Despite their promise, BCIs face a significant challenge in maintaining decoding accuracy over time due to neural nonstationarities. However, the decoding accuracy of BCI drops severely across days due to the neural data drift. Whil… ▽ More Brain-computer interfaces (BCIs) offer a means to convert neural signals into control signals, providing a potential restoration of movement for people with paralysis. Despite their promise, BCIs face a significant challenge in maintaining decoding accuracy over time due to neural nonstationarities. However, the decoding accuracy of BCI drops severely across days due to the neural data drift. While current recalibration techniques address this issue to a degree, they often fail to leverage the limited labeled data, to consider the signal correlation between two days, or to perform conditional alignment in regression tasks. This paper introduces a novel approach to enhance recalibration performance. We begin with preliminary experiments that reveal the temporal patterns of neural signal changes and identify three critical elements for effective recalibration: global alignment, conditional speed alignment, and feature-label consistency. Building on these insights, we propose the Speed-enhanced Subdomain Adaptation Regression (SSAR) framework, integrating semi-supervised learning with domain adaptation techniques in regression neural decoding. SSAR employs Speed-enhanced Subdomain Alignment (SeSA) for global and speed conditional alignment of similarly labeled data, with Contrastive Consistency Constraint (CCC) to enhance the alignment of SeSA by reinforcing feature-label consistency through contrastive learning. Our comprehensive set of experiments, both qualitative and quantitative, substantiate the superior recalibration performance and robustness of SSAR. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.06767 [pdf, other]

Enhancing Robustness and Security in ISAC Network Design: Leveraging Transmissive Reconfigurable Intelligent Surface with RSMA

Authors: Ziwei Liu, Wen Chen, Qingqing Wu, Zhendong Li, Xusheng Zhu, Qiong Wu, Nan Cheng

Abstract: In this paper, we propose a novel transmissive reconfigurable intelligent surface transceiver-enhanced robust and secure integrated sensing and communication network. A time-division sensing communication mechanism is designed for the scenario, which enables communication and sensing to share wireless resources. To address the interference management problem and hinder eavesdropping, we implement… ▽ More In this paper, we propose a novel transmissive reconfigurable intelligent surface transceiver-enhanced robust and secure integrated sensing and communication network. A time-division sensing communication mechanism is designed for the scenario, which enables communication and sensing to share wireless resources. To address the interference management problem and hinder eavesdropping, we implement rate-splitting multiple access (RSMA), where the common stream is designed as a useful signal and an artificial noise, while taking into account the imperfect channel state information and modeling the channel for the illegal users in a fine-grained manner as well as giving an upper bound on the error. We introduce the secrecy outage probability and construct an optimization problem with secrecy sum-rate as the objective functions to optimize the common stream beamforming matrix, the private stream beamforming matrix and the timeslot duration variable. Due to the coupling of the optimization variables and the infinity of the error set, the proposed problem is a nonconvex optimization problem that cannot be solved directly. In order to address the above challenges, the block coordinate descent-based second-order cone programming algorithm is used to decouple the optimization variables and solving the problem. Specifically, the problem is decoupled into two subproblems concerning the common stream beamforming matrix, the private stream beamforming matrix, and the timeslot duration variable, which are solved by alternating optimization until convergence is reached. To solve the problem, S-procedure, Bernstein's inequality and successive convex approximation are employed to deal with the objective function and non-convex constraints. Numerical simulation results verify the superiority of the proposed scheme in improving the secrecy energy efficiency and the Cramér-Rao boundary. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.05331 [pdf, ps, other]

Channel Characterization of IRS-assisted Resonant Beam Communication Systems

Authors: Wen Fang, Wen Chen, Qingqing Wu, Xusheng Zhu, Qiong Wu, Nan Cheng

Abstract: To meet the growing demand for data traffic, spectrum-rich optical wireless communication (OWC) has emerged as a key technological driver for the development of 6G. The resonant beam communication (RBC) system, which employs spatially separated laser cavities as the transmitter and receiver, is a high-speed OWC technology capable of self-alignment without tracking. However, its transmission throug… ▽ More To meet the growing demand for data traffic, spectrum-rich optical wireless communication (OWC) has emerged as a key technological driver for the development of 6G. The resonant beam communication (RBC) system, which employs spatially separated laser cavities as the transmitter and receiver, is a high-speed OWC technology capable of self-alignment without tracking. However, its transmission through the air is susceptible to losses caused by obstructions. In this paper, we propose an intelligent reflecting surface (IRS) assisted RBC system with the optical frequency doubling method, where the resonant beam in frequency-fundamental and frequency-doubled is transmitted through both direct line-of-sight (LoS) and IRS-assisted channels to maintain steady-state oscillation and enable communication without echo-interference, respectively. Then, we establish the channel model based on Fresnel diffraction theory under the near-field optical propagation to analyze the transmission loss and frequency-doubled power analytically. Furthermore, communication power can be maximized by dynamically controlling the beam-splitting ratio between the two channels according to the loss levels encountered over air. Numerical results validate that the IRS-assisted channel can compensate for the losses in the obstructed LoS channel and misaligned receivers, ensuring that communication performance reaches an optimal value with dynamic ratio adjustments. △ Less

Submitted 15 August, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

arXiv:2407.04737 [pdf, other]

Hierarchical Decoupling Capacitor Optimization for Power Distribution Network of 2.5D ICs with Co-Analysis of Frequency and Time Domains Based on Deep Reinforcement Learning

Authors: Yuanyuan Duan, Haiyang Feng, Zhiping Yu, Hanming Wu, Leilai Shao, Xiaolei Zhu

Abstract: With the growing need for higher memory bandwidth and computation density, 2.5D design, which involves integrating multiple chiplets onto an interposer, emerges as a promising solution. However, this integration introduces significant challenges due to increasing data rates and a large number of I/Os, necessitating advanced optimization of the power distribution networks (PDNs) both on-chip and on… ▽ More With the growing need for higher memory bandwidth and computation density, 2.5D design, which involves integrating multiple chiplets onto an interposer, emerges as a promising solution. However, this integration introduces significant challenges due to increasing data rates and a large number of I/Os, necessitating advanced optimization of the power distribution networks (PDNs) both on-chip and on-interposer to mitigate the small signal noise and simultaneous switching noise (SSN). Traditional PDN optimization strategies in 2.5D systems primarily focus on reducing impedance by integrating decoupling capacitors (decaps) to lessen small signal noises. Unfortunately, relying solely on frequency-domain analysis has been proven inadequate for addressing coupled SSN, as indicated by our experimental results. In this work, we introduce a novel two-phase optimization flow using deep reinforcement learning to tackle both the on-chip small signal noise and SSN. Initially, we optimize the impedance in the frequency domain to maintain the small signal noise within acceptable limits while avoiding over-design. Subsequently, in the time domain, we refine the PDN to minimize the voltage violation integral (VVI), a more accurate measure of SSN severity. To the best of our knowledge, this is the first dual-domain optimization strategy that simultaneously addresses both the small signal noise and SSN propagation through strategic decap placement in on-chip and on-interposer PDNs, offering a significant step forward in the design of robust PDNs for 2.5D integrated systems. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.03388 [pdf]

Passenger Route and Departure Time Guidance under Disruptions in Oversaturated Urban Rail Transit Networks

Authors: Siyu Zhuo, Xiaoning Zhu, Pan Shang, Zhengke Liu

Abstract: The urban rail transit (URT) system attracts many commuters with its punctuality and convenience. However, it is vulnerable to disruptions caused by factors like extreme weather and temporary equipment failures, which greatly impact passengers' journeys and diminish the system's service quality. In this study, we propose targeted travel guidance for passengers at different space-time locations by… ▽ More The urban rail transit (URT) system attracts many commuters with its punctuality and convenience. However, it is vulnerable to disruptions caused by factors like extreme weather and temporary equipment failures, which greatly impact passengers' journeys and diminish the system's service quality. In this study, we propose targeted travel guidance for passengers at different space-time locations by devising passenger rescheduling strategies during disruptions. This guidance not only offers insights into route changes but also provides practical recommendations for delaying departure times when required. We present a novel three-feature four-group passenger classification principle, integrating temporal, spatial, and spatio-temporal features to classify passengers in disrupted URT networks. This approach results in the creation of four distinct solution spaces based on passenger groups. A mixed integer programming model is built based on individual level considering the First-in-First-out (FIFO) rule in oversaturated networks. Additionally, we present a two-stage solution approach for handling the complex issues in large-scale networks. Experimental results from both small-scale artificial networks and the real-world Beijing URT network validate the efficacy of our proposed passenger rescheduling strategies in mitigating disruptions. Specifically, when compared to scenarios with no travel guidance during disruptions, our strategies achieve a substantial reduction in total passenger travel time by 29.7% and 50.9% respectively, underscoring the effectiveness in managing unexpected disruptions. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2406.09846 [pdf, ps, other]

Multiple Intelligent Reflecting Surfaces Collaborative Wireless Localization System

Authors: Ziheng Zhang, Wen Chen, Qingqing Wu, Zhendong Li, Xusheng Zhu, Jingfeng Chen, Nan Cheng

Abstract: This paper studies a multiple intelligent reflecting surfaces (IRSs) collaborative localization system where multiple semi-passive IRSs are deployed in the network to locate one or more targets based on time-of-arrival. It is assumed that each semi-passive IRS is equipped with reflective elements and sensors, which are used to establish the line-of-sight links from the base station (BS) to multipl… ▽ More This paper studies a multiple intelligent reflecting surfaces (IRSs) collaborative localization system where multiple semi-passive IRSs are deployed in the network to locate one or more targets based on time-of-arrival. It is assumed that each semi-passive IRS is equipped with reflective elements and sensors, which are used to establish the line-of-sight links from the base station (BS) to multiple targets and process echo signals, respectively. Based on the above model, we derive the Fisher information matrix of the echo signal with respect to the time delay. By employing the chain rule and exploiting the geometric relationship between time delay and position, the Cramer-Rao bound (CRB) for estimating the target's Cartesian coordinate position is derived. Then, we propose a two-stage algorithmic framework to minimize CRB in single- and multi-target localization systems by joint optimizing active beamforming at BS, passive beamforming at multiple IRSs and IRS selection. For the single-target case, we derive the optimal closed-form solution for multiple IRSs coefficients design and propose a lowcomplexity algorithm based on alternating direction method of multipliers to obtain the optimal solution for active beaming design. For the multi-target case, alternating optimization is used to transform the original problem into two subproblems where semi-definite relaxation and successive convex approximation are applied to tackle the quadraticity and indefiniteness in the CRB expression, respectively. Finally, numerical simulation results validate the effectiveness of the proposed algorithm for multiple IRSs collaborative localization system compared to other benchmark schemes as well as the significant performance gains. △ Less

Submitted 17 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

Comments: 13 pages, 8 figures

arXiv:2406.09844 [pdf, other]

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

Authors: Linhan Ma, Xinfa Zhu, Yuanjun Lv, Zhichao Wang, Ziqian Wang, Wendi He, Hongbin Zhou, Lei Xie

Abstract: Zero-shot voice conversion (VC) aims to transform source speech into arbitrary unseen target voice while keeping the linguistic content unchanged. Recent VC methods have made significant progress, but semantic losses in the decoupling process as well as training-inference mismatch still hinder conversion performance. In this paper, we propose Vec-Tok-VC+, a novel prompt-based zero-shot VC model im… ▽ More Zero-shot voice conversion (VC) aims to transform source speech into arbitrary unseen target voice while keeping the linguistic content unchanged. Recent VC methods have made significant progress, but semantic losses in the decoupling process as well as training-inference mismatch still hinder conversion performance. In this paper, we propose Vec-Tok-VC+, a novel prompt-based zero-shot VC model improved from Vec-Tok Codec, achieving voice conversion given only a 3s target speaker prompt. We design a residual-enhanced K-Means decoupler to enhance the semantic content extraction with a two-layer clustering process. Besides, we employ teacher-guided refinement to simulate the conversion process to eliminate the training-inference mismatch, forming a dual-mode training strategy. Furthermore, we design a multi-codebook progressive loss function to constrain the layer-wise output of the model from coarse to fine to improve speaker similarity and content accuracy. Objective and subjective evaluations demonstrate that Vec-Tok-VC+ outperforms the strong baselines in naturalness, intelligibility, and speaker similarity. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH2024

arXiv:2406.08920 [pdf, other]

AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

Authors: Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu

Abstract: Novel view acoustic synthesis (NVAS) aims to render binaural audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene. Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing binaural audio. However, in addition to low efficiency originating from heavy NeRF rendering, these methods all have a limited ability… ▽ More Novel view acoustic synthesis (NVAS) aims to render binaural audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene. Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing binaural audio. However, in addition to low efficiency originating from heavy NeRF rendering, these methods all have a limited ability of characterizing the entire scene environment such as room geometry, material properties, and the spatial relation between the listener and sound source. To address these issues, we propose a novel Audio-Visual Gaussian Splatting (AV-GS) model. To obtain a material-aware and geometry-aware condition for audio synthesis, we learn an explicit point-based scene representation with an audio-guidance parameter on locally initialized Gaussian points, taking into account the space relation from the listener and sound source. To make the visual scene model audio adaptive, we propose a point densification and pruning strategy to optimally distribute the Gaussian points, with the per-point contribution in sound propagation (e.g., more points needed for texture-less wall surfaces as they affect sound path diversion). Extensive experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets. △ Less

Submitted 14 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.07422 [pdf, other]

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Authors: Hanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, Yuanjun Lv, Lei Xie, Yunlin Chen, Hao Yin, Zhifei Li

Abstract: The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermor… ▽ More The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a BLSTM module to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single-Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further validated by LLM-TTS experiments, showing improved naturalness and intelligibility. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.06998 [pdf, other]

Movable Antenna Enhanced NOMA Short-Packet Transmission

Authors: Xinyuan He, Wen Chen, Qingqing Wu, Xusheng Zhu, Nan Cheng

Abstract: This letter investigates a short-packet downlink transmission system using non-orthogonal multiple access (NOMA) enhanced via movable antenna (MA). We focuses on maximizing the effective throughput for a core user while ensuring reliable communication for an edge user by optimizing the MAs' coordinates and the power and rate allocations from the access point (AP). The optimization challenge is app… ▽ More This letter investigates a short-packet downlink transmission system using non-orthogonal multiple access (NOMA) enhanced via movable antenna (MA). We focuses on maximizing the effective throughput for a core user while ensuring reliable communication for an edge user by optimizing the MAs' coordinates and the power and rate allocations from the access point (AP). The optimization challenge is approached by decomposing it into two subproblems, utilizing successive convex approximation (SCA) to handle the highly non-concave nature of channel gains. Numerical results confirm that the proposed solution offers substantial improvements in effective throughput compared to NOMA short-packet communication with fixed position antennas (FPAs). △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 5 pages, 4 figures

arXiv:2406.06880 [pdf, other]

doi 10.1109/TIA.2024.3395570

Multi-Objective Sizing Optimization Method of Microgrid Considering Cost and Carbon Emissions

Authors: Xiang Zhu, Guangchun Ruan, Hua Geng, Honghai Liu, Mingfei Bai, Chao Peng

Abstract: Microgrid serves as a promising solution to integrate and manage distributed renewable energy resources. In this paper, we establish a stochastic multi-objective sizing optimization (SMOSO) model for microgrid planning, which fully captures the battery degradation characteristics and the total carbon emissions. The microgrid operator aims to simultaneously maximize the economic benefits and minimi… ▽ More Microgrid serves as a promising solution to integrate and manage distributed renewable energy resources. In this paper, we establish a stochastic multi-objective sizing optimization (SMOSO) model for microgrid planning, which fully captures the battery degradation characteristics and the total carbon emissions. The microgrid operator aims to simultaneously maximize the economic benefits and minimize carbon emissions, and the degradation of the battery energy storage system (BESS) is modeled as a nonlinear function of power throughput. A self-adaptive multi-objective genetic algorithm (SAMOGA) is proposed to solve the SMOSO model, and this algorithm is enhanced by pre-grouped hierarchical selection and self-adaptive probabilities of crossover and mutation. Several case studies are conducted to determine the microgrid size by analyzing Pareto frontiers, and the simulation results validate that the proposed method has superior performance over other algorithms on the solution quality of optimum and diversity. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted by IEEE Transactions on Industry Applications

arXiv:2406.05976 [pdf, other]

Dynamic Virtual Power Plants With Frequency Regulation Capacity

Authors: Xiang Zhu, Guangchun Ruan, Hua Geng

Abstract: For integrating heterogeneous distributed energy resources to provide fast frequency regulation, this paper proposes a dynamic virtual power plant~(DVPP) with frequency regulation capacity. A parameter anonymity-based approach is established for DVPP aggregating small-scaled inverter-based resources~(IBRs) with privacy concerns. On this basis, a parameter-to-performance mapping is formulated to ev… ▽ More For integrating heterogeneous distributed energy resources to provide fast frequency regulation, this paper proposes a dynamic virtual power plant~(DVPP) with frequency regulation capacity. A parameter anonymity-based approach is established for DVPP aggregating small-scaled inverter-based resources~(IBRs) with privacy concerns. On this basis, a parameter-to-performance mapping is formulated to evaluate how control coefficients impact the DVPP-level power overshoot as well as the IBR-level costs. The objective is to design the best way to provide the frequency response with minimal impacts on grid and the most financial gains. Numerical experiments illustrate the effectiveness of the proposed approach and further analysis validates that our models are able to take dead bands into consideration. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: Accepted by IAS Annual Meeting 2024

arXiv:2406.05672 [pdf, other]

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Authors: Dake Guo, Xinfa Zhu, Liumeng Xue, Yongmao Zhang, Wenjie Tian, Lei Xie

Abstract: Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference speech. To address this problem, we propose a text-aware and context-aware(TACA) style modeling approach… ▽ More Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference speech. To address this problem, we propose a text-aware and context-aware(TACA) style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech style. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, VITS-based TTS and language model-based TTS. Experimental results demonstrate that our proposed approach can effectively capture diverse styles and coherent prosody, and consequently improves naturalness and expressiveness in audiobook speech synthesis. △ Less

Submitted 12 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH2024

arXiv:2406.04111 [pdf, other]

UrbanSARFloods: Sentinel-1 SLC-Based Benchmark Dataset for Urban and Open-Area Flood Mapping

Authors: Jie Zhao, Zhitong Xiong, Xiao Xiang Zhu

Abstract: Due to its cloud-penetrating capability and independence from solar illumination, satellite Synthetic Aperture Radar (SAR) is the preferred data source for large-scale flood mapping, providing global coverage and including various land cover classes. However, most studies on large-scale SAR-derived flood mapping using deep learning algorithms have primarily focused on flooded open areas, utilizing… ▽ More Due to its cloud-penetrating capability and independence from solar illumination, satellite Synthetic Aperture Radar (SAR) is the preferred data source for large-scale flood mapping, providing global coverage and including various land cover classes. However, most studies on large-scale SAR-derived flood mapping using deep learning algorithms have primarily focused on flooded open areas, utilizing available open-access datasets (e.g., Sen1Floods11) and with limited attention to urban floods. To address this gap, we introduce \textbf{UrbanSARFloods}, a floodwater dataset featuring pre-processed Sentinel-1 intensity data and interferometric coherence imagery acquired before and during flood events. It contains 8,879 $512\times 512$ chips covering 807,500 $km^2$ across 20 land cover classes and 5 continents, spanning 18 flood events. We used UrbanSARFloods to benchmark existing state-of-the-art convolutional neural networks (CNNs) for segmenting open and urban flood areas. Our findings indicate that prevalent approaches, including the Weighted Cross-Entropy (WCE) loss and the application of transfer learning with pretrained models, fall short in overcoming the obstacles posed by imbalanced data and the constraints of a small training dataset. Urban flood detection remains challenging. Future research should explore strategies for addressing imbalanced data challenges and investigate transfer learning's potential for SAR-based large-scale flood mapping. Besides, expanding this dataset to include additional flood events holds promise for enhancing its utility and contributing to advancements in flood mapping techniques. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: Accepted by CVPR 2024 EarthVision Workshop

arXiv:2405.18692 [pdf, other]

Movable Antenna Empowered Downlink NOMA Systems: Power Allocation and Antenna Position Optimization

Authors: Yufeng Zhou, Wen Chen, Qingqing Wu, Xusheng Zhu, Nan Cheng

Abstract: This paper investigates a novel communication paradigm employing movable antennas (MAs) within a multiple-input single-output (MISO) non-orthogonal multiple access (NOMA) downlink framework, where users are equipped with MAs. Initially, leveraging the far-field response, we delineate the channel characteristics concerning both the power allocation coefficient and positions of MAs. Subsequently, we… ▽ More This paper investigates a novel communication paradigm employing movable antennas (MAs) within a multiple-input single-output (MISO) non-orthogonal multiple access (NOMA) downlink framework, where users are equipped with MAs. Initially, leveraging the far-field response, we delineate the channel characteristics concerning both the power allocation coefficient and positions of MAs. Subsequently, we endeavor to maximize the channel capacity by jointly optimizing power allocation and antenna positions. To tackle the resultant non-convex problem, we propose an alternating optimization (AO) scheme underpinned by successive convex approximation (SCA) to converge towards a stationary point. Through numerical simulations, our findings substantiate the superiority of the MA-assisted NOMA system over both orthogonal multiple access (OMA) and conventional NOMA configurations in terms of average sum rate and outage probability. △ Less

Submitted 7 August, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.13901 [pdf, other]

DCT-Based Decorrelated Attention for Vision Transformers

Authors: Hongyi Pan, Emadeldeen Hamdan, Xin Zhu, Koushik Biswas, Ahmet Enis Cetin, Ulas Bagci

Abstract: Central to the Transformer architectures' effectiveness is the self-attention mechanism, a function that maps queries, keys, and values into a high-dimensional vector space. However, training the attention weights of queries, keys, and values is non-trivial from a state of random initialization. In this paper, we propose two methods. (i) We first address the initialization problem of Vision Transf… ▽ More Central to the Transformer architectures' effectiveness is the self-attention mechanism, a function that maps queries, keys, and values into a high-dimensional vector space. However, training the attention weights of queries, keys, and values is non-trivial from a state of random initialization. In this paper, we propose two methods. (i) We first address the initialization problem of Vision Transformers by introducing a simple, yet highly innovative, initialization approach utilizing Discrete Cosine Transform (DCT) coefficients. Our proposed DCT-based attention initialization marks a significant gain compared to traditional initialization strategies; offering a robust foundation for the attention mechanism. Our experiments reveal that the DCT-based initialization enhances the accuracy of Vision Transformers in classification tasks. (ii) We also recognize that since DCT effectively decorrelates image information in the frequency domain, this decorrelation is useful for compression because it allows the quantization step to discard many of the higher-frequency components. Based on this observation, we propose a novel DCT-based compression technique for the attention function of Vision Transformers. Since high-frequency DCT coefficients usually correspond to noise, we truncate the high-frequency DCT components of the input patches. Our DCT-based compression reduces the size of weight matrices for queries, keys, and values. While maintaining the same level of accuracy, our DCT compressed Swin Transformers obtain a considerable decrease in the computational overhead. △ Less

Submitted 28 May, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.04285 [pdf, other]

On the Foundations of Earth and Climate Foundation Models

Authors: Xiao Xiang Zhu, Zhitong Xiong, Yi Wang, Adam J. Stewart, Konrad Heidler, Yuanyuan Wang, Zhenghang Yuan, Thomas Dujardin, Qingsong Xu, Yilei Shi

Abstract: Foundation models have enormous potential in advancing Earth and climate sciences, however, current approaches may not be optimal as they focus on a few basic features of a desirable Earth and climate foundation model. Crafting the ideal Earth foundation model, we define eleven features which would allow such a foundation model to be beneficial for any geoscientific downstream application in an en… ▽ More Foundation models have enormous potential in advancing Earth and climate sciences, however, current approaches may not be optimal as they focus on a few basic features of a desirable Earth and climate foundation model. Crafting the ideal Earth foundation model, we define eleven features which would allow such a foundation model to be beneficial for any geoscientific downstream application in an environmental- and human-centric manner.We further shed light on the way forward to achieve the ideal model and to evaluate Earth foundation models. What comes after foundation models? Energy efficient adaptation, adversarial defenses, and interpretability are among the emerging directions. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2404.07932 [pdf, other]

FusionMamba: Efficient Image Fusion with State Space Model

Authors: Siran Peng, Xiangyu Zhu, Haoyu Deng, Zhen Lei, Liang-Jian Deng

Abstract: Image fusion aims to generate a high-resolution multi/hyper-spectral image by combining a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Current deep learning (DL)-based methods for image fusion primarily rely on CNNs or Transformers to extract features and merge different types of data. While CNNs are efficient, their receptive fiel… ▽ More Image fusion aims to generate a high-resolution multi/hyper-spectral image by combining a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Current deep learning (DL)-based methods for image fusion primarily rely on CNNs or Transformers to extract features and merge different types of data. While CNNs are efficient, their receptive fields are limited, restricting their capacity to capture global context. Conversely, Transformers excel at learning global information but are hindered by their quadratic complexity. Fortunately, recent advancements in the State Space Model (SSM), particularly Mamba, offer a promising solution to this issue by enabling global awareness with linear complexity. However, there have been few attempts to explore the potential of the SSM in information fusion, which is a crucial ability in domains like image fusion. Therefore, we propose FusionMamba, an innovative method for efficient image fusion. Our contributions mainly focus on two aspects. Firstly, recognizing that images from different sources possess distinct properties, we incorporate Mamba blocks into two U-shaped networks, presenting a novel architecture that extracts spatial and spectral features in an efficient, independent, and hierarchical manner. Secondly, to effectively combine spatial and spectral information, we extend the Mamba block to accommodate dual inputs. This expansion leads to the creation of a new module called the FusionMamba block, which outperforms existing fusion techniques such as concatenation and cross-attention. We conduct a series of experiments on five datasets related to three image fusion tasks. The quantitative and qualitative evaluation results demonstrate that our method achieves SOTA performance, underscoring the superiority of FusionMamba. The code is available at https://github.com/PSRben/FusionMamba. △ Less

Submitted 10 May, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.00872 [pdf, other]

Performance Evaluation of RIS-Assisted Spatial Modulation for Downlink Transmission

Authors: Xusheng Zhu, Qingqing Wu, Wen Chen

Abstract: This paper explores the performance of reconfigurable intelligent surface (RIS) assisted spatial modulation (SM) downlink communication systems, focusing on the average bit error probability (ABEP). Notably, in scenarios with a large number of reflecting units, the composite channel can be approximated by a Gaussian distribution using the central limit theorem. The receiver utilizes a maximum like… ▽ More This paper explores the performance of reconfigurable intelligent surface (RIS) assisted spatial modulation (SM) downlink communication systems, focusing on the average bit error probability (ABEP). Notably, in scenarios with a large number of reflecting units, the composite channel can be approximated by a Gaussian distribution using the central limit theorem. The receiver utilizes a maximum likelihood detector to recover information in both spatial and symbol domains. In the proposed RIS-SM system, we analytically derive a closed-form expression for the union tight upper bound of ABEP, employing the Gaussian-Chebyshev quadrature method. The validity of these results is rigorously confirmed through exhaustive Monte Carlo simulations. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2402.02893

arXiv:2403.19457 [pdf, other]

Transmissive RIS Transmitter Enabled Spatial Modulation for MIMO Systems

Authors: Xusheng Zhu, Qingqing Wu, Wen Chen

Abstract: In this paper, we propose a novel transmissive reconfigurable intelligent surface (TRIS) transmitter-enabled spatial modulation (SM) multiple-input multiple-output (MIMO) system. In the transmission phase, a column-wise activation strategy is implemented for the TRIS panel, where the specific column elements are activated per time slot. Concurrently, the receiver employs the maximum likelihood det… ▽ More In this paper, we propose a novel transmissive reconfigurable intelligent surface (TRIS) transmitter-enabled spatial modulation (SM) multiple-input multiple-output (MIMO) system. In the transmission phase, a column-wise activation strategy is implemented for the TRIS panel, where the specific column elements are activated per time slot. Concurrently, the receiver employs the maximum likelihood detection technique. Based on this, for the transmit signals, we derive the closed-form expressions for the upper bounds of the average bit error probability (ABEP) of the proposed scheme from different perspectives, employing both vector-based and element-based approaches. Furthermore, we provide the asymptotic closed-form expressions for the ABEP of the TRIS-SM scheme, as well as the diversity gain. To improve the performance of the proposed TRIS-SM system, we optimize ABEP with a fixed data rate. Additionally, we provide lower bounds to simplify the computational complexity of improved TRIS-SM scheme. The Monte Carlo simulation method is used to validate the theoretical derivations exhaustively. The results demonstrate that the proposed TRIS-SM scheme can achieve better ABEP performance compared to the conventional SM scheme. Furthermore, the improved TRIS-SM scheme outperforms the TRIS-SM scheme in terms of reliability. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.18846 [pdf, other]

The Blind Normalized Stein Variational Gradient Descent-Based Detection for Intelligent Massive Random Access

Authors: Xin Zhu, Ahmet Enis Cetin

Abstract: The lack of an efficient preamble detection algorithm remains a challenge for solving preamble collision problems in intelligent massive random access (RA) in practical communication scenarios. To solve this problem, we present a novel early preamble detection scheme based on a maximum likelihood estimation (MLE) model at the first step of the grant-based RA procedure. A novel blind normalized Ste… ▽ More The lack of an efficient preamble detection algorithm remains a challenge for solving preamble collision problems in intelligent massive random access (RA) in practical communication scenarios. To solve this problem, we present a novel early preamble detection scheme based on a maximum likelihood estimation (MLE) model at the first step of the grant-based RA procedure. A novel blind normalized Stein variational gradient descent (SVGD)-based detector is proposed to obtain an approximate solution to the MLE model. First, by exploring the relationship between the Hadamard transform and wavelet transform, a new modified Hadamard transform (MHT) is developed to separate high-frequencies from important components using the second-order derivative filter. Next, to eliminate noise and mitigate the vanishing gradients problem in the SVGD-based detectors, the block MHT layer is designed based on the MHT, scaling layer, soft-thresholding layer, inverse MHT and sparsity penalty. Then, the blind normalized SVGD algorithm is derived to perform preamble detection without prior knowledge of noise power and the number of active devices. The experimental results show the proposed block MHT layer outperforms other transform-based methods in terms of computation costs and denoising performance. Furthermore, with the assistance of the block MHT layer, the proposed blind normalized SVGD algorithm achieves a higher preamble detection accuracy and throughput than other state-of-the-art detection methods. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2403.17751 [pdf, other]

Robust Analysis of Full-Duplex Two-Way Space Shift Keying With RIS Systems

Authors: Xusheng Zhu, Wen Chen, Qingqing Wu, Wen Fang, Chaoying Huang, Jun Li

Abstract: Reconfigurable intelligent surface (RIS)-assisted index modulation system schemes are considered a promising technology for sixth-generation (6G) wireless communication systems, which can enhance various system capabilities such as coverage and reliability. However, obtaining perfect channel state information (CSI) is challenging due to the lack of a radio frequency chain in RIS. In this paper, we… ▽ More Reconfigurable intelligent surface (RIS)-assisted index modulation system schemes are considered a promising technology for sixth-generation (6G) wireless communication systems, which can enhance various system capabilities such as coverage and reliability. However, obtaining perfect channel state information (CSI) is challenging due to the lack of a radio frequency chain in RIS. In this paper, we investigate the RIS-assisted full-duplex (FD) two-way space shift keying (SSK) system under imperfect CSI, where the signal emissions are augmented by deploying RISs in the vicinity of two FD users. The maximum likelihood detector is utilized to recover the transmit antenna index. With this in mind, we derive closed-form average bit error probability (ABEP) expression based on the Gaussian-Chebyshev quadrature (GCQ) method and provide the upper bound and asymptotic ABEP expressions in the presence of channel estimation errors. To gain more insights, we also derive the outage probability and provide the throughput of the proposed scheme with imperfect CSI. The correctness of the analytical derivation results is confirmed via Monte Carlo simulations. It is demonstrated that increasing the number of elements of RIS can significantly improve the ABEP performance of the FD system over the half-duplex (HD) system. Furthermore, in the high SNR region, the ABEP performance of the FD system is better than that of the HD system. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.05435 [pdf, other]

OmniCount: Multi-label Object Counting with Semantic-Geometric Priors

Authors: Anindya Mondal, Sauradip Nag, Xiatian Zhu, Anjan Dutta

Abstract: Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficien… ▽ More Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficiencies. This paper introduces a more practical approach enabling simultaneous counting of multiple object categories using an open-vocabulary framework. Our solution, OmniCount, stands out by using semantic and geometric insights (priors) from pre-trained models to count multiple categories of objects as specified by users, all without additional training. OmniCount distinguishes itself by generating precise object masks and leveraging varied interactive prompts via the Segment Anything Model for efficient counting. To evaluate OmniCount, we created the OmniCount-191 benchmark, a first-of-its-kind dataset with multi-label object counts, including points, bounding boxes, and VQA annotations. Our comprehensive evaluation in OmniCount-191, alongside other leading benchmarks, demonstrates OmniCount's exceptional performance, significantly outpacing existing solutions. △ Less

Submitted 20 August, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2403.05024 [pdf, other]

A Probabilistic Hadamard U-Net for MRI Bias Field Correction

Authors: Xin Zhu, Hongyi Pan, Yury Velichko, Adam B. Murphy, Ashley Ross, Baris Turkbey, Ahmet Enis Cetin, Ulas Bagci

Abstract: Magnetic field inhomogeneity correction remains a challenging task in MRI analysis. Most established techniques are designed for brain MRI by supposing that image intensities in the identical tissue follow a uniform distribution. Such an assumption cannot be easily applied to other organs, especially those that are small in size and heterogeneous in texture (large variations in intensity), such as… ▽ More Magnetic field inhomogeneity correction remains a challenging task in MRI analysis. Most established techniques are designed for brain MRI by supposing that image intensities in the identical tissue follow a uniform distribution. Such an assumption cannot be easily applied to other organs, especially those that are small in size and heterogeneous in texture (large variations in intensity), such as the prostate. To address this problem, this paper proposes a probabilistic Hadamard U-Net (PHU-Net) for prostate MRI bias field correction. First, a novel Hadamard U-Net (HU-Net) is introduced to extract the low-frequency scalar field, multiplied by the original input to obtain the prototypical corrected image. HU-Net converts the input image from the time domain into the frequency domain via Hadamard transform. In the frequency domain, high-frequency components are eliminated using the trainable filter (scaling layer), hard-thresholding layer, and sparsity penalty. Next, a conditional variational autoencoder is used to encode possible bias field-corrected variants into a low-dimensional latent space. Random samples drawn from latent space are then incorporated with a prototypical corrected image to generate multiple plausible images. Experimental results demonstrate the effectiveness of PHU-Net in correcting bias-field in prostate MRI with a fast inference speed. It has also been shown that prostate MRI segmentation accuracy improves with the high-quality corrected images from PHU-Net. The code will be available in the final version of this manuscript. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2402.02893 [pdf, other]

On the Performance of RIS-Aided Spatial Modulation for Downlink Transmission

Authors: Xusheng Zhu, Qingqing Wu, Wen Chen

Abstract: In this study, we explore the performance of a reconfigurable reflecting surface (RIS)-assisted transmit spatial modulation (SM) system for downlink transmission, wherein the deployment of RIS serves the purpose of blind area coverage within the channel. At the receiving end, we present three detectors, i.e., maximum likelihood (ML) detector, two-stage ML detection, and greedy detector to recover… ▽ More In this study, we explore the performance of a reconfigurable reflecting surface (RIS)-assisted transmit spatial modulation (SM) system for downlink transmission, wherein the deployment of RIS serves the purpose of blind area coverage within the channel. At the receiving end, we present three detectors, i.e., maximum likelihood (ML) detector, two-stage ML detection, and greedy detector to recover the transmitted signal. By utilizing the ML detector, we initially derive the conditional pair error probability expression for the proposed scheme. Subsequently, we leverage the central limit theorem (CLT) to obtain the probability density function of the combined channel. Following this, the Gaussian-Chebyshev quadrature method is applied to derive a closed-form expression for the unconditional pair error probability and establish the union tight upper bound for the average bit error probability (ABEP). Furthermore, we derive a closed-form expression for the ergodic capacity of the proposed RIS-SM scheme. Monte Carlo simulations are conducted not only to assess the complexity and reliability of the three detection algorithms but also to validate the results obtained through theoretical derivation results. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2401.10345 [pdf, other]

Attack and Defense Analysis of Learned Image Compression

Authors: Tianyu Zhu, Heming Sun, Xiankui Xiong, Xuanpeng Zhu, Yong Gong, Minge jing, Yibo Fan

Abstract: Learned image compression (LIC) is becoming more and more popular these years with its high efficiency and outstanding compression quality. Still, the practicality against modified inputs added with specific noise could not be ignored. White-box attacks such as FGSM and PGD use only gradient to compute adversarial images that mislead LIC models to output unexpected results. Our experiments compare… ▽ More Learned image compression (LIC) is becoming more and more popular these years with its high efficiency and outstanding compression quality. Still, the practicality against modified inputs added with specific noise could not be ignored. White-box attacks such as FGSM and PGD use only gradient to compute adversarial images that mislead LIC models to output unexpected results. Our experiments compare the effects of different dimensions such as attack methods, models, qualities, and targets, concluding that in the worst case, there is a 61.55% decrease in PSNR or a 19.15 times increase in bpp under the PGD attack. To improve their robustness, we conduct adversarial training by adding adversarial images into the training datasets, which obtains a 95.52% decrease in the R-D cost of the most vulnerable LIC model. We further test the robustness of H.266, whose better performance on reconstruction quality extends its possibility to defend one-step or iterative adversarial attacks. △ Less

Submitted 27 March, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

arXiv:2312.16850 [pdf, other]

Accent-VITS:accent transfer for end-to-end TTS

Authors: Linhan Ma, Yongmao Zhang, Xinfa Zhu, Yi Lei, Ziqian Ning, Pengcheng Zhu, Lei Xie

Abstract: Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker's voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based end-to-end accent transfer model named Accent-VITS.Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable… ▽ More Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker's voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based end-to-end accent transfer model named Accent-VITS.Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer.We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints.Moreover, the text-to-wave mapping in VITS is decomposed into text-to-accent and accent-to-wave mappings in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective.Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline. △ Less

Submitted 29 December, 2023; v1 submitted 28 December, 2023; originally announced December 2023.

Comments: Accepted by NCMMSC2023

arXiv:2312.16064 [pdf, other]

Goal-Oriented Integration of Sensing, Communication, Computing, and Control for Mission-Critical Internet-of-Things

Authors: Jie Cao, Ernest Kurniawan, Amnart Boonkajay, Sumei Sun, Petar Popovski, Xu Zhu

Abstract: Driven by the development goal of network paradigm and demand for various functions in the sixth-generation (6G) mission-critical Internet-of-Things (MC-IoT), we foresee a goal-oriented integration of sensing, communication, computing, and control (GIS3C) in this paper. We first provide an overview of the tasks, requirements, and challenges of MC-IoT. Then we introduce an end-to-end GIS3C architec… ▽ More Driven by the development goal of network paradigm and demand for various functions in the sixth-generation (6G) mission-critical Internet-of-Things (MC-IoT), we foresee a goal-oriented integration of sensing, communication, computing, and control (GIS3C) in this paper. We first provide an overview of the tasks, requirements, and challenges of MC-IoT. Then we introduce an end-to-end GIS3C architecture, in which goal-oriented communication is leveraged to bridge and empower sensing, communication, control, and computing functionalities. By revealing the interplay among multiple subsystems in terms of key performance indicators and parameters, this paper introduces unified metrics, i.e., task completion effectiveness and cost, to facilitate S3C co-design in MC-IoT. The preliminary results demonstrate the benefits of GIS3C in improving task completion effectiveness while reducing costs. We also identify and highlight the gaps and challenges in applying GIS3C in the future 6G networks. △ Less

Submitted 1 January, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

arXiv:2312.15454 [pdf, other]

Risk-Aware and Energy-Efficient AoI Optimization for Multi-Connectivity WNCS with Short Packet Transmissions

Authors: Jie Cao, Xu Zhu, Sumei Sun, Ernest Kurniawan, Amnart Boonkajay

Abstract: Age of Information (AoI) has been proposed to quantify the freshness of information for emerging real-time applications such as remote monitoring and control in wireless networked control systems (WNCSs). Minimization of the average AoI and its outage probability can ensure timely and stable transmission. Energy efficiency (EE) also plays an important role in WNCSs, as many devices are featured by… ▽ More Age of Information (AoI) has been proposed to quantify the freshness of information for emerging real-time applications such as remote monitoring and control in wireless networked control systems (WNCSs). Minimization of the average AoI and its outage probability can ensure timely and stable transmission. Energy efficiency (EE) also plays an important role in WNCSs, as many devices are featured by low cost and limited battery. Multi-connectivity over multiple links enables a decrease in AoI, at the cost of energy. We tackle the unresolved problem of selecting the optimal number of connections that is both AoI-optimal and energy-efficient, while avoiding risky states. To address this issue, the average AoI and peak AoI (PAoI), as well as PAoI violation probability are formulated as functions of the number of connections. Then the EE-PAoI ratio is introduced to allow a tradeoff between AoI and energy, which is maximized by the proposed risk-aware, AoI-optimal and energy-efficient connectivity scheme. To obtain this, we analyze the property of the formulated EE-PAoI ratio and prove the monotonicity of PAoI violation probability. Interestingly, we reveal that the multi-connectivity scheme is not always preferable, and the signal-to-noise ratio (SNR) threshold that determines the selection of the multi-connectivity scheme is derived as a function of the coding rate. Also, the optimal number of connections is obtained and shown to be a decreasing function of the transmit power. Simulation results demonstrate that the proposed scheme enables more than 15 folds of EE-PAoI gain at the low SNR than the single-connectivity scheme. △ Less

Submitted 1 January, 2024; v1 submitted 24 December, 2023; originally announced December 2023.

arXiv:2312.12581 [pdf]

Wireless, Customizable Coaxially-shielded Coils for Magnetic Resonance Imaging

Authors: Ke Wu, Xia Zhu, Stephan W. Anderson, Xin Zhang

Abstract: Anatomy-specific RF receive coil arrays routinely adopted in magnetic resonance imaging (MRI) for signal acquisition, are commonly burdened by their bulky, fixed, and rigid configurations, which may impose patient discomfort, bothersome positioning, and suboptimal sensitivity in certain situations. Herein, leveraging coaxial cables' inherent flexibility and electric field confining property, for t… ▽ More Anatomy-specific RF receive coil arrays routinely adopted in magnetic resonance imaging (MRI) for signal acquisition, are commonly burdened by their bulky, fixed, and rigid configurations, which may impose patient discomfort, bothersome positioning, and suboptimal sensitivity in certain situations. Herein, leveraging coaxial cables' inherent flexibility and electric field confining property, for the first time, we present wireless, ultra-lightweight, coaxially-shielded MRI coils achieving a signal-to-noise ratio (SNR) comparable to or surpassing that of commercially available cutting-edge receive coil arrays with the potential for improved patient comfort, ease of implementation, and significantly reduced costs. The proposed coils demonstrate versatility by functioning both independently in form-fitting configurations, closely adapting to relatively small anatomical sites, and collectively by inductively coupling together as metamaterials, allowing for extension of the field-of-view of their coverage to encompass larger anatomical regions without compromising coil sensitivity. The wireless, coaxially-shielded MRI coils reported herein pave the way toward next generation MRI coils. △ Less

Submitted 19 December, 2023; originally announced December 2023.

arXiv:2312.10018 [pdf]

Wearable Coaxially-shielded Metamaterial for Magnetic Resonance Imaging

Authors: Xia Zhu, Ke Wu, Stephan W. Anderson, Xin Zhang

Abstract: Recent advancements in metamaterials have yielded the possibility of a wireless solution to improve signal-to-noise ratio (SNR) in magnetic resonance imaging (MRI). Unlike traditional closely packed local coil arrays with rigid designs and numerous components, these lightweight, cost-effective metamaterials eliminate the need for radio frequency (RF) cabling, baluns, adapters, and interfaces. Howe… ▽ More Recent advancements in metamaterials have yielded the possibility of a wireless solution to improve signal-to-noise ratio (SNR) in magnetic resonance imaging (MRI). Unlike traditional closely packed local coil arrays with rigid designs and numerous components, these lightweight, cost-effective metamaterials eliminate the need for radio frequency (RF) cabling, baluns, adapters, and interfaces. However, their clinical adoption has been limited by their low sensitivity, bulky physical footprint, and limited, specific use cases. Herein, we introduce a wearable metamaterial developed using commercially available coaxial cable, designed for a 3.0 T MRI system. This metamaterial inherits the coaxially-shielded structure of its constituent coaxial cable, effectively containing the electric field within the cable, thereby mitigating the electric coupling to its loading while ensuring safer clinical adoption, lower signal loss, and resistance to frequency shifts. Weighing only 50g, the metamaterial maximizes its sensitivity by conforming to the anatomical region of interest. MRI images acquired using this metamaterial with various pulse sequences demonstrate an up to 2-fold SNR enhancement when compared to a state-of-the-art 16-channel knee coil. This work introduces a novel paradigm for constructing metamaterials in the MRI environment, paving the way for the development of next-generation wireless MRI technology. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2312.09747 [pdf, other]

SELM: Speech Enhancement Using Discrete Tokens and Language Models

Authors: Ziqian Wang, Xinfa Zhu, Zihan Zhang, YuanJun Lv, Ning Jiang, Guoqing Zhao, Lei Xie

Abstract: Language models (LMs) have shown superior performances in various speech generation tasks recently, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information holds potential advantages for speech enhancement tasks. In light of this, we propose SELM, a novel paradigm for speech… ▽ More Language models (LMs) have shown superior performances in various speech generation tasks recently, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information holds potential advantages for speech enhancement tasks. In light of this, we propose SELM, a novel paradigm for speech enhancement, which integrates discrete tokens and leverages language models. SELM comprises three stages: encoding, modeling, and decoding. We transform continuous waveform signals into discrete tokens using pre-trained self-supervised learning (SSL) models and a k-means tokenizer. Language models then capture comprehensive contextual information within these tokens. Finally, a detokenizer and HiFi-GAN restore them into enhanced speech. Experimental results demonstrate that SELM achieves comparable performance in objective metrics alongside superior results in subjective perception. Our demos are available https://honee-w.github.io/SELM/. △ Less

Submitted 7 January, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP 2024

arXiv:2311.13254 [pdf, other]

Unified Domain Adaptive Semantic Segmentation

Authors: Zhe Zhang, Gaochang Wu, Jing Zhang, Xiatian Zhu, Dacheng Tao, Tianyou Chai

Abstract: Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled target domain. The majority of existing UDA-SS works typically consider images whilst recent attempts have extended further to tackle videos by modeling the temporal dimension. Although the two lines of research share the major challenges -- overcoming the under… ▽ More Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled target domain. The majority of existing UDA-SS works typically consider images whilst recent attempts have extended further to tackle videos by modeling the temporal dimension. Although the two lines of research share the major challenges -- overcoming the underlying domain distribution shift, their studies are largely independent, resulting in fragmented insights, a lack of holistic understanding, and missed opportunities for cross-pollination of ideas. This fragmentation prevents the unification of methods, leading to redundant efforts and suboptimal knowledge transfer across image and video domains. Under this observation, we advocate unifying the study of UDA-SS across video and image scenarios, enabling a more comprehensive understanding, synergistic advancements, and efficient knowledge sharing. To that end, we explore the unified UDA-SS from a general data augmentation perspective, serving as a unifying conceptual framework, enabling improved generalization, and potential for cross-pollination of ideas, ultimately contributing to the overall progress and practical impact of this field of research. Specifically, we propose a Quad-directional Mixup (QuadMix) method, characterized by tackling distinct point attributes and feature inconsistencies through four-directional paths for intra- and inter-domain mixing in a feature space. To deal with temporal shifts with videos, we incorporate optical flow-guided feature aggregation across spatial and temporal dimensions for fine-grained domain alignment. Extensive experiments show that our method outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks. Our source code and models will be released at \url{https://github.com/ZHE-SAPI/UDASS}. △ Less

Submitted 20 August, 2024; v1 submitted 22 November, 2023; originally announced November 2023.

Comments: 18 pages,10 figures, 11 tables

arXiv:2311.07179 [pdf, other]

SponTTS: modeling and transferring spontaneous style for TTS

Authors: Hanzhao Li, Xinfa Zhu, Liumeng Xue, Yang Song, Yunlin Chen, Lei Xie

Abstract: Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like a smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous d… ▽ More Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like a smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous data constrains spontaneous speech generation for speakers without spontaneous data. To address these problems, we propose SponTTS, a two-stage approach based on neural bottleneck (BN) features to model and transfer spontaneous style for TTS. In the first stage, we adopt a Conditional Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature and involve the spontaneous phenomena by the constraint of spontaneous phenomena embedding prediction loss. Besides, we introduce a flow-based predictor to predict a latent spontaneous style representation from the text, which enriches the prosody and context-specific spontaneous phenomena during inference. In the second stage, we adopt a VITS-like module to transfer the spontaneous style learned in the first stage to the target speakers. Experiments demonstrate that SponTTS is effective in modeling spontaneous style and transferring the style to the target speakers, generating spontaneous speech with high naturalness, expressiveness, and speaker similarity. The zero-shot spontaneous style TTS test further verifies the generalization and robustness of SponTTS in generating spontaneous speech for unseen speakers. △ Less

Submitted 8 January, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: 5 pages, 3 figures, Accepted by ICASSP2024

arXiv:2310.17101 [pdf, other]

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning

Authors: Xinfa Zhu, Yuke Li, Yi Lei, Ning Jiang, Guoqing Zhao, Lei Xie

Abstract: This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, em… ▽ More This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method. △ Less

Submitted 25 April, 2024; v1 submitted 25 October, 2023; originally announced October 2023.

Comments: 6 pages, 4 figures; Accepted by ICME 2024

arXiv:2310.09840 [pdf, ps, other]

Towards Structural Sparse Precoding: Dynamic Time, Frequency, Space, and Power Multistage Resource Programming

Authors: Zhongxiang Wei, Ping Wang, Qingjiang Shi, Xu Zhu, Christos Masouros

Abstract: In last decades, dynamic resource programming in partial resource domains has been extensively investigated for single time slot optimizations. However, with the emerging real-time media applications in fifth-generation communications, their new quality of service requirements are often measured in temporal dimension. This requires multistage optimization for full resource domain dynamic programmi… ▽ More In last decades, dynamic resource programming in partial resource domains has been extensively investigated for single time slot optimizations. However, with the emerging real-time media applications in fifth-generation communications, their new quality of service requirements are often measured in temporal dimension. This requires multistage optimization for full resource domain dynamic programming. Taking experience rate as a typical temporal multistage metric, we jointly optimize time, frequency, space and power domains resource for multistage optimization. To strike a good tradeoff between system performance and computational complexity, we first transform the formulated mixed integer non-linear constraints into equivalent convex second order cone constraints, by exploiting the coupling effect among the resources. Leveraging the concept of structural sparsity, the objective of max-min experience rate is given as a weighted 1-norm term associated with the precoding matrix. Finally, a low-complexity iterative algorithm is proposed for full resource domain programming, aided by another simple conic optimization for obtaining its feasible initial result. Simulation verifies that our design significantly outperform the benchmarks while maintaining a fast convergence rate, shedding light on full domain dynamic resource programming of multistage optimizations. △ Less

Submitted 15 October, 2023; originally announced October 2023.

arXiv:2310.07246 [pdf, other]

Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

Authors: Xinfa Zhu, Yuanjun Lv, Yi Lei, Tao Li, Wendi He, Hongbin Zhou, Heng Lu, Lei Xie

Abstract: Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating e… ▽ More Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at https://github.com/BakerBunker/VecTok . △ Less

Submitted 12 October, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: 15 pages, 2 figures

arXiv:2310.05072 [pdf, other]

Performance Analysis of RIS-Aided Double Spatial Scattering Modulation for mmWave MIMO Systems

Authors: Xusheng Zhu, Wen Chen, Qingqing Wu, Jun Li, Nan Cheng, Fangjiong Chen, Changle Li

Abstract: In this paper, we investigate a practical structure of reconfigurable intelligent surface (RIS)-based double spatial scattering modulation (DSSM) for millimeter-wave (mmWave) multiple-input multiple-output (MIMO) systems. A suboptimal detector is proposed, in which the beam direction is first demodulated according to the received beam strength, and then the remaining information is demodulated by… ▽ More In this paper, we investigate a practical structure of reconfigurable intelligent surface (RIS)-based double spatial scattering modulation (DSSM) for millimeter-wave (mmWave) multiple-input multiple-output (MIMO) systems. A suboptimal detector is proposed, in which the beam direction is first demodulated according to the received beam strength, and then the remaining information is demodulated by adopting the maximum likelihood algorithm. Based on the proposed suboptimal detector, we derive the conditional pairwise error probability expression. Further, the exact numerical integral and closed-form expressions of unconditional pairwise error probability (UPEP) are derived via two different approaches. To provide more insights, we derive the upper bound and asymptotic expressions of UPEP. In addition, the diversity gain of the RIS-DSSM scheme was also given. Furthermore, the union upper bound of average bit error probability (ABEP) is obtained by combining the UPEP and the number of error bits. Simulation results are provided to validate the derived upper bound and asymptotic expressions of ABEP. We found an interesting phenomenon that the ABEP performance of the proposed system-based phase shift keying is better than that of the quadrature amplitude modulation. Additionally, the performance advantage of ABEP is more significant with the increase in the number of RIS elements. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2310.04004 [pdf, other]

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

Authors: Tao Li, Zhichao Wang, Xinfa Zhu, Jian Cong, Qiao Tian, Yuping Wang, Lei Xie

Abstract: Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zer… ▽ More Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning. △ Less

Submitted 6 October, 2023; originally announced October 2023.

arXiv:2310.03963 [pdf, other]

Zero-Shot Emotion Transfer For Cross-Lingual Speech Synthesis

Authors: Yuke Li, Xinfa Zhu, Yi Lei, Hai Li, Junhui Liu, Danming Xie, Lei Xie

Abstract: Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from an arbitrary speech reference in the source language to the synthetic speech in the target language. Building such a system faces challenges of unnatural foreign accents and difficulty in modeling the shared emotional expressions of different languages. Building on the DelightfulTTS neural architecture, this… ▽ More Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from an arbitrary speech reference in the source language to the synthetic speech in the target language. Building such a system faces challenges of unnatural foreign accents and difficulty in modeling the shared emotional expressions of different languages. Building on the DelightfulTTS neural architecture, this paper addresses these challenges by introducing specifically-designed modules to model the language-specific prosody features and language-shared emotional expressions separately. Specifically, the language-specific speech prosody is learned by a non-autoregressive predictive coding (NPC) module to improve the naturalness of the synthetic cross-lingual speech. The shared emotional expression between different languages is extracted from a pre-trained self-supervised model HuBERT with strong generalization capabilities. We further use hierarchical emotion modeling to capture more comprehensive emotions across different languages. Experimental results demonstrate the proposed framework's effectiveness in synthesizing bi-lingual emotional speech for the monolingual target speaker without emotional training data. △ Less

Submitted 5 October, 2023; originally announced October 2023.

Comments: Accepted by ASRU2023

arXiv:2310.02862 [pdf, other]

A novel asymmetrical autoencoder with a sparsifying discrete cosine Stockwell transform layer for gearbox sensor data compression

Authors: Xin Zhu, Daoguang Yang, Hongyi Pan, Hamid Reza Karimi, Didem Ozevin, Ahmet Enis Cetin

Abstract: The lack of an efficient compression model remains a challenge for the wireless transmission of gearbox data in non-contact gear fault diagnosis problems. In this paper, we present a signal-adaptive asymmetrical autoencoder with a transform domain layer to compress sensor signals. First, a new discrete cosine Stockwell transform (DCST) layer is introduced to replace linear layers in a multi-layer… ▽ More The lack of an efficient compression model remains a challenge for the wireless transmission of gearbox data in non-contact gear fault diagnosis problems. In this paper, we present a signal-adaptive asymmetrical autoencoder with a transform domain layer to compress sensor signals. First, a new discrete cosine Stockwell transform (DCST) layer is introduced to replace linear layers in a multi-layer autoencoder. A trainable filter is implemented in the DCST domain by utilizing the multiplication property of the convolution. A trainable hard-thresholding layer is applied to reduce redundant data in the DCST layer to make the feature map sparse. In comparison to the linear layer, the DCST layer reduces the number of trainable parameters and improves the accuracy of data reconstruction. Second, training the autoencoder with a sparsifying DCST layer only requires a small number of datasets. The proposed method is superior to other autoencoder-based methods on the University of Connecticut (UoC) and Southeast University (SEU) gearbox datasets, as the average quality score is improved by 2.00% at the lowest and 32.35% at the highest with a limited number of training samples △ Less

Submitted 4 October, 2023; originally announced October 2023.

arXiv:2310.00153 [pdf]

Conformal Metamaterials with Active Tunability and Self-adaptivity for Magnetic Resonance Imaging

Authors: Ke Wu, Xia Zhu, Xiaoguang Zhao, Stephan W. Anderson, Xin Zhang

Abstract: Ongoing effort has been devoted to applying metamaterials to boost the imaging performance of magnetic resonance imaging owing to their unique capacity for electromagnetic field confinement and enhancement. However, there are still major obstacles to widespread clinical adoption of conventional metamaterials due to several notable restrictions, namely: their typically bulky and rigid structures, d… ▽ More Ongoing effort has been devoted to applying metamaterials to boost the imaging performance of magnetic resonance imaging owing to their unique capacity for electromagnetic field confinement and enhancement. However, there are still major obstacles to widespread clinical adoption of conventional metamaterials due to several notable restrictions, namely: their typically bulky and rigid structures, deviations in their optimal resonance frequency, and their inevitable interference with the transmission RF field in MRI. Herein, we address these restrictions and report a conformal, smart metamaterial, which may not only be readily tuned to achieve the desired, precise frequency match with MRI by a controlling circuit, but is also capable of selectively amplifying the magnetic field during the RF reception phase by sensing the excitation signal strength passively, thereby remaining off during the RF transmission phase and thereby ensuring its optimal performance when applied to MRI as an additive technology. By addressing a host of current technological challenges, the metamaterial presented herein paves the way toward the wide-ranging utilization of metamaterials in clinical MRI, thereby translating this promising technology to the MRI bedside. △ Less

Submitted 29 September, 2023; originally announced October 2023.

Comments: 21 pages, 7 figures

arXiv:2309.16499 [pdf, other]

Cross-City Matters: A Multimodal Remote Sensing Benchmark Dataset for Cross-City Semantic Segmentation using High-Resolution Domain Adaptation Networks

Authors: Danfeng Hong, Bing Zhang, Hao Li, Yuxuan Li, Jing Yao, Chenyu Li, Martin Werner, Jocelyn Chanussot, Alexander Zipf, Xiao Xiang Zhu

Abstract: Artificial intelligence (AI) approaches nowadays have gained remarkable success in single-modality-dominated remote sensing (RS) applications, especially with an emphasis on individual urban environments (e.g., single cities or regions). Yet these AI models tend to meet the performance bottleneck in the case studies across cities or regions, due to the lack of diverse RS information and cutting-ed… ▽ More Artificial intelligence (AI) approaches nowadays have gained remarkable success in single-modality-dominated remote sensing (RS) applications, especially with an emphasis on individual urban environments (e.g., single cities or regions). Yet these AI models tend to meet the performance bottleneck in the case studies across cities or regions, due to the lack of diverse RS information and cutting-edge solutions with high generalization ability. To this end, we build a new set of multimodal remote sensing benchmark datasets (including hyperspectral, multispectral, SAR) for the study purpose of the cross-city semantic segmentation task (called C2Seg dataset), which consists of two cross-city scenes, i.e., Berlin-Augsburg (in Germany) and Beijing-Wuhan (in China). Beyond the single city, we propose a high-resolution domain adaptation network, HighDAN for short, to promote the AI model's generalization ability from the multi-city environments. HighDAN is capable of retaining the spatially topological structure of the studied urban scene well in a parallel high-to-low resolution fusion fashion but also closing the gap derived from enormous differences of RS image representations between different cities by means of adversarial learning. In addition, the Dice loss is considered in HighDAN to alleviate the class imbalance issue caused by factors across cities. Extensive experiments conducted on the C2Seg dataset show the superiority of our HighDAN in terms of segmentation performance and generalization ability, compared to state-of-the-art competitors. The C2Seg dataset and the semantic segmentation toolbox (involving the proposed HighDAN) will be available publicly at https://github.com/danfenghong. △ Less

Submitted 3 October, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

arXiv:2309.16468 [pdf, other]

HyperLISTA-ABT: An Ultra-light Unfolded Network for Accurate Multi-component Differential Tomographic SAR Inversion

Authors: Kun Qian, Yuanyuan Wang, Peter Jung, Yilei Shi, Xiao Xiang Zhu

Abstract: Deep neural networks based on unrolled iterative algorithms have achieved remarkable success in sparse reconstruction applications, such as synthetic aperture radar (SAR) tomographic inversion (TomoSAR). However, the currently available deep learning-based TomoSAR algorithms are limited to three-dimensional (3D) reconstruction. The extension of deep learning-based algorithms to four-dimensional (4… ▽ More Deep neural networks based on unrolled iterative algorithms have achieved remarkable success in sparse reconstruction applications, such as synthetic aperture radar (SAR) tomographic inversion (TomoSAR). However, the currently available deep learning-based TomoSAR algorithms are limited to three-dimensional (3D) reconstruction. The extension of deep learning-based algorithms to four-dimensional (4D) imaging, i.e., differential TomoSAR (D-TomoSAR) applications, is impeded mainly due to the high-dimensional weight matrices required by the network designed for D-TomoSAR inversion, which typically contain millions of freely trainable parameters. Learning such huge number of weights requires an enormous number of training samples, resulting in a large memory burden and excessive time consumption. To tackle this issue, we propose an efficient and accurate algorithm called HyperLISTA-ABT. The weights in HyperLISTA-ABT are determined in an analytical way according to a minimum coherence criterion, trimming the model down to an ultra-light one with only three hyperparameters. Additionally, HyperLISTA-ABT improves the global thresholding by utilizing an adaptive blockwise thresholding scheme, which applies block-coordinate techniques and conducts thresholding in local blocks, so that weak expressions and local features can be retained in the shrinkage step layer by layer. Simulations were performed and demonstrated the effectiveness of our approach, showing that HyperLISTA-ABT achieves superior computational efficiency and with no significant performance degradation compared to state-of-the-art methods. Real data experiments showed that a high-quality 4D point cloud could be reconstructed over a large area by the proposed HyperLISTA-ABT with affordable computational resources and in a fast time. △ Less

Submitted 28 September, 2023; originally announced September 2023.

arXiv:2309.13907 [pdf, other]

HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS

Authors: Dake Guo, Xinfa Zhu, Liumeng Xue, Tao Li, Yuanjun Lv, Yuepeng Jiang, Lei Xie

Abstract: Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNN… ▽ More Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNN-TTS. Specifically, we add a virtual global node in the graph to strengthen the interconnection of word nodes and introduce a contextual attention mechanism to broaden the prosody modeling scope of GNNs from intra-sentence to inter-sentence. Additionally, we perform hierarchical supervision from acoustic prosody on each node of the graph to capture the prosodic variations with a high dynamic range. Ablation studies show the effectiveness of HiGNN-TTS in learning hierarchical prosody. Both objective and subjective evaluations demonstrate that HiGNN-TTS significantly improves the naturalness and expressiveness of long-form synthetic speech. △ Less

Submitted 6 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

Comments: Accepted by ASRU2023

arXiv:2309.12201 [pdf, other]

Electroencephalogram Sensor Data Compression Using An Asymmetrical Sparse Autoencoder With A Discrete Cosine Transform Layer

Authors: Xin Zhu, Hongyi Pan, Shuaiang Rong, Ahmet Enis Cetin

Abstract: Electroencephalogram (EEG) data compression is necessary for wireless recording applications to reduce the amount of data that needs to be transmitted. In this paper, an asymmetrical sparse autoencoder with a discrete cosine transform (DCT) layer is proposed to compress EEG signals. The encoder module of the autoencoder has a combination of a fully connected linear layer and the DCT layer to reduc… ▽ More Electroencephalogram (EEG) data compression is necessary for wireless recording applications to reduce the amount of data that needs to be transmitted. In this paper, an asymmetrical sparse autoencoder with a discrete cosine transform (DCT) layer is proposed to compress EEG signals. The encoder module of the autoencoder has a combination of a fully connected linear layer and the DCT layer to reduce redundant data using hard-thresholding nonlinearity. Furthermore, the DCT layer includes trainable hard-thresholding parameters and scaling layers to give emphasis or de-emphasis on individual DCT coefficients. Finally, the one-by-one convolutional layer generates the latent space. The sparsity penalty-based cost function is employed to keep the feature map as sparse as possible in the latent space. The latent space data is transmitted to the receiver. The decoder module of the autoencoder is designed using the inverse DCT and two fully connected linear layers to improve the accuracy of data reconstruction. In comparison to other state-of-the-art methods, the proposed method significantly improves the average quality score in various data compression experiments. △ Less

Submitted 15 September, 2023; originally announced September 2023.

arXiv:2309.11109 [pdf, other]

Self-supervised Domain-agnostic Domain Adaptation for Satellite Images

Authors: Fahong Zhang, Yilei Shi, Xiao Xiang Zhu

Abstract: Domain shift caused by, e.g., different geographical regions or acquisition conditions is a common issue in machine learning for global scale satellite image processing. A promising method to address this problem is domain adaptation, where the training and the testing datasets are split into two or multiple domains according to their distributions, and an adaptation method is applied to improve t… ▽ More Domain shift caused by, e.g., different geographical regions or acquisition conditions is a common issue in machine learning for global scale satellite image processing. A promising method to address this problem is domain adaptation, where the training and the testing datasets are split into two or multiple domains according to their distributions, and an adaptation method is applied to improve the generalizability of the model on the testing dataset. However, defining the domain to which each satellite image belongs is not trivial, especially under large-scale multi-temporal and multi-sensory scenarios, where a single image mosaic could be generated from multiple data sources. In this paper, we propose an self-supervised domain-agnostic domain adaptation (SS(DA)2) method to perform domain adaptation without such a domain definition. To achieve this, we first design a contrastive generative adversarial loss to train a generative network to perform image-to-image translation between any two satellite image patches. Then, we improve the generalizability of the downstream models by augmenting the training data with different testing spectral characteristics. The experimental results on public benchmarks verify the effectiveness of SS(DA)2. △ Less

Submitted 25 September, 2023; v1 submitted 20 September, 2023; originally announced September 2023.

Showing 1–50 of 199 results for author: Zhu, X