-
LCM-SVC: Latent Diffusion Model Based Singing Voice Conversion with Inference Acceleration via Latent Consistency Distillation
Authors:
Shihao Chen,
Yu Gu,
Jianwei Cui,
Jie Zhang,
Rilin Chen,
Lirong Dai
Abstract:
Any-to-any singing voice conversion (SVC) aims to transfer a target singer's timbre to other songs using a short voice sample. However many diffusion model based any-to-any SVC methods, which have achieved impressive results, usually suffered from low efficiency caused by a mass of inference steps. In this paper, we propose LCM-SVC, a latent consistency distillation (LCD) based latent diffusion mo…
▽ More
Any-to-any singing voice conversion (SVC) aims to transfer a target singer's timbre to other songs using a short voice sample. However many diffusion model based any-to-any SVC methods, which have achieved impressive results, usually suffered from low efficiency caused by a mass of inference steps. In this paper, we propose LCM-SVC, a latent consistency distillation (LCD) based latent diffusion model (LDM) to accelerate inference speed. We achieved one-step or few-step inference while maintaining the high performance by distilling a pre-trained LDM based SVC model, which had the advantages of timbre decoupling and sound quality. Experimental results show that our proposed method can significantly reduce the inference time and largely preserve the sound quality and timbre similarity comparing with other state-of-the-art SVC models. Audio samples are available at https://sounddemos.github.io/lcm-svc.
△ Less
Submitted 22 August, 2024;
originally announced August 2024.
-
AutoRG-Brain: Grounded Report Generation for Brain MRI
Authors:
Jiayu Lei,
Xiaoman Zhang,
Chaoyi Wu,
Lisong Dai,
Ya Zhang,
Yanyong Zhang,
Yanfeng Wang,
Weidi Xie,
Yuehua Li
Abstract:
Radiologists are tasked with interpreting a large number of images in a daily base, with the responsibility of generating corresponding reports. This demanding workload elevates the risk of human error, potentially leading to treatment delays, increased healthcare costs, revenue loss, and operational inefficiencies. To address these challenges, we initiate a series of work on grounded Automatic Re…
▽ More
Radiologists are tasked with interpreting a large number of images in a daily base, with the responsibility of generating corresponding reports. This demanding workload elevates the risk of human error, potentially leading to treatment delays, increased healthcare costs, revenue loss, and operational inefficiencies. To address these challenges, we initiate a series of work on grounded Automatic Report Generation (AutoRG), starting from the brain MRI interpretation system, which supports the delineation of brain structures, the localization of anomalies, and the generation of well-organized findings. We make contributions from the following aspects, first, on dataset construction, we release a comprehensive dataset encompassing segmentation masks of anomaly regions and manually authored reports, termed as RadGenome-Brain MRI. This data resource is intended to catalyze ongoing research and development in the field of AI-assisted report generation systems. Second, on system design, we propose AutoRG-Brain, the first brain MRI report generation system with pixel-level grounded visual clues. Third, for evaluation, we conduct quantitative assessments and human evaluations of brain structure segmentation, anomaly localization, and report generation tasks to provide evidence of its reliability and accuracy. This system has been integrated into real clinical scenarios, where radiologists were instructed to write reports based on our generated findings and anomaly segmentation masks. The results demonstrate that our system enhances the report-writing skills of junior doctors, aligning their performance more closely with senior doctors, thereby boosting overall productivity.
△ Less
Submitted 29 July, 2024; v1 submitted 23 July, 2024;
originally announced July 2024.
-
Coded Beam Training for RIS Assisted Wireless Communications
Authors:
Yuhao Chen,
Linglong Dai
Abstract:
Reconfigurable intelligent surface (RIS) is considered as one of the key technologies for future 6G communications. To fully unleash the performance of RIS, accurate channel state information (CSI) is crucial. Beam training is widely utilized to acquire the CSI. However, before aligning the beam correctly to establish stable connections, the signal-to-noise ratio (SNR) at UE is inevitably low, whi…
▽ More
Reconfigurable intelligent surface (RIS) is considered as one of the key technologies for future 6G communications. To fully unleash the performance of RIS, accurate channel state information (CSI) is crucial. Beam training is widely utilized to acquire the CSI. However, before aligning the beam correctly to establish stable connections, the signal-to-noise ratio (SNR) at UE is inevitably low, which reduces the beam training accuracy. To deal with this problem, we exploit the coded beam training framework for RIS systems, which leverages the error correction capability of channel coding to improve the beam training accuracy under low SNR. Specifically, we first extend the coded beam training framework to RIS systems by decoupling the base station-RIS channel and the RIS-user channel. For this framework, codewords that accurately steer to multiple angles is essential for fully unleashing the error correction capability. In order to realize effective codeword design in RIS systems, we then propose a new codeword design criterion, based on which we propose a relaxed Gerchberg-Saxton (GS) based codeword design scheme by considering the constant modulus constraints of RIS elements. In addition, considering the two dimensional structure of RIS, we further propose a dimension reduced encoder design scheme, which can not only guarentee a better beam shape, but also enable a stronger error correction capability. Simulation results reveal that the proposed scheme can realize effective and accurate beam training in low SNR scenarios.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Sparse MIMO for ISAC: New Opportunities and Challenges
Authors:
Xinrui Li,
Hongqi Min,
Yong Zeng,
Shi Jin,
Linglong Dai,
Yifei Yuan,
Rui Zhang
Abstract:
Multiple-input multiple-output (MIMO) has been a key technology of wireless communications for decades. A typical MIMO system employs antenna arrays with the inter-antenna spacing being half of the signal wavelength, which we term as compact MIMO. Looking forward towards the future sixth-generation (6G) mobile communication networks, MIMO system will achieve even finer spatial resolution to not on…
▽ More
Multiple-input multiple-output (MIMO) has been a key technology of wireless communications for decades. A typical MIMO system employs antenna arrays with the inter-antenna spacing being half of the signal wavelength, which we term as compact MIMO. Looking forward towards the future sixth-generation (6G) mobile communication networks, MIMO system will achieve even finer spatial resolution to not only enhance the spectral efficiency of wireless communications, but also enable more accurate wireless sensing. To this end, by removing the restriction of half-wavelength antenna spacing, sparse MIMO has been proposed as a new architecture that is able to significantly enlarge the array aperture as compared to conventional compact MIMO with the same number of array elements. In addition, sparse MIMO leads to a new form of virtual MIMO systems for sensing with their virtual apertures considerably larger than physical apertures. As sparse MIMO is expected to be a viable technology for 6G, we provide in this article a comprehensive overview of it, especially focusing on its appealing advantages for integrated sensing and communication (ISAC) towards 6G. Specifically, assorted sparse MIMO architectures are first introduced, followed by their new benefits as well as challenges. We then discuss the main design issues of sparse MIMO, including beam pattern synthesis, signal processing, grating lobe suppression, beam codebook design, and array geometry optimization. Last, we provide numerical results to evaluate the performance of sparse MIMO for ISAC and point out promising directions for future research.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Near-Field Wideband Beam Training Based on Distance-Dependent Beam Split
Authors:
Tianyue Zheng,
Mingyao Cui,
Zidong Wu,
Linglong Dai
Abstract:
Near-field beam training is essential for acquiring channel state information in 6G extremely large-scale multiple input multiple output (XL-MIMO) systems. To achieve low-overhead beam training, existing method has been proposed to leverage the near-field beam split effect, which deploys true-time-delay arrays to simultaneously search multiple angles of the entire angular range in a distance ring…
▽ More
Near-field beam training is essential for acquiring channel state information in 6G extremely large-scale multiple input multiple output (XL-MIMO) systems. To achieve low-overhead beam training, existing method has been proposed to leverage the near-field beam split effect, which deploys true-time-delay arrays to simultaneously search multiple angles of the entire angular range in a distance ring with a single pilot. However, the method still requires exhaustive search in the distance domain, which limits its efficiency. To address the problem, we propose a distance-dependent beam-split-based beam training method to further reduce the training overheads. Specifically, we first reveal the new phenomenon of distance-dependent beam split, where by manipulating the configurations of time-delay and phase-shift, beams at different frequencies can simultaneously scan the angular domain in multiple distance rings. Leveraging the phenomenon, we propose a near-field beam training method where both different angles and distances can simultaneously be searched in one time slot. Thus, a few pilots are capable of covering the whole angle-distance space for wideband XL-MIMO. Theoretical analysis and numerical simulations are also displayed to verify the superiority of the proposed method on beamforming gain and training overhead.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance
Authors:
Shihao Chen,
Yu Gu,
Jie Zhang,
Na Li,
Rilin Chen,
Liping Chen,
Lirong Dai
Abstract:
Any-to-any singing voice conversion (SVC) is an interesting audio editing technique, aiming to convert the singing voice of one singer into that of another, given only a few seconds of singing data. However, during the conversion process, the issue of timbre leakage is inevitable: the converted singing voice still sounds like the original singer's voice. To tackle this, we propose a latent diffusi…
▽ More
Any-to-any singing voice conversion (SVC) is an interesting audio editing technique, aiming to convert the singing voice of one singer into that of another, given only a few seconds of singing data. However, during the conversion process, the issue of timbre leakage is inevitable: the converted singing voice still sounds like the original singer's voice. To tackle this, we propose a latent diffusion model for SVC (LDM-SVC) in this work, which attempts to perform SVC in the latent space using an LDM. We pretrain a variational autoencoder structure using the noted open-source So-VITS-SVC project based on the VITS framework, which is then used for the LDM training. Besides, we propose a singer guidance training method based on classifier-free guidance to further suppress the timbre of the original singer. Experimental results show the superiority of the proposed method over previous works in both subjective and objective evaluations of timbre similarity.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
MIMO Capacity Analysis and Channel Estimation for Electromagnetic Information Theory
Authors:
Jieao Zhu,
Vincent Y. F. Tan,
Linglong Dai
Abstract:
Electromagnetic information theory (EIT) is an interdisciplinary subject that serves to integrate deterministic electromagnetic theory with stochastic Shannon's information theory. Existing EIT analysis operates in the continuous space domain, which is not aligned with the practical algorithms working in the discrete space domain. This mismatch leads to a significant difficulty in application of E…
▽ More
Electromagnetic information theory (EIT) is an interdisciplinary subject that serves to integrate deterministic electromagnetic theory with stochastic Shannon's information theory. Existing EIT analysis operates in the continuous space domain, which is not aligned with the practical algorithms working in the discrete space domain. This mismatch leads to a significant difficulty in application of EIT methodologies to practical discrete space systems, which is called as the discrete-continuous gap in this paper. To bridge this gap, we establish the discrete-continuous correspondence with a prolate spheroidal wave function (PSWF)-based ergodic capacity analysis framework. Specifically, we state and prove some discrete-continuous correspondence lemmas to establish a firm theoretical connection between discrete information-theoretic quantities to their continuous counterparts. With these lemmas, we apply the PSWF ergodic capacity bound to advanced MIMO architectures such as continuous-aperture MIMO (CAP-MIMO) and extremely large-scale MIMO (XL-MIMO). From this PSWF capacity bound, we discover the capacity saturation phenomenon both theoretically and empirically. Although the growth of MIMO performance is fundamentally limited in this EIT-based analysis framework, we reveal new opportunities in MIMO channel estimation by exploiting the EIT knowledge about the channel. Inspired by the PSWF capacity bound, we utilize continuous PSWFs to improve the pilot design of discrete MIMO channel estimators, which is called as the PSWF channel estimator (PSWF-CE). Simulation results demonstrate improved performances of the proposed PSWF-CE, compared to traditional minimum mean squared error (MMSE) and compressed sensing-based estimators.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Hierarchical Reinforcement Learning Empowered Task Offloading in V2I Networks
Authors:
Xinyu You,
Haojie Yan,
Yuedong Xu,
Lifeng Wang,
Liangui Dai
Abstract:
Edge computing plays an essential role in the vehicle-to-infrastructure (V2I) networks, where vehicles offload their intensive computation tasks to the road-side units for saving energy and reduce the latency. This paper designs the optimal task offloading policy to address the concerns involving processing delay, energy consumption and edge computing cost. Each computation task consisting of some…
▽ More
Edge computing plays an essential role in the vehicle-to-infrastructure (V2I) networks, where vehicles offload their intensive computation tasks to the road-side units for saving energy and reduce the latency. This paper designs the optimal task offloading policy to address the concerns involving processing delay, energy consumption and edge computing cost. Each computation task consisting of some interdependent sub-tasks is characterized as a directed acyclic graph (DAG). In such dynamic networks, a novel hierarchical Offloading scheme is proposed by leveraging deep reinforcement learning (DRL). The inter-dependencies among the DAGs of the computation tasks are extracted using a graph neural network with attention mechanism. A parameterized DRL algorithm is developed to deal with the hierarchical action space containing both discrete and continuous actions. Simulation results with a real-world car speed dataset demonstrate that the proposed scheme can effectively reduce the system overhead.
△ Less
Submitted 18 May, 2024;
originally announced May 2024.
-
Electromagnetic Information Theory for Holographic MIMO Communications
Authors:
Li Wei,
Tierui Gong,
Chongwen Huang,
Zhaoyang Zhang,
Wei E. I. Sha,
Zhi Ning Chen,
Linglong Dai,
Merouane Debbah,
Chau Yuen
Abstract:
Holographic multiple-input multiple-output (HMIMO) utilizes a compact antenna array to form a nearly continuous aperture, thereby enhancing higher capacity and more flexible configurations compared with conventional MIMO systems, making it attractive in current scientific research. Key questions naturally arise regarding the potential of HMIMO to surpass Shannon's theoretical limits and how far it…
▽ More
Holographic multiple-input multiple-output (HMIMO) utilizes a compact antenna array to form a nearly continuous aperture, thereby enhancing higher capacity and more flexible configurations compared with conventional MIMO systems, making it attractive in current scientific research. Key questions naturally arise regarding the potential of HMIMO to surpass Shannon's theoretical limits and how far its capabilities can be extended. However, the traditional Shannon information theory falls short in addressing these inquiries because it only focuses on the information itself while neglecting the underlying carrier, electromagnetic (EM) waves, and environmental interactions. To fill up the gap between the theoretical analysis and the practical application for HMIMO systems, we introduce electromagnetic information theory (EIT) in this paper. This paper begins by laying the foundation for HMIMO-oriented EIT, encompassing EM wave equations and communication regions. In the context of HMIMO systems, the resultant physical limitations are presented, involving Chu's limit, Harrington's limit, Hannan's limit, and the evaluation of coupling effects. Field sampling and HMIMO-assisted oversampling are also discussed to guide the optimal HMIMO design within the EIT framework. To comprehensively depict the EM-compliant propagation process, we present the approximate and exact channel modeling approaches in near-/far-field zones. Furthermore, we discuss both traditional Shannon's information theory, employing the probabilistic method, and Kolmogorov information theory, utilizing the functional analysis, for HMIMO-oriented EIT systems.
△ Less
Submitted 25 May, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Near-Optimal Channel Estimation for Dense Array Systems
Authors:
Mingyao Cui,
Zijian Zhang,
Linglong Dai,
Kaibin Huang
Abstract:
By deploying a large number of antennas with sub-half-wavelength spacing in a compact space, dense array systems(DASs) can fully unleash the multiplexing-and-diversity gains of limited apertures. To acquire these gains, accurate channel state information acquisition is necessary but challenging due to the large antenna numbers. To overcome this obstacle, this paper reveals that exploiting the high…
▽ More
By deploying a large number of antennas with sub-half-wavelength spacing in a compact space, dense array systems(DASs) can fully unleash the multiplexing-and-diversity gains of limited apertures. To acquire these gains, accurate channel state information acquisition is necessary but challenging due to the large antenna numbers. To overcome this obstacle, this paper reveals that exploiting the high spatial correlation of DAS channels is crucial while designing the observation matrix for optimal/near-optimal channel estimation. Firstly, we prove that the observation matrix design is equivalent to a time-domain duality of multiple-input multiple-output precoding, which can be ideally addressed by the water-filling principle. For practical realizations, a novel ice-filling algorithm is proposed to design amplitude-and-phase controllable observation matrices, and a majorization-minimization algorithm is proposed to address the phase-only controllable case. Particularly, we prove that the ice-filling algorithm can be viewed as a ``quantized" water-filling algorithm. To support the sub-optimality of the proposed designs, we provide comprehensive analyses on the achievable mean square errors and their asymptotic expressions. Finally, numerical simulations verify that our proposed channel estimation designs can achieve the near-optimal performance and outperform existing approaches significantly.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
CT Synthesis with Conditional Diffusion Models for Abdominal Lymph Node Segmentation
Authors:
Yongrui Yu,
Hanyu Chen,
Zitian Zhang,
Qiong Xiao,
Wenhui Lei,
Linrui Dai,
Yu Fu,
Hui Tan,
Guan Wang,
Peng Gao,
Xiaofan Zhang
Abstract:
Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node…
▽ More
Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node generation and the nnU-Net model for lymph node segmentation to improve the segmentation performance of abdominal lymph nodes through synthesizing a diversity of realistic abdominal lymph node data. We propose LN-DDPM, a conditional denoising diffusion probabilistic model (DDPM) for lymph node (LN) generation. LN-DDPM utilizes lymph node masks and anatomical structure masks as model conditions. These conditions work in two conditioning mechanisms: global structure conditioning and local detail conditioning, to distinguish between lymph nodes and their surroundings and better capture lymph node characteristics. The obtained paired abdominal lymph node images and masks are used for the downstream segmentation task. Experimental results on the abdominal lymph node datasets demonstrate that LN-DDPM outperforms other generative methods in the abdominal lymph node image synthesis and better assists the downstream abdominal lymph node segmentation task.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Holography inspired self-controlled reconfigurable intelligent surface
Authors:
Jieao Zhu,
Ze Gu,
Qian Ma,
Linglong Dai,
Tie Jun Cui
Abstract:
Among various promising candidate technologies for the sixth-generation (6G) wireless communications, recent advances in microwave metasurfaces have sparked a new research area of reconfigurable intelligent surfaces (RISs). By controllably reprogramming the wireless propagation channel, RISs are envisioned to achieve low-cost wireless capacity boosting, coverage extension, and enhanced energy effi…
▽ More
Among various promising candidate technologies for the sixth-generation (6G) wireless communications, recent advances in microwave metasurfaces have sparked a new research area of reconfigurable intelligent surfaces (RISs). By controllably reprogramming the wireless propagation channel, RISs are envisioned to achieve low-cost wireless capacity boosting, coverage extension, and enhanced energy efficiency. To reprogram the channel, each meta-atom on RIS needs an external control signal, which is usually generated by base station (BS). However, BS-controlled RISs require complicated control cables, which hamper their massive deployments. Here, we eliminate the need for BS control by proposing a self-controlled RIS (SC-RIS), which is inspired by the optical holography principle. Different from the existing BS-controlled RISs, each meta-atom of SC-RIS is integrated with an additional power detector for holographic recording. By applying the classical Fourier-transform processing to the measured hologram, SC-RIS is capable of retrieving the user's channel state information required for beamforming, thus enabling autonomous RIS beamforming without control cables. Owing to this WiFi-like plug-and-play capability without the BS control, SC-RISs are expected to enable easy and massive deployments in the future 6G systems.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Near-Field Channel Modeling for Electromagnetic Information Theory
Authors:
Zhongzhichao Wan,
Jieao Zhu,
Linglong Dai
Abstract:
Electromagnetic information theory (EIT) is one of the emerging topics for 6G communication due to its potential to reveal the performance limit of wireless communication systems. For EIT, the research foundation is reasonable and accurate channel modeling. Existing channel modeling works for EIT in non-line-of-sight (NLoS) scenario focus on far-field modeling, which can not accurately capture the…
▽ More
Electromagnetic information theory (EIT) is one of the emerging topics for 6G communication due to its potential to reveal the performance limit of wireless communication systems. For EIT, the research foundation is reasonable and accurate channel modeling. Existing channel modeling works for EIT in non-line-of-sight (NLoS) scenario focus on far-field modeling, which can not accurately capture the characteristics of the channel in near-field. In this paper, we propose the near-field channel model for EIT based on electromagnetic scattering theory. We model the channel by using non-stationary Gaussian random fields and derive the analytical expression of the correlation function of the fields. Furthermore, we analyze the characteristics of the proposed channel model, e.g., channel degrees of freedom (DoF). Finally, we design a channel estimation scheme for near-field scenario by integrating the electromagnetic prior information of the proposed model. Numerical analysis verifies the correctness of the proposed scheme and shows that it can outperform existing schemes like least square (LS) and orthogonal matching pursuit (OMP).
△ Less
Submitted 26 May, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
GuideGen: A Text-guided Framework for Joint CT Volume and Anatomical structure Generation
Authors:
Linrui Dai,
Rongzhao Zhang,
Zhongzhen Huang,
Xiaofan Zhang
Abstract:
The annotation burden and extensive labor for gathering a large medical dataset with images and corresponding labels are rarely cost-effective and highly intimidating. This results in a lack of abundant training data that undermines downstream tasks and partially contributes to the challenge image analysis faces in the medical field. As a workaround, given the recent success of generative neural m…
▽ More
The annotation burden and extensive labor for gathering a large medical dataset with images and corresponding labels are rarely cost-effective and highly intimidating. This results in a lack of abundant training data that undermines downstream tasks and partially contributes to the challenge image analysis faces in the medical field. As a workaround, given the recent success of generative neural models, it is now possible to synthesize image datasets at a high fidelity guided by external constraints. This paper explores this possibility and presents \textbf{GuideGen}: a pipeline that jointly generates CT images and tissue masks for abdominal organs and colorectal cancer conditioned on a text prompt. Firstly, we introduce Volumetric Mask Sampler to fit the discrete distribution of mask labels and generate low-resolution 3D tissue masks. Secondly, our Conditional Image Generator autoregressively generates CT slices conditioned on a corresponding mask slice to incorporate both style information and anatomical guidance. This pipeline guarantees high fidelity and variability as well as exact alignment between generated CT volumes and tissue masks. Both qualitative and quantitative experiments on 3D abdominal CTs demonstrate a high performance of our proposed pipeline, thereby proving our method can serve as a dataset generator and provide potential benefits to downstream tasks. It is hoped that our work will offer a promising solution on the multimodality generation of CT and its anatomical mask. Our source code is publicly available at https://github.com/OvO1111/JointImageGeneration.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Electromagnetic Hybrid Beamforming for Holographic Communications
Authors:
Ran Ji,
Chongwen Huang,
Xiaoming Chen,
Wei E. I. Sha,
Linglong Dai,
Jiguang He,
Zhaoyang Zhang,
Chau Yuen,
Mérouane Debbah
Abstract:
It is well known that there is inherent radiation pattern distortion for the commercial base station antenna array, which usually needs three antenna sectors to cover the whole space. To eliminate pattern distortion and further enhance beamforming performance, we propose an electromagnetic hybrid beamforming (EHB) scheme based on a three-dimensional (3D) superdirective holographic antenna array. S…
▽ More
It is well known that there is inherent radiation pattern distortion for the commercial base station antenna array, which usually needs three antenna sectors to cover the whole space. To eliminate pattern distortion and further enhance beamforming performance, we propose an electromagnetic hybrid beamforming (EHB) scheme based on a three-dimensional (3D) superdirective holographic antenna array. Specifically, EHB consists of antenna excitation current vectors (analog beamforming) and digital precoding matrices, where the implementation of analog beamforming involves the real-time adjustment of the radiation pattern to adapt it to the dynamic wireless environment. Meanwhile, the digital beamforming is optimized based on the channel characteristics of analog beamforming to further improve the achievable rate of communication systems. An electromagnetic channel model incorporating array radiation patterns and the mutual coupling effect is also developed to evaluate the benefits of our proposed scheme. Simulation results demonstrate that our proposed EHB scheme with a 3D holographic array achieves a relatively flat superdirective beamforming gain and allows for programmable focusing directions throughout the entire spatial domain. Furthermore, they also verify that the proposed scheme achieves a sum rate gain of over 150% compared to traditional beamforming algorithms.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
Successive Bayesian Reconstructor for FAS Channel Estimation
Authors:
Zijian Zhang,
Jieao Zhu,
Linglong Dai,
Robert W. Heath Jr
Abstract:
Fluid antenna systems (FASs) can reconfigure their locations freely within a spatially continuous space. To keep favorable antenna positions, the channel state information (CSI) acquisition for FASs is essential. While some techniques have been proposed, most existing FAS channel estimators require several channel assumptions, such as slow variation and angular-domain sparsity. When these assumpti…
▽ More
Fluid antenna systems (FASs) can reconfigure their locations freely within a spatially continuous space. To keep favorable antenna positions, the channel state information (CSI) acquisition for FASs is essential. While some techniques have been proposed, most existing FAS channel estimators require several channel assumptions, such as slow variation and angular-domain sparsity. When these assumptions are not reasonable, the model mismatch may lead to unpredictable performance loss. In this paper, we propose the successive Bayesian reconstructor (S-BAR) as a general solution to estimate FAS channels. Unlike model-based estimators, the proposed S-BAR is prior-aided, which builds the experiential kernel for CSI acquisition. Inspired by Bayesian regression, the key idea of S-BAR is to model the FAS channels as a stochastic process, whose uncertainty can be successively eliminated by kernel-based sampling and regression. In this way, the predictive mean of the regressed stochastic process can be viewed as the maximum a posterior (MAP) estimator of FAS channels. Simulation results verify that, in both model-mismatched and model-matched cases, the proposed S-BAR can achieve higher estimation accuracy than the existing schemes.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
Adversarial speech for voice privacy protection from Personalized Speech generation
Authors:
Shihao Chen,
Liping Chen,
Jie Zhang,
KongAik Lee,
Zhenhua Ling,
Lirong Dai
Abstract:
The rapid progress in personalized speech generation technology, including personalized text-to-speech (TTS) and voice conversion (VC), poses a challenge in distinguishing between generated and real speech for human listeners, resulting in an urgent demand in protecting speakers' voices from malicious misuse. In this regard, we propose a speaker protection method based on adversarial attacks. The…
▽ More
The rapid progress in personalized speech generation technology, including personalized text-to-speech (TTS) and voice conversion (VC), poses a challenge in distinguishing between generated and real speech for human listeners, resulting in an urgent demand in protecting speakers' voices from malicious misuse. In this regard, we propose a speaker protection method based on adversarial attacks. The proposed method perturbs speech signals by minimally altering the original speech while rendering downstream speech generation models unable to accurately generate the voice of the target speaker. For validation, we employ the open-source pre-trained YourTTS model for speech generation and protect the target speaker's speech in the white-box scenario. Automatic speaker verification (ASV) evaluations were carried out on the generated speech as the assessment of the voice protection capability. Our experimental results show that we successfully perturbed the speaker encoder of the YourTTS model using the gradient-based I-FGSM adversarial perturbation method. Furthermore, the adversarial perturbation is effective in preventing the YourTTS model from generating the speech of the target speaker. Audio samples can be found in https://voiceprivacy.github.io/Adeversarial-Speech-with-YourTTS.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation
Authors:
Qiushi Zhu,
Jie Zhang,
Yu Gu,
Yuchen Hu,
Lirong Dai
Abstract:
Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech proces…
▽ More
Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multichannel audio streams and a visual stream in parallel, with intra- and inter-channel contrastive losses as training targets to fully exploit the spatiotemporal information in multichannel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of speech representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks.
△ Less
Submitted 7 January, 2024;
originally announced January 2024.
-
Coded Beam Training
Authors:
Tianyue Zheng,
Jieao Zhu,
Qiumo Yu,
Yongli Yan,
Linglong Dai
Abstract:
In extremely large-scale multiple input multiple output (XL-MIMO) systems for future sixth-generation (6G) communications, codebook-based beam training stands out as a promising technology to acquire channel state information (CSI). Despite their effectiveness, when the pilot overhead is limited, existing beam training methods suffer from significant achievable rate degradation for remote users wi…
▽ More
In extremely large-scale multiple input multiple output (XL-MIMO) systems for future sixth-generation (6G) communications, codebook-based beam training stands out as a promising technology to acquire channel state information (CSI). Despite their effectiveness, when the pilot overhead is limited, existing beam training methods suffer from significant achievable rate degradation for remote users with low signal-to-noise ratio (SNR). To tackle this challenge, leveraging the error-correcting capability of channel codes, we introduce channel coding theory into hierarchical beam training to extend the coverage area. Specifically, we establish the duality between hierarchical beam training and channel coding, and the proposed coded beam training scheme serves as a general framework. Then, we present two specific implementations exemplified by coded beam training methods based on Hamming codes and convolutional codes, during which the beam encoding and decoding processes are refined respectively to better accommodate the beam training problem. Simulation results have demonstrated that the proposed coded beam training method can enable reliable beam training performance for remote users with low SNR while keeping training overhead low.
△ Less
Submitted 6 March, 2024; v1 submitted 3 January, 2024;
originally announced January 2024.
-
Successive Bayesian Reconstructor for Channel Estimation in Fluid Antenna Systems
Authors:
Zijian Zhang,
Jieao Zhu,
Linglong Dai,
Robert W. Heath Jr
Abstract:
Fluid antenna systems (FASs) can reconfigure their antenna locations freely within a spatially continuous space. To keep favorable antenna positions, the channel state information (CSI) acquisition for FASs is essential. While some techniques have been proposed, most existing FAS channel estimators require several channel assumptions, such as slow variation and angular-domain sparsity. When these…
▽ More
Fluid antenna systems (FASs) can reconfigure their antenna locations freely within a spatially continuous space. To keep favorable antenna positions, the channel state information (CSI) acquisition for FASs is essential. While some techniques have been proposed, most existing FAS channel estimators require several channel assumptions, such as slow variation and angular-domain sparsity. When these assumptions are not reasonable, the model mismatch may lead to unpredictable performance loss. In this paper, we propose the successive Bayesian reconstructor (S-BAR) as a general solution to estimate FAS channels. Unlike model-based estimators, the proposed S-BAR is prior-aided, which builds the experiential kernel for CSI acquisition. Inspired by Bayesian regression, the key idea of S-BAR is to model the FAS channels as a stochastic process, whose uncertainty can be successively eliminated by kernel-based sampling and regression. In this way, the predictive mean of the regressed stochastic process can be viewed as the maximum a posterior (MAP) estimator of FAS channels. Simulation results verify that, in both model-mismatched and model-matched cases, the proposed S-BAR can achieve higher estimation accuracy than the existing schemes.
△ Less
Submitted 17 January, 2024; v1 submitted 11 December, 2023;
originally announced December 2023.
-
MD-IQA: Learning Multi-scale Distributed Image Quality Assessment with Semi Supervised Learning for Low Dose CT
Authors:
Tao Song,
Ruizhi Hou,
Lisong Dai,
Lei Xiang
Abstract:
Image quality assessment (IQA) plays a critical role in optimizing radiation dose and developing novel medical imaging techniques in computed tomography (CT). Traditional IQA methods relying on hand-crafted features have limitations in summarizing the subjective perceptual experience of image quality. Recent deep learning-based approaches have demonstrated strong modeling capabilities and potentia…
▽ More
Image quality assessment (IQA) plays a critical role in optimizing radiation dose and developing novel medical imaging techniques in computed tomography (CT). Traditional IQA methods relying on hand-crafted features have limitations in summarizing the subjective perceptual experience of image quality. Recent deep learning-based approaches have demonstrated strong modeling capabilities and potential for medical IQA, but challenges remain regarding model generalization and perceptual accuracy. In this work, we propose a multi-scale distributions regression approach to predict quality scores by constraining the output distribution, thereby improving model generalization. Furthermore, we design a dual-branch alignment network to enhance feature extraction capabilities. Additionally, semi-supervised learning is introduced by utilizing pseudo-labels for unlabeled data to guide model training. Extensive qualitative experiments demonstrate the effectiveness of our proposed method for advancing the state-of-the-art in deep learning-based medical IQA. Code is available at: https://github.com/zunzhumu/MD-IQA.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Enhancing Energy Efficiency for Reconfigurable Intelligent Surfaces with Practical Power Models
Authors:
Zhiyi Li,
Jida Zhang,
Jieao Zhu,
Shi Jin,
Linglong Dai
Abstract:
Reconfigurable intelligent surfaces (RISs) are widely considered a promising technology for future wireless communication systems. As an important indicator of RIS-assisted communication systems in green wireless communications, energy efficiency (EE) has recently received intensive research interest as an optimization target. However, most previous works have ignored the different power consumpti…
▽ More
Reconfigurable intelligent surfaces (RISs) are widely considered a promising technology for future wireless communication systems. As an important indicator of RIS-assisted communication systems in green wireless communications, energy efficiency (EE) has recently received intensive research interest as an optimization target. However, most previous works have ignored the different power consumption between ON and OFF states of the PIN diodes attached to each RIS element. This oversight results in extensive unnecessary power consumption and reduction of actual EE due to the inaccurate power model. To address this issue, in this paper, we first utilize a practical power model for a RIS-assisted multi-user multiple-input single-output (MU-MISO) communication system, which takes into account the difference in power dissipation caused by ON-OFF states of RIS's PIN diodes. Based on this model, we formulate a more accurate EE optimization problem. However, this problem is non-convex and has mixed-integer properties, which poses a challenge for optimization. To solve the problem, an effective alternating optimization (AO) algorithm framework is utilized to optimize the base station and RIS beamforming precoder separately. To obtain the essential RIS beamforming precoder, we develop two effective methods based on maximum gradient search and SDP relaxation respectively. Theoretical analysis shows the exponential complexity of the original problem has been reduced to polynomial complexity. Simulation results demonstrate that the proposed algorithm outperforms the existing ones, leading to a significant increase in EE across a diverse set of scenarios.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Can Electromagnetic Information Theory Improve Wireless Systems? A Channel Estimation Example
Authors:
Jieao Zhu,
Zhongzhichao Wan,
Linglong Dai,
Tie Jun Cui
Abstract:
Electromagnetic information theory (EIT) is an emerging interdisciplinary subject that integrates classical Maxwell electromagnetics and Shannon information theory. The goal of EIT is to uncover the information transmission mechanisms from an electromagnetic (EM) perspective in wireless systems. Existing works on EIT are mainly focused on the analysis of EM channel characteristics, degrees-of-free…
▽ More
Electromagnetic information theory (EIT) is an emerging interdisciplinary subject that integrates classical Maxwell electromagnetics and Shannon information theory. The goal of EIT is to uncover the information transmission mechanisms from an electromagnetic (EM) perspective in wireless systems. Existing works on EIT are mainly focused on the analysis of EM channel characteristics, degrees-of-freedom, and system capacity. However, these works do not clarify whether EIT can improve wireless communication systems. To fill in this gap, in this paper, we provide a novel example that EIT can improve the performance of classical minimum mean squared error (MMSE) channel estimators by replacing the channel covariance matrix with an EM correlation function (EMCF). Specifically, by averaging the solutions of Maxwell's equations over a tunable angular distribution, we obtain a spatio-temporal correlation function (STCF) of the EM channel, which we name as the EMCF. Since classical MMSE estimators can exploit prior information contained in the channel covariance matrix, the substitution of EMCF for the covariance matrix introduces EM side information into MMSE estimators. Furthermore, we dynamically tune the EMCF parameters to better fit the channel observations. Simulation results show that the proposed EIT-MMSE channel estimator outperforms traditional MMSE estimators, thus proving that EIT is beneficial to wireless communication systems.
△ Less
Submitted 6 February, 2024; v1 submitted 18 October, 2023;
originally announced October 2023.
-
DISCO Might Not Be Funky: Random Intelligent Reflective Surface Configurations That Attack
Authors:
Huan Huang,
Lipeng Dai,
Hongliang Zhang,
Chongfu Zhang,
Zhongxing Tian,
Yi Cai,
A. Lee Swindlehurst,
Zhu Han
Abstract:
Emerging intelligent reflective surfaces (IRSs) significantly improve system performance, but also pose a significant risk for physical layer security (PLS). Unlike the extensive research on legitimate IRS-enhanced communications, in this article we present an adversarial IRS-based fully-passive jammer (FPJ). We describe typical application scenarios for Disco IRS (DIRS)-based FPJ, where an illegi…
▽ More
Emerging intelligent reflective surfaces (IRSs) significantly improve system performance, but also pose a significant risk for physical layer security (PLS). Unlike the extensive research on legitimate IRS-enhanced communications, in this article we present an adversarial IRS-based fully-passive jammer (FPJ). We describe typical application scenarios for Disco IRS (DIRS)-based FPJ, where an illegitimate IRS with random, time-varying reflection properties acts like a "disco ball" to randomly change the propagation environment. We introduce the principles of DIRS-based FPJ and overview existing investigations of the technology, including a design example employing one-bit phase shifters. The DIRS-based FPJ can be implemented without either jamming power or channel state information (CSI) for the legitimate users (LUs). It does not suffer from the energy constraints of traditional active jammers, nor does it require any knowledge of the LU channels. In addition to the proposed jamming attack, we also propose an anti-jamming strategy that requires only statistical rather than instantaneous CSI. Furthermore, we present a data frame structure that enables the legitimate access point (AP) to estimate the DIRS-jammed channels' statistical characteristics in the presence of the DIRS jamming. Typical cases are discussed to show the impact of the DIRS-based FPJ and the feasibility of the anti-jamming precoder (AJP). Moreover, we outline future research directions and challenges for the DIRS-based FPJ and its anti-jamming precoding to stimulate this line of research and pave the way for practical applications.
△ Less
Submitted 10 June, 2024; v1 submitted 1 October, 2023;
originally announced October 2023.
-
Toward Beamfocusing-Aided Near-Field Communications: Research Advances, Potential, and Challenges
Authors:
Jiancheng An,
Chau Yuen,
Linglong Dai,
Marco Di Renzo,
Merouane Debbah,
Lajos Hanzo
Abstract:
Next-generation mobile networks promise to support high throughput, massive connectivity, and improved energy efficiency. To achieve these ambitious goals, extremely large-scale antenna arrays (ELAAs) and terahertz communications constitute a pair of promising technologies. This will result in future wireless communications occurring in the near-field regions. To accurately portray the channel cha…
▽ More
Next-generation mobile networks promise to support high throughput, massive connectivity, and improved energy efficiency. To achieve these ambitious goals, extremely large-scale antenna arrays (ELAAs) and terahertz communications constitute a pair of promising technologies. This will result in future wireless communications occurring in the near-field regions. To accurately portray the channel characteristics of near-field wireless propagation, spherical wavefront-based models are required and present both opportunities as well as challenges. Following the basics of near-field communications (NFC), we contrast it to conventional far-field communications. Moreover, we cover the key challenges of NFC, including its channel modeling and estimation, near-field beamfocusing, as well as hardware design. Our numerical results demonstrate the potential of NFC in improving the spatial multiplexing gain and positioning accuracy. Finally, a suite of open issues are identified for motivating future research.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.
-
Anti-Jamming Precoding Against Disco Intelligent Reflecting Surfaces Based Fully-Passive Jamming Attacks
Authors:
Huan Huang,
Lipeng Dai,
Hongliang Zhang,
Zhongxing Tian,
Yi Cai,
Chongfu Zhang,
A. Lee Swindlehurst,
Zhu Han
Abstract:
Emerging intelligent reflecting surfaces (IRSs) significantly improve system performance, but also pose a huge risk for physical layer security. Existing works have illustrated that a disco IRS (DIRS), i.e., an illegitimate IRS with random time-varying reflection properties (like a "disco ball"), can be employed by an attacker to actively age the channels of legitimate users (LUs). Such active cha…
▽ More
Emerging intelligent reflecting surfaces (IRSs) significantly improve system performance, but also pose a huge risk for physical layer security. Existing works have illustrated that a disco IRS (DIRS), i.e., an illegitimate IRS with random time-varying reflection properties (like a "disco ball"), can be employed by an attacker to actively age the channels of legitimate users (LUs). Such active channel aging (ACA) generated by the DIRS can be employed to jam multi-user multiple-input single-output (MU-MISO) systems without relying on either jamming power or LU channel state information (CSI). To address the significant threats posed by DIRS-based fully-passive jammers (FPJs), an anti-jamming precoder is proposed that requires only the statistical characteristics of the DIRS-based ACA channels instead of their CSI. The statistical characteristics of DIRS-jammed channels are first derived, and then the anti-jamming precoder is derived based on the statistical characteristics. Furthermore, we prove that the anti-jamming precoder can achieve the maximum signal-to-jamming-plus-noise ratio (SJNR). To acquire the ACA statistics without changing the system architecture or cooperating with the illegitimate DIRS, we design a data frame structure that the legitimate access point (AP) can use to estimate the statistical characteristics. During the designed data frame, the LUs only need to feed back their received power to the legitimate AP when they detect jamming attacks. Numerical results are also presented to evaluate the effectiveness of the proposed anti-jamming precoder against the DIRS-based FPJs and the feasibility of the designed data frame used by the legitimate AP to estimate the statistical characteristics.
△ Less
Submitted 24 January, 2024; v1 submitted 29 August, 2023;
originally announced August 2023.
-
Rep2wav: Noise Robust text-to-speech Using self-supervised representations
Authors:
Qiushi Zhu,
Yu Gu,
Rilin Chen,
Chao Weng,
Yuchen Hu,
Lirong Dai,
Jie Zhang
Abstract:
Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contains noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background…
▽ More
Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contains noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech. Meanwhile, it was shown that self-supervised pre-trained models exhibit excellent noise robustness on many speech tasks, implying that the learned representation has a better tolerance for noise perturbations. In this work, we therefore explore pre-trained models to improve the noise robustness of TTS models. Based on HiFi-GAN, we first propose a representation-to-waveform vocoder, which aims to learn to map the representation of pre-trained models to the waveform. We then propose a text-to-representation FastSpeech2 model, which aims to learn to map text to pre-trained model representations. Experimental results on the LJSpeech and LibriTTS datasets show that our method outperforms those using speech enhancement methods in both subjective and objective metrics. Audio samples are available at: https://zqs01.github.io/rep2wav.
△ Less
Submitted 3 September, 2023; v1 submitted 28 August, 2023;
originally announced August 2023.
-
Continuous-Time Channel Prediction Based on Tensor Neural Ordinary Differential Equation
Authors:
Mingyao Cui,
Hao Jiang,
Yuhao Chen,
Yang Du,
Linglong Dai
Abstract:
Channel prediction is critical to address the channel aging issue in mobile scenarios. Existing channel prediction techniques are mainly designed for discrete channel prediction, which can only predict the future channel in a fixed time slot per frame, while the other intra-frame channels are usually recovered by interpolation. However, these approaches suffer from a serious interpolation loss, es…
▽ More
Channel prediction is critical to address the channel aging issue in mobile scenarios. Existing channel prediction techniques are mainly designed for discrete channel prediction, which can only predict the future channel in a fixed time slot per frame, while the other intra-frame channels are usually recovered by interpolation. However, these approaches suffer from a serious interpolation loss, especially for mobile millimeter wave communications. To solve this challenging problem, we propose a tensor neural ordinary differential equation (TN-ODE) based continuous-time channel prediction scheme to realize the direct prediction of intra-frame channels. Specifically, inspired by the recently developed continuous mapping model named neural ODE in the field of machine learning, we first utilize the neural ODE model to predict future continuous-time channels. To improve the channel prediction accuracy and reduce computational complexity, we then propose the TN-ODE scheme to learn the structural characteristics of the high-dimensional channel by low dimensional learnable transform. Simulation results show that the proposed scheme is able to achieve higher intra-frame channel prediction accuracy than existing schemes.
△ Less
Submitted 31 July, 2023;
originally announced July 2023.
-
Robust Weighted Sum-Rate Maximization for Transmissive RIS Transmitter Enabled RSMA Networks
Authors:
Bojiang Li,
Wen Chen,
Zhendong Li,
Qingqing Wu,
Nan Cheng,
Changle Li,
Linglong Dai
Abstract:
Due to the low power consumption and low cost nature of transmissive reconfigurable intelligent surface (RIS),in this paper, we propose a downlink multi-user rate-splitting multiple access (RSMA) architecture based on the transmissive RIS transmitter, where the channel state information (CSI) is only accquired partially. We investigate the weighted sum-rate maximization problem by jointly optimizi…
▽ More
Due to the low power consumption and low cost nature of transmissive reconfigurable intelligent surface (RIS),in this paper, we propose a downlink multi-user rate-splitting multiple access (RSMA) architecture based on the transmissive RIS transmitter, where the channel state information (CSI) is only accquired partially. We investigate the weighted sum-rate maximization problem by jointly optimizing the power, RIS transmissive coefficients and common rate allocated to each user. Due to the coupling of optimization variables, the problem is nonconvex, and it is difficult to directly obtain the optimal solution. Hence, a block coordinate descent (BCD) algorithm based on sample average approximation (SAA) and weighted minimum mean square error (WMMSE) is proposed to tackle it. Numerical results illustrate that the transmissive RIS transmitter with ratesplitting architecture has advantages over conventional space division multiple access (SDMA) and non-orthgonal multiple access (NOMA).
△ Less
Submitted 23 July, 2023;
originally announced July 2023.
-
Near-Field Beam Management for Extremely Large-Scale Array Communications
Authors:
Changsheng You,
Yunpu Zhang,
Chenyu Wu,
Yong Zeng,
Beixiong Zheng,
Li Chen,
Linglong Dai,
A. Lee Swindlehurst
Abstract:
Extremely large-scale arrays (XL-arrays) have emerged as a promising technology to achieve super-high spectral efficiency and spatial resolution in future wireless systems. The large aperture of XL-arrays means that spherical rather than planar wavefronts must be considered, and a paradigm shift from far-field to near-field communications is necessary. Unlike existing works that have mainly consid…
▽ More
Extremely large-scale arrays (XL-arrays) have emerged as a promising technology to achieve super-high spectral efficiency and spatial resolution in future wireless systems. The large aperture of XL-arrays means that spherical rather than planar wavefronts must be considered, and a paradigm shift from far-field to near-field communications is necessary. Unlike existing works that have mainly considered far-field beam management, we study the new near-field beam management for XL-arrays. We first provide an overview of near-field communications and introduce various applications of XL-arrays in both outdoor and indoor scenarios. Then, three typical near-field beam management methods for XL-arrays are discussed: near-field beam training, beam tracking, and beam scheduling. We point out their main design issues and propose promising solutions to address them. Moreover, other important directions in near-field communications are also highlighted to motivate future research.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
On the Role of ViT and CNN in Semantic Communications: Analysis and Prototype Validation
Authors:
Hanju Yoo,
Linglong Dai,
Songkuk Kim,
Chan-Byoung Chae
Abstract:
Semantic communications have shown promising advancements by optimizing source and channel coding jointly. However, the dynamics of these systems remain understudied, limiting research and performance gains. Inspired by the robustness of Vision Transformers (ViTs) in handling image nuisances, we propose a ViT-based model for semantic communications. Our approach achieves a peak signal-to-noise rat…
▽ More
Semantic communications have shown promising advancements by optimizing source and channel coding jointly. However, the dynamics of these systems remain understudied, limiting research and performance gains. Inspired by the robustness of Vision Transformers (ViTs) in handling image nuisances, we propose a ViT-based model for semantic communications. Our approach achieves a peak signal-to-noise ratio (PSNR) gain of +0.5 dB over convolutional neural network variants. We introduce novel measures, average cosine similarity and Fourier analysis, to analyze the inner workings of semantic communications and optimize the system's performance. We also validate our approach through a real wireless channel prototype using software-defined radio (SDR). To the best of our knowledge, this is the first investigation of the fundamental workings of a semantic communications system, accompanied by the pioneering hardware implementation. To facilitate reproducibility and encourage further research, we provide open-source code, including neural network implementations and LabVIEW codes for SDR-based wireless transmission systems.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
CASA-ASR: Context-Aware Speaker-Attributed ASR
Authors:
Mohan Shi,
Zhihao Du,
Qian Chen,
Fan Yu,
Yangze Li,
Shiliang Zhang,
Jie Zhang,
Li-Rong Dai
Abstract:
Recently, speaker-attributed automatic speech recognition (SA-ASR) has attracted a wide attention, which aims at answering the question ``who spoke what''. Different from modular systems, end-to-end (E2E) SA-ASR minimizes the speaker-dependent recognition errors directly and shows a promising applicability. In this paper, we propose a context-aware SA-ASR (CASA-ASR) model by enhancing the contextu…
▽ More
Recently, speaker-attributed automatic speech recognition (SA-ASR) has attracted a wide attention, which aims at answering the question ``who spoke what''. Different from modular systems, end-to-end (E2E) SA-ASR minimizes the speaker-dependent recognition errors directly and shows a promising applicability. In this paper, we propose a context-aware SA-ASR (CASA-ASR) model by enhancing the contextual modeling ability of E2E SA-ASR. Specifically, in CASA-ASR, a contextual text encoder is involved to aggregate the semantic information of the whole utterance, and a context-dependent scorer is employed to model the speaker discriminability by contrasting with speakers in the context. In addition, a two-pass decoding strategy is further proposed to fully leverage the contextual modeling ability resulting in a better recognition performance. Experimental results on AliMeeting corpus show that the proposed CASA-ASR model outperforms the original E2E SA-ASR system with a relative improvement of 11.76% in terms of speaker-dependent character error rate.
△ Less
Submitted 21 May, 2023;
originally announced May 2023.
-
Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction
Authors:
Mohan Shi,
Yuchun Shu,
Lingyun Zuo,
Qian Chen,
Shiliang Zhang,
Jie Zhang,
Li-Rong Dai
Abstract:
For speech interaction, voice activity detection (VAD) is often used as a front-end. However, traditional VAD algorithms usually need to wait for a continuous tail silence to reach a preset maximum duration before segmentation, resulting in a large latency that affects user experience. In this paper, we propose a novel semantic VAD for low-latency segmentation. Different from existing methods, a f…
▽ More
For speech interaction, voice activity detection (VAD) is often used as a front-end. However, traditional VAD algorithms usually need to wait for a continuous tail silence to reach a preset maximum duration before segmentation, resulting in a large latency that affects user experience. In this paper, we propose a novel semantic VAD for low-latency segmentation. Different from existing methods, a frame-level punctuation prediction task is added to the semantic VAD, and the artificial endpoint is included in the classification category in addition to the often-used speech presence and absence. To enhance the semantic information of the model, we also incorporate an automatic speech recognition (ASR) related semantic loss. Evaluations on an internal dataset show that the proposed method can reduce the average latency by 53.3% without significant deterioration of character error rate in the back-end ASR compared to the traditional VAD approach.
△ Less
Submitted 21 May, 2023;
originally announced May 2023.
-
Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection
Authors:
Xiao-Min Zeng,
Yan Song,
Zhu Zhuo,
Yu Zhou,
Yu-Hong Li,
Hui Xue,
Li-Rong Dai,
Ian McLoughlin
Abstract:
In this paper, we propose a joint generative and contrastive representation learning method (GeCo) for anomalous sound detection (ASD). GeCo exploits a Predictive AutoEncoder (PAE) equipped with self-attention as a generative model to perform frame-level prediction. The output of the PAE together with original normal samples, are used for supervised contrastive representative learning in a multi-t…
▽ More
In this paper, we propose a joint generative and contrastive representation learning method (GeCo) for anomalous sound detection (ASD). GeCo exploits a Predictive AutoEncoder (PAE) equipped with self-attention as a generative model to perform frame-level prediction. The output of the PAE together with original normal samples, are used for supervised contrastive representative learning in a multi-task framework. Besides cross-entropy loss between classes, contrastive loss is used to separate PAE output and original samples within each class. GeCo aims to better capture context information among frames, thanks to the self-attention mechanism for PAE model. Furthermore, GeCo combines generative and contrastive learning from which we aim to yield more effective and informative representations, compared to existing methods. Extensive experiments have been conducted on the DCASE2020 Task2 development dataset, showing that GeCo outperforms state-of-the-art generative and discriminative methods.
△ Less
Submitted 20 May, 2023;
originally announced May 2023.
-
Reconfigurable Intelligent Surfaces for 6G: Nine Fundamental Issues and One Critical Problem
Authors:
Zijian Zhang,
Linglong Dai
Abstract:
Thanks to the recent advances in metamaterials, reconfigurable intelligent surface (RIS) has emerged as a promising technology for future 6G wireless communications. Benefiting from its high array gain, low cost, and low power consumption, RISs are expected to greatly enlarge signal coverage, improve system capacity, and increase energy efficiency. In this article, we systematically overview the e…
▽ More
Thanks to the recent advances in metamaterials, reconfigurable intelligent surface (RIS) has emerged as a promising technology for future 6G wireless communications. Benefiting from its high array gain, low cost, and low power consumption, RISs are expected to greatly enlarge signal coverage, improve system capacity, and increase energy efficiency. In this article, we systematically overview the emerging RIS technology with the focus on its key basics, nine fundamental issues, and one critical problem. Specifically, we first explain the RIS basics, including its working principles, hardware structures, and potential benefits for communications. Based on these basics, nine fundamental issues of RISs, such as ``What's the differences between RISs and massive MIMO?'' and ``Is RIS really intelligent?'', are explicitly addressed to elaborate its technical features, distinguish it from existing technologies, and clarify some misunderstandings in the literature. Then, one critical problem of RISs is revealed that, due to the ``multiplicative fading'' effect, existing passive RISs can hardly achieve visible performance gains in many communication scenarios with strong direct links. To address this critical problem, a potential solution called active RISs is introduced, and its effectiveness is demonstrated by numerical simulations.
△ Less
Submitted 19 May, 2023;
originally announced May 2023.
-
The Manifestation of Spatial Wideband Effect in Circular Array: From Beam Split to Beam Defocus
Authors:
Zidong Wu,
Linglong Dai
Abstract:
Millimeter-wave (mmWave) and terahertz (THz) communications with hybrid precoding architectures have been regarded as energy-efficient solutions to fulfill the vision of high-speed transmissions for 6G communications. Benefiting from the advantages of providing a wide scan range and flat array gain, the uniform circular array (UCA) has attracted much attention. However, the growing bandwidth of mm…
▽ More
Millimeter-wave (mmWave) and terahertz (THz) communications with hybrid precoding architectures have been regarded as energy-efficient solutions to fulfill the vision of high-speed transmissions for 6G communications. Benefiting from the advantages of providing a wide scan range and flat array gain, the uniform circular array (UCA) has attracted much attention. However, the growing bandwidth of mmWave and THz communications require frequency-independent phase shifts, which can not be perfectly realized through frequency-independent phase shifters (PSs) in classical hybrid precoding architectures. This mismatch causes the beam defocus effect in UCA wideband communications, where the high-gain beams could not form at non-central frequencies. In this paper, we first investigate the characteristics of the beam defocus effect distinguishing itself from the beam split effect in uniform linear array (ULA) systems. The beam pattern of UCA in both frequency domain and angular domain is analyzed, characterizing the beamforming loss caused by the beam defocus effect. Then, the delay-phase-precoding (DPP) architecture which leverages the true-time-delay (TTD) devices to generate frequency-dependent phase shifts is employed to mitigate the beam defocus effect. Finally, performance analysis and extensive simulation results are provided to evaluate the effectiveness of the DPP architecture in UCA systems.
△ Less
Submitted 4 May, 2023;
originally announced May 2023.
-
Diverse and Vivid Sound Generation from Text Descriptions
Authors:
Guangwei Li,
Xuenan Xu,
Lingfeng Dai,
Mengyue Wu,
Kai Yu
Abstract:
Previous audio generation mainly focuses on specified sound classes such as speech or music, whose form and content are greatly restricted. In this paper, we go beyond specific audio generation by using natural language description as a clue to generate broad sounds. Unlike visual information, a text description is concise by its nature but has rich hidden meanings beneath, which poses a higher po…
▽ More
Previous audio generation mainly focuses on specified sound classes such as speech or music, whose form and content are greatly restricted. In this paper, we go beyond specific audio generation by using natural language description as a clue to generate broad sounds. Unlike visual information, a text description is concise by its nature but has rich hidden meanings beneath, which poses a higher possibility and complexity on the audio to be generated. A Variation-Quantized GAN is used to train a codebook learning discrete representations of spectrograms. For a given text description, its pre-trained embedding is fed to a Transformer to sample codebook indices to decode a spectrogram to be further transformed into waveform by a melgan vocoder. The generated waveform has high quality and fidelity while excellently corresponding to the given text. Experiments show that our proposed method is capable of generating natural, vivid audios, achieving superb quantitative and qualitative results.
△ Less
Submitted 3 May, 2023;
originally announced May 2023.
-
AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer
Authors:
Kang Li,
Yan Song,
Li-Rong Dai,
Ian McLoughlin,
Xin Fang,
Lin Liu
Abstract:
In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED. Pretrained AST models have recently shown promise on DCASE2022 challenge task4 where they help mitigate a lack of sufficient real annotated data. However, mainly due to differences betwe…
▽ More
In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED. Pretrained AST models have recently shown promise on DCASE2022 challenge task4 where they help mitigate a lack of sufficient real annotated data. However, mainly due to differences between the AT and SED tasks, it is suboptimal to directly utilize outputs from a pretrained AST model. Hence the proposed AST-SED adopts an encoder-decoder architecture to enable effective and efficient fine-tuning without needing to redesign or retrain the AST model. Specifically, the Frequency-wise Transformer Encoder (FTE) consists of transformers with self attention along the frequency axis to address multiple overlapped audio events issue in a single clip. The Local Gated Recurrent Units Decoder (LGD) consists of nearest-neighbor interpolation (NNI) and Bidirectional Gated Recurrent Units (Bi-GRU) to compensate for temporal resolution loss in the pretrained AST model output. Experimental results on DCASE2022 task4 development set have demonstrated the superiority of the proposed AST-SED with FTE-LGD architecture. Specifically, the Event-Based F1-score (EB-F1) of 59.60% and Polyphonic Sound detection Score scenario1 (PSDS1) score of 0.5140 significantly outperform CRNN and other pretrained AST-based systems.
△ Less
Submitted 7 March, 2023;
originally announced March 2023.
-
Location Division Multiple Access for Near-Field Communications
Authors:
Zidong Wu,
Linglong Dai
Abstract:
Spatial division multiple access (SDMA) is essential to improve the spectrum efficiency for multi-user multiple-input multiple-output (MIMO) communications. The classical SDMA for massive MIMO with hybrid precoding heavily relies on the angular orthogonality in the far field to distinguish multiple users at different angles, which fails to fully exploit spatial resources in the distance domain. Wi…
▽ More
Spatial division multiple access (SDMA) is essential to improve the spectrum efficiency for multi-user multiple-input multiple-output (MIMO) communications. The classical SDMA for massive MIMO with hybrid precoding heavily relies on the angular orthogonality in the far field to distinguish multiple users at different angles, which fails to fully exploit spatial resources in the distance domain. With dramatically increasing number of antennas, extremely large-scale antenna array (ELAA) introduces additional resolution in the distance domain in the near field. In this paper, we propose the concept of location division multiple access (LDMA) to provide a new possibility to enhance spectrum efficiency. The key idea is to exploit extra spatial resources in the distance domain to serve different users at different locations (determined by angles and distances) in the near field. Specifically, the asymptotic orthogonality of beam focusing vectors in the distance domain is proved, which reveals that near-field beam focusing is able to focus signals on specific locations to mitigate inter-user interferences. Simulation results verify the superiority of the proposed LDMA over classical SDMA in different scenarios.
△ Less
Submitted 22 January, 2023;
originally announced January 2023.
-
Cross Far- and Near-field Wireless Communications in Terahertz Ultra-large Antenna Array Systems
Authors:
Chong Han,
Yuhang Chen,
Longfei Yan,
Zhi Chen,
Linglong Dai
Abstract:
Terahertz (THz) band owning the abundant multi-ten-GHz bandwidth is capable to support Terabit-per-second wireless communications, which is a pillar technology for 6G and beyond systems. With sub-millimeter-long antennas, ultra-massive (UM) MIMO and intelligent surface (IS) systems with thousands of array elements are exploited to effectively combat the distance limitation and blockage problems, w…
▽ More
Terahertz (THz) band owning the abundant multi-ten-GHz bandwidth is capable to support Terabit-per-second wireless communications, which is a pillar technology for 6G and beyond systems. With sub-millimeter-long antennas, ultra-massive (UM) MIMO and intelligent surface (IS) systems with thousands of array elements are exploited to effectively combat the distance limitation and blockage problems, which compose a promising THz ultra-large antenna array (ULAA) system. As a combined effect of wavelength and array aperture, the resulting coverage of THz systems ranges from near-field to far-field, leading to a new paradigm of cross-field communications. Although channel models, communications theories, and networking strategies have been studied for far-field and near-field separately, the unified design of cross-field communications that achieve high spectral efficiency and low complexity is still missing. In this article, the challenges and features of THz ULAA cross-field communications are investigated. Furthermore, cross-field solutions in three perspectives are presented, including a hybrid spherical- and planar-wave channel model, cross-field channel estimation, and widely-spaced multi-subarray hybrid beamforming, where a subarray as a basic unit in THz ULAA systems is exploited. The approximation error of channel modeling accuracy, spectral efficiency, and estimation error of these designs are numerically evaluated. Finally, as a roadmap of THz ULAA cross-field communications, multiple open problems and potential research directions are elaborated.
△ Less
Submitted 3 August, 2023; v1 submitted 8 January, 2023;
originally announced January 2023.
-
Active RISs: Signal Modeling, Asymptotic Analysis, and Beamforming Design
Authors:
Zijian Zhang,
Linglong Dai,
Xibi Chen,
Changhao Liu,
Fan Yang,
Robert Schober,
H. Vincent Poor
Abstract:
Reconfigurable intelligent surfaces (RISs) have emerged as a candidate technology for future 6G networks. However, due to the "multiplicative fading" effect, the existing passive RISs only achieve a negligible capacity gain in environments with strong direct links. In this paper, the concept of active RISs is studied to overcome this fundamental limitation. Unlike the existing passive RISs that re…
▽ More
Reconfigurable intelligent surfaces (RISs) have emerged as a candidate technology for future 6G networks. However, due to the "multiplicative fading" effect, the existing passive RISs only achieve a negligible capacity gain in environments with strong direct links. In this paper, the concept of active RISs is studied to overcome this fundamental limitation. Unlike the existing passive RISs that reflect signals without amplification, active RISs can amplify the reflected signals via amplifiers integrated into their elements. To characterize the signal amplification and incorporate the noise introduced by the active components, we verify the signal model of active RISs through the experimental measurements on a fabricated active RIS element. Based on the verified signal model, we formulate the sum-rate maximization problem for an active RIS aided multi-user multiple-input single-output (MU-MISO) system and a joint transmit precoding and reflect beamforming algorithm is proposed to solve this problem. Simulation results show that, in a typical wireless system, the existing passive RISs can realize only a negligible sum-rate gain of 3%, while the active RISs can achieve a significant sum-rate gain of 62%, thus overcoming the "multiplicative fading" effect. Finally, we develop a 64-element active RIS aided wireless communication prototype, and the significant gain of active RISs is validated by field test.
△ Less
Submitted 31 December, 2022;
originally announced January 2023.
-
Enabling More Users to Benefit from Near-Field Communications: From Linear to Circular Array
Authors:
Zidong Wu,
Mingyao Cui,
Linglong Dai
Abstract:
Massive multiple-input multiple-output (MIMO) for 5G is evolving into the extremely large-scale antenna array (ELAA) to increase the spectrum efficiency by orders of magnitude for 6G communications. ELAA introduces spherical-wave-based near-field communications, where channel capacity can be significantly improved for single-user and multi-user scenarios. Unfortunately, the near-field region at la…
▽ More
Massive multiple-input multiple-output (MIMO) for 5G is evolving into the extremely large-scale antenna array (ELAA) to increase the spectrum efficiency by orders of magnitude for 6G communications. ELAA introduces spherical-wave-based near-field communications, where channel capacity can be significantly improved for single-user and multi-user scenarios. Unfortunately, the near-field region at large incidence/emergence angles is greatly reduced with the widely studied uniform linear array (ULA). Thus, many randomly distributed users may fail to benefit from near-field communications. In this paper, we leverage the rotational symmetry of uniform circular array (UCA) to provide uniform and enlarged near-field regions at all angles, enabling more users to benefit from near-field communications. Specifically, by exploiting the geometrical relationship between UCA and users, the near-field beamforming technique for UCA is developed. Based on the analysis of near-field beamforming, we reveal that UCA is able to provide a larger near-field region than ULA in terms of the effective Rayleigh distance. Moreover, a concentric-ring codebook is designed to realize efficient codebook-based beamforming in the near-field region. In addition, we find out that UCA could generate orthogonal near-field beams along the same direction when the focal point of the near-field beam is exactly the zeros of other beams, which has the potential to further improve spectrum efficiency in multi-user communications compared with ULA. Simulation results are provided to verify the effectiveness of theoretical analysis and feasibility of UCA to enable more users to benefit from near-field communications by broadening the near-field region.
△ Less
Submitted 30 October, 2023; v1 submitted 30 December, 2022;
originally announced December 2022.
-
Near-Field Wideband Channel Estimation for Extremely Large-Scale MIMO
Authors:
Mingyao Cui,
Linglong Dai
Abstract:
Extremely large-scale multiple-input-multiple-output (XL-MIMO) at millimeter-wave (mmWave) and terahertz (THz) bands plays an important role in supporting extreme high beamforming gain as well as ultra-wideband spectrum resources. Unfortunately, accurate wideband XL-MIMO channel estimation suffers from the new challenge called as the near-field beam split effect. Prior works either neglect the acc…
▽ More
Extremely large-scale multiple-input-multiple-output (XL-MIMO) at millimeter-wave (mmWave) and terahertz (THz) bands plays an important role in supporting extreme high beamforming gain as well as ultra-wideband spectrum resources. Unfortunately, accurate wideband XL-MIMO channel estimation suffers from the new challenge called as the near-field beam split effect. Prior works either neglect the accurate near-field channel model or fail to exploit the beam split effect, resulting in poor channel estimation accuracy for wideband XL-MIMO. To tackle this problem, this paper proposes a bilinear pattern detection (BPD) based approach to accurately recover the wideband XL-MIMO channel. Specifically, by analyzing the characteristics of near-field wideband channels, we first reveal the bilinear pattern of the near-field beam split effect, which implies that the sparse support set of near-field channels in both the angle and the distance domains can be regarded as a linear function against frequency. Then, inspired by the classical simultaneously orthogonal matching pursuit technique, we use the bilinear pattern to estimate the angle-of-arrival (AoA) and distance parameters of each near-field path component at all frequencies. In this way, the entire wideband XL-MIMO channel can be recovered by compressed sensing algorithms. Moreover, we provide the computational complexity of the proposed algorithm compared with existing algorithms. Finally, simulation results demonstrate that our scheme can achieve the accurate estimation of the near-field wideband XL-MIMO channel in the presence of near-field beam split effect.
△ Less
Submitted 16 December, 2022;
originally announced December 2022.
-
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
Authors:
Qiushi Zhu,
Long Zhou,
Ziqiang Zhang,
Shujie Liu,
Binxing Jiao,
Jie Zhang,
Lirong Dai,
Daxin Jiang,
Jinyu Li,
Furu Wei
Abstract:
Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech rep…
▽ More
Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm.
△ Less
Submitted 19 May, 2023; v1 submitted 21 November, 2022;
originally announced November 2022.
-
A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings
Authors:
Mohan Shi,
Jie Zhang,
Zhihao Du,
Fan Yu,
Qian Chen,
Shiliang Zhang,
Li-Rong Dai
Abstract:
Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task. It was shown that single-channel frame-level diarization with serialized output training (SC-FD-SOT), single-channel word-level diarization with SOT (SC-WD-SOT) and joint training of single-channel target-speaker separation and ASR (SC-TS-ASR) can be explo…
▽ More
Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task. It was shown that single-channel frame-level diarization with serialized output training (SC-FD-SOT), single-channel word-level diarization with SOT (SC-WD-SOT) and joint training of single-channel target-speaker separation and ASR (SC-TS-ASR) can be exploited to partially solve this problem. In this paper, we propose three corresponding multichannel (MC) SA-ASR approaches, namely MC-FD-SOT, MC-WD-SOT and MC-TS-ASR. For different tasks/models, different multichannel data fusion strategies are considered, including channel-level cross-channel attention for MC-FD-SOT, frame-level cross-channel attention for MC-WD-SOT and neural beamforming for MC-TS-ASR. Results on the AliMeeting corpus reveal that our proposed models can consistently outperform the corresponding single-channel counterparts in terms of the speaker-dependent character error rate.
△ Less
Submitted 1 March, 2023; v1 submitted 1 November, 2022;
originally announced November 2022.
-
Robust Data2vec: Noise-robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning
Authors:
Qiu-Shi Zhu,
Long Zhou,
Jie Zhang,
Shu-Jie Liu,
Yu-Chen Hu,
Li-Rong Dai
Abstract:
Self-supervised pre-training methods based on contrastive learning or regression tasks can utilize more unlabeled data to improve the performance of automatic speech recognition (ASR). However, the robustness impact of combining the two pre-training tasks and constructing different negative samples for contrastive learning still remains unclear. In this paper, we propose a noise-robust data2vec fo…
▽ More
Self-supervised pre-training methods based on contrastive learning or regression tasks can utilize more unlabeled data to improve the performance of automatic speech recognition (ASR). However, the robustness impact of combining the two pre-training tasks and constructing different negative samples for contrastive learning still remains unclear. In this paper, we propose a noise-robust data2vec for self-supervised speech representation learning by jointly optimizing the contrastive learning and regression tasks in the pre-training stage. Furthermore, we present two improved methods to facilitate contrastive learning. More specifically, we first propose to construct patch-based non-semantic negative samples to boost the noise robustness of the pre-training model, which is achieved by dividing the features into patches at different sizes (i.e., so-called negative samples). Second, by analyzing the distribution of positive and negative samples, we propose to remove the easily distinguishable negative samples to improve the discriminative capacity for pre-training models. Experimental results on the CHiME-4 dataset show that our method is able to improve the performance of the pre-trained model in noisy scenarios. We find that joint training of the contrastive learning and regression tasks can avoid the model collapse to some extent compared to only training the regression task.
△ Less
Submitted 27 October, 2022;
originally announced October 2022.
-
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training
Authors:
Ziqiang Zhang,
Long Zhou,
Junyi Ao,
Shujie Liu,
Lirong Dai,
Jinyu Li,
Furu Wei
Abstract:
The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods. In this paper, we propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Leveraging hidden-unit as an interface to align speech and text, we can decomp…
▽ More
The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods. In this paper, we propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Leveraging hidden-unit as an interface to align speech and text, we can decompose the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be jointly pre-trained with unpaired speech and text data respectively. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks. Experimental results show that SpeechUT gets substantial improvements over strong baselines, and achieves state-of-the-art performance on both the LibriSpeech ASR and MuST-C ST tasks. To better understand the proposed SpeechUT, detailed analyses are conducted. The code and pre-trained models are available at https://aka.ms/SpeechUT.
△ Less
Submitted 7 October, 2022;
originally announced October 2022.
-
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data
Authors:
Ziqiang Zhang,
Sanyuan Chen,
Long Zhou,
Yu Wu,
Shuo Ren,
Shujie Liu,
Zhuoyuan Yao,
Xun Gong,
Lirong Dai,
Jinyu Li,
Furu Wei
Abstract:
How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discret…
▽ More
How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM.
△ Less
Submitted 15 June, 2023; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Workflow-based Fast Data-driven Predictive Control with Disturbance Observer in Cloud-edge Collaborative Architecture
Authors:
Runze Gao,
Qiwen Li,
Li Dai,
Yufeng Zhan,
Yuanqing Xia
Abstract:
Data-driven predictive control (DPC) has been studied and used in various scenarios, since it could generate the predicted control sequence only relying on the historical input and output data. Recently, based on cloud computing, data-driven predictive cloud control system (DPCCS) has been proposed with the advantage of sufficient computational resources. However, the existing computation mode of…
▽ More
Data-driven predictive control (DPC) has been studied and used in various scenarios, since it could generate the predicted control sequence only relying on the historical input and output data. Recently, based on cloud computing, data-driven predictive cloud control system (DPCCS) has been proposed with the advantage of sufficient computational resources. However, the existing computation mode of DPCCS is centralized. This computation mode could not utilize fully the computing power of cloud computing, of which the structure is distributed. Thus, the computation delay could not been reduced and still affects the control quality. In this paper, a novel cloud-edge collaborative containerised workflow-based DPC system with disturbance observer (DOB) is proposed, to improve the computation efficiency and guarantee the control accuracy. First, a construction method for the DPC workflow is designed, to match the distributed processing environment of cloud computing. But the non-computation overheads of the workflow tasks are relatively high. Therefore, a cloud-edge collaborative control scheme with DOB is designed. The low-weight data could be truncated to reduce the non-computation overheads. Meanwhile, we design an edge DOB to estimate and compensate the uncertainty in cloud workflow processing, and obtain the composite control variable. The UUB stability of the DOB is also proved. Third, to execute the workflow-based DPC controller and evaluate the proposed cloud-edge collaborative control scheme with DOB in the real cloud environment, we design and implement a practical workflow-based cloud control experimental system based on container technology. Finally, a series of evaluations show that, the computation times are decreased by 45.19% and 74.35% for two real-time control examples, respectively, and by at most 85.10% for a high-dimension control example.
△ Less
Submitted 16 September, 2022;
originally announced September 2022.
-
Dynamic Write-Voltage Design and Read-Voltage Optimization for MLC NAND Flash Memory
Authors:
Runbin Cai,
Yi Fang,
Zhifang Shi,
Lin Dai,
Guojun Han
Abstract:
To mitigate the impact of noise and interference on multi-level-cell (MLC) flash memory with the use of low-density parity-check (LDPC) codes, we propose a dynamic write-voltage design scheme considering the asymmetric property of raw bit error rate (RBER), which can obtain the optimal write voltage by minimizing a cost function. In order to further improve the decoding performance of flash memory…
▽ More
To mitigate the impact of noise and interference on multi-level-cell (MLC) flash memory with the use of low-density parity-check (LDPC) codes, we propose a dynamic write-voltage design scheme considering the asymmetric property of raw bit error rate (RBER), which can obtain the optimal write voltage by minimizing a cost function. In order to further improve the decoding performance of flash memory, we put forward a low-complexity entropy-based read-voltage optimization scheme, which derives the read voltages by searching for the optimal entropy value via a log-likelihood ratio (LLR)-aware cost function. Simulation results demonstrate the superiority of our proposed dynamic write-voltage design scheme and read-voltage optimization scheme with respect to the existing counterparts.
△ Less
Submitted 3 September, 2022;
originally announced September 2022.