-
Structural Optimization of Lightweight Bipedal Robot via SERL
Authors:
Yi Cheng,
Chenxi Han,
Yuheng Min,
Linqi Ye,
Houde Liu,
Hang Liu
Abstract:
Designing a bipedal robot is a complex and challenging task, especially when dealing with a multitude of structural parameters. Traditional design methods often rely on human intuition and experience. However, such approaches are time-consuming, labor-intensive, lack theoretical guidance and hard to obtain optimal design results within vast design spaces, thus failing to full exploit the inherent…
▽ More
Designing a bipedal robot is a complex and challenging task, especially when dealing with a multitude of structural parameters. Traditional design methods often rely on human intuition and experience. However, such approaches are time-consuming, labor-intensive, lack theoretical guidance and hard to obtain optimal design results within vast design spaces, thus failing to full exploit the inherent performance potential of robots. In this context, this paper introduces the SERL (Structure Evolution Reinforcement Learning) algorithm, which combines reinforcement learning for locomotion tasks with evolution algorithms. The aim is to identify the optimal parameter combinations within a given multidimensional design space. Through the SERL algorithm, we successfully designed a bipedal robot named Wow Orin, where the optimal leg length are obtained through optimization based on body structure and motor torque. We have experimentally validated the effectiveness of the SERL algorithm, which is capable of optimizing the best structure within specified design space and task conditions. Additionally, to assess the performance gap between our designed robot and the current state-of-the-art robots, we compared Wow Orin with mainstream bipedal robots Cassie and Unitree H1. A series of experimental results demonstrate the Outstanding energy efficiency and performance of Wow Orin, further validating the feasibility of applying the SERL algorithm to practical design.
△ Less
Submitted 28 August, 2024;
originally announced August 2024.
-
PredIN: Towards Open-Set Gesture Recognition via Prediction Inconsistency
Authors:
Chen Liu,
Can Han,
Chengfeng Zhou,
Crystal Cai,
Dahong Qian
Abstract:
Gesture recognition based on surface electromyography (sEMG) has achieved significant progress in human-machine interaction (HMI). However, accurately recognizing predefined gestures within a closed set is still inadequate in practice; a robust open-set system needs to effectively reject unknown gestures while correctly classifying known ones. To handle this challenge, we first report prediction i…
▽ More
Gesture recognition based on surface electromyography (sEMG) has achieved significant progress in human-machine interaction (HMI). However, accurately recognizing predefined gestures within a closed set is still inadequate in practice; a robust open-set system needs to effectively reject unknown gestures while correctly classifying known ones. To handle this challenge, we first report prediction inconsistency discovered for unknown classes due to ensemble diversity, which can significantly facilitate the detection of unknown classes. Based on this insight, we propose an ensemble learning approach, PredIN, to explicitly magnify the prediction inconsistency by enhancing ensemble diversity. Specifically, PredIN maximizes the class feature distribution inconsistency among ensemble members to enhance diversity. Meanwhile, it optimizes inter-class separability within an individual ensemble member to maintain individual performance. Comprehensive experiments on various benchmark datasets demonstrate that the PredIN outperforms state-of-the-art methods by a clear margin.Our proposed method simultaneously achieves accurate closed-set classification for predefined gestures and effective rejection for unknown gestures, exhibiting its efficacy and superiority in open-set gesture recognition based on sEMG.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Transfer Learning Enabled Transformer based Generative Adversarial Networks (TT-GAN) for Terahertz Channel Modeling and Generating
Authors:
Zhengdong Hu,
Yuanbo Li,
Chong Han
Abstract:
Terahertz (THz) communications, ranging from 100 GHz to 10 THz, are envisioned as a promising technology for 6G and beyond wireless systems. As foundation of designing THz communications, channel modeling and characterization are crucial to scrutinize the potential of the new spectrum. However, current channel modeling and standardization heavily rely on measurements, which are both time-consuming…
▽ More
Terahertz (THz) communications, ranging from 100 GHz to 10 THz, are envisioned as a promising technology for 6G and beyond wireless systems. As foundation of designing THz communications, channel modeling and characterization are crucial to scrutinize the potential of the new spectrum. However, current channel modeling and standardization heavily rely on measurements, which are both time-consuming and costly to obtain in the THz band. Here, we propose a Transfer learning enabled Transformer based Generative Adversarial Network (TT-GAN) for THz channel modeling. Specifically, as a fundamental building block, a GAN is exploited to generate channel parameters, which can substitute measurements. To greatly improve the accuracy, the first T, i.e., a transformer structure with a self-attention mechanism is incorporated in GAN. Still incurring errors compared with ground-truth measurement, the second T, i.e., a transfer learning is designed to solve the mismatch between the formulated network and measurement. The proposed TT-GAN can achieve high accuracy in channel modeling, while requiring only rather limited amount of measurement, which is a promising complementary of channel standardization that fundamentally differs from the current techniques that heavily rely on measurement.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis
Authors:
Xilin Jiang,
Yinghao Aaron Li,
Adrian Nicolas Florea,
Cong Han,
Nima Mesgarani
Abstract:
It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks. To reach this conclusion, we propose and evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis. We compar…
▽ More
It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks. To reach this conclusion, we propose and evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis. We compare them with transformers of similar sizes in performance, memory, and speed. Our Mamba or Mamba-transformer hybrid models show comparable or higher performance than their transformer counterparts: Sepformer, Conformer, and VALL-E. They are more efficient than transformers in memory and speed for speech longer than a threshold duration, inversely related to the resolution of a speech token. Mamba for separation is the most efficient, and Mamba for recognition is the least. Further, we show that Mamba is not more efficient than transformer for speech shorter than the threshold duration and performs worse in models that require joint modeling of text and speech, such as cross or masked attention of two inputs. Therefore, we argue that the superiority of Mamba or transformer depends on particular problems and models. Code available at https://github.com/xi-j/Mamba-TasNet and https://github.com/xi-j/Mamba-ASR.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
EDPNet: An Efficient Dual Prototype Network for Motor Imagery EEG Decoding
Authors:
Can Han,
Chen Liu,
Crystal Cai,
Jun Wang,
Dahong Qian
Abstract:
Motor imagery electroencephalograph (MI-EEG) decoding plays a crucial role in developing motor imagery brain-computer interfaces (MI-BCIs). However, decoding intentions from MI remains challenging due to the inherent complexity of EEG signals relative to the small-sample size. In this paper, we propose an Efficient Dual Prototype Network (EDPNet) to enable accurate and fast MI decoding. EDPNet emp…
▽ More
Motor imagery electroencephalograph (MI-EEG) decoding plays a crucial role in developing motor imagery brain-computer interfaces (MI-BCIs). However, decoding intentions from MI remains challenging due to the inherent complexity of EEG signals relative to the small-sample size. In this paper, we propose an Efficient Dual Prototype Network (EDPNet) to enable accurate and fast MI decoding. EDPNet employs a lightweight adaptive spatial-spectral fusion module, which promotes more efficient information fusion between multiple EEG electrodes. Subsequently, a parameter-free multi-scale variance pooling module extracts more comprehensive temporal features. Furthermore, we introduce dual prototypical learning to optimize the feature space distribution and training process, thereby improving the model's generalization ability on small-sample MI datasets. Our experimental results show that the EDPNet outperforms state-of-the-art models with superior classification accuracy and kappa values (84.11% and 0.7881 for dataset BCI competition IV 2a, 86.65% and 0.7330 for dataset BCI competition IV 2b). Additionally, we use the BCI competition III IVa dataset with fewer training data to further validate the generalization ability of the proposed EDPNet. We also achieve superior performance with 82.03% classification accuracy. Benefiting from the lightweight parameters and superior decoding accuracy, our EDPNet shows great potential for MI-BCI applications. The code is publicly available at https://github.com/hancan16/EDPNet.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model
Authors:
Di Wang,
Meiqi Hu,
Yao Jin,
Yuchun Miao,
Jiaqi Yang,
Yichu Xu,
Xiaolei Qin,
Jiaqi Ma,
Lingyu Sun,
Chenxing Li,
Chuan Fu,
Hongruixuan Chen,
Chengxi Han,
Naoto Yokoya,
Jing Zhang,
Minqiang Xu,
Lin Liu,
Lefei Zhang,
Chen Wu,
Bo Du,
Dacheng Tao,
Liangpei Zhang
Abstract:
Foundation models (FMs) are revolutionizing the analysis and understanding of remote sensing (RS) scenes, including aerial RGB, multispectral, and SAR images. However, hyperspectral images (HSIs), which are rich in spectral information, have not seen much application of FMs, with existing methods often restricted to specific tasks and lacking generality. To fill this gap, we introduce HyperSIGMA,…
▽ More
Foundation models (FMs) are revolutionizing the analysis and understanding of remote sensing (RS) scenes, including aerial RGB, multispectral, and SAR images. However, hyperspectral images (HSIs), which are rich in spectral information, have not seen much application of FMs, with existing methods often restricted to specific tasks and lacking generality. To fill this gap, we introduce HyperSIGMA, a vision transformer-based foundation model for HSI interpretation, scalable to over a billion parameters. To tackle the spectral and spatial redundancy challenges in HSIs, we introduce a novel sparse sampling attention (SSA) mechanism, which effectively promotes the learning of diverse contextual features and serves as the basic block of HyperSIGMA. HyperSIGMA integrates spatial and spectral features using a specially designed spectral enhancement module. In addition, we construct a large-scale hyperspectral dataset, HyperGlobal-450K, for pre-training, which contains about 450K hyperspectral images, significantly surpassing existing datasets in scale. Extensive experiments on various high-level and low-level HSI tasks demonstrate HyperSIGMA's versatility and superior representational capability compared to current state-of-the-art methods. Moreover, HyperSIGMA shows significant advantages in scalability, robustness, cross-modal transferring capability, and real-world applicability.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Understanding Pedestrian Movement Using Urban Sensing Technologies: The Promise of Audio-based Sensors
Authors:
Chaeyeon Han,
Pavan Seshadri,
Yiwei Ding,
Noah Posner,
Bon Woo Koo,
Animesh Agrawal,
Alexander Lerch,
Subhrajit Guhathakurta
Abstract:
While various sensors have been deployed to monitor vehicular flows, sensing pedestrian movement is still nascent. Yet walking is a significant mode of travel in many cities, especially those in Europe, Africa, and Asia. Understanding pedestrian volumes and flows is essential for designing safer and more attractive pedestrian infrastructure and for controlling periodic overcrowding. This study dis…
▽ More
While various sensors have been deployed to monitor vehicular flows, sensing pedestrian movement is still nascent. Yet walking is a significant mode of travel in many cities, especially those in Europe, Africa, and Asia. Understanding pedestrian volumes and flows is essential for designing safer and more attractive pedestrian infrastructure and for controlling periodic overcrowding. This study discusses a new approach to scale up urban sensing of people with the help of novel audio-based technology. It assesses the benefits and limitations of microphone-based sensors as compared to other forms of pedestrian sensing. A large-scale dataset called ASPED is presented, which includes high-quality audio recordings along with video recordings used for labeling the pedestrian count data. The baseline analyses highlight the promise of using audio sensors for pedestrian tracking, although algorithmic and technological improvements to make the sensors practically usable continue. This study also demonstrates how the data can be leveraged to predict pedestrian trajectories. Finally, it discusses the use cases and scenarios where audio-based pedestrian sensing can support better urban and transportation planning.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting
Authors:
Sichen Jin,
Youngmoon Jung,
Seungjin Lee,
Jaeyoung Roh,
Changwoo Han,
Hoonyoung Cho
Abstract:
This paper introduces a novel approach for streaming openvocabulary keyword spotting (KWS) with text-based keyword enrollment. For every input frame, the proposed method finds the optimal alignment ending at the frame using connectionist temporal classification (CTC) and aggregates the frame-level acoustic embedding (AE) to obtain higher-level (i.e., character, word, or phrase) AE that aligns with…
▽ More
This paper introduces a novel approach for streaming openvocabulary keyword spotting (KWS) with text-based keyword enrollment. For every input frame, the proposed method finds the optimal alignment ending at the frame using connectionist temporal classification (CTC) and aggregates the frame-level acoustic embedding (AE) to obtain higher-level (i.e., character, word, or phrase) AE that aligns with the text embedding (TE) of the target keyword text. After that, we calculate the similarity of the aggregated AE and the TE. To the best of our knowledge, this is the first attempt to dynamically align the audio and the keyword text on-the-fly to attain the joint audio-text embedding for KWS. Despite operating in a streaming fashion, our approach achieves competitive performance on the LibriPhrase dataset compared to the non-streaming methods with a mere 155K model parameters and a decoding algorithm with time complexity O(U), where U is the length of the target keyword at inference time.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Relational Proxy Loss for Audio-Text based Keyword Spotting
Authors:
Youngmoon Jung,
Seungjin Lee,
Joon-Young Yang,
Jaeyoung Roh,
Chang Woo Han,
Hoon-Young Cho
Abstract:
In recent years, there has been an increasing focus on user convenience, leading to increased interest in text-based keyword enrollment systems for keyword spotting (KWS). Since the system utilizes text input during the enrollment phase and audio input during actual usage, we call this task audio-text based KWS. To enable this task, both acoustic and text encoders are typically trained using deep…
▽ More
In recent years, there has been an increasing focus on user convenience, leading to increased interest in text-based keyword enrollment systems for keyword spotting (KWS). Since the system utilizes text input during the enrollment phase and audio input during actual usage, we call this task audio-text based KWS. To enable this task, both acoustic and text encoders are typically trained using deep metric learning loss functions, such as triplet- and proxy-based losses. This study aims to improve existing methods by leveraging the structural relations within acoustic embeddings and within text embeddings. Unlike previous studies that only compare acoustic and text embeddings on a point-to-point basis, our approach focuses on the relational structures within the embedding space by introducing the concept of Relational Proxy Loss (RPL). By incorporating RPL, we demonstrated improved performance on the Wall Street Journal (WSJ) corpus.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Cross Far- and Near-Field Channel Measurement and Modeling in Extremely Large-scale Antenna Array (ELAA) Systems
Authors:
Yiqin Wang,
Chong Han,
Shu Sun,
Jianhua Zhang
Abstract:
Technologies like ultra-massive multiple-input-multiple-output (UM-MIMO) and reconfigurable intelligent surfaces (RISs) are of special interest to meet the key performance indicators of future wireless systems including ubiquitous connectivity and lightning-fast data rates. One of their common features, the extremely large-scale antenna array (ELAA) systems with hundreds or thousands of antennas,…
▽ More
Technologies like ultra-massive multiple-input-multiple-output (UM-MIMO) and reconfigurable intelligent surfaces (RISs) are of special interest to meet the key performance indicators of future wireless systems including ubiquitous connectivity and lightning-fast data rates. One of their common features, the extremely large-scale antenna array (ELAA) systems with hundreds or thousands of antennas, give rise to near-field (NF) propagation and bring new challenges to channel modeling and characterization. In this paper, a cross-field channel model for ELAA systems is proposed, which improves the statistical model in 3GPP TR 38.901 by refining the propagation path with its first and last bounces and differentiating the characterization of parameters like path loss, delay, and angles in near- and far-fields. A comprehensive analysis of cross-field boundaries and closed-form expressions of corresponding NF or FF parameters are provided. Furthermore, cross-field experiments carried out in a typical indoor scenario at 300 GHz verify the variation of MPC parameters across the antenna array, and demonstrate the distinction of channels between different antenna elements. Finally, detailed generation procedures of the cross-field channel model are provided, based on which simulations and analysis on NF probabilities and channel coefficients are conducted for $4\times4$, $8\times8$, $16\times16$, and $9\times21$ uniform planar arrays at different frequency bands.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Track Role Prediction of Single-Instrumental Sequences
Authors:
Changheon Han,
Suhyun Lee,
Minsam Ko
Abstract:
In the composition process, selecting appropriate single-instrumental music sequences and assigning their track-role is an indispensable task. However, manually determining the track-role for a myriad of music samples can be time-consuming and labor-intensive. This study introduces a deep learning model designed to automatically predict the track-role of single-instrumental music sequences. Our ev…
▽ More
In the composition process, selecting appropriate single-instrumental music sequences and assigning their track-role is an indispensable task. However, manually determining the track-role for a myriad of music samples can be time-consuming and labor-intensive. This study introduces a deep learning model designed to automatically predict the track-role of single-instrumental music sequences. Our evaluations show a prediction accuracy of 87% in the symbolic domain and 84% in the audio domain. The proposed track-role prediction methods hold promise for future applications in AI music generation and analysis.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
Change Guiding Network: Incorporating Change Prior to Guide Change Detection in Remote Sensing Imagery
Authors:
Chengxi Han,
Chen Wu,
Haonan Guo,
Meiqi Hu,
Jiepan Li,
Hongruixuan Chen
Abstract:
The rapid advancement of automated artificial intelligence algorithms and remote sensing instruments has benefited change detection (CD) tasks. However, there is still a lot of space to study for precise detection, especially the edge integrity and internal holes phenomenon of change features. In order to solve these problems, we design the Change Guiding Network (CGNet), to tackle the insufficien…
▽ More
The rapid advancement of automated artificial intelligence algorithms and remote sensing instruments has benefited change detection (CD) tasks. However, there is still a lot of space to study for precise detection, especially the edge integrity and internal holes phenomenon of change features. In order to solve these problems, we design the Change Guiding Network (CGNet), to tackle the insufficient expression problem of change features in the conventional U-Net structure adopted in previous methods, which causes inaccurate edge detection and internal holes. Change maps from deep features with rich semantic information are generated and used as prior information to guide multi-scale feature fusion, which can improve the expression ability of change features. Meanwhile, we propose a self-attention module named Change Guide Module (CGM), which can effectively capture the long-distance dependency among pixels and effectively overcome the problem of the insufficient receptive field of traditional convolutional neural networks. On four major CD datasets, we verify the usefulness and efficiency of the CGNet, and a large number of experiments and ablation studies demonstrate the effectiveness of CGNet. We're going to open-source our code at https://github.com/ChengxiHAN/CGNet-CD.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
ChangeMamba: Remote Sensing Change Detection With Spatiotemporal State Space Model
Authors:
Hongruixuan Chen,
Jian Song,
Chengxi Han,
Junshi Xia,
Naoto Yokoya
Abstract:
Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing change detection (CD). However, both architectures have inherent shortcomings: CNN are constrained by a limited receptive field that may hinder their ability to capture broader spatial contexts, while Transformers are computationally intensive, making them costly to train and deploy on…
▽ More
Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing change detection (CD). However, both architectures have inherent shortcomings: CNN are constrained by a limited receptive field that may hinder their ability to capture broader spatial contexts, while Transformers are computationally intensive, making them costly to train and deploy on large datasets. Recently, the Mamba architecture, based on state space models, has shown remarkable performance in a series of natural language processing tasks, which can effectively compensate for the shortcomings of the above two architectures. In this paper, we explore for the first time the potential of the Mamba architecture for remote sensing CD tasks. We tailor the corresponding frameworks, called MambaBCD, MambaSCD, and MambaBDA, for binary change detection (BCD), semantic change detection (SCD), and building damage assessment (BDA), respectively. All three frameworks adopt the cutting-edge Visual Mamba architecture as the encoder, which allows full learning of global spatial contextual information from the input images. For the change decoder, which is available in all three architectures, we propose three spatio-temporal relationship modeling mechanisms, which can be naturally combined with the Mamba architecture and fully utilize its attribute to achieve spatio-temporal interaction of multi-temporal features, thereby obtaining accurate change information. On five benchmark datasets, our proposed frameworks outperform current CNN- and Transformer-based approaches without using any complex training strategies or tricks, fully demonstrating the potential of the Mamba architecture in CD tasks. Further experiments show that our architecture is quite robust to degraded data. The source code will be available in https://github.com/ChenHongruixuan/MambaCD
△ Less
Submitted 26 July, 2024; v1 submitted 4 April, 2024;
originally announced April 2024.
-
iMD4GC: Incomplete Multimodal Data Integration to Advance Precise Treatment Response Prediction and Survival Analysis for Gastric Cancer
Authors:
Fengtao Zhou,
Yingxue Xu,
Yanfen Cui,
Shenyan Zhang,
Yun Zhu,
Weiyang He,
Jiguang Wang,
Xin Wang,
Ronald Chan,
Louis Ho Shing Lau,
Chu Han,
Dafu Zhang,
Zhenhui Li,
Hao Chen
Abstract:
Gastric cancer (GC) is a prevalent malignancy worldwide, ranking as the fifth most common cancer with over 1 million new cases and 700 thousand deaths in 2020. Locally advanced gastric cancer (LAGC) accounts for approximately two-thirds of GC diagnoses, and neoadjuvant chemotherapy (NACT) has emerged as the standard treatment for LAGC. However, the effectiveness of NACT varies significantly among…
▽ More
Gastric cancer (GC) is a prevalent malignancy worldwide, ranking as the fifth most common cancer with over 1 million new cases and 700 thousand deaths in 2020. Locally advanced gastric cancer (LAGC) accounts for approximately two-thirds of GC diagnoses, and neoadjuvant chemotherapy (NACT) has emerged as the standard treatment for LAGC. However, the effectiveness of NACT varies significantly among patients, with a considerable subset displaying treatment resistance. Ineffective NACT not only leads to adverse effects but also misses the optimal therapeutic window, resulting in lower survival rate. However, existing multimodal learning methods assume the availability of all modalities for each patient, which does not align with the reality of clinical practice. The limited availability of modalities for each patient would cause information loss, adversely affecting predictive accuracy. In this study, we propose an incomplete multimodal data integration framework for GC (iMD4GC) to address the challenges posed by incomplete multimodal data, enabling precise response prediction and survival analysis. Specifically, iMD4GC incorporates unimodal attention layers for each modality to capture intra-modal information. Subsequently, the cross-modal interaction layers explore potential inter-modal interactions and capture complementary information across modalities, thereby enabling information compensation for missing modalities. To evaluate iMD4GC, we collected three multimodal datasets for GC study: GastricRes (698 cases) for response prediction, GastricSur (801 cases) for survival analysis, and TCGA-STAD (400 cases) for survival analysis. The scale of our datasets is significantly larger than previous studies. The iMD4GC achieved impressive performance with an 80.2% AUC on GastricRes, 71.4% C-index on GastricSur, and 66.1% C-index on TCGA-STAD, significantly surpassing other compared methods.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation
Authors:
Xilin Jiang,
Cong Han,
Nima Mesgarani
Abstract:
Transformers have been the most successful architecture for various speech modeling tasks, including speech separation. However, the self-attention mechanism in transformers with quadratic complexity is inefficient in computation and memory. Recent models incorporate new layers and modules along with transformers for better performance but also introduce extra model complexity. In this work, we re…
▽ More
Transformers have been the most successful architecture for various speech modeling tasks, including speech separation. However, the self-attention mechanism in transformers with quadratic complexity is inefficient in computation and memory. Recent models incorporate new layers and modules along with transformers for better performance but also introduce extra model complexity. In this work, we replace transformers with Mamba, a selective state space model, for speech separation. We propose dual-path Mamba, which models short-term and long-term forward and backward dependency of speech signals using selective state spaces. Our experimental results on the WSJ0-2mix data show that our dual-path Mamba models of comparably smaller sizes outperform state-of-the-art RNN model DPRNN, CNN model WaveSplit, and transformer model Sepformer. Code: https://github.com/xi-j/Mamba-TasNet
△ Less
Submitted 30 April, 2024; v1 submitted 27 March, 2024;
originally announced March 2024.
-
Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience
Authors:
Xilin Jiang,
Cong Han,
Yinghao Aaron Li,
Nima Mesgarani
Abstract:
In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces "Listen, Chat, and Edit" (LCE), a novel multimodal sound mixture editor that modifies each sound source in a mixture based on user-provided text instructions. LCE distinguishes itself with a user-friendly chat interface and its unique ability to…
▽ More
In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces "Listen, Chat, and Edit" (LCE), a novel multimodal sound mixture editor that modifies each sound source in a mixture based on user-provided text instructions. LCE distinguishes itself with a user-friendly chat interface and its unique ability to edit multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for editing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles it into the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along with text prompts for diverse editing tasks like extraction, removal, and volume control. Our experiments demonstrate significant improvements in signal quality across all editing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Vision Transformer-based Multimodal Feature Fusion Network for Lymphoma Segmentation on PET/CT Images
Authors:
Huan Huang,
Liheng Qiu,
Shenmiao Yang,
Longxi Li,
Jiaofen Nan,
Yanting Li,
Chuang Han,
Fubao Zhu,
Chen Zhao,
Weihua Zhou
Abstract:
Background: Diffuse large B-cell lymphoma (DLBCL) segmentation is a challenge in medical image analysis. Traditional segmentation methods for lymphoma struggle with the complex patterns and the presence of DLBCL lesions. Objective: We aim to develop an accurate method for lymphoma segmentation with 18F-Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) images. Metho…
▽ More
Background: Diffuse large B-cell lymphoma (DLBCL) segmentation is a challenge in medical image analysis. Traditional segmentation methods for lymphoma struggle with the complex patterns and the presence of DLBCL lesions. Objective: We aim to develop an accurate method for lymphoma segmentation with 18F-Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) images. Methods: Our lymphoma segmentation approach combines a vision transformer with dual encoders, adeptly fusing PET and CT data via multimodal cross-attention fusion (MMCAF) module. In this study, PET and CT data from 165 DLBCL patients were analyzed. A 5-fold cross-validation was employed to evaluate the performance and generalization ability of our method. Ground truths were annotated by experienced nuclear medicine experts. We calculated the total metabolic tumor volume (TMTV) and performed a statistical analysis on our results. Results: The proposed method exhibited accurate performance in DLBCL lesion segmentation, achieving a Dice similarity coefficient of 0.9173$\pm$0.0071, a Hausdorff distance of 2.71$\pm$0.25mm, a sensitivity of 0.9462$\pm$0.0223, and a specificity of 0.9986$\pm$0.0008. Additionally, a Pearson correlation coefficient of 0.9030$\pm$0.0179 and an R-square of 0.8586$\pm$0.0173 were observed in TMTV when measured on manual annotation compared to our segmentation results. Conclusion: This study highlights the advantages of MMCAF and vision transformer for lymphoma segmentation using PET and CT, offering great promise for computer-aided lymphoma diagnosis and treatment.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
Dynamic Indoor Fingerprinting Localization based on Few-Shot Meta-Learning with CSI Images
Authors:
Jiyu Jiao,
Xiaojun Wang,
Chenpei Han,
Yuhua Huang,
Yizhuo Zhang
Abstract:
While fingerprinting localization is favored for its effectiveness, it is hindered by high data acquisition costs and the inaccuracy of static database-based estimates. Addressing these issues, this letter presents an innovative indoor localization method using a data-efficient meta-learning algorithm. This approach, grounded in the ``Learning to Learn'' paradigm of meta-learning, utilizes histori…
▽ More
While fingerprinting localization is favored for its effectiveness, it is hindered by high data acquisition costs and the inaccuracy of static database-based estimates. Addressing these issues, this letter presents an innovative indoor localization method using a data-efficient meta-learning algorithm. This approach, grounded in the ``Learning to Learn'' paradigm of meta-learning, utilizes historical localization tasks to improve adaptability and learning efficiency in dynamic indoor environments. We introduce a task-weighted loss to enhance knowledge transfer within this framework. Our comprehensive experiments confirm the method's robustness and superiority over current benchmarks, achieving a notable 23.13\% average gain in Mean Euclidean Distance, particularly effective in scenarios with limited CSI data.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Far- and Near-Field Channel Measurements and Characterization in the Terahertz Band Using a Virtual Antenna Array
Authors:
Yiqin Wang,
Shu Sun,
Chong Han
Abstract:
Extremely large-scale antenna array (ELAA) technologies consisting of ultra-massive multiple-input-multiple-output (UM-MIMO) or reconfigurable intelligent surfaces (RISs), are emerging to meet the demand of wireless systems in sixth-generation and beyond communications for enhanced coverage and extreme data rates up to Terabits per second. For ELAA operating at Terahertz (THz) frequencies, the Ray…
▽ More
Extremely large-scale antenna array (ELAA) technologies consisting of ultra-massive multiple-input-multiple-output (UM-MIMO) or reconfigurable intelligent surfaces (RISs), are emerging to meet the demand of wireless systems in sixth-generation and beyond communications for enhanced coverage and extreme data rates up to Terabits per second. For ELAA operating at Terahertz (THz) frequencies, the Rayleigh distance expands, and users are likely to be located in both far-field (FF) and near-field (NF) regions. On one hand, new features like NF propagation and spatial non-stationarity need to be characterized. On the other hand, the transition of properties near the FF and NF boundary is worth exploring. In this paper, a complete experimental analysis of far- and near-field channel characteristics using a THz virtual antenna array is provided based on measurement of the multi-input-single-output channel with the virtual uniform planar array (UPA) structure of at most 4096 elements. In particular, non-linear phase change is observed in the NF, and the Rayleigh criterion regarding the maximum phase error is verified. Then, a new cross-field path loss model is proposed, which characterizes the power change at antenna elements in the UPA and is compatible with both FF and NF cases.
△ Less
Submitted 3 February, 2024; v1 submitted 20 December, 2023;
originally announced December 2023.
-
Can Far-field Beam Training Be Deployed for Cross-field Beam Alignment in Terahertz UM-MIMO Communications?
Authors:
Yuhang Chen,
Chong Han,
Emil Björnson
Abstract:
Ultra-massive multiple-input multiple-output (UM-MIMO) is the enabler of Terahertz (THz) communications in next-generation wireless networks. In THz UM-MIMO systems, a new paradigm of cross-field communications spanning from near-field to far-field is emerging, since the near-field range expands with higher frequencies and larger array apertures. Precise beam alignment in cross-field is critical b…
▽ More
Ultra-massive multiple-input multiple-output (UM-MIMO) is the enabler of Terahertz (THz) communications in next-generation wireless networks. In THz UM-MIMO systems, a new paradigm of cross-field communications spanning from near-field to far-field is emerging, since the near-field range expands with higher frequencies and larger array apertures. Precise beam alignment in cross-field is critical but challenging. Specifically, unlike far-field beams that rely only on the angle domain, the incorporation of dual-domain (angle and distance) training significantly increases overhead. A natural question arises of whether far-field beam training can be deployed for cross-field beam alignment. In this paper, this question is answered, by demonstrating that the far-field training enables sufficient signal-to-noise ratio (SNR) in both far- and near-field scenarios, while exciting all channel dimensions. Based on that, we propose a subarray-coordinated hierarchical (SCH) training with greatly reduced overhead. To further obtain high-precision beam designs, we propose a two-phase angle and distance beam estimator (TPBE). Extensive simulations demonstrate the effectiveness of the proposed methods. Compared to near-field exhaustive search, the SCH possesses 0.2\% training overhead. The TPBE achieves 0.01~degrees and 0.02~m estimation root-mean-squared errors for angle and distance. Furthermore, with the estimated beam directions, a near-optimal SNR with 0.11~dB deviation is attained after beam alignment.
△ Less
Submitted 12 January, 2024; v1 submitted 16 December, 2023;
originally announced December 2023.
-
A Robust Deep Learning Method with Uncertainty Estimation for the Pathological Classification of Renal Cell Carcinoma based on CT Images
Authors:
Ni Yao,
Hang Hu,
Kaicong Chen,
Chen Zhao,
Yuan Guo,
Boya Li,
Jiaofen Nan,
Yanting Li,
Chuang Han,
Fubao Zhu,
Weihua Zhou,
Li Tian
Abstract:
Objectives To develop and validate a deep learning-based diagnostic model incorporating uncertainty estimation so as to facilitate radiologists in the preoperative differentiation of the pathological subtypes of renal cell carcinoma (RCC) based on CT images. Methods Data from 668 consecutive patients, pathologically proven RCC, were retrospectively collected from Center 1. By using five-fold cross…
▽ More
Objectives To develop and validate a deep learning-based diagnostic model incorporating uncertainty estimation so as to facilitate radiologists in the preoperative differentiation of the pathological subtypes of renal cell carcinoma (RCC) based on CT images. Methods Data from 668 consecutive patients, pathologically proven RCC, were retrospectively collected from Center 1. By using five-fold cross-validation, a deep learning model incorporating uncertainty estimation was developed to classify RCC subtypes into clear cell RCC (ccRCC), papillary RCC (pRCC), and chromophobe RCC (chRCC). An external validation set of 78 patients from Center 2 further evaluated the model's performance. Results In the five-fold cross-validation, the model's area under the receiver operating characteristic curve (AUC) for the classification of ccRCC, pRCC, and chRCC was 0.868 (95% CI: 0.826-0.923), 0.846 (95% CI: 0.812-0.886), and 0.839 (95% CI: 0.802-0.88), respectively. In the external validation set, the AUCs were 0.856 (95% CI: 0.838-0.882), 0.787 (95% CI: 0.757-0.818), and 0.793 (95% CI: 0.758-0.831) for ccRCC, pRCC, and chRCC, respectively. Conclusions The developed deep learning model demonstrated robust performance in predicting the pathological subtypes of RCC, while the incorporated uncertainty emphasized the importance of understanding model confidence, which is crucial for assisting clinical decision-making for patients with renal tumors. Clinical relevance statement Our deep learning approach, integrated with uncertainty estimation, offers clinicians a dual advantage: accurate RCC subtype predictions complemented by diagnostic confidence references, promoting informed decision-making for patients with RCC.
△ Less
Submitted 12 November, 2023; v1 submitted 1 November, 2023;
originally announced November 2023.
-
Exploring Self-Supervised Contrastive Learning of Spatial Sound Event Representation
Authors:
Xilin Jiang,
Cong Han,
Yinghao Aaron Li,
Nima Mesgarani
Abstract:
In this study, we present a simple multi-channel framework for contrastive learning (MC-SimCLR) to encode 'what' and 'where' of spatial audios. MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios, thereby enhancing both event classification and sound localization in downstream tasks. At its core, we propose a multi-level data augmentation pipeline that augment…
▽ More
In this study, we present a simple multi-channel framework for contrastive learning (MC-SimCLR) to encode 'what' and 'where' of spatial audios. MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios, thereby enhancing both event classification and sound localization in downstream tasks. At its core, we propose a multi-level data augmentation pipeline that augments different levels of audio features, including waveforms, Mel spectrograms, and generalized cross-correlation (GCC) features. In addition, we introduce simple yet effective channel-wise augmentation methods to randomly swap the order of the microphones and mask Mel and GCC channels. By using these augmentations, we find that linear layers on top of the learned representation significantly outperform supervised models in terms of both event classification accuracy and localization error. We also perform a comprehensive analysis of the effect of each augmentation method and a comparison of the fine-tuning performance using different amounts of labeled data.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
How to Differentiate between Near Field and Far Field: Revisiting the Rayleigh Distance
Authors:
Shu Sun,
Renwang Li,
Xingchen Liu,
Liuxun Xue,
Chong Han,
Meixia Tao
Abstract:
Future wireless communication systems are likely to adopt extremely large aperture arrays and millimeter-wave/sub-THz frequency bands to achieve higher throughput, lower latency, and higher energy efficiency. Conventional wireless systems predominantly operate in the far field (FF) of the radiation source of signals. As the array size increases and the carrier wavelength shrinks, however, the near…
▽ More
Future wireless communication systems are likely to adopt extremely large aperture arrays and millimeter-wave/sub-THz frequency bands to achieve higher throughput, lower latency, and higher energy efficiency. Conventional wireless systems predominantly operate in the far field (FF) of the radiation source of signals. As the array size increases and the carrier wavelength shrinks, however, the near field (NF) becomes non-negligible. Since the NF and FF differ in many aspects, it is essential to distinguish their corresponding regions. In this article, we first provide a comprehensive overview of the existing NF-FF boundaries, then introduce a novel NF-FF demarcation method based on effective degrees of freedom (EDoF) of the channel. Since EDoF is intimately related to spectral efficiency, the EDoF-based border is able to characterize key channel performance more accurately, as compared with the classic Rayleigh distance. Furthermore, we analyze the main features of the EDoF-based NF-FF boundary and provide insights into wireless system design.
△ Less
Submitted 22 September, 2023;
originally announced September 2023.
-
HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform
Authors:
Yinghao Aaron Li,
Cong Han,
Xilin Jiang,
Nima Mesgarani
Abstract:
Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In th…
▽ More
Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In this paper, we introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain that uses a sinusoidal source from the fundamental frequency (F0) inferred via a pre-trained F0 estimation network for fast inference speed. Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance. HiFTNet also outperforms BigVGAN-base on LibriTTS for unseen speakers and achieves comparable performance to BigVGAN while being four times faster with only $1/6$ of the parameters. Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications that demand high quality speech synthesis.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
ASPED: An Audio Dataset for Detecting Pedestrians
Authors:
Pavan Seshadri,
Chaeyeon Han,
Bon-Woo Koo,
Noah Posner,
Subhrajit Guhathakurta,
Alexander Lerch
Abstract:
We introduce the new audio analysis task of pedestrian detection and present a new large-scale dataset for this task. While the preliminary results prove the viability of using audio approaches for pedestrian detection, they also show that this challenging task cannot be easily solved with standard approaches.
We introduce the new audio analysis task of pedestrian detection and present a new large-scale dataset for this task. While the preliminary results prove the viability of using audio approaches for pedestrian detection, they also show that this challenging task cannot be easily solved with standard approaches.
△ Less
Submitted 16 January, 2024; v1 submitted 12 September, 2023;
originally announced September 2023.
-
Attenuation and Loss of Spatial Coherence Modeling for Atmospheric Turbulence in Terahertz UAV MIMO Channels
Authors:
Weijun Gao,
Chong Han,
Zhi Chen
Abstract:
Terahertz (THz) wireless communications have the potential to realize ultra-high-speed and secure data transfer with miniaturized devices for unmanned aerial vehicle (UAV) communications. The atmospheric turbulence due to random airflow leads to spatial inhomogeneity of the communication medium, which is yet missing in most existing studies, leading to additional propagation loss and even loss of…
▽ More
Terahertz (THz) wireless communications have the potential to realize ultra-high-speed and secure data transfer with miniaturized devices for unmanned aerial vehicle (UAV) communications. The atmospheric turbulence due to random airflow leads to spatial inhomogeneity of the communication medium, which is yet missing in most existing studies, leading to additional propagation loss and even loss of spatial coherence (LoSC) in MIMO systems. In this paper, the attenuation and loss of spatial coherence for atmospheric turbulence are modeled in THz UAV MIMO channels. Specifically, the frequency- and altitude-dependency of the refractive index structure constant (RISC), as a critical statistical parameter characterizing the intensity of turbulence, is first investigated. Then, the LoSC, fading, and attenuation caused by atmospheric turbulence are modeled, where the turbulence-induced fading is modeled by a Gamma-Gamma distribution, and the turbulence attenuation as a function of altitude and frequency is derived. Numerical results show that the turbulence leads to at most 10 dB attenuation with frequency less than 1 THz and distance less than 10 km. Furthermore, when the distance is 10 km and the RISC is 10^-9m^(-2/3), the loss of spatial coherence effect leads to 10 dB additional loss for a 1024*1024 ultra-massive MIMO system.
△ Less
Submitted 26 January, 2024; v1 submitted 20 August, 2023;
originally announced August 2023.
-
A Universal Attenuation Model of Terahertz Wave in Space-Air-Ground Channel Medium
Authors:
Zhirong Yang,
Weijun Gao,
Chong Han
Abstract:
Providing continuous bandwidth over several tens of GHz, the Terahertz (THz) band (0.1-10 THz) supports space-air-ground integrated network (SAGIN) in 6G and beyond wireless networks. However, it is still mystery how THz waves interact with the channel medium in SAGIN. In this paper, a universal space-air-ground attenuation model is proposed for THz waves, which incorporates the attenuation effect…
▽ More
Providing continuous bandwidth over several tens of GHz, the Terahertz (THz) band (0.1-10 THz) supports space-air-ground integrated network (SAGIN) in 6G and beyond wireless networks. However, it is still mystery how THz waves interact with the channel medium in SAGIN. In this paper, a universal space-air-ground attenuation model is proposed for THz waves, which incorporates the attenuation effects induced by particles including condensed particles, molecules, and free electrons. The proposed model is developed from the insight into the attenuation effects, namely, the physical picture that attenuation is the result of collision between photons that are the essence of THz waves and particles in the environment. Based on the attenuation model, the propagation loss of THz waves in the atmosphere and the outer space are numerically assessed. The results indicate that the attenuation effects except free space loss are all negligible at the altitude higher than 50 km while they need to be considered in the atmosphere lower than 50 km. Furthermore, the capacities of THz SAGIN are evaluated in space-ground, space-sea, ground-sea, and sea-sea scenarios, respectively.
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
Exploring the Interactions between Target Positive and Negative Information for Acoustic Echo Cancellation
Authors:
Chang Han,
Xinmeng Xu,
Weiping Tu,
Yuhong Yang,
Yajie Liu
Abstract:
Acoustic echo cancellation (AEC) aims to remove interference signals while leaving near-end speech least distorted. As the indistinguishable patterns between near-end speech and interference signals, near-end speech can't be separated completely, causing speech distortion and interference signals residual. We observe that besides target positive information, e.g., ground-truth speech and features,…
▽ More
Acoustic echo cancellation (AEC) aims to remove interference signals while leaving near-end speech least distorted. As the indistinguishable patterns between near-end speech and interference signals, near-end speech can't be separated completely, causing speech distortion and interference signals residual. We observe that besides target positive information, e.g., ground-truth speech and features, the target negative information, such as interference signals and features, helps make pattern of target speech and interference signals more discriminative. Therefore, we present a novel AEC model encoder-decoder architecture with the guidance of negative information termed as CMNet. A collaboration module (CM) is designed to establish the correlation between the target positive and negative information in a learnable manner via three blocks: target positive, target negative, and interactive block. Experimental results demonstrate our CMNet achieves superior performance than recent methods.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs
Authors:
Yinghao Aaron Li,
Cong Han,
Nima Mesgarani
Abstract:
In recent years, large-scale pre-trained speech language models (SLMs) have demonstrated remarkable advancements in various generative speech modeling applications, such as text-to-speech synthesis, voice conversion, and speech enhancement. These applications typically involve mapping text or speech inputs to pre-trained SLM representations, from which target speech is decoded. This paper introduc…
▽ More
In recent years, large-scale pre-trained speech language models (SLMs) have demonstrated remarkable advancements in various generative speech modeling applications, such as text-to-speech synthesis, voice conversion, and speech enhancement. These applications typically involve mapping text or speech inputs to pre-trained SLM representations, from which target speech is decoded. This paper introduces a new approach, SLMGAN, to leverage SLM representations for discriminative tasks within the generative adversarial network (GAN) framework, specifically for voice conversion. Building upon StarGANv2-VC, we add our novel SLM-based WavLM discriminators on top of the mel-based discriminators along with our newly designed SLM feature matching loss function, resulting in an unsupervised zero-shot voice conversion system that does not require text labels during training. Subjective evaluation results show that SLMGAN outperforms existing state-of-the-art zero-shot voice conversion models in terms of naturalness and achieves comparable similarity, highlighting the potential of SLM-based discriminators for related applications.
△ Less
Submitted 18 July, 2023;
originally announced July 2023.
-
Still Waters Run Deep: Extend THz Coverage with Non-Intelligent Reflecting Surface
Authors:
Chong Han,
Yuanbo Li,
Yinqin Wang
Abstract:
Large reflection and diffraction losses in the Terahertz (THz) band give rise to degraded coverage abilities in non-line-of-sight (NLoS) areas. To overcome this, a non-intelligent reflecting surface (NIRS) can be used, which is essentially a rough surface made by metal materials. NIRS is not only able to enhance received power in large NLoS areas through rich reflections and scattering, but also c…
▽ More
Large reflection and diffraction losses in the Terahertz (THz) band give rise to degraded coverage abilities in non-line-of-sight (NLoS) areas. To overcome this, a non-intelligent reflecting surface (NIRS) can be used, which is essentially a rough surface made by metal materials. NIRS is not only able to enhance received power in large NLoS areas through rich reflections and scattering, but also costless and super-easy to fabricate and implement. In this article, we first thoroughly compare NIRS with the lively discussed intelligent reflecting surface (IRS) and point out the unique advantages of NIRS over IRS. Furthermore, experimental results are elaborated to show the effectiveness of NIRS in improving coverage. Last but not least, open problems and future directions are highlighted to inspire future research efforts on NIRS.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Time-Frequency-Space Transmit Design and Receiver Processing with Dynamic Subarray for Terahertz Integrated Sensing and Communication
Authors:
Yongzhi Wu,
Chong Han,
Meixia Tao
Abstract:
Terahertz (THz) integrated sensing and communication (ISAC) enables simultaneous data transmission with Terabit-per-second (Tbps) rate and millimeter-level accurate sensing. To realize such a blueprint, ultra-massive antenna arrays with directional beamforming are used to compensate for severe path loss in the THz band. In this paper, the time-frequency-space transmit design is investigated for TH…
▽ More
Terahertz (THz) integrated sensing and communication (ISAC) enables simultaneous data transmission with Terabit-per-second (Tbps) rate and millimeter-level accurate sensing. To realize such a blueprint, ultra-massive antenna arrays with directional beamforming are used to compensate for severe path loss in the THz band. In this paper, the time-frequency-space transmit design is investigated for THz ISAC to generate time-varying scanning sensing beams and stable communication beams. Specifically, with the dynamic array-of-subarray (DAoSA) hybrid beamforming architecture and multi-carrier modulation, two ISAC hybrid precoding algorithms are proposed, namely, a vectorization (VEC) based algorithm that outperforms existing ISAC hybrid precoding methods and a low-complexity sensing codebook assisted (SCA) approach. Meanwhile, coupled with the transmit design, parameter estimation algorithms are proposed to realize high-accuracy sensing, including a wideband DAoSA MUSIC method for angle estimation and a sum-DFT-GSS approach for range and velocity estimation. Numerical results indicate that the proposed algorithms can realize centi-degree-level angle estimation accuracy and millimeter-level range estimation accuracy, which are one or two orders of magnitudes better than the methods in the millimeter-wave band. In addition, to overcome the cyclic prefix limitation and Doppler effects, an inter-symbol interference- and inter-carrier interference-tackled sensing algorithm is developed to refine sensing capabilities for THz ISAC.
△ Less
Submitted 25 March, 2024; v1 submitted 10 July, 2023;
originally announced July 2023.
-
MLA-BIN: Model-level Attention and Batch-instance Style Normalization for Domain Generalization of Federated Learning on Medical Image Segmentation
Authors:
Fubao Zhu,
Yanhui Tian,
Chuang Han,
Yanting Li,
Jiaofen Nan,
Ni Yao,
Weihua Zhou
Abstract:
The privacy protection mechanism of federated learning (FL) offers an effective solution for cross-center medical collaboration and data sharing. In multi-site medical image segmentation, each medical site serves as a client of FL, and its data naturally forms a domain. FL supplies the possibility to improve the performance of seen domains model. However, there is a problem of domain generalizatio…
▽ More
The privacy protection mechanism of federated learning (FL) offers an effective solution for cross-center medical collaboration and data sharing. In multi-site medical image segmentation, each medical site serves as a client of FL, and its data naturally forms a domain. FL supplies the possibility to improve the performance of seen domains model. However, there is a problem of domain generalization (DG) in the actual de-ployment, that is, the performance of the model trained by FL in unseen domains will decrease. Hence, MLA-BIN is proposed to solve the DG of FL in this study. Specifically, the model-level attention module (MLA) and batch-instance style normalization (BIN) block were designed. The MLA represents the unseen domain as a linear combination of seen domain models. The atten-tion mechanism is introduced for the weighting coefficient to obtain the optimal coefficient ac-cording to the similarity of inter-domain data features. MLA enables the global model to gen-eralize to unseen domain. In the BIN block, batch normalization (BN) and instance normalization (IN) are combined to perform the shallow layers of the segmentation network for style normali-zation, solving the influence of inter-domain image style differences on DG. The extensive experimental results of two medical image seg-mentation tasks demonstrate that the proposed MLA-BIN outperforms state-of-the-art methods.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Authors:
Yinghao Aaron Li,
Cong Han,
Vinay S. Raghavan,
Gavin Mischler,
Nima Mesgarani
Abstract:
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, a…
▽ More
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.
△ Less
Submitted 19 November, 2023; v1 submitted 13 June, 2023;
originally announced June 2023.
-
Transformer-based GAN for Terahertz Spatial-Temporal Channel Modeling and Generating
Authors:
Zhengdong Hu,
Yuanbo Li,
Chong Han
Abstract:
Terahertz (THz) communications are envisioned as a promising technology for 6G and beyond wireless systems, providing ultra-broad continuous bandwidth and thus Terabit-per-second (Tbps) data rates. However, as foundation of designing THz communications, channel modeling and characterization are fundamental to scrutinize the potential of the new spectrum. Relied on time-consuming and costly physica…
▽ More
Terahertz (THz) communications are envisioned as a promising technology for 6G and beyond wireless systems, providing ultra-broad continuous bandwidth and thus Terabit-per-second (Tbps) data rates. However, as foundation of designing THz communications, channel modeling and characterization are fundamental to scrutinize the potential of the new spectrum. Relied on time-consuming and costly physical measurements, traditional statistical channel modeling methods suffer from the problem of low accuracy with the assumed certain distributions and empirical parameters. In this paper, a transformer-based generative adversarial network modeling method (T-GAN) is proposed in the THz band, which exploits the advantage of GAN in modeling the complex distribution, and the powerful expressive capability of transformer structure. Experimental results reveal that the distribution of channels generated by the proposed T-GAN method shows good agreement with the original channels in terms of the delay spread and angular spread. Moreover, T-GAN achieves good performance in modeling the power delay angular profile, with 2.18 dB root-mean-square error (RMSE).
△ Less
Submitted 12 June, 2023;
originally announced June 2023.
-
Unsupervised Multi-channel Separation and Adaptation
Authors:
Cong Han,
Kevin Wilson,
Scott Wisdom,
John R. Hershey
Abstract:
A key challenge in machine learning is to generalize from training data to an application domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the multi-channel setting. We use MixIT to train a model on far-field microphone array recordings of overlapping reverberant and noisy speech from the AMI Corpus. Th…
▽ More
A key challenge in machine learning is to generalize from training data to an application domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the multi-channel setting. We use MixIT to train a model on far-field microphone array recordings of overlapping reverberant and noisy speech from the AMI Corpus. The models are trained on both supervised and unsupervised training data, and are tested on real AMI recordings containing overlapping speech. To objectively evaluate our models, we also use a synthetic multi-channel AMI test set. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest improvement to SI-SNR and to human listening ratings across synthetic and real datasets, outperforming supervised models trained on well-matched synthetic data. Our results demonstrate that unsupervised learning through MixIT enables model adaptation on both single- and multi-channel real-world speech recordings.
△ Less
Submitted 22 March, 2024; v1 submitted 18 May, 2023;
originally announced May 2023.
-
Scintillation and Attenuation Modelling of Atmospheric Turbulence for Terahertz UAV Channels
Authors:
Weijun Gao,
Chong Han,
Zhi Chen
Abstract:
Terahertz (THz) wireless communications have the potential to realize ultra-high-speed and secure data transfer with miniaturized devices for unmanned aerial vehicle (UAV) communications. Existing THz channel models for aerial scenarios assume a homogeneous medium along the line-of-sight propagation path. However, the atmospheric turbulence due to random airflow leads to temporal and spatial inhom…
▽ More
Terahertz (THz) wireless communications have the potential to realize ultra-high-speed and secure data transfer with miniaturized devices for unmanned aerial vehicle (UAV) communications. Existing THz channel models for aerial scenarios assume a homogeneous medium along the line-of-sight propagation path. However, the atmospheric turbulence due to random airflow leads to temporal and spatial inhomogeneity of the communication medium, motivating analysis and modelling of the THz UAV communication channel. In this paper, we statistically modelled the scintillation and attenuation effect of turbulence on THz UAV channels. Specifically, the frequency- and altitude-dependency of the refractive index structure constant, as a critical statistical parameter characterizing the intensity of turbulence, is first investigated. Then, the scintillation characteristic and attenuation of the THz communications caused by atmospheric turbulence are modelled, where the scintillation effect is modelled by a Gamma-Gamma distribution, and the turbulence attenuation as a function of altitude and frequency is derived. Numerical simulations on the refractive index structure constant, scintillation, and attenuation in the THz band are presented to quantitatively analyze the influence of turbulence for the THz UAV channels. It is discovered that THz turbulence can lead to at most 10dB attenuation with frequency less than 1THz and distance less than 10km.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
All Information is Necessary: Integrating Speech Positive and Negative Information by Contrastive Learning for Speech Enhancement
Authors:
Xinmeng Xu,
Weiping Tu,
Chang Han,
Yuhong Yang
Abstract:
Monaural speech enhancement (SE) is an ill-posed problem due to the irreversible degradation process. Recent methods to achieve SE tasks rely solely on positive information, e.g., ground-truth speech and speech-relevant features. Different from the above, we observe that the negative information, such as original speech mixture and speech-irrelevant features, are valuable to guide the SE model tra…
▽ More
Monaural speech enhancement (SE) is an ill-posed problem due to the irreversible degradation process. Recent methods to achieve SE tasks rely solely on positive information, e.g., ground-truth speech and speech-relevant features. Different from the above, we observe that the negative information, such as original speech mixture and speech-irrelevant features, are valuable to guide the SE model training procedure. In this study, we propose a SE model that integrates both speech positive and negative information for improving SE performance by adopting contrastive learning, in which two innovations have consisted. (1) We design a collaboration module (CM), which contains two parts, contrastive attention for separating relevant and irrelevant features via contrastive learning and interactive attention for establishing the correlation between both speech features in a learnable and self-adaptive manner. (2) We propose a contrastive regularization (CR) built upon contrastive learning to ensure that the estimated speech is pulled closer to the clean speech and pushed far away from the noisy speech in the representation space by integrating self-supervised models. We term the proposed SE network with CM and CR as CMCR-Net. Experimental results demonstrate that our CMCR-Net achieves comparable and superior performance to recent approaches.
△ Less
Submitted 26 April, 2023;
originally announced April 2023.
-
Good Neighbors Are All You Need for Chinese Grapheme-to-Phoneme Conversion
Authors:
Jungjun Kim,
Changjin Han,
Gyuhyeon Nam,
Gyeongsu Chae
Abstract:
Most Chinese Grapheme-to-Phoneme (G2P) systems employ a three-stage framework that first transforms input sequences into character embeddings, obtains linguistic information using language models, and then predicts the phonemes based on global context about the entire input sequence. However, linguistic knowledge alone is often inadequate. Language models frequently encode overly general structure…
▽ More
Most Chinese Grapheme-to-Phoneme (G2P) systems employ a three-stage framework that first transforms input sequences into character embeddings, obtains linguistic information using language models, and then predicts the phonemes based on global context about the entire input sequence. However, linguistic knowledge alone is often inadequate. Language models frequently encode overly general structures of a sentence and fail to cover specific cases needed to use phonetic knowledge. Also, a handcrafted post-processing system is needed to address the problems relevant to the tone of the characters. However, the system exhibits inconsistency in the segmentation of word boundaries which consequently degrades the performance of the G2P system. To address these issues, we propose the Reinforcer that provides strong inductive bias for language models by emphasizing the phonological information between neighboring characters to help disambiguate pronunciations. Experimental results show that the Reinforcer boosts the cutting-edge architectures by a large margin. We also combine the Reinforcer with a large-scale pre-trained model and demonstrate the validity of using neighboring context in knowledge transfer scenarios.
△ Less
Submitted 14 March, 2023;
originally announced March 2023.
-
Online Binaural Speech Separation of Moving Speakers With a Wavesplit Network
Authors:
Cong Han,
Nima Mesgarani
Abstract:
Binaural speech separation in real-world scenarios often involves moving speakers. Most current speech separation methods use utterance-level permutation invariant training (u-PIT) for training. In inference time, however, the order of outputs can be inconsistent over time particularly in long-form speech separation. This situation which is referred to as the speaker swap problem is even more prob…
▽ More
Binaural speech separation in real-world scenarios often involves moving speakers. Most current speech separation methods use utterance-level permutation invariant training (u-PIT) for training. In inference time, however, the order of outputs can be inconsistent over time particularly in long-form speech separation. This situation which is referred to as the speaker swap problem is even more problematic when speakers constantly move in space and therefore poses a challenge for consistent placement of speakers in output channels. Here, we describe a real-time binaural speech separation model based on a Wavesplit network to mitigate the speaker swap problem for moving speaker separation. Our model computes a speaker embedding for each speaker at each time frame from the mixed audio, aggregates embeddings using online clustering, and uses cluster centroids as speaker profiles to track each speaker throughout the long duration. Experimental results on reverberant, long-form moving multitalker speech separation show that the proposed method is less prone to speaker swap and achieves comparable performance with u-PIT based models with ground truth tracking in both separation accuracy and preserving the interaural cues.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
FedDBL: Communication and Data Efficient Federated Deep-Broad Learning for Histopathological Tissue Classification
Authors:
Tianpeng Deng,
Yanqi Huang,
Guoqiang Han,
Zhenwei Shi,
Jiatai Lin,
Qi Dou,
Zaiyi Liu,
Xiao-jing Guo,
C. L. Philip Chen,
Chu Han
Abstract:
Histopathological tissue classification is a fundamental task in computational pathology. Deep learning-based models have achieved superior performance but centralized training with data centralization suffers from the privacy leakage problem. Federated learning (FL) can safeguard privacy by keeping training samples locally, but existing FL-based frameworks require a large number of well-annotated…
▽ More
Histopathological tissue classification is a fundamental task in computational pathology. Deep learning-based models have achieved superior performance but centralized training with data centralization suffers from the privacy leakage problem. Federated learning (FL) can safeguard privacy by keeping training samples locally, but existing FL-based frameworks require a large number of well-annotated training samples and numerous rounds of communication which hinder their practicability in the real-world clinical scenario. In this paper, we propose a universal and lightweight federated learning framework, named Federated Deep-Broad Learning (FedDBL), to achieve superior classification performance with limited training samples and only one-round communication. By simply associating a pre-trained deep learning feature extractor, a fast and lightweight broad learning inference system and a classical federated aggregation approach, FedDBL can dramatically reduce data dependency and improve communication efficiency. Five-fold cross-validation demonstrates that FedDBL greatly outperforms the competitors with only one-round communication and limited training samples, while it even achieves comparable performance with the ones under multiple-round communications. Furthermore, due to the lightweight design and one-round communication, FedDBL reduces the communication burden from 4.6GB to only 276.5KB per client using the ResNet-50 backbone at 50-round training. Since no data or deep model sharing across different clients, the privacy issue is well-solved and the model security is guaranteed with no model inversion attack risk. Code is available at https://github.com/tianpeng-deng/FedDBL.
△ Less
Submitted 17 December, 2023; v1 submitted 24 February, 2023;
originally announced February 2023.
-
HCGMNET: A Hierarchical Change Guiding Map Network For Change Detection
Authors:
Chengxi Han,
Chen Wu,
Bo Du
Abstract:
Very-high-resolution (VHR) remote sensing (RS) image change detection (CD) has been a challenging task for its very rich spatial information and sample imbalance problem. In this paper, we have proposed a hierarchical change guiding map network (HCGMNet) for change detection. The model uses hierarchical convolution operations to extract multiscale features, continuously merges multi-scale features…
▽ More
Very-high-resolution (VHR) remote sensing (RS) image change detection (CD) has been a challenging task for its very rich spatial information and sample imbalance problem. In this paper, we have proposed a hierarchical change guiding map network (HCGMNet) for change detection. The model uses hierarchical convolution operations to extract multiscale features, continuously merges multi-scale features layer by layer to improve the expression of global and local information, and guides the model to gradually refine edge features and comprehensive performance by a change guide module (CGM), which is a self-attention with changing guide map. Extensive experiments on two CD datasets show that the proposed HCGMNet architecture achieves better CD performance than existing state-of-the-art (SOTA) CD methods.
△ Less
Submitted 13 March, 2023; v1 submitted 20 February, 2023;
originally announced February 2023.
-
Improved Decoding of Attentional Selection in Multi-Talker Environments with Self-Supervised Learned Speech Representation
Authors:
Cong Han,
Vishal Choudhari,
Yinghao Aaron Li,
Nima Mesgarani
Abstract:
Auditory attention decoding (AAD) is a technique used to identify and amplify the talker that a listener is focused on in a noisy environment. This is done by comparing the listener's brainwaves to a representation of all the sound sources to find the closest match. The representation is typically the waveform or spectrogram of the sounds. The effectiveness of these representations for AAD is unce…
▽ More
Auditory attention decoding (AAD) is a technique used to identify and amplify the talker that a listener is focused on in a noisy environment. This is done by comparing the listener's brainwaves to a representation of all the sound sources to find the closest match. The representation is typically the waveform or spectrogram of the sounds. The effectiveness of these representations for AAD is uncertain. In this study, we examined the use of self-supervised learned speech representation in improving the accuracy and speed of AAD. We recorded the brain activity of three subjects using invasive electrocorticography (ECoG) as they listened to two conversations and focused on one. We used WavLM to extract a latent representation of each talker and trained a spatiotemporal filter to map brain activity to intermediate representations of speech. During the evaluation, the reconstructed representation is compared to each speaker's representation to determine the target speaker. Our results indicate that speech representation from WavLM provides better decoding accuracy and speed than the speech envelope and spectrogram. Our findings demonstrate the advantages of self-supervised learned speech representation for auditory attention decoding and pave the way for developing brain-controlled hearable technologies.
△ Less
Submitted 11 February, 2023;
originally announced February 2023.
-
Incremental Value and Interpretability of Radiomics Features of Both Lung and Epicardial Adipose Tissue for Detecting the Severity of COVID-19 Infection
Authors:
Ni Yao,
Yanhui Tian,
Daniel Gama das Neves,
Chen Zhao,
Claudio Tinoco Mesquita,
Wolney de Andrade Martins,
Alair Augusto Sarmet Moreira Damas dos Santos,
Yanting Li,
Chuang Han,
Fubao Zhu,
Neng Dai,
Weihua Zhou
Abstract:
Epicardial adipose tissue (EAT) is known for its pro-inflammatory properties and association with Coronavirus Disease 2019 (COVID-19) severity. However, current EAT segmentation methods do not consider positional information. Additionally, the detection of COVID-19 severity lacks consideration for EAT radiomics features, which limits interpretability. This study investigates the use of radiomics f…
▽ More
Epicardial adipose tissue (EAT) is known for its pro-inflammatory properties and association with Coronavirus Disease 2019 (COVID-19) severity. However, current EAT segmentation methods do not consider positional information. Additionally, the detection of COVID-19 severity lacks consideration for EAT radiomics features, which limits interpretability. This study investigates the use of radiomics features from EAT and lungs to detect the severity of COVID-19 infections. A retrospective analysis of 515 patients with COVID-19 (Cohort1: 415, Cohort2: 100) was conducted using a proposed three-stage deep learning approach for EAT extraction. Lung segmentation was achieved using a published method. A hybrid model for detecting the severity of COVID-19 was built in a derivation cohort, and its performance and uncertainty were evaluated in internal (125, Cohort1) and external (100, Cohort2) validation cohorts. For EAT extraction, the Dice similarity coefficients (DSC) of the two centers were 0.972 (+-0.011) and 0.968 (+-0.005), respectively. For severity detection, the hybrid model with radiomics features of both lungs and EAT showed improvements in AUC, net reclassification improvement (NRI), and integrated discrimination improvement (IDI) compared to the model with only lung radiomics features. The hybrid model exhibited an increase of 0.1 (p<0.001), 19.3%, and 18.0% respectively, in the internal validation cohort and an increase of 0.09 (p<0.001), 18.0%, and 18.0%, respectively, in the external validation cohort while outperforming existing detection methods. Uncertainty quantification and radiomics features analysis confirmed the interpretability of case prediction after inclusion of EAT features.
△ Less
Submitted 6 December, 2023; v1 submitted 28 January, 2023;
originally announced January 2023.
-
Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions
Authors:
Yinghao Aaron Li,
Cong Han,
Xilin Jiang,
Nima Mesgarani
Abstract:
Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we pro…
▽ More
Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions. Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech compared with the state-of-the-art (SOTA) StyleTTS baseline on out-of-distribution (OOD) texts.
△ Less
Submitted 20 January, 2023;
originally announced January 2023.
-
Cross Far- and Near-field Wireless Communications in Terahertz Ultra-large Antenna Array Systems
Authors:
Chong Han,
Yuhang Chen,
Longfei Yan,
Zhi Chen,
Linglong Dai
Abstract:
Terahertz (THz) band owning the abundant multi-ten-GHz bandwidth is capable to support Terabit-per-second wireless communications, which is a pillar technology for 6G and beyond systems. With sub-millimeter-long antennas, ultra-massive (UM) MIMO and intelligent surface (IS) systems with thousands of array elements are exploited to effectively combat the distance limitation and blockage problems, w…
▽ More
Terahertz (THz) band owning the abundant multi-ten-GHz bandwidth is capable to support Terabit-per-second wireless communications, which is a pillar technology for 6G and beyond systems. With sub-millimeter-long antennas, ultra-massive (UM) MIMO and intelligent surface (IS) systems with thousands of array elements are exploited to effectively combat the distance limitation and blockage problems, which compose a promising THz ultra-large antenna array (ULAA) system. As a combined effect of wavelength and array aperture, the resulting coverage of THz systems ranges from near-field to far-field, leading to a new paradigm of cross-field communications. Although channel models, communications theories, and networking strategies have been studied for far-field and near-field separately, the unified design of cross-field communications that achieve high spectral efficiency and low complexity is still missing. In this article, the challenges and features of THz ULAA cross-field communications are investigated. Furthermore, cross-field solutions in three perspectives are presented, including a hybrid spherical- and planar-wave channel model, cross-field channel estimation, and widely-spaced multi-subarray hybrid beamforming, where a subarray as a basic unit in THz ULAA systems is exploited. The approximation error of channel modeling accuracy, spectral efficiency, and estimation error of these designs are numerically evaluated. Finally, as a roadmap of THz ULAA cross-field communications, multiple open problems and potential research directions are elaborated.
△ Less
Submitted 3 August, 2023; v1 submitted 8 January, 2023;
originally announced January 2023.
-
Transfer Generative Adversarial Networks (T-GAN)-based Terahertz Channel Modeling
Authors:
Zhengdong Hu,
Yuanbo Li,
Chong Han
Abstract:
Terahertz (THz) communications are envisioned as a promising technology for 6G and beyond wireless systems, providing ultra-broad bandwidth and thus Terabit-per-second (Tbps) data rates. However, as foundation of designing THz communications, channel modeling and characterization are fundamental to scrutinize the potential of the new spectrum. Relied on physical measurements, traditional statistic…
▽ More
Terahertz (THz) communications are envisioned as a promising technology for 6G and beyond wireless systems, providing ultra-broad bandwidth and thus Terabit-per-second (Tbps) data rates. However, as foundation of designing THz communications, channel modeling and characterization are fundamental to scrutinize the potential of the new spectrum. Relied on physical measurements, traditional statistical channel modeling methods suffer from the problem of low accuracy with the assumed certain distributions and empirical parameters. Moreover, it is time-consuming and expensive to acquire extensive channel measurement in the THz band. In this paper, a transfer generative adversarial network (T-GAN) based modeling method is proposed in the THz band, which exploits the advantage of GAN in modeling the complex distribution, and the benefit of transfer learning in transferring the knowledge from a source task to improve generalization about the target task with limited training data. Specifically, to start with, the proposed GAN is pre-trained using the simulated dataset, generated by the standard channel model from 3rd generation partnerships project (3GPP). Furthermore, by transferring the knowledge and fine-tuning the pre-trained GAN, the T-GAN is developed by using the THz measured dataset with a small amount. Experimental results reveal that the distribution of PDPs generated by the proposed T-GAN method shows good agreement with measurement. Moreover, T-GAN achieves good performance in channel modeling, with 9 dB improved root-mean-square error (RMSE) and higher Structure Similarity Index Measure (SSIM), compared with traditional 3GPP method.
△ Less
Submitted 3 January, 2023;
originally announced January 2023.
-
StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models
Authors:
Yinghao Aaron Li,
Cong Han,
Nima Mesgarani
Abstract:
One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning f…
▽ More
One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models. With cycle consistent and adversarial training, the style-based TTS models can perform transcription-guided one-shot VC with high fidelity and similarity. By learning an additional mel-spectrogram encoder through a teacher-student knowledge transfer and novel data augmentation scheme, our approach results in disentangled speech representation without needing the input text. The subjective evaluation shows that our approach can significantly outperform the previous state-of-the-art one-shot voice conversion models in both naturalness and similarity.
△ Less
Submitted 29 December, 2022;
originally announced December 2022.
-
DSS-o-SAGE: Direction-Scan Sounding-Oriented SAGE Algorithm for Channel Parameter Estimation in mmWave and THz Bands
Authors:
Yuanbo Li,
Chong Han,
Yi Chen,
Ziming Yu,
Xuefeng Yin
Abstract:
Investigation of millimeter (mmWave) and Terahertz (THz) channels relies on channel measurements and estimation of multi-path component (MPC) parameters. As a common measurement technique in the mmWave and THz bands, direction-scan sounding (DSS) resolves angular information and increases the measurable distance. Through mechanical rotation, the DSS creates a virtual multi-antenna sounding system,…
▽ More
Investigation of millimeter (mmWave) and Terahertz (THz) channels relies on channel measurements and estimation of multi-path component (MPC) parameters. As a common measurement technique in the mmWave and THz bands, direction-scan sounding (DSS) resolves angular information and increases the measurable distance. Through mechanical rotation, the DSS creates a virtual multi-antenna sounding system, which however incurs signal phase instability and large data sizes, which are not fully considered in existing estimation algorithms and thus make them ineffective. To tackle this research gap, in this paper, a DSS-oriented space-alternating generalized expectation-maximization (DSS-o-SAGE) algorithm is proposed for channel parameter estimation in mmWave and THz bands. To appropriately capture the measured data in mmWave and THz DSS, the phase instability is modeled by the scanning-direction-dependent signal phases. Furthermore, based on the signal model, the DSS-o-SAGE algorithm is developed, which not only addresses the problems brought by phase instability, but also achieves ultra-low computational complexity by exploiting the narrow antenna beam property of DSS. Simulations in synthetic channels are conducted to demonstrate the efficacy of the proposed algorithm and explore the applicable region of the far-field approximation in DSS-o-SAGE. Last but not least, the proposed DSS-o-SAGE algorithm is applied in real measurements in an indoor corridor scenario at 300~GHz. Compared with results using the baseline noise-elimination method, the channel is characterized more correctly and reasonably based on the DSS-o-SAGE.
△ Less
Submitted 4 March, 2024; v1 submitted 28 November, 2022;
originally announced December 2022.
-
Terahertz Channel Measurement and Analysis on a University Campus Street
Authors:
Yiqin Wang,
Yuanbo Li,
Yi Chen,
Ziming Yu,
Chong Han
Abstract:
Owning abundant bandwidth resource, the Terahertz (0.1-10 THz) band is a promising spectrum to support sixth-generation (6G) and beyond communications. As the foundation of channel study in the spectrum, channel measurement is ongoing in covering representative 6G communication scenarios and promising THz frequency bands. In this paper, a wideband channel measurement in an L-shaped university camp…
▽ More
Owning abundant bandwidth resource, the Terahertz (0.1-10 THz) band is a promising spectrum to support sixth-generation (6G) and beyond communications. As the foundation of channel study in the spectrum, channel measurement is ongoing in covering representative 6G communication scenarios and promising THz frequency bands. In this paper, a wideband channel measurement in an L-shaped university campus street is conducted at 306-321 GHz and 356-371 GHz. In particular, ten line-of-sight (LoS) and eight non-line-of-sight (NLoS) points are measured at the two frequency bands, respectively. In total, 6480 channel impulse responses (CIRs) are obtained from the measurement, based on which multi-path propagation in the L-shaped roadway in the THz band is elaborated to identify major scatterers of walls, vehicles, etc. in the environment and their impact on multi-path components (MPCs). Furthermore, outdoor THz channel characteristics in the two frequency bands are analyzed, including path losses, shadow fading, cluster parameters, delay spread and angular spread. In contrast with the counterparts in the similar outdoor scenario at lower frequencies, the results verify the sparsity of MPCs at THz frequencies and indicate smaller power spreads in both temporal and spatial domains in the THz band.
△ Less
Submitted 21 November, 2022;
originally announced November 2022.
-
300 GHz Wideband Channel Measurement and Analysis in a Lobby
Authors:
Yiqin Wang,
Yuanbo Li,
Yi Chen,
Ziming Yu,
Chong Han
Abstract:
The Terahertz (0.1-10 THz) band has been envisioned as one of the promising spectrum bands to support ultra-broadband sixth-generation (6G) and beyond communications. In this paper, a wideband channel measurement campaign in a 500- square-meter indoor lobby at 306-321 GHz is presented. The measurement system consists of a vector network analyzer (VNA)-based channel sounder, and a directional anten…
▽ More
The Terahertz (0.1-10 THz) band has been envisioned as one of the promising spectrum bands to support ultra-broadband sixth-generation (6G) and beyond communications. In this paper, a wideband channel measurement campaign in a 500- square-meter indoor lobby at 306-321 GHz is presented. The measurement system consists of a vector network analyzer (VNA)-based channel sounder, and a directional antenna equipped at the receiver to resolve multi-path components (MPCs) in the angular domain. In particular, 21 positions and 3780 channel impulse responses (CIRs) are measured in the lobby, including the line-of-sight (LoS), non-line-of-sight (NLoS) and obstructed-line-of-sight (OLoS) cases. The multi-path characteristics are summarized as follows. First, the main scatterers in the lobby include the glass, the pillar, and the LED screen. Second, best direction and omni-directional path losses are analyzed. Compared with the close-in path loss model, the optimal path loss offset in the alpha-beta path loss model exceeds 86 dB in the LoS case, and accordingly, the exponent decreases to 1.57 and below. Third, more than 10 clusters are observed in OLoS and NLoS cases, compared to 2.17 clusters on average in the LoS case. Fourth, the average power dispersion of MPCs is smaller in both temporal and angular domains in the LoS case, compared with the NLoS and OLoS counterparts. Finally, in contrast to hallway scenarios measured in previous works at the same frequency band, the lobby which is larger in dimension and square in shape, features larger path losses and smaller delay and angular spreads.
△ Less
Submitted 23 May, 2023; v1 submitted 20 November, 2022;
originally announced November 2022.