Zum Hauptinhalt springen

Showing 1–50 of 117 results for author: Han, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2408.15632  [pdf, other

    eess.SY cs.AI

    Structural Optimization of Lightweight Bipedal Robot via SERL

    Authors: Yi Cheng, Chenxi Han, Yuheng Min, Linqi Ye, Houde Liu, Hang Liu

    Abstract: Designing a bipedal robot is a complex and challenging task, especially when dealing with a multitude of structural parameters. Traditional design methods often rely on human intuition and experience. However, such approaches are time-consuming, labor-intensive, lack theoretical guidance and hard to obtain optimal design results within vast design spaces, thus failing to full exploit the inherent… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  2. arXiv:2407.19753  [pdf, other

    cs.CV eess.SP

    PredIN: Towards Open-Set Gesture Recognition via Prediction Inconsistency

    Authors: Chen Liu, Can Han, Chengfeng Zhou, Crystal Cai, Dahong Qian

    Abstract: Gesture recognition based on surface electromyography (sEMG) has achieved significant progress in human-machine interaction (HMI). However, accurately recognizing predefined gestures within a closed set is still inadequate in practice; a robust open-set system needs to effectively reject unknown gestures while correctly classifying known ones. To handle this challenge, we first report prediction i… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: Under review

  3. arXiv:2407.17510  [pdf

    eess.SP

    Transfer Learning Enabled Transformer based Generative Adversarial Networks (TT-GAN) for Terahertz Channel Modeling and Generating

    Authors: Zhengdong Hu, Yuanbo Li, Chong Han

    Abstract: Terahertz (THz) communications, ranging from 100 GHz to 10 THz, are envisioned as a promising technology for 6G and beyond wireless systems. As foundation of designing THz communications, channel modeling and characterization are crucial to scrutinize the potential of the new spectrum. However, current channel modeling and standardization heavily rely on measurements, which are both time-consuming… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2306.06902

  4. arXiv:2407.09732  [pdf, other

    eess.AS cs.LG cs.SD

    Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis

    Authors: Xilin Jiang, Yinghao Aaron Li, Adrian Nicolas Florea, Cong Han, Nima Mesgarani

    Abstract: It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks. To reach this conclusion, we propose and evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis. We compar… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  5. arXiv:2407.03177  [pdf, other

    cs.HC eess.SP

    EDPNet: An Efficient Dual Prototype Network for Motor Imagery EEG Decoding

    Authors: Can Han, Chen Liu, Crystal Cai, Jun Wang, Dahong Qian

    Abstract: Motor imagery electroencephalograph (MI-EEG) decoding plays a crucial role in developing motor imagery brain-computer interfaces (MI-BCIs). However, decoding intentions from MI remains challenging due to the inherent complexity of EEG signals relative to the small-sample size. In this paper, we propose an Efficient Dual Prototype Network (EDPNet) to enable accurate and fast MI decoding. EDPNet emp… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  6. arXiv:2406.11519  [pdf, other

    cs.CV eess.IV

    HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model

    Authors: Di Wang, Meiqi Hu, Yao Jin, Yuchun Miao, Jiaqi Yang, Yichu Xu, Xiaolei Qin, Jiaqi Ma, Lingyu Sun, Chenxing Li, Chuan Fu, Hongruixuan Chen, Chengxi Han, Naoto Yokoya, Jing Zhang, Minqiang Xu, Lin Liu, Lefei Zhang, Chen Wu, Bo Du, Dacheng Tao, Liangpei Zhang

    Abstract: Foundation models (FMs) are revolutionizing the analysis and understanding of remote sensing (RS) scenes, including aerial RGB, multispectral, and SAR images. However, hyperspectral images (HSIs), which are rich in spectral information, have not seen much application of FMs, with existing methods often restricted to specific tasks and lacking generality. To fill this gap, we introduce HyperSIGMA,… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: The code and models will be released at https://github.com/WHU-Sigma/HyperSIGMA

  7. arXiv:2406.09998  [pdf, other

    eess.AS cs.AI cs.LG cs.MM cs.SD

    Understanding Pedestrian Movement Using Urban Sensing Technologies: The Promise of Audio-based Sensors

    Authors: Chaeyeon Han, Pavan Seshadri, Yiwei Ding, Noah Posner, Bon Woo Koo, Animesh Agrawal, Alexander Lerch, Subhrajit Guhathakurta

    Abstract: While various sensors have been deployed to monitor vehicular flows, sensing pedestrian movement is still nascent. Yet walking is a significant mode of travel in many cities, especially those in Europe, Africa, and Asia. Understanding pedestrian volumes and flows is essential for designing safer and more attractive pedestrian infrastructure and for controlling periodic overcrowding. This study dis… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: submitted to Urban Informatics

  8. arXiv:2406.07923  [pdf, other

    cs.SD cs.AI eess.AS

    CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting

    Authors: Sichen Jin, Youngmoon Jung, Seungjin Lee, Jaeyoung Roh, Changwoo Han, Hoonyoung Cho

    Abstract: This paper introduces a novel approach for streaming openvocabulary keyword spotting (KWS) with text-based keyword enrollment. For every input frame, the proposed method finds the optimal alignment ending at the frame using connectionist temporal classification (CTC) and aggregates the frame-level acoustic embedding (AE) to obtain higher-level (i.e., character, word, or phrase) AE that aligns with… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  9. arXiv:2406.05314  [pdf, other

    eess.AS cs.AI eess.SP

    Relational Proxy Loss for Audio-Text based Keyword Spotting

    Authors: Youngmoon Jung, Seungjin Lee, Joon-Young Yang, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho

    Abstract: In recent years, there has been an increasing focus on user convenience, leading to increased interest in text-based keyword enrollment systems for keyword spotting (KWS). Since the system utilizes text input during the enrollment phase and audio input during actual usage, we call this task audio-text based KWS. To enable this task, both acoustic and text encoders are typically trained using deep… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, Accepted by Interspeech 2024

  10. arXiv:2405.16893  [pdf, other

    cs.IT eess.SP

    Cross Far- and Near-Field Channel Measurement and Modeling in Extremely Large-scale Antenna Array (ELAA) Systems

    Authors: Yiqin Wang, Chong Han, Shu Sun, Jianhua Zhang

    Abstract: Technologies like ultra-massive multiple-input-multiple-output (UM-MIMO) and reconfigurable intelligent surfaces (RISs) are of special interest to meet the key performance indicators of future wireless systems including ubiquitous connectivity and lightning-fast data rates. One of their common features, the extremely large-scale antenna array (ELAA) systems with hundreds or thousands of antennas,… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: 14 pages, 33 figures

  11. arXiv:2404.13286  [pdf, other

    cs.SD cs.IR eess.AS

    Track Role Prediction of Single-Instrumental Sequences

    Authors: Changheon Han, Suhyun Lee, Minsam Ko

    Abstract: In the composition process, selecting appropriate single-instrumental music sequences and assigning their track-role is an indispensable task. However, manually determining the track-role for a myriad of music samples can be time-consuming and labor-intensive. This study introduces a deep learning model designed to automatically predict the track-role of single-instrumental music sequences. Our ev… ▽ More

    Submitted 20 April, 2024; originally announced April 2024.

    Comments: ISMIR LBD 2023

  12. Change Guiding Network: Incorporating Change Prior to Guide Change Detection in Remote Sensing Imagery

    Authors: Chengxi Han, Chen Wu, Haonan Guo, Meiqi Hu, Jiepan Li, Hongruixuan Chen

    Abstract: The rapid advancement of automated artificial intelligence algorithms and remote sensing instruments has benefited change detection (CD) tasks. However, there is still a lot of space to study for precise detection, especially the edge integrity and internal holes phenomenon of change features. In order to solve these problems, we design the Change Guiding Network (CGNet), to tackle the insufficien… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

  13. arXiv:2404.03425  [pdf, other

    eess.IV cs.AI cs.CV

    ChangeMamba: Remote Sensing Change Detection With Spatiotemporal State Space Model

    Authors: Hongruixuan Chen, Jian Song, Chengxi Han, Junshi Xia, Naoto Yokoya

    Abstract: Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing change detection (CD). However, both architectures have inherent shortcomings: CNN are constrained by a limited receptive field that may hinder their ability to capture broader spatial contexts, while Transformers are computationally intensive, making them costly to train and deploy on… ▽ More

    Submitted 26 July, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

    Comments: Accepted by IEEE TGRS: https://ieeexplore.ieee.org/document/10565926

    Journal ref: IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-20, 2024, Art no. 4409720

  14. arXiv:2404.01192  [pdf, other

    eess.IV cs.CV

    iMD4GC: Incomplete Multimodal Data Integration to Advance Precise Treatment Response Prediction and Survival Analysis for Gastric Cancer

    Authors: Fengtao Zhou, Yingxue Xu, Yanfen Cui, Shenyan Zhang, Yun Zhu, Weiyang He, Jiguang Wang, Xin Wang, Ronald Chan, Louis Ho Shing Lau, Chu Han, Dafu Zhang, Zhenhui Li, Hao Chen

    Abstract: Gastric cancer (GC) is a prevalent malignancy worldwide, ranking as the fifth most common cancer with over 1 million new cases and 700 thousand deaths in 2020. Locally advanced gastric cancer (LAGC) accounts for approximately two-thirds of GC diagnoses, and neoadjuvant chemotherapy (NACT) has emerged as the standard treatment for LAGC. However, the effectiveness of NACT varies significantly among… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: 27 pages, 9 figures, 3 tables (under review)

  15. arXiv:2403.18257  [pdf, other

    eess.AS cs.SD

    Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation

    Authors: Xilin Jiang, Cong Han, Nima Mesgarani

    Abstract: Transformers have been the most successful architecture for various speech modeling tasks, including speech separation. However, the self-attention mechanism in transformers with quadratic complexity is inefficient in computation and memory. Recent models incorporate new layers and modules along with transformers for better performance but also introduce extra model complexity. In this work, we re… ▽ More

    Submitted 30 April, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

    Comments: work in progress

  16. arXiv:2402.03710  [pdf, other

    eess.AS cs.CL cs.SD

    Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience

    Authors: Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani

    Abstract: In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces "Listen, Chat, and Edit" (LCE), a novel multimodal sound mixture editor that modifies each sound source in a mixture based on user-provided text instructions. LCE distinguishes itself with a user-friendly chat interface and its unique ability to… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

    Comments: preprint

  17. arXiv:2402.02349  [pdf

    eess.IV cs.CV

    Vision Transformer-based Multimodal Feature Fusion Network for Lymphoma Segmentation on PET/CT Images

    Authors: Huan Huang, Liheng Qiu, Shenmiao Yang, Longxi Li, Jiaofen Nan, Yanting Li, Chuang Han, Fubao Zhu, Chen Zhao, Weihua Zhou

    Abstract: Background: Diffuse large B-cell lymphoma (DLBCL) segmentation is a challenge in medical image analysis. Traditional segmentation methods for lymphoma struggle with the complex patterns and the presence of DLBCL lesions. Objective: We aim to develop an accurate method for lymphoma segmentation with 18F-Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) images. Metho… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

    Comments: 14 pages, 6 figures; reference added

  18. arXiv:2401.05711  [pdf, other

    cs.LG eess.SP

    Dynamic Indoor Fingerprinting Localization based on Few-Shot Meta-Learning with CSI Images

    Authors: Jiyu Jiao, Xiaojun Wang, Chenpei Han, Yuhua Huang, Yizhuo Zhang

    Abstract: While fingerprinting localization is favored for its effectiveness, it is hindered by high data acquisition costs and the inaccuracy of static database-based estimates. Addressing these issues, this letter presents an innovative indoor localization method using a data-efficient meta-learning algorithm. This approach, grounded in the ``Learning to Learn'' paradigm of meta-learning, utilizes histori… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

    Comments: 5 pages,7 figures

  19. arXiv:2312.12964  [pdf, other

    cs.IT eess.SP

    Far- and Near-Field Channel Measurements and Characterization in the Terahertz Band Using a Virtual Antenna Array

    Authors: Yiqin Wang, Shu Sun, Chong Han

    Abstract: Extremely large-scale antenna array (ELAA) technologies consisting of ultra-massive multiple-input-multiple-output (UM-MIMO) or reconfigurable intelligent surfaces (RISs), are emerging to meet the demand of wireless systems in sixth-generation and beyond communications for enhanced coverage and extreme data rates up to Terabits per second. For ELAA operating at Terahertz (THz) frequencies, the Ray… ▽ More

    Submitted 3 February, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

    Comments: 5 pages, 10 figures

  20. arXiv:2312.10394  [pdf, ps, other

    cs.IT eess.SP

    Can Far-field Beam Training Be Deployed for Cross-field Beam Alignment in Terahertz UM-MIMO Communications?

    Authors: Yuhang Chen, Chong Han, Emil Björnson

    Abstract: Ultra-massive multiple-input multiple-output (UM-MIMO) is the enabler of Terahertz (THz) communications in next-generation wireless networks. In THz UM-MIMO systems, a new paradigm of cross-field communications spanning from near-field to far-field is emerging, since the near-field range expands with higher frequencies and larger array apertures. Precise beam alignment in cross-field is critical b… ▽ More

    Submitted 12 January, 2024; v1 submitted 16 December, 2023; originally announced December 2023.

  21. arXiv:2311.00567  [pdf

    eess.IV cs.CV cs.LG physics.med-ph q-bio.QM

    A Robust Deep Learning Method with Uncertainty Estimation for the Pathological Classification of Renal Cell Carcinoma based on CT Images

    Authors: Ni Yao, Hang Hu, Kaicong Chen, Chen Zhao, Yuan Guo, Boya Li, Jiaofen Nan, Yanting Li, Chuang Han, Fubao Zhu, Weihua Zhou, Li Tian

    Abstract: Objectives To develop and validate a deep learning-based diagnostic model incorporating uncertainty estimation so as to facilitate radiologists in the preoperative differentiation of the pathological subtypes of renal cell carcinoma (RCC) based on CT images. Methods Data from 668 consecutive patients, pathologically proven RCC, were retrospectively collected from Center 1. By using five-fold cross… ▽ More

    Submitted 12 November, 2023; v1 submitted 1 November, 2023; originally announced November 2023.

    Comments: 16 pages, 6 figures

  22. arXiv:2309.15938  [pdf, other

    eess.AS cs.LG cs.SD

    Exploring Self-Supervised Contrastive Learning of Spatial Sound Event Representation

    Authors: Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani

    Abstract: In this study, we present a simple multi-channel framework for contrastive learning (MC-SimCLR) to encode 'what' and 'where' of spatial audios. MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios, thereby enhancing both event classification and sound localization in downstream tasks. At its core, we propose a multi-level data augmentation pipeline that augment… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  23. arXiv:2309.13238  [pdf, other

    eess.SP

    How to Differentiate between Near Field and Far Field: Revisiting the Rayleigh Distance

    Authors: Shu Sun, Renwang Li, Xingchen Liu, Liuxun Xue, Chong Han, Meixia Tao

    Abstract: Future wireless communication systems are likely to adopt extremely large aperture arrays and millimeter-wave/sub-THz frequency bands to achieve higher throughput, lower latency, and higher energy efficiency. Conventional wireless systems predominantly operate in the far field (FF) of the radiation source of signals. As the array size increases and the carrier wavelength shrinks, however, the near… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

  24. arXiv:2309.09493  [pdf, other

    eess.AS cs.AI cs.SD

    HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

    Authors: Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani

    Abstract: Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In th… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

  25. arXiv:2309.06531  [pdf, other

    eess.AS cs.SD

    ASPED: An Audio Dataset for Detecting Pedestrians

    Authors: Pavan Seshadri, Chaeyeon Han, Bon-Woo Koo, Noah Posner, Subhrajit Guhathakurta, Alexander Lerch

    Abstract: We introduce the new audio analysis task of pedestrian detection and present a new large-scale dataset for this task. While the preliminary results prove the viability of using audio approaches for pedestrian detection, they also show that this challenging task cannot be easily solved with standard approaches.

    Submitted 16 January, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

    Comments: 4+1 pages, ICASSP 2024

  26. arXiv:2308.10424  [pdf, other

    eess.SY

    Attenuation and Loss of Spatial Coherence Modeling for Atmospheric Turbulence in Terahertz UAV MIMO Channels

    Authors: Weijun Gao, Chong Han, Zhi Chen

    Abstract: Terahertz (THz) wireless communications have the potential to realize ultra-high-speed and secure data transfer with miniaturized devices for unmanned aerial vehicle (UAV) communications. The atmospheric turbulence due to random airflow leads to spatial inhomogeneity of the communication medium, which is yet missing in most existing studies, leading to additional propagation loss and even loss of… ▽ More

    Submitted 26 January, 2024; v1 submitted 20 August, 2023; originally announced August 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2305.08820

  27. arXiv:2308.00966  [pdf, other

    eess.SY

    A Universal Attenuation Model of Terahertz Wave in Space-Air-Ground Channel Medium

    Authors: Zhirong Yang, Weijun Gao, Chong Han

    Abstract: Providing continuous bandwidth over several tens of GHz, the Terahertz (THz) band (0.1-10 THz) supports space-air-ground integrated network (SAGIN) in 6G and beyond wireless networks. However, it is still mystery how THz waves interact with the channel medium in SAGIN. In this paper, a universal space-air-ground attenuation model is proposed for THz waves, which incorporates the attenuation effect… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

  28. arXiv:2307.13888  [pdf, other

    eess.AS cs.SD

    Exploring the Interactions between Target Positive and Negative Information for Acoustic Echo Cancellation

    Authors: Chang Han, Xinmeng Xu, Weiping Tu, Yuhong Yang, Yajie Liu

    Abstract: Acoustic echo cancellation (AEC) aims to remove interference signals while leaving near-end speech least distorted. As the indistinguishable patterns between near-end speech and interference signals, near-end speech can't be separated completely, causing speech distortion and interference signals residual. We observe that besides target positive information, e.g., ground-truth speech and features,… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

    Comments: Accepted at INTERSPEECH 2023

  29. arXiv:2307.09435  [pdf, other

    eess.AS cs.AI cs.SD

    SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs

    Authors: Yinghao Aaron Li, Cong Han, Nima Mesgarani

    Abstract: In recent years, large-scale pre-trained speech language models (SLMs) have demonstrated remarkable advancements in various generative speech modeling applications, such as text-to-speech synthesis, voice conversion, and speech enhancement. These applications typically involve mapping text or speech inputs to pre-trained SLM representations, from which target speech is decoded. This paper introduc… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

    Comments: WASPAA 2023

  30. arXiv:2307.04961  [pdf, other

    cs.IT eess.SP

    Still Waters Run Deep: Extend THz Coverage with Non-Intelligent Reflecting Surface

    Authors: Chong Han, Yuanbo Li, Yinqin Wang

    Abstract: Large reflection and diffraction losses in the Terahertz (THz) band give rise to degraded coverage abilities in non-line-of-sight (NLoS) areas. To overcome this, a non-intelligent reflecting surface (NIRS) can be used, which is essentially a rough surface made by metal materials. NIRS is not only able to enhance received power in large NLoS areas through rich reflections and scattering, but also c… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

    Comments: 6 page, 5 figures, 1 table

  31. arXiv:2307.04440  [pdf, ps, other

    cs.IT eess.SP

    Time-Frequency-Space Transmit Design and Receiver Processing with Dynamic Subarray for Terahertz Integrated Sensing and Communication

    Authors: Yongzhi Wu, Chong Han, Meixia Tao

    Abstract: Terahertz (THz) integrated sensing and communication (ISAC) enables simultaneous data transmission with Terabit-per-second (Tbps) rate and millimeter-level accurate sensing. To realize such a blueprint, ultra-massive antenna arrays with directional beamforming are used to compensate for severe path loss in the THz band. In this paper, the time-frequency-space transmit design is investigated for TH… ▽ More

    Submitted 25 March, 2024; v1 submitted 10 July, 2023; originally announced July 2023.

  32. arXiv:2306.17008  [pdf

    eess.IV cs.CV

    MLA-BIN: Model-level Attention and Batch-instance Style Normalization for Domain Generalization of Federated Learning on Medical Image Segmentation

    Authors: Fubao Zhu, Yanhui Tian, Chuang Han, Yanting Li, Jiaofen Nan, Ni Yao, Weihua Zhou

    Abstract: The privacy protection mechanism of federated learning (FL) offers an effective solution for cross-center medical collaboration and data sharing. In multi-site medical image segmentation, each medical site serves as a client of FL, and its data naturally forms a domain. FL supplies the possibility to improve the performance of seen domains model. However, there is a problem of domain generalizatio… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

    Comments: 9 pages, 8 figures, 2 tables

  33. arXiv:2306.07691  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

    Authors: Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani

    Abstract: In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, a… ▽ More

    Submitted 19 November, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023

  34. arXiv:2306.06902  [pdf, other

    eess.SP

    Transformer-based GAN for Terahertz Spatial-Temporal Channel Modeling and Generating

    Authors: Zhengdong Hu, Yuanbo Li, Chong Han

    Abstract: Terahertz (THz) communications are envisioned as a promising technology for 6G and beyond wireless systems, providing ultra-broad continuous bandwidth and thus Terabit-per-second (Tbps) data rates. However, as foundation of designing THz communications, channel modeling and characterization are fundamental to scrutinize the potential of the new spectrum. Relied on time-consuming and costly physica… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2301.00981

  35. arXiv:2305.11151  [pdf, other

    cs.SD eess.AS

    Unsupervised Multi-channel Separation and Adaptation

    Authors: Cong Han, Kevin Wilson, Scott Wisdom, John R. Hershey

    Abstract: A key challenge in machine learning is to generalize from training data to an application domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the multi-channel setting. We use MixIT to train a model on far-field microphone array recordings of overlapping reverberant and noisy speech from the AMI Corpus. Th… ▽ More

    Submitted 22 March, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

  36. arXiv:2305.08820  [pdf, other

    eess.SP

    Scintillation and Attenuation Modelling of Atmospheric Turbulence for Terahertz UAV Channels

    Authors: Weijun Gao, Chong Han, Zhi Chen

    Abstract: Terahertz (THz) wireless communications have the potential to realize ultra-high-speed and secure data transfer with miniaturized devices for unmanned aerial vehicle (UAV) communications. Existing THz channel models for aerial scenarios assume a homogeneous medium along the line-of-sight propagation path. However, the atmospheric turbulence due to random airflow leads to temporal and spatial inhom… ▽ More

    Submitted 15 May, 2023; originally announced May 2023.

  37. arXiv:2304.13439  [pdf, other

    eess.AS

    All Information is Necessary: Integrating Speech Positive and Negative Information by Contrastive Learning for Speech Enhancement

    Authors: Xinmeng Xu, Weiping Tu, Chang Han, Yuhong Yang

    Abstract: Monaural speech enhancement (SE) is an ill-posed problem due to the irreversible degradation process. Recent methods to achieve SE tasks rely solely on positive information, e.g., ground-truth speech and speech-relevant features. Different from the above, we observe that the negative information, such as original speech mixture and speech-irrelevant features, are valuable to guide the SE model tra… ▽ More

    Submitted 26 April, 2023; originally announced April 2023.

  38. arXiv:2303.07726  [pdf, other

    cs.CL cs.LG eess.AS

    Good Neighbors Are All You Need for Chinese Grapheme-to-Phoneme Conversion

    Authors: Jungjun Kim, Changjin Han, Gyuhyeon Nam, Gyeongsu Chae

    Abstract: Most Chinese Grapheme-to-Phoneme (G2P) systems employ a three-stage framework that first transforms input sequences into character embeddings, obtains linguistic information using language models, and then predicts the phonemes based on global context about the entire input sequence. However, linguistic knowledge alone is often inadequate. Language models frequently encode overly general structure… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023

  39. arXiv:2303.07458  [pdf, other

    eess.AS cs.SD

    Online Binaural Speech Separation of Moving Speakers With a Wavesplit Network

    Authors: Cong Han, Nima Mesgarani

    Abstract: Binaural speech separation in real-world scenarios often involves moving speakers. Most current speech separation methods use utterance-level permutation invariant training (u-PIT) for training. In inference time, however, the order of outputs can be inconsistent over time particularly in long-form speech separation. This situation which is referred to as the speaker swap problem is even more prob… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: To appear in ICASSP 2023

  40. arXiv:2302.12662  [pdf, other

    eess.IV cs.CV

    FedDBL: Communication and Data Efficient Federated Deep-Broad Learning for Histopathological Tissue Classification

    Authors: Tianpeng Deng, Yanqi Huang, Guoqiang Han, Zhenwei Shi, Jiatai Lin, Qi Dou, Zaiyi Liu, Xiao-jing Guo, C. L. Philip Chen, Chu Han

    Abstract: Histopathological tissue classification is a fundamental task in computational pathology. Deep learning-based models have achieved superior performance but centralized training with data centralization suffers from the privacy leakage problem. Federated learning (FL) can safeguard privacy by keeping training samples locally, but existing FL-based frameworks require a large number of well-annotated… ▽ More

    Submitted 17 December, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

  41. arXiv:2302.10420  [pdf

    cs.CV eess.IV

    HCGMNET: A Hierarchical Change Guiding Map Network For Change Detection

    Authors: Chengxi Han, Chen Wu, Bo Du

    Abstract: Very-high-resolution (VHR) remote sensing (RS) image change detection (CD) has been a challenging task for its very rich spatial information and sample imbalance problem. In this paper, we have proposed a hierarchical change guiding map network (HCGMNet) for change detection. The model uses hierarchical convolution operations to extract multiscale features, continuously merges multi-scale features… ▽ More

    Submitted 13 March, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

  42. arXiv:2302.05756  [pdf, other

    eess.AS cs.SD eess.SP

    Improved Decoding of Attentional Selection in Multi-Talker Environments with Self-Supervised Learned Speech Representation

    Authors: Cong Han, Vishal Choudhari, Yinghao Aaron Li, Nima Mesgarani

    Abstract: Auditory attention decoding (AAD) is a technique used to identify and amplify the talker that a listener is focused on in a noisy environment. This is done by comparing the listener's brainwaves to a representation of all the sound sources to find the closest match. The representation is typically the waveform or spectrogram of the sounds. The effectiveness of these representations for AAD is unce… ▽ More

    Submitted 11 February, 2023; originally announced February 2023.

  43. arXiv:2301.12340  [pdf

    eess.IV cs.CV

    Incremental Value and Interpretability of Radiomics Features of Both Lung and Epicardial Adipose Tissue for Detecting the Severity of COVID-19 Infection

    Authors: Ni Yao, Yanhui Tian, Daniel Gama das Neves, Chen Zhao, Claudio Tinoco Mesquita, Wolney de Andrade Martins, Alair Augusto Sarmet Moreira Damas dos Santos, Yanting Li, Chuang Han, Fubao Zhu, Neng Dai, Weihua Zhou

    Abstract: Epicardial adipose tissue (EAT) is known for its pro-inflammatory properties and association with Coronavirus Disease 2019 (COVID-19) severity. However, current EAT segmentation methods do not consider positional information. Additionally, the detection of COVID-19 severity lacks consideration for EAT radiomics features, which limits interpretability. This study investigates the use of radiomics f… ▽ More

    Submitted 6 December, 2023; v1 submitted 28 January, 2023; originally announced January 2023.

    Comments: 20 pages, 7 figures

  44. arXiv:2301.08810  [pdf, other

    cs.CL cs.SD eess.AS

    Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions

    Authors: Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani

    Abstract: Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we pro… ▽ More

    Submitted 20 January, 2023; originally announced January 2023.

  45. arXiv:2301.03035  [pdf, ps, other

    cs.IT eess.SP

    Cross Far- and Near-field Wireless Communications in Terahertz Ultra-large Antenna Array Systems

    Authors: Chong Han, Yuhang Chen, Longfei Yan, Zhi Chen, Linglong Dai

    Abstract: Terahertz (THz) band owning the abundant multi-ten-GHz bandwidth is capable to support Terabit-per-second wireless communications, which is a pillar technology for 6G and beyond systems. With sub-millimeter-long antennas, ultra-massive (UM) MIMO and intelligent surface (IS) systems with thousands of array elements are exploited to effectively combat the distance limitation and blockage problems, w… ▽ More

    Submitted 3 August, 2023; v1 submitted 8 January, 2023; originally announced January 2023.

  46. arXiv:2301.00981  [pdf, other

    eess.SP

    Transfer Generative Adversarial Networks (T-GAN)-based Terahertz Channel Modeling

    Authors: Zhengdong Hu, Yuanbo Li, Chong Han

    Abstract: Terahertz (THz) communications are envisioned as a promising technology for 6G and beyond wireless systems, providing ultra-broad bandwidth and thus Terabit-per-second (Tbps) data rates. However, as foundation of designing THz communications, channel modeling and characterization are fundamental to scrutinize the potential of the new spectrum. Relied on physical measurements, traditional statistic… ▽ More

    Submitted 3 January, 2023; originally announced January 2023.

  47. arXiv:2212.14227  [pdf, other

    eess.AS cs.SD

    StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models

    Authors: Yinghao Aaron Li, Cong Han, Nima Mesgarani

    Abstract: One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning f… ▽ More

    Submitted 29 December, 2022; originally announced December 2022.

    Comments: SLT 2022

  48. arXiv:2212.11756  [pdf, ps, other

    cs.IT eess.SP

    DSS-o-SAGE: Direction-Scan Sounding-Oriented SAGE Algorithm for Channel Parameter Estimation in mmWave and THz Bands

    Authors: Yuanbo Li, Chong Han, Yi Chen, Ziming Yu, Xuefeng Yin

    Abstract: Investigation of millimeter (mmWave) and Terahertz (THz) channels relies on channel measurements and estimation of multi-path component (MPC) parameters. As a common measurement technique in the mmWave and THz bands, direction-scan sounding (DSS) resolves angular information and increases the measurable distance. Through mechanical rotation, the DSS creates a virtual multi-antenna sounding system,… ▽ More

    Submitted 4 March, 2024; v1 submitted 28 November, 2022; originally announced December 2022.

    Comments: 15 pages, 10 figures, 3 tables

  49. arXiv:2211.11185  [pdf, other

    cs.IT eess.SP

    Terahertz Channel Measurement and Analysis on a University Campus Street

    Authors: Yiqin Wang, Yuanbo Li, Yi Chen, Ziming Yu, Chong Han

    Abstract: Owning abundant bandwidth resource, the Terahertz (0.1-10 THz) band is a promising spectrum to support sixth-generation (6G) and beyond communications. As the foundation of channel study in the spectrum, channel measurement is ongoing in covering representative 6G communication scenarios and promising THz frequency bands. In this paper, a wideband channel measurement in an L-shaped university camp… ▽ More

    Submitted 21 November, 2022; originally announced November 2022.

    Comments: 6 pages, 15 figures

  50. arXiv:2211.11180  [pdf, other

    cs.IT eess.SP

    300 GHz Wideband Channel Measurement and Analysis in a Lobby

    Authors: Yiqin Wang, Yuanbo Li, Yi Chen, Ziming Yu, Chong Han

    Abstract: The Terahertz (0.1-10 THz) band has been envisioned as one of the promising spectrum bands to support ultra-broadband sixth-generation (6G) and beyond communications. In this paper, a wideband channel measurement campaign in a 500- square-meter indoor lobby at 306-321 GHz is presented. The measurement system consists of a vector network analyzer (VNA)-based channel sounder, and a directional anten… ▽ More

    Submitted 23 May, 2023; v1 submitted 20 November, 2022; originally announced November 2022.

    Comments: 6 pages, 6 figures