Zum Hauptinhalt springen

Showing 1–50 of 134 results for author: Lu, H

Searching in archive eess. Search in all archives.
.
  1. arXiv:2408.09844  [pdf, ps, other

    eess.SP

    Joint Beamforming and Power Control for D2D-Assisted Integrated Sensing and Communication Networks

    Authors: Zhenyu Xue, Yuang Chen, Hancheng Lu, Baolin Chong, Wanqing Long

    Abstract: Integrated sensing and communication (ISAC) is an emerging technology in next-generation communication networks. However, the communication performance of the ISAC system may be severely affected by interference from the radar system if the sensing task has demanding performance requirements. In this paper, we exploit device-to-device communication (D2D) to improve system communication capacity. T… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  2. Resonant Beam Enabled DoA Estimation in Passive Positioning System

    Authors: Yixuan Guo, Qingwei Jiang, Mengyuan Xu, Wen Fang, Qingwen Liu, Gang Yan, Qunhui Yang, Hai Lu

    Abstract: The rapid advancement of the next generation of communications and internet of things (IoT) technologies has made the provision of location-based services for diverse devices an increasingly pressing necessity. Localizing devices with/without intelligent computing abilities, including both active and passive devices is essential, especially in indoor scenarios. For traditional RF positioning syste… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

  3. arXiv:2408.02582  [pdf, other

    cs.SD cs.AI eess.AS

    Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition

    Authors: Jaeyoung Kim, Han Lu, Soheil Khorram, Anshuman Tripathi, Qian Zhang, Hasim Sak

    Abstract: Modern automatic speech recognition (ASR) systems are typically trained on more than tens of thousands hours of speech data, which is one of the main factors for their great success. However, the distribution of such data is typically biased towards common accents or typical speech patterns. As a result, those systems often poorly perform on atypical accented speech. In this paper, we present acce… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

  4. arXiv:2407.13306  [pdf, ps, other

    cs.IT eess.SP

    Group Movable Antenna With Flexible Sparsity: Joint Array Position and Sparsity Optimization

    Authors: Haiquan Lu, Yong Zeng, Shi Jin, Rui Zhang

    Abstract: Movable antenna (MA) is a promising technology to exploit the spatial variation of wireless channel for performance enhancement, by dynamically varying the antenna position within a certain region. However, for multi-antenna communication systems, moving each antenna independently not only requires prohibitive complexity to find the optimal antenna positions, but also incurs sophisticated movement… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: 5 pages, 5 figures

  5. arXiv:2407.11031  [pdf, other

    cs.LG eess.SP

    Purification Of Contaminated Convolutional Neural Networks Via Robust Recovery: An Approach with Theoretical Guarantee in One-Hidden-Layer Case

    Authors: Hanxiao Lu, Zeyu Huang, Ren Wang

    Abstract: Convolutional neural networks (CNNs), one of the key architectures of deep learning models, have achieved superior performance on many machine learning tasks such as image classification, video recognition, and power systems. Despite their success, CNNs can be easily contaminated by natural noises and artificially injected noises such as backdoor attacks. In this paper, we propose a robust recover… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  6. arXiv:2407.05407  [pdf, other

    cs.SD cs.AI eess.AS

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Authors: Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

    Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role… ▽ More

    Submitted 9 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

    Comments: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051

  7. arXiv:2407.04051  [pdf, other

    cs.SD cs.AI eess.AS

    FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

    Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

    Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Work in progress. Authors are listed in alphabetical order by family name

  8. arXiv:2407.03169  [pdf, other

    cs.CL cs.SD eess.AS

    Investigating Decoder-only Large Language Models for Speech-to-text Translation

    Authors: Chao-Wei Huang, Hui Lu, Hongyu Gong, Hirofumi Inaguma, Ilia Kulikov, Ruslan Mavlyutov, Sravya Popuri

    Abstract: Large language models (LLMs), known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains, present a promising avenue for enhancing speech-related tasks. In this paper, we focus on integrating decoder-only LLMs to the task of speech-to-text translation (S2TT). We propose a decoder-only architecture that enables the LLM to directly consume the encoded sp… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted to Interspeech 2024

  9. arXiv:2406.14186  [pdf, other

    eess.IV cs.CV

    CriDiff: Criss-cross Injection Diffusion Framework via Generative Pre-train for Prostate Segmentation

    Authors: Tingwei Liu, Miao Zhang, Leiye Liu, Jialong Zhong, Shuyao Wang, Yongri Piao, Huchuan Lu

    Abstract: Recently, the Diffusion Probabilistic Model (DPM)-based methods have achieved substantial success in the field of medical image segmentation. However, most of these methods fail to enable the diffusion model to learn edge features and non-edge features effectively and to inject them efficiently into the diffusion backbone. Additionally, the domain gap between the images features and the diffusion… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Accepted in MICCAI 2024

  10. arXiv:2406.02940  [pdf, other

    cs.SD eess.AS

    Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder

    Authors: Haohan Guo, Fenglong Xie, Dongchao Yang, Hui Lu, Xixin Wu, Helen Meng

    Abstract: VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewords to address this problem and build large-codebook speech tokenizers. It encodes speech features into multiple VQ subspaces and composes them into c… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  11. arXiv:2405.09353  [pdf, other

    eess.IV cs.CV

    Large coordinate kernel attention network for lightweight image super-resolution

    Authors: Fangwei Hao, Jiesheng Wu, Haotian Lu, Ji Du, Jing Xu, Xiaoxuan Xu

    Abstract: The multi-scale receptive field and large kernel attention (LKA) module have been shown to significantly improve performance in the lightweight image super-resolution task. However, existing lightweight super-resolution (SR) methods seldom pay attention to designing efficient building block with multi-scale receptive field for local modeling, and their LKA modules face a quadratic increase in comp… ▽ More

    Submitted 30 August, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

    Comments: 13 pages

  12. arXiv:2405.05336  [pdf, other

    eess.IV cs.AI cs.CV

    Joint semi-supervised and contrastive learning enables zero-shot domain-adaptation and multi-domain segmentation

    Authors: Alvaro Gomariz, Yusuke Kikuchi, Yun Yvonna Li, Thomas Albrecht, Andreas Maunz, Daniela Ferrara, Huanxiang Lu, Orcun Goksel

    Abstract: Despite their effectiveness, current deep learning models face challenges with images coming from different domains with varying appearance and content. We introduce SegCLR, a versatile framework designed to segment volumetric images across different domains, employing supervised and contrastive learning simultaneously to effectively learn from both labeled and unlabeled data. We demonstrate the s… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

  13. arXiv:2405.02151  [pdf, other

    cs.SD cs.AI eess.AS

    GMP-TL: Gender-augmented Multi-scale Pseudo-label Enhanced Transfer Learning for Speech Emotion Recognition

    Authors: Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao

    Abstract: The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, current research typically relies on utterance-level emotion labels, inadequately capturing the complexity of emotions within a single utterance. In this paper, we introduce GMP-TL, a novel SER framework that employs gender-augmented multi-scale pseudo-label (GMP) based transfer le… ▽ More

    Submitted 16 June, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

  14. arXiv:2405.00316  [pdf, other

    cs.RO eess.SY

    Enhance Planning with Physics-informed Safety Controller for End-to-end Autonomous Driving

    Authors: Hang Zhou, Haichao Liu, Hongliang Lu, Dan Xu, Jun Ma, Yiding Ji

    Abstract: Recent years have seen a growing research interest in applications of Deep Neural Networks (DNN) on autonomous vehicle technology. The trend started with perception and prediction a few years ago and it is gradually being applied to motion planning tasks. Despite the performance of networks improve over time, DNN planners inherit the natural drawbacks of Deep Learning. Learning-based planners have… ▽ More

    Submitted 5 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

  15. arXiv:2404.06265  [pdf, other

    cs.CV eess.IV

    Spatial-Temporal Multi-level Association for Video Object Segmentation

    Authors: Deshui Miao, Xin Li, Zhenyu He, Huchuan Lu, Ming-Hsuan Yang

    Abstract: Existing semi-supervised video object segmentation methods either focus on temporal feature matching or spatial-temporal feature modeling. However, they do not address the issues of sufficient target interaction and efficient parallel processing simultaneously, thereby constraining the learning of dynamic, target-aware features. To tackle these limitations, this paper proposes a spatial-temporal m… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

  16. arXiv:2404.00327   

    eess.IV cs.CV cs.LG

    YNetr: Dual-Encoder architecture on Plain Scan Liver Tumors (PSLT)

    Authors: Wen Sheng, Zhong Zheng, Jiajun Liu, Han Lu, Hanyuan Zhang, Zhengyong Jiang, Zhihong Zhang, Daoping Zhu

    Abstract: Background: Liver tumors are abnormal growths in the liver that can be either benign or malignant, with liver cancer being a significant health concern worldwide. However, there is no dataset for plain scan segmentation of liver tumors, nor any related algorithms. To fill this gap, we propose Plain Scan Liver Tumors(PSLT) and YNetr. Methods: A collection of 40 liver tumor plain scan segmentation d… ▽ More

    Submitted 4 July, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

    Comments: My academic research interests have undergone significant changes. I believe that continuing to retain the paper is no longer in line with my academic development path, and may also mislead readers. And some of the content may involve the boundaries of personal privacy. To respect and protect the privacy of relevant personnel, I decided to withdraw it to avoid any unnecessary controversy or harm

  17. arXiv:2403.12408  [pdf, other

    cs.CL cs.SD eess.AS

    MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

    Authors: Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong

    Abstract: There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech language model trained in a multitask setting. Without reliance on text training data, our model is able to support multilingual S2ST with speaker style preserve… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  18. arXiv:2403.12402  [pdf, other

    cs.CL cs.SD eess.AS

    An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis

    Authors: Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong

    Abstract: Speech language models (LMs) are promising for high-quality speech synthesis through in-context learning. A typical speech LM takes discrete semantic units as content and a short utterance as prompt, and synthesizes speech which preserves the content's semantics but mimics the prompt's style. However, there is no systematic understanding on how the synthesized audio is controlled by the prompt and… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  19. arXiv:2402.16027  [pdf, other

    cs.IT eess.SP

    Enhancing xURLLC with RSMA-Assisted Massive-MIMO Networks: Performance Analysis and Optimization

    Authors: Yuang Chen, Hancheng Lu, Chenwu Zhang, Yansha Deng, Arumugam Nallanathan

    Abstract: Massive interconnection has sparked people's envisioning for next-generation ultra-reliable and low-latency communications (xURLLC), prompting the design of customized next-generation advanced transceivers (NGAT). Rate-splitting multiple access (RSMA) has emerged as a pivotal technology for NGAT design, given its robustness to imperfect channel state information (CSI) and resilience to quality of… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

    Comments: 14 pages, 11 figures, Submitted to IEEE for potential publication

  20. Linear Periodically Time-Variant Digital PLL Phase Noise Modeling Using Conversion Matrices and Uncorrelated Upsampling

    Authors: Hongyu Lu, Patrick P. Mercier

    Abstract: This paper introduces a conversion matrix method for linear periodically time-variant (LPTV) digital phase-locked loop (DPLL) phase noise modeling that offers precise and computationally efficient results to enable rapid design iteration and optimization. Unlike many previous studies, which either assume linear time-invariance (LTI) and therefore overlook phase noise aliasing effects, or solve LPT… ▽ More

    Submitted 27 June, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: 13 pages, 24 figures

  21. arXiv:2401.13051  [pdf, other

    cs.CV eess.IV

    PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation

    Authors: Zhaozhi Xie, Bochen Guan, Weihao Jiang, Muyang Yi, Yue Ding, Hongtao Lu, Lei Zhang

    Abstract: The Segment Anything Model (SAM) has exhibited outstanding performance in various image segmentation tasks. Despite being trained with over a billion masks, SAM faces challenges in mask prediction quality in numerous scenarios, especially in real-world contexts. In this paper, we introduce a novel prompt-driven adapter into SAM, namely Prompt Adapter Segment Anything Model (PA-SAM), aiming to enha… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

    Comments: Code is available at https://github.com/xzz2/pa-sam

  22. arXiv:2401.08935  [pdf, other

    eess.SP

    Privacy Protected Contactless Cardio-respiratory Monitoring using Defocused Cameras during Sleep

    Authors: Yingen Zhu, Jia Huang, Hongzhou Lu, Wenjin Wang

    Abstract: The monitoring of vital signs such as heart rate (HR) and respiratory rate (RR) during sleep is important for the assessment of sleep quality and detection of sleep disorders. Camera-based HR and RR monitoring gained popularity in sleep monitoring in recent years. However, they are all facing with serious privacy issues when using a video camera in the sleeping scenario. In this paper, we propose… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

  23. arXiv:2312.01499  [pdf, other

    eess.SY cs.DC eess.SP

    Towards Decentralized Task Offloading and Resource Allocation in User-Centric Mobile Edge Computing

    Authors: Langtian Qin, Hancheng Lu, Yuang Chen, Baolin Chong, Feng Wu

    Abstract: In the traditional cellular-based mobile edge computing (MEC), users at the edge of the cell are prone to suffer severe inter-cell interference and signal attenuation, leading to low throughput even transmission interruptions. Such edge effect severely obstructs offloading of tasks to MEC servers. To address this issue, we propose user-centric mobile edge computing (UCMEC), a novel MEC architectur… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

    Comments: 16 pages, 13 figures

  24. arXiv:2311.08153  [pdf, other

    eess.SY cs.AI

    When Mining Electric Locomotives Meet Reinforcement Learning

    Authors: Ying Li, Zhencai Zhu, Xiaoqiang Li, Chunyu Yang, Hao Lu

    Abstract: As the most important auxiliary transportation equipment in coal mines, mining electric locomotives are mostly operated manually at present. However, due to the complex and ever-changing coal mine environment, electric locomotive safety accidents occur frequently these years. A mining electric locomotive control method that can adapt to different complex mining environments is needed. Reinforcemen… ▽ More

    Submitted 14 November, 2023; originally announced November 2023.

  25. arXiv:2310.11044  [pdf, ps, other

    cs.IT eess.SP

    A Tutorial on Near-Field XL-MIMO Communications Towards 6G

    Authors: Haiquan Lu, Yong Zeng, Changsheng You, Yu Han, Jiayi Zhang, Zhe Wang, Zhenjun Dong, Shi Jin, Cheng-Xiang Wang, Tao Jiang, Xiaohu You, Rui Zhang

    Abstract: Extremely large-scale multiple-input multiple-output (XL-MIMO) is a promising technology for the sixth-generation (6G) mobile communication networks. By significantly boosting the antenna number or size to at least an order of magnitude beyond current massive MIMO systems, XL-MIMO is expected to unprecedentedly enhance the spectral efficiency and spatial resolution for wireless communication. The… ▽ More

    Submitted 3 April, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: 42 pages

  26. arXiv:2310.07246  [pdf, other

    cs.SD eess.AS

    Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

    Authors: Xinfa Zhu, Yuanjun Lv, Yi Lei, Tao Li, Wendi He, Hongbin Zhou, Heng Lu, Lei Xie

    Abstract: Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating e… ▽ More

    Submitted 12 October, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

    Comments: 15 pages, 2 figures

  27. arXiv:2310.05051  [pdf, other

    cs.SD eess.AS

    SALT: Distinguishable Speaker Anonymization Through Latent Space Transformation

    Authors: Yuanjun Lv, Jixun Yao, Peikun Chen, Hongbin Zhou, Heng Lu, Lei Xie

    Abstract: Speaker anonymization aims to conceal a speaker's identity without degrading speech quality and intelligibility. Most speaker anonymization systems disentangle the speaker representation from the original speech and achieve anonymization by averaging or modifying the speaker representation. However, the anonymized speech is subject to reduction in pseudo speaker distinctiveness, speech quality and… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: 8 pages, 3 figures; Accepted by ASRU2023

  28. arXiv:2310.03275  [pdf, other

    eess.SY

    Power Optimization in Multi-IRS Aided Delay-Constrained IoVT Systems

    Authors: Baolin Chong, Hancheng Lu, Langtian Qin, Chenwu Zhang, Jiasen Li, Chang Wen Chen

    Abstract: With the advancement of video sensors in the Internet of Things, Internet of Video Things (IoVT) systems, capable of delivering abundant and diverse information, have been increasingly deployed for various applications. However, the extensive transmission of video data in IoVT poses challenges in terms of delay and power consumption. Intelligent reconfigurable surface (IRS), as an emerging technol… ▽ More

    Submitted 24 October, 2023; v1 submitted 4 October, 2023; originally announced October 2023.

  29. arXiv:2310.03268  [pdf, other

    cs.IT eess.SY

    On the Distribution of SINR for Cell-Free Massive MIMO Systems

    Authors: Baolin Chong, Fengqian Guo, Hancheng Lu, Langtian Qin

    Abstract: Cell-free (CF) massive multiple-input multiple-output (mMIMO) has been considered as a potential technology for Beyond 5G communication systems. However, the performance of CF mMIMO systems has not been well studied. Most existing analytical work on CF mMIMO systems is based on the expected signal-to-interference-plus-noise ratio (SINR). The statistical characteristics of the SINR, which is critic… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

  30. arXiv:2309.16247  [pdf, other

    eess.AS cs.SD

    PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System

    Authors: Xiang Lyu, Yuhang Cao, Qing Wang, Jingjing Yin, Yuguang Yang, Pengpeng Zou, Yanni Hu, Heng Lu

    Abstract: Speaker-attributed automatic speech recognition (SA-ASR) improves the accuracy and applicability of multi-speaker ASR systems in real-world scenarios by assigning speaker labels to transcribed texts. However, SA-ASR poses unique challenges due to factors such as speaker overlap, speaker variability, background noise, and reverberation. In this study, we propose PP-MeT system, a real-world personal… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

  31. arXiv:2309.09262  [pdf, other

    eess.AS cs.SD

    PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

    Authors: Jixun Yao, Yuguang Yang, Yi Lei, Ziqian Ning, Yanni Hu, Yu Pan, Jingjing Yin, Hongbin Zhou, Heng Lu, Lei Xie

    Abstract: Style voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to control the conversion process, which leads to limitations in style diversity or falls short in terms of the intuitive and interpretability of style representation… ▽ More

    Submitted 26 December, 2023; v1 submitted 17 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP 2024

  32. arXiv:2309.08377  [pdf, other

    eess.AS cs.CL cs.SD

    DiaCorrect: Error Correction Back-end For Speaker Diarization

    Authors: Jiangyu Han, Federico Landini, Johan Rohdin, Mireia Diez, Lukas Burget, Yuhang Cao, Heng Lu, Jan Cernocky

    Abstract: In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. Our model consists of two parallel convolutional encoders and a transform-based decoder. By exploiting the interactions between the input recording and the initia… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  33. arXiv:2309.08023  [pdf, other

    eess.AS cs.LG cs.SD

    USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models

    Authors: Guanlong Zhao, Yongqiang Wang, Jason Pelecanos, Yu Zhang, Hank Liao, Yiling Huang, Han Lu, Quan Wang

    Abstract: We introduce a multilingual speaker change detection model (USM-SCD) that can simultaneously detect speaker turns and perform ASR for 96 languages. This model is adapted from a speech foundation model trained on a large quantity of supervised and unsupervised data, demonstrating the utility of fine-tuning from a large generic foundation model for a downstream task. We analyze the performance of th… ▽ More

    Submitted 6 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: 5 pages, 2 figures, 4 tables

  34. arXiv:2309.00792  [pdf, ps, other

    cs.IT eess.SP

    Delay-Doppler Alignment Modulation for Spatially Sparse Massive MIMO Communication

    Authors: Haiquan Lu, Yong Zeng

    Abstract: Delay alignment modulation (DAM) is an emerging technique for achieving inter-symbol interference (ISI)-free wideband communications using spatial-delay processing, without relying on channel equalization or multi-carrier transmission. However, existing works on DAM only consider multiple-input single-output (MISO) communication systems and assume time-invariant channels. In this paper, by extendi… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

    Comments: 15 pages, 12 figures

  35. arXiv:2309.00391  [pdf, other

    cs.IT eess.SP

    Achievable Rate Region and Path-Based Beamforming for Multi-User Single-Carrier Delay Alignment Modulation

    Authors: Xingwei Wang, Haiquan Lu, Yong Zeng, Xiaoli Xu, Jie Xu

    Abstract: Delay alignment modulation (DAM) is a novel wideband transmission technique for mmWave massive MIMO systems, which exploits the high spatial resolution and multi-path sparsity to mitigate ISI, without relying on channel equalization or multi-carrier transmission. In particular, DAM leverages the delay pre-compensation and path-based beamforming to effectively align the multi-path components, thus… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

    Comments: 13 pages, 5 figures

  36. arXiv:2308.13908  [pdf, other

    eess.SP

    Sparse Recovery with Attention: A Hybrid Data/Model Driven Solution for High Accuracy Position and Channel Tracking at mmWave

    Authors: Yun Chen, Nuria González-Prelcic, Takayuki Shimizu, Hongshen Lu, Chinmay Mahabal

    Abstract: In this paper, we propose first a mmWave channel tracking algorithm based on multidimensional orthogonal matching pursuit algorithm (MOMP) using reduced sparsifying dictionaries, which exploits information from channel estimates in previous frames. Then, we present an algorithm to obtain the vehicle's initial location for the current frame by solving a system of geometric equations that leverage t… ▽ More

    Submitted 26 August, 2023; originally announced August 2023.

  37. arXiv:2308.04025  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition

    Authors: Yu Pan, Yuguang Yang, Yuheng Huang, Jixun Yao, Jingjing Yin, Yanni Hu, Heng Lu, Lei Ma, Jianjun Zhao

    Abstract: Despite notable progress, speech emotion recognition (SER) remains challenging due to the intricate and ambiguous nature of speech emotion, particularly in wild world. While current studies primarily focus on recognition and generalization abilities, our research pioneers an investigation into the reliability of SER methods in the presence of semantic data shifts and explores how to exert fine-gra… ▽ More

    Submitted 22 March, 2024; v1 submitted 7 August, 2023; originally announced August 2023.

    Comments: 12 pages

  38. arXiv:2308.00507  [pdf, other

    eess.IV cs.CV cs.LG

    Improved Prognostic Prediction of Pancreatic Cancer Using Multi-Phase CT by Integrating Neural Distance and Texture-Aware Transformer

    Authors: Hexin Dong, Jiawen Yao, Yuxing Tang, Mingze Yuan, Yingda Xia, Jian Zhou, Hong Lu, Jingren Zhou, Bin Dong, Le Lu, Li Zhang, Zaiyi Liu, Yu Shi, Ling Zhang

    Abstract: Pancreatic ductal adenocarcinoma (PDAC) is a highly lethal cancer in which the tumor-vascular involvement greatly affects the resectability and, thus, overall survival of patients. However, current prognostic prediction methods fail to explicitly and accurately investigate relationships between the tumor and nearby important vessels. This paper proposes a novel learnable neural distance that descr… ▽ More

    Submitted 13 September, 2023; v1 submitted 1 August, 2023; originally announced August 2023.

    Comments: MICCAI 2023

  39. arXiv:2307.15951  [pdf, other

    eess.AS

    METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer

    Authors: Xinfa Zhu, Yi Lei, Tao Li, Yongmao Zhang, Hongbin Zhou, Heng Lu, Lei Xie

    Abstract: Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion, and language factors in the speech… ▽ More

    Submitted 29 July, 2023; originally announced July 2023.

    Comments: 10 pages, 3 figures

  40. arXiv:2307.00167  [pdf, other

    eess.SP

    Learning to Localize with Attention: from sparse mmWave channel estimates from a single BS to high accuracy 3D location

    Authors: Yun Chen, Nuria González-Prelcic, Takayuki Shimizu, Hongsheng Lu

    Abstract: One strategy to obtain user location information in a wireless network operating at millimeter wave (mmWave) is based on the exploitation of the geometric relationships between the channel parameters and the user position. These relationships can be easily built from the LoS path and/or first order reflections, but high resolution channel estimates are required for high accuracy. In this paper, we… ▽ More

    Submitted 30 June, 2023; originally announced July 2023.

    Comments: Journal

  41. arXiv:2306.07848  [pdf, other

    cs.CL cs.MM cs.SD eess.AS

    GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Accurate Speech Emotion Recognition

    Authors: Yu Pan, Yanni Hu, Yuguang Yang, Wen Fei, Jixun Yao, Heng Lu, Lei Ma, Jianjun Zhao

    Abstract: Contrastive cross-modality pretraining has recently exhibited impressive success in diverse fields, whereas there is limited research on their merits in speech emotion recognition (SER). In this paper, we propose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio pretraining (CLAP) method for SER. Specifically, we first construct an effective emotion CLAP (Emo-CLAP) for SER,… ▽ More

    Submitted 4 December, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: 5 pages

  42. arXiv:2306.02107  [pdf, other

    cs.IT eess.SY

    Achievable Sum Rate Optimization on NOMA-aided Cell-Free Massive MIMO with Finite Blocklength Coding

    Authors: Baolin Chong, Hancheng Lu, Yuang Chen, Langtian Qin, Fengqian Guo

    Abstract: Non-orthogonal multiple access (NOMA)-aided cell-free massive multiple-input multiple-output (CFmMIMO) has been considered as a promising technology to fulfill strict quality of service requirements for ultra-reliable low-latency communications (URLLC). However, finite blocklength coding (FBC) in URLLC makes it challenging to achieve the optimal performance in the NOMA-aided CFmMIMO system. In thi… ▽ More

    Submitted 25 March, 2024; v1 submitted 3 June, 2023; originally announced June 2023.

  43. arXiv:2305.17860  [pdf, other

    cs.SD eess.AS

    speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition

    Authors: Haoyu Lu, Nan Li, Tongtong Song, Longbiao Wang, Jianwu Dang, Xiaobao Wang, Shiliang Zhang

    Abstract: In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech from input due to the diverse types of noise with d… ▽ More

    Submitted 30 May, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

  44. arXiv:2305.10651  [pdf, other

    eess.IV

    Accelerated MR Fingerprinting with Low-Rank and Generative Subspace Modeling

    Authors: Hengfa Lu, Huihui Ye, Lawrence L. Wald, Bo Zhao

    Abstract: Magnetic Resonance (MR) Fingerprinting is an emerging multi-parametric quantitative MR imaging technique, for which image reconstruction methods utilizing low-rank and subspace constraints have achieved state-of-the-art performance. However, this class of methods often suffers from an ill-conditioned model-fitting issue, which degrades the performance as the data acquisition lengths become short a… ▽ More

    Submitted 24 May, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

  45. arXiv:2305.07935  [pdf, other

    cs.IT eess.IV

    Streaming 360-degree VR Video with Statistical QoS Provisioning in mmWave Networks from Delay and Rate Perspectives

    Authors: Yuang Chen, Hancheng Lu, Langtian Qin, Chang Wu, Chang Wen Chen

    Abstract: Millimeter-wave(mmWave) technology has emerged as a promising enabler for unleashing the full potential of 360-degree virtual reality (VR). However, the explosive growth of VR services, coupled with the reliability issues of mmWave communications, poses enormous challenges in terms of wireless resource and quality-of-service (QoS) provisioning for mmWave-enabled 360-degree VR. In this paper, we pr… ▽ More

    Submitted 13 May, 2023; originally announced May 2023.

    Comments: 31 pages, 8 figures

  46. arXiv:2304.12184  [pdf, other

    eess.SP cs.AI cs.IT cs.LG

    Active RIS-aided EH-NOMA Networks: A Deep Reinforcement Learning Approach

    Authors: Zhaoyuan Shi, Huabing Lu, Xianzhong Xie, Helin Yang, Chongwen Huang, Jun Cai, Zhiguo Ding

    Abstract: An active reconfigurable intelligent surface (RIS)-aided multi-user downlink communication system is investigated, where non-orthogonal multiple access (NOMA) is employed to improve spectral efficiency, and the active RIS is powered by energy harvesting (EH). The problem of joint control of the RIS's amplification matrix and phase shift matrix is formulated to maximize the communication success ra… ▽ More

    Submitted 11 April, 2023; originally announced April 2023.

  47. arXiv:2303.08636  [pdf, other

    eess.AS cs.SD

    HYBRIDFORMER: improving SqueezeFormer with hybrid attention and NSR mechanism

    Authors: Yuguang Yang, Yu Pan, Jingjing Yin, Jiangyu Han, Lei Ma, Heng Lu

    Abstract: SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR). However, its inference speed suffers from the quadratic complexity of softmax-attention (SA). In addition, limited by the large convolution kernel size, the local modeling ability of SqueezeFormer is insufficient. In this paper, we propose a novel method HybridFormer to improve SqueezeFormer in a fast an… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP2023

  48. arXiv:2303.07626  [pdf, other

    cs.SD cs.MM eess.AS

    CAT: Causal Audio Transformer for Audio Classification

    Authors: Xiaoyu Liu, Hanlin Lu, Jianbo Yuan, Xinyu Li

    Abstract: The attention-based Transformers have been increasingly applied to audio classification because of their global receptive field and ability to handle long-term dependency. However, the existing frameworks which are mainly extended from the Vision Transformers are not perfectly compatible with audio signals. In this paper, we introduce a Causal Audio Transformer (CAT) consisting of a Multi-Resoluti… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023

  49. arXiv:2302.10558  [pdf, other

    eess.SP

    Joint Optimization of Base Station Clustering and Service Caching in User-Centric MEC

    Authors: Langtian Qin, Hancheng Lu, Yao Lu, Chenwu Zhang, Feng Wu

    Abstract: Edge service caching can effectively reduce the delay or bandwidth overhead for acquiring and initializing applications. To address single-base station (BS) transmission limitation and serious edge effect in traditional cellular-based edge service caching networks, in this paper, we proposed a novel user-centric edge service caching framework where each user is jointly provided with edge caching a… ▽ More

    Submitted 21 February, 2023; originally announced February 2023.

  50. arXiv:2302.10515  [pdf, other

    eess.SP cs.DC cs.PF

    Energy-Efficient Blockchain-enabled User-Centric Mobile Edge Computing

    Authors: Langtian Qin, Hancheng Lu, Yuang Chen, Zhuojia Gu, Dan Zhao, Feng Wu

    Abstract: In the traditional mobile edge computing (MEC) system, the availability of MEC services is greatly limited for the edge users of the cell due to serious signal attenuation and inter-cell interference. User-centric MEC (UC-MEC) can be seen as a promising solution to address this issue. In UC-MEC, each user is served by a dedicated access point (AP) cluster enabled with MEC capability instead of a s… ▽ More

    Submitted 21 February, 2023; originally announced February 2023.