Zum Hauptinhalt springen

Showing 1–50 of 704 results for author: Chen, Z

Searching in archive eess. Search in all archives.
.
  1. arXiv:2408.15947  [pdf, other

    eess.IV cs.CV

    Auxiliary Input in Training: Incorporating Catheter Features into Deep Learning Models for ECG-Free Dynamic Coronary Roadmapping

    Authors: Yikang Liu, Lin Zhao, Eric Z. Chen, Xiao Chen, Terrence Chen, Shanhui Sun

    Abstract: Dynamic coronary roadmapping is a technology that overlays the vessel maps (the "roadmap") extracted from an offline image sequence of X-ray angiography onto a live stream of X-ray fluoroscopy in real-time. It aims to offer navigational guidance for interventional surgeries without the need for repeated contrast agent injections, thereby reducing the risks associated with radiation exposure and ki… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: MICCAI 2024

  2. arXiv:2408.15668  [pdf, ps, other

    cs.IT eess.SP

    Movable Antennas Meet Intelligent Reflecting Surface: When Do We Need Movable Antennas?

    Authors: Xin Wei, Weidong Mei, Qingqing Wu, Boyu Ning, Zhi Chen

    Abstract: Intelligent reflecting surface (IRS) and movable antenna (MA)/fluid antenna (FA) techniques have both received increasing attention in the realm of wireless communications due to their ability to reconfigure and improve wireless channel conditions. In this paper, we investigate the integration of MAs/FAs into an IRS-assisted wireless communication system. In particular, we consider the downlink tr… ▽ More

    Submitted 29 August, 2024; v1 submitted 28 August, 2024; originally announced August 2024.

    Comments: 6 pages, 6 figures, submitted to IEEE WCNC 2025

  3. arXiv:2408.15585  [pdf, other

    cs.SD eess.AS

    Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models

    Authors: Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

    Abstract: In this paper, Whisper, a large-scale pre-trained model for automatic speech recognition, is proposed to apply to speaker verification. A partial multi-scale feature aggregation (PMFA) approach is proposed based on a subset of Whisper encoder blocks to derive highly discriminative speaker embeddings.Experimental results demonstrate that using the middle to later blocks of the Whisper encoder keeps… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: Accepted by Interspeech 2024

  4. arXiv:2408.15508  [pdf, other

    cs.SD cs.AI eess.AS

    EmoAttack: Utilizing Emotional Voice Conversion for Speech Backdoor Attacks on Deep Speech Classification Models

    Authors: Wenhan Yao, Zedong XingXiarun Chen, Jia Liu, yongqiang He, Weiping Wen

    Abstract: Deep speech classification tasks, mainly including keyword spotting and speaker verification, play a crucial role in speech-based human-computer interaction. Recently, the security of these technologies has been demonstrated to be vulnerable to backdoor attacks. Specifically speaking, speech samples are attacked by noisy disruption and component modification in present triggers. We suggest that sp… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

    Comments: Submitted to ICASSP 2025

  5. arXiv:2408.13733  [pdf, other

    eess.IV cs.CV

    Anatomical Consistency Distillation and Inconsistency Synthesis for Brain Tumor Segmentation with Missing Modalities

    Authors: Zheyu Zhang, Xinzhao Liu, Zheng Chen, Yueyi Zhang, Huanjing Yue, Yunwei Ou, Xiaoyan Sun

    Abstract: Multi-modal Magnetic Resonance Imaging (MRI) is imperative for accurate brain tumor segmentation, offering indispensable complementary information. Nonetheless, the absence of modalities poses significant challenges in achieving precise segmentation. Recognizing the shared anatomical structures between mono-modal and multi-modal representations, it is noteworthy that mono-modal images typically ex… ▽ More

    Submitted 25 August, 2024; originally announced August 2024.

    Comments: Accepted Paper to European Conference on Artificial Intelligence (ECAI 2024)

  6. arXiv:2408.11982  [pdf, other

    eess.IV cs.CV cs.MM

    AIM 2024 Challenge on Compressed Video Quality Assessment: Methods and Results

    Authors: Maksim Smirnov, Aleksandr Gushchin, Anastasia Antsiferova, Dmitry Vatolin, Radu Timofte, Ziheng Jia, Zicheng Zhang, Wei Sun, Jiaying Qian, Yuqin Cao, Yinan Sun, Yuxin Zhu, Xiongkuo Min, Guangtao Zhai, Kanjar De, Qing Luo, Ao-Xiang Zhang, Peng Zhang, Haibo Lei, Linyan Jiang, Yaqing Li, Wenhui Meng, Xiaoheng Tan, Haiqiang Wang, Xiaozhong Xu , et al. (11 additional authors not shown)

    Abstract: Video quality assessment (VQA) is a crucial task in the development of video compression standards, as it directly impacts the viewer experience. This paper presents the results of the Compressed Video Quality Assessment challenge, held in conjunction with the Advances in Image Manipulation (AIM) workshop at ECCV 2024. The challenge aimed to evaluate the performance of VQA methods on a diverse dat… ▽ More

    Submitted 28 August, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

  7. arXiv:2408.11787  [pdf, other

    eess.IV cs.CV

    NuSegDG: Integration of Heterogeneous Space and Gaussian Kernel for Domain-Generalized Nuclei Segmentation

    Authors: Zhenye Lou, Qing Xu, Zekun Jiang, Xiangjian He, Zhen Chen, Yi Wang, Chenxin Li, Maggie M. He, Wenting Duan

    Abstract: Domain-generalized nuclei segmentation refers to the generalizability of models to unseen domains based on knowledge learned from source domains and is challenged by various image conditions, cell types, and stain strategies. Recently, the Segment Anything Model (SAM) has made great success in universal image segmentation by interactive prompt modes (e.g., point and box). Despite its strengths, th… ▽ More

    Submitted 24 August, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

    Comments: Under Reivew

  8. arXiv:2408.09851  [pdf, other

    cs.NI eess.SY

    ISAC-Fi: Enabling Full-fledged Monostatic Sensing over Wi-Fi Communication

    Authors: Zhe Chen, Chao Hu, Tianyue Zheng, Hangcheng Cao, Yanbing Yang, Yen Chu, Hongbo Jiang, Jun Luo

    Abstract: Whereas Wi-Fi communications have been exploited for sensing purpose for over a decade, the bistatic or multistatic nature of Wi-Fi still poses multiple challenges, hampering real-life deployment of integrated sensing and communication (ISAC) within Wi-Fi framework. In this paper, we aim to re-design WiFi so that monostatic sensing (mimicking radar) can be achieved over the multistatic communicati… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: 14 pages, 22 figures

  9. arXiv:2408.09151  [pdf, other

    cs.CV eess.IV

    Realistic Extreme Image Rescaling via Generative Latent Space Learning

    Authors: Ce Wang, Wanjie Sun, Zhenzhong Chen

    Abstract: Image rescaling aims to learn the optimal downscaled low-resolution (LR) image that can be accurately reconstructed to its original high-resolution (HR) counterpart. This process is crucial for efficient image processing and storage, especially in the era of ultra-high definition media. However, extreme downscaling factors pose significant challenges due to the highly ill-posed nature of the inver… ▽ More

    Submitted 17 August, 2024; originally announced August 2024.

  10. arXiv:2408.08322  [pdf, other

    eess.SP cs.IT

    Movable-Antenna Position Optimization for Physical-Layer Security via Discrete Sampling

    Authors: Weidong Mei, Xin Wei, Yijie Liu, Boyu Ning, Zhi Chen

    Abstract: Fluid antennas (FAs) and mobile antennas (MAs) are innovative technologies in wireless communications that are able to proactively improve channel conditions by dynamically adjusting the transmit/receive antenna positions within a given spatial region. In this paper, we investigate an MA-enhanced multiple-input single-output (MISO) secure communication system, aiming to maximize the secrecy rate b… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

    Comments: This paper is accepted by IEEE Globecom 2024. arXiv admin note: substantial text overlap with arXiv:2403.16886

  11. arXiv:2408.07320  [pdf, other

    eess.SP

    Coordinated Spectral Efficiency Prediction for Real-World 5G CoMP Systems

    Authors: Zhixing Chen, Zhaoyu Fan, Yang Li, Yibin Kang, Qi Yan, Qingjiang Shi

    Abstract: Coordinated multipoint (CoMP) systems incur substantial resource consumption due to the management of backhaul links and the coordination among various base stations (BSs). Accurate prediction of coordinated spectral efficiency (CSE) can guide the optimization of network parameters, resulting in enhanced resource utilization efficiency. However, characterizing the CSE is intractable due to the inh… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

  12. arXiv:2408.07293  [pdf, other

    eess.IV cs.CV q-bio.NC

    Discriminating retinal microvascular and neuronal differences related to migraines: Deep Learning based Crossectional Study

    Authors: Feilong Tang, Matt Trinh, Annita Duong, Angelica Ly, Fiona Stapleton, Zhe Chen, Zongyuan Ge, Imran Razzak

    Abstract: Migraine, a prevalent neurological disorder, has been associated with various ocular manifestations suggestive of neuronal and microvascular deficits. However, there is limited understanding of the extent to which retinal imaging may discriminate between individuals with migraines versus without migraines. In this study, we apply convolutional neural networks to color fundus photography (CFP) and… ▽ More

    Submitted 29 July, 2024; originally announced August 2024.

  13. arXiv:2408.06109  [pdf

    eess.SP q-bio.QM

    Inferring directed spectral information flow between mixed-frequency time series

    Authors: Qiqi Xian, Zhe Sage Chen

    Abstract: Identifying directed spectral information flow between multivariate time series is important for many applications in finance, climate, geophysics and neuroscience. Spectral Granger causality (SGC) is a prediction-based measure characterizing directed information flow at specific oscillatory frequencies. However, traditional vector autoregressive (VAR) approaches are insufficient to assess SGC whe… ▽ More

    Submitted 17 August, 2024; v1 submitted 12 August, 2024; originally announced August 2024.

  14. arXiv:2408.05254  [pdf

    eess.SY

    Optimal Power Flow in Renewable-Integrated Power Systems: A Comprehensive Review

    Authors: Zigang Chen

    Abstract: This paper explores the integration of renewable energy sources into power systems, highlighting the resulting complexities such as variability and intermittency that challenge traditional power flow dynamics. We delve into innovative Optimal Power Flow (OPF) strategies designed to manage the unpredictability of renewable sources while ensuring economically viable and stable grid operations. A tho… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

    Comments: 22 pages

  15. arXiv:2408.04273  [pdf, other

    eess.IV cs.CV

    SG-JND: Semantic-Guided Just Noticeable Distortion Predictor For Image Compression

    Authors: Linhan Cao, Wei Sun, Xiongkuo Min, Jun Jia, Zicheng Zhang, Zijian Chen, Yucheng Zhu, Lizhou Liu, Qiubo Chen, Jing Chen, Guangtao Zhai

    Abstract: Just noticeable distortion (JND), representing the threshold of distortion in an image that is minimally perceptible to the human visual system (HVS), is crucial for image compression algorithms to achieve a trade-off between transmission bit rate and image quality. However, traditional JND prediction methods only rely on pixel-level or sub-band level features, lacking the ability to capture the i… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

    Comments: Accepted by ICIP 2024

  16. arXiv:2408.03446  [pdf, other

    cs.NI eess.SP

    Optimizing NOMA Transmissions to Advance Federated Learning in Vehicular Networks

    Authors: Ziru Chen, Zhou Ni, Peiyuan Guan, Lu Wang, Lin X. Cai, Morteza Hashemi, Zongzhi Li

    Abstract: Diverse critical data, such as location information and driving patterns, can be collected by IoT devices in vehicular networks to improve driving experiences and road safety. However, drivers are often reluctant to share their data due to privacy concerns. The Federated Vehicular Network (FVN) is a promising technology that tackles these concerns by transmitting model parameters instead of raw da… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: The paper is accepted by IEEE Globecom 2024

  17. arXiv:2408.02622  [pdf, other

    cs.CL cs.AI cs.HC cs.SD eess.AS

    Language Model Can Listen While Speaking

    Authors: Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

    Abstract: Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisf… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: Demo can be found at https://ddlbojack.github.io/LSLM

  18. arXiv:2408.00434  [pdf, other

    eess.SP

    Flexible Beam Coverage Optimization for Movable-Antenna Array

    Authors: Dong Wang, Weidong Mei, Boyu Ning, Zhi Chen

    Abstract: Fluid antennas (FAs) and movable antennas (MAs) have attracted increasing attention in wireless communications recently. As compared to the conventional fixed-position antennas (FPAs), their geometry can be dynamically reconfigured, such that more flexible beamforming can be achieved for signal coverage and/or interference nulling. In this paper, we investigate the use of MAs to achieve uniform co… ▽ More

    Submitted 23 August, 2024; v1 submitted 1 August, 2024; originally announced August 2024.

  19. arXiv:2408.00284  [pdf, other

    cs.CL cs.SD eess.AS

    Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

    Authors: Xinhan Di, Zihao Chen, Yunming Liang, Junjie Zheng, Yihua Wang, Chaofan Ding

    Abstract: Large-scale text-to-speech (TTS) models have made significant progress recently.However, they still fall short in the generation of Chinese dialectal speech. Toaddress this, we propose Bailing-TTS, a family of large-scale TTS models capable of generating high-quality Chinese dialectal speech. Bailing-TTS serves as a foundation model for Chinese dialectal speech generation. First, continual semi-su… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

    Comments: 8 pages, 2 figures

  20. arXiv:2407.19763  [pdf, other

    eess.IV cs.CV

    TeleOR: Real-time Telemedicine System for Full-Scene Operating Room

    Authors: Yixuan Wu, Kaiyuan Hu, Qian Shao, Jintai Chen, Danny Z. Chen, Jian Wu

    Abstract: The advent of telemedicine represents a transformative development in leveraging technology to extend the reach of specialized medical expertise to remote surgeries, a field where the immediacy of expert guidance is paramount. However, the intricate dynamics of Operating Room (OR) scene pose unique challenges for telemedicine, particularly in achieving high-fidelity, real-time scene reconstruction… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  21. arXiv:2407.16933  [pdf, other

    eess.SY cs.LG

    Deep Koopman-based Control of Quality Variation in Multistage Manufacturing Systems

    Authors: Zhiyi Chen, Harshal Maske, Devesh Upadhyay, Huanyi Shui, Xun Huan, Jun Ni

    Abstract: This paper presents a modeling-control synthesis to address the quality control challenges in multistage manufacturing systems (MMSs). A new feedforward control scheme is developed to minimize the quality variations caused by process disturbances in MMSs. Notably, the control framework leverages a stochastic deep Koopman (SDK) model to capture the quality propagation mechanism in the MMSs, highlig… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

    Comments: The paper was in the proceeding of 2024 American Control Conference. This submitted version addresses a minor correction to one equation (Eq. 14), while the results and conclusions remain the same

  22. arXiv:2407.15188  [pdf, other

    eess.AS cs.SD

    Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning

    Authors: Shuai Wang, Zhengyang Chen, Kong Aik Lee, Yanmin Qian, Haizhou Li

    Abstract: Speaker individuality information is among the most critical elements within speech signals. By thoroughly and accurately modeling this information, it can be utilized in various intelligent speech applications, such as speaker recognition, speaker diarization, speech synthesis, and target speaker extraction. In this article, we aim to present, from a unique perspective, the developmental history,… ▽ More

    Submitted 21 July, 2024; originally announced July 2024.

  23. arXiv:2407.14153  [pdf, other

    eess.IV cs.CV

    ESP-MedSAM: Efficient Self-Prompting SAM for Universal Domain-Generalized Medical Image Segmentation

    Authors: Qing Xu, Jiaxuan Li, Xiangjian He, Ziyu Liu, Zhen Chen, Wenting Duan, Chenxin Li, Maggie M. He, Fiseha B. Tesema, Wooi P. Cheah, Yi Wang, Rong Qu, Jonathan M. Garibaldi

    Abstract: The universality of deep neural networks across different modalities and their generalization capabilities to unseen domains play an essential role in medical image segmentation. The recent Segment Anything Model (SAM) has demonstrated its potential in both settings. However, the huge computational costs, demand for manual annotations as prompts and conflict-prone decoding process of SAM degrade i… ▽ More

    Submitted 17 August, 2024; v1 submitted 19 July, 2024; originally announced July 2024.

    Comments: Under Review

  24. arXiv:2407.12780  [pdf, other

    physics.med-ph eess.IV

    Hallucination Index: An Image Quality Metric for Generative Reconstruction Models

    Authors: Matthew Tivnan, Siyeop Yoon, Zhennong Chen, Xiang Li, Dufan Wu, Quanzheng Li

    Abstract: Generative image reconstruction algorithms such as measurement conditioned diffusion models are increasingly popular in the field of medical imaging. These powerful models can transform low signal-to-noise ratio (SNR) inputs into outputs with the appearance of high SNR. However, the outputs can have a new type of error called hallucinations. In medical imaging, these hallucinations may not be obvi… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  25. arXiv:2407.11700  [pdf, other

    cs.CV eess.IV

    Rate-Distortion-Cognition Controllable Versatile Neural Image Compression

    Authors: Jinming Liu, Ruoyu Feng, Yunpeng Qi, Qiuyu Chen, Zhibo Chen, Wenjun Zeng, Xin Jin

    Abstract: Recently, the field of Image Coding for Machines (ICM) has garnered heightened interest and significant advances thanks to the rapid progress of learning-based techniques for image compression and analysis. Previous studies often require training separate codecs to support various bitrate levels, machine tasks, and networks, thus lacking both flexibility and practicality. To address these challeng… ▽ More

    Submitted 17 July, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

    Comments: ECCV2024

  26. arXiv:2407.11413  [pdf, other

    math.OC eess.SY

    Distributed Prescribed-Time Convex Optimization: Cascade Design and Time-Varying Gain Approach

    Authors: Gewei Zuo, Lijun Zhu, Yujuan Wang, Zhiyong Chen

    Abstract: In this paper, we address the distributed prescribed-time convex optimization (DPTCO) problem for a class of nonlinear multi-agent systems (MASs) under undirected connected graph. A cascade design framework is proposed such that the DPTCO implementation is divided into two parts: distributed optimal trajectory generator design and local reference trajectory tracking controller design. The DPTCO pr… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  27. arXiv:2407.11408  [pdf, other

    eess.SY

    Prescribed-time Cooperative Output Regulation of Linear Heterogeneous Multi-agent Systems

    Authors: Gewei Zuo, Lijun Zhu, Yujuan Wang, Zhiyong Chen

    Abstract: A finite-time protocol for a multi-agent systems (MASs) can guarantee the convergence of every agent in a finite time interval in contrast to the asymptotic convergence, but the settling time depends on the initial condition and design parameters and is inconsistent across the agents. In this paper, we study the prescribed-time cooperative output regulation (PTCOR) problem for a class of linear he… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  28. arXiv:2407.10833  [pdf, other

    eess.IV cs.CV

    MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration

    Authors: Yulin Ren, Xin Li, Bingchen Li, Xingrui Wang, Mengxi Guo, Shijie Zhao, Li Zhang, Zhibo Chen

    Abstract: We present MoE-DiffIR, an innovative universal compressed image restoration (CIR) method with task-customized diffusion priors. This intends to handle two pivotal challenges in the existing CIR methods: (i) lacking adaptability and universality for different image codecs, e.g., JPEG and WebP; (ii) poor texture generation capability, particularly at low bitrates. Specifically, our MoE-DiffIR develo… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  29. arXiv:2407.10603  [pdf, other

    eess.AS cs.CL cs.SD

    Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data

    Authors: Liang-Hsuan Tseng, Zih-Ching Chen, Wei-Shun Chang, Cheng-Kuang Lee, Tsung-Ren Huang, Hung-yi Lee

    Abstract: Recent advances in automatic speech recognition (ASR) often rely on large speech foundation models for generating high-quality transcriptions. However, these models can be impractical due to limited computing resources. The situation is even more severe in terms of more realistic or difficult scenarios, such as code-switching ASR (CS-ASR). To address this, we present a framework for developing mor… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  30. arXiv:2407.10325  [pdf, other

    eess.IV cs.CV

    Light Field Compression Based on Implicit Neural Representation

    Authors: Henan Wang, Hanxin Zhu, Zhibo Chen

    Abstract: Light field, as a new data representation format in multimedia, has the ability to capture both intensity and direction of light rays. However, the additional angular information also brings a large volume of data. Classical coding methods are not effective to describe the relationship between different views, leading to redundancy left. To address this problem, we propose a novel light field comp… ▽ More

    Submitted 7 May, 2024; originally announced July 2024.

    Comments: PCS2022

  31. arXiv:2407.05289  [pdf, other

    cs.IT eess.SP

    DM-MIMO: Diffusion Models for Robust Semantic Communications over MIMO Channels

    Authors: Yiheng Duan, Tong Wu, Zhiyong Chen, Meixia Tao

    Abstract: This paper investigates robust semantic communications over multiple-input multiple-output (MIMO) fading channels. Current semantic communications over MIMO channels mainly focus on channel adaptive encoding and decoding, which lacks exploration of signal distribution. To leverage the potential of signal distribution in signal space denoising, we develop a diffusion model over MIMO channels (DM-MI… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  32. arXiv:2407.04675  [pdf, other

    eess.AS cs.SD

    Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

    Authors: Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li , et al. (30 additional authors not shown)

    Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor… ▽ More

    Submitted 10 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

  33. arXiv:2407.04416  [pdf, other

    cs.SD cs.MM eess.AS

    Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions

    Authors: Yi Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhengxi Liu, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xubo Liu, Xiyuan Kang, Mark D. Plumbley, Wenwu Wang

    Abstract: Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models.… ▽ More

    Submitted 14 August, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

    Comments: 5 pages with 1 appendix

  34. arXiv:2407.04174  [pdf, other

    cs.NI eess.SP

    Gemini: Integrating Full-fledged Sensing upon Millimeter Wave Communications

    Authors: Yilong Li, Zhe Chen

    Abstract: Integrating millimeter wave (mmWave)technology in both communication and sensing is promising as it enables the reuse of existing spectrum and infrastructure without draining resources. Most existing systems piggyback sensing onto conventional communication modes without fully exploiting the potential of integrated sensing and communication (ISAC) in mmWave radios (not full-fledged). In this paper… ▽ More

    Submitted 27 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: 12 pages

  35. arXiv:2407.01097  [pdf, other

    eess.SY

    HGNET: A Hierarchical Feature Guided Network for Occupancy Flow Field Prediction

    Authors: Zhan Chen, Chen Tang, Lu Xiong

    Abstract: Predicting the motion of multiple traffic participants has always been one of the most challenging tasks in autonomous driving. The recently proposed occupancy flow field prediction method has shown to be a more effective and scalable representation compared to general trajectory prediction methods. However, in complex multi-agent traffic scenarios, it remains difficult to model the interactions a… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  36. arXiv:2406.19954  [pdf, other

    cs.CL cs.HC cs.SD eess.AS

    BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

    Authors: Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin Rao Koluguri, Piotr Żelasko, Jagadeesh Balam, Boris Ginsburg

    Abstract: Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTO… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    MSC Class: 68T10 ACM Class: I.2.7

  37. arXiv:2406.19674  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

    Authors: Krishna C. Puvvada, Piotr Żelasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

    Abstract: Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while b… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech-2024

  38. Joint Beamforming and Antenna Position Optimization for Movable Antenna-Assisted Spectrum Sharing

    Authors: Xin Wei, Weidong Mei, Dong Wang, Boyu Ning, Zhi Chen

    Abstract: Fluid antennas (FAs) and movable antennas (MAs) have drawn increasing attention in wireless communications recently due to their ability to create favorable channel conditions via local antenna movement within a confined region. In this letter, we advance their application for cognitive radio to facilitate efficient spectrum sharing between primary and secondary communication systems. In particula… ▽ More

    Submitted 23 August, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

    Comments: Accepted to IEEE Wireless Communications Letters

  39. arXiv:2406.18871  [pdf, other

    eess.AS cs.CL

    DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

    Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

    Abstract: Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby fa… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  40. arXiv:2406.18547  [pdf

    eess.IV cs.CV

    Enhancing Medical Imaging with GANs Synthesizing Realistic Images from Limited Data

    Authors: Yinqiu Feng, Bo Zhang, Lingxi Xiao, Yutian Yang, Tana Gegen, Zexi Chen

    Abstract: In this research, we introduce an innovative method for synthesizing medical images using generative adversarial networks (GANs). Our proposed GANs method demonstrates the capability to produce realistic synthetic images even when trained on a limited quantity of real medical image data, showcasing commendable generalization prowess. To achieve this, we devised a generator and discriminator networ… ▽ More

    Submitted 22 May, 2024; originally announced June 2024.

  41. arXiv:2406.18361  [pdf, other

    cs.CV cs.AI eess.IV

    Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process

    Authors: Tianyu Lin, Zhiguang Chen, Zhonghao Yan, Weijiang Yu, Fudan Zheng

    Abstract: Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first laten… ▽ More

    Submitted 9 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted at MICCAI 2024. Code and citation info see https://github.com/lin-tianyu/Stable-Diffusion-Seg

  42. arXiv:2406.16981  [pdf

    eess.IV cs.AI cs.LG eess.SP

    Research on Feature Extraction Data Processing System For MRI of Brain Diseases Based on Computer Deep Learning

    Authors: Lingxi Xiao, Jinxin Hu, Yutian Yang, Yinqiu Feng, Zichao Li, Zexi Chen

    Abstract: Most of the existing wavelet image processing techniques are carried out in the form of single-scale reconstruction and multiple iterations. However, processing high-quality fMRI data presents problems such as mixed noise and excessive computation time. This project proposes the use of matrix operations by combining mixed noise elimination methods with wavelet analysis to replace traditional itera… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

  43. arXiv:2406.16297  [pdf, other

    cs.CV eess.IV

    Priorformer: A UGC-VQA Method with content and distortion priors

    Authors: Yajing Pei, Shiyu Huang, Yiting Lu, Xin Li, Zhibo Chen

    Abstract: User Generated Content (UGC) videos are susceptible to complicated and variant degradations and contents, which prevents the existing blind video quality assessment (BVQA) models from good performance since the lack of the adapability of distortions and contents. To mitigate this, we propose a novel prior-augmented perceptual vision transformer (PriorFormer) for the BVQA of UGC, which boots its ad… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: 7 pages

  44. arXiv:2406.15752  [pdf, other

    eess.AS cs.AI cs.CL

    TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers

    Authors: Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Guanrou Yang, Xie Chen

    Abstract: Neural codec language model (LM) has demonstrated strong capability in zero-shot text-to-speech (TTS) synthesis. However, the codec LM often suffers from limitations in inference speed and stability, due to its auto-regressive nature and implicit alignment between text and audio. In this work, to handle these challenges, we introduce a new variant of neural codec LM, namely TacoLM. Specifically, T… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: INTERSPEECH 2024

  45. arXiv:2406.14976  [pdf, other

    eess.IV cs.CV

    CoCPF: Coordinate-based Continuous Projection Field for Ill-Posed Inverse Problem in Imaging

    Authors: Zixuan Chen, Lingxiao Yang, Jian-Huang Lai, Xiaohua Xie

    Abstract: Sparse-view computed tomography (SVCT) reconstruction aims to acquire CT images based on sparsely-sampled measurements. It allows the subjects exposed to less ionizing radiation, reducing the lifetime risk of developing cancers. Recent researches employ implicit neural representation (INR) techniques to reconstruct CT images from a single SV sinogram. However, due to ill-posedness, these INR-based… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  46. arXiv:2406.14878  [pdf, other

    cs.CV cs.LG eess.IV

    MOS: Model Synergy for Test-Time Adaptation on LiDAR-Based 3D Object Detection

    Authors: Zhuoxiao Chen, Junjie Meng, Mahsa Baktashmotlagh, Zi Huang, Yadan Luo

    Abstract: LiDAR-based 3D object detection is pivotal across many applications, yet the performance of such detection systems often degrades after deployment, especially when faced with unseen test point clouds originating from diverse locations or subjected to corruption. In this work, we introduce a new online adaptation framework for detectors named Model Synergy (MOS). Specifically, MOS dynamically assem… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  47. arXiv:2406.13705  [pdf, other

    eess.IV cs.AI cs.CV

    EndoUIC: Promptable Diffusion Transformer for Unified Illumination Correction in Capsule Endoscopy

    Authors: Long Bai, Tong Chen, Qiaozhi Tan, Wan Jun Nah, Yanheng Li, Zhicheng He, Sishen Yuan, Zhen Chen, Jinlin Wu, Mobarakol Islam, Zhen Li, Hongbin Liu, Hongliang Ren

    Abstract: Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels rema… ▽ More

    Submitted 8 July, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: To appear in MICCAI 2024. Code and dataset availability: https://github.com/longbai1006/EndoUIC

  48. arXiv:2406.12946  [pdf

    eess.AS cs.AI cs.CL cs.LG

    Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

    Authors: Vahid Noroozi, Zhehuai Chen, Somshubra Majumdar, Steve Huang, Jagadeesh Balam, Boris Ginsburg

    Abstract: In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs. Addressing the scarcity of samples containing both modalities, synthetic data generation emerges as a crucial strategy to enhance the performance of such systems and facilitate the modeling of cross-modal relationships be… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted for Interspeech 2024

  49. arXiv:2406.10856  [pdf, other

    cs.NI eess.SY

    LEO Satellite Networks Assisted Geo-distributed Data Processing

    Authors: Zhiyuan Zhao, Zhe Chen, Zheng Lin, Wenjun Zhu, Kun Qiu, Chaoqun You, Yue Gao

    Abstract: Nowadays, the increasing deployment of edge clouds globally provides users with low-latency services. However, connecting an edge cloud to a core cloud via optic cables in terrestrial networks poses significant barriers due to the prohibitively expensive building cost of optic cables. Fortunately, emerging Low Earth Orbit (LEO) satellite networks (e.g., Starlink) offer a more cost-effective soluti… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: 6 pages, 5 figures

  50. arXiv:2406.09546  [pdf, other

    cs.CV eess.IV

    Q-Mamba: On First Exploration of Vision Mamba for Image Quality Assessment

    Authors: Fengbin Guan, Xin Li, Zihao Yu, Yiting Lu, Zhibo Chen

    Abstract: In this work, we take the first exploration of the recently popular foundation model, i.e., State Space Model/Mamba, in image quality assessment, aiming at observing and excavating the perception potential in vision Mamba. A series of works on Mamba has shown its significant potential in various fields, e.g., segmentation and classification. However, the perception capability of Mamba has been und… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 17 pages,3 figures