Zum Hauptinhalt springen

Showing 1–50 of 987 results for author: Li, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2408.16315  [pdf, other

    cs.HC cs.LG eess.SP

    Passenger hazard perception based on EEG signals for highly automated driving vehicles

    Authors: Ashton Yu Xuan Tan, Yingkai Yang, Xiaofei Zhang, Bowen Li, Xiaorong Gao, Sifa Zheng, Jianqiang Wang, Xinyu Gu, Jun Li, Yang Zhao, Yuxin Zhang, Tania Stathaki

    Abstract: Enhancing the safety of autonomous vehicles is crucial, especially given recent accidents involving automated systems. As passengers in these vehicles, humans' sensory perception and decision-making can be integrated with autonomous systems to improve safety. This study explores neural mechanisms in passenger-vehicle interactions, leading to the development of a Passenger Cognitive Model (PCM) and… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

  2. arXiv:2408.13522  [pdf, other

    cs.SD eess.AS

    StreamAAD: Decoding Spatial Auditory Attention with a Streaming Architecture

    Authors: Zelin Qiu, Dingding Yao, Junfeng Li

    Abstract: In this paper, we present our approach for the Track 1 of the Chinese Auditory Attention Decoding (Chinese AAD) Challenge at ISCSLP 2024. Most existing spatial auditory attention decoding (Sp-AAD) methods employ an isolated window architecture, focusing solely on global invariant features without considering relationships between different decision windows, which can lead to suboptimal performance… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

  3. arXiv:2408.13495  [pdf

    eess.IV cs.CV

    Topological GCN for Improving Detection of Hip Landmarks from B-Mode Ultrasound Images

    Authors: Tianxiang Huang, Jing Shi, Ge Jin, Juncheng Li, Jun Wang, Jun Du, Jun Shi

    Abstract: The B-mode ultrasound based computer-aided diagnosis (CAD) has demonstrated its effectiveness for diagnosis of Developmental Dysplasia of the Hip (DDH) in infants. However, due to effect of speckle noise in ultrasound im-ages, it is still a challenge task to accurately detect hip landmarks. In this work, we propose a novel hip landmark detection model by integrating the Topological GCN (TGCN) with… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

  4. arXiv:2408.13483  [pdf, other

    eess.SP cs.IT

    Transmissive RIS Enabled Transceiver Systems:Architecture, Design Issues and Opportunities

    Authors: Zhendong Li, Wen Chen, Qingqing Wu, Ziwei Liu, Chong He, Xudong Bai, Jun Li

    Abstract: Reconfigurable intelligent surface (RIS) is anticipated to augment the performance of beyond fifth-generation (B5G) and sixth-generation (6G) networks by intelligently manipulating the state of its components. Rather than employing reflective RIS for aided communications, this paper proposes an innovative transmissive RIS-enabled transceiver (TRTC) architecture that can accomplish the functions of… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

    Journal ref: IEEE VTM, 2024

  5. arXiv:2408.13470  [pdf, other

    eess.SP

    Performance Analysis of Photon-Limited Free-Space Optical Communications with Practical Photon-Counting Receivers

    Authors: Chen Wang, Zhiyong Xu, Jingyuan Wang, Jianhua Li, Weifeng Mou, Huatao Zhu, Jiyong Zhao, Yang Su, Yimin Wang, Ailin Qi

    Abstract: The non-perfect factors of practical photon-counting receiver are recognized as a significant challenge for long-distance photon-limited free-space optical (FSO) communication systems. This paper presents a comprehensive analytical framework for modeling the statistical properties of time-gated single-photon avalanche diode (TG-SPAD) based photon-counting receivers in presence of dead time, non-ph… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

  6. arXiv:2408.10800  [pdf, other

    eess.SP

    A Novel Signal Detection Method for Photon-Counting Communications with Nonlinear Distortion Effects

    Authors: Chen Wang, Zhiyong Xu, Jingyuan Wang, Jianhua Li, Weifeng Mou, Huatao Zhu, Jiyong Zhao, Yang Su, Yimin Wang, Ailin Qi

    Abstract: This paper proposes a method for estimating and detecting optical signals in practical photon-counting receivers. There are two important aspects of non-perfect photon-counting receivers, namely, (i) dead time which results in blocking loss, and (ii) non-photon-number-resolving, which leads to counting loss during the gate-ON interval. These factors introduce nonlinear distortion to the detected p… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  7. arXiv:2408.08496  [pdf, other

    cs.NI eess.SP

    Generative AI for Energy Harvesting Internet of Things Network: Fundamental, Applications, and Opportunities

    Authors: Wenwen Xie, Geng Sun, Jiahui Li, Jiacheng Wang, Hongyang Du, Dusit Niyato, Octavia A. Dobre

    Abstract: Internet of Things (IoT) devices are typically powered by small-sized batteries with limited energy storage capacity, requiring regular replacement or recharging. To reduce costs and maintain connectivity in IoT networks, energy harvesting technologies are regarded as a promising solution. Notably, due to its robust analytical and generative capabilities, generative artificial intelligence (GenAI)… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

  8. arXiv:2408.07866  [pdf, other

    eess.SY

    Certifiable Deep Learning for Reachability Using a New Lipschitz Continuous Value Function

    Authors: Jingqi Li, Donggun Lee, Jaewon Lee, Kris Shengjun Dong, Somayeh Sojoudi, Claire Tomlin

    Abstract: We propose a new reachability learning framework for high-dimensional nonlinear systems, focusing on reach-avoid problems. These problems require computing the reach-avoid set, which ensures that all its elements can safely reach a target set despite any disturbance within pre-specified bounds. Our framework has two main parts: offline learning of a newly designed reach-avoid value function and po… ▽ More

    Submitted 19 August, 2024; v1 submitted 14 August, 2024; originally announced August 2024.

    Comments: Submitted, under review

  9. arXiv:2408.06870  [pdf, ps, other

    eess.SP

    Spectrum Prediction With Deep 3D Pyramid Vision Transformer Learning

    Authors: Guangliang Pan, Qihui Wu, Bo Zhou, Jie Li, Wei Wang, Guoru Ding, David K. Y. Yau

    Abstract: In this paper, we propose a deep learning (DL)-based task-driven spectrum prediction framework, named DeepSPred. The DeepSPred comprises a feature encoder and a task predictor, where the encoder extracts spectrum usage pattern features, and the predictor configures different networks according to the task requirements to predict future spectrum. Based on the Deep- SPred, we first propose a novel 3… ▽ More

    Submitted 20 August, 2024; v1 submitted 13 August, 2024; originally announced August 2024.

  10. arXiv:2408.05554  [pdf, other

    eess.AS cs.CL cs.SD

    Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text

    Authors: Jinpeng Li, Yu Pu, Qi Sun, Wei-Qiang Zhang

    Abstract: Whisper and other large-scale automatic speech recognition models have made significant progress in performance. However, their performance on many low-resource languages, such as Kazakh, is not satisfactory. It is worth researching how to utilize low-cost data to improve the performance of Whisper on under-represented languages. In this study, we utilized easily accessible unpaired speech and tex… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

    Comments: Accepted by INTERSPEECH 2024;Minor typo correction

  11. arXiv:2408.04912  [pdf, other

    cs.SD cs.CE cs.ET cs.LG eess.AS

    AcousAF: Acoustic Sensing-Based Atrial Fibrillation Detection System for Mobile Phones

    Authors: Xuanyu Liu, Haoxian Liu, Jiao Li, Zongqi Yang, Yi Huang, Jin Zhang

    Abstract: Atrial fibrillation (AF) is characterized by irregular electrical impulses originating in the atria, which can lead to severe complications and even death. Due to the intermittent nature of the AF, early and timely monitoring of AF is critical for patients to prevent further exacerbation of the condition. Although ambulatory ECG Holter monitors provide accurate monitoring, the high cost of these d… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

    Comments: Accepted for publication in Companion of the 2024 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp Companion '24)

  12. arXiv:2408.04865  [pdf, other

    cs.SD cs.MM eess.AS

    TEAdapter: Supply abundant guidance for controllable text-to-music generation

    Authors: Jialing Zou, Jiahao Mei, Xudong Nan, Jinghua Li, Daoguo Dong, Liang He

    Abstract: Although current text-guided music generation technology can cope with simple creative scenarios, achieving fine-grained control over individual text-modality conditions remains challenging as user demands become more intricate. Accordingly, we introduce the TEAcher Adapter (TEAdapter), a compact plugin designed to guide the generation process with diverse control information provided by users. In… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

    Comments: Accepted by ICME'24: IEEE International Conference on Multimedia and Expo

    Journal ref: 2024 IEEE International Conference on Multimedia and Expo (ICME 2024)

  13. arXiv:2408.04505  [pdf, other

    cs.IT eess.SP

    Feedback Design with VQ-VAE for Robust Precoding in Multi-User FDD Systems

    Authors: Nurettin Turan, Michael Baur, Jianqing Li, Wolfgang Utschick

    Abstract: In this letter, we propose a vector quantized-variational autoencoder (VQ-VAE)-based feedback scheme for robust precoder design in multi-user frequency division duplex (FDD) systems. We demonstrate how the VQ-VAE can be tailored to specific propagation environments, focusing on systems with low pilot overhead, which is crucial in massive multiple-input multiple-output (MIMO). Extensive simulations… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

  14. Highly Efficient No-reference 4K Video Quality Assessment with Full-Pixel Covering Sampling and Training Strategy

    Authors: Xiaoheng Tan, Jiabin Zhang, Yuhui Quan, Jing Li, Yajing Wu, Zilin Bian

    Abstract: Deep Video Quality Assessment (VQA) methods have shown impressive high-performance capabilities. Notably, no-reference (NR) VQA methods play a vital role in situations where obtaining reference videos is restricted or not feasible. Nevertheless, as more streaming videos are being created in ultra-high definition (e.g., 4K) to enrich viewers' experiences, the current deep VQA methods face unaccepta… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

    Comments: Accepted by ACM MM 2024

  15. Content-driven Magnitude-Derivative Spectrum Complementary Learning for Hyperspectral Image Classification

    Authors: Huiyan Bai, Tingfa Xu, Huan Chen, Peifu Liu, Jianan Li

    Abstract: Extracting discriminative information from complex spectral details in hyperspectral image (HSI) for HSI classification is pivotal. While current prevailing methods rely on spectral magnitude features, they could cause confusion in certain classes, resulting in misclassification and decreased accuracy. We find that the derivative spectrum proves more adept at capturing concealed information, there… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

    Comments: accepted by TGRS

  16. arXiv:2407.17460  [pdf, other

    cs.RO cs.AI cs.CV cs.LG eess.SY

    SoNIC: Safe Social Navigation with Adaptive Conformal Inference and Constrained Reinforcement Learning

    Authors: Jianpeng Yao, Xiaopan Zhang, Yu Xia, Zejin Wang, Amit K. Roy-Chowdhury, Jiachen Li

    Abstract: Reinforcement Learning (RL) has enabled social robots to generate trajectories without human-designed rules or interventions, which makes it more effective than hard-coded systems for generalizing to complex real-world scenarios. However, social navigation is a safety-critical task that requires robots to avoid collisions with pedestrians while previous RL-based solutions fall short in safety perf… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

    Comments: Project website: https://sonic-social-nav.github.io/

  17. arXiv:2407.16543  [pdf, other

    eess.SP

    Joint Active and Passive Beamforming Design for IRS-aided MIMO ISAC Based on Sensing Mutual Information

    Authors: Jin Li, Gui Zhou, Tantao Gong, Nan Liu, Rui Zhang

    Abstract: In this paper, we investigate the intelligent reflecting surface (IRS)/reconfigurable intelligent surface (RIS)-aided integrated sensing and communication (ISAC) system based on sensing mutual information (MI). Specifically, the base station (BS) perceives the sensing target via the reflected sensing signal by the IRS, while communicating with the users simultaneously. Our aim is to maximize the s… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  18. arXiv:2407.14153  [pdf, other

    eess.IV cs.CV

    ESP-MedSAM: Efficient Self-Prompting SAM for Universal Domain-Generalized Medical Image Segmentation

    Authors: Qing Xu, Jiaxuan Li, Xiangjian He, Ziyu Liu, Zhen Chen, Wenting Duan, Chenxin Li, Maggie M. He, Fiseha B. Tesema, Wooi P. Cheah, Yi Wang, Rong Qu, Jonathan M. Garibaldi

    Abstract: The universality of deep neural networks across different modalities and their generalization capabilities to unseen domains play an essential role in medical image segmentation. The recent Segment Anything Model (SAM) has demonstrated its potential in both settings. However, the huge computational costs, demand for manual annotations as prompts and conflict-prone decoding process of SAM degrade i… ▽ More

    Submitted 17 August, 2024; v1 submitted 19 July, 2024; originally announced July 2024.

    Comments: Under Review

  19. arXiv:2407.13895  [pdf, other

    eess.AS

    Improving Robustness and Clinical Applicability of Respiratory Sound Classification via Audio Enhancement

    Authors: Jing-Tong Tzeng, Jeng-Lin Li, Huan-Yu Chen, Chun-Hsiang Huang, Chi-Hsin Chen, Cheng-Yi Fan, Edward Pei-Chuan Huang, Chi-Chun Lee

    Abstract: Deep learning techniques have shown promising results in the automatic classification of respiratory sounds. However, accurately distinguishing these sounds in real-world noisy conditions poses challenges for clinical deployment. Additionally, predicting signals with only background noise could undermine user trust in the system. In this study, we propose an audio enhancement (AE) pipeline as a pr… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: The following article has been submitted to The Journal of the Acoustical Society of America (JASA). After it is published, it will be found at https://pubs.aip.org/asa/jasa

  20. arXiv:2407.12295  [pdf, ps, other

    cs.CV eess.IV

    Exploiting Inter-Image Similarity Prior for Low-Bitrate Remote Sensing Image Compression

    Authors: Junhui Li, Xingsong Hou

    Abstract: Deep learning-based methods have garnered significant attention in remote sensing (RS) image compression due to their superior performance. Most of these methods focus on enhancing the coding capability of the compression network and improving entropy model prediction accuracy. However, they typically compress and decompress each image independently, ignoring the significant inter-image similarity… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  21. arXiv:2407.12229  [pdf, other

    eess.AS cs.AI eess.SP

    Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

    Authors: Haibin Wu, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Daniel Tompkins, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Sheng Zhao, Jinyu Li, Naoyuki Kanda

    Abstract: People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotions, including NVs. This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker. Em… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: See https://aka.ms/emoctrl-tts for demo samples

  22. arXiv:2407.11161  [pdf, other

    eess.SP

    S-RAN: Semantic-Aware Radio Access Networks

    Authors: Yao Sun, Lan Zhang, Linke Guo, Jian Li, Dusit Niyato, Yuguang Fang

    Abstract: Semantic communication (SemCom) has been a transformative paradigm, emphasizing the precise exchange of meaningful information over traditional bit-level transmissions. However, existing SemCom research, primarily centered on simplified scenarios like single-pair transmissions with direct wireless links, faces significant challenges when applied to real-world radio access networks (RANs). This art… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  23. arXiv:2407.10926  [pdf, other

    eess.IV cs.CV

    In-Loop Filtering via Trained Look-Up Tables

    Authors: Zhuoyuan Li, Jiacheng Li, Yao Li, Li Li, Dong Liu, Feng Wu

    Abstract: In-loop filtering (ILF) is a key technology for removing the artifacts in image/video coding standards. Recently, neural network-based in-loop filtering methods achieve remarkable coding gains beyond the capability of advanced video coding standards, which becomes a powerful coding tool candidate for future video coding standards. However, the utilization of deep neural networks brings heavy time… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: 11 pages, 6 figures

  24. arXiv:2407.09935  [pdf, other

    cs.CV cs.MM eess.IV

    LeRF: Learning Resampling Function for Adaptive and Efficient Image Interpolation

    Authors: Jiacheng Li, Chang Chen, Fenglong Song, Youliang Yan, Zhiwei Xiong

    Abstract: Image resampling is a basic technique that is widely employed in daily applications, such as camera photo editing. Recent deep neural networks (DNNs) have made impressive progress in performance by introducing learned data priors. Still, these methods are not the perfect substitute for interpolation, due to the drawbacks in efficiency and versatility. In this work, we propose a novel method of Lea… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: Code: https://github.com/ddlee-cn/LeRF-PyTorch

  25. arXiv:2407.08914  [pdf, other

    cs.NI eess.SP

    Multi-objective Aerial Collaborative Secure Communication Optimization via Generative Diffusion Model-enabled Deep Reinforcement Learning

    Authors: Chuang Zhang, Geng Sun, Jiahui Li, Qingqing Wu, Jiacheng Wang, Dusit Niyato, Yuanwei Liu

    Abstract: Due to flexibility and low-cost, unmanned aerial vehicles (UAVs) are increasingly crucial for enhancing coverage and functionality of wireless networks. However, incorporating UAVs into next-generation wireless communication systems poses significant challenges, particularly in sustaining high-rate and long-range secure communications against eavesdropping attacks. In this work, we consider a UAV… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: This paper has been submitted to IEEE Transactions on Mobile Computing

  26. arXiv:2407.08551  [pdf, other

    cs.CL cs.SD eess.AS

    Autoregressive Speech Synthesis without Vector Quantization

    Authors: Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei

    Abstract: We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  27. arXiv:2407.07295  [pdf, other

    eess.IV cs.CE cs.CV

    Deformation-Recovery Diffusion Model (DRDM): Instance Deformation for Image Manipulation and Synthesis

    Authors: Jian-Qing Zheng, Yuanhan Mo, Yang Sun, Jiahua Li, Fuping Wu, Ziyang Wang, Tonia Vincent, Bartłomiej W. Papież

    Abstract: In medical imaging, the diffusion models have shown great potential in synthetic image generation tasks. However, these models often struggle with the interpretable connections between the generated and existing images and could create illusions. To address these challenges, our research proposes a novel diffusion-based generative model based on deformation diffusion and recovery. This model, name… ▽ More

    Submitted 21 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

  28. arXiv:2407.06614  [pdf, other

    eess.IV cs.CV

    Implicit Regression in Subspace for High-Sensitivity CEST Imaging

    Authors: Chu Chen, Yang Liu, Se Weon Park, Jizhou Li, Kannie W. Y. Chan, Raymond H. F. Chan

    Abstract: Chemical Exchange Saturation Transfer (CEST) MRI demonstrates its capability in significantly enhancing the detection of proteins and metabolites with low concentrations through exchangeable protons. The clinical application of CEST, however, is constrained by its low contrast and low signal-to-noise ratio (SNR) in the acquired data. Denoising, as one of the post-processing stages for CEST data, c… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  29. arXiv:2407.06524  [pdf, other

    cs.SD cs.MM eess.AS

    Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer

    Authors: Jizhen Li, Xinmeng Xu, Weiping Tu, Yuhong Yang, Rong Zhu

    Abstract: Recent speech enhancement methods based on convolutional neural networks (CNNs) and transformer have been demonstrated to efficaciously capture time-frequency (T-F) information on spectrogram. However, the correlation of each channels of speech features is failed to explore. Theoretically, each channel map of speech features obtained by different convolution kernels contains information with diffe… ▽ More

    Submitted 13 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

  30. Soli-enabled Noncontact Heart Rate Detection for Sleep and Meditation Tracking

    Authors: Luzhou Xu, Jaime Lien, Haiguang Li, Nicholas Gillian, Rajeev Nongpiur, Jihan Li, Qian Zhang, Jian Cui, David Jorgensen, Adam Bernstein, Lauren Bedal, Eiji Hayashi, Jin Yamanaka, Alex Lee, Jian Wang, D Shin, Ivan Poupyrev, Trausti Thormundsson, Anupam Pathak, Shwetak Patel

    Abstract: Heart rate (HR) is a crucial physiological signal that can be used to monitor health and fitness. Traditional methods for measuring HR require wearable devices, which can be inconvenient or uncomfortable, especially during sleep and meditation. Noncontact HR detection methods employing microwave radar can be a promising alternative. However, the existing approaches in the literature usually use hi… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: 15 pages

    Journal ref: Sci Rep 13, 18008 (2023)

  31. arXiv:2407.06116  [pdf

    eess.IV cs.CV cs.LG

    Data-driven Nucleus Subclassification on Colon H&E using Style-transferred Digital Pathology

    Authors: Lucas W. Remedios, Shunxing Bao, Samuel W. Remedios, Ho Hin Lee, Leon Y. Cai, Thomas Li, Ruining Deng, Nancy R. Newlin, Adam M. Saunders, Can Cui, Jia Li, Qi Liu, Ken S. Lau, Joseph T. Roland, Mary K Washington, Lori A. Coburn, Keith T. Wilson, Yuankai Huo, Bennett A. Landman

    Abstract: Understanding the way cells communicate, co-locate, and interrelate is essential to furthering our understanding of how the body functions. H&E is widely available, however, cell subtyping often requires expert knowledge and the use of specialized stains. To reduce the annotation burden, AI has been proposed for the classification of cells on H&E. For example, the recent Colon Nucleus Identificati… ▽ More

    Submitted 15 May, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2401.05602

  32. arXiv:2407.05361  [pdf, other

    eess.AS cs.CL

    Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

    Authors: Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu

    Abstract: Recently, speech generation models have made significant progress by using large-scale training data. However, the research community struggle to produce highly spontaneous and human-like speech due to the lack of large-scale, diverse, and spontaneous speech data. This paper present Emilia, the first multilingual speech generation dataset from in-the-wild speech data, and Emilia-Pipe, the first op… ▽ More

    Submitted 12 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

    Comments: Fix typos

  33. arXiv:2407.04711  [pdf, other

    cs.CV cs.AI eess.IV

    MetaFruit Meets Foundation Models: Leveraging a Comprehensive Multi-Fruit Dataset for Advancing Agricultural Foundation Models

    Authors: Jiajia Li, Kyle Lammers, Xunyuan Yin, Xiang Yin, Long He, Renfu Lu, Zhaojian Li

    Abstract: Fruit harvesting poses a significant labor and financial burden for the industry, highlighting the critical need for advancements in robotic harvesting solutions. Machine vision-based fruit detection has been recognized as a crucial component for robust identification of fruits to guide robotic manipulation. Despite considerable progress in leveraging deep learning and machine learning techniques… ▽ More

    Submitted 13 May, 2024; originally announced July 2024.

    Comments: 14 pages, 5 figures, 7 tables

  34. arXiv:2406.19769  [pdf, other

    eess.SP

    Decision Transformer for IRS-Assisted Systems with Diffusion-Driven Generative Channels

    Authors: Jie Zhang, Jun Li, Zhe Wang, Yu Han, Long Shi, Bin Cao

    Abstract: In this paper, we propose a novel diffusion-decision transformer (D2T) architecture to optimize the beamforming strategies for intelligent reflecting surface (IRS)-assisted multiple-input single-output (MISO) communication systems. The first challenge lies in the expensive computation cost to recover the real-time channel state information (CSI) from the received pilot signals, which usually requi… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  35. arXiv:2406.18054  [pdf, other

    eess.IV cs.CV

    Leveraging Pre-trained Models for FF-to-FFPE Histopathological Image Translation

    Authors: Qilai Zhang, Jiawen Li, Peiran Liao, Jiali Hu, Tian Guan, Anjia Han, Yonghong He

    Abstract: The two primary types of Hematoxylin and Eosin (H&E) slides in histopathology are Formalin-Fixed Paraffin-Embedded (FFPE) and Fresh Frozen (FF). FFPE slides offer high quality histopathological images but require a labor-intensive acquisition process. In contrast, FF slides can be prepared quickly, but the image quality is relatively poor. Our task is to translate FF images into FFPE style, thereb… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  36. arXiv:2406.17957  [pdf, other

    cs.SD cs.AI eess.AS

    Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

    Authors: Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, Boris Ginsburg

    Abstract: Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text c… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Published as a conference paper at INTERSPEECH 2024

  37. arXiv:2406.17190  [pdf, other

    cs.SD cs.LG eess.AS

    Sound Tagging in Infant-centric Home Soundscapes

    Authors: Mohammad Nur Hossain Khan, Jialu Li, Nancy L. McElwain, Mark Hasegawa-Johnson, Bashima Islam

    Abstract: Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Accepted in IEEE/ACM CHASE 2024

  38. arXiv:2406.15885  [pdf, other

    cs.SD cs.AI eess.AS

    The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models

    Authors: Jiajia Li, Lu Yang, Mingni Tang, Cong Chen, Zuchao Li, Ping Wang, Hai Zhao

    Abstract: Benchmark plays a pivotal role in assessing the advancements of large language models (LLMs). While numerous benchmarks have been proposed to evaluate LLMs' capabilities, there is a notable absence of a dedicated benchmark for assessing their musical abilities. To address this gap, we present ZIQI-Eval, a comprehensive and large-scale music benchmark specifically designed to evaluate the music-rel… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL-Findings 2024

  39. arXiv:2406.15172  [pdf, other

    eess.IV cs.CV

    Multimodal Deformable Image Registration for Long-COVID Analysis Based on Progressive Alignment and Multi-perspective Loss

    Authors: Jiahua Li, James T. Grist, Fergus V. Gleeson, Bartłomiej W. Papież

    Abstract: Long COVID is characterized by persistent symptoms, particularly pulmonary impairment, which necessitates advanced imaging for accurate diagnosis. Hyperpolarised Xenon-129 MRI (XeMRI) offers a promising avenue by visualising lung ventilation, perfusion, as well as gas transfer. Integrating functional data from XeMRI with structural data from Computed Tomography (CT) is crucial for comprehensive an… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  40. arXiv:2406.14953  [pdf, other

    cs.CV cs.AI cs.LG eess.SP

    Deep Imbalanced Regression to Estimate Vascular Age from PPG Data: a Novel Digital Biomarker for Cardiovascular Health

    Authors: Guangkun Nie, Qinghao Zhao, Gongzheng Tang, Jun Li, Shenda Hong

    Abstract: Photoplethysmography (PPG) is emerging as a crucial tool for monitoring human hemodynamics, with recent studies highlighting its potential in assessing vascular aging through deep learning. However, real-world age distributions are often imbalanced, posing significant challenges for deep learning models. In this paper, we introduce a novel, simple, and effective loss function named the Dist Loss t… ▽ More

    Submitted 2 July, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

  41. Interpretable modulated differentiable STFT and physics-informed balanced spectrum metric for freight train wheelset bearing cross-machine transfer fault diagnosis under speed fluctuations

    Authors: Chao He, Hongmei Shi, Ruixin Li, Jianbo Li, ZuJun Yu

    Abstract: The service conditions of wheelset bearings has a direct impact on the safe operation of railway heavy haul freight trains as the key components. However, speed fluctuation of the trains and few fault samples are the two main problems that restrict the accuracy of bearing fault diagnosis. Therefore, a cross-machine transfer diagnosis (pyDSN) network coupled with interpretable modulated differentia… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Journal ref: Advanced Engineering Informatics, 2024

  42. arXiv:2406.11546  [pdf, other

    eess.AS cs.CL cs.SD

    GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

    Authors: Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

    Abstract: The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired spee… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Under review

  43. arXiv:2406.10276  [pdf, other

    cs.CL cs.SD eess.AS

    Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation

    Authors: Peidong Wang, Jian Xue, Jinyu Li, Junkun Chen, Aswin Shanmugam Subramanian

    Abstract: Language-agnostic many-to-one end-to-end speech translation models can convert audio signals from different source languages into text in a target language. These models do not need source language identification, which improves user experience. In some cases, the input language can be given or estimated. Our goal is to use this additional language information while preserving the quality of the o… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  44. arXiv:2406.10052  [pdf, other

    cs.SD cs.CL eess.AS

    Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection

    Authors: Haoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, Jian Li

    Abstract: As a robust and large-scale multilingual speech recognition model, Whisper has demonstrated impressive results in many low-resource and out-of-distribution scenarios. However, its encoder-decoder structure hinders its application to streaming speech recognition. In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper's cross-attention to guide auto-regressive d… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

  45. arXiv:2406.09656  [pdf, other

    cs.CV cs.AI cs.LG eess.IV

    RSEND: Retinex-based Squeeze and Excitation Network with Dark Region Detection for Efficient Low Light Image Enhancement

    Authors: Jingcheng Li, Ye Qiao, Haocheng Xu, Sitao Huang

    Abstract: Images captured under low-light scenarios often suffer from low quality. Previous CNN-based deep learning methods often involve using Retinex theory. Nevertheless, most of them cannot perform well in more complicated datasets like LOL-v2 while consuming too much computational resources. Besides, some of these methods require sophisticated training at different stages, making the procedure even mor… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  46. arXiv:2406.09167  [pdf, other

    cs.SD eess.AS

    Vision Transformer Segmentation for Visual Bird Sound Denoising

    Authors: Sahil Kumar, Jialu Li, Youshan Zhang

    Abstract: Audio denoising, especially in the context of bird sounds, remains a challenging task due to persistent residual noise. Traditional and deep learning methods often struggle with artificial or low-frequency noise. In this work, we propose ViTVS, a novel approach that leverages the power of the vision transformer (ViT) architecture. ViTVS adeptly combines segmentation techniques to disentangle clean… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: INTERSPEECH 2024

  47. arXiv:2406.09161  [pdf, other

    cs.SD eess.AS

    Complex Image-Generative Diffusion Transformer for Audio Denoising

    Authors: Junhui Li, Pu Wang, Jialu Li, Youshan Zhang

    Abstract: The audio denoising technique has captured widespread attention in the deep neural network field. Recently, the audio denoising problem has been converted into an image generation task, and deep learning-based approaches have been applied to tackle this problem. However, its performance is still limited, leaving room for further improvement. In order to enhance audio denoising performance, this pa… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: INTERSPEECH 2024

  48. arXiv:2406.09154  [pdf, other

    cs.SD cs.CL eess.AS

    Diffusion Gaussian Mixture Audio Denoise

    Authors: Pu Wang, Junhui Li, Jialu Li, Liangdong Guo, Youshan Zhang

    Abstract: Recent diffusion models have achieved promising performances in audio-denoising tasks. The unique property of the reverse process could recover clean signals. However, the distribution of real-world noises does not comply with a single Gaussian distribution and is even unknown. The sampling of Gaussian noise conditions limits its application scenarios. To overcome these challenges, we propose a Di… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: INTERSPEECH 2024

  49. arXiv:2406.08634  [pdf, other

    eess.IV cs.CV cs.LG

    Unveiling Incomplete Modality Brain Tumor Segmentation: Leveraging Masked Predicted Auto-Encoder and Divergence Learning

    Authors: Zhongao Sun, Jiameng Li, Yuhan Wang, Jiarong Cheng, Qing Zhou, Chun Li

    Abstract: Brain tumor segmentation remains a significant challenge, particularly in the context of multi-modal magnetic resonance imaging (MRI) where missing modality images are common in clinical settings, leading to reduced segmentation accuracy. To address this issue, we propose a novel strategy, which is called masked predicted pre-training, enabling robust feature learning from incomplete modality data… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  50. arXiv:2406.08268  [pdf, other

    eess.SY

    Multi-Static ISAC based on Network-Assisted Full-Duplex Cell-Free Networks: Performance Analysis and Duplex Mode Optimization

    Authors: Fan Zeng, Ruoyun Liu, Xiaoyu Sun, Jingxuan Yu, Jiamin Li, Pengchen Zhu, Dongming Wang, Xiaohu You

    Abstract: Multi-static integrated sensing and communication (ISAC) technology, which can achieve a wider coverage range and avoid self-interference, is an important trend for the future development of ISAC. Existing multi-static ISAC designs are unable to support the asymmetric uplink (UL)/downlink (DL) communication requirements in the scenario while simultaneously achieving optimal sensing performance. Th… ▽ More

    Submitted 12 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.