Zum Hauptinhalt springen

Showing 1–50 of 189 results for author: Xia, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2408.11982  [pdf, other

    eess.IV cs.CV cs.MM

    AIM 2024 Challenge on Compressed Video Quality Assessment: Methods and Results

    Authors: Maksim Smirnov, Aleksandr Gushchin, Anastasia Antsiferova, Dmitry Vatolin, Radu Timofte, Ziheng Jia, Zicheng Zhang, Wei Sun, Jiaying Qian, Yuqin Cao, Yinan Sun, Yuxin Zhu, Xiongkuo Min, Guangtao Zhai, Kanjar De, Qing Luo, Ao-Xiang Zhang, Peng Zhang, Haibo Lei, Linyan Jiang, Yaqing Li, Wenhui Meng, Xiaoheng Tan, Haiqiang Wang, Xiaozhong Xu , et al. (11 additional authors not shown)

    Abstract: Video quality assessment (VQA) is a crucial task in the development of video compression standards, as it directly impacts the viewer experience. This paper presents the results of the Compressed Video Quality Assessment challenge, held in conjunction with the Advances in Image Manipulation (AIM) workshop at ECCV 2024. The challenge aimed to evaluate the performance of VQA methods on a diverse dat… ▽ More

    Submitted 28 August, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

  2. arXiv:2408.08228  [pdf, other

    eess.IV cs.CV

    Rethinking Medical Anomaly Detection in Brain MRI: An Image Quality Assessment Perspective

    Authors: Zixuan Pan, Jun Xia, Zheyu Yan, Guoyue Xu, Yawen Wu, Zhenge Jia, Jianxu Chen, Yiyu Shi

    Abstract: Reconstruction-based methods, particularly those leveraging autoencoders, have been widely adopted to perform anomaly detection in brain MRI. While most existing works try to improve detection accuracy by proposing new model structures or algorithms, we tackle the problem through image quality assessment, an underexplored perspective in the field. We propose a fusion quality loss function that com… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

  3. arXiv:2408.07592  [pdf, other

    eess.SP

    Multi-periodicity dependency Transformer based on spectrum offset for radio frequency fingerprint identification

    Authors: Jing Xiao, Wenrui Ding, Zeqi Shao, Duona Zhang, Yanan Ma, Yufeng Wang, Jian Wang

    Abstract: Radio Frequency Fingerprint Identification (RFFI) has emerged as a pivotal task for reliable device authentication. Despite advancements in RFFI methods, background noise and intentional modulation features result in weak energy and subtle differences in the RFF features. These challenges diminish the capability of RFFI methods in feature representation, complicating the effective identification o… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

  4. arXiv:2407.05619  [pdf, other

    cs.RO eess.SY

    AIRA: A Low-cost IR-based Approach Towards Autonomous Precision Drone Landing and NLOS Indoor Navigation

    Authors: Yanchen Liu, Minghui Zhao, Kaiyuan Hou, Junxi Xia, Charlie Carver, Stephen Xia, Xia Zhou, Xiaofan Jiang

    Abstract: Automatic drone landing is an important step for achieving fully autonomous drones. Although there are many works that leverage GPS, video, wireless signals, and active acoustic sensing to perform precise landing, autonomous drone landing remains an unsolved challenge for palm-sized microdrones that may not be able to support the high computational requirements of vision, wireless, or active audio… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  5. Deep Learning Segmentation of Ascites on Abdominal CT Scans for Automatic Volume Quantification

    Authors: Benjamin Hou, Sung-Won Lee, Jung-Min Lee, Christopher Koh, Jing Xiao, Perry J. Pickhardt, Ronald M. Summers

    Abstract: Purpose: To evaluate the performance of an automated deep learning method in detecting ascites and subsequently quantifying its volume in patients with liver cirrhosis and ovarian cancer. Materials and Methods: This retrospective study included contrast-enhanced and non-contrast abdominal-pelvic CT scans of patients with cirrhotic ascites and patients with ovarian cancer from two institutions, N… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  6. arXiv:2406.10869  [pdf, other

    eess.IV cs.CV

    Geometric Distortion Guided Transformer for Omnidirectional Image Super-Resolution

    Authors: Cuixin Yang, Rongkang Dong, Jun Xiao, Cong Zhang, Kin-Man Lam, Fei Zhou, Guoping Qiu

    Abstract: As virtual and augmented reality applications gain popularity, omnidirectional image (ODI) super-resolution has become increasingly important. Unlike 2D plain images that are formed on a plane, ODIs are projected onto spherical surfaces. Applying established image super-resolution methods to ODIs, therefore, requires performing equirectangular projection (ERP) to map the ODIs onto a plane. ODI sup… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: 13 pages, 12 figures, journal

  7. Blind Super-Resolution via Meta-learning and Markov Chain Monte Carlo Simulation

    Authors: Jingyuan Xia, Zhixiong Yang, Shengxi Li, Shuanghui Zhang, Yaowen Fu, Deniz Gündüz, Xiang Li

    Abstract: Learning-based approaches have witnessed great successes in blind single image super-resolution (SISR) tasks, however, handcrafted kernel priors and learning based kernel priors are typically required. In this paper, we propose a Meta-learning and Markov Chain Monte Carlo (MCMC) based SISR approach to learn kernel priors from organized randomness. In concrete, a lightweight network is adopted as k… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: This paper has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

  8. arXiv:2406.08835  [pdf, other

    cs.SD eess.AS

    EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed

    Authors: Ziyang Zhuang, Chenfeng Miao, Kun Zou, Ming Fang, Tao Wei, Zijian Li, Ning Cheng, Wei Hu, Shaojun Wang, Jing Xiao

    Abstract: Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. It uses an Index Mappin… ▽ More

    Submitted 28 August, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Submitted to ICASSP 2025

  9. arXiv:2405.17028  [pdf, other

    cs.SD eess.AS

    RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

    Authors: Haoxiang Shi, Jianzong Wang, Xulong Zhang, Ning Cheng, Jun Yu, Jing Xiao

    Abstract: Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Accepted by the 8th APWeb-WAIM International Joint Conference on Web and Big Data

  10. arXiv:2405.01242  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

    Authors: Yueyuan Sui, Minghui Zhao, Junxi Xia, Xiaofan Jiang, Stephen Xia

    Abstract: We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art m… ▽ More

    Submitted 29 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

  11. arXiv:2405.00930  [pdf, other

    cs.SD eess.AS

    MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

    Authors: Pengcheng Li, Jianzong Wang, Xulong Zhang, Yong Zhang, Jing Xiao, Ning Cheng

    Abstract: One-shot voice conversion aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing methods face difficulties in satisfactory speech representation disentanglement and suffer from sizable networks as some of them leverage numerous complex modules for disentanglement. In this paper, we propose a model named MAIN-VC to effectively… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

  12. arXiv:2405.00603  [pdf, other

    cs.SD eess.AS

    Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

    Authors: Yimin Deng, Jianzong Wang, Xulong Zhang, Ning Cheng, Jing Xiao

    Abstract: Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issue… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

  13. arXiv:2405.00577  [pdf

    cs.LG eess.SP q-bio.NC

    Discovering robust biomarkers of neurological disorders from functional MRI using graph neural networks: A Review

    Authors: Yi Hao Chan, Deepank Girish, Sukrit Gupta, Jing Xia, Chockalingam Kasi, Yinan He, Conghao Wang, Jagath C. Rajapakse

    Abstract: Graph neural networks (GNN) have emerged as a popular tool for modelling functional magnetic resonance imaging (fMRI) datasets. Many recent studies have reported significant improvements in disorder classification performance via more sophisticated GNN designs and highlighted salient features that could be potential biomarkers of the disorder. In this review, we provide an overview of how GNN and… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  14. arXiv:2404.19214  [pdf, other

    cs.SD eess.AS

    EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization

    Authors: Jianzong Wang, Ziqi Liang, Xulong Zhang, Ning Cheng, Jing Xiao

    Abstract: In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

  15. arXiv:2404.19212  [pdf, other

    cs.SD eess.AS

    EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

    Authors: Ziqi Liang, Jianzong Wang, Xulong Zhang, Yong Zhang, Ning Cheng, Jing Xiao

    Abstract: Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of informat… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

  16. arXiv:2404.19187  [pdf, other

    cs.SD eess.AS

    CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition

    Authors: Jianzong Wang, Pengcheng Li, Xulong Zhang, Ning Cheng, Jing Xiao

    Abstract: Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and s… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

  17. arXiv:2404.15704  [pdf, other

    cs.LG cs.AI cs.SD eess.AS

    Efficient Multi-Model Fusion with Adversarial Complementary Representation Learning

    Authors: Zuheng Kang, Yayun He, Jianzong Wang, Junqing Peng, Jing Xiao

    Abstract: Single-model systems often suffer from deficiencies in tasks such as speaker verification (SV) and image classification, relying heavily on partial prior knowledge during decision-making, resulting in suboptimal performance. Although multi-model fusion (MMF) can mitigate some of these issues, redundancy in learned representations may limits improvements. To this end, we propose an adversarial comp… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

  18. arXiv:2404.15620  [pdf, other

    eess.IV

    A Dynamic Kernel Prior Model for Unsupervised Blind Image Super-Resolution

    Authors: Zhixiong Yang, Jingyuan Xia, Shengxi Li, Xinghua Huang, Shuanghui Zhang, Zhen Liu, Yaowen Fu, Yongxiang Liu

    Abstract: Deep learning-based methods have achieved significant successes on solving the blind super-resolution (BSR) problem. However, most of them request supervised pre-training on labelled datasets. This paper proposes an unsupervised kernel estimation model, named dynamic kernel prior (DKP), to realize an unsupervised and pre-training-free learning-based algorithm for solving the BSR problem. DKP can a… ▽ More

    Submitted 25 April, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: Accepted for publication in CVPR 2024

  19. arXiv:2404.13892  [pdf, other

    cs.SD cs.AI eess.AS

    Retrieval-Augmented Audio Deepfake Detection

    Authors: Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang

    Abstract: With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired… ▽ More

    Submitted 23 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: Accepted by the 2024 International Conference on Multimedia Retrieval (ICMR 2024)

  20. arXiv:2404.13786  [pdf, other

    eess.SY cs.AI cs.DC cs.LG

    Soar: Design and Deployment of A Smart Roadside Infrastructure System for Autonomous Driving

    Authors: Shuyao Shi, Neiwen Ling, Zhehao Jiang, Xuan Huang, Yuze He, Xiaoguang Zhao, Bufang Yang, Chen Bian, Jingfei Xia, Zhenyu Yan, Raymond Yeung, Guoliang Xing

    Abstract: Recently,smart roadside infrastructure (SRI) has demonstrated the potential of achieving fully autonomous driving systems. To explore the potential of infrastructure-assisted autonomous driving, this paper presents the design and deployment of Soar, the first end-to-end SRI system specifically designed to support autonomous driving systems. Soar consists of both software and hardware components ca… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

  21. arXiv:2404.03425  [pdf, other

    eess.IV cs.AI cs.CV

    ChangeMamba: Remote Sensing Change Detection With Spatiotemporal State Space Model

    Authors: Hongruixuan Chen, Jian Song, Chengxi Han, Junshi Xia, Naoto Yokoya

    Abstract: Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing change detection (CD). However, both architectures have inherent shortcomings: CNN are constrained by a limited receptive field that may hinder their ability to capture broader spatial contexts, while Transformers are computationally intensive, making them costly to train and deploy on… ▽ More

    Submitted 26 July, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

    Comments: Accepted by IEEE TGRS: https://ieeexplore.ieee.org/document/10565926

    Journal ref: IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-20, 2024, Art no. 4409720

  22. arXiv:2403.07919  [pdf, ps, other

    cs.IT eess.SP

    Freshness-aware Resource Allocation for Non-orthogonal Wireless-powered IoT Networks

    Authors: Yunfeng Chen, Yong Liu, Jinhao Xiao, Qunying Wu, Han Zhang, Fen Hou

    Abstract: This paper investigates a wireless-powered Internet of Things (IoT) network comprising a hybrid access point (HAP) and two devices. The HAP facilitates downlink wireless energy transfer (WET) for device charging and uplink wireless information transfer (WIT) to collect status updates from the devices. To keep the information fresh, concurrent WET and WIT are allowed, and orthogonal multiple access… ▽ More

    Submitted 27 February, 2024; originally announced March 2024.

  23. arXiv:2402.04356  [pdf, other

    cs.SD cs.CV eess.AS

    Bidirectional Autoregressive Diffusion Model for Dance Generation

    Authors: Canyu Zhang, Youbao Tang, Ning Zhang, Ruei-Sung Lin, Mei Han, Jing Xiao, Song Wang

    Abstract: Dance serves as a powerful medium for expressing human emotions, but the lifelike generation of dance is still a considerable challenge. Recently, diffusion models have showcased remarkable generative abilities across various domains. They hold promise for human motion generation due to their adaptable many-to-many nature. Nonetheless, current diffusion-based motion generation models often create… ▽ More

    Submitted 22 June, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

  24. arXiv:2401.11141  [pdf, other

    cs.IT eess.SP

    Wideband Beamforming for RIS Assisted Near-Field Communications

    Authors: Ji Wang, Jian Xiao, Yixuan Zou, Wenwu Xie, Yuanwei Liu

    Abstract: A near-field wideband beamforming scheme is investigated for reconfigurable intelligent surface (RIS) assisted multiple-input multiple-output (MIMO) systems, in which a deep learning-based end-to-end (E2E) optimization framework is proposed to maximize the system spectral efficiency. To deal with the near-field double beam split effect, the base station is equipped with frequency-dependent hybrid… ▽ More

    Submitted 3 July, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

  25. arXiv:2401.08166  [pdf, other

    eess.AS cs.SD

    ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis

    Authors: Haobin Tang, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang

    Abstract: Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our p… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: Accepted by 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2024)

  26. arXiv:2401.08096  [pdf, other

    cs.SD eess.AS

    Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

    Authors: Yimin Deng, Huaizhen Tang, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang

    Abstract: Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these… ▽ More

    Submitted 17 January, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

    Comments: Accepted by 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2024)

  27. arXiv:2401.08049  [pdf, other

    cs.CV cs.SD eess.AS

    EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model

    Authors: Bingyuan Zhang, Xulong Zhang, Ning Cheng, Jun Yu, Jing Xiao, Jianzong Wang

    Abstract: In recent years, the field of talking faces generation has attracted considerable attention, with certain methods adept at generating virtual faces that convincingly imitate human expressions. However, existing methods face challenges related to limited generalization, particularly when dealing with challenging identities. Furthermore, methods for editing expressions are often confined to a singul… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

    Comments: Accepted by 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2024)

  28. arXiv:2311.08670  [pdf, other

    cs.SD eess.AS

    CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation

    Authors: Yimin Deng, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

    Abstract: Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard ne… ▽ More

    Submitted 14 November, 2023; originally announced November 2023.

    Comments: Accepted by the 21st IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2023)

  29. arXiv:2311.07965  [pdf, other

    cs.SD eess.AS

    DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation

    Authors: Jianzong Wang, Pengcheng Li, Xulong Zhang, Ning Cheng, Jing Xiao

    Abstract: Most existing neural-based text-to-speech methods rely on extensive datasets and face challenges under low-resource condition. In this paper, we introduce a novel semi-supervised text-to-speech synthesis model that learns from both paired and unpaired data to address this challenge. The key component of the proposed model is a dynamic quantized representation module, which is integrated into a seq… ▽ More

    Submitted 2 February, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

    Comments: Accepted by the 13th IEEE International Conference on Big Data and Cloud Computing (IEEE BDCloud 2023)

  30. arXiv:2310.09918  [pdf, other

    eess.IV

    Pedestrian Accessible Infrastructure Inventory: Assessing Zero-Shot Segmentation on Multi-Mode Geospatial Data for All Pedestrian Types

    Authors: Jiahao Xia, Gavin Gong, Jiawei Liu, Zhigang Zhu, Hao Tang

    Abstract: In this paper, a Segment Anything Model (SAM)-based pedestrian infrastructure segmentation workflow is designed and optimized, which is capable of efficiently processing multi-sourced geospatial data including LiDAR data and satellite imagery data. We used an expanded definition of pedestrian infrastructure inventory which goes beyond the traditional transportation elements to include street furni… ▽ More

    Submitted 27 November, 2023; v1 submitted 15 October, 2023; originally announced October 2023.

  31. arXiv:2310.04681  [pdf, other

    cs.SD cs.AI eess.AS

    VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model

    Authors: Yayun He, Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao

    Abstract: Speaker verification (SV) performance deteriorates as utterances become shorter. To this end, we propose a new architecture called VoiceExtender which provides a promising solution for improving SV performance when handling short-duration speech signals. We use two guided diffusion models, the built-in and the external speaker embedding (SE) guided diffusion model, both of which utilize a diffusio… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.

    Comments: Accepted by the 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2023)

  32. arXiv:2309.08839  [pdf, other

    cs.SD cs.MM eess.AS

    Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

    Authors: Kaiyi Luo, Xulong Zhang, Jianzong Wang, Huaxiong Li, Ning Cheng, Jing Xiao

    Abstract: Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)

  33. arXiv:2309.08838  [pdf, other

    cs.CV eess.IV

    AOSR-Net: All-in-One Sandstorm Removal Network

    Authors: Yazhong Si, Xulong Zhang, Fan Yang, Jianzong Wang, Ning Cheng, Jing Xiao

    Abstract: Most existing sandstorm image enhancement methods are based on traditional theory and prior knowledge, which often restrict their applicability in real-world scenarios. In addition, these approaches often adopt a strategy of color correction followed by dust removal, which makes the algorithm structure too complex. To solve the issue, we introduce a novel image restoration model, named all-in-one… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)

  34. arXiv:2309.08837  [pdf, other

    cs.SD eess.AS

    FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

    Authors: Jianzong Wang, Xulong Zhang, Aolan Sun, Ning Cheng, Jing Xiao

    Abstract: This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and i… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)

  35. arXiv:2308.14319  [pdf, other

    cs.SD eess.AS

    Voice Conversion with Denoising Diffusion Probabilistic GAN Models

    Authors: Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

    Abstract: Voice conversion is a method that allows for the transformation of speaking style while maintaining the integrity of linguistic information. There are many researchers using deep generative models for voice conversion tasks. Generative Adversarial Networks (GANs) can quickly generate high-quality samples, but the generated samples lack diversity. The samples generated by the Denoising Diffusion Pr… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Comments: Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023)

  36. arXiv:2308.14317  [pdf, other

    cs.SD eess.AS

    Symbolic & Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music

    Authors: Kexin Zhu, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

    Abstract: Music Emotion Recognition involves the automatic identification of emotional elements within music tracks, and it has garnered significant attention due to its broad applicability in the field of Music Information Retrieval. It can also be used as the upstream task of many other human-related tasks such as emotional music generation and music recommendation. Due to existing psychology research, mu… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Comments: Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023)

  37. PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

    Authors: Yimin Deng, Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

    Abstract: Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosod… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: Accepted by the 31st ACM International Conference on Multimedia (MM2023)

  38. arXiv:2308.10142  [pdf, ps, other

    eess.IV cs.CV

    Polymerized Feature-based Domain Adaptation for Cervical Cancer Dose Map Prediction

    Authors: Jie Zeng, Zeyu Han, Xingchen Peng, Jianghong Xiao, Peng Wang, Yan Wang

    Abstract: Recently, deep learning (DL) has automated and accelerated the clinical radiation therapy (RT) planning significantly by predicting accurate dose maps. However, most DL-based dose map prediction methods are data-driven and not applicable for cervical cancer where only a small amount of data is available. To address this problem, this paper proposes to transfer the rich knowledge learned from anoth… ▽ More

    Submitted 19 August, 2023; originally announced August 2023.

    Comments: Accepted and presented in ISBI 2023. To be published in Proceedings

  39. arXiv:2308.03772  [pdf, other

    cs.CV cs.AI cs.GR cs.LG eess.IV

    Improved Neural Radiance Fields Using Pseudo-depth and Fusion

    Authors: Jingliang Li, Qiang Zhou, Chaohui Yu, Zhengda Lu, Jun Xiao, Zhibin Wang, Fan Wang

    Abstract: Since the advent of Neural Radiance Fields, novel view synthesis has received tremendous attention. The existing approach for the generalization of radiance field reconstruction primarily constructs an encoding volume from nearby source images as additional inputs. However, these approaches cannot efficiently encode the geometric information of real scenes with various scale objects/structures. In… ▽ More

    Submitted 27 July, 2023; originally announced August 2023.

  40. arXiv:2308.03448  [pdf, other

    cs.CV eess.IV

    Make Explicit Calibration Implicit: Calibrate Denoiser Instead of the Noise Model

    Authors: Xin Jin, Jia-Wen Xiao, Ling-Hao Han, Chunle Guo, Xialei Liu, Chongyi Li, Ming-Ming Cheng

    Abstract: Explicit calibration-based methods have dominated RAW image denoising under extremely low-light environments. However, these methods are impeded by several critical limitations: a) the explicit calibration process is both labor- and time-intensive, b) challenge exists in transferring denoisers across different camera models, and c) the disparity between synthetic and real noise is exacerbated by d… ▽ More

    Submitted 25 December, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

  41. arXiv:2306.00648  [pdf, other

    cs.SD eess.AS

    EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

    Authors: Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

    Abstract: There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emo… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted by 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023)

  42. arXiv:2305.19581  [pdf, other

    cs.SD cs.AI eess.AS

    SVVAD: Personal Voice Activity Detection for Speaker Verification

    Authors: Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao

    Abstract: Voice activity detection (VAD) improves the performance of speaker verification (SV) by preserving speech segments and attenuating the effects of non-speech. However, this scheme is not ideal: (1) it fails in noisy environments or multi-speaker conversations; (2) it is trained based on inaccurate non-SV sensitive labels. To address this, we propose a speaker verification-based voice activity detec… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

    Comments: Accepted by INTERSPEECH 2023

  43. arXiv:2305.17778  [pdf

    physics.med-ph eess.IV

    PND-Net: Physics based Non-local Dual-domain Network for Metal Artifact Reduction

    Authors: Jinqiu Xia, Yiwen Zhou, Hailong Wang, Wenxin Deng, Jing Kang, Wangjiang Wu, Mengke Qi, Linghong Zhou, Jianhui Ma, Yuan Xu

    Abstract: Metal artifacts caused by the presence of metallic implants tremendously degrade the reconstructed computed tomography (CT) image quality, affecting clinical diagnosis or reducing the accuracy of organ delineation and dose calculation in radiotherapy. Recently, deep learning methods in sinogram and image domains have been rapidly applied on metal artifact reduction (MAR) task. The supervised dual-… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: 19 pages, 8 figures

  44. arXiv:2305.14778  [pdf, other

    eess.AS cs.SD

    P-vectors: A Parallel-Coupled TDNN/Transformer Network for Speaker Verification

    Authors: Xiyuan Wang, Fangyuan Wang, Bo Xu, Liang Xu, Jing Xiao

    Abstract: Typically, the Time-Delay Neural Network (TDNN) and Transformer can serve as a backbone for Speaker Verification (SV). Both of them have advantages and disadvantages from the perspective of global and local feature modeling. How to effectively integrate these two style features is still an open issue. In this paper, we explore a Parallel-coupled TDNN/Transformer Network (p-vectors) to replace the… ▽ More

    Submitted 25 May, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted by INTERSPEECH 2023

  45. arXiv:2305.14374  [pdf, other

    cs.LG eess.SY

    Inferring Attracting Basins of Power System with Machine Learning

    Authors: Yao Du, Qing Li, Huawei Fan, Meng Zhan, Jinghua Xiao, Xingang Wang

    Abstract: Power systems dominated by renewable energy encounter frequently large, random disturbances, and a critical challenge faced in power-system management is how to anticipate accurately whether the perturbed systems will return to the functional state after the transient or collapse. Whereas model-based studies show that the key to addressing the challenge lies in the attracting basins of the functio… ▽ More

    Submitted 20 May, 2023; originally announced May 2023.

    Comments: 13 pages, 7 figures

  46. arXiv:2305.10983  [pdf, other

    cs.CV eess.IV

    Assessor360: Multi-sequence Network for Blind Omnidirectional Image Quality Assessment

    Authors: Tianhe Wu, Shuwei Shi, Haoming Cai, Mingdeng Cao, Jing Xiao, Yinqiang Zheng, Yujiu Yang

    Abstract: Blind Omnidirectional Image Quality Assessment (BOIQA) aims to objectively assess the human perceptual quality of omnidirectional images (ODIs) without relying on pristine-quality image information. It is becoming more significant with the increasing advancement of virtual reality (VR) technology. However, the quality assessment of ODIs is severely hampered by the fact that the existing BOIQA pipe… ▽ More

    Submitted 10 October, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

  47. arXiv:2305.09543  [pdf, other

    eess.SP cs.AI cs.LG

    EEG-based Sleep Staging with Hybrid Attention

    Authors: Xinliang Zhou, Chenyu Liu, Jiaping Xiao, Yang Liu

    Abstract: Sleep staging is critical for assessing sleep quality and diagnosing sleep disorders. However, capturing both the spatial and temporal relationships within electroencephalogram (EEG) signals during different sleep stages remains challenging. In this paper, we propose a novel framework called the Hybrid Attention EEG Sleep Staging (HASS) Framework. Specifically, we propose a well-designed spatio-te… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

  48. arXiv:2304.14920  [pdf, other

    eess.SP cs.AI cs.LG

    An EEG Channel Selection Framework for Driver Drowsiness Detection via Interpretability Guidance

    Authors: Xinliang Zhou, Dan Lin, Ziyu Jia, Jiaping Xiao, Chenyu Liu, Liming Zhai, Yang Liu

    Abstract: Drowsy driving has a crucial influence on driving safety, creating an urgent demand for driver drowsiness detection. Electroencephalogram (EEG) signal can accurately reflect the mental fatigue state and thus has been widely studied in drowsiness monitoring. However, the raw EEG data is inherently noisy and redundant, which is neglected by existing works that just use single-channel EEG data or ful… ▽ More

    Submitted 26 April, 2023; originally announced April 2023.

  49. arXiv:2304.11547  [pdf, other

    cs.SD eess.AS

    SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

    Authors: Jianzong Wang, Xulong Zhang, Haobin Tang, Aolan Sun, Ning Cheng, Jing Xiao

    Abstract: In recent Text-to-Speech (TTS) systems, a neural vocoder often generates speech samples by solely conditioning on acoustic features predicted from an acoustic model. However, there are always distortions existing in the predicted acoustic features, compared to those of the groundtruth, especially in the common case of poor acoustic modeling due to low-quality training data. To overcome such limits… ▽ More

    Submitted 23 April, 2023; originally announced April 2023.

    Comments: Accepted by IJCNN2023. 2023 International Joint Conference on Neural Networks (IJCNN2023)

  50. arXiv:2303.14869  [pdf, other

    eess.IV cs.CV cs.LG

    Label-Free Liver Tumor Segmentation

    Authors: Qixin Hu, Yixiong Chen, Junfei Xiao, Shuwen Sun, Jieneng Chen, Alan Yuille, Zongwei Zhou

    Abstract: We demonstrate that AI models can accurately segment liver tumors without the need for manual annotation by using synthetic tumors in CT scans. Our synthetic tumors have two intriguing advantages: (I) realistic in shape and texture, which even medical professionals can confuse with real tumors; (II) effective for training AI models, which can perform liver tumor segmentation similarly to the model… ▽ More

    Submitted 26 March, 2023; originally announced March 2023.

    Comments: CVPR 2023