Zum Hauptinhalt springen

Showing 1–50 of 69 results for author: Du, Z

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.05407  [pdf, other

    cs.SD cs.AI eess.AS

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Authors: Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

    Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role… ▽ More

    Submitted 9 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

    Comments: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051

  2. arXiv:2407.04051  [pdf, other

    cs.SD cs.AI eess.AS

    FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

    Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

    Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Work in progress. Authors are listed in alphabetical order by family name

  3. arXiv:2406.14869  [pdf, other

    eess.SP

    Cost-Effective RF Fingerprinting Based on Hybrid CVNN-RF Classifier with Automated Multi-Dimensional Early-Exit Strategy

    Authors: Jiayan Gan, Zhixing Du, Qiang Li, Huaizong Shao, Jingran Lin, Ye Pan, Zhongyi Wen, Shafei Wang

    Abstract: While the Internet of Things (IoT) technology is booming and offers huge opportunities for information exchange, it also faces unprecedented security challenges. As an important complement to the physical layer security technologies for IoT, radio frequency fingerprinting (RFF) is of great interest due to its difficulty in counterfeiting. Recently, many machine learning (ML)-based RFF algorithms h… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: Accepted by IEEE Internet of Things Journal

  4. arXiv:2406.09950  [pdf, other

    cs.SD cs.CL eess.AS

    An efficient text augmentation approach for contextualized Mandarin speech recognition

    Authors: Naijun Zheng, Xucheng Wan, Kai Liu, Ziqing Du, Zhou Huan

    Abstract: Although contextualized automatic speech recognition (ASR) systems are commonly used to improve the recognition of uncommon words, their effectiveness is hindered by the inherent limitations of speech-text data availability. To address this challenge, our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models using a straightforward text-augmentation (TA)… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: accepted to interspeech2024

  5. arXiv:2406.04494  [pdf, other

    eess.AS

    Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline

    Authors: Ali N. Salman, Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ulgen, Carlos Busso, Berrak Sisman

    Abstract: Voice conversion (VC) research traditionally depends on scripted or acted speech, which lacks the natural spontaneity of real-life conversations. While natural speech data is limited for VC, our study focuses on filling in this gap. We introduce a novel data-sourcing pipeline that makes the release of a natural speech dataset for VC, named NaturalVoices. The pipeline extracts rich information in s… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  6. arXiv:2405.11413  [pdf, other

    eess.AS cs.LG

    Exploring speech style spaces with language models: Emotional TTS without emotion labels

    Authors: Shreeram Suresh Chandra, Zongyang Du, Berrak Sisman

    Abstract: Many frameworks for emotional text-to-speech (E-TTS) rely on human-annotated emotion labels that are often inaccurate and difficult to obtain. Learning emotional prosody implicitly presents a tough challenge due to the subjective nature of emotions. In this study, we propose a novel approach that leverages text awareness to acquire emotional styles without the need for explicit emotion labels or t… ▽ More

    Submitted 18 May, 2024; originally announced May 2024.

    Comments: Accepted at Speaker Odyssey 2024

  7. arXiv:2405.01730  [pdf, other

    eess.AS cs.SD

    Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

    Authors: Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman

    Abstract: Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders.… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: Accepted by Speaker Odyssey 2024

  8. arXiv:2404.16484  [pdf, other

    cs.CV eess.IV

    Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

    Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi Jin, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

    Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024, AI for Streaming (AIS) Workshop

  9. arXiv:2404.10343  [pdf, other

    cs.CV eess.IV

    The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

    Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

  10. arXiv:2402.08846  [pdf, other

    cs.CL cs.AI cs.MM cs.SD eess.AS

    An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

    Authors: Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen

    Abstract: In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning f… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

    Comments: Working in progress and will open-source soon

  11. Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition

    Authors: Ismail Rasim Ulgen, Zongyang Du, Carlos Busso, Berrak Sisman

    Abstract: Speaker embeddings carry valuable emotion-related information, which makes them a promising resource for enhancing speech emotion recognition (SER), especially with limited labeled data. Traditionally, it has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and stat… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  12. arXiv:2312.16381  [pdf, other

    eess.SP

    Frame Structure and Protocol Design for Sensing-Assisted NR-V2X Communications

    Authors: Yunxin Li, Fan Liu, Zhen Du, Weijie Yuan, Qingjiang Shi, Christos Masouros

    Abstract: The emergence of the fifth-generation (5G) New Radio (NR) technology has provided unprecedented opportunities for vehicle-to-everything (V2X) networks, enabling enhanced quality of services. However, high-mobility V2X networks require frequent handovers and acquiring accurate channel state information (CSI) necessitates the utilization of pilot signals, leading to increased overhead and reduced co… ▽ More

    Submitted 26 December, 2023; originally announced December 2023.

    Comments: 14 pages, 14 figures

  13. arXiv:2312.16006  [pdf, other

    eess.SP

    Interference-Resilient OFDM Waveform Design with Subcarrier Interval Constraint for ISAC Systems

    Authors: Qinghui Lu, Zhen Du, Zenghui Zhang

    Abstract: Conventional orthogonal frequency division multiplexing (OFDM) waveform design in integrated sensing and communications (ISAC) systems usually selects the channels with high-frequency responses to transmit communication data, which does not fully consider the possible interference in the environment. To mitigate these adverse effects, we propose an optimization model by weighting between peak side… ▽ More

    Submitted 26 December, 2023; originally announced December 2023.

  14. arXiv:2312.15941  [pdf, other

    eess.SP

    Reshaping the ISAC Tradeoff Under OFDM Signaling: A Probabilistic Constellation Shaping Approach

    Authors: Zhen Du, Fan Liu, Yifeng Xiong, Tony Xiao Han, Yonina C. Eldar, Shi Jin

    Abstract: Integrated sensing and communications is regarded as a key enabling technology in the sixth generation networks, where a unified waveform, such as orthogonal frequency division multiplexing (OFDM) signal, is adopted to facilitate both sensing and communications (S&C). However, the random communication data embedded in the OFDM signal results in severe variability in the sidelobes of its ambiguity… ▽ More

    Submitted 26 December, 2023; originally announced December 2023.

  15. arXiv:2311.11151  [pdf, ps, other

    eess.SY cs.LG stat.ML

    On the Hardness of Learning to Stabilize Linear Systems

    Authors: Xiong Zeng, Zexiang Liu, Zhe Du, Necmiye Ozay, Mario Sznaier

    Abstract: Inspired by the work of Tsiamis et al. \cite{tsiamis2022learning}, in this paper we study the statistical hardness of learning to stabilize linear time-invariant systems. Hardness is measured by the number of samples required to achieve a learning task with a given probability. The work in \cite{tsiamis2022learning} shows that there exist system classes that are hard to learn to stabilize with the… ▽ More

    Submitted 18 November, 2023; originally announced November 2023.

    Comments: 7 pages, 2 figures, accepted by CDC 2023

  16. arXiv:2310.19477  [pdf, other

    cs.CV cs.MM eess.IV

    VDIP-TGV: Blind Image Deconvolution via Variational Deep Image Prior Empowered by Total Generalized Variation

    Authors: Tingting Wu, Zhiyan Du, Zhi Li, Feng-Lei Fan, Tieyong Zeng

    Abstract: Recovering clear images from blurry ones with an unknown blur kernel is a challenging problem. Deep image prior (DIP) proposes to use the deep network as a regularizer for a single image rather than as a supervised model, which achieves encouraging results in the nonblind deblurring problem. However, since the relationship between images and the network architectures is unclear, it is hard to find… ▽ More

    Submitted 10 November, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

    Comments: 13 pages, 5 figures

  17. arXiv:2310.18090  [pdf, ps, other

    eess.SP

    Probabilistic Constellation Shaping for OFDM-Based ISAC Signaling

    Authors: Zhen Du, Fan Liu, Yifeng Xiong, Tony Xiao Han, Weijie Yuan, Yuanhao Cui, Changhua Yao, Yonina C. Eldar

    Abstract: Integrated Sensing and Communications (ISAC) has garnered significant attention as a promising technology for the upcoming sixth-generation wireless communication systems (6G). In pursuit of this goal, a common strategy is that a unified waveform, such as Orthogonal Frequency Division Multiplexing (OFDM), should serve dual-functional roles by enabling simultaneous sensing and communications (S&C)… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

  18. arXiv:2310.04863  [pdf, other

    cs.SD eess.AS

    SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

    Authors: Yangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du, Shiliang Zhang, Lei Xie

    Abstract: Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we intro… ▽ More

    Submitted 7 October, 2023; originally announced October 2023.

  19. arXiv:2310.04673  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

    Authors: Zhihao Du, Jiaming Wang, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang

    Abstract: Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text large language models (LLMs). Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as a… ▽ More

    Submitted 2 July, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: 10 pages, work in progress

  20. arXiv:2309.14372  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Human Transcription Quality Improvement

    Authors: Jian Gao, Hanbo Sun, Cheng Cao, Zheng Du

    Abstract: High quality transcription data is crucial for training automatic speech recognition (ASR) systems. However, the existing industry-level data collection pipelines are expensive to researchers, while the quality of crowdsourced transcription is low. In this paper, we propose a reliable method to collect speech transcriptions. We introduce two mechanisms to improve transcription quality: confidence… ▽ More

    Submitted 23 September, 2023; originally announced September 2023.

    Comments: 5 pages, 3 figures, 5 tables, INTERSPEECH 2023

    MSC Class: 68T50 ACM Class: I.2.7

    Journal ref: INTERSPEECH 2023

  21. arXiv:2309.13573  [pdf, other

    cs.SD eess.AS

    The second multi-channel multi-party meeting transcription challenge (M2MeT) 2.0): A benchmark for speaker-attributed ASR

    Authors: Yuhao Liang, Mohan Shi, Fan Yu, Yangze Li, Shiliang Zhang, Zhihao Du, Qian Chen, Lei Xie, Yanmin Qian, Jian Wu, Zhuo Chen, Kong Aik Lee, Zhijie Yan, Hui Bu

    Abstract: With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of \emph{speaker-attributed ASR (SA-ASR)}, which directly addresses the practical and challenging problem of ``who spoke what at when" at typical meeting scenario. We particularly established two sub-tr… ▽ More

    Submitted 5 October, 2023; v1 submitted 24 September, 2023; originally announced September 2023.

    Comments: 8 pages, Accepted by ASRU2023

  22. arXiv:2309.10089  [pdf, other

    eess.AS cs.AI cs.CL cs.HC cs.LG cs.SD

    HTEC: Human Transcription Error Correction

    Authors: Hanbo Sun, Jian Gao, Xiaomin Wu, Anjie Fang, Cheng Cao, Zheng Du

    Abstract: High-quality human transcription is essential for training and improving Automatic Speech Recognition (ASR) models. Recent study~\cite{libricrowd} has found that every 1% worse transcription Word Error Rate (WER) increases approximately 2% ASR WER by using the transcriptions to train ASR models. Transcription errors are inevitable for even highly-trained annotators. However, few studies have explo… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: 13 pages, 4 figures, 11 tables, AMLC 2023

    MSC Class: 68T50 ACM Class: I.2.7

  23. arXiv:2309.07405  [pdf, other

    cs.SD cs.AI eess.AS

    FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

    Authors: Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng

    Abstract: This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as… ▽ More

    Submitted 6 October, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: 5 pages, 3 figures, submitted to ICASSP 2024

  24. arXiv:2309.05058  [pdf, other

    cs.SD cs.MM eess.AS

    Multimodal Fish Feeding Intensity Assessment in Aquaculture

    Authors: Meng Cui, Xubo Liu, Haohe Liu, Zhuangzhuang Du, Tao Chen, Guoping Lian, Daoliang Li, Wenwu Wang

    Abstract: Fish feeding intensity assessment (FFIA) aims to evaluate fish appetite changes during feeding, which is crucial in industrial aquaculture applications. Existing FFIA methods are limited by their robustness to noise, computational complexity, and the lack of public datasets for developing the models. To address these issues, we first introduce AV-FFIA, a new dataset containing 27,000 labeled audio… ▽ More

    Submitted 19 May, 2024; v1 submitted 10 September, 2023; originally announced September 2023.

  25. arXiv:2308.08536  [pdf, other

    eess.SY cs.AI cs.LG

    Can Transformers Learn Optimal Filtering for Unknown Systems?

    Authors: Haldun Balim, Zhe Du, Samet Oymak, Necmiye Ozay

    Abstract: Transformer models have shown great success in natural language processing; however, their potential remains mostly unexplored for dynamical systems. In this work, we investigate the optimal output estimation problem using transformers, which generate output predictions using all the past ones. Particularly, we train the transformer using various distinct systems and then evaluate the performance… ▽ More

    Submitted 11 June, 2024; v1 submitted 16 August, 2023; originally announced August 2023.

    Comments: Minor differences between the implementation and the originally provided descriptions are corrected, ensuring better clarity and accuracy of the content

  26. arXiv:2305.12459  [pdf, other

    eess.AS cs.SD

    CASA-ASR: Context-Aware Speaker-Attributed ASR

    Authors: Mohan Shi, Zhihao Du, Qian Chen, Fan Yu, Yangze Li, Shiliang Zhang, Jie Zhang, Li-Rong Dai

    Abstract: Recently, speaker-attributed automatic speech recognition (SA-ASR) has attracted a wide attention, which aims at answering the question ``who spoke what''. Different from modular systems, end-to-end (E2E) SA-ASR minimizes the speaker-dependent recognition errors directly and shows a promising applicability. In this paper, we propose a context-aware SA-ASR (CASA-ASR) model by enhancing the contextu… ▽ More

    Submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted by Interspeech2023

  27. arXiv:2305.11013  [pdf, other

    cs.SD cs.CL eess.AS

    FunASR: A Fundamental End-to-End Speech Recognition Toolkit

    Authors: Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, Shiliang Zhang

    Abstract: This paper introduces FunASR, an open-source speech recognition toolkit designed to bridge the gap between academic research and industrial applications. FunASR offers models trained on large-scale industrial corpora and the ability to deploy them in applications. The toolkit's flagship model, Paraformer, is a non-autoregressive end-to-end speech recognition model that has been trained on a manual… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: 5 pages, 3 figures, accepted by INTERSPEECH 2023

  28. arXiv:2305.00681  [pdf, other

    eess.SP

    Towards ISAC-Empowered Vehicular Networks: Framework, Advances, and Opportunities

    Authors: Zhen Du, Fan Liu, Yunxin Li, Weijie Yuan, Yuanhao Cui, Zenghui Zhang, Christos Masouros, Bo Ai

    Abstract: Connected and autonomous vehicle (CAV) networks face several challenges, such as low throughput, high latency, and poor localization accuracy. These challenges severely impede the implementation of CAV networks for immersive metaverse applications and driving safety in future 6G wireless networks. To alleviate these issues, integrated sensing and communications (ISAC) is envisioned as a game-chang… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

  29. arXiv:2303.13243  [pdf, other

    eess.AS cs.SD

    Pyramid Multi-branch Fusion DCNN with Multi-Head Self-Attention for Mandarin Speech Recognition

    Authors: Kai Liu, Hailiang Xiong, Gangqiang Yang, Zhengfeng Du, Yewen Cao, Danyal Shah

    Abstract: As one of the major branches of automatic speech recognition, attention-based models greatly improves the feature representation ability of the model. In particular, the multi-head mechanism is employed in the attention, hoping to learn speech features of more aspects in different attention subspaces. For speech recognition of complex languages, on the one hand, a small head size will lead to an o… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

  30. arXiv:2303.06550  [pdf, other

    eess.IV cs.CV

    Spatial Correspondence between Graph Neural Network-Segmented Images

    Authors: Qian Li, Yunguan Fu, Qianye Yang, Zhijiang Du, Hongjian Yu, Yipeng Hu

    Abstract: Graph neural networks (GNNs) have been proposed for medical image segmentation, by predicting anatomical structures represented by graphs of vertices and edges. One such type of graph is predefined with fixed size and connectivity to represent a reference of anatomical regions of interest, thus known as templates. This work explores the potentials in these GNNs with common topology for establishin… ▽ More

    Submitted 16 March, 2023; v1 submitted 11 March, 2023; originally announced March 2023.

    Comments: Accepted at MIDL 2023 (The Medical Imaging with Deep Learning conference, 2023)

  31. arXiv:2303.05397  [pdf, other

    cs.SD cs.AI eess.AS

    TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization

    Authors: Jiaming Wang, Zhihao Du, Shiliang Zhang

    Abstract: Recently, end-to-end neural diarization (EEND) is introduced and achieves promising results in speaker-overlapped scenarios. In EEND, speaker diarization is formulated as a multi-label prediction problem, where speaker activities are estimated independently and their dependency are not well considered. To overcome these disadvantages, we employ the power set encoding to reformulate speaker diariza… ▽ More

    Submitted 13 December, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP2023

  32. arXiv:2303.05023  [pdf, other

    eess.AS cs.AI cs.SD

    X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

    Authors: Kai Liu, Ziqing Du, Xucheng Wan, Huan Zhou

    Abstract: Target speech extraction (TSE) systems are designed to extract target speech from a multi-talker mixture. The popular training objective for most prior TSE networks is to enhance reconstruction performance of extracted speech waveform. However, it has been reported that a TSE system delivers high reconstruction performance may still suffer low-quality experience problems in practice. One such expe… ▽ More

    Submitted 8 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP 2023

  33. arXiv:2303.02722  [pdf, other

    cs.IT eess.SP

    Performance of OTFS-NOMA Scheme for Coordinated Direct and Relay Transmission Networks in High-Mobility Scenarios

    Authors: Yao Xu, Zhen Du, Weijie Yuan, Shaobo Jia, Victor C. M. Leung

    Abstract: In this letter, an orthogonal time frequency space (OTFS) based non-orthogonal multiple access (NOMA) scheme is investigated for the coordinated direct and relay transmission system, where a source directly communicates with a near user with high mobile speed, and it needs the relaying assistance to serve the far user also having high mobility. Due to the coexistence of signal superposition coding… ▽ More

    Submitted 5 March, 2023; originally announced March 2023.

  34. arXiv:2301.12787  [pdf, other

    eess.SP

    ISAC-Enabled V2I Networks Based on 5G NR: How Much Can the Overhead Be Reduced?

    Authors: Yunxin Li, Fan Liu, Zhen Du, Weijie Yuan, Christos Masouros

    Abstract: The emergence of the fifth-generation (5G) New Radio (NR) brings additional possibilities to vehicle-to-everything (V2X) network with improved quality of services. In order to obtain accurate channel state information (CSI) in high-mobility V2X networks, pilot signals and frequent handover between vehicles and infrastructures are required to establish and maintain the communication link, which inc… ▽ More

    Submitted 21 March, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

    Comments: 6 pages, 5 figures

  35. arXiv:2301.06277  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS

    Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

    Authors: Kai Liu, Xucheng Wan, Ziqing Du, Huan Zhou

    Abstract: As a practical alternative of speech separation, target speaker extraction (TSE) aims to extract the speech from the desired speaker using additional speaker cue extracted from the speaker. Its main challenge lies in how to properly extract and leverage the speaker cue to benefit the extracted speech quality. The cue extraction method adopted in majority existing TSE studies is to directly utilize… ▽ More

    Submitted 16 January, 2023; originally announced January 2023.

    Comments: ACCEPTED by NCMMSC 2022

  36. arXiv:2211.10243  [pdf, other

    cs.SD cs.MM eess.AS

    Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis

    Authors: Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan

    Abstract: Recently, hybrid systems of clustering and neural diarization models have been successfully applied in multi-party meeting analysis. However, current models always treat overlapped speaker diarization as a multi-label classification problem, where speaker dependency and overlaps are not well considered. To overcome the disadvantages, we reformulate overlapped speaker diarization task as a single-l… ▽ More

    Submitted 18 November, 2022; originally announced November 2022.

    Comments: Accepted by EMNLP 2022

  37. arXiv:2211.07143  [pdf

    eess.IV cs.CV

    WSC-Trans: A 3D network model for automatic multi-structural segmentation of temporal bone CT

    Authors: Xin Hua, Zhijiang Du, Hongjian Yu, Jixin Ma, Fanjun Zheng, Cheng Zhang, Qiaohui Lu, Hui Zhao

    Abstract: Cochlear implantation is currently the most effective treatment for patients with severe deafness, but mastering cochlear implantation is extremely challenging because the temporal bone has extremely complex and small three-dimensional anatomical structures, and it is important to avoid damaging the corresponding structures when performing surgery. The spatial location of the relevant anatomical t… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

    Comments: 10 pages,7 figures

  38. arXiv:2211.00511  [pdf, other

    eess.AS cs.SD

    A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings

    Authors: Mohan Shi, Jie Zhang, Zhihao Du, Fan Yu, Qian Chen, Shiliang Zhang, Li-Rong Dai

    Abstract: Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task. It was shown that single-channel frame-level diarization with serialized output training (SC-FD-SOT), single-channel word-level diarization with SOT (SC-WD-SOT) and joint training of single-channel target-speaker separation and ASR (SC-TS-ASR) can be explo… ▽ More

    Submitted 1 March, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

  39. arXiv:2211.00434  [pdf, other

    eess.SP

    On the Performance Gain of Integrated Sensing and Communications: A Subspace Correlation Perspective

    Authors: Shihang Lu, Xiao Meng, Zhen Du, Yifeng Xiong, Fan Liu

    Abstract: In this paper, we shed light on the performance gain of integrated sensing and communications (ISAC) from the perspective of channel correlations between radar sensing and communication (S&C), namely ISAC subspace correlation. To begin with, we consider a multi-input multi-output (MIMO) ISAC system and reveal that the optimal ISAC signal is in the subspace spanned by the transmitted steering vecto… ▽ More

    Submitted 2 November, 2022; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: 6 pages, 5 figures, submitted to IEEE conference

  40. arXiv:2210.05265  [pdf, other

    cs.SD eess.AS

    MFCCA:Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario

    Authors: Fan Yu, Shiliang Zhang, Pengcheng Guo, Yuhao Liang, Zhihao Du, Yuxiao Lin, Lei Xie

    Abstract: Recently cross-channel attention, which better leverages multi-channel signals from microphone array, has shown promising results in the multi-party meeting scenario. Cross-channel attention focuses on either learning global correlations between sequences of different channels or exploiting fine-grained channel-wise information effectively at each time step. Considering the delay of microphone arr… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: Accepted by SLT 2022

  41. arXiv:2209.11906  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    Joint Speech Activity and Overlap Detection with Multi-Exit Architecture

    Authors: Ziqing Du, Kai Liu, Xucheng Wan, Huan Zhou

    Abstract: Overlapped speech detection (OSD) is critical for speech applications in scenario of multi-party conversion. Despite numerous research efforts and progresses, comparing with speech activity detection (VAD), OSD remains an open challenge and its overall performance is far from satisfactory. The majority of prior research typically formulates the OSD problem as a standard classification problem, to… ▽ More

    Submitted 23 September, 2022; originally announced September 2022.

  42. arXiv:2209.11905  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

    Authors: Xucheng Wan, Kai Liu, Ziqing Du, Huan Zhou

    Abstract: To address the monaural speech enhancement problem, numerous research studies have been conducted to enhance speech via operations either in time-domain on the inner-domain learned from the speech mixture or in time--frequency domain on the fixed full-band short time Fourier transform (STFT) spectrograms. Very recently, a few studies on sub-band based speech enhancement have been proposed. By enha… ▽ More

    Submitted 23 September, 2022; originally announced September 2022.

  43. arXiv:2206.12300  [pdf

    eess.IV cs.CV

    Automatic extraction of coronary arteries using deep learning in invasive coronary angiograms

    Authors: Yinghui Meng, Zhenglong Du, Chen Zhao, Minghao Dong, Drew Pienta, Zhihui Xu, Weihua Zhou

    Abstract: Accurate extraction of coronary arteries from invasive coronary angiography (ICA) is important in clinical decision-making for the diagnosis and risk stratification of coronary artery disease (CAD). In this study, we develop a method using deep learning to automatically extract the coronary artery lumen. Methods. A deep learning model U-Net 3+, which incorporates the full-scale skip connections an… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

    Comments: 22 pages,5 figures

  44. arXiv:2205.05675  [pdf, other

    cs.CV eess.IV

    NTIRE 2022 Challenge on Efficient Super-Resolution: Methods and Results

    Authors: Yawei Li, Kai Zhang, Radu Timofte, Luc Van Gool, Fangyuan Kong, Mingxi Li, Songwei Liu, Zongcai Du, Ding Liu, Chenhui Zhou, Jingyi Chen, Qingrui Han, Zheyuan Li, Yingqi Liu, Xiangyu Chen, Haoming Cai, Yu Qiao, Chao Dong, Long Sun, Jinshan Pan, Yi Zhu, Zhikai Zong, Xiaoxiao Liu, Zheng Hui, Tao Yang , et al. (86 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of e… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

    Comments: Validation code of the baseline model is available at https://github.com/ofsoundof/IMDN. Validation of all submitted models is available at https://github.com/ofsoundof/NTIRE2022_ESR

  45. Mode Reduction for Markov Jump Systems

    Authors: Zhe Du, Laura Balzano, Necmiye Ozay

    Abstract: Switched systems are capable of modeling processes with underlying dynamics that may change abruptly over time. To achieve accurate modeling in practice, one may need a large number of modes, but this may in turn increase the model complexity drastically. Existing work on reducing system complexity mainly considers state space reduction, yet reducing the number of modes is less studied. In this wo… ▽ More

    Submitted 20 October, 2022; v1 submitted 5 May, 2022; originally announced May 2022.

  46. arXiv:2204.08397  [pdf, other

    eess.IV cs.CV

    Fast and Memory-Efficient Network Towards Efficient Image Super-Resolution

    Authors: Zongcai Du, Ding Liu, Jie Liu, Jie Tang, Gangshan Wu, Lean Fu

    Abstract: Runtime and memory consumption are two important aspects for efficient image super-resolution (EISR) models to be deployed on resource-constrained devices. Recent advances in EISR exploit distillation and aggregation strategies with plenty of channel split and concatenation operations to make full use of limited hierarchical features. In contrast, sequential network operations avoid frequently acc… ▽ More

    Submitted 18 April, 2022; originally announced April 2022.

    Comments: Accepted by NTIRE 2022 (CVPR Workshop)

  47. arXiv:2203.16834  [pdf, other

    cs.SD cs.CL eess.AS

    A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

    Authors: Fan Yu, Zhihao Du, Shiliang Zhang, Yuxiao Lin, Lei Xie

    Abstract: In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA-ASR) in the multi-party meeting scenario, a topic with increasing attention in meeting rich transcription. Specifically, three approaches are evaluated in this study. The first approach, FD-SOT, consists of a frame-level diarization model to identify speakers and a multi-talker ASR to recognize utt… ▽ More

    Submitted 1 July, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: accepted by INTERSPEECH 2022, 5 pages, 2 figures

  48. arXiv:2203.09767  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios

    Authors: Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan

    Abstract: Overlapping speech diarization has been traditionally treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding multiple binary labels into a single label with the power set, which represents the possible combinations of target speakers. This formulation has two benefits. First, the overlaps of target speakers are expl… ▽ More

    Submitted 30 March, 2022; v1 submitted 18 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022, 5 parges, 2 figure

  49. arXiv:2202.03647  [pdf, other

    cs.SD eess.AS

    Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

    Authors: Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie, Zheng-Hua Tan, DeLiang Wang, Yanmin Qian, Kong Aik Lee, Zhijie Yan, Bin Ma, Xin Xu, Hui Bu

    Abstract: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Ma… ▽ More

    Submitted 25 February, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

    Comments: Accepted by ICASSP 2022

  50. arXiv:2111.13694  [pdf, other

    cs.SD cs.LG eess.AS

    Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information

    Authors: Zhihao Du, Shiliang Zhang, Siqi Zheng, Weilong Huang, Ming Lei

    Abstract: Overlapping speech diarization is always treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set. Specifically, we propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels according to the similarities between speech feat… ▽ More

    Submitted 28 November, 2021; originally announced November 2021.

    Comments: Submitted to ICASSP 2022, 5 pages, 2 figures