Search | arXiv e-print repository

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Authors: Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role… ▽ More Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis. Experimental results show that supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning. Moreover, we find that utilizing large-scale data further improves the synthesis performance, indicating the scalable capacity of CosyVoice. To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models. △ Less

Submitted 9 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

Comments: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051

arXiv:2407.04051 [pdf, other]

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM. △ Less

Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: Work in progress. Authors are listed in alphabetical order by family name

arXiv:2406.14869 [pdf, other]

Cost-Effective RF Fingerprinting Based on Hybrid CVNN-RF Classifier with Automated Multi-Dimensional Early-Exit Strategy

Authors: Jiayan Gan, Zhixing Du, Qiang Li, Huaizong Shao, Jingran Lin, Ye Pan, Zhongyi Wen, Shafei Wang

Abstract: While the Internet of Things (IoT) technology is booming and offers huge opportunities for information exchange, it also faces unprecedented security challenges. As an important complement to the physical layer security technologies for IoT, radio frequency fingerprinting (RFF) is of great interest due to its difficulty in counterfeiting. Recently, many machine learning (ML)-based RFF algorithms h… ▽ More While the Internet of Things (IoT) technology is booming and offers huge opportunities for information exchange, it also faces unprecedented security challenges. As an important complement to the physical layer security technologies for IoT, radio frequency fingerprinting (RFF) is of great interest due to its difficulty in counterfeiting. Recently, many machine learning (ML)-based RFF algorithms have emerged. In particular, deep learning (DL) has shown great benefits in automatically extracting complex and subtle features from raw data with high classification accuracy. However, DL algorithms face the computational cost problem as the difficulty of the RFF task and the size of the DNN have increased dramatically. To address the above challenge, this paper proposes a novel costeffective early-exit neural network consisting of a complex-valued neural network (CVNN) backbone with multiple random forest branches, called hybrid CVNN-RF. Unlike conventional studies that use a single fixed DL model to process all RF samples, our hybrid CVNN-RF considers differences in the recognition difficulty of RF samples and introduces an early-exit mechanism to dynamically process the samples. When processing "easy" samples that can be well classified with high confidence, the hybrid CVNN-RF can end early at the random forest branch to reduce computational cost. Conversely, subsequent network layers will be activated to ensure accuracy. To further improve the early-exit rate, an automated multi-dimensional early-exit strategy is proposed to achieve scheduling control from multiple dimensions within the network depth and classification category. Finally, our experiments on the public ADS-B dataset show that the proposed algorithm can reduce the computational cost by 83% while improving the accuracy by 1.6% under a classification task with 100 categories. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Accepted by IEEE Internet of Things Journal

arXiv:2406.09950 [pdf, other]

An efficient text augmentation approach for contextualized Mandarin speech recognition

Authors: Naijun Zheng, Xucheng Wan, Kai Liu, Ziqing Du, Zhou Huan

Abstract: Although contextualized automatic speech recognition (ASR) systems are commonly used to improve the recognition of uncommon words, their effectiveness is hindered by the inherent limitations of speech-text data availability. To address this challenge, our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models using a straightforward text-augmentation (TA)… ▽ More Although contextualized automatic speech recognition (ASR) systems are commonly used to improve the recognition of uncommon words, their effectiveness is hindered by the inherent limitations of speech-text data availability. To address this challenge, our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models using a straightforward text-augmentation (TA) technique, all while keeping computational costs minimal. In particular, to contextualize a pre-trained CIF-based ASR, we construct a codebook using limited speech-text data. By utilizing a simple codebook lookup process, we convert available text-only data into latent text embeddings. These embeddings then enhance the inputs for the contextualized ASR. Our experiments on diverse Mandarin test sets demonstrate that our TA approach significantly boosts recognition performance. The top-performing system shows relative CER improvements of up to 30% on rare words and 15% across all words in general. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: accepted to interspeech2024

arXiv:2406.04494 [pdf, other]

Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline

Authors: Ali N. Salman, Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ulgen, Carlos Busso, Berrak Sisman

Abstract: Voice conversion (VC) research traditionally depends on scripted or acted speech, which lacks the natural spontaneity of real-life conversations. While natural speech data is limited for VC, our study focuses on filling in this gap. We introduce a novel data-sourcing pipeline that makes the release of a natural speech dataset for VC, named NaturalVoices. The pipeline extracts rich information in s… ▽ More Voice conversion (VC) research traditionally depends on scripted or acted speech, which lacks the natural spontaneity of real-life conversations. While natural speech data is limited for VC, our study focuses on filling in this gap. We introduce a novel data-sourcing pipeline that makes the release of a natural speech dataset for VC, named NaturalVoices. The pipeline extracts rich information in speech such as emotion and signal-to-noise ratio (SNR) from raw podcast data, utilizing recent deep learning methods and providing flexibility and ease of use. NaturalVoices marks a large-scale, spontaneous, expressive, and emotional speech dataset, comprising over 3,800 hours speech sourced from the original podcasts in the MSP-Podcast dataset. Objective and subjective evaluations demonstrate the effectiveness of using our pipeline for providing natural and expressive data for VC, suggesting the potential of NaturalVoices for broader speech generation tasks. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2405.11413 [pdf, other]

Exploring speech style spaces with language models: Emotional TTS without emotion labels

Authors: Shreeram Suresh Chandra, Zongyang Du, Berrak Sisman

Abstract: Many frameworks for emotional text-to-speech (E-TTS) rely on human-annotated emotion labels that are often inaccurate and difficult to obtain. Learning emotional prosody implicitly presents a tough challenge due to the subjective nature of emotions. In this study, we propose a novel approach that leverages text awareness to acquire emotional styles without the need for explicit emotion labels or t… ▽ More Many frameworks for emotional text-to-speech (E-TTS) rely on human-annotated emotion labels that are often inaccurate and difficult to obtain. Learning emotional prosody implicitly presents a tough challenge due to the subjective nature of emotions. In this study, we propose a novel approach that leverages text awareness to acquire emotional styles without the need for explicit emotion labels or text prompts. We present TEMOTTS, a two-stage framework for E-TTS that is trained without emotion labels and is capable of inference without auxiliary inputs. Our proposed method performs knowledge transfer between the linguistic space learned by BERT and the emotional style space constructed by global style tokens. Our experimental results demonstrate the effectiveness of our proposed framework, showcasing improvements in emotional accuracy and naturalness. This is one of the first studies to leverage the emotional correlation between spoken content and expressive delivery for emotional TTS. △ Less

Submitted 18 May, 2024; originally announced May 2024.

Comments: Accepted at Speaker Odyssey 2024

arXiv:2405.01730 [pdf, other]

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

Authors: Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman

Abstract: Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders.… ▽ More Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: Accepted by Speaker Odyssey 2024

arXiv:2404.16484 [pdf, other]

Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi Jin, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF codec, instead of JPEG. All the proposed methods improve PSNR fidelity over Lanczos interpolation, and process images under 10ms. Out of the 160 participants, 25 teams submitted their code and models. The solutions present novel designs tailored for memory-efficiency and runtime on edge devices. This survey describes the best solutions for real-time SR of compressed high-resolution images. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: CVPR 2024, AI for Streaming (AIS) Workshop

arXiv:2404.10343 [pdf, other]

The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/. △ Less

Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

arXiv:2402.08846 [pdf, other]

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Authors: Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen

Abstract: In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning f… ▽ More In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task. To be more specific, we benchmark and explore various combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. To the best of our knowledge, SLAM-ASR achieves the best performance on the Librispeech benchmark among LLM-based ASR models and even outperforms the latest LLM-based audio-universal model trained on massive pair data. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: Working in progress and will open-source soon

arXiv:2401.11017 [pdf, other]

doi 10.1109/ICASSP48485.2024.10447060

Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition

Authors: Ismail Rasim Ulgen, Zongyang Du, Carlos Busso, Berrak Sisman

Abstract: Speaker embeddings carry valuable emotion-related information, which makes them a promising resource for enhancing speech emotion recognition (SER), especially with limited labeled data. Traditionally, it has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and stat… ▽ More Speaker embeddings carry valuable emotion-related information, which makes them a promising resource for enhancing speech emotion recognition (SER), especially with limited labeled data. Traditionally, it has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters. By conducting a thorough clustering analysis, we demonstrate that emotion information can be readily extracted from speaker embeddings. In order to leverage this information, we introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition. The proposed approach involves the sampling of positive and the negative examples based on the intra-speaker clusters of speaker embeddings. The proposed strategy, which leverages extensive emotion-unlabeled data, leads to a significant improvement in SER performance, whether employed as a standalone pretraining task or integrated into a multi-task pretraining setting. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Comments: Accepted to ICASSP 2024

arXiv:2312.16381 [pdf, other]

Frame Structure and Protocol Design for Sensing-Assisted NR-V2X Communications

Authors: Yunxin Li, Fan Liu, Zhen Du, Weijie Yuan, Qingjiang Shi, Christos Masouros

Abstract: The emergence of the fifth-generation (5G) New Radio (NR) technology has provided unprecedented opportunities for vehicle-to-everything (V2X) networks, enabling enhanced quality of services. However, high-mobility V2X networks require frequent handovers and acquiring accurate channel state information (CSI) necessitates the utilization of pilot signals, leading to increased overhead and reduced co… ▽ More The emergence of the fifth-generation (5G) New Radio (NR) technology has provided unprecedented opportunities for vehicle-to-everything (V2X) networks, enabling enhanced quality of services. However, high-mobility V2X networks require frequent handovers and acquiring accurate channel state information (CSI) necessitates the utilization of pilot signals, leading to increased overhead and reduced communication throughput. To address this challenge, integrated sensing and communications (ISAC) techniques have been employed at the base station (gNB) within vehicle-to-infrastructure (V2I) networks, aiming to minimize overhead and improve spectral efficiency. In this study, we propose novel frame structures that incorporate ISAC signals for three crucial stages in the NR-V2X system: initial access, connected mode, and beam failure and recovery. These new frame structures employ 75% fewer pilots and reduce reference signals by 43.24%, capitalizing on the sensing capability of ISAC signals. Through extensive link-level simulations, we demonstrate that our proposed approach enables faster beam establishment during initial access, higher throughput and more precise beam tracking in connected mode with reduced overhead, and expedited detection and recovery from beam failures. Furthermore, the numerical results obtained from our simulations showcase enhanced spectrum efficiency, improved communication performance and minimal overhead, validating the effectiveness of the proposed ISAC-based techniques in NR V2I networks. △ Less

Submitted 26 December, 2023; originally announced December 2023.

Comments: 14 pages, 14 figures

arXiv:2312.16006 [pdf, other]

Interference-Resilient OFDM Waveform Design with Subcarrier Interval Constraint for ISAC Systems

Authors: Qinghui Lu, Zhen Du, Zenghui Zhang

Abstract: Conventional orthogonal frequency division multiplexing (OFDM) waveform design in integrated sensing and communications (ISAC) systems usually selects the channels with high-frequency responses to transmit communication data, which does not fully consider the possible interference in the environment. To mitigate these adverse effects, we propose an optimization model by weighting between peak side… ▽ More Conventional orthogonal frequency division multiplexing (OFDM) waveform design in integrated sensing and communications (ISAC) systems usually selects the channels with high-frequency responses to transmit communication data, which does not fully consider the possible interference in the environment. To mitigate these adverse effects, we propose an optimization model by weighting between peak sidelobe level and communication data rate, with power and communication subcarrier interval constraints. To tackle the resultant nonconvex problem, an iterative adaptive cyclic minimization (ACM) algorithm is developed, where an adaptive iterative factor is introduced to improve convergence. Subsequently, the least squares algorithm is used to reduce the coefficient of variation of envelopes by further optimizing the phase of the OFDM waveform. Finally, the numerical simulations are provided to demonstrate the interference-resilient ability of the proposed OFDM strategy and the robustness of the ACM algorithm. △ Less

Submitted 26 December, 2023; originally announced December 2023.

arXiv:2312.15941 [pdf, other]

Reshaping the ISAC Tradeoff Under OFDM Signaling: A Probabilistic Constellation Shaping Approach

Authors: Zhen Du, Fan Liu, Yifeng Xiong, Tony Xiao Han, Yonina C. Eldar, Shi Jin

Abstract: Integrated sensing and communications is regarded as a key enabling technology in the sixth generation networks, where a unified waveform, such as orthogonal frequency division multiplexing (OFDM) signal, is adopted to facilitate both sensing and communications (S&C). However, the random communication data embedded in the OFDM signal results in severe variability in the sidelobes of its ambiguity… ▽ More Integrated sensing and communications is regarded as a key enabling technology in the sixth generation networks, where a unified waveform, such as orthogonal frequency division multiplexing (OFDM) signal, is adopted to facilitate both sensing and communications (S&C). However, the random communication data embedded in the OFDM signal results in severe variability in the sidelobes of its ambiguity function (AF), which leads to missed detection of weak targets and false detection of ghost targets, thereby impairing the sensing performance. Therefore, balancing between preserving communication capability (i.e., the randomness) while improving sensing performance remains a challenging task. To cope with this issue, we characterize the random AF of OFDM communication signals, and demonstrate that the AF variance is determined by the fourth-moment of the constellation amplitudes. Subsequently, we propose an optimal probabilistic constellation shaping (PCS) approach by maximizing the achievable information rate (AIR) under the fourth-moment, power and probability constraints, where the optimal input distribution may be numerically specified through a modified Blahut-Arimoto algorithm. To reduce the computational overheads, we further propose a heuristic PCS approach by actively controlling the value of the fourth-moment, without involving the communication metric in the optimization model, despite that the AIR is passively scaled with the variation of the input distribution. Numerical results show that both approaches strike a scalable performance tradeoff between S&C, where the superiority of the PCS-enabled constellations over conventional uniform constellations is also verified. Notably, the heuristic approach achieves very close performance to the optimal counterpart, at a much lower computational complexity. △ Less

Submitted 26 December, 2023; originally announced December 2023.

arXiv:2311.11151 [pdf, ps, other]

On the Hardness of Learning to Stabilize Linear Systems

Authors: Xiong Zeng, Zexiang Liu, Zhe Du, Necmiye Ozay, Mario Sznaier

Abstract: Inspired by the work of Tsiamis et al. \cite{tsiamis2022learning}, in this paper we study the statistical hardness of learning to stabilize linear time-invariant systems. Hardness is measured by the number of samples required to achieve a learning task with a given probability. The work in \cite{tsiamis2022learning} shows that there exist system classes that are hard to learn to stabilize with the… ▽ More Inspired by the work of Tsiamis et al. \cite{tsiamis2022learning}, in this paper we study the statistical hardness of learning to stabilize linear time-invariant systems. Hardness is measured by the number of samples required to achieve a learning task with a given probability. The work in \cite{tsiamis2022learning} shows that there exist system classes that are hard to learn to stabilize with the core reason being the hardness of identification. Here we present a class of systems that can be easy to identify, thanks to a non-degenerate noise process that excites all modes, but the sample complexity of stabilization still increases exponentially with the system dimension. We tie this result to the hardness of co-stabilizability for this class of systems using ideas from robust control. △ Less

Submitted 18 November, 2023; originally announced November 2023.

Comments: 7 pages, 2 figures, accepted by CDC 2023

arXiv:2310.19477 [pdf, other]

VDIP-TGV: Blind Image Deconvolution via Variational Deep Image Prior Empowered by Total Generalized Variation

Authors: Tingting Wu, Zhiyan Du, Zhi Li, Feng-Lei Fan, Tieyong Zeng

Abstract: Recovering clear images from blurry ones with an unknown blur kernel is a challenging problem. Deep image prior (DIP) proposes to use the deep network as a regularizer for a single image rather than as a supervised model, which achieves encouraging results in the nonblind deblurring problem. However, since the relationship between images and the network architectures is unclear, it is hard to find… ▽ More Recovering clear images from blurry ones with an unknown blur kernel is a challenging problem. Deep image prior (DIP) proposes to use the deep network as a regularizer for a single image rather than as a supervised model, which achieves encouraging results in the nonblind deblurring problem. However, since the relationship between images and the network architectures is unclear, it is hard to find a suitable architecture to provide sufficient constraints on the estimated blur kernels and clean images. Also, DIP uses the sparse maximum a posteriori (MAP), which is insufficient to enforce the selection of the recovery image. Recently, variational deep image prior (VDIP) was proposed to impose constraints on both blur kernels and recovery images and take the standard deviation of the image into account during the optimization process by the variational principle. However, we empirically find that VDIP struggles with processing image details and tends to generate suboptimal results when the blur kernel is large. Therefore, we combine total generalized variational (TGV) regularization with VDIP in this paper to overcome these shortcomings of VDIP. TGV is a flexible regularization that utilizes the characteristics of partial derivatives of varying orders to regularize images at different scales, reducing oil painting artifacts while maintaining sharp edges. The proposed VDIP-TGV effectively recovers image edges and details by supplementing extra gradient information through TGV. Additionally, this model is solved by the alternating direction method of multipliers (ADMM), which effectively combines traditional algorithms and deep learning methods. Experiments show that our proposed VDIP-TGV surpasses various state-of-the-art models quantitatively and qualitatively. △ Less

Submitted 10 November, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

Comments: 13 pages, 5 figures

arXiv:2310.18090 [pdf, ps, other]

Probabilistic Constellation Shaping for OFDM-Based ISAC Signaling

Authors: Zhen Du, Fan Liu, Yifeng Xiong, Tony Xiao Han, Weijie Yuan, Yuanhao Cui, Changhua Yao, Yonina C. Eldar

Abstract: Integrated Sensing and Communications (ISAC) has garnered significant attention as a promising technology for the upcoming sixth-generation wireless communication systems (6G). In pursuit of this goal, a common strategy is that a unified waveform, such as Orthogonal Frequency Division Multiplexing (OFDM), should serve dual-functional roles by enabling simultaneous sensing and communications (S&C)… ▽ More Integrated Sensing and Communications (ISAC) has garnered significant attention as a promising technology for the upcoming sixth-generation wireless communication systems (6G). In pursuit of this goal, a common strategy is that a unified waveform, such as Orthogonal Frequency Division Multiplexing (OFDM), should serve dual-functional roles by enabling simultaneous sensing and communications (S&C) operations. However, the sensing performance of an OFDM communication signal is substantially affected by the randomness of the data symbols mapped from bit streams. Therefore, achieving a balance between preserving communication capability (i.e., the randomness) while improving sensing performance remains a challenging task. To cope with this issue, in this paper we analyze the ambiguity function of the OFDM communication signal modulated by random data. Subsequently, a probabilistic constellation shaping (PCS) method is proposed to devise the probability distributions of constellation points, which is able to strike a scalable S&C tradeoff of the random transmitted signal. Finally, the superiority of the proposed PCS method over conventional uniformly distributed constellations is validated through numerical simulations. △ Less

Submitted 27 October, 2023; originally announced October 2023.

arXiv:2310.04863 [pdf, other]

SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Authors: Yangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du, Shiliang Zhang, Lei Xie

Abstract: Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we intro… ▽ More Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model.Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded SA-ASR model by a 6.1% relative speaker-dependent character error rate (SD-CER) reduction on the test set. Moreover, our model achieves a comparable SD-CER of 34.8% with only 1/10 RTF compared with the SOTA joint AR SA-ASR model. △ Less

Submitted 7 October, 2023; originally announced October 2023.

arXiv:2310.04673 [pdf, other]

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Authors: Zhihao Du, Jiaming Wang, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang

Abstract: Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text large language models (LLMs). Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as a… ▽ More Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text large language models (LLMs). Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features. In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. LauraGPT is a versatile LLM that can process both audio and text inputs and generate outputs in either modalities. We propose a novel data representation that combines continuous and discrete features for audio: LauraGPT encodes input audio into continuous representations using an audio encoder and generates output audio from discrete codec codes. We propose a one-step codec vocoder to overcome the prediction challenge caused by the multimodal distribution of codec tokens. We fine-tune LauraGPT using supervised multi-task learning. Extensive experiments show that LauraGPT consistently achieves comparable to superior performance compared to strong baselines on a wide range of audio tasks related to content, semantics, paralinguistics, and audio-signal analysis, such as automatic speech recognition, speech-to-text translation, text-to-speech synthesis, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding. △ Less

Submitted 2 July, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

Comments: 10 pages, work in progress

arXiv:2309.14372 [pdf, other]

Human Transcription Quality Improvement

Authors: Jian Gao, Hanbo Sun, Cheng Cao, Zheng Du

Abstract: High quality transcription data is crucial for training automatic speech recognition (ASR) systems. However, the existing industry-level data collection pipelines are expensive to researchers, while the quality of crowdsourced transcription is low. In this paper, we propose a reliable method to collect speech transcriptions. We introduce two mechanisms to improve transcription quality: confidence… ▽ More High quality transcription data is crucial for training automatic speech recognition (ASR) systems. However, the existing industry-level data collection pipelines are expensive to researchers, while the quality of crowdsourced transcription is low. In this paper, we propose a reliable method to collect speech transcriptions. We introduce two mechanisms to improve transcription quality: confidence estimation based reprocessing at labeling stage, and automatic word error correction at post-labeling stage. We collect and release LibriCrowd - a large-scale crowdsourced dataset of audio transcriptions on 100 hours of English speech. Experiment shows the Transcription WER is reduced by over 50%. We further investigate the impact of transcription error on ASR model performance and found a strong correlation. The transcription quality improvement provides over 10% relative WER reduction for ASR models. We release the dataset and code to benefit the research community. △ Less

Submitted 23 September, 2023; originally announced September 2023.

Comments: 5 pages, 3 figures, 5 tables, INTERSPEECH 2023

MSC Class: 68T50 ACM Class: I.2.7

Journal ref: INTERSPEECH 2023

arXiv:2309.13573 [pdf, other]

The second multi-channel multi-party meeting transcription challenge (M2MeT) 2.0): A benchmark for speaker-attributed ASR

Authors: Yuhao Liang, Mohan Shi, Fan Yu, Yangze Li, Shiliang Zhang, Zhihao Du, Qian Chen, Lei Xie, Yanmin Qian, Jian Wu, Zhuo Chen, Kong Aik Lee, Zhijie Yan, Hui Bu

Abstract: With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of \emph{speaker-attributed ASR (SA-ASR)}, which directly addresses the practical and challenging problem of ``who spoke what at when" at typical meeting scenario. We particularly established two sub-tr… ▽ More With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of \emph{speaker-attributed ASR (SA-ASR)}, which directly addresses the practical and challenging problem of ``who spoke what at when" at typical meeting scenario. We particularly established two sub-tracks. The fixed training condition sub-track, where the training data is constrained to predetermined datasets, but participants can use any open-source pre-trained model. The open training condition sub-track, which allows for the use of all available data and models without limitation. In addition, we release a new 10-hour test set for challenge ranking. This paper provides an overview of the dataset, track settings, results, and analysis of submitted systems, as a benchmark to show the current state of speaker-attributed ASR. △ Less

Submitted 5 October, 2023; v1 submitted 24 September, 2023; originally announced September 2023.

Comments: 8 pages, Accepted by ASRU2023

arXiv:2309.10089 [pdf, other]

HTEC: Human Transcription Error Correction

Authors: Hanbo Sun, Jian Gao, Xiaomin Wu, Anjie Fang, Cheng Cao, Zheng Du

Abstract: High-quality human transcription is essential for training and improving Automatic Speech Recognition (ASR) models. Recent study~\cite{libricrowd} has found that every 1% worse transcription Word Error Rate (WER) increases approximately 2% ASR WER by using the transcriptions to train ASR models. Transcription errors are inevitable for even highly-trained annotators. However, few studies have explo… ▽ More High-quality human transcription is essential for training and improving Automatic Speech Recognition (ASR) models. Recent study~\cite{libricrowd} has found that every 1% worse transcription Word Error Rate (WER) increases approximately 2% ASR WER by using the transcriptions to train ASR models. Transcription errors are inevitable for even highly-trained annotators. However, few studies have explored human transcription correction. Error correction methods for other problems, such as ASR error correction and grammatical error correction, do not perform sufficiently for this problem. Therefore, we propose HTEC for Human Transcription Error Correction. HTEC consists of two stages: Trans-Checker, an error detection model that predicts and masks erroneous words, and Trans-Filler, a sequence-to-sequence generative model that fills masked positions. We propose a holistic list of correction operations, including four novel operations handling deletion errors. We further propose a variant of embeddings that incorporates phoneme information into the input of the transformer. HTEC outperforms other methods by a large margin and surpasses human annotators by 2.2% to 4.5% in WER. Finally, we deployed HTEC to assist human annotators and showed HTEC is particularly effective as a co-pilot, which improves transcription quality by 15.1% without sacrificing transcription velocity. △ Less

Submitted 18 September, 2023; originally announced September 2023.

Comments: 13 pages, 4 figures, 11 tables, AMLC 2023

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2309.07405 [pdf, other]

FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

Authors: Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng

Abstract: This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as… ▽ More This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as speech recognition. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes. Based on the toolkit, we further propose the frequency-domain codec models, FreqCodec, which can achieve comparable speech quality with much lower computation and parameter complexity. Experimental results show that, under the same compression ratio, FunCodec can achieve better reconstruction quality compared with other toolkits and released models. We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis. This toolkit is publicly available at https://github.com/alibaba-damo-academy/FunCodec. △ Less

Submitted 6 October, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: 5 pages, 3 figures, submitted to ICASSP 2024

arXiv:2309.05058 [pdf, other]

Multimodal Fish Feeding Intensity Assessment in Aquaculture

Authors: Meng Cui, Xubo Liu, Haohe Liu, Zhuangzhuang Du, Tao Chen, Guoping Lian, Daoliang Li, Wenwu Wang

Abstract: Fish feeding intensity assessment (FFIA) aims to evaluate fish appetite changes during feeding, which is crucial in industrial aquaculture applications. Existing FFIA methods are limited by their robustness to noise, computational complexity, and the lack of public datasets for developing the models. To address these issues, we first introduce AV-FFIA, a new dataset containing 27,000 labeled audio… ▽ More Fish feeding intensity assessment (FFIA) aims to evaluate fish appetite changes during feeding, which is crucial in industrial aquaculture applications. Existing FFIA methods are limited by their robustness to noise, computational complexity, and the lack of public datasets for developing the models. To address these issues, we first introduce AV-FFIA, a new dataset containing 27,000 labeled audio and video clips that capture different levels of fish feeding intensity. Then, we introduce multi-modal approaches for FFIA by leveraging the models pre-trained on individual modalities and fused with data fusion methods. We perform benchmark studies of these methods on AV-FFIA, and demonstrate the advantages of the multi-modal approach over the single-modality based approach, especially in noisy environments. However, compared to the methods developed for individual modalities, the multimodal approaches may involve higher computational costs due to the need for independent encoders for each modality. To overcome this issue, we further present a novel unified mixed-modality based method for FFIA, termed as U-FFIA. U-FFIA is a single model capable of processing audio, visual, or audio-visual modalities, by leveraging modality dropout during training and knowledge distillation using the models pre-trained with data from single modality. We demonstrate that U-FFIA can achieve performance better than or on par with the state-of-the-art modality-specific FFIA models, with significantly lower computational overhead, enabling robust and efficient FFIA for improved aquaculture management. △ Less

Submitted 19 May, 2024; v1 submitted 10 September, 2023; originally announced September 2023.

arXiv:2308.08536 [pdf, other]

Can Transformers Learn Optimal Filtering for Unknown Systems?

Authors: Haldun Balim, Zhe Du, Samet Oymak, Necmiye Ozay

Abstract: Transformer models have shown great success in natural language processing; however, their potential remains mostly unexplored for dynamical systems. In this work, we investigate the optimal output estimation problem using transformers, which generate output predictions using all the past ones. Particularly, we train the transformer using various distinct systems and then evaluate the performance… ▽ More Transformer models have shown great success in natural language processing; however, their potential remains mostly unexplored for dynamical systems. In this work, we investigate the optimal output estimation problem using transformers, which generate output predictions using all the past ones. Particularly, we train the transformer using various distinct systems and then evaluate the performance on unseen systems with unknown dynamics. Empirically, the trained transformer adapts exceedingly well to different unseen systems and even matches the optimal performance given by the Kalman filter for linear systems. In more complex settings with non-i.i.d. noise, time-varying dynamics, and nonlinear dynamics like a quadrotor system with unknown parameters, transformers also demonstrate promising results. To support our experimental findings, we provide statistical guarantees that quantify the amount of training data required for the transformer to achieve a desired excess risk. Finally, we point out some limitations by identifying two classes of problems that lead to degraded performance, highlighting the need for caution when using transformers for control and estimation. △ Less

Submitted 11 June, 2024; v1 submitted 16 August, 2023; originally announced August 2023.

Comments: Minor differences between the implementation and the originally provided descriptions are corrected, ensuring better clarity and accuracy of the content

arXiv:2305.12459 [pdf, other]

CASA-ASR: Context-Aware Speaker-Attributed ASR

Authors: Mohan Shi, Zhihao Du, Qian Chen, Fan Yu, Yangze Li, Shiliang Zhang, Jie Zhang, Li-Rong Dai

Abstract: Recently, speaker-attributed automatic speech recognition (SA-ASR) has attracted a wide attention, which aims at answering the question ``who spoke what''. Different from modular systems, end-to-end (E2E) SA-ASR minimizes the speaker-dependent recognition errors directly and shows a promising applicability. In this paper, we propose a context-aware SA-ASR (CASA-ASR) model by enhancing the contextu… ▽ More Recently, speaker-attributed automatic speech recognition (SA-ASR) has attracted a wide attention, which aims at answering the question ``who spoke what''. Different from modular systems, end-to-end (E2E) SA-ASR minimizes the speaker-dependent recognition errors directly and shows a promising applicability. In this paper, we propose a context-aware SA-ASR (CASA-ASR) model by enhancing the contextual modeling ability of E2E SA-ASR. Specifically, in CASA-ASR, a contextual text encoder is involved to aggregate the semantic information of the whole utterance, and a context-dependent scorer is employed to model the speaker discriminability by contrasting with speakers in the context. In addition, a two-pass decoding strategy is further proposed to fully leverage the contextual modeling ability resulting in a better recognition performance. Experimental results on AliMeeting corpus show that the proposed CASA-ASR model outperforms the original E2E SA-ASR system with a relative improvement of 11.76% in terms of speaker-dependent character error rate. △ Less

Submitted 21 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech2023

arXiv:2305.11013 [pdf, other]

FunASR: A Fundamental End-to-End Speech Recognition Toolkit

Authors: Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, Shiliang Zhang

Abstract: This paper introduces FunASR, an open-source speech recognition toolkit designed to bridge the gap between academic research and industrial applications. FunASR offers models trained on large-scale industrial corpora and the ability to deploy them in applications. The toolkit's flagship model, Paraformer, is a non-autoregressive end-to-end speech recognition model that has been trained on a manual… ▽ More This paper introduces FunASR, an open-source speech recognition toolkit designed to bridge the gap between academic research and industrial applications. FunASR offers models trained on large-scale industrial corpora and the ability to deploy them in applications. The toolkit's flagship model, Paraformer, is a non-autoregressive end-to-end speech recognition model that has been trained on a manually annotated Mandarin speech recognition dataset that contains 60,000 hours of speech. To improve the performance of Paraformer, we have added timestamp prediction and hotword customization capabilities to the standard Paraformer backbone. In addition, to facilitate model deployment, we have open-sourced a voice activity detection model based on the Feedforward Sequential Memory Network (FSMN-VAD) and a text post-processing punctuation model based on the controllable time-delay Transformer (CT-Transformer), both of which were trained on industrial corpora. These functional modules provide a solid foundation for building high-precision long audio speech recognition services. Compared to other models trained on open datasets, Paraformer demonstrates superior performance. △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: 5 pages, 3 figures, accepted by INTERSPEECH 2023

arXiv:2305.00681 [pdf, other]

Towards ISAC-Empowered Vehicular Networks: Framework, Advances, and Opportunities

Authors: Zhen Du, Fan Liu, Yunxin Li, Weijie Yuan, Yuanhao Cui, Zenghui Zhang, Christos Masouros, Bo Ai

Abstract: Connected and autonomous vehicle (CAV) networks face several challenges, such as low throughput, high latency, and poor localization accuracy. These challenges severely impede the implementation of CAV networks for immersive metaverse applications and driving safety in future 6G wireless networks. To alleviate these issues, integrated sensing and communications (ISAC) is envisioned as a game-chang… ▽ More Connected and autonomous vehicle (CAV) networks face several challenges, such as low throughput, high latency, and poor localization accuracy. These challenges severely impede the implementation of CAV networks for immersive metaverse applications and driving safety in future 6G wireless networks. To alleviate these issues, integrated sensing and communications (ISAC) is envisioned as a game-changing technology for future CAV networks. This article presents a comprehensive overview on the application of ISAC techniques in vehicle-to-infrastructure (V2I) networks. We cover the general system framework, representative advances, and a detailed case study on using the 5G New Radio (NR) waveform for sensing-assisted communications in V2I networks. Finally, we highlight open problems and opportunities in the field. △ Less

Submitted 1 May, 2023; originally announced May 2023.

arXiv:2303.13243 [pdf, other]

Pyramid Multi-branch Fusion DCNN with Multi-Head Self-Attention for Mandarin Speech Recognition

Authors: Kai Liu, Hailiang Xiong, Gangqiang Yang, Zhengfeng Du, Yewen Cao, Danyal Shah

Abstract: As one of the major branches of automatic speech recognition, attention-based models greatly improves the feature representation ability of the model. In particular, the multi-head mechanism is employed in the attention, hoping to learn speech features of more aspects in different attention subspaces. For speech recognition of complex languages, on the one hand, a small head size will lead to an o… ▽ More As one of the major branches of automatic speech recognition, attention-based models greatly improves the feature representation ability of the model. In particular, the multi-head mechanism is employed in the attention, hoping to learn speech features of more aspects in different attention subspaces. For speech recognition of complex languages, on the one hand, a small head size will lead to an obvious shortage of learnable aspects. On the other hand, we need to reduce the dimension of each subspace to keep the size of the overall feature space unchanged when we increase the number of heads, which will significantly weaken the ability to represent the feature of each subspace. Therefore, this paper explores how to use a small attention subspace to represent complete speech features while ensuring many heads. In this work we propose a novel neural network architecture, namely, pyramid multi-branch fusion DCNN with multi-head self-attention. The proposed architecture is inspired by Dilated Convolution Neural Networks (DCNN), it uses multiple branches with DCNN to extract the feature of the input speech under different receptive fields. To reduce the number of parameters, every two branches are merged until all the branches are merged into one. Thus, its shape is like a pyramid rotated 90 degrees. We demonstrate that on Aishell-1, a widely used Mandarin speech dataset, our model achieves a character error rate (CER) of 6.45% on the test sets. △ Less

Submitted 23 March, 2023; originally announced March 2023.

arXiv:2303.06550 [pdf, other]

Spatial Correspondence between Graph Neural Network-Segmented Images

Authors: Qian Li, Yunguan Fu, Qianye Yang, Zhijiang Du, Hongjian Yu, Yipeng Hu

Abstract: Graph neural networks (GNNs) have been proposed for medical image segmentation, by predicting anatomical structures represented by graphs of vertices and edges. One such type of graph is predefined with fixed size and connectivity to represent a reference of anatomical regions of interest, thus known as templates. This work explores the potentials in these GNNs with common topology for establishin… ▽ More Graph neural networks (GNNs) have been proposed for medical image segmentation, by predicting anatomical structures represented by graphs of vertices and edges. One such type of graph is predefined with fixed size and connectivity to represent a reference of anatomical regions of interest, thus known as templates. This work explores the potentials in these GNNs with common topology for establishing spatial correspondence, implicitly maintained during segmenting two or more images. With an example application of registering local vertebral sub-regions found in CT images, our experimental results showed that the GNN-based segmentation is capable of accurate and reliable localization of the same interventionally interesting structures between images, not limited to the segmentation classes. The reported average target registration errors of 2.2$\pm$1.3 mm and 2.7$\pm$1.4 mm, for aligning holdout test images with a reference and for aligning two test images, respectively, were by a considerable margin lower than those from the tested non-learning and learning-based registration algorithms. Further ablation studies assess the contributions towards the registration performance, from individual components in the originally segmentation-purposed network and its training algorithm. The results highlight that the proposed segmentation-in-lieu-of-registration approach shares methodological similarities with existing registration methods, such as the use of displacement smoothness constraint and point distance minimization albeit on non-grid graphs, which interestingly yielded benefits for both segmentation and registration. We, therefore, conclude that the template-based GNN segmentation can effectively establish spatial correspondence in our application, without any other dedicated registration algorithms. △ Less

Submitted 16 March, 2023; v1 submitted 11 March, 2023; originally announced March 2023.

Comments: Accepted at MIDL 2023 (The Medical Imaging with Deep Learning conference, 2023)

arXiv:2303.05397 [pdf, other]

TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization

Authors: Jiaming Wang, Zhihao Du, Shiliang Zhang

Abstract: Recently, end-to-end neural diarization (EEND) is introduced and achieves promising results in speaker-overlapped scenarios. In EEND, speaker diarization is formulated as a multi-label prediction problem, where speaker activities are estimated independently and their dependency are not well considered. To overcome these disadvantages, we employ the power set encoding to reformulate speaker diariza… ▽ More Recently, end-to-end neural diarization (EEND) is introduced and achieves promising results in speaker-overlapped scenarios. In EEND, speaker diarization is formulated as a multi-label prediction problem, where speaker activities are estimated independently and their dependency are not well considered. To overcome these disadvantages, we employ the power set encoding to reformulate speaker diarization as a single-label classification problem and propose the overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency can be modeled explicitly. Inspired by the success of two-stage hybrid systems, we further propose a novel Two-stage OverLap-aware Diarization framework (TOLD) by involving a speaker overlap-aware post-processing (SOAP) model to iteratively refine the diarization results of EEND-OLA. Experimental results show that, compared with the original EEND, the proposed EEND-OLA achieves a 14.39% relative improvement in terms of diarization error rates (DER), and utilizing SOAP provides another 19.33% relative improvement. As a result, our method TOLD achieves a DER of 10.14% on the CALLHOME dataset, which is a new state-of-the-art result on this benchmark to the best of our knowledge. △ Less

Submitted 13 December, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP2023

arXiv:2303.05023 [pdf, other]

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

Authors: Kai Liu, Ziqing Du, Xucheng Wan, Huan Zhou

Abstract: Target speech extraction (TSE) systems are designed to extract target speech from a multi-talker mixture. The popular training objective for most prior TSE networks is to enhance reconstruction performance of extracted speech waveform. However, it has been reported that a TSE system delivers high reconstruction performance may still suffer low-quality experience problems in practice. One such expe… ▽ More Target speech extraction (TSE) systems are designed to extract target speech from a multi-talker mixture. The popular training objective for most prior TSE networks is to enhance reconstruction performance of extracted speech waveform. However, it has been reported that a TSE system delivers high reconstruction performance may still suffer low-quality experience problems in practice. One such experience problem is wrong speaker extraction (called speaker confusion, SC), which leads to strong negative experience and hampers effective conversations. To mitigate the imperative SC issue, we reformulate the training objective and propose two novel loss schemes that explore the metric of reconstruction improvement performance defined at small chunk-level and leverage the metric associated distribution information. Both loss schemes aim to encourage a TSE network to pay attention to those SC chunks based on the said distribution information. On this basis, we present X-SepFormer, an end-to-end TSE model with proposed loss schemes and a backbone of SepFormer. Experimental results on the benchmark WSJ0-2mix dataset validate the effectiveness of our proposals, showing consistent improvements on SC errors (by 14.8% relative). Moreover, with SI-SDRi of 19.4 dB and PESQ of 3.81, our best system significantly outperforms the current SOTA systems and offers the top TSE results reported till date on the WSJ0-2mix. △ Less

Submitted 8 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.02722 [pdf, other]

Performance of OTFS-NOMA Scheme for Coordinated Direct and Relay Transmission Networks in High-Mobility Scenarios

Authors: Yao Xu, Zhen Du, Weijie Yuan, Shaobo Jia, Victor C. M. Leung

Abstract: In this letter, an orthogonal time frequency space (OTFS) based non-orthogonal multiple access (NOMA) scheme is investigated for the coordinated direct and relay transmission system, where a source directly communicates with a near user with high mobile speed, and it needs the relaying assistance to serve the far user also having high mobility. Due to the coexistence of signal superposition coding… ▽ More In this letter, an orthogonal time frequency space (OTFS) based non-orthogonal multiple access (NOMA) scheme is investigated for the coordinated direct and relay transmission system, where a source directly communicates with a near user with high mobile speed, and it needs the relaying assistance to serve the far user also having high mobility. Due to the coexistence of signal superposition coding and multi-domain transformation, the performance of OTFS-based NOMA is usually challenging to be measured from a theoretical perspective. To accurately evaluate the system performance of the proposed scheme, we derive the closed-form expressions for the outage probability and the outage sum rate by using the Inversion formula and characteristic function. Numerical results verify the performance superiority and the effectiveness of the proposed scheme. △ Less

Submitted 5 March, 2023; originally announced March 2023.

arXiv:2301.12787 [pdf, other]

ISAC-Enabled V2I Networks Based on 5G NR: How Much Can the Overhead Be Reduced?

Authors: Yunxin Li, Fan Liu, Zhen Du, Weijie Yuan, Christos Masouros

Abstract: The emergence of the fifth-generation (5G) New Radio (NR) brings additional possibilities to vehicle-to-everything (V2X) network with improved quality of services. In order to obtain accurate channel state information (CSI) in high-mobility V2X networks, pilot signals and frequent handover between vehicles and infrastructures are required to establish and maintain the communication link, which inc… ▽ More The emergence of the fifth-generation (5G) New Radio (NR) brings additional possibilities to vehicle-to-everything (V2X) network with improved quality of services. In order to obtain accurate channel state information (CSI) in high-mobility V2X networks, pilot signals and frequent handover between vehicles and infrastructures are required to establish and maintain the communication link, which increases the overheads and reduces the communication throughput. To address this issue, integrated sensing and communications (ISAC) was employed at the base station (BS) in the vehicle-to-infrastructure (V2I) network to reduce a certain amount of overheads, thus improve the spectral efficiency. Nevertheless, the exact amount of overheads reduction remains unclear, particularly for practical NR based V2X networks. In this paper, we study a link-level NR based V2I system employing ISAC signaling to facilitate the communication beam management, where the Extended Kalman filtering (EKF) algorithm is performed to realize the functions of tracking and predicting the motion of the vehicle. We provide detailed analysis on the overheads reduction with the aid of ISAC, and show that up to 43.24% overheads can be reduced under assigned NR frame structure. In addition, numerical results are provided to validate the improved performance on the beam tracking and communication throughput. △ Less

Submitted 21 March, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

Comments: 6 pages, 5 figures

arXiv:2301.06277 [pdf, ps, other]

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

Authors: Kai Liu, Xucheng Wan, Ziqing Du, Huan Zhou

Abstract: As a practical alternative of speech separation, target speaker extraction (TSE) aims to extract the speech from the desired speaker using additional speaker cue extracted from the speaker. Its main challenge lies in how to properly extract and leverage the speaker cue to benefit the extracted speech quality. The cue extraction method adopted in majority existing TSE studies is to directly utilize… ▽ More As a practical alternative of speech separation, target speaker extraction (TSE) aims to extract the speech from the desired speaker using additional speaker cue extracted from the speaker. Its main challenge lies in how to properly extract and leverage the speaker cue to benefit the extracted speech quality. The cue extraction method adopted in majority existing TSE studies is to directly utilize discriminative speaker embedding, which is extracted from the pre-trained models for speaker verification. Although the high speaker discriminability is a most desirable property for speaker verification task, we argue that it may be too sophisticated for TSE. In this study, we propose that a simplified speaker cue with clear class separability might be preferred for TSE. To verify our proposal, we introduce several forms of speaker cues, including naive speaker embedding (such as, x-vector and xi-vector) and new speaker embeddings produced from sparse LDA-transform. Corresponding TSE models are built by integrating these speaker cues with SepFormer (one SOTA speech separation model). Performances of these TSE models are examined on the benchmark WSJ0-2mix dataset. Experimental results validate the effectiveness and generalizability of our proposal, showing up to 9.9% relative improvement in SI-SDRi. Moreover, with SI-SDRi of 19.4 dB and PESQ of 3.78, our best TSE system significantly outperforms the current SOTA systems and offers the top TSE results reported till date on the WSJ0-2mix. △ Less

Submitted 16 January, 2023; originally announced January 2023.

Comments: ACCEPTED by NCMMSC 2022

arXiv:2211.10243 [pdf, other]

Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis

Authors: Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan

Abstract: Recently, hybrid systems of clustering and neural diarization models have been successfully applied in multi-party meeting analysis. However, current models always treat overlapped speaker diarization as a multi-label classification problem, where speaker dependency and overlaps are not well considered. To overcome the disadvantages, we reformulate overlapped speaker diarization task as a single-l… ▽ More Recently, hybrid systems of clustering and neural diarization models have been successfully applied in multi-party meeting analysis. However, current models always treat overlapped speaker diarization as a multi-label classification problem, where speaker dependency and overlaps are not well considered. To overcome the disadvantages, we reformulate overlapped speaker diarization task as a single-label prediction problem via the proposed power set encoding (PSE). Through this formulation, speaker dependency and overlaps can be explicitly modeled. To fully leverage this formulation, we further propose the speaker overlap-aware neural diarization (SOND) model, which consists of a context-independent (CI) scorer to model global speaker discriminability, a context-dependent scorer (CD) to model local discriminability, and a speaker combining network (SCN) to combine and reassign speaker activities. Experimental results show that using the proposed formulation can outperform the state-of-the-art methods based on target speaker voice activity detection, and the performance can be further improved with SOND, resulting in a 6.30% relative diarization error reduction. △ Less

Submitted 18 November, 2022; originally announced November 2022.

Comments: Accepted by EMNLP 2022

arXiv:2211.07143 [pdf]

WSC-Trans: A 3D network model for automatic multi-structural segmentation of temporal bone CT

Authors: Xin Hua, Zhijiang Du, Hongjian Yu, Jixin Ma, Fanjun Zheng, Cheng Zhang, Qiaohui Lu, Hui Zhao

Abstract: Cochlear implantation is currently the most effective treatment for patients with severe deafness, but mastering cochlear implantation is extremely challenging because the temporal bone has extremely complex and small three-dimensional anatomical structures, and it is important to avoid damaging the corresponding structures when performing surgery. The spatial location of the relevant anatomical t… ▽ More Cochlear implantation is currently the most effective treatment for patients with severe deafness, but mastering cochlear implantation is extremely challenging because the temporal bone has extremely complex and small three-dimensional anatomical structures, and it is important to avoid damaging the corresponding structures when performing surgery. The spatial location of the relevant anatomical tissues within the target area needs to be determined using CT prior to the procedure. Considering that the target structures are too small and complex, the time required for manual segmentation is too long, and it is extremely challenging to segment the temporal bone and its nearby anatomical structures quickly and accurately. To overcome this difficulty, we propose a deep learning-based algorithm, a 3D network model for automatic segmentation of multi-structural targets in temporal bone CT that can automatically segment the cochlea, facial nerve, auditory tubercle, vestibule and semicircular canal. The algorithm combines CNN and Transformer for feature extraction and takes advantage of spatial attention and channel attention mechanisms to further improve the segmentation effect, the experimental results comparing with the results of various existing segmentation algorithms show that the dice similarity scores, Jaccard coefficients of all targets anatomical structures are significantly higher while HD95 and ASSD scores are lower, effectively proving that our method outperforms other advanced methods. △ Less

Submitted 14 November, 2022; originally announced November 2022.

Comments: 10 pages,7 figures

arXiv:2211.00511 [pdf, other]

A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings

Authors: Mohan Shi, Jie Zhang, Zhihao Du, Fan Yu, Qian Chen, Shiliang Zhang, Li-Rong Dai

Abstract: Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task. It was shown that single-channel frame-level diarization with serialized output training (SC-FD-SOT), single-channel word-level diarization with SOT (SC-WD-SOT) and joint training of single-channel target-speaker separation and ASR (SC-TS-ASR) can be explo… ▽ More Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task. It was shown that single-channel frame-level diarization with serialized output training (SC-FD-SOT), single-channel word-level diarization with SOT (SC-WD-SOT) and joint training of single-channel target-speaker separation and ASR (SC-TS-ASR) can be exploited to partially solve this problem. In this paper, we propose three corresponding multichannel (MC) SA-ASR approaches, namely MC-FD-SOT, MC-WD-SOT and MC-TS-ASR. For different tasks/models, different multichannel data fusion strategies are considered, including channel-level cross-channel attention for MC-FD-SOT, frame-level cross-channel attention for MC-WD-SOT and neural beamforming for MC-TS-ASR. Results on the AliMeeting corpus reveal that our proposed models can consistently outperform the corresponding single-channel counterparts in terms of the speaker-dependent character error rate. △ Less

Submitted 1 March, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

arXiv:2211.00434 [pdf, other]

On the Performance Gain of Integrated Sensing and Communications: A Subspace Correlation Perspective

Authors: Shihang Lu, Xiao Meng, Zhen Du, Yifeng Xiong, Fan Liu

Abstract: In this paper, we shed light on the performance gain of integrated sensing and communications (ISAC) from the perspective of channel correlations between radar sensing and communication (S&C), namely ISAC subspace correlation. To begin with, we consider a multi-input multi-output (MIMO) ISAC system and reveal that the optimal ISAC signal is in the subspace spanned by the transmitted steering vecto… ▽ More In this paper, we shed light on the performance gain of integrated sensing and communications (ISAC) from the perspective of channel correlations between radar sensing and communication (S&C), namely ISAC subspace correlation. To begin with, we consider a multi-input multi-output (MIMO) ISAC system and reveal that the optimal ISAC signal is in the subspace spanned by the transmitted steering vectors of the sensing channel and the right singular matrix of the communication channel. By leveraging this result, we study a basic ISAC scenario with a single target and a single-antenna communication user, and derive the optimal waveform covariance matrix for minimizing the estimation error under a given communication rate constraint. To quantify the integration gain of ISAC systems, we define the subspace "correlation coefficient" to characterize the coupling effect between S&C channels. Finally, numerical results are provided to validate the effectiveness of the proposed approaches. △ Less

Submitted 2 November, 2022; v1 submitted 1 November, 2022; originally announced November 2022.

Comments: 6 pages, 5 figures, submitted to IEEE conference

arXiv:2210.05265 [pdf, other]

MFCCA:Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario

Authors: Fan Yu, Shiliang Zhang, Pengcheng Guo, Yuhao Liang, Zhihao Du, Yuxiao Lin, Lei Xie

Abstract: Recently cross-channel attention, which better leverages multi-channel signals from microphone array, has shown promising results in the multi-party meeting scenario. Cross-channel attention focuses on either learning global correlations between sequences of different channels or exploiting fine-grained channel-wise information effectively at each time step. Considering the delay of microphone arr… ▽ More Recently cross-channel attention, which better leverages multi-channel signals from microphone array, has shown promising results in the multi-party meeting scenario. Cross-channel attention focuses on either learning global correlations between sequences of different channels or exploiting fine-grained channel-wise information effectively at each time step. Considering the delay of microphone array receiving sound, we propose a multi-frame cross-channel attention, which models cross-channel information between adjacent frames to exploit the complementarity of both frame-wise and channel-wise knowledge. Besides, we also propose a multi-layer convolutional mechanism to fuse the multi-channel output and a channel masking strategy to combat the channel number mismatch problem between training and inference. Experiments on the AliMeeting, a real-world corpus, reveal that our proposed model outperforms single-channel model by 31.7\% and 37.0\% CER reduction on Eval and Test sets. Moreover, with comparable model parameters and training data, our proposed model achieves a new SOTA performance on the AliMeeting corpus, as compared with the top ranking systems in the ICASSP2022 M2MeT challenge, a recently held multi-channel multi-speaker ASR challenge. △ Less

Submitted 11 October, 2022; originally announced October 2022.

Comments: Accepted by SLT 2022

arXiv:2209.11906 [pdf, other]

Joint Speech Activity and Overlap Detection with Multi-Exit Architecture

Authors: Ziqing Du, Kai Liu, Xucheng Wan, Huan Zhou

Abstract: Overlapped speech detection (OSD) is critical for speech applications in scenario of multi-party conversion. Despite numerous research efforts and progresses, comparing with speech activity detection (VAD), OSD remains an open challenge and its overall performance is far from satisfactory. The majority of prior research typically formulates the OSD problem as a standard classification problem, to… ▽ More Overlapped speech detection (OSD) is critical for speech applications in scenario of multi-party conversion. Despite numerous research efforts and progresses, comparing with speech activity detection (VAD), OSD remains an open challenge and its overall performance is far from satisfactory. The majority of prior research typically formulates the OSD problem as a standard classification problem, to identify speech with binary (OSD) or three-class label (joint VAD and OSD) at frame level. In contrast to the mainstream, this study investigates the joint VAD and OSD task from a new perspective. In particular, we propose to extend traditional classification network with multi-exit architecture. Such an architecture empowers our system with unique capability to identify class using either low-level features from early exits or high-level features from last exit. In addition, two training schemes, knowledge distillation and dense connection, are adopted to further boost our system performance. Experimental results on benchmark datasets (AMI and DIHARD-III) validated the effectiveness and generality of our proposed system. Our ablations further reveal the complementary contribution of proposed schemes. With $F_1$ score of 0.792 on AMI and 0.625 on DIHARD-III, our proposed system outperforms several top performing models on these datasets, but also surpasses the current state-of-the-art by large margins across both datasets. Besides the performance benefit, our proposed system offers another appealing potential for quality-complexity trade-offs, which is highly preferred for efficient OSD deployment. △ Less

Submitted 23 September, 2022; originally announced September 2022.

arXiv:2209.11905 [pdf, other]

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

Authors: Xucheng Wan, Kai Liu, Ziqing Du, Huan Zhou

Abstract: To address the monaural speech enhancement problem, numerous research studies have been conducted to enhance speech via operations either in time-domain on the inner-domain learned from the speech mixture or in time--frequency domain on the fixed full-band short time Fourier transform (STFT) spectrograms. Very recently, a few studies on sub-band based speech enhancement have been proposed. By enha… ▽ More To address the monaural speech enhancement problem, numerous research studies have been conducted to enhance speech via operations either in time-domain on the inner-domain learned from the speech mixture or in time--frequency domain on the fixed full-band short time Fourier transform (STFT) spectrograms. Very recently, a few studies on sub-band based speech enhancement have been proposed. By enhancing speech via operations on sub-band spectrograms, those studies demonstrated competitive performances on the benchmark dataset of DNS2020. Despite attractive, this new research direction has not been fully explored and there is still room for improvement. As such, in this study, we delve into the latest research direction and propose a sub-band based speech enhancement system with perceptually-motivated optimization and dual transformations, called PT-FSE. Specially, our proposed PT-FSE model improves its backbone, a full-band and sub-band fusion model, by three efforts. First, we design a frequency transformation module that aims to strengthen the global frequency correlation. Then a temporal transformation is introduced to capture long range temporal contexts. Lastly, a novel loss, with leverage of properties of human auditory perception, is proposed to facilitate the model to focus on low frequency enhancement. To validate the effectiveness of our proposed model, extensive experiments are conducted on the DNS2020 dataset. Experimental results show that our PT-FSE system achieves substantial improvements over its backbone, but also outperforms the current state-of-the-art while being 27\% smaller than the SOTA. With average NB-PESQ of 3.57 on the benchmark dataset, our system offers the best speech enhancement results reported till date. △ Less

Submitted 23 September, 2022; originally announced September 2022.

arXiv:2206.12300 [pdf]

doi 10.3233/THC-230278

Automatic extraction of coronary arteries using deep learning in invasive coronary angiograms

Authors: Yinghui Meng, Zhenglong Du, Chen Zhao, Minghao Dong, Drew Pienta, Zhihui Xu, Weihua Zhou

Abstract: Accurate extraction of coronary arteries from invasive coronary angiography (ICA) is important in clinical decision-making for the diagnosis and risk stratification of coronary artery disease (CAD). In this study, we develop a method using deep learning to automatically extract the coronary artery lumen. Methods. A deep learning model U-Net 3+, which incorporates the full-scale skip connections an… ▽ More Accurate extraction of coronary arteries from invasive coronary angiography (ICA) is important in clinical decision-making for the diagnosis and risk stratification of coronary artery disease (CAD). In this study, we develop a method using deep learning to automatically extract the coronary artery lumen. Methods. A deep learning model U-Net 3+, which incorporates the full-scale skip connections and deep supervisions, was proposed for automatic extraction of coronary arteries from ICAs. Transfer learning and a hybrid loss function were employed in this novel coronary artery extraction framework. Results. A data set containing 616 ICAs obtained from 210 patients was used. In the technical evaluation, the U-Net 3+ achieved a Dice score of 0.8942 and a sensitivity of 0.8735, which is higher than U-Net ++ (Dice score: 0.8814, the sensitivity of 0.8331) and U-net (Dice score: 0.8799, the sensitivity of 0.8305). Conclusion. Our study demonstrates that the U-Net 3+ is superior to other segmentation frameworks for the automatic extraction of the coronary arteries from ICAs. This result suggests great promise for clinical use. △ Less

Submitted 24 June, 2022; originally announced June 2022.

Comments: 22 pages,5 figures

arXiv:2205.05675 [pdf, other]

NTIRE 2022 Challenge on Efficient Super-Resolution: Methods and Results

Authors: Yawei Li, Kai Zhang, Radu Timofte, Luc Van Gool, Fangyuan Kong, Mingxi Li, Songwei Liu, Zongcai Du, Ding Liu, Chenhui Zhou, Jingyi Chen, Qingrui Han, Zheyuan Li, Yingqi Liu, Xiangyu Chen, Haoming Cai, Yu Qiao, Chao Dong, Long Sun, Jinshan Pan, Yi Zhu, Zhikai Zong, Xiaoxiao Liu, Zheng Hui, Tao Yang , et al. (86 additional authors not shown)

Abstract: This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of e… ▽ More This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of efficiency measured according to several metrics including runtime, parameters, FLOPs, activations, and memory consumption while at least maintaining the PSNR of 29.00dB on DIV2K validation set. IMDN is set as the baseline for efficiency measurement. The challenge had 3 tracks including the main track (runtime), sub-track one (model complexity), and sub-track two (overall performance). In the main track, the practical runtime performance of the submissions was evaluated. The rank of the teams were determined directly by the absolute value of the average runtime on the validation set and test set. In sub-track one, the number of parameters and FLOPs were considered. And the individual rankings of the two metrics were summed up to determine a final ranking in this track. In sub-track two, all of the five metrics mentioned in the description of the challenge including runtime, parameter count, FLOPs, activations, and memory consumption were considered. Similar to sub-track one, the rankings of five metrics were summed up to determine a final ranking. The challenge had 303 registered participants, and 43 teams made valid submissions. They gauge the state-of-the-art in efficient single image super-resolution. △ Less

Submitted 11 May, 2022; originally announced May 2022.

Comments: Validation code of the baseline model is available at https://github.com/ofsoundof/IMDN. Validation of all submitted models is available at https://github.com/ofsoundof/NTIRE2022_ESR

arXiv:2205.02697 [pdf, other]

doi 10.1109/OJCSYS.2022.3212613

Mode Reduction for Markov Jump Systems

Authors: Zhe Du, Laura Balzano, Necmiye Ozay

Abstract: Switched systems are capable of modeling processes with underlying dynamics that may change abruptly over time. To achieve accurate modeling in practice, one may need a large number of modes, but this may in turn increase the model complexity drastically. Existing work on reducing system complexity mainly considers state space reduction, yet reducing the number of modes is less studied. In this wo… ▽ More Switched systems are capable of modeling processes with underlying dynamics that may change abruptly over time. To achieve accurate modeling in practice, one may need a large number of modes, but this may in turn increase the model complexity drastically. Existing work on reducing system complexity mainly considers state space reduction, yet reducing the number of modes is less studied. In this work, we consider Markov jump linear systems (MJSs), a special class of switched systems where the active mode switches according to a Markov chain, and several issues associated with its mode complexity. Specifically, inspired by clustering techniques from unsupervised learning, we are able to construct a reduced MJS with fewer modes that approximates well the original MJS under various metrics. Furthermore, both theoretically and empirically, we show how one can use the reduced MJS to analyze stability and design controllers with significant reduction in computational cost while achieving guaranteed accuracy. △ Less

Submitted 20 October, 2022; v1 submitted 5 May, 2022; originally announced May 2022.

arXiv:2204.08397 [pdf, other]

Fast and Memory-Efficient Network Towards Efficient Image Super-Resolution

Authors: Zongcai Du, Ding Liu, Jie Liu, Jie Tang, Gangshan Wu, Lean Fu

Abstract: Runtime and memory consumption are two important aspects for efficient image super-resolution (EISR) models to be deployed on resource-constrained devices. Recent advances in EISR exploit distillation and aggregation strategies with plenty of channel split and concatenation operations to make full use of limited hierarchical features. In contrast, sequential network operations avoid frequently acc… ▽ More Runtime and memory consumption are two important aspects for efficient image super-resolution (EISR) models to be deployed on resource-constrained devices. Recent advances in EISR exploit distillation and aggregation strategies with plenty of channel split and concatenation operations to make full use of limited hierarchical features. In contrast, sequential network operations avoid frequently accessing preceding states and extra nodes, and thus are beneficial to reducing the memory consumption and runtime overhead. Following this idea, we design our lightweight network backbone by mainly stacking multiple highly optimized convolution and activation layers and decreasing the usage of feature fusion. We propose a novel sequential attention branch, where every pixel is assigned an important factor according to local and global contexts, to enhance high-frequency details. In addition, we tailor the residual block for EISR and propose an enhanced residual block (ERB) to further accelerate the network inference. Finally, combining all the above techniques, we construct a fast and memory-efficient network (FMEN) and its small version FMEN-S, which runs 33% faster and reduces 74% memory consumption compared with the state-of-the-art EISR model: E-RFDN, the champion in AIM 2020 efficient super-resolution challenge. Besides, FMEN-S achieves the lowest memory consumption and the second shortest runtime in NTIRE 2022 challenge on efficient super-resolution. Code is available at https://github.com/NJU-Jet/FMEN. △ Less

Submitted 18 April, 2022; originally announced April 2022.

Comments: Accepted by NTIRE 2022 (CVPR Workshop)

arXiv:2203.16834 [pdf, other]

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

Authors: Fan Yu, Zhihao Du, Shiliang Zhang, Yuxiao Lin, Lei Xie

Abstract: In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA-ASR) in the multi-party meeting scenario, a topic with increasing attention in meeting rich transcription. Specifically, three approaches are evaluated in this study. The first approach, FD-SOT, consists of a frame-level diarization model to identify speakers and a multi-talker ASR to recognize utt… ▽ More In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA-ASR) in the multi-party meeting scenario, a topic with increasing attention in meeting rich transcription. Specifically, three approaches are evaluated in this study. The first approach, FD-SOT, consists of a frame-level diarization model to identify speakers and a multi-talker ASR to recognize utterances. The speaker-attributed transcriptions are obtained by aligning the diarization results and recognized hypotheses. However, such an alignment strategy may suffer from erroneous timestamps due to the modular independence, severely hindering the model performance. Therefore, we propose the second approach, WD-SOT, to address alignment errors by introducing a word-level diarization model, which can get rid of such timestamp alignment dependency. To further mitigate the alignment issues, we propose the third approach, TS-ASR, which trains a target-speaker separation module and an ASR module jointly. By comparing various strategies for each SA-ASR approach, experimental results on a real meeting scenario corpus, AliMeeting, reveal that the WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER), compared with the FD-SOT approach. In addition, the TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction. △ Less

Submitted 1 July, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: accepted by INTERSPEECH 2022, 5 pages, 2 figures

arXiv:2203.09767 [pdf, other]

Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios

Authors: Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan

Abstract: Overlapping speech diarization has been traditionally treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding multiple binary labels into a single label with the power set, which represents the possible combinations of target speakers. This formulation has two benefits. First, the overlaps of target speakers are expl… ▽ More Overlapping speech diarization has been traditionally treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding multiple binary labels into a single label with the power set, which represents the possible combinations of target speakers. This formulation has two benefits. First, the overlaps of target speakers are explicitly modeled. Second, threshold selection is no longer needed. Through this formulation, we propose the speaker embedding-aware neural diarization (SEND) framework, where a speech encoder, a speaker encoder, two similarity scorers, and a post-processing network are jointly optimized to predict the encoded labels according to the similarities between speech features and speaker embeddings. Experimental results show that SEND has a stable learning process and can be trained on highly overlapped data without extra initialization. More importantly, our method achieves the state-of-the-art performance in real meeting scenarios with fewer model parameters and lower computational complexity. △ Less

Submitted 30 March, 2022; v1 submitted 18 March, 2022; originally announced March 2022.

Comments: Submitted to INTERSPEECH 2022, 5 parges, 2 figure

arXiv:2202.03647 [pdf, other]

Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

Authors: Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie, Zheng-Hua Tan, DeLiang Wang, Yanmin Qian, Kong Aik Lee, Zhijie Yan, Bin Ma, Xin Xu, Hui Bu

Abstract: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Ma… ▽ More The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Mandarin meeting speech data with manual annotation, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants' headset microphone. We briefly describe the released dataset, track setups, baselines and summarize the challenge results and major techniques used in the submissions. △ Less

Submitted 25 February, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

Comments: Accepted by ICASSP 2022

arXiv:2111.13694 [pdf, other]

Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information

Authors: Zhihao Du, Shiliang Zhang, Siqi Zheng, Weilong Huang, Ming Lei

Abstract: Overlapping speech diarization is always treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set. Specifically, we propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels according to the similarities between speech feat… ▽ More Overlapping speech diarization is always treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set. Specifically, we propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels according to the similarities between speech features and given speaker embeddings. Our method is further extended and integrated with downstream tasks by utilizing the textual information, which has not been well studied in previous literature. The experimental results show that our method achieves lower diarization error rate than the target-speaker voice activity detection. When textual information is involved, the diarization errors can be further reduced. For the real meeting scenario, our method can achieve 34.11% relative improvement compared with the Bayesian hidden Markov model based clustering algorithm. △ Less

Submitted 28 November, 2021; originally announced November 2021.

Comments: Submitted to ICASSP 2022, 5 pages, 2 figures

Showing 1–50 of 69 results for author: Du, Z