-
VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing
Authors:
Chunyu Qiang,
Wang Geng,
Yi Zhao,
Ruibo Fu,
Tao Wang,
Cheng Gong,
Tianrui Wang,
Qiuyu Liu,
Jiangyan Yi,
Zhengqi Wen,
Chen Zhang,
Hao Che,
Longbiao Wang,
Jianwu Dang,
Jianhua Tao
Abstract:
Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the spe…
▽ More
Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called "Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)", which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hz from 24kHz input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at https://qiangchunyu.github.io/VQCTAP/
△ Less
Submitted 11 August, 2024;
originally announced August 2024.
-
Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition
Authors:
Yuchun Shu,
Bo Hu,
Yifeng He,
Hao Shi,
Longbiao Wang,
Jianwu Dang
Abstract:
Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction. In this paper, we propose a non-autoregressive speech error correction method. A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses as the reference to find the wrong word position. Besides, the acoustic f…
▽ More
Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction. In this paper, we propose a non-autoregressive speech error correction method. A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses as the reference to find the wrong word position. Besides, the acoustic feature from the ASR encoder is also used to provide the correct pronunciation references. N-best candidates from ASR are aligned using the edit path, to confirm each other and recover some missing character errors. Furthermore, the cross-attention mechanism fuses the information between error correction references and the ASR hypothesis. The experimental results show that both the acoustic and confidence references help with error correction. The proposed system reduces the error rate by 21% compared with the ASR model.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations
Authors:
Sheng Wu,
Jiaxing Liu,
Longbiao Wang,
Dongxiao He,
Xiaobao Wang,
Jianwu Dang
Abstract:
Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion…
▽ More
Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Specifically, we design a Modality Augmentation Network which performs rich representation learning through dimension transformation of different modalities and parameter-efficient inception block. On the other hand, the Modality Interaction Network performs interaction fusion of extracted inter-modal features and intra-modal features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics compared to the state-of-the-art (SOTA) models.
△ Less
Submitted 12 April, 2024;
originally announced July 2024.
-
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios
Authors:
Cheng Gong,
Erica Cooper,
Xin Wang,
Chunyu Qiang,
Mengzhe Geng,
Dan Wells,
Longbiao Wang,
Jianwu Dang,
Marc Tessier,
Aidan Pine,
Korin Richmond,
Junichi Yamagishi
Abstract:
Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on…
▽ More
Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Performance Trade-off and Joint Waveform Design for MIMO-OFDM DFRC Systems
Authors:
Tianchen Liu,
Liang Wu,
Bo An,
Zaichen Zhang,
Jian Dang,
Jiangzhou Wang
Abstract:
Dual-functional radar-communication (DFRC) has attracted considerable attention. This paper considers the frequency-selective multipath fading environment and proposes DFRC waveform design strategies based on multiple-input and multiple-output (MIMO) and orthogonal frequency division multiplexing (OFDM) techniques. In the proposed waveform design strategies, the Cramer-Rao bound (CRB) of the radar…
▽ More
Dual-functional radar-communication (DFRC) has attracted considerable attention. This paper considers the frequency-selective multipath fading environment and proposes DFRC waveform design strategies based on multiple-input and multiple-output (MIMO) and orthogonal frequency division multiplexing (OFDM) techniques. In the proposed waveform design strategies, the Cramer-Rao bound (CRB) of the radar system, the inter-stream interference (ISI) and the achievable rate of the communication system, are respectively considered as the performance metrics. In this paper, we focus on the performance trade-off between the radar system and the communication system, and the optimization problems are formulated. In the ISI minimization based waveform design strategy, the optimization problem is convex and can be easily solved. In the achievable rate maximization based waveform design strategy, we propose a water-filling (WF) and sequential quadratic programming (SQP) based algorithm to derive the covariance matrix and the precoding matrix. Simulation results validate the proposed DFRC waveform designs and show that the achievable rate maximization based strategy has a better performance than the ISI minimization based strategy.
△ Less
Submitted 4 January, 2024;
originally announced January 2024.
-
ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations
Authors:
Cheng Gong,
Xin Wang,
Erica Cooper,
Dan Wells,
Longbiao Wang,
Jianwu Dang,
Korin Richmond,
Junichi Yamagishi
Abstract:
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voices, but there is growing interest in developing systems that can synthesize voices for new…
▽ More
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voices, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.
△ Less
Submitted 26 August, 2024; v1 submitted 21 December, 2023;
originally announced December 2023.
-
A Refining Underlying Information Framework for Monaural Speech Enhancement
Authors:
Rui Cao,
Tianrui Wang,
Meng Ge,
Longbiao Wang,
Jianwu Dang
Abstract:
Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-fitting solutions continue to face challenges with degraded speech and residual noise in hearing evaluations. By bridging the speech enhancement and t…
▽ More
Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-fitting solutions continue to face challenges with degraded speech and residual noise in hearing evaluations. By bridging the speech enhancement and the Information Bottleneck principle in this letter, we rethink a universal plug-and-play strategy and propose a Refining Underlying Information framework called RUI to rise to the challenges both in theory and practice. Specifically, we first transform the objective of speech enhancement into an incremental convergence problem of mutual information between comprehensive speech characteristics and individual speech characteristics, e.g., spectral and acoustic characteristics. By doing so, compared with the existing direct-fitting solutions, the underlying information stems from the conditional entropy of acoustic characteristic given spectral characteristics. Therefore, we design a dual-path multiple refinement iterator based on the chain rule of entropy to refine this underlying information for further approximating target speech. Experimental results on DNS-Challenge dataset show that our solution consistently improves 0.3+ PESQ score over baselines, with only additional 1.18 M parameters. The source code is available at https://github.com/caoruitju/RUI_SE.
△ Less
Submitted 24 December, 2023; v1 submitted 18 December, 2023;
originally announced December 2023.
-
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models
Authors:
Chunyu Qiang,
Hao Li,
Yixin Tian,
Yi Zhao,
Ying Zhang,
Longbiao Wang,
Jianwu Dang
Abstract:
Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic \& acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from inform…
▽ More
Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic \& acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from information redundancy and dimension explosion in semantic representation, and high-frequency waveform distortion in discrete acoustic representation. Autoregressive frameworks exhibit typical instability and uncontrollability issues. And non-autoregressive frameworks suffer from prosodic averaging caused by duration prediction models. To address these issues, we propose a minimally-supervised high-fidelity speech synthesis method, where all modules are constructed based on the diffusion models. The non-autoregressive framework enhances controllability, and the duration diffusion model enables diversified prosodic expression. Contrastive Token-Acoustic Pretraining (CTAP) is used as an intermediate semantic representation to solve the problems of information redundancy and dimension explosion in existing semantic coding methods. Mel-spectrogram is used as the acoustic representation. Both semantic and acoustic representations are predicted by continuous variable regression tasks to solve the problem of high-frequency fine-grained waveform distortion. Experimental results show that our proposed method outperforms the baseline method. We provide audio samples on our website.
△ Less
Submitted 18 December, 2023; v1 submitted 27 September, 2023;
originally announced September 2023.
-
Learning Speech Representation From Contrastive Token-Acoustic Pretraining
Authors:
Chunyu Qiang,
Hao Li,
Yixin Tian,
Ruibo Fu,
Tao Wang,
Longbiao Wang,
Jianwu Dang
Abstract:
For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic informati…
▽ More
For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic information such as speaker identity and acoustic details should be de-emphasized. However, existing methods for extracting fine-grained intermediate representations from speech suffer from issues of excessive redundancy and dimension explosion. Contrastive learning is a good method for modeling intermediate representations from two modalities. However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these issues, we propose a method named "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space, learning how to connect phoneme and speech at the frame level. The CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR. The proposed CTAP method offers a promising solution for fine-grained generation and recognition downstream tasks in speech processing. We provide a website with audio samples.
△ Less
Submitted 18 December, 2023; v1 submitted 1 September, 2023;
originally announced September 2023.
-
Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding
Authors:
Chunyu Qiang,
Hao Li,
Hao Ni,
He Qu,
Ruibo Fu,
Tao Wang,
Longbiao Wang,
Jianwu Dang
Abstract:
Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging pr…
▽ More
Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging problem caused by the duration prediction model in non-autoregressive frameworks, and the information redundancy and dimension explosion problems of existing semantic encoding methods. To address these problems, three progressive methods are proposed. First, we propose Diff-LM-Speech, an autoregressive structure consisting of a language model and diffusion models, which models the semantic embedding into the mel-spectrogram based on a diffusion model to achieve higher audio quality. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Second, we propose Tetra-Diff-Speech, a non-autoregressive structure consisting of four diffusion model-based modules that design a duration diffusion model to achieve diverse prosodic expressions. Finally, we propose Tri-Diff-Speech, a non-autoregressive structure consisting of three diffusion model-based modules that verify the non-necessity of existing semantic encoding models and achieve the best results. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples.
△ Less
Submitted 18 December, 2023; v1 submitted 28 July, 2023;
originally announced July 2023.
-
Downlink Precoding for Cell-free FBMC/OQAM Systems With Asynchronous Reception
Authors:
Yuhao Qi,
Jian Dang,
Zaichen Zhang,
Liang Wu,
Yongpeng Wu
Abstract:
In this work, an efficient precoding design scheme is proposed for downlink cell-free distributed massive multiple-input multiple-output (DM-MIMO) filter bank multi-carrier (FBMC) systems with asynchronous reception and highly frequency selectivity. The proposed scheme includes a multiple interpolation structure to eliminate the impact of response difference we recently discovered, which has bette…
▽ More
In this work, an efficient precoding design scheme is proposed for downlink cell-free distributed massive multiple-input multiple-output (DM-MIMO) filter bank multi-carrier (FBMC) systems with asynchronous reception and highly frequency selectivity. The proposed scheme includes a multiple interpolation structure to eliminate the impact of response difference we recently discovered, which has better performance in highly frequency-selective channels. Besides, we also consider the phase shift in asynchronous reception and introduce a phase compensation in the design process. The phase compensation also benefits from the multiple interpolation structure and better adapts to asynchronous reception. Based on the proposed scheme, we theoretically analyze its ergodic achievable rate performance and derive a closed-form expression. Simulation results show that the derived expression can accurately characterize the rate performance, and FBMC with the proposed scheme outperforms orthogonal frequency-division multiplexing (OFDM) in the asynchronous scenario.
△ Less
Submitted 13 July, 2023;
originally announced July 2023.
-
Rethinking the visual cues in audio-visual speaker extraction
Authors:
Junjie Li,
Meng Ge,
Zexu pan,
Rui Cao,
Longbiao Wang,
Jianwu Dang,
Shiliang Zhang
Abstract:
The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visual cue contributes more to the speaker extraction p…
▽ More
The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visual cue contributes more to the speaker extraction performance. This raises the question of how to better utilize visual cues. To address this issue, we propose two training strategies that decouple the learning of the two visual cues. Our experimental results demonstrate that both visual cues are useful, with the synchronization cue having a higher impact. We introduce a more explainable model, the Decoupled Audio-Visual Speaker Extraction (DAVSE) model, which leverages both visual cues.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition
Authors:
Haoyu Lu,
Nan Li,
Tongtong Song,
Longbiao Wang,
Jianwu Dang,
Xiaobao Wang,
Shiliang Zhang
Abstract:
In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech from input due to the diverse types of noise with d…
▽ More
In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech from input due to the diverse types of noise with different intensities. Furthermore, speech distortion and residual noise are often observed in enhanced speech, and the distortion of speech and noise is different. Most existing methods focus on fusing enhanced and noisy features to address this issue. In this paper, we propose a dual-stream spectrogram refine network to simultaneously refine the speech and noise and decouple the noise from the noisy input. Our proposed method can achieve better performance with a relative 8.6% CER reduction.
△ Less
Submitted 30 May, 2023; v1 submitted 28 May, 2023;
originally announced May 2023.
-
Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation
Authors:
Yanjie Fu,
Meng Ge,
Honglong Wang,
Nan Li,
Haoran Yin,
Longbiao Wang,
Gaoyan Zhang,
Jianwu Dang,
Chengyun Deng,
Fei Wang
Abstract:
Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available. However, most of them neglect to utilize speaker's 2-dimensional (2D) location cues contained in mixture signal, which limits the performance when two sources come from close directions. In this paper, we propose an end-to-end beamforming network for…
▽ More
Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available. However, most of them neglect to utilize speaker's 2-dimensional (2D) location cues contained in mixture signal, which limits the performance when two sources come from close directions. In this paper, we propose an end-to-end beamforming network for 2D location guided speech separation merely given mixture signal. It first estimates discriminable direction and 2D location cues, which imply directions the sources come from in multi views of microphones and their 2D coordinates. These cues are then integrated into location-aware neural beamformer, thus allowing accurate reconstruction of two sources' speech signals. Experiments show that our proposed model not only achieves a comprehensive decent improvement compared to baseline systems, but avoids inferior performance on spatial overlapping cases.
△ Less
Submitted 2 June, 2023; v1 submitted 18 May, 2023;
originally announced May 2023.
-
Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder
Authors:
Hao Shi,
Masato Mimura,
Longbiao Wang,
Jianwu Dang,
Tatsuya Kawahara
Abstract:
Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information,…
▽ More
Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information, we supplement multiple spectrograms in different frame lengths into the time-domain encoders. They extract stationary frequency information in both narrowband and wideband. We also adopt multiple decoder outputs, each of which computes its corresponding resolution frequency loss. Experimental results show that (1) it is more effective to fuse stationary frequency features than non-stationary features in the encoder, and (2) the multiple outputs consistent with the frequency loss improve performance. Experiments on the Voice-Bank dataset show that the proposed method obtained a 0.14 PESQ improvement.
△ Less
Submitted 25 March, 2023;
originally announced March 2023.
-
Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification
Authors:
Meng Liu,
Kong Aik Lee,
Longbiao Wang,
Hanyi Zhang,
Chang Zeng,
Jianwu Dang
Abstract:
Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-mod…
▽ More
Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement over independently trained audio-only/visual-only and baseline fusion systems, respectively.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation
Authors:
Yanjie Fu,
Haoran Yin,
Meng Ge,
Longbiao Wang,
Gaoyan Zhang,
Jianwu Dang,
Chengyun Deng,
Fei Wang
Abstract:
Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we d…
▽ More
Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we design a multi-channel input and multiple outputs architecture to predict the direction-of-arrival based embeddings and beamforming weights for each source. The precisely estimated directional embedding provides quite effective spatial discrimination guidance for the neural beamformer to offset the effect of phase wrapping, thus allowing more accurate reconstruction of two sources' speech signals. Experiments show that our proposed MIMO-DBnet not only achieves a comprehensive decent improvement compared to baseline systems, but also maintain the performance on high frequency bands when phase wrapping occurs.
△ Less
Submitted 6 December, 2022;
originally announced December 2022.
-
Monolingual Recognizers Fusion for Code-switching Speech Recognition
Authors:
Tongtong Song,
Qiang Xu,
Haoyu Lu,
Longbiao Wang,
Hao Shi,
Yuqin Lin,
Yanbing Yang,
Jianwu Dang
Abstract:
The bi-encoder structure has been intensively investigated in code-switching (CS) automatic speech recognition (ASR). However, most existing methods require the structures of two monolingual ASR models (MAMs) should be the same and only use the encoder of MAMs. This leads to the problem that pre-trained MAMs cannot be timely and fully used for CS ASR. In this paper, we propose a monolingual recogn…
▽ More
The bi-encoder structure has been intensively investigated in code-switching (CS) automatic speech recognition (ASR). However, most existing methods require the structures of two monolingual ASR models (MAMs) should be the same and only use the encoder of MAMs. This leads to the problem that pre-trained MAMs cannot be timely and fully used for CS ASR. In this paper, we propose a monolingual recognizers fusion method for CS ASR. It has two stages: the speech awareness (SA) stage and the language fusion (LF) stage. In the SA stage, acoustic features are mapped to two language-specific predictions by two independent MAMs. To keep the MAMs focused on their own language, we further extend the language-aware training strategy for the MAMs. In the LF stage, the BELM fuses two language-specific predictions to get the final prediction. Moreover, we propose a text simulation strategy to simplify the training process of the BELM and reduce reliance on CS data. Experiments on a Mandarin-English corpus show the efficiency of the proposed method. The mix error rate is significantly reduced on the test set after using open-source pre-trained MAMs.
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
Asynchronous RIS-assisted Localization: A Comprehensive Analysis of Fundamental Limits
Authors:
Ziyi Gong,
Liang Wu,
Zaichen Zhang,
Jian Dang,
Yongpeng Wu,
Jiangzhou Wang
Abstract:
The reconfigurable intelligent surface (RIS) has drawn considerable attention for its ability to enhance the performance of not only the wireless communication but also the indoor localization with low-cost. This paper investigates the performance limits of the RIS-based near-field localization in the asynchronous scenario, and analyzes the impact of each part of the cascaded channel on the locali…
▽ More
The reconfigurable intelligent surface (RIS) has drawn considerable attention for its ability to enhance the performance of not only the wireless communication but also the indoor localization with low-cost. This paper investigates the performance limits of the RIS-based near-field localization in the asynchronous scenario, and analyzes the impact of each part of the cascaded channel on the localization performance. The Fisher information matrix (FIM) and the position error bound (PEB) are derived. Besides, we also derive the equivalent Fisher information (EFI) for the position-related intermediate parameters. Enabled by the derived EFI, we verify that both the ranging and bearing information of the user can be obtained when the near-field model is considered for the RIS-User equipment (UE) part of the channel, while only the direction of the UE can be inferred in the far-field scenario. This result is well known in the scenario that the curvature of arrival (COA) is directly sensed by the traditional active large-scale array, and we prove that it still holds when the COA is sensed passively by the large RIS. For the base station (BS)-RIS part of the channel, we reveal that this part of the channel determines the type of the gain provided by the BS antenna array. Besides, in the single-carrier, single snapshot case, it requires both the BS-RIS and the RIS-UE part of the channel works in the near-field scenario to localize the UE. We also show that the well-known focusing control scheme for RIS, which maximizes the received SNR, is not always a good choice and may degrade the localization performance in the asynchronous scenario. The simulation results validate the analytic work. The impact of the focusing control scheme on the PEB performances under synchronous and asynchronous conditions is also investigated.
△ Less
Submitted 26 March, 2023; v1 submitted 19 October, 2022;
originally announced October 2022.
-
VCSE: Time-Domain Visual-Contextual Speaker Extraction Network
Authors:
Junjie Li,
Meng Ge,
Zexu Pan,
Longbiao Wang,
Jianwu Dang
Abstract:
Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previou…
▽ More
Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previous studies have introduced visual and contextual modalities in a single model. In this paper, we propose a two-stage time-domain visual-contextual speaker extraction network named VCSE, which incorporates visual and self-enrolled contextual cues stage by stage to take full advantage of every modality. In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence. In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues. Experimental results on the real-world Lip Reading Sentences 3 (LRS3) database demonstrate that our proposed VCSE network consistently outperforms other state-of-the-art baselines.
△ Less
Submitted 9 October, 2022;
originally announced October 2022.
-
Deep Spectro-temporal Artifacts for Detecting Synthesized Speech
Authors:
Xiaohui Liu,
Meng Liu,
Lin Zhang,
Linjuan Zhang,
Chang Zeng,
Kai Li,
Nan Li,
Kong Aik Lee,
Longbiao Wang,
Jianwu Dang
Abstract:
The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding featur…
▽ More
The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features. To address track 1, low-quality data augmentation, domain adaptation via finetuning, and various complementary feature information fusion were aggregated in our system. Furthermore, we analyzed the clustering characteristics of subsystems with different features by visualization method and explained the effectiveness of our proposed greedy fusion strategy. As for track 2, frame transition and smoothing were detected using self-supervised learning structure to capture the manipulation of PF attacks in the time domain. We ranked 4th and 5th in track 1 and track 2, respectively.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
MIMO-DoAnet: Multi-channel Input and Multiple Outputs DoA Network with Unknown Number of Sound Sources
Authors:
Haoran Yin,
Meng Ge,
Yanjie Fu,
Gaoyan Zhang,
Longbiao Wang,
Lei Zhang,
Lin Qiu,
Jianwu Dang
Abstract:
Recent neural network based Direction of Arrival (DoA) estimation algorithms have performed well on unknown number of sound sources scenarios. These algorithms are usually achieved by mapping the multi-channel audio input to the single output (i.e. overall spatial pseudo-spectrum (SPS) of all sources), that is called MISO. However, such MISO algorithms strongly depend on empirical threshold settin…
▽ More
Recent neural network based Direction of Arrival (DoA) estimation algorithms have performed well on unknown number of sound sources scenarios. These algorithms are usually achieved by mapping the multi-channel audio input to the single output (i.e. overall spatial pseudo-spectrum (SPS) of all sources), that is called MISO. However, such MISO algorithms strongly depend on empirical threshold setting and the angle assumption that the angles between the sound sources are greater than a fixed angle. To address these limitations, we propose a novel multi-channel input and multiple outputs DoA network called MIMO-DoAnet. Unlike the general MISO algorithms, MIMO-DoAnet predicts the SPS coding of each sound source with the help of the informative spatial covariance matrix. By doing so, the threshold task of detecting the number of sound sources becomes an easier task of detecting whether there is a sound source in each output, and the serious interaction between sound sources disappears during inference stage. Experimental results show that MIMO-DoAnet achieves relative 18.6% and absolute 13.3%, relative 34.4% and absolute 20.2% F1 score improvement compared with the MISO baseline system in 3, 4 sources scenes. The results also demonstrate MIMO-DoAnet alleviates the threshold setting problem and solves the angle assumption problem effectively.
△ Less
Submitted 16 November, 2022; v1 submitted 15 July, 2022;
originally announced July 2022.
-
Language-specific Characteristic Assistance for Code-switching Speech Recognition
Authors:
Tongtong Song,
Qiang Xu,
Meng Ge,
Longbiao Wang,
Hao Shi,
Yongjie Lv,
Yuqin Lin,
Jianwu Dang
Abstract:
Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition. Because LSEs are initialized by two pre-trained language-specific models (LSMs), the dual-encoder structure can exploit sufficient monolingual data and capture the individual language attributes. However, most existing methods have no language constraints on LSEs and underutili…
▽ More
Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition. Because LSEs are initialized by two pre-trained language-specific models (LSMs), the dual-encoder structure can exploit sufficient monolingual data and capture the individual language attributes. However, most existing methods have no language constraints on LSEs and underutilize language-specific knowledge of LSMs. In this paper, we propose a language-specific characteristic assistance (LSCA) method to mitigate the above problems. Specifically, during training, we introduce two language-specific losses as language constraints and generate corresponding language-specific targets for them. During decoding, we take the decoding abilities of LSMs into account by combining the output probabilities of two LSMs and the mixture model to obtain the final predictions. Experiments show that either the training or decoding method of LSCA can improve the model's performance. Furthermore, the best result can obtain up to 15.4% relative error reduction on the code-switching test set by combining the training and decoding methods of LSCA. Moreover, the system can process code-switching speech recognition tasks well without extra shared parameters or even retraining based on two pre-trained LSMs by using our method.
△ Less
Submitted 11 July, 2022; v1 submitted 29 June, 2022;
originally announced June 2022.
-
Iterative Sound Source Localization for Unknown Number of Sources
Authors:
Yanjie Fu,
Meng Ge,
Haoran Yin,
Xinyuan Qian,
Longbiao Wang,
Gaoyan Zhang,
Jianwu Dang
Abstract:
Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio. For the practical problem of unknown number of sources, existing localization algorithms attempt to predict a likelihood-based coding (i.e., spatial spectrum) and employ a pre-determined threshold to detect the source number and corresponding DOA value. However, these t…
▽ More
Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio. For the practical problem of unknown number of sources, existing localization algorithms attempt to predict a likelihood-based coding (i.e., spatial spectrum) and employ a pre-determined threshold to detect the source number and corresponding DOA value. However, these threshold-based algorithms are not stable since they are limited by the careful choice of threshold. To address this problem, we propose an iterative sound source localization approach called ISSL, which can iteratively extract each source's DOA without threshold until the termination criterion is met. Unlike threshold-based algorithms, ISSL designs an active source detector network based on binary classifier to accept residual spatial spectrum and decide whether to stop the iteration. By doing so, our ISSL can deal with an arbitrary number of sources, even more than the number of sources seen during the training stage. The experimental results show that our ISSL achieves significant performance improvements in both DOA estimation and source number detection compared with the existing threshold-based algorithms.
△ Less
Submitted 24 June, 2022;
originally announced June 2022.
-
Fast and Arbitrary Beam Pattern Design for RIS-Assisted Terahertz Wireless Communication
Authors:
Jian Dang,
Zaichen Zhang,
Yewei Li,
Liang Wu,
Bingcheng Zhu,
Lei Wang
Abstract:
Reconfigurable intelligent surface (RIS) can assist terahertz wireless communication to restore the fragile line-of-sight links and facilitate beam steering. Arbitrary reflection beam patterns are desired to meet diverse requirements in different applications. This paper establishes relationship between RIS beam pattern design with two-dimensional finite impulse response filter design and proposes…
▽ More
Reconfigurable intelligent surface (RIS) can assist terahertz wireless communication to restore the fragile line-of-sight links and facilitate beam steering. Arbitrary reflection beam patterns are desired to meet diverse requirements in different applications. This paper establishes relationship between RIS beam pattern design with two-dimensional finite impulse response filter design and proposes a fast non-iterative algorithm to solve the problem. Simulations show that the proposed method outperforms baseline method. Hence, it represents a promising solution for fast and arbitrary beam pattern design in RIS-assisted terahertz wireless communication.
△ Less
Submitted 5 May, 2022;
originally announced May 2022.
-
TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding
Authors:
Ruiteng Zhang,
Jianguo Wei,
Xugang Lu,
Wenhuan Lu,
Di Jin,
Junhai Xu,
Lin Zhang,
Yantao Ji,
Jianwu Dang
Abstract:
Speaker embedding is an important front-end module to explore discriminative speaker features for many speech applications where speaker information is needed. Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation. However, naively adding many branches of multi-scale f…
▽ More
Speaker embedding is an important front-end module to explore discriminative speaker features for many speech applications where speaker information is needed. Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation. However, naively adding many branches of multi-scale features with the simple fully convolutional operation could not efficiently improve the performance due to the rapid increase of model parameters and computational complexity. Therefore, in the most current state-of-the-art network architectures, only a few branches corresponding to a limited number of temporal scales could be designed for speaker embeddings. To address this problem, in this paper, we propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs. The new model is based on the conventional TDNN, where the network architecture is smartly separated into two modeling operators: a channel-modeling operator and a temporal multi-branch modeling operator. Adding temporal multi-scale in the temporal multi-branch operator needs only a little bit increase of the number of parameters, and thus save more computational budget for adding more branches with large temporal scales. Moreover, in the inference stage, we further developed a systemic re-parameterization method to convert the TMS-based model into a single-path-based topology in order to increase inference speed. We investigated the performance of the new TMS method for automatic speaker verification (ASV) on in-domain and out-of-domain conditions. Results show that the TMS-based model obtained a significant increase in the performance over the SOTA ASV models, meanwhile, had a faster inference speed.
△ Less
Submitted 17 March, 2022;
originally announced March 2022.
-
L-SpEx: Localized Target Speaker Extraction
Authors:
Meng Ge,
Chenglin Xu,
Longbiao Wang,
Eng Siong Chng,
Jianwu Dang,
Haizhou Li
Abstract:
Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this…
▽ More
Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this paper, we propose an end-to-end localized target speaker extraction on pure speech cues, that is called L-SpEx. Specifically, we design a speaker localizer driven by the target speaker's embedding to extract the spatial features, including direction-of-arrival (DOA) of the target speaker and beamforming output. Then, the spatial cues and target speaker's embedding are both used to form a top-down auditory attention to the target speaker. Experiments on the multi-channel reverberant dataset called MC-Libri2Mix show that our L-SpEx approach significantly outperforms the baseline system.
△ Less
Submitted 21 February, 2022;
originally announced February 2022.
-
A Novel Two-stage Design Scheme of Equalizers for Uplink FBMC/OQAM-based Massive MIMO Systems
Authors:
Yuhao Qi,
Jian Dang,
Zaichen Zhang,
Liang Wu,
Yongpeng Wu
Abstract:
The self-equalization property has raised great concern in the combination of offset-quadratic-amplitude-modulation-based filter bank multi-carrier (FBMC/OQAM) and massive multiple-input multiple-output (MIMO) system, which enables to decrease the interference brought by the highly frequency-selective channels as the number of base station (BS) antennas increases. However, existing works show that…
▽ More
The self-equalization property has raised great concern in the combination of offset-quadratic-amplitude-modulation-based filter bank multi-carrier (FBMC/OQAM) and massive multiple-input multiple-output (MIMO) system, which enables to decrease the interference brought by the highly frequency-selective channels as the number of base station (BS) antennas increases. However, existing works show that there remains residual interference after single-tap equalization even with infinite number of BS antennas, leading to a limitation of achievable signal-to-interference-plus-noise ratio (SINR) performance. In this paper, we propose a two-stage design scheme of equalizers to remove the above limitation. In the first stage, we design high-rate equalizers working before FBMC demodulation to avoid the potential loss of channel information obtained at the BS. In the second stage, we transform the high-rate equalizers into the low-rate equalizers after FBMC demodulation to reduce the implementation complexity. Compared with prior works, the proposed scheme has affordable complexity under massive MIMO and only requires instantaneous channel state information (CSI) without statistical CSI and additional equalizers. Simulation results show that the scheme can bring improved SINR performance. Moreover, even with finite number of BS antennas, the interference brought by the channels can be almost eliminated.
△ Less
Submitted 4 December, 2021;
originally announced December 2021.
-
Using multiple reference audios and style embedding constraints for speech synthesis
Authors:
Cheng Gong,
Longbiao Wang,
Zhenhua Ling,
Ju Zhang,
Jianwu Dang
Abstract:
The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio. However, an appropriate acoustic embedding must be manually selected during inference. Due to the fact that only the matched text and speech are used in the training process, using unmatched text and spee…
▽ More
The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio. However, an appropriate acoustic embedding must be manually selected during inference. Due to the fact that only the matched text and speech are used in the training process, using unmatched text and speech for inference would cause the model to synthesize speech with low content quality. In this study, we propose to mitigate these two problems by using multiple reference audios and style embedding constraints rather than using only the target audio. Multiple reference audios are automatically selected using the sentence similarity determined by Bidirectional Encoder Representations from Transformers (BERT). In addition, we use ''target'' style embedding from a Pre-trained encoder as a constraint by considering the mutual information between the predicted and ''target'' style embedding. The experimental results show that the proposed model can improve the speech naturalness and content quality with multiple reference audios and can also outperform the baseline model in ABX preference tests of style similarity.
△ Less
Submitted 9 October, 2021;
originally announced October 2021.
-
Exploring Deep Learning for Joint Audio-Visual Lip Biometrics
Authors:
Meng Liu,
Longbiao Wang,
Kong Aik Lee,
Hanyi Zhang,
Chang Zeng,
Jianwu Dang
Abstract:
Audio-visual (AV) lip biometrics is a promising authentication technique that leverages the benefits of both the audio and visual modalities in speech communication. Previous works have demonstrated the usefulness of AV lip biometrics. However, the lack of a sizeable AV database hinders the exploration of deep-learning-based audio-visual lip biometrics. To address this problem, we compile a modera…
▽ More
Audio-visual (AV) lip biometrics is a promising authentication technique that leverages the benefits of both the audio and visual modalities in speech communication. Previous works have demonstrated the usefulness of AV lip biometrics. However, the lack of a sizeable AV database hinders the exploration of deep-learning-based audio-visual lip biometrics. To address this problem, we compile a moderate-size database using existing public databases. Meanwhile, we establish the DeepLip AV lip biometrics system realized with a convolutional neural network (CNN) based video module, a time-delay neural network (TDNN) based audio module, and a multimodal fusion module. Our experiments show that DeepLip outperforms traditional speaker recognition models in context modeling and achieves over 50% relative improvements compared with our best single modality baseline, with an equal error rate of 0.75% and 1.11% on the test datasets, respectively.
△ Less
Submitted 17 April, 2021;
originally announced April 2021.
-
Two New Approaches to Optical IRSs: Schemes and Comparative Analysis
Authors:
Haibo Wang,
Zaichen Zhang,
Bingcheng Zhu,
Jian Dang,
Liang Wu
Abstract:
Oriented to the point-to-multipoint free space optical communication (FSO) scenarios, this paper analyzes the micro-mirror array and phased array-type optical intelligent reflecting surface (OIRS) in terms of control mode, power efficiency, and beam splitting. We build the physical models of the two types of OIRSs. Based on the models, the closed form solution of OIRSs' output power density distri…
▽ More
Oriented to the point-to-multipoint free space optical communication (FSO) scenarios, this paper analyzes the micro-mirror array and phased array-type optical intelligent reflecting surface (OIRS) in terms of control mode, power efficiency, and beam splitting. We build the physical models of the two types of OIRSs. Based on the models, the closed form solution of OIRSs' output power density distribution and power efficiency, along with their control algorithms have been derived. Then we propose the algorithms of beam splitting and multi-beam power allocation for two types of OIRSs. The channel fading in FSO system and the comparison of two types of OIRSs in actual systems are discussed according to the analytical results. Experiments and simulations are both presented to verify the feasibility of models and algorithms.
△ Less
Submitted 30 December, 2020;
originally announced December 2020.
-
Multi-stage Speaker Extraction with Utterance and Frame-Level Reference Signals
Authors:
Meng Ge,
Chenglin Xu,
Longbiao Wang,
Eng Siong Chng,
Jianwu Dang,
Haizhou Li
Abstract:
Speaker extraction requires a sample speech from the target speaker as the reference. However, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. For the first time, we use fr…
▽ More
Speaker extraction requires a sample speech from the target speaker as the reference. However, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. For the first time, we use frame-level sequential speech embedding as the reference for target speaker. This is a departure from the traditional utterance-based speaker embedding reference. In addition, a signal fusion scheme is proposed to combine the decoded signals in multiple scales with automatically learned weights. Experiments on WSJ0-2mix and its noisy versions (WHAM! and WHAMR!) show that SpEx++ consistently outperforms other state-of-the-art baselines.
△ Less
Submitted 2 April, 2021; v1 submitted 18 November, 2020;
originally announced November 2020.
-
Transmit Covariance and Waveform Optimization for Non-orthogonal CP-FBMA System
Authors:
Yuhao Qi,
Jian Dang,
Zaichen Zhang,
Liang Wu,
Yongpeng Wu
Abstract:
Filter bank multiple access (FBMA) without subbands orthogonality has been proposed as a new candidate waveform to better meet the requirements of future wireless communication systems and scenarios. It has the ability to process directly the complex symbols without any fancy preprocessing. Along with the usage of cyclic prefix (CP) and wide-banded subband design, CP-FBMA can further improve the p…
▽ More
Filter bank multiple access (FBMA) without subbands orthogonality has been proposed as a new candidate waveform to better meet the requirements of future wireless communication systems and scenarios. It has the ability to process directly the complex symbols without any fancy preprocessing. Along with the usage of cyclic prefix (CP) and wide-banded subband design, CP-FBMA can further improve the peak-to-average power ratio and bit error rate performance while reducing the length of filters. However, the potential gain of removing the orthogonality constraint on the subband filters in the system has not been fully exploited from the perspective of waveform design, which inspires us to optimize the subband filters for CP-FBMA system to maximizing the achievable rate. Besides, we propose a joint optimization algorithm to optimize both the waveform and the covariance matrices iteratively. Furthermore, the joint optimization algorithm can meet the requirements of filter design in practical applications in which the available spectrum consists of several isolated bandwidth parts. Both general framework and detailed derivation of the algorithms are presented. Simulation results show that the algorithms converge after only a few iterations and can improve the sum rate dramatically while reducing the transmission delay of information symbols.
△ Less
Submitted 13 October, 2020; v1 submitted 12 October, 2020;
originally announced October 2020.
-
SpEx+: A Complete Time Domain Speaker Extraction Network
Authors:
Meng Ge,
Chenglin Xu,
Longbiao Wang,
Eng Siong Chng,
Jianwu Dang,
Haizhou Li
Abstract:
Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker's reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain approaches. Unfortunately, SpEx is not fully a time-domain solution since it performs time-domain speech encoding for speaker extraction, while taking frequency-…
▽ More
Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker's reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain approaches. Unfortunately, SpEx is not fully a time-domain solution since it performs time-domain speech encoding for speaker extraction, while taking frequency-domain speaker embedding as the reference. The size of the analysis window for time-domain and the size for frequency-domain input are also different. Such mismatch has an adverse effect on the system performance. To eliminate such mismatch, we propose a complete time-domain speaker extraction solution, that is called SpEx+. Specifically, we tie the weights of two identical speech encoder networks, one for the encoder-extractor-decoder pipeline, another as part of the speaker encoder. Experiments show that the SpEx+ achieves 0.8dB and 2.1dB SDR improvement over the state-of-the-art SpEx baseline, under different and same gender conditions on WSJ0-2mix-extr database respectively.
△ Less
Submitted 17 August, 2020; v1 submitted 10 May, 2020;
originally announced May 2020.
-
Joint User Identification and Channel Estimation Over Rician Fading Channels
Authors:
Liang Wu,
Zaichen Zhang,
Jian Dang,
Yongpeng Wu,
Huaping Liu,
Jiangzhou Wang
Abstract:
This paper considers crowded massive multiple input multiple output (MIMO) communications over a Rician fading channel, where the number of users is much greater than the number of available pilot sequences. A joint user identification and line-of-sight (LOS) component derivation algorithm is proposed without requiring a threshold. Based on the derived LOS component, we design a LOS-only channel e…
▽ More
This paper considers crowded massive multiple input multiple output (MIMO) communications over a Rician fading channel, where the number of users is much greater than the number of available pilot sequences. A joint user identification and line-of-sight (LOS) component derivation algorithm is proposed without requiring a threshold. Based on the derived LOS component, we design a LOS-only channel estimator and an updated channel estimator.
△ Less
Submitted 13 April, 2020;
originally announced April 2020.
-
Performance of Wireless Optical Communication With Reconfigurable Intelligent Surfaces and Random Obstacles
Authors:
Haibo Wang,
Zaichen Zhang,
Bingcheng Zhu,
Jian Dang,
Liang Wu,
Lei Wang,
Kehan Zhang,
Yidi Zhang
Abstract:
It is difficult for free space optical communication to be applied in mobile communication due to the obstruction of obstacles in the environment, which is expected to be solved by reconfigurable intelligent surface technology. The reconfigurable intelligent surface is a new type of digital coding meta-materials, which can reflect, compute and program electromagnetic and optical waves in real time…
▽ More
It is difficult for free space optical communication to be applied in mobile communication due to the obstruction of obstacles in the environment, which is expected to be solved by reconfigurable intelligent surface technology. The reconfigurable intelligent surface is a new type of digital coding meta-materials, which can reflect, compute and program electromagnetic and optical waves in real time. We purpose a controllable multi-branch wireless optical communication system based on the optical reconfigurable intelligent surface technology. By setting up multiple optical reconfigurable intelligent surface in the environment, multiple artificial channels are built to improve system performance and to reduce the outage probability. Three factors affecting channel coefficients are investigated in this paper, which are beam jitter, jitter of the reconfigurable intelligent surface and the probability of obstruction. Based on the model, we derive the closed-form probability density function of channel coefficients, the asymptotic system's average bit error rate and outage probability for systems with single and multiple branches. It is revealed that the probability density function contains an impulse function, which causes irreducible error rate and outage probability floors. Numerical results indicate that compared with free-space optical communication systems with single direct path, the performance of the multi-branch system is improved and the outage probability is reduced.
△ Less
Submitted 16 January, 2020;
originally announced January 2020.
-
A 3D Non-Stationary Wideband Geometry-Based Channel Model for MIMO Vehicle-to-Vehicle Communication System
Authors:
Hao Jiang,
Zaichen Zhang,
Liang Wu,
Jian Dang,
Guan Gui
Abstract:
In this paper, we present a three-dimensional (3D) non-wide-sense stationary (non-WSS) wideband geometry-based channel model for vehicle-to-vehicle (V2V) communication environments. We introduce a two-cylinder model to describe moving vehicles as well as multiple confocal semi-ellipsoid models to depict stationary roadside scenarios. The received signal is constructed as a sum of the line-of-sight…
▽ More
In this paper, we present a three-dimensional (3D) non-wide-sense stationary (non-WSS) wideband geometry-based channel model for vehicle-to-vehicle (V2V) communication environments. We introduce a two-cylinder model to describe moving vehicles as well as multiple confocal semi-ellipsoid models to depict stationary roadside scenarios. The received signal is constructed as a sum of the line-of-sight (LoS), single-, and double-bounced rays with different energies. Accordingly, the proposed channel model is sufficient for depicting a wide variety of V2V environments, such as macro-, micro-, and picocells. The relative movement between the mobile transmitter (MT) and mobile receiver (MR) results in time-variant geometric statistics that make our channel model non-stationary. Using this channel model, the proposed channel statistics, i.e., the time-variant space correlation functions (CFs), frequency CFs, and corresponding Doppler power spectral density (PSD), were studied for different relative moving time instants. The numerical results demonstrate that the proposed 3D non-WSS wideband channel model is practical for characterizing real V2V channels.
△ Less
Submitted 22 January, 2018;
originally announced January 2018.