Search | arXiv e-print repository

SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech

Authors: Jingru Lin, Meng Ge, Junyi Ao, Liqun Deng, Haizhou Li

Abstract: It was shown that pre-trained models with self-supervised learning (SSL) techniques are effective in various downstream speech tasks. However, most such models are trained on single-speaker speech data, limiting their effectiveness in mixture speech. This motivates us to explore pre-training on mixture speech. This work presents SA-WavLM, a novel pre-trained model for mixture speech. Specifically,… ▽ More It was shown that pre-trained models with self-supervised learning (SSL) techniques are effective in various downstream speech tasks. However, most such models are trained on single-speaker speech data, limiting their effectiveness in mixture speech. This motivates us to explore pre-training on mixture speech. This work presents SA-WavLM, a novel pre-trained model for mixture speech. Specifically, SA-WavLM follows an "extract-merge-predict" pipeline in which the representations of each speaker in the input mixture are first extracted individually and then merged before the final prediction. In this pipeline, SA-WavLM performs speaker-informed extractions with the consideration of the interactions between different speakers. Furthermore, a speaker shuffling strategy is proposed to enhance the robustness towards the speaker absence. Experiments show that SA-WavLM either matches or improves upon the state-of-the-art pre-trained models. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: InterSpeech 2024

arXiv:2403.05772 [pdf, other]

sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks

Authors: Qu Yang, Qianhui Liu, Nan Li, Meng Ge, Zeyang Song, Haizhou Li

Abstract: Speech applications are expected to be low-power and robust under noisy conditions. An effective Voice Activity Detection (VAD) front-end lowers the computational need. Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient. However, SNN-based VADs have yet to achieve noise robustness and often require large models for high performance. This paper introduces a no… ▽ More Speech applications are expected to be low-power and robust under noisy conditions. An effective Voice Activity Detection (VAD) front-end lowers the computational need. Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient. However, SNN-based VADs have yet to achieve noise robustness and often require large models for high performance. This paper introduces a novel SNN-based VAD model, referred to as sVAD, which features an auditory encoder with an SNN-based attention mechanism. Particularly, it provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms. The classifier utilizes Spiking Recurrent Neural Networks (sRNN) to exploit temporal speech information. Experimental results demonstrate that our sVAD achieves remarkable noise robustness and meanwhile maintains low power consumption and a small footprint, making it a promising solution for real-world VAD applications. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: Accepted by ICASSP 2024

arXiv:2401.09686 [pdf, other]

An Empirical Study on the Impact of Positional Encoding in Transformer-based Monaural Speech Enhancement

Authors: Qiquan Zhang, Meng Ge, Hongxu Zhu, Eliathamby Ambikairajah, Qi Song, Zhaoheng Ni, Haizhou Li

Abstract: Transformer architecture has enabled recent progress in speech enhancement. Since Transformers are position-agostic, positional encoding is the de facto standard component used to enable Transformers to distinguish the order of elements in a sequence. However, it remains unclear how positional encoding exactly impacts speech enhancement based on Transformer architectures. In this paper, we perform… ▽ More Transformer architecture has enabled recent progress in speech enhancement. Since Transformers are position-agostic, positional encoding is the de facto standard component used to enable Transformers to distinguish the order of elements in a sequence. However, it remains unclear how positional encoding exactly impacts speech enhancement based on Transformer architectures. In this paper, we perform a comprehensive empirical study evaluating five positional encoding methods, i.e., Sinusoidal and learned absolute position embedding (APE), T5-RPE, KERPLE, as well as the Transformer without positional encoding (No-Pos), across both causal and noncausal configurations. We conduct extensive speech enhancement experiments, involving spectral mapping and masking methods. Our findings establish that positional encoding is not quite helpful for the models in a causal configuration, which indicates that causal attention may implicitly incorporate position information. In a noncausal configuration, the models significantly benefit from the use of positional encoding. In addition, we find that among the four position embeddings, relative position embeddings outperform APEs. △ Less

Submitted 13 February, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

Comments: Accepted by ICASSP 2024

arXiv:2401.02626 [pdf, other]

Gradient weighting for speaker verification in extremely low Signal-to-Noise Ratio

Authors: Yi Ma, Kong Aik Lee, Ville Hautamäki, Meng Ge, Haizhou Li

Abstract: Speaker verification is hampered by background noise, particularly at extremely low Signal-to-Noise Ratio (SNR) under 0 dB. It is difficult to suppress noise without introducing unwanted artifacts, which adversely affects speaker verification. We proposed the mechanism called Gradient Weighting (Grad-W), which dynamically identifies and reduces artifact noise during prediction. The mechanism is ba… ▽ More Speaker verification is hampered by background noise, particularly at extremely low Signal-to-Noise Ratio (SNR) under 0 dB. It is difficult to suppress noise without introducing unwanted artifacts, which adversely affects speaker verification. We proposed the mechanism called Gradient Weighting (Grad-W), which dynamically identifies and reduces artifact noise during prediction. The mechanism is based on the property that the gradient indicates which parts of the input the model is paying attention to. Specifically, when the speaker network focuses on a region in the denoised utterance but not on the clean counterpart, we consider it artifact noise and assign higher weights for this region during optimization of enhancement. We validate it by training an enhancement model and testing the enhanced utterance on speaker verification. The experimental results show that our approach effectively reduces artifact noise, improving speaker verification across various SNR levels. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Comments: Accepted by ICASSP 2024

arXiv:2312.16002 [pdf, other]

The NUS-HLT System for ICASSP2024 ICMC-ASR Grand Challenge

Authors: Meng Ge, Yizhou Peng, Yidi Jiang, Jingru Lin, Junyi Ao, Mehmet Sinan Yildirim, Shuai Wang, Haizhou Li, Mengling Feng

Abstract: This paper summarizes our team's efforts in both tracks of the ICMC-ASR Challenge for in-car multi-channel automatic speech recognition. Our submitted systems for ICMC-ASR Challenge include the multi-channel front-end enhancement and diarization, training data augmentation, speech recognition modeling with multi-channel branches. Tested on the offical Eval1 and Eval2 set, our best system achieves… ▽ More This paper summarizes our team's efforts in both tracks of the ICMC-ASR Challenge for in-car multi-channel automatic speech recognition. Our submitted systems for ICMC-ASR Challenge include the multi-channel front-end enhancement and diarization, training data augmentation, speech recognition modeling with multi-channel branches. Tested on the offical Eval1 and Eval2 set, our best system achieves a relative 34.3% improvement in CER and 56.5% improvement in cpCER, compared to the offical baseline system. △ Less

Submitted 26 December, 2023; originally announced December 2023.

Comments: Technical Report. 2 pages. For ICMC-ASR-2023 Challenge

arXiv:2312.11201 [pdf, other]

A Refining Underlying Information Framework for Monaural Speech Enhancement

Authors: Rui Cao, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang

Abstract: Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-fitting solutions continue to face challenges with degraded speech and residual noise in hearing evaluations. By bridging the speech enhancement and t… ▽ More Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-fitting solutions continue to face challenges with degraded speech and residual noise in hearing evaluations. By bridging the speech enhancement and the Information Bottleneck principle in this letter, we rethink a universal plug-and-play strategy and propose a Refining Underlying Information framework called RUI to rise to the challenges both in theory and practice. Specifically, we first transform the objective of speech enhancement into an incremental convergence problem of mutual information between comprehensive speech characteristics and individual speech characteristics, e.g., spectral and acoustic characteristics. By doing so, compared with the existing direct-fitting solutions, the underlying information stems from the conditional entropy of acoustic characteristic given spectral characteristics. Therefore, we design a dual-path multiple refinement iterator based on the chain rule of entropy to refine this underlying information for further approximating target speech. Experimental results on DNS-Challenge dataset show that our solution consistently improves 0.3+ PESQ score over baselines, with only additional 1.18 M parameters. The source code is available at https://github.com/caoruitju/RUI_SE. △ Less

Submitted 24 December, 2023; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: 5 pages

arXiv:2311.04526 [pdf, other]

Selective HuBERT: Self-Supervised Pre-Training for Target Speaker in Clean and Mixture Speech

Authors: Jingru Lin, Meng Ge, Wupeng Wang, Haizhou Li, Mengling Feng

Abstract: Self-supervised pre-trained speech models were shown effective for various downstream speech processing tasks. Since they are mainly pre-trained to map input speech to pseudo-labels, the resulting representations are only effective for the type of pre-train data used, either clean or mixture speech. With the idea of selective auditory attention, we propose a novel pre-training solution called Sele… ▽ More Self-supervised pre-trained speech models were shown effective for various downstream speech processing tasks. Since they are mainly pre-trained to map input speech to pseudo-labels, the resulting representations are only effective for the type of pre-train data used, either clean or mixture speech. With the idea of selective auditory attention, we propose a novel pre-training solution called Selective-HuBERT, or SHuBERT, which learns the selective extraction of target speech representations from either clean or mixture speech. Specifically, SHuBERT is trained to predict pseudo labels of a target speaker, conditioned on an enrolled speech from the target speaker. By doing so, SHuBERT is expected to selectively attend to the target speaker in a complex acoustic environment, thus benefiting various downstream tasks. We further introduce a dual-path training strategy and use the cross-correlation constraint between the two branches to encourage the model to generate noise-invariant representation. Experiments on SUPERB benchmark and LibriMix dataset demonstrate the universality and noise-robustness of SHuBERT. Furthermore, we find that our high-quality representation can be easily integrated with conventional supervised learning methods to achieve significant performance, even under extremely low-resource labeled data. △ Less

Submitted 8 November, 2023; originally announced November 2023.

arXiv:2309.10674 [pdf, other]

USED: Universal Speaker Extraction and Diarization

Authors: Junyi Ao, Mehmet Sinan Yıldırım, Ruijie Tao, Meng Ge, Shuai Wang, Yanmin Qian, Haizhou Li

Abstract: Speaker extraction and diarization are two enabling techniques for real-world speech applications. Speaker extraction aims to extract a target speaker's voice from a speech mixture, while speaker diarization demarcates speech segments by speaker, annotating `who spoke when'. Previous studies have typically treated the two tasks independently. In practical applications, it is more meaningful to hav… ▽ More Speaker extraction and diarization are two enabling techniques for real-world speech applications. Speaker extraction aims to extract a target speaker's voice from a speech mixture, while speaker diarization demarcates speech segments by speaker, annotating `who spoke when'. Previous studies have typically treated the two tasks independently. In practical applications, it is more meaningful to have knowledge about `who spoke what and when', which is captured by the two tasks. The two tasks share a similar objective of disentangling speakers. Speaker extraction operates in the frequency domain, whereas diarization is in the temporal domain. It is logical to believe that speaker activities obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker activity detection than the speech mixture. In this paper, we propose a unified model called Universal Speaker Extraction and Diarization (USED) to address output inconsistency and scenario mismatch issues. It is designed to manage speech mixture with varying overlap ratios and variable number of speakers. We show that the USED model significantly outperforms the competitive baselines for speaker extraction and diarization tasks on LibriMix and SparseLibriMix datasets. We further validate the diarization performance on CALLHOME, a dataset based on real recordings, and experimental results indicate that our model surpasses recently proposed approaches. △ Less

Submitted 9 May, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

arXiv:2309.08408 [pdf, other]

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

Authors: Junjie Li, Ruijie Tao, Zexu Pan, Meng Ge, Shuai Wang, Haizhou Li

Abstract: Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios… ▽ More Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. We propose an audio-visual speaker extraction model named ActiveExtract, which leverages speaking activity from audio-visual active speaker detection (ASD). The ASD directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip synchronization that could be used for speaker disentanglement. Experimental results show our model outperforms baselines across various overlapping ratios, achieving an average improvement of more than 4 dB in terms of SI-SNR. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: Submitted to ICASSP 2024

arXiv:2309.06723 [pdf, other]

doi 10.21437/Interspeech.2023-889

PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network

Authors: Qinghua Liu, Meng Ge, Zhizheng Wu, Haizhou Li

Abstract: It is common in everyday spoken communication that we look at the turning head of a talker to listen to his/her voice. Humans see the talker to listen better, so do machines. However, previous studies on audio-visual speaker extraction have not effectively handled the varying talking face. This paper studies how to take full advantage of the varying talking face. We propose a Pose-Invariant Audio-… ▽ More It is common in everyday spoken communication that we look at the turning head of a talker to listen to his/her voice. Humans see the talker to listen better, so do machines. However, previous studies on audio-visual speaker extraction have not effectively handled the varying talking face. This paper studies how to take full advantage of the varying talking face. We propose a Pose-Invariant Audio-Visual Speaker Extraction Network (PIAVE) that incorporates an additional pose-invariant view to improve audio-visual speaker extraction. Specifically, we generate the pose-invariant view from each original pose orientation, which enables the model to receive a consistent frontal view of the talker regardless of his/her head pose, therefore, forming a multi-view visual input for the speaker. Experiments on the multi-view MEAD and in-the-wild LRS3 dataset demonstrate that PIAVE outperforms the state-of-the-art and is more robust to pose variations. △ Less

Submitted 13 September, 2023; originally announced September 2023.

Comments: Interspeech 2023

Journal ref: Proc. INTERSPEECH 2023, 3719-3723

arXiv:2306.02625 [pdf, other]

Rethinking the visual cues in audio-visual speaker extraction

Authors: Junjie Li, Meng Ge, Zexu pan, Rui Cao, Longbiao Wang, Jianwu Dang, Shiliang Zhang

Abstract: The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visual cue contributes more to the speaker extraction p… ▽ More The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visual cue contributes more to the speaker extraction performance. This raises the question of how to better utilize visual cues. To address this issue, we propose two training strategies that decouple the learning of the two visual cues. Our experimental results demonstrate that both visual cues are useful, with the synchronization cue having a higher impact. We introduce a more explainable model, the Decoupled Audio-Visual Speaker Extraction (DAVSE) model, which leverages both visual cues. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: Accepted in Interspeech 2023

arXiv:2305.10821 [pdf, other]

Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation

Authors: Yanjie Fu, Meng Ge, Honglong Wang, Nan Li, Haoran Yin, Longbiao Wang, Gaoyan Zhang, Jianwu Dang, Chengyun Deng, Fei Wang

Abstract: Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available. However, most of them neglect to utilize speaker's 2-dimensional (2D) location cues contained in mixture signal, which limits the performance when two sources come from close directions. In this paper, we propose an end-to-end beamforming network for… ▽ More Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available. However, most of them neglect to utilize speaker's 2-dimensional (2D) location cues contained in mixture signal, which limits the performance when two sources come from close directions. In this paper, we propose an end-to-end beamforming network for 2D location guided speech separation merely given mixture signal. It first estimates discriminable direction and 2D location cues, which imply directions the sources come from in multi views of microphones and their 2D coordinates. These cues are then integrated into location-aware neural beamformer, thus allowing accurate reconstruction of two sources' speech signals. Experiments show that our proposed model not only achieves a comprehensive decent improvement compared to baseline systems, but avoids inferior performance on spatial overlapping cases. △ Less

Submitted 2 June, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech 2023. arXiv admin note: substantial text overlap with arXiv:2212.03401

arXiv:2304.04154 [pdf, other]

doi 10.1016/j.cja.2023.03.002

Review of X-ray pulsar spacecraft autonomous navigation

Authors: Yidi Wang, Wei Zheng, Shuangnan Zhang, Minyu Ge, Liansheng Li, Kun Jiang, Xiaoqian Chen, Xiang Zhang, Shijie Zheng, Fangjun Lu

Abstract: This article provides a review on X-ray pulsar-based navigation (XNAV). The review starts with the basic concept of XNAV, and briefly introduces the past, present and future projects concerning XNAV. This paper focuses on the advances of the key techniques supporting XNAV, including the navigation pulsar database, the X-ray detection system, and the pulse time of arrival estimation. Moreover, the… ▽ More This article provides a review on X-ray pulsar-based navigation (XNAV). The review starts with the basic concept of XNAV, and briefly introduces the past, present and future projects concerning XNAV. This paper focuses on the advances of the key techniques supporting XNAV, including the navigation pulsar database, the X-ray detection system, and the pulse time of arrival estimation. Moreover, the methods to improve the estimation performance of XNAV are reviewed. Finally, some remarks on the future development of XNAV are provided. △ Less

Submitted 9 April, 2023; originally announced April 2023.

Comments: has been accepted by Chinese Journal of Aeronautics

Journal ref: Chinese Journal of Aeronautics, 2023

arXiv:2212.03401 [pdf, other]

MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation

Authors: Yanjie Fu, Haoran Yin, Meng Ge, Longbiao Wang, Gaoyan Zhang, Jianwu Dang, Chengyun Deng, Fei Wang

Abstract: Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we d… ▽ More Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we design a multi-channel input and multiple outputs architecture to predict the direction-of-arrival based embeddings and beamforming weights for each source. The precisely estimated directional embedding provides quite effective spatial discrimination guidance for the neural beamformer to offset the effect of phase wrapping, thus allowing more accurate reconstruction of two sources' speech signals. Experiments show that our proposed MIMO-DBnet not only achieves a comprehensive decent improvement compared to baseline systems, but also maintain the performance on high frequency bands when phase wrapping occurs. △ Less

Submitted 6 December, 2022; originally announced December 2022.

Comments: Submitted to ICASSP 2023

arXiv:2210.06177 [pdf, other]

VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

Authors: Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang

Abstract: Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previou… ▽ More Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previous studies have introduced visual and contextual modalities in a single model. In this paper, we propose a two-stage time-domain visual-contextual speaker extraction network named VCSE, which incorporates visual and self-enrolled contextual cues stage by stage to take full advantage of every modality. In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence. In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues. Experimental results on the real-world Lip Reading Sentences 3 (LRS3) database demonstrate that our proposed VCSE network consistently outperforms other state-of-the-art baselines. △ Less

Submitted 9 October, 2022; originally announced October 2022.

arXiv:2207.07307 [pdf, other]

doi 10.21437/Interspeech.2022-10493

MIMO-DoAnet: Multi-channel Input and Multiple Outputs DoA Network with Unknown Number of Sound Sources

Authors: Haoran Yin, Meng Ge, Yanjie Fu, Gaoyan Zhang, Longbiao Wang, Lei Zhang, Lin Qiu, Jianwu Dang

Abstract: Recent neural network based Direction of Arrival (DoA) estimation algorithms have performed well on unknown number of sound sources scenarios. These algorithms are usually achieved by mapping the multi-channel audio input to the single output (i.e. overall spatial pseudo-spectrum (SPS) of all sources), that is called MISO. However, such MISO algorithms strongly depend on empirical threshold settin… ▽ More Recent neural network based Direction of Arrival (DoA) estimation algorithms have performed well on unknown number of sound sources scenarios. These algorithms are usually achieved by mapping the multi-channel audio input to the single output (i.e. overall spatial pseudo-spectrum (SPS) of all sources), that is called MISO. However, such MISO algorithms strongly depend on empirical threshold setting and the angle assumption that the angles between the sound sources are greater than a fixed angle. To address these limitations, we propose a novel multi-channel input and multiple outputs DoA network called MIMO-DoAnet. Unlike the general MISO algorithms, MIMO-DoAnet predicts the SPS coding of each sound source with the help of the informative spatial covariance matrix. By doing so, the threshold task of detecting the number of sound sources becomes an easier task of detecting whether there is a sound source in each output, and the serious interaction between sound sources disappears during inference stage. Experimental results show that MIMO-DoAnet achieves relative 18.6% and absolute 13.3%, relative 34.4% and absolute 20.2% F1 score improvement compared with the MISO baseline system in 3, 4 sources scenes. The results also demonstrate MIMO-DoAnet alleviates the threshold setting problem and solves the angle assumption problem effectively. △ Less

Submitted 16 November, 2022; v1 submitted 15 July, 2022; originally announced July 2022.

Comments: Accepted by Interspeech 2022

arXiv:2206.14580 [pdf, other]

Language-specific Characteristic Assistance for Code-switching Speech Recognition

Authors: Tongtong Song, Qiang Xu, Meng Ge, Longbiao Wang, Hao Shi, Yongjie Lv, Yuqin Lin, Jianwu Dang

Abstract: Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition. Because LSEs are initialized by two pre-trained language-specific models (LSMs), the dual-encoder structure can exploit sufficient monolingual data and capture the individual language attributes. However, most existing methods have no language constraints on LSEs and underutili… ▽ More Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition. Because LSEs are initialized by two pre-trained language-specific models (LSMs), the dual-encoder structure can exploit sufficient monolingual data and capture the individual language attributes. However, most existing methods have no language constraints on LSEs and underutilize language-specific knowledge of LSMs. In this paper, we propose a language-specific characteristic assistance (LSCA) method to mitigate the above problems. Specifically, during training, we introduce two language-specific losses as language constraints and generate corresponding language-specific targets for them. During decoding, we take the decoding abilities of LSMs into account by combining the output probabilities of two LSMs and the mixture model to obtain the final predictions. Experiments show that either the training or decoding method of LSCA can improve the model's performance. Furthermore, the best result can obtain up to 15.4% relative error reduction on the code-switching test set by combining the training and decoding methods of LSCA. Moreover, the system can process code-switching speech recognition tasks well without extra shared parameters or even retraining based on two pre-trained LSMs by using our method. △ Less

Submitted 11 July, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

Comments: Accepted by Interspeech 2022

arXiv:2206.12273 [pdf, other]

Iterative Sound Source Localization for Unknown Number of Sources

Authors: Yanjie Fu, Meng Ge, Haoran Yin, Xinyuan Qian, Longbiao Wang, Gaoyan Zhang, Jianwu Dang

Abstract: Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio. For the practical problem of unknown number of sources, existing localization algorithms attempt to predict a likelihood-based coding (i.e., spatial spectrum) and employ a pre-determined threshold to detect the source number and corresponding DOA value. However, these t… ▽ More Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio. For the practical problem of unknown number of sources, existing localization algorithms attempt to predict a likelihood-based coding (i.e., spatial spectrum) and employ a pre-determined threshold to detect the source number and corresponding DOA value. However, these threshold-based algorithms are not stable since they are limited by the careful choice of threshold. To address this problem, we propose an iterative sound source localization approach called ISSL, which can iteratively extract each source's DOA without threshold until the termination criterion is met. Unlike threshold-based algorithms, ISSL designs an active source detector network based on binary classifier to accept residual spatial spectrum and decide whether to stop the iteration. By doing so, our ISSL can deal with an arbitrary number of sources, even more than the number of sources seen during the training stage. The experimental results show that our ISSL achieves significant performance improvements in both DOA estimation and source number detection compared with the existing threshold-based algorithms. △ Less

Submitted 24 June, 2022; originally announced June 2022.

Comments: Accepted by Interspeech 2022

arXiv:2203.16843 [pdf, other]

A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction

Authors: Zexu Pan, Meng Ge, Haizhou Li

Abstract: The speaker extraction algorithm extracts the target speech from a mixture speech containing interference speech and background noise. The extraction process sometimes over-suppresses the extracted target speech, which not only creates artifacts during listening but also harms the performance of downstream automatic speech recognition algorithms. We propose a hybrid continuity loss function for ti… ▽ More The speaker extraction algorithm extracts the target speech from a mixture speech containing interference speech and background noise. The extraction process sometimes over-suppresses the extracted target speech, which not only creates artifacts during listening but also harms the performance of downstream automatic speech recognition algorithms. We propose a hybrid continuity loss function for time-domain speaker extraction algorithms to settle the over-suppression problem. On top of the waveform-level loss used for superior signal quality, i.e., SI-SDR, we introduce a multi-resolution delta spectrum loss in the frequency-domain, to ensure the continuity of an extracted speech signal, thus alleviating the over-suppression. We examine the hybrid continuity loss function using a time-domain audio-visual speaker extraction algorithm on the YouTube LRS2-BBC dataset. Experimental results show that the proposed loss function reduces the over-suppression and improves the word error rate of speech recognition on both clean and noisy two-speakers mixtures, without harming the reconstructed speech quality. △ Less

Submitted 20 June, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: Accepted by Interspeech2022

arXiv:2202.09995 [pdf, other]

L-SpEx: Localized Target Speaker Extraction

Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

Abstract: Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this… ▽ More Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this paper, we propose an end-to-end localized target speaker extraction on pure speech cues, that is called L-SpEx. Specifically, we design a speaker localizer driven by the target speaker's embedding to extract the spatial features, including direction-of-arrival (DOA) of the target speaker and beamforming output. Then, the spatial cues and target speaker's embedding are both used to form a top-down auditory attention to the target speaker. Experiments on the multi-channel reverberant dataset called MC-Libri2Mix show that our L-SpEx approach significantly outperforms the baseline system. △ Less

Submitted 21 February, 2022; originally announced February 2022.

Comments: Accepted in ICASSP 2022

arXiv:2111.10596 [pdf, other]

Semi-supervised Impedance Inversion by Bayesian Neural Network Based on 2-d CNN Pre-training

Authors: Muyang Ge, Wenlong Wang, Wangxiangming Zheng

Abstract: Seismic impedance inversion can be performed with a semi-supervised learning algorithm, which only needs a few logs as labels and is less likely to get overfitted. However, classical semi-supervised learning algorithm usually leads to artifacts on the predicted impedance image. In this artical, we improve the semi-supervised learning from two aspects. First, by replacing 1-d convolutional neural n… ▽ More Seismic impedance inversion can be performed with a semi-supervised learning algorithm, which only needs a few logs as labels and is less likely to get overfitted. However, classical semi-supervised learning algorithm usually leads to artifacts on the predicted impedance image. In this artical, we improve the semi-supervised learning from two aspects. First, by replacing 1-d convolutional neural network (CNN) layers in deep learning structure with 2-d CNN layers and 2-d maxpooling layers, the prediction accuracy is improved. Second, prediction uncertainty can also be estimated by embedding the network into a Bayesian inference framework. Local reparameterization trick is used during forward propagation of the network to reduce sampling cost. Tests with Marmousi2 model and SEAM model validate the feasibility of the proposed strategy. △ Less

Submitted 20 November, 2021; originally announced November 2021.

arXiv:2109.14831 [pdf, other]

USEV: Universal Speaker Extraction with Visual Cue

Authors: Zexu Pan, Meng Ge, Haizhou Li

Abstract: A speaker extraction algorithm seeks to extract the target speaker's speech from a multi-talker speech mixture. The prior studies focus mostly on speaker extraction from a highly overlapped multi-talker speech mixture. However, the target-interference speaker overlapping ratios could vary over a wide range from 0% to 100% in natural speech communication, furthermore, the target speaker could be ab… ▽ More A speaker extraction algorithm seeks to extract the target speaker's speech from a multi-talker speech mixture. The prior studies focus mostly on speaker extraction from a highly overlapped multi-talker speech mixture. However, the target-interference speaker overlapping ratios could vary over a wide range from 0% to 100% in natural speech communication, furthermore, the target speaker could be absent in the speech mixture, the speech mixtures in such universal multi-talker scenarios are described as general speech mixtures. The speaker extraction algorithm requires an auxiliary reference, such as a video recording or a pre-recorded speech, to form top-down auditory attention on the target speaker. We advocate that a visual cue, i.e., lip movement, is more informative than an audio cue, i.e., pre-recorded speech, to serve as the auxiliary reference for speaker extraction in disentangling the target speaker from a general speech mixture. In this paper, we propose a universal speaker extraction network with a visual cue, that works for all multi-talker scenarios. In addition, we propose a scenario-aware differentiated loss function for network training, to balance the network performance over different target-interference speaker pairing scenarios. The experimental results show that our proposed method outperforms various competitive baselines for general speech mixtures in terms of signal fidelity. △ Less

Submitted 30 August, 2022; v1 submitted 29 September, 2021; originally announced September 2021.

Comments: Accepted by TASLP

arXiv:2011.09624 [pdf, other]

Multi-stage Speaker Extraction with Utterance and Frame-Level Reference Signals

Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

Abstract: Speaker extraction requires a sample speech from the target speaker as the reference. However, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. For the first time, we use fr… ▽ More Speaker extraction requires a sample speech from the target speaker as the reference. However, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. For the first time, we use frame-level sequential speech embedding as the reference for target speaker. This is a departure from the traditional utterance-based speaker embedding reference. In addition, a signal fusion scheme is proposed to combine the decoded signals in multiple scales with automatically learned weights. Experiments on WSJ0-2mix and its noisy versions (WHAM! and WHAMR!) show that SpEx++ consistently outperforms other state-of-the-art baselines. △ Less

Submitted 2 April, 2021; v1 submitted 18 November, 2020; originally announced November 2020.

Comments: Accepted in ICASSP 2021

arXiv:2006.12372 [pdf, other]

Edge server deployment scheme of blockchain in IoVs

Authors: Liya Xu, Mingzhu Ge, Weili Wu

Abstract: With the development of intelligent vehicles, security and reliability communication between vehicles has become a key problem to be solved in Internet of vehicles(IoVs). Blockchain is considered as a feasible solution due to its advantages of decentralization, unforgeability and collective maintenance. However, the computing power of nodes in IoVs is limited, while the consensus mechanism of bloc… ▽ More With the development of intelligent vehicles, security and reliability communication between vehicles has become a key problem to be solved in Internet of vehicles(IoVs). Blockchain is considered as a feasible solution due to its advantages of decentralization, unforgeability and collective maintenance. However, the computing power of nodes in IoVs is limited, while the consensus mechanism of blockchain requires that the miners in the system have strong computing power for mining calculation. It consequently cannot satisfy the requirements, which is the challenges for the application of blockchain in IoVs. In fact, the application of blockchain in IoVs can be implemented by employing edge computing. The key entity of edge computing is the edge servers(ESs). Roadside nodes(RSUs) can be deployed as ESs of edge computing in IoVs. We have studied the ES deployment scheme for covering more vehicle nodes in IoVs, and propose a randomized algorithm to calculate approximation solutions. Finally, we simulated the performance of the proposed scheme and compared it with other deployment schemes. △ Less

Submitted 16 June, 2020; originally announced June 2020.

arXiv:2005.04686 [pdf, other]

SpEx+: A Complete Time Domain Speaker Extraction Network

Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

Abstract: Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker's reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain approaches. Unfortunately, SpEx is not fully a time-domain solution since it performs time-domain speech encoding for speaker extraction, while taking frequency-… ▽ More Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker's reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain approaches. Unfortunately, SpEx is not fully a time-domain solution since it performs time-domain speech encoding for speaker extraction, while taking frequency-domain speaker embedding as the reference. The size of the analysis window for time-domain and the size for frequency-domain input are also different. Such mismatch has an adverse effect on the system performance. To eliminate such mismatch, we propose a complete time-domain speaker extraction solution, that is called SpEx+. Specifically, we tie the weights of two identical speech encoder networks, one for the encoder-extractor-decoder pipeline, another as part of the speaker encoder. Experiments show that the SpEx+ achieves 0.8dB and 2.1dB SDR improvement over the state-of-the-art SpEx baseline, under different and same gender conditions on WSJ0-2mix-extr database respectively. △ Less

Submitted 17 August, 2020; v1 submitted 10 May, 2020; originally announced May 2020.

Comments: accepted in INTERSPEECH 2020

arXiv:2001.04198 [pdf, ps, other]

Predefined-time Terminal Sliding Mode Control of Robot Manipulators

Authors: Chang-Duo Liang, Ming-Feng Ge, Zhi-Wei Liu, Yan-Wu Wang, Hamid Reza Karimi

Abstract: In this paper, we present a new terminal sliding mode control to achieve predefined-time stability of robot manipulators. The proposed control is developed based on a novel predefined-time terminal sliding mode (PTSM) surface, on which the states are forced to reach the origin in a predefined time, i.e., the settling time is independent to the initial condition and can be explicitly user-defined v… ▽ More In this paper, we present a new terminal sliding mode control to achieve predefined-time stability of robot manipulators. The proposed control is developed based on a novel predefined-time terminal sliding mode (PTSM) surface, on which the states are forced to reach the origin in a predefined time, i.e., the settling time is independent to the initial condition and can be explicitly user-defined via adjusting some specific parameters called the predefined-time parameters. It is also demonstrated that the proposed control can provide satisfactory steady-state performance in the case of both external disturbances and parametric uncertainties. Besides, we present a formal systemic analysis method to derive the sufficient conditions for guaranteeing the predefined-time convergence of the closed-loop system. Finally, the effectiveness and performance of the presented control scheme are illustrated through both theoretical comparisons and numerical simulations. △ Less

Submitted 25 April, 2020; v1 submitted 13 January, 2020; originally announced January 2020.

Comments: 10 pages, 9 figures, This draft is not intended for publication

arXiv:1607.07543 [pdf, ps, other]

doi 10.1016/j.jfranklin.2016.06.025

Task-space coordinated tracking of multiple heterogeneous manipulators via controller-estimator approaches

Authors: Ming-Feng Ge, Zhi-Hong Guan, Chao Yang, Chao-Yang Chen, Ding-Fu Zheng, Ming Chi

Abstract: This paper studies the task-space coordinated tracking of a time-varying leader for multiple heterogeneous manipulators (MHMs), containing redundant manipulators and nonredundant ones. Different from the traditional coordinated control, distributed controller-estimator algorithms (DCEA), which consist of local algorithms and networked algorithms, are developed for MHMs with parametric uncertaintie… ▽ More This paper studies the task-space coordinated tracking of a time-varying leader for multiple heterogeneous manipulators (MHMs), containing redundant manipulators and nonredundant ones. Different from the traditional coordinated control, distributed controller-estimator algorithms (DCEA), which consist of local algorithms and networked algorithms, are developed for MHMs with parametric uncertainties and input disturbances. By invoking differential inclusions, nonsmooth analysis, and input-to-state stability, some conditions (including sufficient conditions, necessary and sufficient conditions) on the asymptotic stability of the task-space tracking errors and the subtask errors are developed. Simulation results are given to show the effectiveness of the presented DCEA. △ Less

Submitted 26 July, 2016; originally announced July 2016.

Comments: 17 pages, 7 figures, Journal of the Franklin Institute

arXiv:1607.07535 [pdf, ps, other]

doi 10.1016/j.neucom.2016.03.008

Time-varying formation tracking of multiple manipulators via distributed finite-time control

Authors: Ming-Feng Ge, Zhi-Hong Guan, Chao Yang, Tao Li, Yan-Wu Wang

Abstract: Comparing with traditional fixed formation for a group of dynamical systems, time-varying formation can produce the following benefits: i) covering the greater part of complex environments; ii) collision avoidance. This paper studies the time-varying formation tracking for multiple manipulator systems (MMSs) under fixed and switching directed graphs with a dynamic leader, whose acceleration cannot… ▽ More Comparing with traditional fixed formation for a group of dynamical systems, time-varying formation can produce the following benefits: i) covering the greater part of complex environments; ii) collision avoidance. This paper studies the time-varying formation tracking for multiple manipulator systems (MMSs) under fixed and switching directed graphs with a dynamic leader, whose acceleration cannot change too fast. An explicit mathematical formulation of time-varying formation is developed based on the related practical applications. A class of extended inverse dynamics control algorithms combining with distributed sliding-mode estimators are developed to address the aforementioned problem. By invoking finite-time stability arguments, several novel criteria (including sufficient criteria, necessary and sufficient criteria) for global finite-time stability of MMSs are established. Finally, numerical experiments are presented to verify the effectiveness of the theoretical results. △ Less

Submitted 26 July, 2016; originally announced July 2016.

Journal ref: Neurocomputing, 2016, 202: 20-26

arXiv:1605.08542 [pdf, ps, other]

doi 10.1016/j.automatica.2016.03.008

Distributed controller-estimator for target tracking of networked robotic systems under sampled interaction

Authors: Ming-Feng Ge, Zhi-Hong Guan, Bin Hu, Ding-Xin He, Rui-Quan Liao

Abstract: This paper investigates the target tracking problem for networked robotic systems (NRSs) under sampled interaction. The target is assumed to be time-varying and described by a second-order oscillator. Two novel distributed controller-estimator algorithms (DCEA), which consist of both continuous and discontinuous signals, are presented. Based on the properties of small-value norms and Lyapunov stab… ▽ More This paper investigates the target tracking problem for networked robotic systems (NRSs) under sampled interaction. The target is assumed to be time-varying and described by a second-order oscillator. Two novel distributed controller-estimator algorithms (DCEA), which consist of both continuous and discontinuous signals, are presented. Based on the properties of small-value norms and Lyapunov stability theory, the conditions on the interaction topology, the sampling period, and the other control parameters are given such that the practical stability of the tracking error is achieved and the stability region is regulated quantitatively. The advantages of the presented DCEA are illustrated by comparisons with each other and the existing coordination algorithms. Simulation examples are given to demonstrate the theoretical results. △ Less

Submitted 27 May, 2016; originally announced May 2016.

Comments: 8 pages, 4 figures, Published in Automatica

Journal ref: Automatica, 2016, 69: 410-417

Showing 1–29 of 29 results for author: Ge, M