Zum Hauptinhalt springen

Showing 1–37 of 37 results for author: Dang, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2408.05758  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

    Authors: Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, Jianhua Tao

    Abstract: Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the spe… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

  2. arXiv:2407.12817  [pdf, other

    cs.CL cs.SD eess.AS

    Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition

    Authors: Yuchun Shu, Bo Hu, Yifeng He, Hao Shi, Longbiao Wang, Jianwu Dang

    Abstract: Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction. In this paper, we propose a non-autoregressive speech error correction method. A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses as the reference to find the wrong word position. Besides, the acoustic f… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  3. arXiv:2407.00743  [pdf, other

    cs.MM cs.AI cs.CL eess.AS

    AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

    Authors: Sheng Wu, Jiaxing Liu, Longbiao Wang, Dongxiao He, Xiaobao Wang, Jianwu Dang

    Abstract: Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion… ▽ More

    Submitted 12 April, 2024; originally announced July 2024.

  4. arXiv:2406.08911  [pdf, other

    cs.CL eess.AS

    An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

    Authors: Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi

    Abstract: Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  5. arXiv:2401.02081  [pdf, ps, other

    cs.IT eess.SP

    Performance Trade-off and Joint Waveform Design for MIMO-OFDM DFRC Systems

    Authors: Tianchen Liu, Liang Wu, Bo An, Zaichen Zhang, Jian Dang, Jiangzhou Wang

    Abstract: Dual-functional radar-communication (DFRC) has attracted considerable attention. This paper considers the frequency-selective multipath fading environment and proposes DFRC waveform design strategies based on multiple-input and multiple-output (MIMO) and orthogonal frequency division multiplexing (OFDM) techniques. In the proposed waveform design strategies, the Cramer-Rao bound (CRB) of the radar… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

  6. arXiv:2312.14398  [pdf, other

    cs.SD eess.AS

    ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

    Authors: Cheng Gong, Xin Wang, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang, Korin Richmond, Junichi Yamagishi

    Abstract: Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voices, but there is growing interest in developing systems that can synthesize voices for new… ▽ More

    Submitted 26 August, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: Accepted by IEEE/ACM TASLP, 16 pages plus 1 page of bio and photos

  7. arXiv:2312.11201  [pdf, other

    eess.AS cs.SD eess.SP

    A Refining Underlying Information Framework for Monaural Speech Enhancement

    Authors: Rui Cao, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang

    Abstract: Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-fitting solutions continue to face challenges with degraded speech and residual noise in hearing evaluations. By bridging the speech enhancement and t… ▽ More

    Submitted 24 December, 2023; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: 5 pages

  8. arXiv:2309.15512  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

    Authors: Chunyu Qiang, Hao Li, Yixin Tian, Yi Zhao, Ying Zhang, Longbiao Wang, Jianwu Dang

    Abstract: Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic \& acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from inform… ▽ More

    Submitted 18 December, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP 2024. arXiv admin note: substantial text overlap with arXiv:2307.15484; text overlap with arXiv:2309.00424

  9. arXiv:2309.00424  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Learning Speech Representation From Contrastive Token-Acoustic Pretraining

    Authors: Chunyu Qiang, Hao Li, Yixin Tian, Ruibo Fu, Tao Wang, Longbiao Wang, Jianwu Dang

    Abstract: For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic informati… ▽ More

    Submitted 18 December, 2023; v1 submitted 1 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP 2024

  10. arXiv:2307.15484  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

    Authors: Chunyu Qiang, Hao Li, Hao Ni, He Qu, Ruibo Fu, Tao Wang, Longbiao Wang, Jianwu Dang

    Abstract: Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging pr… ▽ More

    Submitted 18 December, 2023; v1 submitted 28 July, 2023; originally announced July 2023.

    Comments: Accepted by ICASSP 2024

  11. arXiv:2307.06657  [pdf, other

    cs.IT eess.SP

    Downlink Precoding for Cell-free FBMC/OQAM Systems With Asynchronous Reception

    Authors: Yuhao Qi, Jian Dang, Zaichen Zhang, Liang Wu, Yongpeng Wu

    Abstract: In this work, an efficient precoding design scheme is proposed for downlink cell-free distributed massive multiple-input multiple-output (DM-MIMO) filter bank multi-carrier (FBMC) systems with asynchronous reception and highly frequency selectivity. The proposed scheme includes a multiple interpolation structure to eliminate the impact of response difference we recently discovered, which has bette… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: 16pages, 4 figures

  12. arXiv:2306.02625  [pdf, other

    cs.SD eess.AS

    Rethinking the visual cues in audio-visual speaker extraction

    Authors: Junjie Li, Meng Ge, Zexu pan, Rui Cao, Longbiao Wang, Jianwu Dang, Shiliang Zhang

    Abstract: The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visual cue contributes more to the speaker extraction p… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Accepted in Interspeech 2023

  13. arXiv:2305.17860  [pdf, other

    cs.SD eess.AS

    speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition

    Authors: Haoyu Lu, Nan Li, Tongtong Song, Longbiao Wang, Jianwu Dang, Xiaobao Wang, Shiliang Zhang

    Abstract: In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech from input due to the diverse types of noise with d… ▽ More

    Submitted 30 May, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

  14. arXiv:2305.10821  [pdf, other

    eess.AS

    Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation

    Authors: Yanjie Fu, Meng Ge, Honglong Wang, Nan Li, Haoran Yin, Longbiao Wang, Gaoyan Zhang, Jianwu Dang, Chengyun Deng, Fei Wang

    Abstract: Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available. However, most of them neglect to utilize speaker's 2-dimensional (2D) location cues contained in mixture signal, which limits the performance when two sources come from close directions. In this paper, we propose an end-to-end beamforming network for… ▽ More

    Submitted 2 June, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted by Interspeech 2023. arXiv admin note: substantial text overlap with arXiv:2212.03401

  15. arXiv:2303.14593  [pdf, other

    cs.SD eess.AS

    Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder

    Authors: Hao Shi, Masato Mimura, Longbiao Wang, Jianwu Dang, Tatsuya Kawahara

    Abstract: Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information,… ▽ More

    Submitted 25 March, 2023; originally announced March 2023.

  16. arXiv:2302.11254  [pdf, other

    cs.SD cs.CV cs.LG eess.AS eess.IV

    Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

    Authors: Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang

    Abstract: Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-mod… ▽ More

    Submitted 22 February, 2023; originally announced February 2023.

  17. arXiv:2212.03401  [pdf, other

    eess.AS cs.LG cs.SD

    MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation

    Authors: Yanjie Fu, Haoran Yin, Meng Ge, Longbiao Wang, Gaoyan Zhang, Jianwu Dang, Chengyun Deng, Fei Wang

    Abstract: Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we d… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  18. arXiv:2211.01046  [pdf, other

    eess.AS cs.CL cs.SD

    Monolingual Recognizers Fusion for Code-switching Speech Recognition

    Authors: Tongtong Song, Qiang Xu, Haoyu Lu, Longbiao Wang, Hao Shi, Yuqin Lin, Yanbing Yang, Jianwu Dang

    Abstract: The bi-encoder structure has been intensively investigated in code-switching (CS) automatic speech recognition (ASR). However, most existing methods require the structures of two monolingual ASR models (MAMs) should be the same and only use the encoder of MAMs. This leads to the problem that pre-trained MAMs cannot be timely and fully used for CS ASR. In this paper, we propose a monolingual recogn… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP2023

  19. arXiv:2210.10401  [pdf, other

    cs.IT eess.SP

    Asynchronous RIS-assisted Localization: A Comprehensive Analysis of Fundamental Limits

    Authors: Ziyi Gong, Liang Wu, Zaichen Zhang, Jian Dang, Yongpeng Wu, Jiangzhou Wang

    Abstract: The reconfigurable intelligent surface (RIS) has drawn considerable attention for its ability to enhance the performance of not only the wireless communication but also the indoor localization with low-cost. This paper investigates the performance limits of the RIS-based near-field localization in the asynchronous scenario, and analyzes the impact of each part of the cascaded channel on the locali… ▽ More

    Submitted 26 March, 2023; v1 submitted 19 October, 2022; originally announced October 2022.

  20. arXiv:2210.06177  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

    Authors: Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang

    Abstract: Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previou… ▽ More

    Submitted 9 October, 2022; originally announced October 2022.

  21. arXiv:2210.05254  [pdf, other

    cs.SD cs.AI eess.AS

    Deep Spectro-temporal Artifacts for Detecting Synthesized Speech

    Authors: Xiaohui Liu, Meng Liu, Lin Zhang, Linjuan Zhang, Chang Zeng, Kai Li, Nan Li, Kong Aik Lee, Longbiao Wang, Jianwu Dang

    Abstract: The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding featur… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: 7 pages, 1 figures, Accecpted by Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia

  22. MIMO-DoAnet: Multi-channel Input and Multiple Outputs DoA Network with Unknown Number of Sound Sources

    Authors: Haoran Yin, Meng Ge, Yanjie Fu, Gaoyan Zhang, Longbiao Wang, Lei Zhang, Lin Qiu, Jianwu Dang

    Abstract: Recent neural network based Direction of Arrival (DoA) estimation algorithms have performed well on unknown number of sound sources scenarios. These algorithms are usually achieved by mapping the multi-channel audio input to the single output (i.e. overall spatial pseudo-spectrum (SPS) of all sources), that is called MISO. However, such MISO algorithms strongly depend on empirical threshold settin… ▽ More

    Submitted 16 November, 2022; v1 submitted 15 July, 2022; originally announced July 2022.

    Comments: Accepted by Interspeech 2022

  23. arXiv:2206.14580  [pdf, other

    cs.CL eess.AS

    Language-specific Characteristic Assistance for Code-switching Speech Recognition

    Authors: Tongtong Song, Qiang Xu, Meng Ge, Longbiao Wang, Hao Shi, Yongjie Lv, Yuqin Lin, Jianwu Dang

    Abstract: Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition. Because LSEs are initialized by two pre-trained language-specific models (LSMs), the dual-encoder structure can exploit sufficient monolingual data and capture the individual language attributes. However, most existing methods have no language constraints on LSEs and underutili… ▽ More

    Submitted 11 July, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

    Comments: Accepted by Interspeech 2022

  24. arXiv:2206.12273  [pdf, other

    eess.AS cs.LG

    Iterative Sound Source Localization for Unknown Number of Sources

    Authors: Yanjie Fu, Meng Ge, Haoran Yin, Xinyuan Qian, Longbiao Wang, Gaoyan Zhang, Jianwu Dang

    Abstract: Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio. For the practical problem of unknown number of sources, existing localization algorithms attempt to predict a likelihood-based coding (i.e., spatial spectrum) and employ a pre-determined threshold to detect the source number and corresponding DOA value. However, these t… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

    Comments: Accepted by Interspeech 2022

  25. arXiv:2205.02999  [pdf, ps, other

    eess.SP

    Fast and Arbitrary Beam Pattern Design for RIS-Assisted Terahertz Wireless Communication

    Authors: Jian Dang, Zaichen Zhang, Yewei Li, Liang Wu, Bingcheng Zhu, Lei Wang

    Abstract: Reconfigurable intelligent surface (RIS) can assist terahertz wireless communication to restore the fragile line-of-sight links and facilitate beam steering. Arbitrary reflection beam patterns are desired to meet diverse requirements in different applications. This paper establishes relationship between RIS beam pattern design with two-dimensional finite impulse response filter design and proposes… ▽ More

    Submitted 5 May, 2022; originally announced May 2022.

    Comments: 5 pages, 5 figures

  26. arXiv:2203.09098  [pdf, other

    cs.SD cs.LG eess.AS

    TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding

    Authors: Ruiteng Zhang, Jianguo Wei, Xugang Lu, Wenhuan Lu, Di Jin, Junhai Xu, Lin Zhang, Yantao Ji, Jianwu Dang

    Abstract: Speaker embedding is an important front-end module to explore discriminative speaker features for many speech applications where speaker information is needed. Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation. However, naively adding many branches of multi-scale f… ▽ More

    Submitted 17 March, 2022; originally announced March 2022.

    Comments: Due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract here is shorter than that in the PDF file

  27. arXiv:2202.09995  [pdf, other

    eess.AS cs.SD

    L-SpEx: Localized Target Speaker Extraction

    Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

    Abstract: Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this… ▽ More

    Submitted 21 February, 2022; originally announced February 2022.

    Comments: Accepted in ICASSP 2022

  28. arXiv:2112.02324  [pdf, other

    eess.SP

    A Novel Two-stage Design Scheme of Equalizers for Uplink FBMC/OQAM-based Massive MIMO Systems

    Authors: Yuhao Qi, Jian Dang, Zaichen Zhang, Liang Wu, Yongpeng Wu

    Abstract: The self-equalization property has raised great concern in the combination of offset-quadratic-amplitude-modulation-based filter bank multi-carrier (FBMC/OQAM) and massive multiple-input multiple-output (MIMO) system, which enables to decrease the interference brought by the highly frequency-selective channels as the number of base station (BS) antennas increases. However, existing works show that… ▽ More

    Submitted 4 December, 2021; originally announced December 2021.

    Comments: 14 pages, 11 figures

  29. arXiv:2110.04451  [pdf, other

    cs.SD cs.AI eess.AS

    Using multiple reference audios and style embedding constraints for speech synthesis

    Authors: Cheng Gong, Longbiao Wang, Zhenhua Ling, Ju Zhang, Jianwu Dang

    Abstract: The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio. However, an appropriate acoustic embedding must be manually selected during inference. Due to the fact that only the matched text and speech are used in the training process, using unmatched text and spee… ▽ More

    Submitted 9 October, 2021; originally announced October 2021.

    Comments: 5 pages,3 figures submitted to ICASSP2022

  30. arXiv:2104.08510  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Exploring Deep Learning for Joint Audio-Visual Lip Biometrics

    Authors: Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Chang Zeng, Jianwu Dang

    Abstract: Audio-visual (AV) lip biometrics is a promising authentication technique that leverages the benefits of both the audio and visual modalities in speech communication. Previous works have demonstrated the usefulness of AV lip biometrics. However, the lack of a sizeable AV database hinders the exploration of deep-learning-based audio-visual lip biometrics. To address this problem, we compile a modera… ▽ More

    Submitted 17 April, 2021; originally announced April 2021.

  31. arXiv:2012.15398  [pdf, other

    eess.SY

    Two New Approaches to Optical IRSs: Schemes and Comparative Analysis

    Authors: Haibo Wang, Zaichen Zhang, Bingcheng Zhu, Jian Dang, Liang Wu

    Abstract: Oriented to the point-to-multipoint free space optical communication (FSO) scenarios, this paper analyzes the micro-mirror array and phased array-type optical intelligent reflecting surface (OIRS) in terms of control mode, power efficiency, and beam splitting. We build the physical models of the two types of OIRSs. Based on the models, the closed form solution of OIRSs' output power density distri… ▽ More

    Submitted 30 December, 2020; originally announced December 2020.

    Comments: 26 pages,11 figures

  32. arXiv:2011.09624  [pdf, other

    eess.AS cs.LG

    Multi-stage Speaker Extraction with Utterance and Frame-Level Reference Signals

    Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

    Abstract: Speaker extraction requires a sample speech from the target speaker as the reference. However, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. For the first time, we use fr… ▽ More

    Submitted 2 April, 2021; v1 submitted 18 November, 2020; originally announced November 2020.

    Comments: Accepted in ICASSP 2021

  33. arXiv:2010.05530  [pdf, other

    cs.IT eess.SP

    Transmit Covariance and Waveform Optimization for Non-orthogonal CP-FBMA System

    Authors: Yuhao Qi, Jian Dang, Zaichen Zhang, Liang Wu, Yongpeng Wu

    Abstract: Filter bank multiple access (FBMA) without subbands orthogonality has been proposed as a new candidate waveform to better meet the requirements of future wireless communication systems and scenarios. It has the ability to process directly the complex symbols without any fancy preprocessing. Along with the usage of cyclic prefix (CP) and wide-banded subband design, CP-FBMA can further improve the p… ▽ More

    Submitted 13 October, 2020; v1 submitted 12 October, 2020; originally announced October 2020.

    Comments: 30 pages, 9 figures, accepted for publication in the IEEE Transactions on Communications

  34. arXiv:2005.04686  [pdf, other

    eess.AS cs.SD

    SpEx+: A Complete Time Domain Speaker Extraction Network

    Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

    Abstract: Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker's reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain approaches. Unfortunately, SpEx is not fully a time-domain solution since it performs time-domain speech encoding for speaker extraction, while taking frequency-… ▽ More

    Submitted 17 August, 2020; v1 submitted 10 May, 2020; originally announced May 2020.

    Comments: accepted in INTERSPEECH 2020

  35. arXiv:2004.05772  [pdf, ps, other

    cs.IT eess.SP

    Joint User Identification and Channel Estimation Over Rician Fading Channels

    Authors: Liang Wu, Zaichen Zhang, Jian Dang, Yongpeng Wu, Huaping Liu, Jiangzhou Wang

    Abstract: This paper considers crowded massive multiple input multiple output (MIMO) communications over a Rician fading channel, where the number of users is much greater than the number of available pilot sequences. A joint user identification and line-of-sight (LOS) component derivation algorithm is proposed without requiring a threshold. Based on the derived LOS component, we design a LOS-only channel e… ▽ More

    Submitted 13 April, 2020; originally announced April 2020.

  36. arXiv:2001.05715  [pdf, other

    eess.SY eess.SP

    Performance of Wireless Optical Communication With Reconfigurable Intelligent Surfaces and Random Obstacles

    Authors: Haibo Wang, Zaichen Zhang, Bingcheng Zhu, Jian Dang, Liang Wu, Lei Wang, Kehan Zhang, Yidi Zhang

    Abstract: It is difficult for free space optical communication to be applied in mobile communication due to the obstruction of obstacles in the environment, which is expected to be solved by reconfigurable intelligent surface technology. The reconfigurable intelligent surface is a new type of digital coding meta-materials, which can reflect, compute and program electromagnetic and optical waves in real time… ▽ More

    Submitted 16 January, 2020; originally announced January 2020.

  37. arXiv:1801.07336  [pdf, ps, other

    eess.SP

    A 3D Non-Stationary Wideband Geometry-Based Channel Model for MIMO Vehicle-to-Vehicle Communication System

    Authors: Hao Jiang, Zaichen Zhang, Liang Wu, Jian Dang, Guan Gui

    Abstract: In this paper, we present a three-dimensional (3D) non-wide-sense stationary (non-WSS) wideband geometry-based channel model for vehicle-to-vehicle (V2V) communication environments. We introduce a two-cylinder model to describe moving vehicles as well as multiple confocal semi-ellipsoid models to depict stationary roadside scenarios. The received signal is constructed as a sum of the line-of-sight… ▽ More

    Submitted 22 January, 2018; originally announced January 2018.