Search | arXiv e-print repository

Enhancement of Over-the-Air Federated Learning by Using AI-based Fluid Antenna System

Authors: Mohsen Ahmadzadeh, Saeid Pakravan, Ghosheh Abed Hodtani, Ming Zeng, Jean-Yves Chouinard

Abstract: This letter investigates an over-the-air federated learning (OTA-FL) system that employs fluid antennas (FAs) at the access point (AP) to enhance learning performance by leveraging the additional degrees of freedom provided by antenna mobility. First, we analyze the convergence of the OTA-FL system and derive the optimality gap to illustrate the influence of FAs on learning performance. Then, we f… ▽ More This letter investigates an over-the-air federated learning (OTA-FL) system that employs fluid antennas (FAs) at the access point (AP) to enhance learning performance by leveraging the additional degrees of freedom provided by antenna mobility. First, we analyze the convergence of the OTA-FL system and derive the optimality gap to illustrate the influence of FAs on learning performance. Then, we formulate a nonconvex optimization problem to minimize the optimality gap by jointly optimizing the positions of the FAs and the beamforming vector. To address the dynamic environment, we cast this optimization problem as a Markov decision process (MDP) and propose the recurrent deep deterministic policy gradient (RDPG) algorithm. Finally, extensive simulations show that the FA-assisted OTA-FL system outperforms systems with fixed-position antennas and that the RDPG algorithm surpasses the existing methods. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 5 pages, 3 figures

arXiv:2405.17809 [pdf, other]

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Authors: Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie Liu, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Sheng Zhao, Michael Zeng

Abstract: There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complex… ▽ More There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complexities involved in direct translation tasks and the scarcity of data. In this study, we introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Work in progress

arXiv:2405.08423 [pdf, other]

NAFRSSR: a Lightweight Recursive Network for Efficient Stereo Image Super-Resolution

Authors: Yihong Chen, Zhen Fan, Shuai Dong, Zhiwei Chen, Wenjie Li, Minghui Qin, Min Zeng, Xubing Lu, Guofu Zhou, Xingsen Gao, Jun-Ming Liu

Abstract: Stereo image super-resolution (SR) refers to the reconstruction of a high-resolution (HR) image from a pair of low-resolution (LR) images as typically captured by a dual-camera device. To enhance the quality of SR images, most previous studies focused on increasing the number and size of feature maps and introducing complex and computationally intensive structures, resulting in models with high co… ▽ More Stereo image super-resolution (SR) refers to the reconstruction of a high-resolution (HR) image from a pair of low-resolution (LR) images as typically captured by a dual-camera device. To enhance the quality of SR images, most previous studies focused on increasing the number and size of feature maps and introducing complex and computationally intensive structures, resulting in models with high computational complexity. Here, we propose a simple yet efficient stereo image SR model called NAFRSSR, which is modified from the previous state-of-the-art model NAFSSR by introducing recursive connections and lightweighting the constituent modules. Our NAFRSSR model is composed of nonlinear activation free and group convolution-based blocks (NAFGCBlocks) and depth-separated stereo cross attention modules (DSSCAMs). The NAFGCBlock improves feature extraction and reduces number of parameters by removing the simple channel attention mechanism from NAFBlock and using group convolution. The DSSCAM enhances feature fusion and reduces number of parameters by replacing 1x1 pointwise convolution in SCAM with weight-shared 3x3 depthwise convolution. Besides, we propose to incorporate trainable edge detection operator into NAFRSSR to further improve the model performance. Four variants of NAFRSSR with different sizes, namely, NAFRSSR-Mobile (NAFRSSR-M), NAFRSSR-Tiny (NAFRSSR-T), NAFRSSR-Super (NAFRSSR-S) and NAFRSSR-Base (NAFRSSR-B) are designed, and they all exhibit fewer parameters, higher PSNR/SSIM, and faster speed than the previous state-of-the-art models. In particular, to the best of our knowledge, NAFRSSR-M is the lightest (0.28M parameters) and fastest (50 ms inference time) model achieving an average PSNR/SSIM as high as 24.657 dB/0.7622 on the benchmark datasets. Codes and models will be released at https://github.com/JNUChenYiHong/NAFRSSR. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2404.06690 [pdf, other]

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Authors: Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, Sheng Zhao, Michael Zeng

Abstract: Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-rou… ▽ More Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. This is exemplified by instances generated in a single channel where one speaker's utterance is seamlessly mixed with another's interjections or laughter, indicating the latter's role as an attentive listener. Audio samples are available at https://aka.ms/covomix. △ Less

Submitted 29 May, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2402.07383 [pdf, other]

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

Authors: Naoyuki Kanda, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Yufei Xia, Jinzhu Li, Yanqing Liu, Sheng Zhao, Michael Zeng

Abstract: Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their applications and user experience. While there have been prior works to generate natural laughter, they fell short in terms of controlling the timing an… ▽ More Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their applications and user experience. While there have been prior works to generate natural laughter, they fell short in terms of controlling the timing and variety of the laughter to be generated. In this work, we propose ELaTE, a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt with precise control of laughter timing and expression. Specifically, ELaTE works on the audio prompt to mimic the voice characteristic, the text prompt to indicate the contents of the generated speech, and the input to control the laughter expression, which can be either the start and end times of laughter, or the additional audio prompt that contains laughter to be mimicked. We develop our model based on the foundation of conditional flow-matching-based zero-shot TTS, and fine-tune it with frame-level representation from a laughter detector as additional conditioning. With a simple scheme to mix small-scale laughter-conditioned data with large-scale pre-training data, we demonstrate that a pre-trained zero-shot TTS model can be readily fine-tuned to generate natural laughter with precise controllability, without losing any quality of the pre-trained zero-shot TTS model. Through objective and subjective evaluations, we show that ELaTE can generate laughing speech with significantly higher quality and controllability compared to conventional models. See https://aka.ms/elate/ for demo samples. △ Less

Submitted 4 March, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

Comments: See https://aka.ms/elate/ for demo samples, v2: subjective evaluation has been added

arXiv:2402.04395 [pdf]

Auto-Encoder Optimized PAM IM/DD Transceivers for Amplified Fiber Links

Authors: Amir Omidi, Mai Banawan, Erwan Weckenmann, Benoit Paquin, Alireza Geravand, Zibo Zheng, Wei Shi, Ming Zeng, Leslie A. Rusch

Abstract: We examine pulse amplitude modulation (PAM) for intensity modulation and direct detection systems. Using a straight-forward, mixed noise model, we optimize the constellations with an autoencoder-based neural network (NN), an improve required signal-to-noise ratio of 4 dB for amplified spontaneous emission (ASE)-limited PAM4 and PAM8, without increasing system complexity. Performance can also be im… ▽ More We examine pulse amplitude modulation (PAM) for intensity modulation and direct detection systems. Using a straight-forward, mixed noise model, we optimize the constellations with an autoencoder-based neural network (NN), an improve required signal-to-noise ratio of 4 dB for amplified spontaneous emission (ASE)-limited PAM4 and PAM8, without increasing system complexity. Performance can also be improved in O-band wavelength division multiplexing system with semiconductor optical amplifier amplification and chromatic dispersion. We show via simulation that for such a system operating at 53 Gbaud, we can extend the reach of PAM4 by 10-25 km with an optimized constellation and a NN decoder. We present an experimental validation of 4 dB improvement of an ASE-limited PAM4 at 60 Gbaud using an optimized constellation and a NN decoder. △ Less

Submitted 29 May, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

Comments: 9 pages and 13 figures

arXiv:2311.03282 [pdf, ps, other]

Resource Allocation for RIS-Empowered Wireless Communications: Low-Complexity and Robust Designs

Authors: Ming Zeng, Wanming Hao, Zhangjie Peng, Zheng Chu, Xingwang Li, Changsheng You, Cunhua Pan

Abstract: This article delves into advancements in resource allocation techniques tailored for systems utilizing reconfigurable intelligent surfaces (RIS), with a primary focus on achieving low-complexity and resilient solutions. The investigation of low-complexity approaches for RIS holds significant relevance, primarily owing to the intricate characteristics inherent in RIS-based systems and the need of d… ▽ More This article delves into advancements in resource allocation techniques tailored for systems utilizing reconfigurable intelligent surfaces (RIS), with a primary focus on achieving low-complexity and resilient solutions. The investigation of low-complexity approaches for RIS holds significant relevance, primarily owing to the intricate characteristics inherent in RIS-based systems and the need of deploying large-scale RIS arrays. Concurrently, the exploration of robust solutions aims to address the issue of hardware impairments occurring at both the transceivers and RIS components in practical RIS-assisted systems. In the realm of both low-complexity and robust resource allocation, this article not only elucidates the fundamental techniques underpinning these methodologies but also offers comprehensive numerical results for illustrative purposes. The necessity of adopting resource allocation strategies that are both low in complexity and resilient is thoroughly established. Ultimately, this article provides prospective research avenues in the domain of low-complexity and robust resource allocation techniques tailored for RIS-assisted systems. △ Less

Submitted 6 November, 2023; originally announced November 2023.

Comments: submitted to IEEE WCM

arXiv:2309.13874 [pdf, other]

Diffusion Conditional Expectation Model for Efficient and Robust Target Speech Extraction

Authors: Leying Zhang, Yao Qian, Linfeng Yu, Heming Wang, Xinkai Wang, Hemin Yang, Long Zhou, Shujie Liu, Yanmin Qian, Michael Zeng

Abstract: Target Speech Extraction (TSE) is a crucial task in speech processing that focuses on isolating the clean speech of a specific speaker from complex mixtures. While discriminative methods are commonly used for TSE, they can introduce distortion in terms of speech perception quality. On the other hand, generative approaches, particularly diffusion-based methods, can enhance speech quality perceptual… ▽ More Target Speech Extraction (TSE) is a crucial task in speech processing that focuses on isolating the clean speech of a specific speaker from complex mixtures. While discriminative methods are commonly used for TSE, they can introduce distortion in terms of speech perception quality. On the other hand, generative approaches, particularly diffusion-based methods, can enhance speech quality perceptually but suffer from slower inference speed. We propose an efficient generative approach named Diffusion Conditional Expectation Model (DCEM) for TSE. It can handle multi- and single-speaker scenarios in both noisy and clean conditions. Additionally, we introduce Regenerate-DCEM (R-DCEM) that can regenerate and optimize speech quality based on pre-processed speech from a discriminative model. Our method outperforms conventional methods in terms of both intrusive and non-intrusive metrics and demonstrates notable strengths in inference efficiency and robustness to unseen tasks. Audio examples are available online (https://vivian556123.github.io/dcem). △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: Submitted to ICASSP 2024

arXiv:2308.05813 [pdf, other]

doi 10.1109/JIOT.2023.3296319

Physical Layer Security for NOMA Systems: Requirements, Issues, and Recommendations

Authors: Saeid Pakravan, Jean-Yves Chouinard, Xingwang Li, Ming Zeng, Wanming Hao, Quoc-Viet Pham, Octavia A. Dobre

Abstract: Non-orthogonal multiple access (NOMA) has been viewed as a potential candidate for the upcoming generation of wireless communication systems. Comparing to traditional orthogonal multiple access (OMA), multiplexing users in the same time-frequency resource block can increase the number of served users and improve the efficiency of the systems in terms of spectral efficiency. Nevertheless, from a se… ▽ More Non-orthogonal multiple access (NOMA) has been viewed as a potential candidate for the upcoming generation of wireless communication systems. Comparing to traditional orthogonal multiple access (OMA), multiplexing users in the same time-frequency resource block can increase the number of served users and improve the efficiency of the systems in terms of spectral efficiency. Nevertheless, from a security view-point, when multiple users are utilizing the same time-frequency resource, there may be concerns regarding keeping information confidential. In this context, physical layer security (PLS) has been introduced as a supplement of protection to conventional encryption techniques by making use of the random nature of wireless transmission media for ensuring communication secrecy. The recent years have seen significant interests in PLS being applied to NOMA networks. Numerous scenarios have been investigated to assess the security of NOMA systems, including when active and passive eavesdroppers are present, as well as when these systems are combined with relay and reconfigurable intelligent surfaces (RIS). Additionally, the security of the ambient backscatter (AmB)-NOMA systems are other issues that have lately drawn a lot of attention. In this paper, a thorough analysis of the PLS-assisted NOMA systems research state-of-the-art is presented. In this regard, we begin by outlining the foundations of NOMA and PLS, respectively. Following that, we discuss the PLS performances for NOMA systems in four categories depending on the type of the eavesdropper, the existence of relay, RIS, and AmB systems in different conditions. Finally, a thorough explanation of the most recent PLS-assisted NOMA systems is given. △ Less

Submitted 10 August, 2023; originally announced August 2023.

Comments: 17 pages, 4 figures

Journal ref: IEEE Internet of Things Journal

arXiv:2307.08234 [pdf, other]

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

Authors: Shaoshi Ling, Yuxuan Hu, Shuangbei Qian, Guoli Ye, Yao Qian, Yifan Gong, Ed Lin, Michael Zeng

Abstract: Most end-to-end (E2E) speech recognition models are composed of encoder and decoder blocks that perform acoustic and language modeling functions. Pretrained large language models (LLMs) have the potential to improve the performance of E2E ASR. However, integrating a pretrained language model into an E2E speech recognition model has shown limited benefits due to the mismatches between text-based LL… ▽ More Most end-to-end (E2E) speech recognition models are composed of encoder and decoder blocks that perform acoustic and language modeling functions. Pretrained large language models (LLMs) have the potential to improve the performance of E2E ASR. However, integrating a pretrained language model into an E2E speech recognition model has shown limited benefits due to the mismatches between text-based LLMs and those used in E2E ASR. In this paper, we explore an alternative approach by adapting a pretrained LLMs to speech. Our experiments on fully-formatted E2E ASR transcription tasks across various domains demonstrate that our approach can effectively leverage the strengths of pretrained LLMs to produce more readable ASR transcriptions. Our model, which is based on the pretrained large language models with either an encoder-decoder or decoder-only structure, surpasses strong ASR models such as Whisper, in terms of recognition error rate, considering formats like punctuation and capitalization as well. △ Less

Submitted 2 August, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

arXiv:2305.18747 [pdf, other]

Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

Authors: Chenda Li, Yao Qian, Zhuo Chen, Naoyuki Kanda, Dongmei Wang, Takuya Yoshioka, Yanmin Qian, Michael Zeng

Abstract: State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages. However, it remains a challenge for these models to recognize overlapped speech, which is often seen in meeting conversations. We propose an approach to adapt USMs for multi-talker ASR. We first develop an enhanced version of serialized out… ▽ More State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages. However, it remains a challenge for these models to recognize overlapped speech, which is often seen in meeting conversations. We propose an approach to adapt USMs for multi-talker ASR. We first develop an enhanced version of serialized output training to jointly perform multi-talker ASR and utterance timestamp prediction. That is, we predict the ASR hypotheses for all speakers, count the speakers, and estimate the utterance timestamps at the same time. We further introduce a lightweight adapter module to maintain the multilingual property of the USMs even when we perform the adaptation with only a single language. Experimental results obtained using the AMI and AliMeeting corpora show that our proposed approach effectively transfers the USMs to a strong multilingual multi-talker ASR model with timestamp prediction capability. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech 2023

arXiv:2305.14838 [pdf, other]

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Authors: Chenyang Le, Yao Qian, Long Zhou, Shujie Liu, Yanmin Qian, Michael Zeng, Xuedong Huang

Abstract: Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate… ▽ More Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set. △ Less

Submitted 14 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: NeurIPS 2023, Poster

arXiv:2305.12311 [pdf, other]

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

Authors: Ziyi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang, Chenguang Zhu, Dongdong Chen, Yao Qian, Mei Gao, Yi-Ling Chen, Robert Gmyr, Naoyuki Kanda, Noel Codella, Bin Xiao, Yu Shi, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang

Abstract: The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is a… ▽ More The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is an integrative system that leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder in order to flexibly project combinations of modalities into a shared representational space. Next, language tokens are generated from these representations via an autoregressive decoder. The whole framework is pretrained end-to-end on a large collection of dual- and single-modality datasets using a novel text completion objective that can be generalized across arbitrary combinations of modalities. i-Code V2 matches or outperforms state-of-the-art single- and dual-modality baselines on 7 multimodal tasks, demonstrating the power of generative multimodal pretraining across a diversity of tasks and signals. △ Less

Submitted 20 May, 2023; originally announced May 2023.

arXiv:2305.11846 [pdf, other]

Any-to-Any Generation via Composable Diffusion

Authors: Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal

Abstract: We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image. Despite the absence of trai… ▽ More We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image. Despite the absence of training datasets for many combinations of modalities, we propose to align modalities in both the input and output space. This allows CoDi to freely condition on any input combination and generate any group of modalities, even if they are not present in the training data. CoDi employs a novel composable generation strategy which involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio. Highly customizable and flexible, CoDi achieves strong joint-modality generation quality, and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis. The project page with demonstrations and code is at https://codi-gen.github.io △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: Project Page: https://codi-gen.github.io

arXiv:2303.14044 [pdf, other]

MusicFace: Music-driven Expressive Singing Face Synthesis

Authors: Pengfei Liu, Wenjin Deng, Hengda Li, Jintai Wang, Yinglin Zheng, Yiwei Ding, Xiaohu Guo, Ming Zeng

Abstract: It is still an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music signal. In this paper, we present a method for this task with natural motions of the lip, facial expression, head pose, and eye states. Due to the coupling of the mixed information of human voice and background music in common signals of music audio, we design a decouple-and-fuse str… ▽ More It is still an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music signal. In this paper, we present a method for this task with natural motions of the lip, facial expression, head pose, and eye states. Due to the coupling of the mixed information of human voice and background music in common signals of music audio, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into human voice stream and background music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressiveness of the generated results, we propose to decompose head movements generation into speed generation and direction generation, and decompose eye states generation into the short-time eye blinking generation and the long-time eye closing generation to model them separately. We also build a novel SingingFace Dataset to support the training and evaluation of this task, and to facilitate future works on this topic. Extensive experiments and user study show that our proposed method is capable of synthesizing vivid singing face, which is better than state-of-the-art methods qualitatively and quantitatively. △ Less

Submitted 24 March, 2023; originally announced March 2023.

Comments: Accepted to CVMJ

arXiv:2303.10949 [pdf, other]

Code-Switching Text Generation and Injection in Mandarin-English ASR

Authors: Haibin Yu, Yuxuan Hu, Yao Qian, Ma Jin, Linquan Liu, Shujie Liu, Yu Shi, Yanmin Qian, Edward Lin, Michael Zeng

Abstract: Code-switching speech refers to a means of expression by mixing two or more languages within a single utterance. Automatic Speech Recognition (ASR) with End-to-End (E2E) modeling for such speech can be a challenging task due to the lack of data. In this study, we investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transd… ▽ More Code-switching speech refers to a means of expression by mixing two or more languages within a single utterance. Automatic Speech Recognition (ASR) with End-to-End (E2E) modeling for such speech can be a challenging task due to the lack of data. In this study, we investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T), in Mandarin-English code-switching speech recognition. We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces. Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models, i.e., 16% relative Token-based Error Rate (TER) reduction averaged on three evaluation sets, and the approach of tying speech and text latent spaces is superior to that of TTS conversion on the evaluation set which contains more homogeneous data with the training set. △ Less

Submitted 20 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.08372 [pdf, other]

Target Sound Extraction with Variable Cross-modality Clues

Authors: Chenda Li, Yao Qian, Zhuo Chen, Dongmei Wang, Takuya Yoshioka, Shujie Liu, Yanmin Qian, Michael Zeng

Abstract: Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources. It often uses a model conditioned on a fixed form of target sound clues, such as a sound class label, which limits the ways in which users can interact with the model to specify the target sounds. To leverage… ▽ More Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources. It often uses a model conditioned on a fixed form of target sound clues, such as a sound class label, which limits the ways in which users can interact with the model to specify the target sounds. To leverage variable number of clues cross modalities available in the inference phase, including a video, a sound event class, and a text caption, we propose a unified transformer-based TSE model architecture, where a multi-clue attention module integrates all the clues across the modalities. Since there is no off-the-shelf benchmark to evaluate our proposed approach, we build a dataset based on public corpora, Audioset and AudioCaps. Experimental results for seen and unseen target-sound evaluation sets show that our proposed TSE model can effectively deal with a varying number of clues which improves the TSE performance and robustness against partially compromised clues. △ Less

Submitted 15 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2210.15936 [pdf, other]

A comprehensive study on self-supervised distillation for speaker representation learning

Authors: Zhengyang Chen, Yao Qian, Bing Han, Yanmin Qian, Michael Zeng

Abstract: In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In… ▽ More In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In this paper, we present a comprehensive study on self-distilled self-supervised speaker representation learning, especially on critical data augmentation. Our proposed strategy of audio perturbation augmentation has pushed the performance of the speaker representation to a new limit. The experimental results show that our model can achieve a new SoTA on Voxceleb1 speaker verification evaluation benchmark ( i.e., equal error rate (EER) 2.505%, 2.473%, and 4.791% for trial Vox1-O, Vox1-E and Vox1-H , respectively), discarding any speaker labels in the training phase. △ Less

Submitted 25 November, 2022; v1 submitted 28 October, 2022; originally announced October 2022.

Comments: Accepted by SLT2022

arXiv:2205.01818 [pdf, other]

i-Code: An Integrative and Composable Multimodal Learning Framework

Authors: Ziyi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, Liyang Lu, Yujia Xie, Robert Gmyr, Noel Codella, Naoyuki Kanda, Bin Xiao, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang

Abstract: Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. I… ▽ More Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel attention mechanisms and other architectural innovations to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining. △ Less

Submitted 5 May, 2022; v1 submitted 3 May, 2022; originally announced May 2022.

arXiv:2202.07140 [pdf, ps, other]

Securing Reconfigurable Intelligent Surface-Aided Cell-Free Networks

Authors: Wanming Hao, Junjie Li, Gangcan Sun, Ming Zeng, Octavia A. Dobre

Abstract: In this paper, we investigate the physical layer security in the reconfigurable intelligent surface (RIS)-aided cell-free networks. A maximum weighted sum secrecy rate problem is formulated by jointly optimizing the active beamforming (BF) at the base stations and passive BF at the RISs. To handle this non-trivial problem, we adopt the alternating optimization to decouple the original problem into… ▽ More In this paper, we investigate the physical layer security in the reconfigurable intelligent surface (RIS)-aided cell-free networks. A maximum weighted sum secrecy rate problem is formulated by jointly optimizing the active beamforming (BF) at the base stations and passive BF at the RISs. To handle this non-trivial problem, we adopt the alternating optimization to decouple the original problem into two sub-ones, which are solved using the semidefinite relaxation and continuous convex approximation theory. To decrease the complexity for obtaining overall channel state information (CSI), we extend the proposed framework to the case that only requires part of the RIS' CSI. This is achieved via deliberately discarding the RIS that has a small contribution to the user's secrecy rate. Based on this, we formulate a mixed integer non-linear programming problem, and the linear conic relaxation is used to obtained the solutions. Finally, the simulation results show that the proposed schemes can obtain a higher secrecy rate than the existing ones. △ Less

Submitted 14 February, 2022; originally announced February 2022.

arXiv:2202.07137 [pdf, other]

Ultra Wide Band THz IRS Communications: Applications, Challenges, Key Techniques, and Research Opportunities

Authors: Wanming Hao, Fuhui Zhou, Ming Zeng, Octavia A. Dobre, Naofal Al-Dhahir

Abstract: Terahertz (THz) communication is a promising technology for future wireless networks due to its ultra-wide bandwidth. However, THz signals suffer from severe attenuation and poor diffraction capability, making it vulnerable to blocking obstacles. To compensate for these two shortcomings and improve the system performance, an intelligent reflecting surface (IRS) can be exploited to change the propa… ▽ More Terahertz (THz) communication is a promising technology for future wireless networks due to its ultra-wide bandwidth. However, THz signals suffer from severe attenuation and poor diffraction capability, making it vulnerable to blocking obstacles. To compensate for these two shortcomings and improve the system performance, an intelligent reflecting surface (IRS) can be exploited to change the propagation direction and enhance the signal strength. In this article, we investigate this promising ultra wide band (UWB) THz IRS communication paradigm. We start by motivating our research and describing several potential application scenarios. Then, we identify major challenges faced by UWB THz IRS communications. To overcome these challenges, several effective key techniques are developed, i.e., the time delayer-based sparse radio frequency antenna structure, delay hybrid precoding and IRS deployment. Simulation results are also presented to compare the system performance for these proposed techniques, thus demonstrating their effectiveness. Finally, we highlight several open issues and research opportunities for UWB THz IRS communications. △ Less

Submitted 14 February, 2022; originally announced February 2022.

Journal ref: IEEE Network,2022

arXiv:2112.05826 [pdf, other]

doi 10.21437/Interspeech.2020-2020

Sequence-level self-learning with multiple hypotheses

Authors: Kenichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr, Sefik Emre Eskimez, Jinyu Li, Michael Zeng

Abstract: In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multipl… ▽ More In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multiple powerful teacher models are unavailable. In contrast to conventional unsupervised learning approaches, we adopt the \emph{multi-task learning} (MTL) framework where the $n$-th best ASR hypothesis is used as the label of each task. The seq2seq network is updated through the MTL framework so as to find the common representation that can cover multiple hypotheses. By doing so, the effect of the \emph{hard-decision} errors can be alleviated. We first demonstrate the effectiveness of our self-learning methods through ASR experiments in an accent adaptation task between the US and British English speech. Our experiment results show that our method can reduce the WER on the British speech data from 14.55\% to 10.36\% compared to the baseline model trained with the US English data only. Moreover, we investigate the effect of our proposed methods in a federated learning scenario. △ Less

Submitted 10 December, 2021; originally announced December 2021.

Comments: Published in Interspeech 2020: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

Report number: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

Journal ref: Proc. Interspeech 2020, page 3775-3779

arXiv:2111.13608 [pdf, other]

Joint Wireless and Computing Resources Allocation in Multi-Cell MEC

Authors: M. Zeng

Abstract: This paper addresses join wireless and computing resource allocation in mobile edge computing (MEC) systems with several access points and with the possibility that users connect to many access points, and utilize the computation capability of many servers at the same time. The problem of sum transmission energy minimization under response time constraints is considered. It is proved, that the opt… ▽ More This paper addresses join wireless and computing resource allocation in mobile edge computing (MEC) systems with several access points and with the possibility that users connect to many access points, and utilize the computation capability of many servers at the same time. The problem of sum transmission energy minimization under response time constraints is considered. It is proved, that the optimization problem is non-convex. The complexity of optimization of a part of the system parameters is investigated, and based on these results an Iterative Resource Allocation procedure is proposed, that converges to a local optimum. The performance of the joint resource allocation is evaluated by comparing it to lower and upper bounds defined by less or more flexible multi-cell MEC architectures. The results show that the free selection of the access point is crucial for good system performance. △ Less

Submitted 26 November, 2021; originally announced November 2021.

Comments: 6 pages, 3 figures. arXiv admin note: text overlap with arXiv:1910.04841

arXiv:2110.13900 [pdf, other]

doi 10.1109/JSTSP.2022.3188113

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Authors: Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei

Abstract: Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. To tackle the problem, we propose a new pre-trained… ▽ More Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM jointly learns masked speech prediction and denoising in pre-training. By this means, WavLM does not only keep the speech content modeling capability by the masked speech prediction, but also improves the potential to non-ASR tasks by the speech denoising. In addition, WavLM employs gated relative position bias for the Transformer structure to better capture the sequence ordering of input speech. We also scale up the training dataset from 60k hours to 94k hours. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks. The code and pre-trained models are available at https://aka.ms/wavlm. △ Less

Submitted 17 June, 2022; v1 submitted 26 October, 2021; originally announced October 2021.

Comments: Submitted to the Journal of Selected Topics in Signal Processing (JSTSP)

arXiv:2110.12138 [pdf, other]

Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding

Authors: Wei Wang, Shuo Ren, Yao Qian, Shujie Liu, Yu Shi, Yanmin Qian, Michael Zeng

Abstract: The advances in attention-based encoder-decoder (AED) networks have brought great progress to end-to-end (E2E) automatic speech recognition (ASR). One way to further improve the performance of AED-based E2E ASR is to introduce an extra text encoder for leveraging extensive text data and thus capture more context-aware linguistic information. However, this approach brings a mismatch problem between… ▽ More The advances in attention-based encoder-decoder (AED) networks have brought great progress to end-to-end (E2E) automatic speech recognition (ASR). One way to further improve the performance of AED-based E2E ASR is to introduce an extra text encoder for leveraging extensive text data and thus capture more context-aware linguistic information. However, this approach brings a mismatch problem between the speech encoder and the text encoder due to the different units used for modeling. In this paper, we propose an embedding aligner and modality switch training to better align the speech and text latent spaces. The embedding aligner is a shared linear projection between text encoder and speech encoder trained by masked language modeling (MLM) loss and connectionist temporal classification (CTC), respectively. The modality switch training randomly swaps speech and text embeddings based on the forced alignment result to learn a joint representation space. Experimental results show that our proposed approach achieves a relative 14% to 19% word error rate (WER) reduction on Librispeech ASR task. We further verify its effectiveness on spoken language understanding (SLU), i.e., an absolute 2.5% to 2.8% F1 score improvement on SNIPS slot filling task. △ Less

Submitted 23 October, 2021; originally announced October 2021.

Comments: submitted to ICASSP 2022

arXiv:2110.09707 [pdf]

PI(t)D(t) Control and Motion Profiling for Omnidirectional Mobile Robots

Authors: Michael Zeng

Abstract: Recently, a trend is emerging toward human-servicing autonomous mobile robots, with diverse applications including delivery of supplies in hospitals, hotels, or labs where personnel are scarce, or reacting to indoor emergencies. However, existing autonomous mobile robot (AMR) motion is slow and inefficient, a foundational barrier to proliferation of human-servicing applications. This research has… ▽ More Recently, a trend is emerging toward human-servicing autonomous mobile robots, with diverse applications including delivery of supplies in hospitals, hotels, or labs where personnel are scarce, or reacting to indoor emergencies. However, existing autonomous mobile robot (AMR) motion is slow and inefficient, a foundational barrier to proliferation of human-servicing applications. This research has developed a motion control architecture that demonstrates the potential of several algorithms for increasing speed and efficiency. These include a novel PI(t)D(t) controller that sets integral and derivative gains as functions of time, and motion-profiling applied for holonomic motion. Resulting performance indicates potential for faster, more efficient AMRs, that maintain high levels of accuracy and repeatability. The hope is that this research can serve as a proof of concept for faster motion-control, to remove a key barrier to further use of human-servicing mobile robots. △ Less

Submitted 18 October, 2021; originally announced October 2021.

Comments: 12 pages, 13 figures

arXiv:2110.05777 [pdf, other]

Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

Authors: Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, Michael Zeng

Abstract: The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a… ▽ More The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN [1], as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.537%, 0.569%, and 1.180% equal error rate (EER) on the three official trials of VoxCeleb1, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.479%, 0.536% and 1.023%. Among the three evaluation trials, our best system outperforms the winner system [2] of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial. △ Less

Submitted 24 January, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: Accepted by ICASSP 2022

arXiv:2106.05630 [pdf, other]

MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training

Authors: Mingliang Zeng, Xu Tan, Rui Wang, Zeqian Ju, Tao Qin, Tie-Yan Liu

Abstract: Symbolic music understanding, which refers to the understanding of music from the symbolic data (e.g., MIDI format, but not audio), covers many music applications such as genre classification, emotion classification, and music pieces matching. While good music representations are beneficial for these applications, the lack of training data hinders representation learning. Inspired by the success o… ▽ More Symbolic music understanding, which refers to the understanding of music from the symbolic data (e.g., MIDI format, but not audio), covers many music applications such as genre classification, emotion classification, and music pieces matching. While good music representations are beneficial for these applications, the lack of training data hinders representation learning. Inspired by the success of pre-training models in natural language processing, in this paper, we develop MusicBERT, a large-scale pre-trained model for music understanding. To this end, we construct a large-scale symbolic music corpus that contains more than 1 million music songs. Since symbolic music contains more structural (e.g., bar, position) and diverse information (e.g., tempo, instrument, and pitch), simply adopting the pre-training techniques from NLP to symbolic music only brings marginal gains. Therefore, we design several mechanisms, including OctupleMIDI encoding and bar-level masking strategy, to enhance pre-training with symbolic music data. Experiments demonstrate the advantages of MusicBERT on four music understanding tasks, including melody completion, accompaniment suggestion, genre classification, and style classification. Ablation studies also verify the effectiveness of our designs of OctupleMIDI encoding and bar-level masking strategy in MusicBERT. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: Accepted by ACL 2021 Findings

arXiv:2102.11114 [pdf, other]

Generating Human Readable Transcript for Automatic Speech Recognition with Pre-trained Language Model

Authors: Junwei Liao, Yu Shi, Ming Gong, Linjun Shou, Sefik Eskimez, Liyang Lu, Hong Qu, Michael Zeng

Abstract: Modern Automatic Speech Recognition (ASR) systems can achieve high performance in terms of recognition accuracy. However, a perfectly accurate transcript still can be challenging to read due to disfluency, filter words, and other errata common in spoken communication. Many downstream tasks and human readers rely on the output of the ASR system; therefore, errors introduced by the speaker and ASR s… ▽ More Modern Automatic Speech Recognition (ASR) systems can achieve high performance in terms of recognition accuracy. However, a perfectly accurate transcript still can be challenging to read due to disfluency, filter words, and other errata common in spoken communication. Many downstream tasks and human readers rely on the output of the ASR system; therefore, errors introduced by the speaker and ASR system alike will be propagated to the next task in the pipeline. In this work, we propose an ASR post-processing model that aims to transform the incorrect and noisy ASR output into a readable text for humans and downstream tasks. We leverage the Metadata Extraction (MDE) corpus to construct a task-specific dataset for our study. Since the dataset is small, we propose a novel data augmentation method and use a two-stage training strategy to fine-tune the RoBERTa pre-trained model. On the constructed test set, our model outperforms a production two-step pipeline-based post-processing method by a large margin of 13.26 on readability-aware WER (RA-WER) and 17.53 on BLEU metrics. Human evaluation also demonstrates that our method can generate more human-readable transcripts than the baseline method. △ Less

Submitted 22 February, 2021; originally announced February 2021.

Comments: Accepted in 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2021)

arXiv:2102.06283 [pdf, other]

Speech-language Pre-training for End-to-end Spoken Language Understanding

Authors: Yao Qian, Ximo Bian, Yu Shi, Naoyuki Kanda, Leo Shen, Zhen Xiao, Michael Zeng

Abstract: End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module. However, paired utterance recordings and corresponding semantics may not always be available or sufficient to train an E2E SLU model in a real production environment. In this paper, we propose… ▽ More End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module. However, paired utterance recordings and corresponding semantics may not always be available or sufficient to train an E2E SLU model in a real production environment. In this paper, we propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder. The unified speech-language pre-trained model (SLP) is continually enhanced on limited labeled data from a target domain by using a conditional masked language model (MLM) objective, and thus can effectively generate a sequence of intent, slot type, and slot value for given input speech in the inference. The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method. It also outperforms the present state-of-the-art approaches to E2E SLU with much less paired data. △ Less

Submitted 11 February, 2021; originally announced February 2021.

arXiv:2101.07597 [pdf, other]

UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

Authors: Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang

Abstract: In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve… ▽ More In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach. △ Less

Submitted 10 June, 2021; v1 submitted 19 January, 2021; originally announced January 2021.

Comments: accepted by ICML2021

arXiv:2008.05798 [pdf, ps, other]

Hardware Impaired Ambient Backscatter NOMA Systems: Reliability and Security

Authors: Xingwang Li, Mengle Zhao, Ming Zeng, Shahid Mumtaz, Varun G Menon, Zhiguo Ding, Octavia A. Dobre

Abstract: Non-orthogonal multiple access (NOMA) and ambient backscatter communication have been envisioned as two promising technologies for the Internet-of-things due to their high spectral efficiency and energy efficiency. Motivated by this fact, we consider an ambient backscatter NOMA system in the presence of a malicious eavesdropper. Under some realistic assumptions of residual hardware impairments (RH… ▽ More Non-orthogonal multiple access (NOMA) and ambient backscatter communication have been envisioned as two promising technologies for the Internet-of-things due to their high spectral efficiency and energy efficiency. Motivated by this fact, we consider an ambient backscatter NOMA system in the presence of a malicious eavesdropper. Under some realistic assumptions of residual hardware impairments (RHIs), channel estimation errors (CEEs) and imperfect successive interference cancellation (ipSIC), we investigate the physical layer security (PLS) of the ambient backscatter NOMA systems focusing on reliability and security. In order to further improve the security of the considered system, an artificial noise scheme is proposed where the radio frequency (RF) source acts as a jammer that transmits interference signal to the legitimate receivers and eavesdropper. On this basis, the analytical expressions for the outage probability (OP) and the intercept probability (IP) are derived. To gain more insights, the asymptotic analysis and diversity orders for the OP in the high signal-to-noise ratio (SNR) regime are carried out, and the asymptotic behaviors of the IP in the high main-to-eavesdropper ratio (MER) region are explored as well. Numerical results show that: 1) RHIs, CEEs and ipSIC have negative effects on the OP but positive effects on the IP; 2) Compared with CEEs, RHIs have a more serious impact on the reliability and security of the considered system; 3) There exists a trade-off between reliability and security, and this trade-off can be optimized by reducing the power coefficient of the artificial noise or increasing the interfering factor of readers; 4) There are error floors for the OP due to the CEEs and the reflection coefficient; 5) As MER grows large, the security for Rnand Rf is improved, while the security for T is reduced. △ Less

Submitted 13 August, 2020; originally announced August 2020.

arXiv:2007.10001 [pdf, other]

Power Minimization for Multi-cell Uplink NOMA with Imperfect SIC

Authors: M. Zeng, W. Hao, O. A. Dobre, Z. Ding, H. V. Poor

Abstract: In this paper, we investigate a multi-cell uplink non-orthogonal multiple access (NOMA) system with imperfect successive interference cancellation (SIC). The objective of the formulated optimization problem is to minimize the total power consumption under users' quality-of-service constraints. The considered problem is first transformed into a linear programming problem, upon which centralized and… ▽ More In this paper, we investigate a multi-cell uplink non-orthogonal multiple access (NOMA) system with imperfect successive interference cancellation (SIC). The objective of the formulated optimization problem is to minimize the total power consumption under users' quality-of-service constraints. The considered problem is first transformed into a linear programming problem, upon which centralized and distributed optimal solutions are proposed. Numerical results are presented to verify the performance of the proposed solutions and evaluate the impact of imperfect SIC on the system performance. △ Less

Submitted 20 July, 2020; originally announced July 2020.

Comments: accepted by IEEE WCL; NOMA, uplink, multi-cell, power minimization, imperfect SIC

arXiv:2006.05407 [pdf, other]

D-VPnet: A Network for Real-time Dominant Vanishing Point Detection in Natural Scenes

Authors: Yin-Bo Liu, Ming Zeng, Qing-Hao Meng

Abstract: As an important part of linear perspective, vanishing points (VPs) provide useful clues for mapping objects from 2D photos to 3D space. Existing methods are mainly focused on extracting structural features such as lines or contours and then clustering these features to detect VPs. However, these techniques suffer from ambiguous information due to the large number of line segments and contours dete… ▽ More As an important part of linear perspective, vanishing points (VPs) provide useful clues for mapping objects from 2D photos to 3D space. Existing methods are mainly focused on extracting structural features such as lines or contours and then clustering these features to detect VPs. However, these techniques suffer from ambiguous information due to the large number of line segments and contours detected in outdoor environments. In this paper, we present a new convolutional neural network (CNN) to detect dominant VPs in natural scenes, i.e., the Dominant Vanishing Point detection Network (D-VPnet). The key component of our method is the feature line-segment proposal unit (FLPU), which can be directly utilized to predict the location of the dominant VP. Moreover, the model also uses the two main parallel lines as an assistant to determine the position of the dominant VP. The proposed method was tested using a public dataset and a Parallel Line based Vanishing Point (PLVP) dataset. The experimental results suggest that the detection accuracy of our approach outperforms those of state-of-the-art methods under various conditions in real-time, achieving rates of 115fps. △ Less

Submitted 9 June, 2020; originally announced June 2020.

Comments: 18 pages, 6 figures, under review

ACM Class: I.4.7

arXiv:2004.10791 [pdf, other]

doi 10.1109/LCOMM.2020.3025978

Sum Rate Maximization for IRS-assisted Uplink NOMA

Authors: M. Zeng, X. Li, G. Li, W. Hao, O. A. Dobre

Abstract: An intelligent reflecting surface (IRS) consists of a large number of low-cost reflecting elements, which can steer the incident signal collaboratively by passive beamforming. This way, IRS reconfigures the wireless environment to boost the system performance. In this paper, we consider an IRS-assisted uplink non-orthogonal multiple access (NOMA) system. The objective is to maximize the sum rate o… ▽ More An intelligent reflecting surface (IRS) consists of a large number of low-cost reflecting elements, which can steer the incident signal collaboratively by passive beamforming. This way, IRS reconfigures the wireless environment to boost the system performance. In this paper, we consider an IRS-assisted uplink non-orthogonal multiple access (NOMA) system. The objective is to maximize the sum rate of all users under individual power constraint. The considered problem requires a joint power control at the users and beamforming design at the IRS, and is nonconvex. To handle it, semidefinite relaxation is employed, which provides a near-optimal solution. Presented numerical results show that the proposed NOMA-based scheme achieves a larger sum rate than orthogonal multiple access (OMA)-based one. Moreover, the impact of the number of reflecting elements on the sum rate is revealed. △ Less

Submitted 22 April, 2020; originally announced April 2020.

Comments: IEEE COMML, IRS, RIS, NOMA, sum rate, uplink

Journal ref: IEEE COMML 2020

arXiv:2002.04169 [pdf, other]

Edge Cache-assisted Secure Low-Latency Millimeter Wave Transmission

Authors: Wanming Hao, Ming Zeng, Gangcan Sun, Pei Xiao

Abstract: In this paper, we consider an edge cache-assisted millimeter wave cloud radio access network (C-RAN). Each remote radio head (RRH) in the C-RAN has a local cache, which can pre-fetch and store the files requested by the actuators. Multiple RRHs form a cluster to cooperatively serve the actuators, which acquire their required files either from the local caches or from the central processor via mult… ▽ More In this paper, we consider an edge cache-assisted millimeter wave cloud radio access network (C-RAN). Each remote radio head (RRH) in the C-RAN has a local cache, which can pre-fetch and store the files requested by the actuators. Multiple RRHs form a cluster to cooperatively serve the actuators, which acquire their required files either from the local caches or from the central processor via multicast fronthaul links. For such a scenario, we formulate a beamforming design problem to minimize the secure transmission delay under transmit power constraint of each RRH. Due to the difficulty of directly solving the formulated problem, we divide it into two independent ones: {\textit{i)}} minimizing the fronthaul transmission delay by jointly optimizing the transmit and receive beamforming; {\textit{ii)}} minimizing the maximum access transmission delay by jointly designing cooperative beamforming among RRHs. An alternatively iterative algorithm is proposed to solve the first optimization problem. For the latter, we first design the analog beamforming based on the channel state information of the actuators. Then, with the aid of successive convex approximation and $S$-procedure techniques, a semidefinite program (SDP) is formulated, and an iterative algorithm is proposed through SDP relaxation. Finally, simulation results are provided to verify the performance of the proposed schemes. △ Less

Submitted 10 February, 2020; originally announced February 2020.

Comments: IEEE_IoT, Accept

arXiv:1910.04841 [pdf, ps, other]

Dynamic Spectrum Sharing for Load Balancing in Multi-Cell Mobile Edge Computing

Authors: Ming Zeng, Viktoria Fodor

Abstract: Large-scale mobile edge computing (MEC) systems require scalable solutions to allocate communication and computing resources to the users. In this letter we address this challenge by applying dynamic spectrum sharing among the base stations (BSs), together with local resource allocation in the cells. We show that the network-wide resource allocation can be transformed into a convex optimization pr… ▽ More Large-scale mobile edge computing (MEC) systems require scalable solutions to allocate communication and computing resources to the users. In this letter we address this challenge by applying dynamic spectrum sharing among the base stations (BSs), together with local resource allocation in the cells. We show that the network-wide resource allocation can be transformed into a convex optimization problem, and propose a distributed, hierarchical solution with limited information exchange among the BSs. Numerical results demonstrate that the proposed solution is superior to other baseline algorithms, when wireless and computing resource allocation is not jointly optimized, or the wireless resources allocated to the BSs are fixed. △ Less

Submitted 10 October, 2019; originally announced October 2019.

Comments: IEEE WCL

arXiv:1909.06218 [pdf, other]

Codebook-Based Max-Min Energy-Efficient Resource Allocation for Uplink mmWave MIMO-NOMA Systems

Authors: Wanming Hao, Ming Zeng, Gangcan Sun, Osamu Muta, Octavia A. Dobre, Shouyi Yang, Haris Gacanin

Abstract: In this paper, we investigate the energy-efficient resource allocation problem in an uplink non-orthogonal multiple access (NOMA) millimeter wave system, where the fully-connected-based sparse radio frequency chain antenna structure is applied at the base station (BS). To relieve the pilot overhead for channel estimation, we propose a codebook-based analog beam design scheme, which only requires t… ▽ More In this paper, we investigate the energy-efficient resource allocation problem in an uplink non-orthogonal multiple access (NOMA) millimeter wave system, where the fully-connected-based sparse radio frequency chain antenna structure is applied at the base station (BS). To relieve the pilot overhead for channel estimation, we propose a codebook-based analog beam design scheme, which only requires to obtain the equivalent channel gain. On this basis, users belonging to the same analog beam are served via NOMA. Meanwhile, an advanced NOMA decoding scheme is proposed by exploiting the global information available at the BS. Under predefined minimum rate and maximum transmit power constraints for each user, we formulate a max-min user energy efficiency (EE) optimization problem by jointly optimizing the detection matrix at the BS and transmit power at the users. We first transform the original fractional objective function into a subtractive one. Then, we propose a two-loop iterative algorithm to solve the reformulated problem. Specifically, the inner loop updates the detection matrix and transmit power iteratively, while the outer loop adopts the bisection method. Meanwhile, to decrease the complexity of the inner loop, we propose a zero-forcing (ZF)-based iterative algorithm, where the detection matrix is designed via the ZF technique. Finally, simulation results show that the proposed schemes obtain a better performance in terms of spectral efficiency and EE than the conventional schemes △ Less

Submitted 13 September, 2019; originally announced September 2019.

Comments: IEEE_T_COM, accepted

arXiv:1907.10001 [pdf]

Non-Orthogonal Multiple Access (NOMA): How It Meets 5G and Beyond

Authors: S. M. Riazul Islam, Ming Zeng, Octavia A. Dobre, Kyung-Sup Kwak

Abstract: Due to massive connectivity and increasing demands of various services and data-hungry applications, a full-scale implementation of the fifth generation (5G) wireless systems requires more effective radio access techniques. In this regard, non-orthogonal multiple access (NOMA) has recently gained ever-growing attention from both academia and industry. Compared to orthogonal multiple access (OMA) t… ▽ More Due to massive connectivity and increasing demands of various services and data-hungry applications, a full-scale implementation of the fifth generation (5G) wireless systems requires more effective radio access techniques. In this regard, non-orthogonal multiple access (NOMA) has recently gained ever-growing attention from both academia and industry. Compared to orthogonal multiple access (OMA) techniques, NOMA is superior in terms of spectral efficiency and is thus appropriate for 5G and Beyond. In this article, we provide an overview of NOMA principles and applications. Specifically, the article discusses the fundamentals of power-domain NOMA with single and multiple antennas in both uplink and downlink settings. In addition, the basic principles of code-domain NOMA are elaborated. Further, the article explains various resource allocation techniques such as user pairing and power allocation for NOMA systems; discusses the basic form of cooperative NOMA and its variants; and addresses several opportunities and challenges associated with the compatibility of NOMA with other advanced communication paradigms such as heterogeneous networks and millimeter wave communications. △ Less

Submitted 23 July, 2019; originally announced July 2019.

Comments: 38 pages, 9 figures, Wiley 5G Ref

arXiv:1905.02545 [pdf, other]

Meeting Transcription Using Virtual Microphone Arrays

Authors: Takuya Yoshioka, Zhuo Chen, Dimitrios Dimitriadis, William Hinthorn, Xuedong Huang, Andreas Stolcke, Michael Zeng

Abstract: We describe a system that generates speaker-annotated transcripts of meetings by using a virtual microphone array, a set of spatially distributed asynchronous recording devices such as laptops and mobile phones. The system is composed of continuous audio stream alignment, blind beamforming, speech recognition, speaker diarization using prior speaker information, and system combination. When utiliz… ▽ More We describe a system that generates speaker-annotated transcripts of meetings by using a virtual microphone array, a set of spatially distributed asynchronous recording devices such as laptops and mobile phones. The system is composed of continuous audio stream alignment, blind beamforming, speech recognition, speaker diarization using prior speaker information, and system combination. When utilizing seven input audio streams, our system achieves a word error rate (WER) of 22.3% and comes within 3% of the close-talking microphone WER on the non-overlapping speech segments. The speaker-attributed WER (SAWER) is 26.7%. The relative gains in SAWER over the single-device system are 14.8%, 20.3%, and 22.4% for three, five, and seven microphones, respectively. The presented system achieves a 13.6% diarization error rate when 10% of the speech duration contains more than one speaker. The contribution of each component to the overall performance is also investigated, and we validate the system with experiments on the NIST RT-07 conference meeting test set. △ Less

Submitted 7 July, 2019; v1 submitted 3 May, 2019; originally announced May 2019.

Report number: MSR-TR-2019-11

Showing 1–40 of 40 results for author: Zeng, M