Search | arXiv e-print repository

Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification

Authors: Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie

Abstract: Trained on 680,000 hours of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are… ▽ More Trained on 680,000 hours of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios. △ Less

Submitted 13 July, 2024; originally announced July 2024.

arXiv:2403.18621 [pdf, other]

doi 10.1109/TVT.2024.3420880

Performance Analysis of Integrated Sensing and Communication Networks with Blockage Effects

Authors: Zezhong Sun, Shi Yan, Ning Jiang, Jiaen Zhou, Mugen Peng

Abstract: Communication-sensing integration represents an up-and-coming area of research, enabling wireless networks to simultaneously perform communication and sensing tasks. However, in urban cellular networks, the blockage of buildings results in a complex signal propagation environment, affecting the performance analysis of integrated sensing and communication (ISAC) networks. To overcome this obstacle,… ▽ More Communication-sensing integration represents an up-and-coming area of research, enabling wireless networks to simultaneously perform communication and sensing tasks. However, in urban cellular networks, the blockage of buildings results in a complex signal propagation environment, affecting the performance analysis of integrated sensing and communication (ISAC) networks. To overcome this obstacle, this paper constructs a comprehensive framework considering building blockage and employs a distance-correlated blockage model to analyze interference from line of sight (LoS), non-line of sight (NLoS), and target reflection cascading (TRC) links. Using stochastic geometric theory, expressions for signal-to-interference-plus-noise ratio (SINR) and coverage probability for communication and sensing in the presence of blockage are derived, allowing for a comprehensive comparison under the same parameters. The research findings indicate that blockage can positively impact coverage, especially in enhancing communication performance. The analysis also suggests that there exists an optimal base station (BS) density when blockage is of the same order of magnitude as the BS density, maximizing communication or sensing coverage probability. △ Less

Submitted 2 July, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

Comments: This paper has been accepted by IEEE Transactions on Vehicular Technology

arXiv:2403.09536 [pdf]

Mixed Algorithm of SINDy and HAVOK for Measure-Based Analysis of Power System with Inverter-based Resources

Authors: Reza Saeed Kandezy, John Ning Jiang

Abstract: Artificial intelligence and machine learning is enhancing electric grids by offering data analysis tools that can be used to operate the power grid more reliably. However, the complex nonlinear dynamics, particularly when coupled with multi-scale interactions among Inverter-based renewable energy Resources, calls for effective algorithms for power system application. This paper presents affective… ▽ More Artificial intelligence and machine learning is enhancing electric grids by offering data analysis tools that can be used to operate the power grid more reliably. However, the complex nonlinear dynamics, particularly when coupled with multi-scale interactions among Inverter-based renewable energy Resources, calls for effective algorithms for power system application. This paper presents affective novel algorithm to detect various nonlinear dynamics, which is built upon: the Sparse Identification of Nonlinear Dynamics method for nonlinear dynamics detection; and Hankel Alternative View of Koopman method for multi-scale decomposition. We show that, by an appropriate integration of the strengths of the two, the mixed algorithm not only can detect the nonlinearity, but also it distinguishes the nonlinearity caused by coupled Inverter-based resources from the more familiar ones caused synchronous generators. This shows that the proposal algorithm can be a promising application of artificial intelligence and machine learning for data measure-based analysis to support operation of power system with integrated renewables. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2401.03697 [pdf, other]

An audio-quality-based multi-strategy approach for target speaker extraction in the MISP 2023 Challenge

Authors: Runduo Han, Xiaopeng Yan, Weiming Xu, Pengcheng Guo, Jiayao Sun, He Wang, Quan Lu, Ning Jiang, Lei Xie

Abstract: This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. Specifically, our approach adopts different extraction strategies based on the audio quality, striking a balance between interference removal and speech preservation, which benifits the back-en… ▽ More This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. Specifically, our approach adopts different extraction strategies based on the audio quality, striking a balance between interference removal and speech preservation, which benifits the back-end automatic speech recognition (ASR) systems. Experiments show that our approach achieves a character error rate (CER) of 24.2% and 33.2% on the Dev and Eval set, respectively, obtaining the second place in the challenge. △ Less

Submitted 6 March, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

Comments: Accepted by ICASSP 2024

arXiv:2312.09747 [pdf, other]

SELM: Speech Enhancement Using Discrete Tokens and Language Models

Authors: Ziqian Wang, Xinfa Zhu, Zihan Zhang, YuanJun Lv, Ning Jiang, Guoqing Zhao, Lei Xie

Abstract: Language models (LMs) have shown superior performances in various speech generation tasks recently, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information holds potential advantages for speech enhancement tasks. In light of this, we propose SELM, a novel paradigm for speech… ▽ More Language models (LMs) have shown superior performances in various speech generation tasks recently, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information holds potential advantages for speech enhancement tasks. In light of this, we propose SELM, a novel paradigm for speech enhancement, which integrates discrete tokens and leverages language models. SELM comprises three stages: encoding, modeling, and decoding. We transform continuous waveform signals into discrete tokens using pre-trained self-supervised learning (SSL) models and a k-means tokenizer. Language models then capture comprehensive contextual information within these tokens. Finally, a detokenizer and HiFi-GAN restore them into enhanced speech. Experimental results demonstrate that SELM achieves comparable performance in objective metrics alongside superior results in subjective perception. Our demos are available https://honee-w.github.io/SELM/. △ Less

Submitted 7 January, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP 2024

arXiv:2310.17101 [pdf, other]

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning

Authors: Xinfa Zhu, Yuke Li, Yi Lei, Ning Jiang, Guoqing Zhao, Lei Xie

Abstract: This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, em… ▽ More This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method. △ Less

Submitted 25 April, 2024; v1 submitted 25 October, 2023; originally announced October 2023.

Comments: 6 pages, 4 figures; Accepted by ICME 2024

arXiv:2310.14278 [pdf, other]

doi 10.1109/TASLP.2024.3389630

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Authors: Kun Wei, Bei Li, Hang Lv, Quan Lu, Ning Jiang, Lei Xie

Abstract: Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the C… ▽ More Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model. △ Less

Submitted 27 April, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

Comments: TASLP

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024

arXiv:2310.04760 [pdf, other]

Multi-objective Progressive Clustering for Semi-supervised Domain Adaptation in Speaker Verification

Authors: Ze Li, Yuke Lin, Ning Jiang, Xiaoyi Qin, Guoqing Zhao, Haiying Wu, Ming Li

Abstract: Utilizing the pseudo-labeling algorithm with large-scale unlabeled data becomes crucial for semi-supervised domain adaptation in speaker verification tasks. In this paper, we propose a novel pseudo-labeling method named Multi-objective Progressive Clustering (MoPC), specifically designed for semi-supervised domain adaptation. Firstly, we utilize limited labeled data from the target domain to deriv… ▽ More Utilizing the pseudo-labeling algorithm with large-scale unlabeled data becomes crucial for semi-supervised domain adaptation in speaker verification tasks. In this paper, we propose a novel pseudo-labeling method named Multi-objective Progressive Clustering (MoPC), specifically designed for semi-supervised domain adaptation. Firstly, we utilize limited labeled data from the target domain to derive domain-specific descriptors based on multiple distinct objectives, namely within-graph denoising, intra-class denoising and inter-class denoising. Then, the Infomap algorithm is adopted for embedding clustering, and the descriptors are leveraged to further refine the target domain's pseudo-labels. Moreover, to further improve the quality of pseudo labels, we introduce the subcenter-purification and progressive-merging strategy for label denoising. Our proposed MoPC method achieves 4.95% EER and ranked the 1$^{st}$ place on the evaluation set of VoxSRC 2023 track 3. We also conduct additional experiments on the FFSVC dataset and yield promising results. △ Less

Submitted 7 October, 2023; originally announced October 2023.

arXiv:2309.14125 [pdf]

doi 10.1016/j.apenergy.2024.123122

Driving behavior-guided battery health monitoring for electric vehicles using machine learning

Authors: Nanhua Jiang, Jiawei Zhang, Weiran Jiang, Yao Ren, Jing Lin, Edwin Khoo, Ziyou Song

Abstract: An accurate estimation of the state of health (SOH) of batteries is critical to ensuring the safe and reliable operation of electric vehicles (EVs). Feature-based machine learning methods have exhibited enormous potential for rapidly and precisely monitoring battery health status. However, simultaneously using various health indicators (HIs) may weaken estimation performance due to feature redunda… ▽ More An accurate estimation of the state of health (SOH) of batteries is critical to ensuring the safe and reliable operation of electric vehicles (EVs). Feature-based machine learning methods have exhibited enormous potential for rapidly and precisely monitoring battery health status. However, simultaneously using various health indicators (HIs) may weaken estimation performance due to feature redundancy. Furthermore, ignoring real-world driving behaviors can lead to inaccurate estimation results as some features are rarely accessible in practical scenarios. To address these issues, we proposed a feature-based machine learning pipeline for reliable battery health monitoring, enabled by evaluating the acquisition probability of features under real-world driving conditions. We first summarized and analyzed various individual HIs with mechanism-related interpretations, which provide insightful guidance on how these features relate to battery degradation modes. Moreover, all features were carefully evaluated and screened based on estimation accuracy and correlation analysis on three public battery degradation datasets. Finally, the scenario-based feature fusion and acquisition probability-based practicality evaluation method construct a useful tool for feature extraction with consideration of driving behaviors. This work highlights the importance of balancing the performance and practicality of HIs during the development of feature-based battery health monitoring algorithms. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Journal ref: Applied Energy (2024)

arXiv:2309.14109 [pdf, other]

Haha-Pod: An Attempt for Laughter-based Non-Verbal Speaker Verification

Authors: Yuke Lin, Xiaoyi Qin, Ning Jiang, Guoqing Zhao, Ming Li

Abstract: It is widely acknowledged that discriminative representation for speaker verification can be extracted from verbal speech. However, how much speaker information that non-verbal vocalization carries is still a puzzle. This paper explores speaker verification based on the most ubiquitous form of non-verbal voice, laughter. First, we use a semi-automatic pipeline to collect a new Haha-Pod dataset fro… ▽ More It is widely acknowledged that discriminative representation for speaker verification can be extracted from verbal speech. However, how much speaker information that non-verbal vocalization carries is still a puzzle. This paper explores speaker verification based on the most ubiquitous form of non-verbal voice, laughter. First, we use a semi-automatic pipeline to collect a new Haha-Pod dataset from open-source podcast media. The dataset contains over 240 speakers' laughter clips with corresponding high-quality verbal speech. Second, we propose a Two-Stage Teacher-Student (2S-TS) framework to minimize the within-speaker embedding distance between verbal and non-verbal (laughter) signals. Considering Haha-Pod as a test set, two trials (S2L-Eval) are designed to verify the speaker's identity through laugh sounds. Experimental results demonstrate that our method can significantly improve the performance of the S2L-Eval test set with only a minor degradation on the VoxCeleb1 test set. The resources for the Haha-Pod dataset can be found at https://github.com/nevermoreLin/HahaPod. △ Less

Submitted 9 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

Comments: accepted by ASRU 2023

arXiv:2308.08766 [pdf, other]

The DKU-MSXF Speaker Verification System for the VoxCeleb Speaker Recognition Challenge 2023

Authors: Ze Li, Yuke Lin, Xiaoyi Qin, Ning Jiang, Guoqing Zhao, Ming Li

Abstract: This paper is the system description of the DKU-MSXF System for the track1, track2 and track3 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). For Track 1, we utilize a network structure based on ResNet for training. By constructing a cross-age QMF training set, we achieve a substantial improvement in system performance. For Track 2, we inherite the pre-trained model from Track 1 an… ▽ More This paper is the system description of the DKU-MSXF System for the track1, track2 and track3 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). For Track 1, we utilize a network structure based on ResNet for training. By constructing a cross-age QMF training set, we achieve a substantial improvement in system performance. For Track 2, we inherite the pre-trained model from Track 1 and conducte mixed training by incorporating the VoxBlink-clean dataset. In comparison to Track 1, the models incorporating VoxBlink-clean data exhibit a performance improvement by more than 10% relatively. For Track3, the semi-supervised domain adaptation task, a novel pseudo-labeling method based on triple thresholds and sub-center purification is adopted to make domain adaptation. The final submission achieves mDCF of 0.1243 in task1, mDCF of 0.1165 in Track 2 and EER of 4.952% in Track 3. △ Less

Submitted 16 August, 2023; originally announced August 2023.

Comments: arXiv admin note: text overlap with arXiv:2210.05092

arXiv:2308.07595 [pdf, other]

The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023

Authors: Ming Cheng, Weiqing Wang, Xiaoyi Qin, Yuke Lin, Ning Jiang, Guoqing Zhao, Ming Li

Abstract: This paper describes the DKU-MSXF submission to track 4 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). Our system pipeline contains voice activity detection, clustering-based diarization, overlapped speech detection, and target-speaker voice activity detection, where each procedure has a fused output from 3 sub-models. Finally, we fuse different clustering-based and TSVAD-based di… ▽ More This paper describes the DKU-MSXF submission to track 4 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). Our system pipeline contains voice activity detection, clustering-based diarization, overlapped speech detection, and target-speaker voice activity detection, where each procedure has a fused output from 3 sub-models. Finally, we fuse different clustering-based and TSVAD-based diarization systems using DOVER-Lap and achieve the 4.30% diarization error rate (DER), which ranks first place on track 4 of the challenge leaderboard. △ Less

Submitted 16 August, 2023; v1 submitted 15 August, 2023; originally announced August 2023.

arXiv:2308.07056 [pdf, other]

VoxBlink: A Large Scale Speaker Verification Dataset on Camera

Authors: Yuke Lin, Xiaoyi Qin, Guoqing Zhao, Ming Cheng, Ning Jiang, Haiyang Wu, Ming Li

Abstract: In this paper, we introduce a large-scale and high-quality audio-visual speaker verification dataset, named VoxBlink. We propose an innovative and robust automatic audio-visual data mining pipeline to curate this dataset, which contains 1.45M utterances from 38K speakers. Due to the inherent nature of automated data collection, introducing noisy data is inevitable. Therefore, we also utilize a mul… ▽ More In this paper, we introduce a large-scale and high-quality audio-visual speaker verification dataset, named VoxBlink. We propose an innovative and robust automatic audio-visual data mining pipeline to curate this dataset, which contains 1.45M utterances from 38K speakers. Due to the inherent nature of automated data collection, introducing noisy data is inevitable. Therefore, we also utilize a multi-modal purification step to generate a cleaner version of the VoxBlink, named VoxBlink-clean, comprising 18K identities and 1.02M utterances. In contrast to the VoxCeleb, the VoxBlink sources from short videos of ordinary users, and the covered scenarios can better align with real-life situations. To our best knowledge, the VoxBlink dataset is one of the largest publicly available speaker verification datasets. Leveraging the VoxCeleb and VoxBlink-clean datasets together, we employ diverse speaker verification models with multiple architectural backbones to conduct comprehensive evaluations on the VoxCeleb test sets. Experimental results indicate a substantial enhancement in performance,ranging from 12% to 30% relatively, across various backbone architectures upon incorporating the VoxBlink-clean into the training process. The details of the dataset can be found on http://voxblink.github.io △ Less

Submitted 12 December, 2023; v1 submitted 14 August, 2023; originally announced August 2023.

Comments: Accepted By ICASSP2024

arXiv:2307.04630 [pdf, other]

The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

Authors: Kun Song, Yi lei, Peikun Chen, Yiqing Cao, Kun Wei, Yongmao Zhang, Lei Xie, Ning Jiang, Guoqing Zhao

Abstract: This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Spec… ▽ More This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Specifically, to improve the robustness to multi-source speech input, we adopt various data augmentation strategies and a ROVER-based score fusion on multiple ASR model outputs. To better handle the noisy ASR transcripts, we introduce a three-stage fine-tuning strategy to improve translation accuracy. Finally, we build a TTS model with high naturalness and sound quality, which leverages a two-stage framework, using network bottleneck features as a robust intermediate representation for speaker timbre and linguistic content disentanglement. Based on the two-stage framework, pre-trained speaker embedding is leveraged as a condition to transfer the speaker timbre in the source English speech to the translated Chinese speech. Experimental results show that our system has high translation accuracy, speech naturalness, sound quality, and speaker similarity. Moreover, it shows good robustness to multi-source data. △ Less

Submitted 10 July, 2023; originally announced July 2023.

Comments: IWSLT@ACL 2023 system paper. Our submitted system ranks 1st in the S2ST task of the IWSLT 2023 evaluation campaign

arXiv:2307.04133 [pdf, other]

Ultrasonic Image's Annotation Removal: A Self-supervised Noise2Noise Approach

Authors: Yuanheng Zhang, Nan Jiang, Zhaoheng Xie, Junying Cao, Yueyang Teng

Abstract: Accurately annotated ultrasonic images are vital components of a high-quality medical report. Hospitals often have strict guidelines on the types of annotations that should appear on imaging results. However, manually inspecting these images can be a cumbersome task. While a neural network could potentially automate the process, training such a model typically requires a dataset of paired input an… ▽ More Accurately annotated ultrasonic images are vital components of a high-quality medical report. Hospitals often have strict guidelines on the types of annotations that should appear on imaging results. However, manually inspecting these images can be a cumbersome task. While a neural network could potentially automate the process, training such a model typically requires a dataset of paired input and target images, which in turn involves significant human labour. This study introduces an automated approach for detecting annotations in images. This is achieved by treating the annotations as noise, creating a self-supervised pretext task and using a model trained under the Noise2Noise scheme to restore the image to a clean state. We tested a variety of model structures on the denoising task against different types of annotation, including body marker annotation, radial line annotation, etc. Our results demonstrate that most models trained under the Noise2Noise scheme outperformed their counterparts trained with noisy-clean data pairs. The costumed U-Net yielded the most optimal outcome on the body marker annotation dataset, with high scores on segmentation precision and reconstruction similarity. We released our code at https://github.com/GrandArth/UltrasonicImage-N2N-Approach. △ Less

Submitted 9 July, 2023; originally announced July 2023.

Comments: 10 pages, 7 figures

arXiv:2306.05297 [pdf]

Connectional-Style-Guided Contextual Representation Learning for Brain Disease Diagnosis

Authors: Gongshu Wang, Ning Jiang, Yunxiao Ma, Tiantian Liu, Duanduan Chen, Jinglong Wu, Guoqi Li, Dong Liang, Tianyi Yan

Abstract: Structural magnetic resonance imaging (sMRI) has shown great clinical value and has been widely used in deep learning (DL) based computer-aided brain disease diagnosis. Previous approaches focused on local shapes and textures in sMRI that may be significant only within a particular domain. The learned representations are likely to contain spurious information and have a poor generalization ability… ▽ More Structural magnetic resonance imaging (sMRI) has shown great clinical value and has been widely used in deep learning (DL) based computer-aided brain disease diagnosis. Previous approaches focused on local shapes and textures in sMRI that may be significant only within a particular domain. The learned representations are likely to contain spurious information and have a poor generalization ability in other diseases and datasets. To facilitate capturing meaningful and robust features, it is necessary to first comprehensively understand the intrinsic pattern of the brain that is not restricted within a single data/task domain. Considering that the brain is a complex connectome of interlinked neurons, the connectional properties in the brain have strong biological significance, which is shared across multiple domains and covers most pathological information. In this work, we propose a connectional style contextual representation learning model (CS-CRL) to capture the intrinsic pattern of the brain, used for multiple brain disease diagnosis. Specifically, it has a vision transformer (ViT) encoder and leverages mask reconstruction as the proxy task and Gram matrices to guide the representation of connectional information. It facilitates the capture of global context and the aggregation of features with biological plausibility. The results indicate that CS-CRL achieves superior accuracy in multiple brain disease diagnosis tasks across six datasets and three diseases and outperforms state-of-the-art models. Furthermore, we demonstrate that CS-CRL captures more brain-network-like properties, better aggregates features, is easier to optimize and is more robust to noise, which explains its superiority in theory. Our source code will be released soon. △ Less

Submitted 8 June, 2023; originally announced June 2023.

arXiv:2302.11224 [pdf, other]

MADI: Inter-domain Matching and Intra-domain Discrimination for Cross-domain Speech Recognition

Authors: Jiaming Zhou, Shiwan Zhao, Ning Jiang, Guoqing Zhao, Yong Qin

Abstract: End-to-end automatic speech recognition (ASR) usually suffers from performance degradation when applied to a new domain due to domain shift. Unsupervised domain adaptation (UDA) aims to improve the performance on the unlabeled target domain by transferring knowledge from the source to the target domain. To improve transferability, existing UDA approaches mainly focus on matching the distributions… ▽ More End-to-end automatic speech recognition (ASR) usually suffers from performance degradation when applied to a new domain due to domain shift. Unsupervised domain adaptation (UDA) aims to improve the performance on the unlabeled target domain by transferring knowledge from the source to the target domain. To improve transferability, existing UDA approaches mainly focus on matching the distributions of the source and target domains globally and/or locally, while ignoring the model discriminability. In this paper, we propose a novel UDA approach for ASR via inter-domain MAtching and intra-domain DIscrimination (MADI), which improves the model transferability by fine-grained inter-domain matching and discriminability by intra-domain contrastive discrimination simultaneously. Evaluations on the Libri-Adapt dataset demonstrate the effectiveness of our approach. MADI reduces the relative word error rate (WER) on cross-device and cross-environment ASR by 17.7% and 22.8%, respectively. △ Less

Submitted 22 February, 2023; originally announced February 2023.

Comments: Accepted to ICASSP 2023

arXiv:2212.09337 [pdf, other]

doi 10.1109/LSP.2023.3266115

Information Bottleneck-Inspired Type Based Multiple Access for Remote Estimation in IoT Systems

Authors: Meiyi Zhu, Chunyan Feng, Caili Guo, Nan Jiang, Osvaldo Simeone

Abstract: Type-based multiple access (TBMA) is a semantics-aware multiple access protocol for remote inference. In TBMA, codewords are reused across transmitting sensors, with each codeword being assigned to a different observation value. Existing TBMA protocols are based on fixed shared codebooks and on conventional maximum-likelihood or Bayesian decoders, which require knowledge of the distributions of ob… ▽ More Type-based multiple access (TBMA) is a semantics-aware multiple access protocol for remote inference. In TBMA, codewords are reused across transmitting sensors, with each codeword being assigned to a different observation value. Existing TBMA protocols are based on fixed shared codebooks and on conventional maximum-likelihood or Bayesian decoders, which require knowledge of the distributions of observations and channels. In this letter, we propose a novel design principle for TBMA based on the information bottleneck (IB). In the proposed IB-TBMA protocol, the shared codebook is jointly optimized with a decoder based on artificial neural networks (ANNs), so as to adapt to source, observations, and channel statistics based on data only. We also introduce the Compressed IB-TBMA (CIB-TBMA) protocol, which improves IB-TBMA by enabling a reduction in the number of codewords via an IB-inspired clustering phase. Numerical results demonstrate the importance of a joint design of codebook and neural decoder, and validate the benefits of codebook compression. △ Less

Submitted 5 April, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

Comments: 5 pages, 3 figures, accepted by IEEE Signal Processing Letters (SPL)

arXiv:2210.17349 [pdf, other]

Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

Authors: Kun Song, Jian Cong, Xinsheng Wang, Yongmao Zhang, Lei Xie, Ning Jiang, Haiying Wu

Abstract: In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model. To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained n… ▽ More In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model. To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained network dropout strategy to the generator. With a specifically designed over-smooth handler which separates speech signal intro periodic and aperiodic components, we only perform network dropout to the aperodic components, which alleviates metallic sounding and maintains good speaker similarity. To further improve generalization ability, we introduce several data augmentation methods to augment fake data in the discriminator, including harmonic shift, harmonic noise and phase noise. Experiments show that Robust MelGAN can be used as a universal vocoder, significantly improving sound quality in TTS systems built on various types of data. △ Less

Submitted 2 November, 2022; v1 submitted 31 October, 2022; originally announced October 2022.

Comments: Accepted by ISCSLP 2022

arXiv:2207.00883 [pdf, other]

Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

Authors: Kun Wei, Pengcheng Guo, Ning Jiang

Abstract: Transformer-based models have demonstrated their effectiveness in automatic speech recognition (ASR) tasks and even shown superior performance over the conventional hybrid framework. The main idea of Transformers is to capture the long-range global context within an utterance by self-attention layers. However, for scenarios like conversational speech, such utterance-level modeling will neglect con… ▽ More Transformer-based models have demonstrated their effectiveness in automatic speech recognition (ASR) tasks and even shown superior performance over the conventional hybrid framework. The main idea of Transformers is to capture the long-range global context within an utterance by self-attention layers. However, for scenarios like conversational speech, such utterance-level modeling will neglect contextual dependencies that span across utterances. In this paper, we propose to explicitly model the inter-sentential information in a Transformer based end-to-end architecture for conversational speech recognition. Specifically, for the encoder network, we capture the contexts of previous speech and incorporate such historic information into current input by a context-aware residual attention mechanism. For the decoder, the prediction of current utterance is also conditioned on the historic linguistic information through a conditional decoder framework. We show the effectiveness of our proposed method on several open-source dialogue corpora and the proposed method consistently improved the performance from the utterance-level Transformer-based ASR models. △ Less

Submitted 2 July, 2022; originally announced July 2022.

Comments: Accepted by Interspeech2022

arXiv:2204.08910 [pdf, other]

Adaptable Semantic Compression and Resource Allocation for Task-Oriented Communications

Authors: Chuanhong Liu, Caili Guo, Yang Yang, Nan Jiang

Abstract: Task-oriented communication is a new paradigm that aims at providing efficient connectivity for accomplishing intelligent tasks rather than the reception of every transmitted bit. In this paper, a deep learning-based task-oriented communication architecture is proposed where the user extracts, compresses and transmits semantics in an end-to-end (E2E) manner. Furthermore, an approach is proposed to… ▽ More Task-oriented communication is a new paradigm that aims at providing efficient connectivity for accomplishing intelligent tasks rather than the reception of every transmitted bit. In this paper, a deep learning-based task-oriented communication architecture is proposed where the user extracts, compresses and transmits semantics in an end-to-end (E2E) manner. Furthermore, an approach is proposed to compress the semantics according to their importance relevant to the task, namely, adaptable semantic compression (ASC). Assuming a delay-intolerant system, supporting multiple users indicates a problem that executing with the higher compression ratio requires fewer channel resources but leads to the distortion of semantics, while executing with the lower compression ratio requires more channel resources and thus may lead to a transmission failure due to delay constraint. To solve the problem, both compression ratio and resource allocation are optimized for the task-oriented communication system to maximize the success probability of tasks. Specifically, due to the nonconvexity of the problem, we propose a compression ratio and resource allocation (CRRA) algorithm by separating the problem into two subproblems and solving iteratively to obtain the convergent solution. Furthermore, considering the scenarios where users have various service levels, a compression ratio, resource allocation, and user selection (CRRAUS) algorithm is proposed to deal with the problem. In CRRAUS, users are adaptively selected to complete the corresponding intelligent tasks based on branch and bound method at the expense of higher algorithm complexity compared with CRRA. Simulation results show that the proposed CRRA and CRRAUS algorithms can obtain at least 15% and 10% success gains over baseline algorithms, respectively. △ Less

Submitted 19 April, 2022; originally announced April 2022.

arXiv:2201.01051 [pdf]

doi 10.1038/s41597-022-01836-y

Open Access Dataset for Electromyography based Multi-code Biometric Authentication

Authors: Ashirbad Pradhan, Jiayuan He, Ning Jiang

Abstract: Recently, surface electromyogram (EMG) has been proposed as a novel biometric trait for addressing some key limitations of current biometrics, such as spoofing and liveness. The EMG signals possess a unique characteristic: they are inherently different for individuals (biometrics), and they can be customized to realize multi-length codes or passwords (for example, by performing different gestures)… ▽ More Recently, surface electromyogram (EMG) has been proposed as a novel biometric trait for addressing some key limitations of current biometrics, such as spoofing and liveness. The EMG signals possess a unique characteristic: they are inherently different for individuals (biometrics), and they can be customized to realize multi-length codes or passwords (for example, by performing different gestures). However, current EMG-based biometric research has two critical limitations: 1) a small subject pool, compared to other more established biometric traits, and 2) limited to single-session or single-day data sets. In this study, forearm and wrist EMG data were collected from 43 participants over three different days with long separation while they performed static hand and wrist gestures. The multi-day biometric authentication resulted in a median EER of 0.017 for the forearm setup and 0.025 for the wrist setup, comparable to well-established biometric traits suggesting consistent performance over multiple days. The presented large-sample multi-day data set and findings could facilitate further research on EMG-based biometrics and other gesture recognition-based applications. △ Less

Submitted 5 January, 2022; v1 submitted 4 January, 2022; originally announced January 2022.

Comments: manuscript for open access dataset (paper and appendix)

Journal ref: Sci Data 9, 733 (2022)

arXiv:2104.13873 [pdf, other]

Evaluating the Performance of Over-the-Air Time Synchronization for 5G and TSN Integration

Authors: Haochuan Shi, Adnan Aijaz, Nan Jiang

Abstract: The IEEE 802.1 time-sensitive networking (TSN) standards aim at improving the real-time capabilities of standard Ethernet. TSN is widely recognized as the long-term replacement of proprietary technologies for industrial control systems. However, wired connectivity alone is not sufficient to meet the requirements of future industrial systems. The fifth-generation (5G) mobile/cellular technology has… ▽ More The IEEE 802.1 time-sensitive networking (TSN) standards aim at improving the real-time capabilities of standard Ethernet. TSN is widely recognized as the long-term replacement of proprietary technologies for industrial control systems. However, wired connectivity alone is not sufficient to meet the requirements of future industrial systems. The fifth-generation (5G) mobile/cellular technology has been designed with native support for ultra-reliable low-latency communication (uRLLC). 5G is promising to meet the stringent requirements of industrial systems in the wireless domain. Converged operation of 5G and TSN systems is crucial for achieving end-to-end deterministic connectivity in industrial networks. Accurate time synchronization is key to integrated operation of 5G and TSN systems. To this end, this paper evaluates the performance of over-the-air time synchronization mechanism which has been proposed in 3GPP Release 16. We analyze the accuracy of time synchronization through the boundary clock approach in the presence of clock drift and different air-interface timing errors related to reference time indication. We also investigate frequency and scalability aspects of over-the-air time synchronization. Our performance evaluation reveals the conditions under which 1 $μ$s or below requirement for TSN time synchronization can be achieved. △ Less

Submitted 28 April, 2021; originally announced April 2021.

Comments: accepted for IEEE BlackSeaCom 2021

arXiv:2103.15295 [pdf, other]

Best-Buddy GANs for Highly Detailed Image Super-Resolution

Authors: Wenbo Li, Kun Zhou, Lu Qi, Liying Lu, Nianjuan Jiang, Jiangbo Lu, Jiaya Jia

Abstract: We consider the single image super-resolution (SISR) problem, where a high-resolution (HR) image is generated based on a low-resolution (LR) input. Recently, generative adversarial networks (GANs) become popular to hallucinate details. Most methods along this line rely on a predefined single-LR-single-HR mapping, which is not flexible enough for the SISR task. Also, GAN-generated fake details may… ▽ More We consider the single image super-resolution (SISR) problem, where a high-resolution (HR) image is generated based on a low-resolution (LR) input. Recently, generative adversarial networks (GANs) become popular to hallucinate details. Most methods along this line rely on a predefined single-LR-single-HR mapping, which is not flexible enough for the SISR task. Also, GAN-generated fake details may often undermine the realism of the whole image. We address these issues by proposing best-buddy GANs (Beby-GAN) for rich-detail SISR. Relaxing the immutable one-to-one constraint, we allow the estimated patches to dynamically seek the best supervision during training, which is beneficial to producing more reasonable details. Besides, we propose a region-aware adversarial learning strategy that directs our model to focus on generating details for textured areas adaptively. Extensive experiments justify the effectiveness of our method. An ultra-high-resolution 4K dataset is also constructed to facilitate future super-resolution research. △ Less

Submitted 27 December, 2021; v1 submitted 28 March, 2021; originally announced March 2021.

arXiv:2103.06015 [pdf]

Performance Optimization of Surface Electromyography (sEMG) based Biometric Sensing System for both Verification and Identification

Authors: Ashirbad Pradhan, Jiayuan He, Ning Jiang

Abstract: Recently, surface electromyography (sEMG) emerged as a novel biometric authentication method. Since EMG system parameters, such as the feature extraction methods and the number of channels, have been known to affect system performances, it is important to investigate these effects on the performance of the sEMG-based biometric system to determine optimal system parameters. In this study, three rob… ▽ More Recently, surface electromyography (sEMG) emerged as a novel biometric authentication method. Since EMG system parameters, such as the feature extraction methods and the number of channels, have been known to affect system performances, it is important to investigate these effects on the performance of the sEMG-based biometric system to determine optimal system parameters. In this study, three robust feature extraction methods, Time-domain (TD) feature, Frequency Division Technique (FDT), and Autoregressive (AR) feature, and their combinations were investigated while the number of channels varying from one to eight. For these system parameters, the performance of sixteen static wrist and hand gestures was systematically investigated in two authentication modes: verification and identification. The results from 24 participants showed that the TD features significantly (p<0.05) and consistently outperformed FDT and AR features for all channel numbers. The results also showed that the performance of a four-channel setup was not significantly different from those with higher number of channels. The average equal error rate (EER) for a four-channel sEMG verification system was 4% for TD features, 5.3% for FDT features, and 10% for AR features. For an identification system, the average Rank-1 error (R1E) for a four-channel configuration was 3% for TD features, 12.4% for FDT features, and 36.3% for AR features. The electrode position on the flexor carpi ulnaris (FCU) muscle had a critical contribution to the authentication performance. Thus, the combination of the TD feature set and a four-channel sEMG system with one of the electrodes positioned on the FCU are recommended for optimal authentication performance. △ Less

Submitted 10 March, 2021; originally announced March 2021.

Comments: 12 pages, 6 figures, and one table

arXiv:2009.06782 [pdf, other]

Analysis of Random Access in NB-IoT Networks with Three Coverage Enhancement Groups: A Stochastic Geometry Approach

Authors: Yan Liu, Yansha Deng, Nan Jiang, Maged Elkashlan, Arumugam Nallanathan

Abstract: NarrowBand-Internet of Things (NB-IoT) is a new 3GPP radio access technology designed to provide better coverage for Low Power Wide Area (LPWA) networks. To provide reliable connections with extended coverage, a repetition transmission scheme and up to three Coverage Enhancement (CE) groups are introduced into NB-IoT during both Random Access CHannel (RACH) procedure and data transmission procedur… ▽ More NarrowBand-Internet of Things (NB-IoT) is a new 3GPP radio access technology designed to provide better coverage for Low Power Wide Area (LPWA) networks. To provide reliable connections with extended coverage, a repetition transmission scheme and up to three Coverage Enhancement (CE) groups are introduced into NB-IoT during both Random Access CHannel (RACH) procedure and data transmission procedure, where each CE group is configured with different repetition values and transmission resources. To characterize the RACH performance of the NB-IoT network with three CE groups, this paper develops a novel traffic-aware spatio-temporal model to analyze the RACH success probability, where both the preamble transmission outage and the collision events of each CE group jointly determine the traffic evolution and the RACH success probability. Based on this analytical model, we derive the analytical expression for the RACH success probability of a randomly chosen IoT device in each CE group over multiple time slots with different RACH schemes, including baseline, back-off (BO), access class barring (ACB), and hybrid ACB and BO schemes (ACB&BO). Our results have shown that the RACH success probabilities of the devices in three CE groups outperform that of a single CE group network but not for all the groups, which is affected by the choice of the categorizing parameters.This mathematical model and analytical framework can be applied to evaluate the performance of multiple group users of other networks with spatial separations. △ Less

Submitted 14 September, 2020; originally announced September 2020.

Comments: 15 pages, 8 figures. Accepted in IEEE TWC

arXiv:2005.01092 [pdf, other]

A Decoupled Learning Strategy for Massive Access Optimization in Cellular IoT Networks

Authors: Nan Jiang, Yansha Deng, Arumugam Nallanathan, Jinghong Yuan

Abstract: Cellular-based networks are expected to offer connectivity for massive Internet of Things (mIoT) systems. However, their Random Access CHannel (RACH) procedure suffers from unreliability, due to the collision from the simultaneous massive access. Despite that this collision problem has been treated in existing RACH schemes, these schemes usually organize IoT devices' transmission and re-transmissi… ▽ More Cellular-based networks are expected to offer connectivity for massive Internet of Things (mIoT) systems. However, their Random Access CHannel (RACH) procedure suffers from unreliability, due to the collision from the simultaneous massive access. Despite that this collision problem has been treated in existing RACH schemes, these schemes usually organize IoT devices' transmission and re-transmission along with fixed parameters, thus can hardly adapt to time-varying traffic patterns. Without adaptation, the RACH procedure easily suffers from high access delay, high energy consumption, or even access unavailability. With the goal of improving the RACH procedure, this paper targets to optimize the RACH procedure in real-time by maximizing a long-term hybrid multi-objective function, which consists of the number of access success devices, the average energy consumption, and the average access delay. To do so, we first optimize the long-term objective in the number of access success devices by using Deep Reinforcement Learning (DRL) algorithms for different RACH schemes, including Access Class Barring (ACB), Back-Off (BO), and Distributed Queuing (DQ). The converging capability and efficiency of different DRL algorithms including Policy Gradient (PG), Actor-Critic (AC), Deep Q-Network (DQN), and Deep Deterministic Policy Gradients (DDPG) are compared. Inspired by the results from this comparison, a decoupled learning strategy is developed to jointly and dynamically adapt the access control factors of those three access schemes. This decoupled strategy first leverage a Recurrent Neural Network (RNN) model to predict the real-time traffic values of the network environment, and then uses multiple DRL agents to cooperatively configure parameters of each RACH scheme. △ Less

Submitted 3 May, 2020; originally announced May 2020.

arXiv:2004.04979 [pdf, other]

Co-Saliency Spatio-Temporal Interaction Network for Person Re-Identification in Videos

Authors: Jiawei Liu, Zheng-Jun Zha, Xierong Zhu, Na Jiang

Abstract: Person re-identification aims at identifying a certain pedestrian across non-overlapping camera networks. Video-based re-identification approaches have gained significant attention recently, expanding image-based approaches by learning features from multiple frames. In this work, we propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos. It… ▽ More Person re-identification aims at identifying a certain pedestrian across non-overlapping camera networks. Video-based re-identification approaches have gained significant attention recently, expanding image-based approaches by learning features from multiple frames. In this work, we propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos. It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions, towards learning discriminative pedestrian representation. Specifically, multiple co-saliency learning modules within CSTNet are designed to utilize the correlated information across video frames to extract the salient features from the task-relevant regions and suppress background interference. Moreover, multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation, to enhance feature representation. Extensive experiments on two benchmarks have demonstrated the effectiveness of the proposed method. △ Less

Submitted 11 May, 2020; v1 submitted 10 April, 2020; originally announced April 2020.

arXiv:2002.07759 [pdf, other]

Traffic Prediction and Random Access Control Optimization: Learning and Non-learning based Approaches

Authors: Nan Jiang, Yansha Deng, Arumugam Nallanathan

Abstract: Random access schemes in modern wireless communications are generally based on the framed-ALOHA (f-ALOHA), which can be optimized by flexibly organizing devices' transmission and re-transmission. However, this optimization is generally intractable due to the lack of information about complex traffic generation statistics and the occurrence of the random collision. In this article, we first summari… ▽ More Random access schemes in modern wireless communications are generally based on the framed-ALOHA (f-ALOHA), which can be optimized by flexibly organizing devices' transmission and re-transmission. However, this optimization is generally intractable due to the lack of information about complex traffic generation statistics and the occurrence of the random collision. In this article, we first summarize the general structure of access control optimization for different random access schemes, and then review the existing access control optimization based on Machine Learning (ML) and non-ML techniques. We demonstrate that the ML-based methods can better optimize the access control problem compared with non-ML based methods, due to their capability in solving high complexity long-term optimization problem and learning experiential knowledge from reality. To further improve the random access performance, we propose two-step learning optimizers for access control optimization, which individually execute the traffic prediction and the access control configuration. In detail, our traffic prediction method relies on online supervised learning adopting Recurrent Neural Networks (RNNs) that can accurately capture traffic statistics over consecutive frames, and the access control configuration can use either a non-ML based controller or a cooperatively trained Deep Reinforcement Learning (DRL) based controller depending on the complexity of different random access schemes. Numerical results show that the proposed two-step cooperative learning optimizer considerably outperforms the conventional Deep Q-Network (DQN) in terms of higher training efficiency and better access performance. △ Less

Submitted 18 February, 2020; originally announced February 2020.

arXiv:2002.03082 [pdf, other]

RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement Learning

Authors: Nan Jiang, Sheng Jin, Zhiyao Duan, Changshui Zhang

Abstract: This paper presents a deep reinforcement learning algorithm for online accompaniment generation, with potential for real-time interactive human-machine duet improvisation. Different from offline music generation and harmonization, online music accompaniment requires the algorithm to respond to human input and generate the machine counterpart in a sequential order. We cast this as a reinforcement l… ▽ More This paper presents a deep reinforcement learning algorithm for online accompaniment generation, with potential for real-time interactive human-machine duet improvisation. Different from offline music generation and harmonization, online music accompaniment requires the algorithm to respond to human input and generate the machine counterpart in a sequential order. We cast this as a reinforcement learning problem, where the generation agent learns a policy to generate a musical note (action) based on previously generated context (state). The key of this algorithm is the well-functioning reward model. Instead of defining it using music composition rules, we learn this model from monophonic and polyphonic training data. This model considers the compatibility of the machine-generated note with both the machine-generated context and the human-generated context. Experiments show that this algorithm is able to respond to the human part and generate a melodic, harmonic and diverse machine part. Subjective evaluations on preferences show that the proposed algorithm generates music pieces of higher quality than the baseline method. △ Less

Submitted 7 February, 2020; originally announced February 2020.

arXiv:1907.11064 [pdf, other]

Online Supervised Learning for Traffic Load Prediction in Framed-ALOHA Networks

Authors: Nan Jiang, Yansha Deng, Osvaldo Simeone, Arumugam Nallanathan

Abstract: Predicting the current backlog, or traffic load, in framed-ALOHA networks enables the optimization of resource allocation, e.g., of the frame size. However, this prediction is made difficult by the lack of information about the cardinality of collisions and by possibly complex packet generation statistics. Assuming no prior information about the traffic model, apart from a bound on its temporal me… ▽ More Predicting the current backlog, or traffic load, in framed-ALOHA networks enables the optimization of resource allocation, e.g., of the frame size. However, this prediction is made difficult by the lack of information about the cardinality of collisions and by possibly complex packet generation statistics. Assuming no prior information about the traffic model, apart from a bound on its temporal memory, this paper develops an online learning-based adaptive traffic load prediction method that is based on Recurrent Neural Networks (RNN) and specifically on the Long Short-Term Memory (LSTM) architecture. In order to enable online training in the absence of feedback on the exact cardinality of collisions, the proposed strategy leverages a novel approximate labeling technique that is inspired by Method of Moments (MOM) estimators. Numerical results show that the proposed online predictor considerably outperforms conventional methods and is able to adapt to changing traffic statistics. △ Less

Submitted 25 July, 2019; originally announced July 2019.

arXiv:1905.13161 [pdf]

doi 10.1088/1741-2552/ab85b2

Simultaneous induction of SSMVEP and SMR Using a Gaiting video stimulus: a novel hybrid brain-computer interface

Authors: Xin Zhang, Guanghua Xu, Aravind Ravi, Sarah Pearce, Ning Jiang

Abstract: We proposed a novel visual stimulus for brain-computer interface. The stimulus is in the form gaiting sequence of a human. The hypothesis is that observing such a visual stimulus would simultaneously induce 1) steady-state motion visual evoked potential (SSMVEP) in the occipital area, similarly to an SSVEP stimulus; and 2) sensorimotor rhythm (SMR) in the primary sensorimotor area, because such ac… ▽ More We proposed a novel visual stimulus for brain-computer interface. The stimulus is in the form gaiting sequence of a human. The hypothesis is that observing such a visual stimulus would simultaneously induce 1) steady-state motion visual evoked potential (SSMVEP) in the occipital area, similarly to an SSVEP stimulus; and 2) sensorimotor rhythm (SMR) in the primary sensorimotor area, because such action observation (AO) could activate the mirror neuron system. Canonical correlation analysis (CCA) was used to detect SSMVEP from occipital EEG, and event-related spectral perturbations (ERSP) were used to identify SMR in the EEG from the sensorimotor area. The results showed that the proposed visual gaiting stimulus-induced SSMVEP, with classification accuracies of 88.9 $\pm$ 12.0% in a four-class scenario. More importantly, it induced clear and sustained event-related desynchronization/synchronization (ERD/ERS) in the EEG from the sensorimotor area, while no ERD/ERS in the sensorimotor area could be observed when the other two SSVEP stimuli were used. Further, for participants with sufficiently clear SSMVEP pattern (classification accuracy > 85%), the ERD index values in mu-beta band induced by the proposed gaiting stimulus were statistically different from that of the other two types of stimulus. Therefore, a novel BCI based on the proposed stimulus has potential in neurorehabilitation applications because it simultaneously has the high accuracy of an SSMVEP (~90% accuracy in a four-class setup) and the ability to activate sensorimotor cortex. And such potential will be further explored in future studies. △ Less

Submitted 30 May, 2019; originally announced May 2019.

Comments: 22 pages, 7 figures and 2 tables

Journal ref: Journal of Neural Engineering, Mar. 2020

arXiv:1511.03722 [pdf, other]

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

Authors: Nan Jiang, Lihong Li

Abstract: We study the problem of off-policy value evaluation in reinforcement learning (RL), where one aims to estimate the value of a new policy based on data collected by a different policy. This problem is often a critical step when applying RL in real-world problems. Despite its importance, existing general methods either have uncontrolled bias or suffer high variance. In this work, we extend the doubl… ▽ More We study the problem of off-policy value evaluation in reinforcement learning (RL), where one aims to estimate the value of a new policy based on data collected by a different policy. This problem is often a critical step when applying RL in real-world problems. Despite its importance, existing general methods either have uncontrolled bias or suffer high variance. In this work, we extend the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators. We demonstrate the estimator's accuracy in several benchmark problems, and illustrate its use as a subroutine in safe policy improvement. We also provide theoretical results on the hardness of the problem, and show that our estimator can match the lower bound in certain scenarios. △ Less

Submitted 26 May, 2016; v1 submitted 11 November, 2015; originally announced November 2015.

Comments: 14 pages; 4 figures; ICML 2016

Showing 1–33 of 33 results for author: Jiang, N