Search | arXiv e-print repository

doi 10.1109/TMI.2024.3435450

Prototype Learning Guided Hybrid Network for Breast Tumor Segmentation in DCE-MRI

Authors: Lei Zhou, Yuzhong Zhang, Jiadong Zhang, Xuejun Qian, Chen Gong, Kun Sun, Zhongxiang Ding, Xing Wang, Zhenhui Li, Zaiyi Liu, Dinggang Shen

Abstract: Automated breast tumor segmentation on the basis of dynamic contrast-enhancement magnetic resonance imaging (DCE-MRI) has shown great promise in clinical practice, particularly for identifying the presence of breast disease. However, accurate segmentation of breast tumor is a challenging task, often necessitating the development of complex networks. To strike an optimal trade-off between computati… ▽ More Automated breast tumor segmentation on the basis of dynamic contrast-enhancement magnetic resonance imaging (DCE-MRI) has shown great promise in clinical practice, particularly for identifying the presence of breast disease. However, accurate segmentation of breast tumor is a challenging task, often necessitating the development of complex networks. To strike an optimal trade-off between computational costs and segmentation performance, we propose a hybrid network via the combination of convolution neural network (CNN) and transformer layers. Specifically, the hybrid network consists of a encoder-decoder architecture by stacking convolution and decovolution layers. Effective 3D transformer layers are then implemented after the encoder subnetworks, to capture global dependencies between the bottleneck features. To improve the efficiency of hybrid network, two parallel encoder subnetworks are designed for the decoder and the transformer layers, respectively. To further enhance the discriminative capability of hybrid network, a prototype learning guided prediction module is proposed, where the category-specified prototypical features are calculated through on-line clustering. All learned prototypical features are finally combined with the features from decoder for tumor mask prediction. The experimental results on private and public DCE-MRI datasets demonstrate that the proposed hybrid network achieves superior performance than the state-of-the-art (SOTA) methods, while maintaining balance between segmentation accuracy and computation cost. Moreover, we demonstrate that automatically generated tumor masks can be effectively applied to identify HER2-positive subtype from HER2-negative subtype with the similar accuracy to the analysis based on manual tumor segmentation. The source code is available at https://github.com/ZhouL-lab/PLHN. △ Less

Submitted 11 August, 2024; originally announced August 2024.

Journal ref: 2024,IEEE Transactions on Medical Imaging

arXiv:2408.05758 [pdf, other]

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

Authors: Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, Jianhua Tao

Abstract: Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the spe… ▽ More Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called "Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)", which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hz from 24kHz input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at https://qiangchunyu.github.io/VQCTAP/ △ Less

Submitted 11 August, 2024; originally announced August 2024.

arXiv:2406.08911 [pdf, other]

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

Authors: Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi

Abstract: Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on… ▽ More Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

arXiv:2405.18739 [pdf, other]

FlocOff: Data Heterogeneity Resilient Federated Learning with Communication-Efficient Edge Offloading

Authors: Mulei Ma, Chenyu Gong, Liekang Zeng, Yang Yang, Liantao Wu

Abstract: Federated Learning (FL) has emerged as a fundamental learning paradigm to harness massive data scattered at geo-distributed edge devices in a privacy-preserving way. Given the heterogeneous deployment of edge devices, however, their data are usually Non-IID, introducing significant challenges to FL including degraded training accuracy, intensive communication costs, and high computing complexity.… ▽ More Federated Learning (FL) has emerged as a fundamental learning paradigm to harness massive data scattered at geo-distributed edge devices in a privacy-preserving way. Given the heterogeneous deployment of edge devices, however, their data are usually Non-IID, introducing significant challenges to FL including degraded training accuracy, intensive communication costs, and high computing complexity. Towards that, traditional approaches typically utilize adaptive mechanisms, which may suffer from scalability issues, increased computational overhead, and limited adaptability to diverse edge environments. To address that, this paper instead leverages the observation that the computation offloading involves inherent functionalities such as node matching and service correlation to achieve data reshaping and proposes Federated learning based on computing Offloading (FlocOff) framework, to address data heterogeneity and resource-constrained challenges. Specifically, FlocOff formulates the FL process with Non-IID data in edge scenarios and derives rigorous analysis on the impact of imbalanced data distribution. Based on this, FlocOff decouples the optimization in two steps, namely : (1) Minimizes the Kullback-Leibler (KL) divergence via Computation Offloading scheduling (MKL-CO); (2) Minimizes the Communication Cost through Resource Allocation (MCC-RA). Extensive experimental results demonstrate that the proposed FlocOff effectively improves model convergence and accuracy by 14.3\%-32.7\% while reducing data heterogeneity under various data distributions. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2402.19013 [pdf, other]

Ultraviolet Positioning via TDOA: Error Analysis and System Prototype

Authors: Shihui Yu, Chubing Lv, Yueke Yang, Yuchen Pan, Lei Sun, Juliang Cao, Ruihang Yu, Chen Gong, Wenqi Wu, Zhengyuan Xu

Abstract: This work performs the design, real-time hardware realization, and experimental evaluation of a positioning system by ultra-violet (UV) communication under photon-level signal detection. The positioning is based on time-difference of arrival (TDOA) principle. Time division-based transmission of synchronization sequence from three transmitters with known positions is applied. We investigate the pos… ▽ More This work performs the design, real-time hardware realization, and experimental evaluation of a positioning system by ultra-violet (UV) communication under photon-level signal detection. The positioning is based on time-difference of arrival (TDOA) principle. Time division-based transmission of synchronization sequence from three transmitters with known positions is applied. We investigate the positioning error via decomposing it into two parts, the transmitter-side timing error and the receiver-side synchronization error. The theoretical average error matches well with the simulation results, which indicates that theoretical fitting can provide reliable guidance and prediction for hardware experiments. We also conduct real-time hardware realization of the TDOA-based positioning system using Field Programmable Gate Array (FPGA), which is experimentally evaluated via outdoor experiments. Experimental results match well with the theoretical and simulation results. △ Less

Submitted 14 April, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

arXiv:2312.15195 [pdf, other]

Mutual Information as Intrinsic Reward of Reinforcement Learning Agents for On-demand Ride Pooling

Authors: Xianjie Zhang, Jiahao Sun, Chen Gong, Kai Wang, Yifei Cao, Hao Chen, Hao Chen, Yu Liu

Abstract: The emergence of on-demand ride pooling services allows each vehicle to serve multiple passengers at a time, thus increasing drivers' income and enabling passengers to travel at lower prices than taxi/car on-demand services (only one passenger can be assigned to a car at a time like UberX and Lyft). Although on-demand ride pooling services can bring so many benefits, ride pooling services need a w… ▽ More The emergence of on-demand ride pooling services allows each vehicle to serve multiple passengers at a time, thus increasing drivers' income and enabling passengers to travel at lower prices than taxi/car on-demand services (only one passenger can be assigned to a car at a time like UberX and Lyft). Although on-demand ride pooling services can bring so many benefits, ride pooling services need a well-defined matching strategy to maximize the benefits for all parties (passengers, drivers, aggregation companies and environment), in which the regional dispatching of vehicles has a significant impact on the matching and revenue. Existing algorithms often only consider revenue maximization, which makes it difficult for requests with unusual distribution to get a ride. How to increase revenue while ensuring a reasonable assignment of requests brings a challenge to ride pooling service companies (aggregation companies). In this paper, we propose a framework for vehicle dispatching for ride pooling tasks, which splits the city into discrete dispatching regions and uses the reinforcement learning (RL) algorithm to dispatch vehicles in these regions. We also consider the mutual information (MI) between vehicle and order distribution as the intrinsic reward of the RL algorithm to improve the correlation between their distributions, thus ensuring the possibility of getting a ride for unusually distributed requests. In experimental results on a real-world taxi dataset, we demonstrate that our framework can significantly increase revenue up to an average of 3\% over the existing best on-demand ride pooling method. △ Less

Submitted 7 January, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

Comments: Accepted by AAMAS 2024

arXiv:2312.14398 [pdf, other]

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Authors: Cheng Gong, Xin Wang, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang, Korin Richmond, Junichi Yamagishi

Abstract: Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voices, but there is growing interest in developing systems that can synthesize voices for new… ▽ More Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voices, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language. △ Less

Submitted 26 August, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: Accepted by IEEE/ACM TASLP, 16 pages plus 1 page of bio and photos

arXiv:2312.03376 [pdf, other]

Beacon-enabled TDMA Ultraviolet Communication Network System Design and Realization

Authors: Yuchen Pan, Fei Long, Ping Li, Haotian Shi, Jiazhao Shi, Hanlin Xiao, Chen Gong, Zhengyuan Xu

Abstract: Nonline of sight (NLOS) ultraviolet (UV) scattering communication can serve as a good candidate for outdoor optical wireless communication (OWC) in the cases of non-perfect transmitter-receiver alignment and radio silence. We design and demonstrate a NLOS UV scattering communication network system in this paper, where a beacon-enabled time division multiple access (TDMA) scheme is adopted. In our… ▽ More Nonline of sight (NLOS) ultraviolet (UV) scattering communication can serve as a good candidate for outdoor optical wireless communication (OWC) in the cases of non-perfect transmitter-receiver alignment and radio silence. We design and demonstrate a NLOS UV scattering communication network system in this paper, where a beacon-enabled time division multiple access (TDMA) scheme is adopted. In our system, LED and PMT are employed for transmitter and receiver devices, repectivey. Furthermore, we design algorithms for beacon transmission, beacon reception, time compensation, and time slot transition for hardware realization in field-programmable gate array (FPGA) board based on master-slave structure, where master node periodically transmits beacon signals to slave nodes. Experimental results are provided to evaluate the time synchronization error and specify the system key parameters for real-time implementation. We perform field tests for real-time communication network with the transmission range over 110 multiplied by 90 square meters, where the system throughput reaches 800kbps. △ Less

Submitted 15 April, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

arXiv:2306.17790 [pdf, other]

Theoretical Analysis of Heterodyne Rydberg Atomic Receiver Sensitivity Based on Transit Relaxation Effect and Frequency Detuning

Authors: Shanchi Wu, Chen Gong, Shangbin Li, Rui Ni, Jinkang Zhu

Abstract: We conduct a theoretical investigation into the impacts of local microwave electric field frequency detuning, laser frequency detuning, and transit relaxation rate on enhancing heterodyne Rydberg atomic receiver sensitivity. To optimize the output signal amplitude given the input microwave signal, we derive the steady-state solutions of the atomic density matrix. Numerical results show that laser… ▽ More We conduct a theoretical investigation into the impacts of local microwave electric field frequency detuning, laser frequency detuning, and transit relaxation rate on enhancing heterodyne Rydberg atomic receiver sensitivity. To optimize the output signal amplitude given the input microwave signal, we derive the steady-state solutions of the atomic density matrix. Numerical results show that laser frequency detuning and local microwave electric field frequency detuning can improve the system detection sensitivity, which can help the system achieve extra sensitivity gain. It also shows that the heterodyne Rydberg atomic receiver can detect weak microwave signals continuously over a wide frequency range with the same sensitivity or even more sensitivity than the resonance case. To evaluate the transit relaxation effect, a modified Liouville equation is used. We find that the transition relaxation rate increases the time it takes to reach steady state and decreases the sensitivity of the system detection. △ Less

Submitted 30 June, 2023; originally announced June 2023.

Comments: 9 pages, 9 figures, 19 references

arXiv:2306.08823 [pdf, other]

Plug-in Hybrid Electric Vehicle Energy Management with Clutch Engagement Control via Continuous-Discrete Reinforcement Learning

Authors: Changfu Gong, Jinming Xu, Yuan Lin

Abstract: Energy management strategy (EMS) is a key technology for plug-in hybrid electric vehicles (PHEVs). The energy management of certain series-parallel PHEVs involves the control of continuous variables, such as engine torque, and discrete variables, such as clutch engagement/disengagement. We establish a control-oriented model for a series-parallel plug-in hybrid system with clutch engagement control… ▽ More Energy management strategy (EMS) is a key technology for plug-in hybrid electric vehicles (PHEVs). The energy management of certain series-parallel PHEVs involves the control of continuous variables, such as engine torque, and discrete variables, such as clutch engagement/disengagement. We establish a control-oriented model for a series-parallel plug-in hybrid system with clutch engagement control from the perspective of mixed-integer programming. Subsequently, we design an EMS based on continuous-discrete reinforcement learning (CDRL), which enables simultaneous output of continuous and discrete variables. During training, we introduce state-of-charge (SOC) randomization to ensure that the hybrid system exhibits optimal energy-saving performance in both high and low SOC. Finally, the effectiveness of the proposed CDRL strategy is verified by comparing EMS based on charge-depleting charge-sustaining (CD-CS) with rule-based clutch engagement control, and Dynamic Programming (DP). The simulation results show that, under a high SOC, the CDRL strategy proposed in this paper can improve energy efficiency by 8.3% compared to CD-CS, and the energy consumption is just 6.6% higher than the global optimum based on DP, while under a low SOC, the numbers are 4.1% and 3.9%, respectively. △ Less

Submitted 2 March, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

arXiv:2304.12804 [pdf, other]

Channel Estimation and Signal Detection for NLOS Ultraviolet Scattering Communication with Space Division Multiple Access

Authors: Yubo Zhang, Yuchen Pan, Chen Gong, Beiyuan Liu, Zhengyuan Xu

Abstract: We design a receiver assembling several photomultipliers (PMTs) as an array to increase the field of view (FOV) of the receiver and adapt to multiuser situation over None-line-of-sight (NLOS) ultraviolet (UV) channels. Channel estimation and signal detection have been investigated according to the space division characteristics of the structure. Firstly, we adopt the balanced structure on the pilo… ▽ More We design a receiver assembling several photomultipliers (PMTs) as an array to increase the field of view (FOV) of the receiver and adapt to multiuser situation over None-line-of-sight (NLOS) ultraviolet (UV) channels. Channel estimation and signal detection have been investigated according to the space division characteristics of the structure. Firstly, we adopt the balanced structure on the pilot matrix, analyze the channel estimation mean square error (MSE), and optimize the structure parameters. Then, with the estimated parameters, an analytical threshold detection rule is proposed as a preliminary work of multiuser detection. The detection rule can be optimized by analyzing the separability of two users based on the Gaussian approximation of Poisson weighted sum. To assess the effect of imperfect estimation, the sensitivity analysis of channel estimation error on two-user signal detection is performed. Moreover, we propose a successive elimination method for on-off keying (OOK) modulated multiuser symbol detection based on the previous threshold detection rule. A closed-form upper bound on the detection error rate is calculated, which turns out to be a good approximation of that of multiuser maximum-likelihood (ML) detection. The proposed successive elimination method is twenty times faster than the ML detection with negligible detection error rate degradation. △ Less

Submitted 25 April, 2023; originally announced April 2023.

arXiv:2303.01218 [pdf, ps, other]

Co-Optimization of Adaptive Cruise Control and Hybrid Electric Vehicle Energy Management via Model Predictive Mixed Integer Control

Authors: Qitao Li, Changfu Gong, Yuan Lin

Abstract: In this paper, a model predictive mixed integer control method for BYD Qin Plus DM-i (Dual Model intelligent) plug-in hybrid electric vehicle (PHEV) is proposed for co-optimization to reduce fuel consumption during car following. First, the adaptive cruise control (ACC) model for energy-saving driving is established. Then, a control-oriented energy management strategy (EMS) model considering the c… ▽ More In this paper, a model predictive mixed integer control method for BYD Qin Plus DM-i (Dual Model intelligent) plug-in hybrid electric vehicle (PHEV) is proposed for co-optimization to reduce fuel consumption during car following. First, the adaptive cruise control (ACC) model for energy-saving driving is established. Then, a control-oriented energy management strategy (EMS) model considering the clutch engagement and disengagement is constructed. Finally, the co-optimization structure by integrating ACC model and EMS model is created and is converted to the mixed integer nonlinear programming (MINLP). The results show that this modeling method can be applied to EMS based on the model predictive control (MPC) framework and verify that co-optimization can achieve a 5.1$\%$ reduction in fuel consumption compared to sequential optimization with the guarantee of ACC performance. △ Less

Submitted 24 April, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

arXiv:2302.03227 [pdf, other]

Automatic Sleep Stage Classification with Cross-modal Self-supervised Features from Deep Brain Signals

Authors: Chen Gong, Yue Chen, Yanan Sui, Luming Li

Abstract: The detection of human sleep stages is widely used in the diagnosis and intervention of neurological and psychiatric diseases. Some patients with deep brain stimulator implanted could have their neural activities recorded from the deep brain. Sleep stage classification based on deep brain recording has great potential to provide more precise treatment for patients. The accuracy and generalizabilit… ▽ More The detection of human sleep stages is widely used in the diagnosis and intervention of neurological and psychiatric diseases. Some patients with deep brain stimulator implanted could have their neural activities recorded from the deep brain. Sleep stage classification based on deep brain recording has great potential to provide more precise treatment for patients. The accuracy and generalizability of existing sleep stage classifiers based on local field potentials are still limited. We proposed an applicable cross-modal transfer learning method for sleep stage classification with implanted devices. This end-to-end deep learning model contained cross-modal self-supervised feature representation, self-attention, and classification framework. We tested the model with deep brain recording data from 12 patients with Parkinson's disease. The best total accuracy reached 83.2% for sleep stage classification. Results showed speech self-supervised features catch the conversion pattern of sleep stages effectively. We provide a new method on transfer learning from acoustic signals to local field potentials. This method supports an effective solution for the insufficient scale of clinical data. This sleep stage classification model could be adapted to chronic and continuous monitor sleep for Parkinson's patients in daily life, and potentially utilized for more precise treatment in deep brain-machine interfaces, such as closed-loop deep brain stimulation. △ Less

Submitted 6 February, 2023; originally announced February 2023.

Comments: 4 pages, 5 figures, 11th International IEEE EMBS Conference on Neural Engineering (NER)

arXiv:2211.02903 [pdf, other]

VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer

Authors: Yongmao Zhang, Heyang Xue, Hanzhao Li, Lei Xie, Tingwei Guo, Ruixiong Zhang, Caixia Gong

Abstract: End-to-end singing voice synthesis (SVS) model VISinger can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase; glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change… ▽ More End-to-end singing voice synthesis (SVS) model VISinger can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase; glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change with audible artefacts; low sampling rate, the sampling rate of 24KHz does not meet the application needs of high-fidelity generation with the full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to address these issues by integrating the digital signal processing (DSP) methods with VISinger. Specifically, inspired by recent advances in differentiable digital signal processing (DDSP), we incorporate a DSP synthesizer into the decoder to solve the above issues. The DSP synthesizer consists of a harmonic synthesizer and a noise synthesizer to generate periodic and aperiodic signals, respectively, from the latent representation z in VISinger. It supervises the posterior encoder to extract the latent representation without phase information and avoid the prior encoder modelling text-to-phase mapping. To avoid glitch artefacts, the HiFi-GAN is modified to accept the waveforms generated by the DSP synthesizer as a condition to produce the singing voice. Moreover, with the improved waveform decoder, VISinger 2 manages to generate 44.1kHz singing audio with richer expression and better quality. Experiments on OpenCpop corpus show that VISinger 2 outperforms VISinger, CpopSing and RefineSinger in both subjective and objective metrics. △ Less

Submitted 5 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP 2023

arXiv:2208.01559 [pdf, other]

The design and optimization of synchronization sequence for Ultraviolet communication

Authors: Shihui Yu, Chen Gong, Zhengyuan Xu

Abstract: In the ultraviolet (UV) scattering communication, the received signals exhibit the characteristics of discrete photoelectrons due to path loss. The synchronization is based on maximum Pulse Number-Sequence correlation problem. First of all, the accuracy of synchronization is vital to channel estimation and decoding. This article focuses on improving synchronization accuracy by designing and optimi… ▽ More In the ultraviolet (UV) scattering communication, the received signals exhibit the characteristics of discrete photoelectrons due to path loss. The synchronization is based on maximum Pulse Number-Sequence correlation problem. First of all, the accuracy of synchronization is vital to channel estimation and decoding. This article focuses on improving synchronization accuracy by designing and optimizing synchronization sequences. As for the maximum Pulse Number-Sequence correlation problem, it is assumed that the correlation values satisfy the Gaussian distribution and their mathematical expectation, variance and covariance are derived to express the upper bound of synchronization offset. The synchronization sequence we designed has two equilong RANDOM parts (Symbols meet Bernoulli distribution with equal probability.) and a $\{1,0,1,0,1,0,...,1,0,1,0\}$ part between them with $ α$ as its proportion of entire sequence. On the premise of ensuring the synchronization reliability, the synchronization deviation can be reduced by optimizing $ α$. There are simulation experiments to verify correctness of the derivation, reasonableness of the hypothesis and reliability of optimization. Compared with equilong random sequence, the synchronization accuracy of the optimized synchronization sequence is significantly improved. △ Less

Submitted 2 August, 2022; originally announced August 2022.

arXiv:2207.00583 [pdf, other]

Feature-selected Graph Spatial Attention Network for Addictive Brain-Networks Identification

Authors: Changwei Gong, Changhong Jing, Junren Pan, Shuqiang Wang

Abstract: Functional alterations in the relevant neural circuits occur from drug addiction over a certain period. And these significant alterations are also revealed by analyzing fMRI. However, because of fMRI's high dimensionality and poor signal-to-noise ratio, it is challenging to encode efficient and robust brain regional embeddings for both graph-level identification and region-level biomarkers detecti… ▽ More Functional alterations in the relevant neural circuits occur from drug addiction over a certain period. And these significant alterations are also revealed by analyzing fMRI. However, because of fMRI's high dimensionality and poor signal-to-noise ratio, it is challenging to encode efficient and robust brain regional embeddings for both graph-level identification and region-level biomarkers detection tasks between nicotine addiction (NA) and healthy control (HC) groups. In this work, we represent the fMRI of the rat brain as a graph with biological attributes and propose a novel feature-selected graph spatial attention network(FGSAN) to extract the biomarkers of addiction and identify from these brain networks. Specially, a graph spatial attention encoder is employed to capture the features of spatiotemporal brain networks with spatial information. The method simultaneously adopts a Bayesian feature selection strategy to optimize the model and improve classification task by constraining features. Experiments on an addiction-related neural imaging dataset show that the proposed model can obtain superior performance and detect interpretable biomarkers associated with addiction-relevant neural circuits. △ Less

Submitted 5 July, 2022; v1 submitted 29 June, 2022; originally announced July 2022.

arXiv:2204.13863 [pdf, other]

Indoor 3-Dimensional Visible Light Positioning: Error Metric and LED Layout Optimization

Authors: Jiaojiao Xu, Nuo Huang, Chen Gong

Abstract: We consider 3-dimensional (3D) visible light positioning (VLP) based on smartphone camera in an indoor scenario. Based on the positioning model in the quantized pixel-domain, we characterize the 3D normalized positioning error metric (NPEM) through the partial derivative of the positioning function, and evaluate the NPEM for horizontal and non-horizontal receiver camera positions. Moreover, under… ▽ More We consider 3-dimensional (3D) visible light positioning (VLP) based on smartphone camera in an indoor scenario. Based on the positioning model in the quantized pixel-domain, we characterize the 3D normalized positioning error metric (NPEM) through the partial derivative of the positioning function, and evaluate the NPEM for horizontal and non-horizontal receiver camera positions. Moreover, under horizontal receiver terminal position, we explore the relationship between the NPEM and the light-emitting diode (LED) cell layout, approximate the relationship between the NPEM and the number of LEDs captured by the camera, and evaluate the approximation accuracy according to the simulated positioning error. Based on the approximation results, we optimize the LED transmitter cell layout to minimize NPEM assuming structured square cell layouts with certain distance parameters. △ Less

Submitted 28 April, 2022; originally announced April 2022.

arXiv:2204.06086

A Post Auto-regressive GAN Vocoder Focused on Spectrum Fracture

Authors: Zhenxing Lu, Mengnan He, Ruixiong Zhang, Caixia Gong

Abstract: Generative adversarial networks (GANs) have been indicated their superiority in usage of the real-time speech synthesis. Nevertheless, most of them make use of deep convolutional layers as their backbone, which may cause the absence of previous signal information. However, the generation of speech signals invariably require preceding waveform samples in its reconstruction, as the lack of this can… ▽ More Generative adversarial networks (GANs) have been indicated their superiority in usage of the real-time speech synthesis. Nevertheless, most of them make use of deep convolutional layers as their backbone, which may cause the absence of previous signal information. However, the generation of speech signals invariably require preceding waveform samples in its reconstruction, as the lack of this can lead to artifacts in generated speech. To address this conflict, in this paper, we propose an improved model: a post auto-regressive (AR) GAN vocoder with a self-attention layer, which merging self-attention in an AR loop. It will not participate in inference, but can assist the generator to learn temporal dependencies within frames in training. Furthermore, an ablation study was done to confirm the contribution of each part. Systematic experiments show that our model leads to a consistent improvement on both objective and subjective evaluation performance. △ Less

Submitted 16 February, 2023; v1 submitted 12 April, 2022; originally announced April 2022.

Comments: Experimental parts should be improved

arXiv:2111.07549 [pdf, other]

Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data

Authors: Zhu Li, Yuqing Zhang, Mengxi Nie, Ming Yan, Mengnan He, Ruixiong Zhang, Caixia Gong

Abstract: Recent advancements in end-to-end speech synthesis have made it possible to generate highly natural speech. However, training these models typically requires a large amount of high-fidelity speech data, and for unseen texts, the prosody of synthesized speech is relatively unnatural. To address these issues, we propose to combine a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-base… ▽ More Recent advancements in end-to-end speech synthesis have made it possible to generate highly natural speech. However, training these models typically requires a large amount of high-fidelity speech data, and for unseen texts, the prosody of synthesized speech is relatively unnatural. To address these issues, we propose to combine a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-based acoustic model to improve prosody modeling. The pre-trained BERT is fine-tuned on the polyphone disambiguation task, the joint Chinese word segmentation (CWS) and part-of-speech (POS) tagging task, and the prosody structure prediction (PSP) task in a multi-task learning framework. FastSpeech 2 is pre-trained on large-scale external data that are noisy but easier to obtain. Experimental results show that both the fine-tuned BERT model and the pre-trained FastSpeech 2 can improve prosody, especially for those structurally complex sentences. △ Less

Submitted 15 November, 2021; originally announced November 2021.

arXiv:2111.02921 [pdf, ps, other]

Map-Assisted Constellation Design for mmWave WDM with OAM in Short-Range LOS Environment

Authors: Yuan Wang, Chen Gong, Nuo Huang, Zhengyuan Xu

Abstract: We consider a system that integrates positioning and single-user millimeter wave (mmWave) communication, where the communication part adopts wavelength division multiplexing (WDM) and orbital angular momentum (OAM). This paper addresses the multi-dimensional constellation design in shortrange line-of-sight (LOS) environment, with stable communication links. We propose a map-assisted method to quan… ▽ More We consider a system that integrates positioning and single-user millimeter wave (mmWave) communication, where the communication part adopts wavelength division multiplexing (WDM) and orbital angular momentum (OAM). This paper addresses the multi-dimensional constellation design in shortrange line-of-sight (LOS) environment, with stable communication links. We propose a map-assisted method to quantify the system parameters based on positions and reduce real-time computing overhead. We explore the possibility of using a few patterns in the maps, and investigate its performance loss. We first investigate the features of OAM beams, and find that the link gain ratio between any two sub-channels remains unchanged at some postions. Then, we prove that a fixed constellation can be adopted for the positions where the link gain matrices are sufficiently close to be proportional. Moreover, we prove that the system can adopt a fixed power vector to generate a multidimensional constellation if the difference between fixed power vector and optimal power vector is small. Finally, we figure out that the constellation design for all receiver locations can be represented by a few constellation sets. △ Less

Submitted 11 October, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

arXiv:2110.04451 [pdf, other]

Using multiple reference audios and style embedding constraints for speech synthesis

Authors: Cheng Gong, Longbiao Wang, Zhenhua Ling, Ju Zhang, Jianwu Dang

Abstract: The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio. However, an appropriate acoustic embedding must be manually selected during inference. Due to the fact that only the matched text and speech are used in the training process, using unmatched text and spee… ▽ More The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio. However, an appropriate acoustic embedding must be manually selected during inference. Due to the fact that only the matched text and speech are used in the training process, using unmatched text and speech for inference would cause the model to synthesize speech with low content quality. In this study, we propose to mitigate these two problems by using multiple reference audios and style embedding constraints rather than using only the target audio. Multiple reference audios are automatically selected using the sentence similarity determined by Bidirectional Encoder Representations from Transformers (BERT). In addition, we use ''target'' style embedding from a Pre-trained encoder as a constraint by considering the mutual information between the predicted and ''target'' style embedding. The experimental results show that the proposed model can improve the speech naturalness and content quality with multiple reference audios and can also outperform the baseline model in ABX preference tests of style similarity. △ Less

Submitted 9 October, 2021; originally announced October 2021.

Comments: 5 pages,3 figures submitted to ICASSP2022

arXiv:2109.07210 [pdf]

Life-Long Multi-Task Learning of Adaptive Path Tracking Policy for Autonomous Vehicle

Authors: Cheng Gong, Jianwei Gong, Chao Lu, Zhe Liu, Zirui Li

Abstract: This paper proposes a life-long adaptive path tracking policy learning method for autonomous vehicles that can self-evolve and self-adapt with multi-task knowledge. Firstly, the proposed method can learn a model-free control policy for path tracking directly from the historical driving experience, where the property of vehicle dynamics and corresponding control strategy can be learned simultaneous… ▽ More This paper proposes a life-long adaptive path tracking policy learning method for autonomous vehicles that can self-evolve and self-adapt with multi-task knowledge. Firstly, the proposed method can learn a model-free control policy for path tracking directly from the historical driving experience, where the property of vehicle dynamics and corresponding control strategy can be learned simultaneously. Secondly, by utilizing the life-long learning method, the proposed method can learn the policy with task-incremental knowledge without encountering catastrophic forgetting. Thus, with continual multi-task knowledge learned, the policy can iteratively adapt to new tasks and improve its performance with knowledge from new tasks. Thirdly, a memory evaluation and updating method is applied to optimize memory structure for life-long learning which enables the policy to learn toward selected directions. Experiments are conducted using a high-fidelity vehicle dynamic model in a complex curvy road to evaluate the performance of the proposed method. Results show that the proposed method can effectively evolve with continual multi-task knowledge and adapt to the new environment, where the performance of the proposed method can also surpass two commonly used baseline methods after evolving. △ Less

Submitted 15 September, 2021; originally announced September 2021.

arXiv:2109.03513 [pdf, other]

doi 10.1109/TPDS.2021.3129615

Elastic Significant Bit Quantization and Acceleration for Deep Neural Networks

Authors: Cheng Gong, Ye Lu, Kunpeng Xie, Zongming Jin, Tao Li, Yanzhi Wang

Abstract: Quantization has been proven to be a vital method for improving the inference efficiency of deep neural networks (DNNs). However, it is still challenging to strike a good balance between accuracy and efficiency while quantizing DNN weights or activation values from high-precision formats to their quantized counterparts. We propose a new method called elastic significant bit quantization (ESB) that… ▽ More Quantization has been proven to be a vital method for improving the inference efficiency of deep neural networks (DNNs). However, it is still challenging to strike a good balance between accuracy and efficiency while quantizing DNN weights or activation values from high-precision formats to their quantized counterparts. We propose a new method called elastic significant bit quantization (ESB) that controls the number of significant bits of quantized values to obtain better inference accuracy with fewer resources. We design a unified mathematical formula to constrain the quantized values of the ESB with a flexible number of significant bits. We also introduce a distribution difference aligner (DDA) to quantitatively align the distributions between the full-precision weight or activation values and quantized values. Consequently, ESB is suitable for various bell-shaped distributions of weights and activation of DNNs, thus maintaining a high inference accuracy. Benefitting from fewer significant bits of quantized values, ESB can reduce the multiplication complexity. We implement ESB as an accelerator and quantitatively evaluate its efficiency on FPGAs. Extensive experimental results illustrate that ESB quantization consistently outperforms state-of-the-art methods and achieves average accuracy improvements of 4.78%, 1.92%, and 3.56% over AlexNet, ResNet18, and MobileNetV2, respectively. Furthermore, ESB as an accelerator can achieve 10.95 GOPS peak performance of 1k LUTs without DSPs on the Xilinx ZCU102 FPGA platform. Compared with CPU, GPU, and state-of-the-art accelerators on FPGAs, the ESB accelerator can improve the energy efficiency by up to 65x, 11x, and 26x, respectively. △ Less

Submitted 17 November, 2021; v1 submitted 8 September, 2021; originally announced September 2021.

Comments: 15 pages, 14 figures

ACM Class: B.2.4.a; I.2.6.g; I.5.1.d; I.5.4.b

Journal ref: IEEE Transactions on Parallel and Distributed Systems, 2021

arXiv:2108.01831 [pdf, other]

Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis

Authors: Xudong Dai, Cheng Gong, Longbiao Wang, Kaili Zhang

Abstract: Expressive neural text-to-speech (TTS) systems incorporate a style encoder to learn a latent embedding as the style information. However, this embedding process may encode redundant textual information. This phenomenon is called content leakage. Researchers have attempted to resolve this problem by adding an ASR or other auxiliary supervision loss functions. In this study, we propose an unsupervis… ▽ More Expressive neural text-to-speech (TTS) systems incorporate a style encoder to learn a latent embedding as the style information. However, this embedding process may encode redundant textual information. This phenomenon is called content leakage. Researchers have attempted to resolve this problem by adding an ASR or other auxiliary supervision loss functions. In this study, we propose an unsupervised method called the "information sieve" to reduce the effect of content leakage in prosody transfer. The rationale of this approach is that the style encoder can be forced to focus on style information rather than on textual information contained in the reference speech by a well-designed downsample-upsample filter, i.e., the extracted style embeddings can be downsampled at a certain interval and then upsampled by duplication. Furthermore, we used instance normalization in convolution layers to help the system learn a better latent style space. Objective metrics such as the significantly lower word error rate (WER) demonstrate the effectiveness of this model in mitigating content leakage. Listening tests indicate that the model retains its prosody transferability compared with the baseline models such as the original GST-Tacotron and ASR-guided Tacotron. △ Less

Submitted 3 August, 2021; originally announced August 2021.

Comments: Accepted By Interspeech 2021

arXiv:2107.09889 [pdf, other]

Fine-Grained Music Plagiarism Detection: Revealing Plagiarists through Bipartite Graph Matching and a Comprehensive Large-Scale Dataset

Authors: Wenxuan Liu, Tianyao He, Chen Gong, Ning Zhang, Hua Yang, Junchi Yan

Abstract: Music plagiarism detection is gaining more and more attention due to the popularity of music production and society's emphasis on intellectual property. We aim to find fine-grained plagiarism in music pairs since conventional methods are coarse-grained and cannot match real-life scenarios. Considering that there is no sizeable dataset designed for the music plagiarism task, we establish a large-sc… ▽ More Music plagiarism detection is gaining more and more attention due to the popularity of music production and society's emphasis on intellectual property. We aim to find fine-grained plagiarism in music pairs since conventional methods are coarse-grained and cannot match real-life scenarios. Considering that there is no sizeable dataset designed for the music plagiarism task, we establish a large-scale simulated dataset, named Music Plagiarism Detection Dataset (MPD-Set) under the guidance and expertise of renowned researchers from national-level professional institutions in the field of music. MPD-Set considers diverse music plagiarism cases found in real life from the melodic, rhythmic, and tonal levels respectively. Further, we establish a Real-life Dataset for evaluation, where all plagiarism pairs are real cases. To detect the fine-grained plagiarism pairs effectively, we propose a graph-based method called Bipatite Melody Matching Detector (BMM-Det), which formulates the problem as a max matching problem in the bipartite graph. Experimental results on both the simulated and Real-life Datasets demonstrate that BMM-Det outperforms the existing plagiarism detection methods, and is robust to common plagiarism cases like transpositions, pitch shifts, duration variance, and melody change. Datasets and source code are open-sourced at https://github.com/xuan301/BMMDet_MPDSet. △ Less

Submitted 2 July, 2023; v1 submitted 21 July, 2021; originally announced July 2021.

arXiv:2106.06237 [pdf, other]

KRADA: Known-region-aware Domain Alignment for Open-set Domain Adaptation in Semantic Segmentation

Authors: Chenhong Zhou, Feng Liu, Chen Gong, Rongfei Zeng, Tongliang Liu, William K. Cheung, Bo Han

Abstract: In semantic segmentation, we aim to train a pixel-level classifier to assign category labels to all pixels in an image, where labeled training images and unlabeled test images are from the same distribution and share the same label set. However, in an open world, the unlabeled test images probably contain unknown categories and have different distributions from the labeled images. Hence, in this p… ▽ More In semantic segmentation, we aim to train a pixel-level classifier to assign category labels to all pixels in an image, where labeled training images and unlabeled test images are from the same distribution and share the same label set. However, in an open world, the unlabeled test images probably contain unknown categories and have different distributions from the labeled images. Hence, in this paper, we consider a new, more realistic, and more challenging problem setting where the pixel-level classifier has to be trained with labeled images and unlabeled open-world images -- we name it open-set domain adaptation segmentation (OSDAS). In OSDAS, the trained classifier is expected to identify unknown-class pixels and classify known-class pixels well. To solve OSDAS, we first investigate which distribution that unknown-class pixels obey. Then, motivated by the goodness-of-fit test, we use statistical measurements to show how a pixel fits the distribution of an unknown class and select highly-fitted pixels to form the unknown region in each test image. Eventually, we propose an end-to-end learning framework, known-region-aware domain alignment (KRADA), to distinguish unknown classes while aligning the distributions of known classes in labeled and unlabeled open-world images. The effectiveness of KRADA has been verified on two synthetic tasks and one COVID-19 segmentation task. △ Less

Submitted 19 February, 2023; v1 submitted 11 June, 2021; originally announced June 2021.

Comments: 18 pages

Journal ref: Transactions on Machine Learning Research, 2023

arXiv:2105.14704 [pdf, other]

Parkinsonian Chinese Speech Analysis towards Automatic Classification of Parkinson's Disease

Authors: Hao Fang, Chen Gong, Chen Zhang, Yanan Sui, Luming Li

Abstract: Speech disorders often occur at the early stage of Parkinson's disease (PD). The speech impairments could be indicators of the disorder for early diagnosis, while motor symptoms are not obvious. In this study, we constructed a new speech corpus of Mandarin Chinese and addressed classification of patients with PD. We implemented classical machine learning methods with ranking algorithms for feature… ▽ More Speech disorders often occur at the early stage of Parkinson's disease (PD). The speech impairments could be indicators of the disorder for early diagnosis, while motor symptoms are not obvious. In this study, we constructed a new speech corpus of Mandarin Chinese and addressed classification of patients with PD. We implemented classical machine learning methods with ranking algorithms for feature selection, convolutional and recurrent deep networks, and an end to end system. Our classification accuracy significantly surpassed state-of-the-art studies. The result suggests that free talk has stronger classification power than standard speech tasks, which could help the design of future speech tasks for efficient early diagnosis of the disease. Based on existing classification methods and our natural speech study, the automatic detection of PD from daily conversation could be accessible to the majority of the clinical population. △ Less

Submitted 31 May, 2021; originally announced May 2021.

Comments: 12 pages, 5 figures, proceedings of the Machine Learning for Health NeurIPS Workshop, PMLR 136:114-125, 2020

arXiv:2101.03548 [pdf, ps, other]

Channel Modeling and Signal Processing for Array-based Visible Light Communication System in Misalignment

Authors: Jiaqi Wei, Chen Gong, Nuo Huang, Zhengyuan Xu

Abstract: This paper proposes an indoor visible light communication (VLC) system with multiple transmitters and receivers. Due to diffusivity of LED light beams, photodiode receive signals from many directions. We use one concave and one convex lens as optical antenna, and obtain the optimal lens structure by optimizing which corresponds to the minimum condition number of channel gain matrix. In this way th… ▽ More This paper proposes an indoor visible light communication (VLC) system with multiple transmitters and receivers. Due to diffusivity of LED light beams, photodiode receive signals from many directions. We use one concave and one convex lens as optical antenna, and obtain the optimal lens structure by optimizing which corresponds to the minimum condition number of channel gain matrix. In this way the light emitted by different LED can be separated well from each other then minimize signal interference. However, interference increases in the case of system deviation, so we explore the system mobility. Then subsequent signal processing is carried out, including signal combining and successive interference cancellation (SIC). We combine the same signal received by different receivers to improve signal to interference noise ratio (SINR). And SIC can effectively restore interference and eliminate its impact. The simulation results show that channel capacity can be increased by more than 5 times and up to 20 times under the condition of receiver and transmitter alignment. In the case of movement, channel capacity can also be increased by about 4 times on average. Moreover, the mobile range of system is also significantly expanded. △ Less

Submitted 10 January, 2021; originally announced January 2021.

arXiv:2010.09275 [pdf, other]

DiDiSpeech: A Large Scale Mandarin Speech Corpus

Authors: Tingwei Guo, Cheng Wen, Dongwei Jiang, Ne Luo, Ruixiong Zhang, Shuaijiang Zhao, Wubo Li, Cheng Gong, Wei Zou, Kun Han, Xiangang Li

Abstract: This paper introduces a new open-sourced Mandarin speech corpus, called DiDiSpeech. It consists of about 800 hours of speech data at 48kHz sampling rate from 6000 speakers and the corresponding texts. All speech data in the corpus is recorded in quiet environment and is suitable for various speech processing tasks, such as voice conversion, multi-speaker text-to-speech and automatic speech recogni… ▽ More This paper introduces a new open-sourced Mandarin speech corpus, called DiDiSpeech. It consists of about 800 hours of speech data at 48kHz sampling rate from 6000 speakers and the corresponding texts. All speech data in the corpus is recorded in quiet environment and is suitable for various speech processing tasks, such as voice conversion, multi-speaker text-to-speech and automatic speech recognition. We conduct experiments with multiple speech tasks and evaluate the performance, showing that it is promising to use the corpus for both academic research and practical application. The corpus is available at https://outreach.didichuxing.com/research/opendata/. △ Less

Submitted 8 February, 2021; v1 submitted 19 October, 2020; originally announced October 2020.

Comments: 5 pages, 2 figures, 11 tables

arXiv:2006.14497 [pdf, other]

Quantumized Microwave Detection Based on $Λ$-Type Three-level Superconducting System: HMM Modeling and Performance Prediction

Authors: Junyu Zhang, Chen Gong, Shangbin Li, Shanchi Wu, Rui Ni, Chengjie Zuo, Jinkang Zhu, Ming Zhao, Zhengyuan Xu

Abstract: We adopt artificial $Λ$-type three-level system with superconducting devices for microwave signal detection, where the signal intensity reaches the level of discrete photons instead of continuous waveform. Based on the state transition principles of the three-level system, we propose a statistical model for microwave signal detection. Moreover, we investigate the achievable transmission rate and s… ▽ More We adopt artificial $Λ$-type three-level system with superconducting devices for microwave signal detection, where the signal intensity reaches the level of discrete photons instead of continuous waveform. Based on the state transition principles of the three-level system, we propose a statistical model for microwave signal detection. Moreover, we investigate the achievable transmission rate and signal detection based on the statistical model. It is predicted that the proposed detection can achieve significantly higher sensitivity compared with the currently deployed 4G/5G communication system. We further characterize the received signal considering the saturation phonomenon, which reveals negligible performance degradation caused by saturation under weak received power regime. △ Less

Submitted 27 August, 2021; v1 submitted 25 June, 2020; originally announced June 2020.

Comments: 12 pages, 18 figures

arXiv:2006.14471 [pdf, other]

Wireless Communication Based on Microwave Photon-Level Detection With Superconducting Devices: Achievable Rate Prediction

Authors: Junyu Zhang, Chen Gong, Shangbin Li, Rui Ni, Chengjie Zuo, Jinkang Zhu, Ming Zhao, Zhengyuan Xu

Abstract: Future wireless communication system embraces physical-layer signal detection with high sensitivity, especially in the microwave photon level. Currently, the receiver primarily adopts the signal detection based on semi-conductor devices for signal detection, while this paper introduces high-sensitivity photon-level microwave detection based on superconducting structure. We first overview existing… ▽ More Future wireless communication system embraces physical-layer signal detection with high sensitivity, especially in the microwave photon level. Currently, the receiver primarily adopts the signal detection based on semi-conductor devices for signal detection, while this paper introduces high-sensitivity photon-level microwave detection based on superconducting structure. We first overview existing works on the photon-level communication in the optical spectrum as well as the microwave photon-level sensing based on superconducting structure in both theoretical and experimental perspectives, including microwave detection circuit model based on Josephson junction, microwave photon counter based on Josephson junction, and two reconstruction approaches under background noise. In addition, we characterize channel modeling based on two different microwave photon detection approaches, including the absorption barrier and the dual-path Handury Brown-Twiss (HBT) experiments, and predict the corresponding achievable rates. According to the performance prediction, it is seen that the microwave photon-level signal detection can increase the receiver sensitivity compared with the state-of-the-art standardized communication system with waveform signal reception, with gain over $10$dB. △ Less

Submitted 25 June, 2020; originally announced June 2020.

Comments: 9 pages, 13 figures

arXiv:2003.12933 [pdf, other]

Weak Radio Frequency Signal Detection Based on Piezo-Opto-Electro-Mechanical System: Architecture Design and Sensitivity Prediction

Authors: Shanchi Wu, Chen Gong, Chengjie Zuo, Shangbin Li, Junyu Zhang, Zhongbin Dai, Kai Yang, Ming Zhao, Rui Ni, Zhengyuan Xu, Jinkang Zhu

Abstract: We propose a novel radio-frequency (RF) receiving architecture based on micro-electro-mechanical system (MEMS) and optical coherent detection module. The architecture converts the received electrical signal into mechanical vibration through the piezoelectric effect and adopts an optical detection module to detect the mechanical vibration. We analyze the response function of piezoelectric film to a… ▽ More We propose a novel radio-frequency (RF) receiving architecture based on micro-electro-mechanical system (MEMS) and optical coherent detection module. The architecture converts the received electrical signal into mechanical vibration through the piezoelectric effect and adopts an optical detection module to detect the mechanical vibration. We analyze the response function of piezoelectric film to an RF signal, the noise limited sensitivity of the optical detection module and the system transfer function in the frequency domain. Finally, we adopt simple on-off keying (OOK) modulation with bandwidth 1 kHz and carrier frequency 1 GHz, to numerically evaluate the detection sensitivity. The result shows that, considering the main noise sources in wireless channel and circuits, the signal detection sensitivity can reach around -160 dBm with a 50 $Ω$ impedance. Such sensitivity significantly outperforms that of the currently deployed Long Term Evolution (LTE) system, when normalizing the transmission bandwidth also to 1 kHz. △ Less

Submitted 8 October, 2020; v1 submitted 28 March, 2020; originally announced March 2020.

Comments: 15 pages, 16 figures, 6 tables

arXiv:1909.11953 [pdf, other]

Hyperspectral Image Classification With Context-Aware Dynamic Graph Convolutional Network

Authors: Sheng Wan, Chen Gong, Ping Zhong, Shirui Pan, Guangyu Li, Jian Yang

Abstract: In hyperspectral image (HSI) classification, spatial context has demonstrated its significance in achieving promising performance. However, conventional spatial context-based methods simply assume that spatially neighboring pixels should correspond to the same land-cover class, so they often fail to correctly discover the contextual relations among pixels in complex situations, and thus leading to… ▽ More In hyperspectral image (HSI) classification, spatial context has demonstrated its significance in achieving promising performance. However, conventional spatial context-based methods simply assume that spatially neighboring pixels should correspond to the same land-cover class, so they often fail to correctly discover the contextual relations among pixels in complex situations, and thus leading to imperfect classification results on some irregular or inhomogeneous regions such as class boundaries. To address this deficiency, we develop a new HSI classification method based on the recently proposed Graph Convolutional Network (GCN), as it can flexibly encode the relations among arbitrarily structured non-Euclidean data. Different from traditional GCN, there are two novel strategies adopted by our method to further exploit the contextual relations for accurate HSI classification. First, since the receptive field of traditional GCN is often limited to fairly small neighborhood, we proposed to capture long range contextual relations in HSI by performing successive graph convolutions on a learned region-induced graph which is transformed from the original 2D image grids. Second, we refine the graph edge weight and the connective relationships among image regions by learning the improved adjacency matrix and the 'edge filter', so that the graph can be gradually refined to adapt to the representations generated by each graph convolutional layer. Such updated graph will in turn result in accurate region representations, and vice versa. The experiments carried out on three real-world benchmark datasets demonstrate that the proposed method yields significant improvement in the classification performance when compared with some state-of-the-art approaches. △ Less

Submitted 26 September, 2019; originally announced September 2019.

arXiv:1907.11458 [pdf, other]

Multiple Human Association between Top and Horizontal Views by Matching Subjects' Spatial Distributions

Authors: Ruize Han, Yujun Zhang, Wei Feng, Chenxing Gong, Xiaoyu Zhang, Jiewen Zhao, Liang Wan, Song Wang

Abstract: Video surveillance can be significantly enhanced by using both top-view data, e.g., those from drone-mounted cameras in the air, and horizontal-view data, e.g., those from wearable cameras on the ground. Collaborative analysis of different-view data can facilitate various kinds of applications, such as human tracking, person identification, and human activity recognition. However, for such collabo… ▽ More Video surveillance can be significantly enhanced by using both top-view data, e.g., those from drone-mounted cameras in the air, and horizontal-view data, e.g., those from wearable cameras on the ground. Collaborative analysis of different-view data can facilitate various kinds of applications, such as human tracking, person identification, and human activity recognition. However, for such collaborative analysis, the first step is to associate people, referred to as subjects in this paper, across these two views. This is a very challenging problem due to large human-appearance difference between top and horizontal views. In this paper, we present a new approach to address this problem by exploring and matching the subjects' spatial distributions between the two views. More specifically, on the top-view image, we model and match subjects' relative positions to the horizontal-view camera in both views and define a matching cost to decide the actual location of horizontal-view camera and its view angle in the top-view image. We collect a new dataset consisting of top-view and horizontal-view image pairs for performance evaluation and the experimental results show the effectiveness of the proposed method. △ Less

Submitted 26 July, 2019; originally announced July 2019.

arXiv:1905.06133 [pdf, other]

Multi-scale Dynamic Graph Convolutional Network for Hyperspectral Image Classification

Authors: Sheng Wan, Chen Gong, Ping Zhong, Bo Du, Lefei Zhang, Jian Yang

Abstract: Convolutional Neural Network (CNN) has demonstrated impressive ability to represent hyperspectral images and to achieve promising results in hyperspectral image classification. However, traditional CNN models can only operate convolution on regular square image regions with fixed size and weights, so they cannot universally adapt to the distinct local regions with various object distributions and… ▽ More Convolutional Neural Network (CNN) has demonstrated impressive ability to represent hyperspectral images and to achieve promising results in hyperspectral image classification. However, traditional CNN models can only operate convolution on regular square image regions with fixed size and weights, so they cannot universally adapt to the distinct local regions with various object distributions and geometric appearances. Therefore, their classification performances are still to be improved, especially in class boundaries. To alleviate this shortcoming, we consider employing the recently proposed Graph Convolutional Network (GCN) for hyperspectral image classification, as it can conduct the convolution on arbitrarily structured non-Euclidean data and is applicable to the irregular image regions represented by graph topological information. Different from the commonly used GCN models which work on a fixed graph, we enable the graph to be dynamically updated along with the graph convolution process, so that these two steps can be benefited from each other to gradually produce the discriminative embedded features as well as a refined graph. Moreover, to comprehensively deploy the multi-scale information inherited by hyperspectral images, we establish multiple input graphs with different neighborhood scales to extensively exploit the diversified spectral-spatial correlations at multiple scales. Therefore, our method is termed 'Multi-scale Dynamic Graph Convolutional Network' (MDGCN). The experimental results on three typical benchmark datasets firmly demonstrate the superiority of the proposed MDGCN to other state-of-the-art methods in both qualitative and quantitative aspects. △ Less

Submitted 14 May, 2019; originally announced May 2019.

arXiv:1904.03575 [pdf, other]

Two Dimension Intensity Distribution of Ultraviolet Scattering Communication

Authors: Difan Zou, Zhengyuan Xu, Chen Gong

Abstract: Consider a ultraviolet (UV) scattering communication system where the position of the transmitter is fixed and the receiver can move around on the ground. To obtain the link gain effectively and economically, we propose an algorithm based on one-dimensional (1D) numerical integration and an off-line data library. Moreover, we analyze the 2D scattering intensity distributions for both LED and laser… ▽ More Consider a ultraviolet (UV) scattering communication system where the position of the transmitter is fixed and the receiver can move around on the ground. To obtain the link gain effectively and economically, we propose an algorithm based on one-dimensional (1D) numerical integration and an off-line data library. Moreover, we analyze the 2D scattering intensity distributions for both LED and laser, and observe that the contours can be well fitted by elliptic models. The relationships between the characteristics of fitting ellipses and the source parameters are provided by numerical results. △ Less

Submitted 6 April, 2019; originally announced April 2019.

Comments: Work was done when Difan Zou was in USTC

arXiv:1811.11874 [pdf, other]

RetinaMatch: Efficient Template Matching of Retina Images for Teleophthalmology

Authors: Chen Gong, N. Benjamin Erichson, John P. Kelly, Laura Trutoiu, Brian T. Schowengerdt, Steven L. Brunton, Eric J. Seibel

Abstract: Retinal template matching and registration is an important challenge in teleophthalmology with low-cost imaging devices. However, the images from such devices generally have a small field of view (FOV) and image quality degradations, making matching difficult. In this work, we develop an efficient and accurate retinal matching technique that combines dimension reduction and mutual information (MI)… ▽ More Retinal template matching and registration is an important challenge in teleophthalmology with low-cost imaging devices. However, the images from such devices generally have a small field of view (FOV) and image quality degradations, making matching difficult. In this work, we develop an efficient and accurate retinal matching technique that combines dimension reduction and mutual information (MI), called RetinaMatch. The dimension reduction initializes the MI optimization as a coarse localization process, which narrows the optimization domain and avoids local optima. The effectiveness of RetinaMatch is demonstrated on the open fundus image database STARE with simulated reduced FOV and anticipated degradations, and on retinal images acquired by adapter-based optics attached to a smartphone. RetinaMatch achieves a success rate over 94\% on human retinal images with the matched target registration errors below 2 pixels on average, excluding the observer variability. It outperforms the standard template matching solutions. In the application of measuring vessel diameter repeatedly, single pixel errors are expected. In addition, our method can be used in the process of image mosaicking with area-based registration, providing a robust approach when the feature based methods fail. To the best of our knowledge, this is the first template matching algorithm for retina images with small template images from unconstrained retinal areas. In the context of the emerging mixed reality market, we envision automated retinal image matching and registration methods as transformative for advanced teleophthalmology and long-term retinal monitoring. △ Less

Submitted 28 November, 2018; originally announced November 2018.

arXiv:1810.13091 [pdf, other]

Towards End-to-End Code-Switching Speech Recognition

Authors: Ne Luo, Dongwei Jiang, Shuaijiang Zhao, Caixia Gong, Wei Zou, Xiangang Li

Abstract: Code-switching speech recognition has attracted an increasing interest recently, but the need for expert linguistic knowledge has always been a big issue. End-to-end automatic speech recognition (ASR) simplifies the building of ASR systems considerably by predicting graphemes or characters directly from acoustic input. In the mean time, the need of expert linguistic knowledge is also eliminated, w… ▽ More Code-switching speech recognition has attracted an increasing interest recently, but the need for expert linguistic knowledge has always been a big issue. End-to-end automatic speech recognition (ASR) simplifies the building of ASR systems considerably by predicting graphemes or characters directly from acoustic input. In the mean time, the need of expert linguistic knowledge is also eliminated, which makes it an attractive choice for code-switching ASR. This paper presents a hybrid CTC-Attention based end-to-end Mandarin-English code-switching (CS) speech recognition system and studies the effect of hybrid CTC-Attention based models, different modeling units, the inclusion of language identification and different decoding strategies on the task of code-switching ASR. On the SEAME corpus, our system achieves a mixed error rate (MER) of 34.24%. △ Less

Submitted 1 November, 2018; v1 submitted 30 October, 2018; originally announced October 2018.

Comments: 5 pages, submitted to ICASSP 2019

arXiv:1808.03486 [pdf, other]

Pulse-laser Based Long-range Non-line-of-sight Ultraviolet Communication with Pulse Response Position Estimation

Authors: Ruixiong Xu, Chen Gong, Zhengyuan Xu

Abstract: We propose pulse laser-based ultra-violet communication over long distance, such that the pulse response signals can be detected at the receiver at the cost of low data transmission rate. We characterize the signal and achievable performance for the pulse laser-based communication. Since the detection performance critically depends on the pulse response position estimation, we also propose two app… ▽ More We propose pulse laser-based ultra-violet communication over long distance, such that the pulse response signals can be detected at the receiver at the cost of low data transmission rate. We characterize the signal and achievable performance for the pulse laser-based communication. Since the detection performance critically depends on the pulse response position estimation, we also propose two approaches to estimate the pulse response positions, one based on counting the number pulses in a window, and the other based on the correlation of pulse response shape and the number of detected photoelectrons. It is seen that the correlation-based position estimation approach can achieve more accurate estimation compared with the counting-based one. △ Less

Submitted 10 August, 2018; originally announced August 2018.

arXiv:1805.07766 [pdf, other]

Constrained Partial Group Decoding with Max-Min Fairness for Multi-color Multi-user Visible Light Communication

Authors: Guangtao Zheng, Chen Gong, Zhengyuan Xu

Abstract: A visible light communication (VLC) system can adopt multi-color light emitting diode (LED) arrays to support multiple users. In this paper, a multi-layer coding and constrained partial group decoding (CPGD) method is proposed to tackle strong color interference and increase the system throughput. After channel model formulation, user information rates are allocated and decoding order for all the… ▽ More A visible light communication (VLC) system can adopt multi-color light emitting diode (LED) arrays to support multiple users. In this paper, a multi-layer coding and constrained partial group decoding (CPGD) method is proposed to tackle strong color interference and increase the system throughput. After channel model formulation, user information rates are allocated and decoding order for all the received data layers is obtained by solving a max-min fairness problem using a greedy algorithm. An achievable rate is derived under the truncated Gaussian input distribution. To reduce the decoding complexity, a map on the decoding order and rate allocation is constructed for all positions of interest on the receiver plane and its size is reduced by a classification-based algorithm. Meanwhile, the symmetrical geometry of LED arrays is exploited. Finally, the transmitter-user association problem is formulated and solved by a genetic algorithm. It is observed that the system throughput increases as the receivers are slightly misaligned with corresponding LED arrays due to the reduced interference level, but decreases afterwards due to the weakened link gain. △ Less

Submitted 20 May, 2018; originally announced May 2018.

Comments: 28 pages, 12 figures, submitted to TCOM

arXiv:1805.02199 [pdf, other]

Asynchronous Multiple Access in Optical Wireless Scattering Communication: Achievable Transmission Rates and Receiver Design

Authors: Guanchu Wang, Chen Gong, Zhimeng Jiang, Zhengyuan Xu

Abstract: We investigate the asynchronous multiple user access communication in optical wireless scattering communication, where different users transmit signals without perfect alignment in the time domain. Firstly, we characterize the received signal based on hidden markov model (HMM) such that the misalignment among different users can be characterized by the state transition. Then, we investigate the ac… ▽ More We investigate the asynchronous multiple user access communication in optical wireless scattering communication, where different users transmit signals without perfect alignment in the time domain. Firstly, we characterize the received signal based on hidden markov model (HMM) such that the misalignment among different users can be characterized by the state transition. Then, we investigate the achievable rates based on that of the HMM and obtain the approximated solution using Monte Carlo method. We propose the channel estimation based on expectation-maximization (EM) algorithm. Furthermore, we adopt Viterbi and Bahl-Cocke-Jelinek-Raviv (BCJR) algorithms for joint iterative multi-user decoding. Numerical and experimental results illustrate the performance of proposed channel estimation, joint detection and decoding. It is seen from the experimental results that the proposed approaches perform close to the simulation results. △ Less

Submitted 21 January, 2019; v1 submitted 6 May, 2018; originally announced May 2018.

arXiv:1802.03944 [pdf, other]

A 1Mbps Real-time NLOS UV Scattering Communication System with Receiver Diversity over 1km

Authors: Guanchu Wang, Kun Wang, Chen Gong, Difan Zou, Zhimeng Jiang, Zhengyuan Xu

Abstract: In the non-line of sight (NLOS) ultraviolet (UV) scattering communication, the received signals exhibit the characteristics of discrete photoelectrons due to the extremely large path loss. We design and demonstrate an NLOS UV scattering communication system in this work, where the receiver-side signal detection is designed based on a discrete-time Poisson channel model. In our system, a laser and… ▽ More In the non-line of sight (NLOS) ultraviolet (UV) scattering communication, the received signals exhibit the characteristics of discrete photoelectrons due to the extremely large path loss. We design and demonstrate an NLOS UV scattering communication system in this work, where the receiver-side signal detection is designed based on a discrete-time Poisson channel model. In our system, a laser and multiple photomultiplier tubes are employed as the optical transmitter and detector, respectively. Furthermore, we design algorithms for pulse-counting, synchronization, channel estimation and $LLR$ computation for hardware realization in FPGA board. Simulation results are provided to evaluate the proposed system design and specify the system key parameters. We perform field tests for real-time communication with the transmission range over $1$km, where the system throughput reaches $1$Mbps. △ Less

Submitted 12 February, 2018; originally announced February 2018.

arXiv:1710.10976 [pdf, ps, other]

SCMA with Low Complexity Symmetric Codebook Design for Visible Light Communication

Authors: Shun Lou, Chen Gong, Qian Gao, Zhengyuan Xu

Abstract: Sparse code multiple access (SCMA) is attracting significant research interests currently, which is considered as a promising multiple access technique for 5G systems. It serves as a good candidate for the future communication network with massive nodes due to its capability of handling user overloading. Introducing SCMA to visible light communication (VLC) can provide another opportunity on desig… ▽ More Sparse code multiple access (SCMA) is attracting significant research interests currently, which is considered as a promising multiple access technique for 5G systems. It serves as a good candidate for the future communication network with massive nodes due to its capability of handling user overloading. Introducing SCMA to visible light communication (VLC) can provide another opportunity on design of transmission protocols for the communication network with massive nodes due to the limited communication range of VLC, which reduces the interference intensity. However, when applying SCMA in VLC systems, we need to modify the SCMA codebook to accommodate the real and positive signal requirement for VLC.We apply multidimensional constellation design methods to SCMA codebook. To reduce the design complexity, we also propose a symmetric codebook design. For all the proposed design approaches, the minimum Euclidean distance aims to be maximized. Our symmetric codebook design can reduce design and detection complexity simultaneously. Simulation results show that our design implies fast convergence with respect to the number of iterations, and outperforms the design that simply modifies the existing approaches to VLC signal requirements. △ Less

Submitted 30 October, 2017; originally announced October 2017.

Showing 1–43 of 43 results for author: Gong, C