Search | arXiv e-print repository

Residual Deep Reinforcement Learning for Inverter-based Volt-Var Control

Authors: Qiong Liu, Ye Guo, Lirong Deng, Haotian Liu, Dongyu Li, Hongbin Sun

Abstract: A residual deep reinforcement learning (RDRL) approach is proposed by integrating DRL with model-based optimization for inverter-based volt-var control in active distribution networks when the accurate power flow model is unknown. RDRL learns a residual action with a reduced residual action space, based on the action of the model-based approach with an approximate model. RDRL inherits the control… ▽ More A residual deep reinforcement learning (RDRL) approach is proposed by integrating DRL with model-based optimization for inverter-based volt-var control in active distribution networks when the accurate power flow model is unknown. RDRL learns a residual action with a reduced residual action space, based on the action of the model-based approach with an approximate model. RDRL inherits the control capability of the approximate-model-based optimization and enhances the policy optimization capability by residual policy learning. Additionally, it improves the approximation accuracy of the critic and reduces the search difficulties of the actor by reducing residual action space. To address the issues of "too small" or "too large" residual action space of RDRL and further improve the optimization performance, we extend RDRL to a boosting RDRL approach. It selects a much smaller residual action space and learns a residual policy by using the policy of RDRL as a base policy. Simulations demonstrate that RDRL and boosting RDRL improve the optimization performance considerably throughout the learning stage and verify their rationales point-by-point, including 1) inheriting the capability of the approximate model-based optimization, 2) residual policy learning, and 3) learning in a reduced action space. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: arXiv admin note: text overlap with arXiv:2210.07360

arXiv:2407.02826 [pdf, other]

SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech

Authors: Jingru Lin, Meng Ge, Junyi Ao, Liqun Deng, Haizhou Li

Abstract: It was shown that pre-trained models with self-supervised learning (SSL) techniques are effective in various downstream speech tasks. However, most such models are trained on single-speaker speech data, limiting their effectiveness in mixture speech. This motivates us to explore pre-training on mixture speech. This work presents SA-WavLM, a novel pre-trained model for mixture speech. Specifically,… ▽ More It was shown that pre-trained models with self-supervised learning (SSL) techniques are effective in various downstream speech tasks. However, most such models are trained on single-speaker speech data, limiting their effectiveness in mixture speech. This motivates us to explore pre-training on mixture speech. This work presents SA-WavLM, a novel pre-trained model for mixture speech. Specifically, SA-WavLM follows an "extract-merge-predict" pipeline in which the representations of each speaker in the input mixture are first extracted individually and then merged before the final prediction. In this pipeline, SA-WavLM performs speaker-informed extractions with the consideration of the interactions between different speakers. Furthermore, a speaker shuffling strategy is proposed to enhance the robustness towards the speaker absence. Experiments show that SA-WavLM either matches or improves upon the state-of-the-art pre-trained models. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: InterSpeech 2024

arXiv:2406.09676 [pdf, other]

Optimizing Byte-level Representation for End-to-end ASR

Authors: Roger Hsiao, Liuhui Deng, Erik McDermott, Ruchir Travadi, Xiaodan Zhuang

Abstract: We propose a novel approach to optimizing a byte-level representation for end-to-end automatic speech recognition (ASR). Byte-level representation is often used by large scale multilingual ASR systems when the character set of the supported languages is large. The compactness and universality of byte-level representation allow the ASR models to use smaller output vocabularies and therefore, provid… ▽ More We propose a novel approach to optimizing a byte-level representation for end-to-end automatic speech recognition (ASR). Byte-level representation is often used by large scale multilingual ASR systems when the character set of the supported languages is large. The compactness and universality of byte-level representation allow the ASR models to use smaller output vocabularies and therefore, provide more flexibility. UTF-8 is a commonly used byte-level representation for multilingual ASR, but it is not designed to optimize machine learning tasks directly. By using auto-encoder and vector quantization, we show that we can optimize a byte-level representation for ASR and achieve better accuracy. Our proposed framework can incorporate information from different modalities, and provides an error correction mechanism. In an English/Mandarin dictation task, we show that a bilingual ASR model built with this approach can outperform UTF-8 representation by 5% relative in error rate. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 5 pages, 1 figure

arXiv:2406.03875 [pdf, other]

Energy-storing analysis and fishtail stiffness optimization for a wire-driven elastic robotic fish

Authors: Xiaocun Liao, Chao Zhou, Junfeng Fan, Zhuoliang Zhang, Zhaoran Yin, Liangwei Deng

Abstract: The robotic fish with high propulsion efficiency and good maneuverability achieves underwater fishlike propulsion by commonly adopting the motor to drive the fishtail, causing the significant fluctuations of the motor power due to the uneven swing speed of the fishtail in one swing cycle. Hence, we propose a wire-driven robotic fish with a spring-steel-based active-segment elastic spine. This bion… ▽ More The robotic fish with high propulsion efficiency and good maneuverability achieves underwater fishlike propulsion by commonly adopting the motor to drive the fishtail, causing the significant fluctuations of the motor power due to the uneven swing speed of the fishtail in one swing cycle. Hence, we propose a wire-driven robotic fish with a spring-steel-based active-segment elastic spine. This bionic spine can produce elastic deformation to store energy under the action of the wire driving and motor for responding to the fluctuations of the motor power. Further, we analyze the effects of the energy-storing of the active-segment elastic spine on the smoothness of motor power. Based on the developed Lagrangian dynamic model and cantilever beam model, the power-variance-based nonlinear optimization model for the stiffness of the active-segment elastic spine is established to respond to the sharp fluctuations of motor power during each fishtail swing cycle. Results validate that the energy-storing of the active-segment elastic spine plays a vital role in improving the power fluctuations and maximum frequency of the motor by adjusting its stiffness reasonably, which is beneficial to achieving high propulsion and high speed for robotic fish. Compared with the active-segment rigid spine that is incapable of storing energy, the energy-storing of the active-segment elastic spine is beneficial to increase the maximum frequency of the motor and the average thrust of the fishtail by 0.41 Hz, and 0.06 N, respectively. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 14 pages, 19 figures

arXiv:2406.02430 [pdf, other]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named $\text{Seed-TTS}_\text{DiT}$, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, $\text{Seed-TTS}_\text{DiT}$ does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at \url{https://bytedancespeech.github.io/seedtts_tech_report}. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2404.11537 [pdf, other]

SSDiff: Spatial-spectral Integrated Diffusion Model for Remote Sensing Pansharpening

Authors: Yu Zhong, Xiao Wu, Liang-Jian Deng, Zihan Cao

Abstract: Pansharpening is a significant image fusion technique that merges the spatial content and spectral characteristics of remote sensing images to generate high-resolution multispectral images. Recently, denoising diffusion probabilistic models have been gradually applied to visual tasks, enhancing controllable image generation through low-rank adaptation (LoRA). In this paper, we introduce a spatial-… ▽ More Pansharpening is a significant image fusion technique that merges the spatial content and spectral characteristics of remote sensing images to generate high-resolution multispectral images. Recently, denoising diffusion probabilistic models have been gradually applied to visual tasks, enhancing controllable image generation through low-rank adaptation (LoRA). In this paper, we introduce a spatial-spectral integrated diffusion model for the remote sensing pansharpening task, called SSDiff, which considers the pansharpening process as the fusion process of spatial and spectral components from the perspective of subspace decomposition. Specifically, SSDiff utilizes spatial and spectral branches to learn spatial details and spectral features separately, then employs a designed alternating projection fusion module (APFM) to accomplish the fusion. Furthermore, we propose a frequency modulation inter-branch module (FMIM) to modulate the frequency distribution between branches. The two components of SSDiff can perform favorably against the APFM when utilizing a LoRA-like branch-wise alternative fine-tuning method. It refines SSDiff to capture component-discriminating features more sufficiently. Finally, extensive experiments on four commonly used datasets, i.e., WorldView-3, WorldView-2, GaoFen-2, and QuickBird, demonstrate the superiority of SSDiff both visually and quantitatively. The code will be made open source after possible acceptance. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.07932 [pdf, other]

FusionMamba: Efficient Image Fusion with State Space Model

Authors: Siran Peng, Xiangyu Zhu, Haoyu Deng, Zhen Lei, Liang-Jian Deng

Abstract: Image fusion aims to generate a high-resolution multi/hyper-spectral image by combining a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Current deep learning (DL)-based methods for image fusion primarily rely on CNNs or Transformers to extract features and merge different types of data. While CNNs are efficient, their receptive fiel… ▽ More Image fusion aims to generate a high-resolution multi/hyper-spectral image by combining a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Current deep learning (DL)-based methods for image fusion primarily rely on CNNs or Transformers to extract features and merge different types of data. While CNNs are efficient, their receptive fields are limited, restricting their capacity to capture global context. Conversely, Transformers excel at learning global information but are hindered by their quadratic complexity. Fortunately, recent advancements in the State Space Model (SSM), particularly Mamba, offer a promising solution to this issue by enabling global awareness with linear complexity. However, there have been few attempts to explore the potential of the SSM in information fusion, which is a crucial ability in domains like image fusion. Therefore, we propose FusionMamba, an innovative method for efficient image fusion. Our contributions mainly focus on two aspects. Firstly, recognizing that images from different sources possess distinct properties, we incorporate Mamba blocks into two U-shaped networks, presenting a novel architecture that extracts spatial and spectral features in an efficient, independent, and hierarchical manner. Secondly, to effectively combine spatial and spectral information, we extend the Mamba block to accommodate dual inputs. This expansion leads to the creation of a new module called the FusionMamba block, which outperforms existing fusion techniques such as concatenation and cross-attention. We conduct a series of experiments on five datasets related to three image fusion tasks. The quantitative and qualitative evaluation results demonstrate that our method achieves SOTA performance, underscoring the superiority of FusionMamba. The code is available at https://github.com/PSRben/FusionMamba. △ Less

Submitted 10 May, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.07543 [pdf, other]

Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening

Authors: Yule Duan, Xiao Wu, Haoyu Deng, Liang-Jian Deng

Abstract: Currently, machine learning-based methods for remote sensing pansharpening have progressed rapidly. However, existing pansharpening methods often do not fully exploit differentiating regional information in non-local spaces, thereby limiting the effectiveness of the methods and resulting in redundant learning parameters. In this paper, we introduce a so-called content-adaptive non-local convolutio… ▽ More Currently, machine learning-based methods for remote sensing pansharpening have progressed rapidly. However, existing pansharpening methods often do not fully exploit differentiating regional information in non-local spaces, thereby limiting the effectiveness of the methods and resulting in redundant learning parameters. In this paper, we introduce a so-called content-adaptive non-local convolution (CANConv), a novel method tailored for remote sensing image pansharpening. Specifically, CANConv employs adaptive convolution, ensuring spatial adaptability, and incorporates non-local self-similarity through the similarity relationship partition (SRP) and the partition-wise adaptive convolution (PWAC) sub-modules. Furthermore, we also propose a corresponding network architecture, called CANNet, which mainly utilizes the multi-scale self-similarity. Extensive experiments demonstrate the superior performance of CANConv, compared with recent promising fusion methods. Besides, we substantiate the method's effectiveness through visualization, ablation experiments, and comparison with existing methods on multiple test sets. The source code is publicly available at https://github.com/duanyll/CANConv. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR 2024

arXiv:2404.01121 [pdf, other]

CMT: Cross Modulation Transformer with Hybrid Loss for Pansharpening

Authors: Wen-Jie Shu, Hong-Xia Dou, Rui Wen, Xiao Wu, Liang-Jian Deng

Abstract: Pansharpening aims to enhance remote sensing image (RSI) quality by merging high-resolution panchromatic (PAN) with multispectral (MS) images. However, prior techniques struggled to optimally fuse PAN and MS images for enhanced spatial and spectral information, due to a lack of a systematic framework capable of effectively coordinating their individual strengths. In response, we present the Cross… ▽ More Pansharpening aims to enhance remote sensing image (RSI) quality by merging high-resolution panchromatic (PAN) with multispectral (MS) images. However, prior techniques struggled to optimally fuse PAN and MS images for enhanced spatial and spectral information, due to a lack of a systematic framework capable of effectively coordinating their individual strengths. In response, we present the Cross Modulation Transformer (CMT), a pioneering method that modifies the attention mechanism. This approach utilizes a robust modulation technique from signal processing, integrating it into the attention mechanism's calculations. It dynamically tunes the weights of the carrier's value (V) matrix according to the modulator's features, thus resolving historical challenges and achieving a seamless integration of spatial and spectral attributes. Furthermore, considering that RSI exhibits large-scale features and edge details along with local textures, we crafted a hybrid loss function that combines Fourier and wavelet transforms to effectively capture these characteristics, thereby enhancing both spatial and spectral accuracy in pansharpening. Extensive experiments demonstrate our framework's superior performance over existing state-of-the-art methods. The code will be publicly available to encourage further research. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2402.08934 [pdf, other]

Extreme Video Compression with Pre-trained Diffusion Models

Authors: Bohan Li, Yiming Liu, Xueyan Niu, Bo Bai, Lei Deng, Deniz Gündüz

Abstract: Diffusion models have achieved remarkable success in generating high quality image and video data. More recently, they have also been used for image compression with high perceptual quality. In this paper, we present a novel approach to extreme video compression leveraging the predictive power of diffusion-based generative models at the decoder. The conditional diffusion model takes several neural… ▽ More Diffusion models have achieved remarkable success in generating high quality image and video data. More recently, they have also been used for image compression with high perceptual quality. In this paper, we present a novel approach to extreme video compression leveraging the predictive power of diffusion-based generative models at the decoder. The conditional diffusion model takes several neural compressed frames and generates subsequent frames. When the reconstruction quality drops below the desired level, new frames are encoded to restart prediction. The entire video is sequentially encoded to achieve a visually pleasing reconstruction, considering perceptual quality metrics such as the learned perceptual image patch similarity (LPIPS) and the Frechet video distance (FVD), at bit rates as low as 0.02 bits per pixel (bpp). Experimental results demonstrate the effectiveness of the proposed scheme compared to standard codecs such as H.264 and H.265 in the low bpp regime. The results showcase the potential of exploiting the temporal relations in video data using generative models. Code is available at: https://github.com/ElesionKyrie/Extreme-Video-Compression-With-Prediction-Using-Pre-trainded-Diffusion-Models- △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2312.06197 [pdf, other]

MART: Learning Hierarchical Music Audio Representations with Part-Whole Transformer

Authors: Dong Yao, Jieming Zhu, Jiahao Xun, Shengyu Zhang, Zhou Zhao, Liqun Deng, Wenqiao Zhang, Zhenhua Dong, Xin Jiang

Abstract: Recent research in self-supervised contrastive learning of music representations has demonstrated remarkable results across diverse downstream tasks. However, a prevailing trend in existing methods involves representing equally-sized music clips in either waveform or spectrogram formats, often overlooking the intrinsic part-whole hierarchies within music. In our quest to comprehend the bottom-up s… ▽ More Recent research in self-supervised contrastive learning of music representations has demonstrated remarkable results across diverse downstream tasks. However, a prevailing trend in existing methods involves representing equally-sized music clips in either waveform or spectrogram formats, often overlooking the intrinsic part-whole hierarchies within music. In our quest to comprehend the bottom-up structure of music, we introduce MART, a hierarchical music representation learning approach that facilitates feature interactions among cropped music clips while considering their part-whole hierarchies. Specifically, we propose a hierarchical part-whole transformer to capture the structural relationships between music clips in a part-whole hierarchy. Furthermore, a hierarchical contrastive learning objective is crafted to align part-whole music representations at adjacent levels, progressively establishing a multi-hierarchy representation space. The effectiveness of our music representation learning from part-whole hierarchies has been empirically validated across multiple downstream tasks, including music classification and cover song identification. △ Less

Submitted 19 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

Comments: Short paper accepted by WWW 2024. This is revised and condensed based on the previous version titled "Music-PAW: Learning Music Representations via Hierarchical Part-whole Interaction and Contrast". For more experimental details and discussions, please refer to the original long paper at arXiv:2312.06197v1

arXiv:2310.14823 [pdf, other]

Prompt-driven Target Speech Diarization

Authors: Yidi Jiang, Zhengyang Chen, Ruijie Tao, Liqun Deng, Yanmin Qian, Haizhou Li

Abstract: We introduce a novel task named `target speech diarization', which seeks to determine `when target event occurred' within an audio signal. We devise a neural architecture called Prompt-driven Target Speech Diarization (PTSD), that works with diverse prompts that specify the target speech events of interest. We train and evaluate PTSD using sim2spk, sim3spk and sim4spk datasets, which are derived f… ▽ More We introduce a novel task named `target speech diarization', which seeks to determine `when target event occurred' within an audio signal. We devise a neural architecture called Prompt-driven Target Speech Diarization (PTSD), that works with diverse prompts that specify the target speech events of interest. We train and evaluate PTSD using sim2spk, sim3spk and sim4spk datasets, which are derived from the Librispeech. We show that the proposed framework accurately localizes target speech events. Furthermore, our framework exhibits versatility through its impressive performance in three diarization-related tasks: target speaker voice activity detection, overlapped speech detection and gender diarization. In particular, PTSD achieves comparable performance to specialized models across these tasks on both real and simulated data. This work serves as a reference benchmark and provides valuable insights into prompt-driven target speech processing. △ Less

Submitted 8 January, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: Accepted by ICASSP 2024

arXiv:2309.15889 [pdf, other]

High Perceptual Quality Wireless Image Delivery with Denoising Diffusion Models

Authors: Selim F. Yilmaz, Xueyan Niu, Bo Bai, Wei Han, Lei Deng, Deniz Gunduz

Abstract: We consider the image transmission problem over a noisy wireless channel via deep learning-based joint source-channel coding (DeepJSCC) along with a denoising diffusion probabilistic model (DDPM) at the receiver. Specifically, we are interested in the perception-distortion trade-off in the practical finite block length regime, in which separate source and channel coding can be highly suboptimal. W… ▽ More We consider the image transmission problem over a noisy wireless channel via deep learning-based joint source-channel coding (DeepJSCC) along with a denoising diffusion probabilistic model (DDPM) at the receiver. Specifically, we are interested in the perception-distortion trade-off in the practical finite block length regime, in which separate source and channel coding can be highly suboptimal. We introduce a novel scheme that utilizes the range-null space decomposition of the target image. We transmit the range-space of the image after encoding and employ DDPM to progressively refine its null space contents. Through extensive experiments, we demonstrate significant improvements in distortion and perceptual quality of reconstructed images compared to standard DeepJSCC and the state-of-the-art generative learning-based method. We will publicly share our source code to facilitate further research and reproducibility. △ Less

Submitted 27 September, 2023; originally announced September 2023.

Comments: 6 pages, 4 figures

arXiv:2309.02835 [pdf]

A flexible and accurate total variation and cascaded denoisers-based image reconstruction algorithm for hyperspectrally compressed ultrafast photography

Authors: Zihan Guo, Jiali Yao, Dalong Qi, Pengpeng Ding, Chengzhi Jin, Ning Xu, Zhiling Zhang, Yunhua Yao, Lianzhong Deng, Zhiyong Wang, Zhenrong Sun, Shian Zhang

Abstract: Hyperspectrally compressed ultrafast photography (HCUP) based on compressed sensing and the time- and spectrum-to-space mappings can simultaneously realize the temporal and spectral imaging of non-repeatable or difficult-to-repeat transient events passively in a single exposure. It possesses an incredibly high frame rate of tens of trillions of frames per second and a sequence depth of several hun… ▽ More Hyperspectrally compressed ultrafast photography (HCUP) based on compressed sensing and the time- and spectrum-to-space mappings can simultaneously realize the temporal and spectral imaging of non-repeatable or difficult-to-repeat transient events passively in a single exposure. It possesses an incredibly high frame rate of tens of trillions of frames per second and a sequence depth of several hundred, and plays a revolutionary role in single-shot ultrafast optical imaging. However, due to the ultra-high data compression ratio induced by the extremely large sequence depth as well as the limited fidelities of traditional reconstruction algorithms over the reconstruction process, HCUP suffers from a poor image reconstruction quality and fails to capture fine structures in complex transient scenes. To overcome these restrictions, we propose a flexible image reconstruction algorithm based on the total variation (TV) and cascaded denoisers (CD) for HCUP, named the TV-CD algorithm. It applies the TV denoising model cascaded with several advanced deep learning-based denoising models in the iterative plug-and-play alternating direction method of multipliers framework, which can preserve the image smoothness while utilizing the deep denoising networks to obtain more priori, and thus solving the common sparsity representation problem in local similarity and motion compensation. Both simulation and experimental results show that the proposed TV-CD algorithm can effectively improve the image reconstruction accuracy and quality of HCUP, and further promote the practical applications of HCUP in capturing high-dimensional complex physical, chemical and biological ultrafast optical scenes. △ Less

Submitted 6 September, 2023; originally announced September 2023.

Comments: 25 pages, 5 figures and 1 table

arXiv:2307.09775 [pdf, other]

DisCover: Disentangled Music Representation Learning for Cover Song Identification

Authors: Jiahao Xun, Shengyu Zhang, Yanting Yang, Jieming Zhu, Liqun Deng, Zhou Zhao, Zhenhua Dong, Ruiqi Li, Lichao Zhang, Fei Wu

Abstract: In the field of music information retrieval (MIR), cover song identification (CSI) is a challenging task that aims to identify cover versions of a query song from a massive collection. Existing works still suffer from high intra-song variances and inter-song correlations, due to the entangled nature of version-specific and version-invariant factors in their modeling. In this work, we set the goal… ▽ More In the field of music information retrieval (MIR), cover song identification (CSI) is a challenging task that aims to identify cover versions of a query song from a massive collection. Existing works still suffer from high intra-song variances and inter-song correlations, due to the entangled nature of version-specific and version-invariant factors in their modeling. In this work, we set the goal of disentangling version-specific and version-invariant factors, which could make it easier for the model to learn invariant music representations for unseen query songs. We analyze the CSI task in a disentanglement view with the causal graph technique, and identify the intra-version and inter-version effects biasing the invariant learning. To block these effects, we propose the disentangled music representation learning framework (DisCover) for CSI. DisCover consists of two critical components: (1) Knowledge-guided Disentanglement Module (KDM) and (2) Gradient-based Adversarial Disentanglement Module (GADM), which block intra-version and inter-version biased effects, respectively. KDM minimizes the mutual information between the learned representations and version-variant factors that are identified with prior domain knowledge. GADM identifies version-variant factors by simulating the representation transitions between intra-song versions, and exploits adversarial distillation for effect blocking. Extensive comparisons with best-performing methods and in-depth analysis demonstrate the effectiveness of DisCover and the and necessity of disentanglement for CSI. △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2307.00699 [pdf]

doi 10.1109/JSEN.2022.3178441

Game Theory and Coverage Optimization Based Multihop Routing Protocol for Network Lifetime in Wireless Sensor Networks

Authors: Yindi Yao, Xiong Li, Yanpeng Cui, Lang Deng, Chen Wang

Abstract: Wireless sensor networks (WSNs) are self-organizing monitoring networks with a large number of randomly deployed microsensor nodes to collect various physical information to realize tasks such as intelligent perception, efficient control, and decision-making. However, WSN nodes are powered by batteries, so they will run out of energy after a certain time. This energy limitation will greatly constr… ▽ More Wireless sensor networks (WSNs) are self-organizing monitoring networks with a large number of randomly deployed microsensor nodes to collect various physical information to realize tasks such as intelligent perception, efficient control, and decision-making. However, WSN nodes are powered by batteries, so they will run out of energy after a certain time. This energy limitation will greatly constrain the network performance like network lifetime and energy efficiency. In this study, to prolong the network lifetime, we proposed a multi-hop routing protocol based on game theory and coverage optimization (MRP-GTCO). Briefly, in the stage of setup, two innovational strategies including a clustering game with penalty function and cluster head coverage set were designed to realize the uniformity of cluster head distribution and improve the rationality of cluster head election. In the data transmission stage, we first derived the applicable conditions theorem of inter-cluster multi-hop routing. Based on this, a novel multi-hop path selection algorithm related to residual energy and node degree was proposed to provide an energy-efficient data transmission path. The simulation results showed that the MRP-GTCO protocol can effectively reduce the network energy consumption and extend the network lifetime by 159.22%, 50.76%, and 16.46% compared with LGCA, RLEACH, and ECAGT protocols. △ Less

Submitted 2 July, 2023; originally announced July 2023.

Comments: 14 pages, 13 figure, 3 tables

Journal ref: in IEEE Sensors Journal, vol. 22, no. 13, pp. 13739-13752, July, 2022

arXiv:2306.02541 [pdf, other]

OTF: Optimal Transport based Fusion of Supervised and Self-Supervised Learning Models for Automatic Speech Recognition

Authors: Li Fu, Siqi Li, Qingtao Li, Fangzhu Li, Liping Deng, Lu Fan, Meng Chen, Youzheng Wu, Xiaodong He

Abstract: Self-Supervised Learning (SSL) Automatic Speech Recognition (ASR) models have shown great promise over Supervised Learning (SL) ones in low-resource settings. However, the advantages of SSL are gradually weakened when the amount of labeled data increases in many industrial applications. To further improve the ASR performance when abundant labels are available, we first explore the potential of com… ▽ More Self-Supervised Learning (SSL) Automatic Speech Recognition (ASR) models have shown great promise over Supervised Learning (SL) ones in low-resource settings. However, the advantages of SSL are gradually weakened when the amount of labeled data increases in many industrial applications. To further improve the ASR performance when abundant labels are available, we first explore the potential of combining SL and SSL ASR models via analyzing their complementarity in recognition accuracy and optimization property. Then, we propose a novel Optimal Transport based Fusion (OTF) method for SL and SSL models without incurring extra computation cost in inference. Specifically, optimal transport is adopted to softly align the layer-wise weights to unify the two different networks into a single one. Experimental results on the public 1k-hour English LibriSpeech dataset and our in-house 2.6k-hour Chinese dataset show that OTF largely outperforms the individual models with lower error rates. △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: Accepted by Interspeech 2023

arXiv:2305.13652 [pdf, ps, other]

Cross-lingual Knowledge Transfer and Iterative Pseudo-labeling for Low-Resource Speech Recognition with Transducers

Authors: Jan Silovsky, Liuhui Deng, Arturo Argueta, Tresi Arvizo, Roger Hsiao, Sasha Kuznietsov, Yiu-Chang Lin, Xiaoqiang Xiao, Yuanyuan Zhang

Abstract: Voice technology has become ubiquitous recently. However, the accuracy, and hence experience, in different languages varies significantly, which makes the technology not equally inclusive. The availability of data for different languages is one of the key factors affecting accuracy, especially in training of all-neural end-to-end automatic speech recognition systems. Cross-lingual knowledge tran… ▽ More Voice technology has become ubiquitous recently. However, the accuracy, and hence experience, in different languages varies significantly, which makes the technology not equally inclusive. The availability of data for different languages is one of the key factors affecting accuracy, especially in training of all-neural end-to-end automatic speech recognition systems. Cross-lingual knowledge transfer and iterative pseudo-labeling are two techniques that have been shown to be successful for improving the accuracy of ASR systems, in particular for low-resource languages, like Ukrainian. Our goal is to train an all-neural Transducer-based ASR system to replace a DNN-HMM hybrid system with no manually annotated training data. We show that the Transducer system trained using transcripts produced by the hybrid system achieves 18% reduction in terms of word error rate. However, using a combination of cross-lingual knowledge transfer from related languages and iterative pseudo-labeling, we are able to achieve 35% reduction of the error rate. △ Less

Submitted 22 May, 2023; originally announced May 2023.

arXiv:2304.04774 [pdf, other]

DDRF: Denoising Diffusion Model for Remote Sensing Image Fusion

Authors: ZiHan Cao, ShiQi Cao, Xiao Wu, JunMing Hou, Ran Ran, Liang-Jian Deng

Abstract: Denosing diffusion model, as a generative model, has received a lot of attention in the field of image generation recently, thanks to its powerful generation capability. However, diffusion models have not yet received sufficient research in the field of image fusion. In this article, we introduce diffusion model to the image fusion field, treating the image fusion task as image-to-image translatio… ▽ More Denosing diffusion model, as a generative model, has received a lot of attention in the field of image generation recently, thanks to its powerful generation capability. However, diffusion models have not yet received sufficient research in the field of image fusion. In this article, we introduce diffusion model to the image fusion field, treating the image fusion task as image-to-image translation and designing two different conditional injection modulation modules (i.e., style transfer modulation and wavelet modulation) to inject coarse-grained style information and fine-grained high-frequency and low-frequency information into the diffusion UNet, thereby generating fused images. In addition, we also discussed the residual learning and the selection of training objectives of the diffusion model in the image fusion task. Extensive experimental results based on quantitative and qualitative assessments compared with benchmarks demonstrates state-of-the-art results and good generalization performance in image fusion tasks. Finally, it is hoped that our method can inspire other works and gain insight into this field to better apply the diffusion model to image fusion tasks. Code shall be released for better reproducibility. △ Less

Submitted 10 April, 2023; originally announced April 2023.

arXiv:2212.06466 [pdf, other]

doi 10.1145/3581783.3612084

U2Net: A General Framework with Spatial-Spectral-Integrated Double U-Net for Image Fusion

Authors: Siran Peng, Chenhao Guo, Xiao Wu, Liang-Jian Deng

Abstract: In image fusion tasks, images obtained from different sources exhibit distinct properties. Consequently, treating them uniformly with a single-branch network can lead to inadequate feature extraction. Additionally, numerous works have demonstrated that multi-scaled networks capture information more sufficiently than single-scaled models in pixel-level computer vision problems. Considering these fa… ▽ More In image fusion tasks, images obtained from different sources exhibit distinct properties. Consequently, treating them uniformly with a single-branch network can lead to inadequate feature extraction. Additionally, numerous works have demonstrated that multi-scaled networks capture information more sufficiently than single-scaled models in pixel-level computer vision problems. Considering these factors, we propose U2Net, a spatial-spectral-integrated double U-shape network for image fusion. The U2Net utilizes a spatial U-Net and a spectral U-Net to extract spatial details and spectral characteristics, which allows for the discriminative and hierarchical learning of features from diverse images. In contrast to most previous works that merely employ concatenation to merge spatial and spectral information, this paper introduces a novel spatial-spectral integration structure called S2Block, which combines feature maps from different sources in a logical and effective way. We conduct a series of experiments on two image fusion tasks, including remote sensing pansharpening and hyperspectral image super-resolution (HISR). The U2Net outperforms representative state-of-the-art (SOTA) approaches in both quantitative and qualitative evaluations, demonstrating the superiority of our method. The code is available at https://github.com/PSRben/U2Net. △ Less

Submitted 2 October, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

Comments: Accepted by the 31st ACM International Conference on Multimedia (ACM MM '23)

arXiv:2210.14515 [pdf, other]

UFO2: A unified pre-training framework for online and offline speech recognition

Authors: Li Fu, Siqi Li, Qingtao Li, Liping Deng, Fangzhu Li, Lu Fan, Meng Chen, Xiaodong He

Abstract: In this paper, we propose a Unified pre-training Framework for Online and Offline (UFO2) Automatic Speech Recognition (ASR), which 1) simplifies the two separate training workflows for online and offline modes into one process, and 2) improves the Word Error Rate (WER) performance with limited utterance annotating. Specifically, we extend the conventional offline-mode Self-Supervised Learning (SSL… ▽ More In this paper, we propose a Unified pre-training Framework for Online and Offline (UFO2) Automatic Speech Recognition (ASR), which 1) simplifies the two separate training workflows for online and offline modes into one process, and 2) improves the Word Error Rate (WER) performance with limited utterance annotating. Specifically, we extend the conventional offline-mode Self-Supervised Learning (SSL)-based ASR approach to a unified manner, where the model training is conditioned on both the full-context and dynamic-chunked inputs. To enhance the pre-trained representation model, stop-gradient operation is applied to decouple the online-mode objectives to the quantizer. Moreover, in both the pre-training and the downstream fine-tuning stages, joint losses are proposed to train the unified model with full-weight sharing for the two modes. Experimental results on the LibriSpeech dataset show that UFO2 outperforms the SSL-based baseline method by 29.7% and 18.2% relative WER reduction in offline and online modes, respectively. △ Less

Submitted 3 April, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

Comments: Accepted by ICASSP 2023

arXiv:2210.12214 [pdf, ps, other]

Optimizing Bilingual Neural Transducer with Synthetic Code-switching Text Generation

Authors: Thien Nguyen, Nathalie Tran, Liuhui Deng, Thiago Fraga da Silva, Matthew Radzihovsky, Roger Hsiao, Henry Mason, Stefan Braun, Erik McDermott, Dogan Can, Pawel Swietojanski, Lyan Verwimp, Sibel Oyman, Tresi Arvizo, Honza Silovsky, Arnab Ghoshal, Mathieu Martel, Bharat Ram Ambati, Mohamed Ali

Abstract: Code-switching describes the practice of using more than one language in the same sentence. In this study, we investigate how to optimize a neural transducer based bilingual automatic speech recognition (ASR) model for code-switching speech. Focusing on the scenario where the ASR model is trained without supervised code-switching data, we found that semi-supervised training and synthetic code-swit… ▽ More Code-switching describes the practice of using more than one language in the same sentence. In this study, we investigate how to optimize a neural transducer based bilingual automatic speech recognition (ASR) model for code-switching speech. Focusing on the scenario where the ASR model is trained without supervised code-switching data, we found that semi-supervised training and synthetic code-switched data can improve the bilingual ASR system on code-switching speech. We analyze how each of the neural transducer's encoders contributes towards code-switching performance by measuring encoder-specific recall values, and evaluate our English/Mandarin system on the ASCEND data set. Our final system achieves 25% mixed error rate (MER) on the ASCEND English/Mandarin code-switching test set -- reducing the MER by 2.1% absolute compared to the previous literature -- while maintaining good accuracy on the monolingual test sets. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: 5 pages, 1 figure, submitted to ICASSP 2023, *: equal contributions

arXiv:2210.07360 [pdf, other]

Reducing Action Space: Reference-Model-Assisted Deep Reinforcement Learning for Inverter-based Volt-Var Control

Authors: Qiong Liu, Ye Guo, Lirong Deng, Haotian Liu, Dongyu Li, Hongbin Sun

Abstract: Reference-model-assisted deep reinforcement learning (DRL) for inverter-based Volt-Var Control (IB-VVC) in active distribution networks is proposed. We investigate that a large action space increases the learning difficulties of DRL and degrades the optimization performance in the process of generating data and training neural networks. To reduce the action space of DRL, we design a reference-mode… ▽ More Reference-model-assisted deep reinforcement learning (DRL) for inverter-based Volt-Var Control (IB-VVC) in active distribution networks is proposed. We investigate that a large action space increases the learning difficulties of DRL and degrades the optimization performance in the process of generating data and training neural networks. To reduce the action space of DRL, we design a reference-model-assisted DRL approach. We introduce definitions of the reference model, reference-model-based optimization, and reference actions. The reference-model-assisted DRL learns the residual actions between the reference actions and optimal actions, rather than learning the optimal actions directly. Since the residual actions are considerably smaller than the optimal actions for a reference model, we can design a smaller action space for the reference-model-assisted DRL. It reduces the learning difficulties of DRL and optimises the performance of the reference-model-assisted DRL approach. It is noteworthy that the reference-model-assisted DRL approach is compatible with any policy gradient DRL algorithms for continuous action problems. This work takes the soft actor-critic algorithm as an example and designs a reference-model-assisted soft actor-critic algorithm. Simulations show that 1) large action space degrades the performance of DRL in the whole training stage, and 2) reference-model-assisted DRL requires fewer iteration times and returns a better optimization performance. △ Less

Submitted 9 October, 2022; originally announced October 2022.

Comments: 10 pages, 9 figures

arXiv:2205.00485 [pdf, ps, other]

Bilingual End-to-End ASR with Byte-Level Subwords

Authors: Liuhui Deng, Roger Hsiao, Arnab Ghoshal

Abstract: In this paper, we investigate how the output representation of an end-to-end neural network affects multilingual automatic speech recognition (ASR). We study different representations including character-level, byte-level, byte pair encoding (BPE), and byte-level byte pair encoding (BBPE) representations, and analyze their strengths and weaknesses. We focus on developing a single end-to-end model… ▽ More In this paper, we investigate how the output representation of an end-to-end neural network affects multilingual automatic speech recognition (ASR). We study different representations including character-level, byte-level, byte pair encoding (BPE), and byte-level byte pair encoding (BBPE) representations, and analyze their strengths and weaknesses. We focus on developing a single end-to-end model to support utterance-based bilingual ASR, where speakers do not alternate between two languages in a single utterance but may change languages across utterances. We conduct our experiments on English and Mandarin dictation tasks, and we find that BBPE with penalty schemes can improve utterance-based bilingual ASR performance by 2% to 5% relative even with smaller number of outputs and fewer parameters. We conclude with analysis that indicates directions for further improving multilingual ASR. △ Less

Submitted 1 May, 2022; originally announced May 2022.

Comments: 5 pages, to be published in IEEE ICASSP 2022

arXiv:2204.05460 [pdf, other]

CorrectSpeech: A Fully Automated System for Speech Correction and Accent Reduction

Authors: Daxin Tan, Liqun Deng, Nianzu Zheng, Yu Ting Yeung, Xin Jiang, Xiao Chen, Tan Lee

Abstract: This study propose a fully automated system for speech correction and accent reduction. Consider the application scenario that a recorded speech audio contains certain errors, e.g., inappropriate words, mispronunciations, that need to be corrected. The proposed system, named CorrectSpeech, performs the correction in three steps: recognizing the recorded speech and converting it into time-stamped s… ▽ More This study propose a fully automated system for speech correction and accent reduction. Consider the application scenario that a recorded speech audio contains certain errors, e.g., inappropriate words, mispronunciations, that need to be corrected. The proposed system, named CorrectSpeech, performs the correction in three steps: recognizing the recorded speech and converting it into time-stamped symbol sequence, aligning recognized symbol sequence with target text to determine locations and types of required edit operations, and generating the corrected speech. Experiments show that the quality and naturalness of corrected speech depend on the performance of speech recognition and alignment modules, as well as the granularity level of editing operations. The proposed system is evaluated on two corpora: a manually perturbed version of VCTK and L2-ARCTIC. The results demonstrate that our system is able to correct mispronunciation and reduce accent in speech recordings. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/CorrectSpeech/ . △ Less

Submitted 13 October, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

Comments: Accepted by ISCSLP 2022

arXiv:2203.04402 [pdf]

High Noise Immune Time-domain Inversion via Cascade Network (TICaN) for Complex Scatterers

Authors: Hongyu Gao, Yinpeng Wang, Qiang Ren, Zixi Wang, Liangcheng Deng, Chenyu Shi

Abstract: In this paper, a high noise immune time-domain inversion cascade network (TICaN) is proposed to reconstruct scatterers from the measured electromagnetic fields. The TICaN is comprised of a denoising block aiming at improving the signal-to-noise ratio, and an inversion block to reconstruct the electromagnetic properties from the raw time-domain measurements. The scatterers investigated in this stud… ▽ More In this paper, a high noise immune time-domain inversion cascade network (TICaN) is proposed to reconstruct scatterers from the measured electromagnetic fields. The TICaN is comprised of a denoising block aiming at improving the signal-to-noise ratio, and an inversion block to reconstruct the electromagnetic properties from the raw time-domain measurements. The scatterers investigated in this study include complicated geometry shapes and high contrast, which cover the stratum layer, lossy medium and hyperfine structure, etc. After being well trained, the performance of the TICaN is evaluated from the perspective of accuracy, noise-immunity, computational acceleration, and generalizability. It can be proven that the proposed framework can realize high-precision inversion under high-intensity noise environments. Compared with traditional reconstruction methods, TICaN avoids the tedious iterative calculation by utilizing the parallel computing ability of GPU and thus significantly reduce the computing time. Besides, the proposed TICaN has certain generalization ability in reconstructing the unknown scatterers such as the famous Austria rings. Herein, it is confident that the proposed TICaN will serve as a new path for real-time quantitative microwave imaging for various practical scenarios. △ Less

Submitted 2 March, 2022; originally announced March 2022.

Comments: 9 pages, 11 figures

arXiv:2201.12155 [pdf, other]

Reducing language context confusion for end-to-end code-switching automatic speech recognition

Authors: Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Jianhua Tao, Yu Ting Yeung, Liqun Deng

Abstract: Code-switching deals with alternative languages in communication process. Training end-to-end (E2E) automatic speech recognition (ASR) systems for code-switching is especially challenging as code-switching training data are always insufficient to combat the increased multilingual context confusion due to the presence of more than one language. We propose a language-related attention mechanism to r… ▽ More Code-switching deals with alternative languages in communication process. Training end-to-end (E2E) automatic speech recognition (ASR) systems for code-switching is especially challenging as code-switching training data are always insufficient to combat the increased multilingual context confusion due to the presence of more than one language. We propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model based on the Equivalence Constraint (EC) Theory. The linguistic theory requires that any monolingual fragment that occurs in the code-switching sentence must occur in one of the monolingual sentences. The theory establishes a bridge between monolingual data and code-switching data. We leverage this linguistics theory to design the code-switching E2E ASR model. The proposed model efficiently transfers language knowledge from rich monolingual data to improve the performance of the code-switching ASR model. We evaluate our model on ASRU 2019 Mandarin-English code-switching challenge dataset. Compared to the baseline model, our proposed model achieves a 17.12% relative error reduction. △ Less

Submitted 29 June, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

Comments: arXiv admin note: text overlap with arXiv:2010.14798,the paper has been accepted by Insterspeech 2022

arXiv:2112.02237 [pdf, other]

A Triple-Double Convolutional Neural Network for Panchromatic Sharpening

Authors: Tian-Jing Zhang, Liang-Jian Deng, Ting-Zhu Huang, Jocelyn Chanussot, Gemine Vivone

Abstract: Pansharpening refers to the fusion of a panchromatic image with a high spatial resolution and a multispectral image with a low spatial resolution, aiming to obtain a high spatial resolution multispectral image. In this paper, we propose a novel deep neural network architecture with level-domain based loss function for pansharpening by taking into account the following double-type structures, \emph… ▽ More Pansharpening refers to the fusion of a panchromatic image with a high spatial resolution and a multispectral image with a low spatial resolution, aiming to obtain a high spatial resolution multispectral image. In this paper, we propose a novel deep neural network architecture with level-domain based loss function for pansharpening by taking into account the following double-type structures, \emph{i.e.,} double-level, double-branch, and double-direction, called as triple-double network (TDNet). By using the structure of TDNet, the spatial details of the panchromatic image can be fully exploited and utilized to progressively inject into the low spatial resolution multispectral image, thus yielding the high spatial resolution output. The specific network design is motivated by the physical formula of the traditional multi-resolution analysis (MRA) methods. Hence, an effective MRA fusion module is also integrated into the TDNet. Besides, we adopt a few ResNet blocks and some multi-scale convolution kernels to deepen and widen the network to effectively enhance the feature extraction and the robustness of the proposed TDNet. Extensive experiments on reduced- and full-resolution datasets acquired by WorldView-3, QuickBird, and GaoFen-2 sensors demonstrate the superiority of the proposed TDNet compared with some recent state-of-the-art pansharpening approaches. An ablation study has also corroborated the effectiveness of the proposed approach. △ Less

Submitted 3 December, 2021; originally announced December 2021.

arXiv:2111.08191 [pdf, other]

CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis

Authors: Nianzu Zheng, Liqun Deng, Wenyong Huang, Yu Ting Yeung, Baohua Xu, Yuanyuan Guo, Yasheng Wang, Xiao Chen, Xin Jiang, Qun Liu

Abstract: Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significant time latency especially for long paragraphs. We propose a streaming e2e MDD model called CoCA-MDD.… ▽ More Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significant time latency especially for long paragraphs. We propose a streaming e2e MDD model called CoCA-MDD. We utilize conv-transformer structure to encode input speech in a streaming manner. A coupled cross-attention (CoCA) mechanism is proposed to integrate frame-level acoustic features with encoded reference linguistic features. CoCA also enables our model to perform mispronunciation classification with whole utterances. The proposed model allows system fusion between the streaming output and mispronunciation classification output for further performance enhancement. We evaluate CoCA-MDD on publicly available corpora. CoCA-MDD achieves F1 scores of 57.03% and 60.78% for streaming and fusion modes respectively on L2-ARCTIC. For phone-level pronunciation scoring, CoCA-MDD achieves 0.58 Pearson correlation coefficient (PCC) value on SpeechOcean762. △ Less

Submitted 29 June, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

Comments: 5 pages, 4 figures, Accepted by INTERSPEECH 2022

arXiv:2108.03176 [pdf, ps, other]

Dynamic Control for Random Access in Deadline-Constrained Broadcasting

Authors: Aoyu Gong, Lei Deng, Fang Liu, Yijin Zhang

Abstract: This paper considers random access in deadline-constrained broadcasting with frame-synchronized traffic. To enhance the maximum achievable timely delivery ratio (TDR), we define a dynamic control scheme that allows each active node to determine the transmission probability with certainty based on the current delivery urgency and the knowledge of current contention intensity. For an idealized envir… ▽ More This paper considers random access in deadline-constrained broadcasting with frame-synchronized traffic. To enhance the maximum achievable timely delivery ratio (TDR), we define a dynamic control scheme that allows each active node to determine the transmission probability with certainty based on the current delivery urgency and the knowledge of current contention intensity. For an idealized environment where the contention intensity is completely known, we develop an analytical framework based on the theory of Markov Decision Process (MDP), which leads to an optimal scheme by applying backward induction. For a realistic environment where the contention intensity is incompletely known, we develop a framework using Partially Observable Markov Decision Process (POMDP), which can in theory be solved. We show that for both environments, there exists an optimal scheme that is optimal over all types of policies. To overcome the infeasibility in obtaining an optimal or near-optimal scheme from the POMDP framework, we investigate the behaviors of the optimal scheme for two extreme cases in the MDP framework, and leverage intuition gained from these behaviors to propose a heuristic scheme for the realistic environment with TDR close to the maximum achievable TDR in the idealized environment. In addition, we propose an approximation on the knowledge of contention intensity to further simplify this heuristic scheme. Numerical results with respect to a wide range of configurations are provided to validate our study. △ Less

Submitted 6 August, 2021; originally announced August 2021.

arXiv:2107.11617 [pdf, other]

LAConv: Local Adaptive Convolution for Image Fusion

Authors: Zi-Rong Jin, Liang-Jian Deng, Tai-Xiang Jiang, Tian-Jing Zhang

Abstract: The convolution operation is a powerful tool for feature extraction and plays a prominent role in the field of computer vision. However, when targeting the pixel-wise tasks like image fusion, it would not fully perceive the particularity of each pixel in the image if the uniform convolution kernel is used on different patches. In this paper, we propose a local adaptive convolution (LAConv), which… ▽ More The convolution operation is a powerful tool for feature extraction and plays a prominent role in the field of computer vision. However, when targeting the pixel-wise tasks like image fusion, it would not fully perceive the particularity of each pixel in the image if the uniform convolution kernel is used on different patches. In this paper, we propose a local adaptive convolution (LAConv), which is dynamically adjusted to different spatial locations. LAConv enables the network to pay attention to every specific local area in the learning process. Besides, the dynamic bias (DYB) is introduced to provide more possibilities for the depiction of features and make the network more flexible. We further design a residual structure network equipped with the proposed LAConv and DYB modules, and apply it to two image fusion tasks. Experiments for pansharpening and hyperspectral image super-resolution (HISR) demonstrate the superiority of our method over other state-of-the-art methods. It is worth mentioning that LAConv can also be competent for other super-resolution tasks with less computation effort. △ Less

Submitted 24 July, 2021; originally announced July 2021.

arXiv:2107.01554 [pdf, other]

EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion

Authors: Daxin Tan, Liqun Deng, Yu Ting Yeung, Xin Jiang, Xiao Chen, Tan Lee

Abstract: This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bi… ▽ More This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bidirectional fusion are proposed to effectively incorporate the contextual information related to the edited region and achieve smooth transition at both left and right boundaries. Distortion introduced to the unmodified parts of the utterance is alleviated. The EditSpeech system is developed and evaluated on English and Chinese in multi-speaker scenarios. Objective and subjective evaluation demonstrate that EditSpeech outperforms a few baseline systems in terms of low spectral distortion and preferred speech quality. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/EditSpeech/ . △ Less

Submitted 7 October, 2021; v1 submitted 4 July, 2021; originally announced July 2021.

Comments: Accepted by ASRU 2021

arXiv:2106.10132 [pdf, other]

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

Authors: Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng

Abstract: One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation disentanglement. Existing work generally ignores the correlation between different speech representations during training, which causes leakage of content information into the speaker representation and t… ▽ More One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation disentanglement. Existing work generally ignores the correlation between different speech representations during training, which causes leakage of content information into the speaker representation and thus degrades VC performance. To alleviate this issue, we employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations for retaining source linguistic content and intonation variations, while capturing target speaker characteristics. In doing so, the proposed approach achieves higher speech naturalness and speaker similarity than current state-of-the-art one-shot VC systems. Our code, pre-trained models and demo are available at https://github.com/Wendison/VQMIVC. △ Less

Submitted 18 June, 2021; originally announced June 2021.

Comments: Accepted to Interspeech 2021. Code, pre-trained models and demo are available at https://github.com/Wendison/VQMIVC

arXiv:2106.10127 [pdf, other]

Unsupervised Domain Adaptation for Dysarthric Speech Detection via Domain Adversarial Training and Mutual Information Minimization

Authors: Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng

Abstract: Dysarthric speech detection (DSD) systems aim to detect characteristics of the neuromotor disorder from speech. Such systems are particularly susceptible to domain mismatch where the training and testing data come from the source and target domains respectively, but the two domains may differ in terms of speech stimuli, disease etiology, etc. It is hard to acquire labelled data in the target domai… ▽ More Dysarthric speech detection (DSD) systems aim to detect characteristics of the neuromotor disorder from speech. Such systems are particularly susceptible to domain mismatch where the training and testing data come from the source and target domains respectively, but the two domains may differ in terms of speech stimuli, disease etiology, etc. It is hard to acquire labelled data in the target domain, due to high costs of annotating sizeable datasets. This paper makes a first attempt to formulate cross-domain DSD as an unsupervised domain adaptation (UDA) problem. We use labelled source-domain data and unlabelled target-domain data, and propose a multi-task learning strategy, including dysarthria presence classification (DPC), domain adversarial training (DAT) and mutual information minimization (MIM), which aim to learn dysarthria-discriminative and domain-invariant biomarker embeddings. Specifically, DPC helps biomarker embeddings capture critical indicators of dysarthria; DAT forces biomarker embeddings to be indistinguishable in source and target domains; and MIM further reduces the correlation between biomarker embeddings and domain-related cues. By treating the UASPEECH and TORGO corpora respectively as the source and target domains, experiments show that the incorporation of UDA attains absolute increases of 22.2% and 20.0% respectively in utterance-level weighted average recall and speaker-level accuracy. △ Less

Submitted 18 June, 2021; originally announced June 2021.

Comments: Accepted to Interspeech 2021

arXiv:2011.13657 [pdf]

Community Energy Storage Management for Welfare Optimization Using a Markov Decision Process

Authors: Lirong Deng, Xuan Zhang, Tianshu Yang, Hongbin Sun, Shmuel S. Oren

Abstract: In this paper, we address an optimal management problem of community energy storage in the real-time electricity market under a stochastic renewable environment. In a real-time electricity market, complete market information may not be assessable for a strategic participant, hence we propose a paradigm that uses partial information including the forecast of real-time prices and slopes of the aggre… ▽ More In this paper, we address an optimal management problem of community energy storage in the real-time electricity market under a stochastic renewable environment. In a real-time electricity market, complete market information may not be assessable for a strategic participant, hence we propose a paradigm that uses partial information including the forecast of real-time prices and slopes of the aggregate supply curve to model the price impact of storage use in the price-maker storage management problem. As a price maker, the community energy storage can not only earn profits through energy arbitrage but also smooth price trajectories and further influence social welfare. We formulate the problem as a finite-horizon Markov decision process that aims to maximize the energy arbitrage and social welfare of the prosumer-based community. The advance of the management scheme is that the optimal policy has a threshold structure. The structure has an analytic form that can guide the energy storage to charge/discharge by comparing its current marginal value and the expected future marginal value. Case studies indicate that welfare-maximizing storage earns more benefits than profit-maximizing storage. The proposed threshold-based algorithm can guarantee optimality and largely decrease the computational complexity of standard stochastic dynamic programming. △ Less

Submitted 27 November, 2020; originally announced November 2020.

arXiv:2011.13652 [pdf]

Optimal Planning of Integrated Heat and Electricity Systems: a Tightening McCormick Approach

Authors: Lirong Deng, Xuan Zhang, Tianshu Yang, Hongbin Sun

Abstract: In this paper, we propose a convex planning model of integrated heat and electricity systems considering variable mass flow rates. The main challenge comes from the non-convexity of the bilinear terms in the district heating network, i.e., the product of mass flow rate and nodal temperature. To resolve this issue, we first reformulate the district heating network model through equivalent transform… ▽ More In this paper, we propose a convex planning model of integrated heat and electricity systems considering variable mass flow rates. The main challenge comes from the non-convexity of the bilinear terms in the district heating network, i.e., the product of mass flow rate and nodal temperature. To resolve this issue, we first reformulate the district heating network model through equivalent transformation and variable substitution. It shows that the reformulated model has only one set of nonconvex constraints with reduced bilinear terms and the others are linear constraints. Such a reformulation not only guarantees the optimality but fastens the solving process. To relax the remaining bilinear constraints, we apply McCormick envelopes and further propose a heuristic tightening method to constrict the bounds of the McCormick approach and get a nearby feasible solution. Case studies show that the tightening McCormick method quickly solves the heat-electricity planning problem with acceptable feasibility check and optimality. △ Less

Submitted 27 November, 2020; originally announced November 2020.

arXiv:2007.04289 [pdf]

A Quadratic Convex Approximation of Optimal Power Flow in Distribution System with Application in Loss Allocation

Authors: Tianshu Yang, Ye Guo, Lirong Deng, Hongbin Sun, Wenchuan Wu

Abstract: In this paper, a novel quadratic convex optimal power flow model, namely, MDOPF, is proposed to determine the optimal dispatches of distributed generators. Based on the results of MDOPF, two price mechanisms, distribution locational marginal price (DLMP) and distribution locational price (DLP), are analyzed. For DLMP, an explicit method is developed to calculate the marginal loss that does not req… ▽ More In this paper, a novel quadratic convex optimal power flow model, namely, MDOPF, is proposed to determine the optimal dispatches of distributed generators. Based on the results of MDOPF, two price mechanisms, distribution locational marginal price (DLMP) and distribution locational price (DLP), are analyzed. For DLMP, an explicit method is developed to calculate the marginal loss that does not require a backward/forward sweep algorithm and thus reduces the computational complexity. However, the marginal loss component in DLMP will cause over-collection of losses (OCL). To address this issue, DLP is defined, which contains two components, the energy cost component and loss component, where the loss component is determined by the proposed loss allocation method (LAM). Numerical tests show that the proposed MDOPF has a better accuracy than existing OPF models based on linear power flow equations. In addition, the proposed marginal loss method and DLMP algorithm have satisfactory accuracy compared with benchmarks provided by ACOPF, and the proposed DLP can eliminate OCL. △ Less

Submitted 4 September, 2020; v1 submitted 8 July, 2020; originally announced July 2020.

arXiv:2007.02074 [pdf]

doi 10.1109/TSG.2020.3039984

A Linear Branch Flow Model for Radial Distribution Networks and its Application to Reactive Power Optimization and Network Reconfiguration

Authors: Tianshu Yang, Ye Guo, Lirong Deng, Hongbin Sun, Wenchuan Wu

Abstract: This paper presents a cold-start linear branch flow model named modified DistFlow. In modified DistFlow, the active and reactive power are replaced by their ratios to voltage magnitude as state variables, so that errors introduced by conventional branch flow linearization approaches due to their complete ignoring of the quadratic term are reduced. Based on the path-branch incidence matrix, branch… ▽ More This paper presents a cold-start linear branch flow model named modified DistFlow. In modified DistFlow, the active and reactive power are replaced by their ratios to voltage magnitude as state variables, so that errors introduced by conventional branch flow linearization approaches due to their complete ignoring of the quadratic term are reduced. Based on the path-branch incidence matrix, branch power flows and nodal voltage magnitudes can be obtained in a non-iterative and explicit manner. Subsequently, the proposed modified DistFlow model is applied to the problem of reactive power optimization and network reconfiguration, transforming it into a mixed-integer quadratic programming (MIQP). Simulations show that the proposed modified DistFlow has a better accuracy than existing cold-start linear branch flow models for distribution networks, and the resulting MIQP model for reactive power optimization and network reconfiguration is much more computationally efficient than existing benchmarks. △ Less

Submitted 22 November, 2020; v1 submitted 4 July, 2020; originally announced July 2020.

arXiv:2005.14400 [pdf, other]

Hyperspectral Image Super-resolution via Deep Spatio-spectral Convolutional Neural Networks

Authors: Jin-Fan Hu, Ting-Zhu Huang, Liang-Jian Deng, Tai-Xiang Jiang, Gemine Vivone, Jocelyn Chanussot

Abstract: Hyperspectral images are of crucial importance in order to better understand features of different materials. To reach this goal, they leverage on a high number of spectral bands. However, this interesting characteristic is often paid by a reduced spatial resolution compared with traditional multispectral image systems. In order to alleviate this issue, in this work, we propose a simple and effici… ▽ More Hyperspectral images are of crucial importance in order to better understand features of different materials. To reach this goal, they leverage on a high number of spectral bands. However, this interesting characteristic is often paid by a reduced spatial resolution compared with traditional multispectral image systems. In order to alleviate this issue, in this work, we propose a simple and efficient architecture for deep convolutional neural networks to fuse a low-resolution hyperspectral image (LR-HSI) and a high-resolution multispectral image (HR-MSI), yielding a high-resolution hyperspectral image (HR-HSI). The network is designed to preserve both spatial and spectral information thanks to an architecture from two folds: one is to utilize the HR-HSI at a different scale to get an output with a satisfied spectral preservation; another one is to apply concepts of multi-resolution analysis to extract high-frequency information, aiming to output high quality spatial details. Finally, a plain mean squared error loss function is used to measure the performance during the training. Extensive experiments demonstrate that the proposed network architecture achieves best performance (both qualitatively and quantitatively) compared with recent state-of-the-art hyperspectral image super-resolution approaches. Moreover, other significant advantages can be pointed out by the use of the proposed approach, such as, a better network generalization ability, a limited computational burden, and a robustness with respect to the number of training samples. △ Less

Submitted 29 May, 2020; originally announced May 2020.

arXiv:2005.02183 [pdf, other]

Comparing SNNs and RNNs on Neuromorphic Vision Datasets: Similarities and Differences

Authors: Weihua He, YuJie Wu, Lei Deng, Guoqi Li, Haoyu Wang, Yang Tian, Wei Ding, Wenhui Wang, Yuan Xie

Abstract: Neuromorphic data, recording frameless spike events, have attracted considerable attention for the spatiotemporal information components and the event-driven processing fashion. Spiking neural networks (SNNs) represent a family of event-driven models with spatiotemporal dynamics for neuromorphic computing, which are widely benchmarked on neuromorphic data. Interestingly, researchers in the machine… ▽ More Neuromorphic data, recording frameless spike events, have attracted considerable attention for the spatiotemporal information components and the event-driven processing fashion. Spiking neural networks (SNNs) represent a family of event-driven models with spatiotemporal dynamics for neuromorphic computing, which are widely benchmarked on neuromorphic data. Interestingly, researchers in the machine learning community can argue that recurrent (artificial) neural networks (RNNs) also have the capability to extract spatiotemporal features although they are not event-driven. Thus, the question of "what will happen if we benchmark these two kinds of models together on neuromorphic data" comes out but remains unclear. In this work, we make a systematic study to compare SNNs and RNNs on neuromorphic data, taking the vision datasets as a case study. First, we identify the similarities and differences between SNNs and RNNs (including the vanilla RNNs and LSTM) from the modeling and learning perspectives. To improve comparability and fairness, we unify the supervised learning algorithm based on backpropagation through time (BPTT), the loss function exploiting the outputs at all timesteps, the network structure with stacked fully-connected or convolutional layers, and the hyper-parameters during training. Especially, given the mainstream loss function used in RNNs, we modify it inspired by the rate coding scheme to approach that of SNNs. Furthermore, we tune the temporal resolution of datasets to test model robustness and generalization. At last, a series of contrast experiments are conducted on two types of neuromorphic datasets: DVS-converted (N-MNIST) and DVS-captured (DVS Gesture). △ Less

Submitted 2 May, 2020; originally announced May 2020.

arXiv:2001.01587 [pdf, other]

Exploring Adversarial Attack in Spiking Neural Networks with Spike-Compatible Gradient

Authors: Ling Liang, Xing Hu, Lei Deng, Yujie Wu, Guoqi Li, Yufei Ding, Peng Li, Yuan Xie

Abstract: Recently, backpropagation through time inspired learning algorithms are widely introduced into SNNs to improve the performance, which brings the possibility to attack the models accurately given Spatio-temporal gradient maps. We propose two approaches to address the challenges of gradient input incompatibility and gradient vanishing. Specifically, we design a gradient to spike converter to convert… ▽ More Recently, backpropagation through time inspired learning algorithms are widely introduced into SNNs to improve the performance, which brings the possibility to attack the models accurately given Spatio-temporal gradient maps. We propose two approaches to address the challenges of gradient input incompatibility and gradient vanishing. Specifically, we design a gradient to spike converter to convert continuous gradients to ternary ones compatible with spike inputs. Then, we design a gradient trigger to construct ternary gradients that can randomly flip the spike inputs with a controllable turnover rate, when meeting all zero gradients. Putting these methods together, we build an adversarial attack methodology for SNNs trained by supervised algorithms. Moreover, we analyze the influence of the training loss function and the firing threshold of the penultimate layer, which indicates a "trap" region under the cross-entropy loss that can be escaped by threshold tuning. Extensive experiments are conducted to validate the effectiveness of our solution. Besides the quantitative analysis of the influence factors, we evidence that SNNs are more robust against adversarial attack than ANNs. This work can help reveal what happens in SNN attack and might stimulate more research on the security of SNN models and neuromorphic devices. △ Less

Submitted 30 September, 2020; v1 submitted 1 January, 2020; originally announced January 2020.

arXiv:1912.12419 [pdf, other]

Transfer Learning in General Lensless Imaging through Scattering Media

Authors: Yukuan Yang, Lei Deng, Peng Jiao, Yansong Chua, Jing Pei, Cheng Ma, Guoqi Li

Abstract: Recently deep neural networks (DNNs) have been successfully introduced to the field of lensless imaging through scattering media. By solving an inverse problem in computational imaging, DNNs can overcome several shortcomings in the conventional lensless imaging through scattering media methods, namely, high cost, poor quality, complex control, and poor anti-interference. However, for training, a l… ▽ More Recently deep neural networks (DNNs) have been successfully introduced to the field of lensless imaging through scattering media. By solving an inverse problem in computational imaging, DNNs can overcome several shortcomings in the conventional lensless imaging through scattering media methods, namely, high cost, poor quality, complex control, and poor anti-interference. However, for training, a large number of training samples on various datasets have to be collected, with a DNN trained on one dataset generally performing poorly for recovering images from another dataset. The underlying reason is that lensless imaging through scattering media is a high dimensional regression problem and it is difficult to obtain an analytical solution. In this work, transfer learning is proposed to address this issue. Our main idea is to train a DNN on a relatively complex dataset using a large number of training samples and fine-tune the last few layers using very few samples from other datasets. Instead of the thousands of samples required to train from scratch, transfer learning alleviates the problem of costly data acquisition. Specifically, considering the difference in sample sizes and similarity among datasets, we propose two DNN architectures, namely LISMU-FCN and LISMU-OCN, and a balance loss function designed for balancing smoothness and sharpness. LISMU-FCN, with much fewer parameters, can achieve imaging across similar datasets while LISMU-OCN can achieve imaging across significantly different datasets. What's more, we establish a set of simulation algorithms which are close to the real experiment, and it is of great significance and practical value in the research on lensless scattering imaging. In summary, this work provides a new solution for lensless imaging through scattering media using transfer learning in DNNs. △ Less

Submitted 28 December, 2019; originally announced December 2019.

arXiv:1911.00822 [pdf, other]

Comprehensive SNN Compression Using ADMM Optimization and Activity Regularization

Authors: Lei Deng, Yujie Wu, Yifan Hu, Ling Liang, Guoqi Li, Xing Hu, Yufei Ding, Peng Li, Yuan Xie

Abstract: As well known, the huge memory and compute costs of both artificial neural networks (ANNs) and spiking neural networks (SNNs) greatly hinder their deployment on edge devices with high efficiency. Model compression has been proposed as a promising technique to improve the running efficiency via parameter and operation reduction. Whereas, this technique is mainly practiced in ANNs rather than SNNs.… ▽ More As well known, the huge memory and compute costs of both artificial neural networks (ANNs) and spiking neural networks (SNNs) greatly hinder their deployment on edge devices with high efficiency. Model compression has been proposed as a promising technique to improve the running efficiency via parameter and operation reduction. Whereas, this technique is mainly practiced in ANNs rather than SNNs. It is interesting to answer how much an SNN model can be compressed without compromising its functionality, where two challenges should be addressed: i) the accuracy of SNNs is usually sensitive to model compression, which requires an accurate compression methodology; ii) the computation of SNNs is event-driven rather than static, which produces an extra compression dimension on dynamic spikes. To this end, we realize a comprehensive SNN compression through three steps. First, we formulate the connection pruning and weight quantization as a constrained optimization problem. Second, we combine spatio-temporal backpropagation (STBP) and alternating direction method of multipliers (ADMM) to solve the problem with minimum accuracy loss. Third, we further propose activity regularization to reduce the spike events for fewer active operations. These methods can be applied in either a single way for moderate compression or a joint way for aggressive compression. We define several quantitative metrics to evaluation the compression performance for SNNs. Our methodology is validated in pattern recognition tasks over MNIST, N-MNIST, CIFAR10, and CIFAR100 datasets, where extensive comparisons, analyses, and insights are provided. To our best knowledge, this is the first work that studies SNN compression in a comprehensive manner by exploiting all compressible components and achieves better results. △ Less

Submitted 20 August, 2020; v1 submitted 3 November, 2019; originally announced November 2019.

Comments: Under review

arXiv:1810.11390 [pdf, other]

Joint Estimation of DOA and Frequency with Sub-Nyquist Sampling in a Binary Array Radar System

Authors: Zhan Zhang, Ping Wei, Lijuan Deng, Huaguo Zhang

Abstract: Recently, several array radar structures combined with sub-Nyquist techniques and corresponding algorithms have been extensively studied. Carrier frequency and direction-of-arrival (DOA) estimations of multiple narrow-band signals received by array radars at the sub-Nyquist rates are considered in this paper. We propose a new sub-Nyquist array radar architecture (a binary array radar separately co… ▽ More Recently, several array radar structures combined with sub-Nyquist techniques and corresponding algorithms have been extensively studied. Carrier frequency and direction-of-arrival (DOA) estimations of multiple narrow-band signals received by array radars at the sub-Nyquist rates are considered in this paper. We propose a new sub-Nyquist array radar architecture (a binary array radar separately connected to a multi-coset structure with M branches) and an efficient joint estimation algorithm which can match frequencies up with corresponding DOAs. We further come up with a delay pattern augmenting method, by which the capability of the number of identifiable signals can increase from M-1 to Q-1 (Q is extended degrees of freedom). We further conclude that the minimum total sampling rate 2MB is sufficient to identify $ {K \leq Q-1}$ narrow-band signals of maximum bandwidth $B$ inside. The effectiveness and performance of the estimation algorithm together with the augmenting method have been verified by simulations. △ Less

Submitted 26 October, 2018; originally announced October 2018.

Comments: 6 pages, 2 figures, conference

arXiv:1509.03044 [pdf, other]

Recurrent Reinforcement Learning: A Hybrid Approach

Authors: Xiujun Li, Lihong Li, Jianfeng Gao, Xiaodong He, Jianshu Chen, Li Deng, Ji He

Abstract: Successful applications of reinforcement learning in real-world problems often require dealing with partially observable states. It is in general very challenging to construct and infer hidden states as they often depend on the agent's entire interaction history and may require substantial domain knowledge. In this work, we investigate a deep-learning approach to learning the representation of sta… ▽ More Successful applications of reinforcement learning in real-world problems often require dealing with partially observable states. It is in general very challenging to construct and infer hidden states as they often depend on the agent's entire interaction history and may require substantial domain knowledge. In this work, we investigate a deep-learning approach to learning the representation of states in partially observable tasks, with minimal prior knowledge of the domain. In particular, we propose a new family of hybrid models that combines the strength of both supervised learning (SL) and reinforcement learning (RL), trained in a joint fashion: The SL component can be a recurrent neural networks (RNN) or its long short-term memory (LSTM) version, which is equipped with the desired property of being able to capture long-term dependency on history, thus providing an effective way of learning the representation of hidden states. The RL component is a deep Q-network (DQN) that learns to optimize the control for maximizing long-term rewards. Extensive experiments in a direct mailing campaign problem demonstrate the effectiveness and advantages of the proposed approach, which performs the best among a set of previous state-of-the-art methods. △ Less

Submitted 19 November, 2015; v1 submitted 10 September, 2015; originally announced September 2015.

Comments: 11 pages, 6 figures

Showing 1–45 of 45 results for author: Deng, L