-
Reduce, Reuse, Recycle: Categories for Compositional Reinforcement Learning
Authors:
Georgios Bakirtzis,
Michail Savvas,
Ruihan Zhao,
Sandeep Chinchali,
Ufuk Topcu
Abstract:
In reinforcement learning, conducting task composition by forming cohesive, executable sequences from multiple tasks remains challenging. However, the ability to (de)compose tasks is a linchpin in developing robotic systems capable of learning complex behaviors. Yet, compositional reinforcement learning is beset with difficulties, including the high dimensionality of the problem space, scarcity of…
▽ More
In reinforcement learning, conducting task composition by forming cohesive, executable sequences from multiple tasks remains challenging. However, the ability to (de)compose tasks is a linchpin in developing robotic systems capable of learning complex behaviors. Yet, compositional reinforcement learning is beset with difficulties, including the high dimensionality of the problem space, scarcity of rewards, and absence of system robustness after task composition. To surmount these challenges, we view task composition through the prism of category theory -- a mathematical discipline exploring structures and their compositional relationships. The categorical properties of Markov decision processes untangle complex tasks into manageable sub-tasks, allowing for strategical reduction of dimensionality, facilitating more tractable reward structures, and bolstering system robustness. Experimental results support the categorical theory of reinforcement learning by enabling skill reduction, reuse, and recycling when learning complex robotic arm tasks.
△ Less
Submitted 23 August, 2024;
originally announced August 2024.
-
Enhanced Equivalent Circuit Model for High Current Discharge of Lithium-Ion Batteries with Application to Electric Vertical Takeoff and Landing Aircraft
Authors:
Alireza Goshtasbi,
Ruxiu Zhao,
Ruiting Wang,
Sangwoo Han,
Wenting Ma,
Jeremy Neubauer
Abstract:
Conventional battery equivalent circuit models (ECMs) have limited capability to predict performance at high discharge rates, where lithium depleted regions may develop and cause a sudden exponential drop in the cell's terminal voltage. Having accurate predictions of performance under such conditions is necessary for electric vertical takeoff and landing (eVTOL) aircraft applications, where high d…
▽ More
Conventional battery equivalent circuit models (ECMs) have limited capability to predict performance at high discharge rates, where lithium depleted regions may develop and cause a sudden exponential drop in the cell's terminal voltage. Having accurate predictions of performance under such conditions is necessary for electric vertical takeoff and landing (eVTOL) aircraft applications, where high discharge currents can be required during fault scenarios and the inability to provide these currents can be safety-critical. To address this challenge, we utilize data-driven modeling methods to derive a parsimonious addition to a conventional ECM that can capture the observed rapid voltage drop with only one additional state. We also provide a detailed method for identifying the resulting model parameters, including an extensive characterization data set along with a well-regularized objective function formulation. The model is validated against a novel data set of over 150 flights encompassing a wide array of conditions for an eVTOL aircraft using an application-specific and safety-relevant reserve duration metric for quantifying accuracy. The model is shown to predict the landing hover capability with an error mean and standard deviation of 2.9 and 6.2 seconds, respectively, defining the model's ability to capture the cell voltage behavior under high discharge currents.
△ Less
Submitted 15 August, 2024;
originally announced August 2024.
-
Securing V2I Backscattering from Eavesdropper
Authors:
Ruotong Zhao,
Deepak Mishra,
Aruna Seneviratne
Abstract:
As our cities become more intelligent and more connected with new technologies like 6G, improving communication between vehicles and infrastructure is essential while reducing energy consumption. This study proposes a secure framework for vehicle-to-infrastructure (V2I) backscattering near an eavesdropping vehicle to maximize the sum secrecy rate of V2I backscatter communication over multiple cohe…
▽ More
As our cities become more intelligent and more connected with new technologies like 6G, improving communication between vehicles and infrastructure is essential while reducing energy consumption. This study proposes a secure framework for vehicle-to-infrastructure (V2I) backscattering near an eavesdropping vehicle to maximize the sum secrecy rate of V2I backscatter communication over multiple coherence slots. This sustainable framework aims to jointly optimize the reflection coefficients at the backscattering vehicle, carrier emitter power, and artificial noise at the infrastructure, along with the target vehicle's linear trajectory in the presence of an eavesdropping vehicle in the parallel lane. To achieve this optimization, we separated the problem into three parts: backscattering coefficient, power allocation, and trajectory design problems. We respectively adopted parallel computing, fractional programming, and finding all the candidates for the global optimal solution to obtain the global optimal solution for these three problems. Our simulations verified the fast convergence of our alternating optimization algorithm and showed that our proposed secure V2I backscattering outperforms the existing benchmark by over 4.7 times in terms of secrecy rate for 50 slots. Overall, this fundamental research on V2I backscattering provided insights to improve vehicular communication's connectivity, efficiency, and security.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
R$^2$-Gaussian: Rectifying Radiative Gaussian Splatting for Tomographic Reconstruction
Authors:
Ruyi Zha,
Tao Jun Lin,
Yuanhao Cai,
Jiwen Cao,
Yanhao Zhang,
Hongdong Li
Abstract:
3D Gaussian splatting (3DGS) has shown promising results in image rendering and surface reconstruction. However, its potential in volumetric reconstruction tasks, such as X-ray computed tomography, remains under-explored. This paper introduces R2-Gaussian, the first 3DGS-based framework for sparse-view tomographic reconstruction. By carefully deriving X-ray rasterization functions, we discover a p…
▽ More
3D Gaussian splatting (3DGS) has shown promising results in image rendering and surface reconstruction. However, its potential in volumetric reconstruction tasks, such as X-ray computed tomography, remains under-explored. This paper introduces R2-Gaussian, the first 3DGS-based framework for sparse-view tomographic reconstruction. By carefully deriving X-ray rasterization functions, we discover a previously unknown integration bias in the standard 3DGS formulation, which hampers accurate volume retrieval. To address this issue, we propose a novel rectification technique via refactoring the projection from 3D to 2D Gaussians. Our new method presents three key innovations: (1) introducing tailored Gaussian kernels, (2) extending rasterization to X-ray imaging, and (3) developing a CUDA-based differentiable voxelizer. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches by 0.93 dB in PSNR and 0.014 in SSIM. Crucially, it delivers high-quality results in 3 minutes, which is 12x faster than NeRF-based methods and on par with traditional algorithms. The superior performance and rapid convergence of our method highlight its practical value.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
AD-Aligning: Emulating Human-like Generalization for Cognitive Domain Adaptation in Deep Learning
Authors:
Zhuoying Li,
Bohua Wan,
Cong Mu,
Ruzhang Zhao,
Shushan Qiu,
Chao Yan
Abstract:
Domain adaptation is pivotal for enabling deep learning models to generalize across diverse domains, a task complicated by variations in presentation and cognitive nuances. In this paper, we introduce AD-Aligning, a novel approach that combines adversarial training with source-target domain alignment to enhance generalization capabilities. By pretraining with Coral loss and standard loss, AD-Align…
▽ More
Domain adaptation is pivotal for enabling deep learning models to generalize across diverse domains, a task complicated by variations in presentation and cognitive nuances. In this paper, we introduce AD-Aligning, a novel approach that combines adversarial training with source-target domain alignment to enhance generalization capabilities. By pretraining with Coral loss and standard loss, AD-Aligning aligns target domain statistics with those of the pretrained encoder, preserving robustness while accommodating domain shifts. Through extensive experiments on diverse datasets and domain shift scenarios, including noise-induced shifts and cognitive domain adaptation tasks, we demonstrate AD-Aligning's superior performance compared to existing methods such as Deep Coral and ADDA. Our findings highlight AD-Aligning's ability to emulate the nuanced cognitive processes inherent in human perception, making it a promising solution for real-world applications requiring adaptable and robust domain adaptation strategies.
△ Less
Submitted 21 May, 2024; v1 submitted 14 May, 2024;
originally announced May 2024.
-
Resilient control of networked switched systems subject to deception attack and DoS attack
Authors:
Rui Zhao,
Zhiqiang Zuo,
Ying Tan,
Yijing Wang,
Wentao Zhang
Abstract:
In this paper, the resilient control for switched systems in the presence of deception attack and denial-of-service (DoS) attack is addressed. Due to the interaction of two kinds of attacks and the asynchronous phenomenon of controller mode and subsystem mode, the system dynamics becomes much more complex. A criterion is derived to ensure the mean square security level of the closed-loop system. T…
▽ More
In this paper, the resilient control for switched systems in the presence of deception attack and denial-of-service (DoS) attack is addressed. Due to the interaction of two kinds of attacks and the asynchronous phenomenon of controller mode and subsystem mode, the system dynamics becomes much more complex. A criterion is derived to ensure the mean square security level of the closed-loop system. This in turn reveals the balance of system resilience and control performance. Furthermore, a mixed-switching control strategy is put forward to make the system globally asymptotically stable. It is shown that the system will still converge to the equilibrium even if the deception attack occurs. Finally, simulations are carried out to verify the effectiveness of the theoretical results.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Advanced Long-Content Speech Recognition With Factorized Neural Transducer
Authors:
Xun Gong,
Yu Wu,
Jinyu Li,
Shujie Liu,
Rui Zhao,
Xie Chen,
Yanmin Qian
Abstract:
In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that th…
▽ More
In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that the vanilla C-T models do not exhibit improved performance when utilizing long-content transcriptions, possibly due to the predictor network of C-T models not functioning as a pure language model. Instead, FNT shows its potential in utilizing long-content information, where we propose the LongFNT model and explore the impact of long-content information in both text (LongFNT-Text) and speech (LongFNT-Speech). The proposed LongFNT-Text and LongFNT-Speech models further complement each other to achieve better performance, with transcription history proving more valuable to the model. The effectiveness of our LongFNT approach is evaluated on LibriSpeech and GigaSpeech corpora, and obtains relative 19% and 12% word error rate reduction, respectively. Furthermore, we extend the LongFNT model to the streaming scenario, which is named SLongFNT , consisting of SLongFNT-Text and SLongFNT-Speech approaches to utilize long-content text and speech information. Experiments show that the proposed SLongFNT model achieves relative 26% and 17% WER reduction on LibriSpeech and GigaSpeech respectively while keeping a good latency, compared to the FNT baseline. Overall, our proposed LongFNT and SLongFNT highlight the significance of considering long-content speech and transcription knowledge for improving both non-streaming and streaming speech recognition systems.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Towards a Digital Twin Framework in Additive Manufacturing: Machine Learning and Bayesian Optimization for Time Series Process Optimization
Authors:
Vispi Karkaria,
Anthony Goeckner,
Rujing Zha,
Jie Chen,
Jianjing Zhang,
Qi Zhu,
Jian Cao,
Robert X. Gao,
Wei Chen
Abstract:
Laser-directed-energy deposition (DED) offers advantages in additive manufacturing (AM) for creating intricate geometries and material grading. Yet, challenges like material inconsistency and part variability remain, mainly due to its layer-wise fabrication. A key issue is heat accumulation during DED, which affects the material microstructure and properties. While closed-loop control methods for…
▽ More
Laser-directed-energy deposition (DED) offers advantages in additive manufacturing (AM) for creating intricate geometries and material grading. Yet, challenges like material inconsistency and part variability remain, mainly due to its layer-wise fabrication. A key issue is heat accumulation during DED, which affects the material microstructure and properties. While closed-loop control methods for heat management are common in DED research, few integrate real-time monitoring, physics-based modeling, and control in a unified framework. Our work presents a digital twin (DT) framework for real-time predictive control of DED process parameters to meet specific design objectives. We develop a surrogate model using Long Short-Term Memory (LSTM)-based machine learning with Bayesian Inference to predict temperatures in DED parts. This model predicts future temperature states in real time. We also introduce Bayesian Optimization (BO) for Time Series Process Optimization (BOTSPO), based on traditional BO but featuring a unique time series process profile generator with reduced dimensions. BOTSPO dynamically optimizes processes, identifying optimal laser power profiles to attain desired mechanical properties. The established process trajectory guides online optimizations, aiming to enhance performance. This paper outlines the digital twin framework's components, promoting its integration into a comprehensive system for AM.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
Dual-modal Tactile E-skin: Enabling Bidirectional Human-Robot Interaction via Integrated Tactile Perception and Feedback
Authors:
Shilong Mu,
Runze Zhao,
Zenan Lin,
Yan Huang,
Shoujie Li,
Chenchang Li,
Xiao-Ping Zhang,
Wenbo Ding
Abstract:
To foster an immersive and natural human-robot interaction, the implementation of tactile perception and feedback becomes imperative, effectively bridging the conventional sensory gap. In this paper, we propose a dual-modal electronic skin (e-skin) that integrates magnetic tactile sensing and vibration feedback for enhanced human-robot interaction. The dual-modal tactile e-skin offers multi-functi…
▽ More
To foster an immersive and natural human-robot interaction, the implementation of tactile perception and feedback becomes imperative, effectively bridging the conventional sensory gap. In this paper, we propose a dual-modal electronic skin (e-skin) that integrates magnetic tactile sensing and vibration feedback for enhanced human-robot interaction. The dual-modal tactile e-skin offers multi-functional tactile sensing and programmable haptic feedback, underpinned by a layered structure comprised of flexible magnetic films, soft silicone, a Hall sensor and actuator array, and a microcontroller unit. The e-skin captures the magnetic field changes caused by subtle deformations through Hall sensors, employing deep learning for accurate tactile perception. Simultaneously, the actuator array generates mechanical vibrations to facilitate haptic feedback, delivering diverse mechanical stimuli. Notably, the dual-modal e-skin is capable of transmitting tactile information bidirectionally, enabling object recognition and fine-weighing operations. This bidirectional tactile interaction framework will enhance the immersion and efficiency of interactions between humans and robots.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
PDiT: Interleaving Perception and Decision-making Transformers for Deep Reinforcement Learning
Authors:
Hangyu Mao,
Rui Zhao,
Ziyue Li,
Zhiwei Xu,
Hao Chen,
Yiqun Chen,
Bin Zhang,
Zhen Xiao,
Junge Zhang,
Jiangjin Yin
Abstract:
Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work studies the former. Specifically, the Perception and Decision-making Interleaving Transformer (PDiT) network is proposed, which cascades two Transformers in a very natural way: the perceiving one focuses on \emph{the environmental perception} by processing the observation at t…
▽ More
Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work studies the former. Specifically, the Perception and Decision-making Interleaving Transformer (PDiT) network is proposed, which cascades two Transformers in a very natural way: the perceiving one focuses on \emph{the environmental perception} by processing the observation at the patch level, whereas the deciding one pays attention to \emph{the decision-making} by conditioning on the history of the desired returns, the perceiver's outputs, and the actions. Such a network design is generally applicable to a lot of deep RL settings, e.g., both the online and offline RL algorithms under environments with either image observations, proprioception observations, or hybrid image-language observations. Extensive experiments show that PDiT can not only achieve superior performance than strong baselines in different settings but also extract explainable feature representations. Our code is available at \url{https://github.com/maohangyu/PDiT}.
△ Less
Submitted 25 December, 2023;
originally announced December 2023.
-
MLP Based Continuous Gait Recognition of a Powered Ankle Prosthesis with Serial Elastic Actuator
Authors:
Yanze Li,
Feixing Chen,
Jingqi Cao,
Ruoqi Zhao,
Xuan Yang,
Xingbang Yang,
Yubo Fan
Abstract:
Powered ankle prostheses effectively assist people with lower limb amputation to perform daily activities. High performance prostheses with adjustable compliance and capability to predict and implement amputee's intent are crucial for them to be comparable to or better than a real limb. However, current designs fail to provide simple yet effective compliance of the joint with full potential of mod…
▽ More
Powered ankle prostheses effectively assist people with lower limb amputation to perform daily activities. High performance prostheses with adjustable compliance and capability to predict and implement amputee's intent are crucial for them to be comparable to or better than a real limb. However, current designs fail to provide simple yet effective compliance of the joint with full potential of modification, and lack accurate gait prediction method in real time. This paper proposes an innovative design of powered ankle prosthesis with serial elastic actuator (SEA), and puts forward a MLP based gait recognition method that can accurately and continuously predict more gait parameters for motion sensing and control. The prosthesis mimics biological joint with similar weight, torque, and power which can assist walking of up to 4 m/s. A new design of planar torsional spring is proposed for the SEA, which has better stiffness, endurance, and potential of modification than current designs. The gait recognition system simultaneously generates locomotive speed, gait phase, ankle angle and angular velocity only utilizing signals of single IMU, holding advantage in continuity, adaptability for speed range, accuracy, and capability of multi-functions.
△ Less
Submitted 30 March, 2024; v1 submitted 15 September, 2023;
originally announced September 2023.
-
t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability
Authors:
Jian Wu,
Naoyuki Kanda,
Takuya Yoshioka,
Rui Zhao,
Zhuo Chen,
Jinyu Li
Abstract:
Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with $\langle \text{cc}\rangle$ symbols interspersed. However, the use of a naive neural transducer architecture significantly cons…
▽ More
Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with $\langle \text{cc}\rangle$ symbols interspersed. However, the use of a naive neural transducer architecture significantly constrained its applicability for text-only adaptation. To overcome this limitation, we propose a novel t-SOT model structure that incorporates the idea of factorized neural transducers (FNT). The proposed method separates a language model (LM) from the transducer's predictor and handles the unnatural token order resulting from the use of $\langle \text{cc}\rangle$ symbols in t-SOT. We achieve this by maintaining multiple hidden states and introducing special handling of the $\langle \text{cc}\rangle$ tokens within the LM. The proposed t-SOT FNT model achieves comparable performance to the original t-SOT model while retaining the ability to reduce word error rate (WER) on both single and multi-talker datasets through text-only adaptation.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Hybrid Attention-based Encoder-decoder Model for Efficient Language Model Adaptation
Authors:
Shaoshi Ling,
Guoli Ye,
Rui Zhao,
Yifan Gong
Abstract:
Attention-based encoder-decoder (AED) speech recognition model has been widely successful in recent years. However, the joint optimization of acoustic model and language model in end-to-end manner has created challenges for text adaptation. In particular, effectively, quickly and inexpensively adapting text has become a primary concern for deploying AED systems in industry. To address this issue,…
▽ More
Attention-based encoder-decoder (AED) speech recognition model has been widely successful in recent years. However, the joint optimization of acoustic model and language model in end-to-end manner has created challenges for text adaptation. In particular, effectively, quickly and inexpensively adapting text has become a primary concern for deploying AED systems in industry. To address this issue, we propose a novel model, the hybrid attention-based encoder-decoder (HAED) speech recognition model that preserves the modularity of conventional hybrid automatic speech recognition systems. Our HAED model separates the acoustic and language models, allowing for the use of conventional text-based language model adaptation techniques. We demonstrate that the proposed HAED model yields 21\% Word Error Rate (WER) improvements in relative when out-of-domain text data is used for language model adaptation, and with only a minor degradation in WER on a general test set compared with conventional AED model.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals
Authors:
Running Zhao,
Jiangtao Yu,
Hang Zhao,
Edith C. H. Ngai
Abstract:
Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, considering the practicality in real scenarios, latency and recognizable vocabulary size are two critical factors that cannot be overlooked. In this paper, we propose Radio2Text, the first mmWave-based system for streaming a…
▽ More
Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, considering the practicality in real scenarios, latency and recognizable vocabulary size are two critical factors that cannot be overlooked. In this paper, we propose Radio2Text, the first mmWave-based system for streaming automatic speech recognition (ASR) with a vocabulary size exceeding 13,000 words. Radio2Text is based on a tailored streaming Transformer that is capable of effectively learning representations of speech-related features, paving the way for streaming ASR with a large vocabulary. To alleviate the deficiency of streaming networks unable to access entire future inputs, we propose the Guidance Initialization that facilitates the transfer of feature knowledge related to the global context from the non-streaming Transformer to the tailored streaming Transformer through weight inheritance. Further, we propose a cross-modal structure based on knowledge distillation (KD), named cross-modal KD, to mitigate the negative effect of low quality mmWave signals on recognition performance. In the cross-modal KD, the audio streaming Transformer provides feature and response guidance that inherit fruitful and accurate speech information to supervise the training of the tailored radio streaming Transformer. The experimental results show that our Radio2Text can achieve a character error rate of 5.7% and a word error rate of 9.4% for the recognition of a vocabulary consisting of over 13,000 words.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
High-Dimensional MR Reconstruction Integrating Subspace and Adaptive Generative Models
Authors:
Ruiyang Zhao,
Xi Peng,
Varun A. Kelkar,
Mark A. Anastasio,
Fan Lam
Abstract:
We present a novel method that integrates subspace modeling with an adaptive generative image prior for high-dimensional MR image reconstruction. The subspace model imposes an explicit low-dimensional representation of the high-dimensional images, while the generative image prior serves as a spatial constraint on the "contrast-weighted" images or the spatial coefficients of the subspace model. A f…
▽ More
We present a novel method that integrates subspace modeling with an adaptive generative image prior for high-dimensional MR image reconstruction. The subspace model imposes an explicit low-dimensional representation of the high-dimensional images, while the generative image prior serves as a spatial constraint on the "contrast-weighted" images or the spatial coefficients of the subspace model. A formulation was introduced to synergize these two components with complimentary regularization such as joint sparsity. A special pretraining plus subject-specific network adaptation strategy was proposed to construct an accurate generative-model-based representation for images with varying contrasts, validated by experimental data. An iterative algorithm was introduced to jointly update the subspace coefficients and the multiresolution latent space of the generative image model that leveraged a recently developed intermediate layer optimization technique for network inversion. We evaluated the utility of the proposed method in two high-dimensional imaging applications: accelerated MR parameter mapping and high-resolution MRSI. Improved performance over state-of-the-art subspace-based methods was demonstrated in both cases. Our work demonstrated the potential of integrating data-driven and adaptive generative models with low-dimensional representation for high-dimensional imaging problems.
△ Less
Submitted 16 June, 2023; v1 submitted 14 June, 2023;
originally announced June 2023.
-
Dynamic Causal Graph Convolutional Network for Traffic Prediction
Authors:
Junpeng Lin,
Ziyue Li,
Zhishuai Li,
Lei Bai,
Rui Zhao,
Chen Zhang
Abstract:
Modeling complex spatiotemporal dependencies in correlated traffic series is essential for traffic prediction. While recent works have shown improved prediction performance by using neural networks to extract spatiotemporal correlations, their effectiveness depends on the quality of the graph structures used to represent the spatial topology of the traffic network. In this work, we propose a novel…
▽ More
Modeling complex spatiotemporal dependencies in correlated traffic series is essential for traffic prediction. While recent works have shown improved prediction performance by using neural networks to extract spatiotemporal correlations, their effectiveness depends on the quality of the graph structures used to represent the spatial topology of the traffic network. In this work, we propose a novel approach for traffic prediction that embeds time-varying dynamic Bayesian network to capture the fine spatiotemporal topology of traffic data. We then use graph convolutional networks to generate traffic forecasts. To enable our method to efficiently model nonlinear traffic propagation patterns, we develop a deep learning-based module as a hyper-network to generate stepwise dynamic causal graphs. Our experimental results on a real traffic dataset demonstrate the superior prediction performance of the proposed method. The code is available at https://github.com/MonBG/DCGCN.
△ Less
Submitted 7 September, 2023; v1 submitted 12 June, 2023;
originally announced June 2023.
-
Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models
Authors:
Rui Zhao,
Jian Xue,
Partha Parthasarathy,
Veljko Miljanic,
Jinyu Li
Abstract:
Neural transducer is now the most popular end-to-end model for speech recognition, due to its naturally streaming ability. However, it is challenging to adapt it with text-only data. Factorized neural transducer (FNT) model was proposed to mitigate this problem. The improved adaptation ability of FNT on text-only adaptation data came at the cost of lowered accuracy compared to the standard neural…
▽ More
Neural transducer is now the most popular end-to-end model for speech recognition, due to its naturally streaming ability. However, it is challenging to adapt it with text-only data. Factorized neural transducer (FNT) model was proposed to mitigate this problem. The improved adaptation ability of FNT on text-only adaptation data came at the cost of lowered accuracy compared to the standard neural transducer model. We propose several methods to improve the performance of the FNT model. They are: adding CTC criterion during training, adding KL divergence loss during adaptation, using a pre-trained language model to seed the vocabulary predictor, and an efficient adaptation approach by interpolating the vocabulary predictor with the n-gram language model. A combination of these approaches results in a relative word-error-rate reduction of 9.48\% from the standard FNT model. Furthermore, n-gram interpolation with the vocabulary predictor improves the adaptation speed hugely with satisfactory adaptation performance.
△ Less
Submitted 23 February, 2023; v1 submitted 4 December, 2022;
originally announced December 2022.
-
LongFNT: Long-form Speech Recognition with Factorized Neural Transducer
Authors:
Xun Gong,
Yu Wu,
Jinyu Li,
Shujie Liu,
Rui Zhao,
Xie Chen,
Yanmin Qian
Abstract:
Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language mo…
▽ More
Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the {LongFNT-Text} architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the {LongFNT} architecture by extending the long-form speech to the original speech input and achieve the best performance. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate~(WER) reduction, respectively.
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
RF-CHORD: Towards Deployable RFID Localization System for Logistics Network
Authors:
Bo Liang,
Purui Wang,
Renjie Zhao,
Heyu Guo,
Pengyu Zhang,
Junchen Guo,
Shunmin Zhu,
Hongqiang Harry Liu,
Xinyu Zhang,
Chenren Xu
Abstract:
RFID localization is considered the key enabler of automating the process of inventory tracking and management for high-performance logistic network. A practical and deployable RFID localization system needs to meet reliability, throughput, and range requirements. This paper presents RF-Chord, the first RFID localization system that simultaneously meets all three requirements. RF-Chord features a…
▽ More
RFID localization is considered the key enabler of automating the process of inventory tracking and management for high-performance logistic network. A practical and deployable RFID localization system needs to meet reliability, throughput, and range requirements. This paper presents RF-Chord, the first RFID localization system that simultaneously meets all three requirements. RF-Chord features a one-shot multisine-constructed wideband design that can process RF signal with a 200 MHz bandwidth in real-time to facilitate one-shot localization at scale. In addition, multiple SINR enhancement techniques are designed for range extension. On top of that, a kernel-layer-based near-field localization framework and a multipath-suppression algorithm are proposed to reduce the 99% long-tail errors. Our empirical results show that RF-Chord can localize more than 180 tags 6 m away from a reader within 1 second and with 99% long-tail error of 0.786 m, achieving a 0% miss reading rate and ~0.01% cross-reading rate in the warehouse and fresh food delivery store deployment.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction
Authors:
Ruyi Zha,
Yanhao Zhang,
Hongdong Li
Abstract:
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction (Cone Beam Computed Tomography) that requires no external training data. Specifically, the desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network. We synthesize projections discretely and train the net…
▽ More
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction (Cone Beam Computed Tomography) that requires no external training data. Specifically, the desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network. We synthesize projections discretely and train the network by minimizing the error between real and synthesized projections. A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details. This encoder outperforms the commonly used frequency-domain encoder in terms of having higher performance and efficiency, because it exploits the smoothness and sparsity of human organs. Experiments have been conducted on both human organ and phantom datasets. The proposed method achieves state-of-the-art accuracy and spends reasonably short computation time.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
Online Poisoning Attacks Against Data-Driven Predictive Control
Authors:
Yue Yu,
Ruihan Zhao,
Sandeep Chinchali,
Ufuk Topcu
Abstract:
Data-driven predictive control (DPC) is a feedback control method for systems with unknown dynamics. It repeatedly optimizes a system's future trajectories based on past input-output data. We develop a numerical method that computes poisoning attacks that inject additive perturbations to the online output data to change the trajectories optimized by DPC. This method is based on implicitly differen…
▽ More
Data-driven predictive control (DPC) is a feedback control method for systems with unknown dynamics. It repeatedly optimizes a system's future trajectories based on past input-output data. We develop a numerical method that computes poisoning attacks that inject additive perturbations to the online output data to change the trajectories optimized by DPC. This method is based on implicitly differentiating the solution map of the trajectory optimization in DPC. We demonstrate that the resulting attacks can cause an output tracking error one order of magnitude higher than random perturbations in numerical experiments.
△ Less
Submitted 23 November, 2022; v1 submitted 19 September, 2022;
originally announced September 2022.
-
Approximate synchronization of coupled multi-valued logical networks
Authors:
Rong Zhao,
Jun-e Feng,
Biao Wang
Abstract:
This article deals with the approximate synchronization of two coupled multi-valued logical networks. According to the initial state set from which both systems start, two kinds of approximate synchronization problem, local approximate synchronization and global approximate synchronization, are proposed for the first time. Three new notions: approximate synchronization state set (ASSS), the maximu…
▽ More
This article deals with the approximate synchronization of two coupled multi-valued logical networks. According to the initial state set from which both systems start, two kinds of approximate synchronization problem, local approximate synchronization and global approximate synchronization, are proposed for the first time. Three new notions: approximate synchronization state set (ASSS), the maximum approximate synchronization basin (MASB) and the shortest approximate synchronization time (SAST) are introduced and analyzed. Based on ASSS, several necessary and sufficient conditions are obtained for approximate synchronization. MASB, the set of all possible initial states, from which the systems are approximately synchronous, is investigated combining with the maximum invariant subset. And the calculation method of the SAST, associated with transient period, is presented. By virtue of MASB, pinning control scheme is investigated to make two coupled systems achieve global approximate synchronization. Furthermore, the related theories are also applied to the complete synchronization problem of $k$-valued ($k\geq2$) logical networks. Finally, four examples are given to illustrate the obtained results.
△ Less
Submitted 13 July, 2022;
originally announced July 2022.
-
Radio2Speech: High Quality Speech Recovery from Radio Frequency Signals
Authors:
Running Zhao,
Jiangtao Yu,
Tingle Li,
Hang Zhao,
Edith C. H. Ngai
Abstract:
Considering the microphone is easily affected by noise and soundproof materials, the radio frequency (RF) signal is a promising candidate to recover audio as it is immune to noise and can traverse many soundproof objects. In this paper, we introduce Radio2Speech, a system that uses RF signals to recover high quality speech from the loudspeaker. Radio2Speech can recover speech comparable to the qua…
▽ More
Considering the microphone is easily affected by noise and soundproof materials, the radio frequency (RF) signal is a promising candidate to recover audio as it is immune to noise and can traverse many soundproof objects. In this paper, we introduce Radio2Speech, a system that uses RF signals to recover high quality speech from the loudspeaker. Radio2Speech can recover speech comparable to the quality of the microphone, advancing from recovering only single tone music or incomprehensible speech in existing approaches. We use Radio UNet to accurately recover speech in time-frequency domain from RF signals with limited frequency band. Also, we incorporate the neural vocoder to synthesize the speech waveform from the estimated time-frequency representation without using the contaminated phase. Quantitative and qualitative evaluations show that in quiet, noisy and soundproof scenarios, Radio2Speech achieves state-of-the-art performance and is on par with the microphone that works in quiet scenarios.
△ Less
Submitted 22 June, 2022;
originally announced June 2022.
-
Class-Aware Adversarial Transformers for Medical Image Segmentation
Authors:
Chenyu You,
Ruihan Zhao,
Fenglin Liu,
Siyuan Dong,
Sandeep Chinchali,
Ufuk Topcu,
Lawrence Staib,
James S. Duncan
Abstract:
Transformers have made remarkable progress towards modeling long-range dependencies within the medical image analysis domain. However, current transformer-based models suffer from several disadvantages: (1) existing methods fail to capture the important features of the images due to the naive tokenization scheme; (2) the models suffer from information loss because they only consider single-scale f…
▽ More
Transformers have made remarkable progress towards modeling long-range dependencies within the medical image analysis domain. However, current transformer-based models suffer from several disadvantages: (1) existing methods fail to capture the important features of the images due to the naive tokenization scheme; (2) the models suffer from information loss because they only consider single-scale feature representations; and (3) the segmentation label maps generated by the models are not accurate enough without considering rich semantic contexts and anatomical textures. In this work, we present CASTformer, a novel type of adversarial transformers, for 2D medical image segmentation. First, we take advantage of the pyramid structure to construct multi-scale representations and handle multi-scale variations. We then design a novel class-aware transformer module to better learn the discriminative regions of objects with semantic structures. Lastly, we utilize an adversarial training strategy that boosts segmentation accuracy and correspondingly allows a transformer-based discriminator to capture high-level semantically correlated contents and low-level anatomical features. Our experiments demonstrate that CASTformer dramatically outperforms previous state-of-the-art transformer-based approaches on three benchmarks, obtaining 2.54%-5.88% absolute improvements in Dice over previous models. Further qualitative experiments provide a more detailed picture of the model's inner workings, shed light on the challenges in improved transparency, and demonstrate that transfer learning can greatly improve performance and reduce the size of medical image datasets in training, making CASTformer a strong starting point for downstream medical image analysis tasks.
△ Less
Submitted 15 December, 2022; v1 submitted 25 January, 2022;
originally announced January 2022.
-
MEGAN: Memory Enhanced Graph Attention Network for Space-Time Video Super-Resolution
Authors:
Chenyu You,
Lianyi Han,
Aosong Feng,
Ruihan Zhao,
Hui Tang,
Wei Fan
Abstract:
Space-time video super-resolution (STVSR) aims to construct a high space-time resolution video sequence from the corresponding low-frame-rate, low-resolution video sequence. Inspired by the recent success to consider spatial-temporal information for space-time super-resolution, our main goal in this work is to take full considerations of spatial and temporal correlations within the video sequences…
▽ More
Space-time video super-resolution (STVSR) aims to construct a high space-time resolution video sequence from the corresponding low-frame-rate, low-resolution video sequence. Inspired by the recent success to consider spatial-temporal information for space-time super-resolution, our main goal in this work is to take full considerations of spatial and temporal correlations within the video sequences of fast dynamic events. To this end, we propose a novel one-stage memory enhanced graph attention network (MEGAN) for space-time video super-resolution. Specifically, we build a novel long-range memory graph aggregation (LMGA) module to dynamically capture correlations along the channel dimensions of the feature maps and adaptively aggregate channel features to enhance the feature representations. We introduce a non-local residual block, which enables each channel-wise feature to attend global spatial hierarchical features. In addition, we adopt a progressive fusion module to further enhance the representation ability by extensively exploiting spatial-temporal correlations from multiple frames. Experiment results demonstrate that our method achieves better results compared with the state-of-the-art methods quantitatively and visually.
△ Less
Submitted 29 November, 2021; v1 submitted 28 October, 2021;
originally announced October 2021.
-
End-to-End AI-based MRI Reconstruction and Lesion Detection Pipeline for Evaluation of Deep Learning Image Reconstruction
Authors:
Ruiyang Zhao,
Yuxin Zhang,
Burhaneddin Yaman,
Matthew P. Lungren,
Michael S. Hansen
Abstract:
Deep learning techniques have emerged as a promising approach to highly accelerated MRI. However, recent reconstruction challenges have shown several drawbacks in current deep learning approaches, including the loss of fine image details even using models that perform well in terms of global quality metrics. In this study, we propose an end-to-end deep learning framework for image reconstruction a…
▽ More
Deep learning techniques have emerged as a promising approach to highly accelerated MRI. However, recent reconstruction challenges have shown several drawbacks in current deep learning approaches, including the loss of fine image details even using models that perform well in terms of global quality metrics. In this study, we propose an end-to-end deep learning framework for image reconstruction and pathology detection, which enables a clinically aware evaluation of deep learning reconstruction quality. The solution is demonstrated for a use case in detecting meniscal tears on knee MRI studies, ultimately finding a loss of fine image details with common reconstruction methods expressed as a reduced ability to detect important pathology like meniscal tears. Despite the common practice of quantitative reconstruction methodology evaluation with metrics such as SSIM, impaired pathology detection as an automated pathology-based reconstruction evaluation approach suggests existing quantitative methods do not capture clinically important reconstruction outcomes.
△ Less
Submitted 23 September, 2021;
originally announced September 2021.
-
fastMRI+: Clinical Pathology Annotations for Knee and Brain Fully Sampled Multi-Coil MRI Data
Authors:
Ruiyang Zhao,
Burhaneddin Yaman,
Yuxin Zhang,
Russell Stewart,
Austin Dixon,
Florian Knoll,
Zhengnan Huang,
Yvonne W. Lui,
Michael S. Hansen,
Matthew P. Lungren
Abstract:
Improving speed and image quality of Magnetic Resonance Imaging (MRI) via novel reconstruction approaches remains one of the highest impact applications for deep learning in medical imaging. The fastMRI dataset, unique in that it contains large volumes of raw MRI data, has enabled significant advances in accelerating MRI using deep learning-based reconstruction methods. While the impact of the fas…
▽ More
Improving speed and image quality of Magnetic Resonance Imaging (MRI) via novel reconstruction approaches remains one of the highest impact applications for deep learning in medical imaging. The fastMRI dataset, unique in that it contains large volumes of raw MRI data, has enabled significant advances in accelerating MRI using deep learning-based reconstruction methods. While the impact of the fastMRI dataset on the field of medical imaging is unquestioned, the dataset currently lacks clinical expert pathology annotations, critical to addressing clinically relevant reconstruction frameworks and exploring important questions regarding rendering of specific pathology using such novel approaches. This work introduces fastMRI+, which consists of 16154 subspecialist expert bounding box annotations and 13 study-level labels for 22 different pathology categories on the fastMRI knee dataset, and 7570 subspecialist expert bounding box annotations and 643 study-level labels for 30 different pathology categories for the fastMRI brain dataset. The fastMRI+ dataset is open access and aims to support further research and advancement of medical imaging in MRI reconstruction and beyond.
△ Less
Submitted 13 September, 2021; v1 submitted 8 September, 2021;
originally announced September 2021.
-
Momentum Contrastive Voxel-wise Representation Learning for Semi-supervised Volumetric Medical Image Segmentation
Authors:
Chenyu You,
Ruihan Zhao,
Lawrence Staib,
James S. Duncan
Abstract:
Contrastive learning (CL) aims to learn useful representation without relying on expert annotations in the context of medical image segmentation. Existing approaches mainly contrast a single positive vector (i.e., an augmentation of the same image) against a set of negatives within the entire remainder of the batch by simply mapping all input features into the same constant vector. Despite the imp…
▽ More
Contrastive learning (CL) aims to learn useful representation without relying on expert annotations in the context of medical image segmentation. Existing approaches mainly contrast a single positive vector (i.e., an augmentation of the same image) against a set of negatives within the entire remainder of the batch by simply mapping all input features into the same constant vector. Despite the impressive empirical performance, those methods have the following shortcomings: (1) it remains a formidable challenge to prevent the collapsing problems to trivial solutions; and (2) we argue that not all voxels within the same image are equally positive since there exist the dissimilar anatomical structures with the same image. In this work, we present a novel Contrastive Voxel-wise Representation Learning (CVRL) method to effectively learn low-level and high-level features by capturing 3D spatial context and rich anatomical information along both the feature and the batch dimensions. Specifically, we first introduce a novel CL strategy to ensure feature diversity promotion among the 3D representation dimensions. We train the framework through bi-level contrastive optimization (i.e., low-level and high-level) on 3D images. Experiments on two benchmark datasets and different labeled settings demonstrate the superiority of our proposed framework. More importantly, we also prove that our method inherits the benefit of hardness-aware property from the standard CL approaches.
△ Less
Submitted 7 March, 2022; v1 submitted 14 May, 2021;
originally announced May 2021.
-
On Addressing Practical Challenges for RNN-Transducer
Authors:
Rui Zhao,
Jian Xue,
Jinyu Li,
Wenning Wei,
Lei He,
Yifan Gong
Abstract:
In this paper, several works are proposed to address practical challenges for deploying RNN Transducer (RNN-T) based speech recognition system. These challenges are adapting a well-trained RNN-T model to a new domain without collecting the audio data, obtaining time stamps and confidence scores at word level. The first challenge is solved with a splicing data method which concatenates the speech s…
▽ More
In this paper, several works are proposed to address practical challenges for deploying RNN Transducer (RNN-T) based speech recognition system. These challenges are adapting a well-trained RNN-T model to a new domain without collecting the audio data, obtaining time stamps and confidence scores at word level. The first challenge is solved with a splicing data method which concatenates the speech segments extracted from the source domain data. To get the time stamp, a phone prediction branch is added to the RNN-T model by sharing the encoder for the purpose of force alignment. Finally, we obtain word-level confidence scores by utilizing several types of features calculated during decoding and from confusion network. Evaluated with Microsoft production data, the splicing data adaptation method improves the baseline and adaptation with the text to speech method by 58.03% and 15.25% relative word error rate reduction, respectively. The proposed time stamping method can get less than 50ms word timing difference from the ground truth alignment on average while maintaining the recognition accuracy of the RNN-T model. We also obtain high confidence annotation performance with limited computation cost.
△ Less
Submitted 18 July, 2021; v1 submitted 27 April, 2021;
originally announced May 2021.
-
Energy Efficiency Maximization in RIS-Aided Cell-Free Network with Limited Backhaul
Authors:
Quang Nhat Le,
Van-Dinh Nguyen,
Octavia A. Dobre,
Ruiqin Zhao
Abstract:
Integrating the reconfigurable intelligent surface in a cell-free (RIS-CF) network is an effective solution to improve the capacity and coverage of future wireless systems with low cost and power consumption. The reflecting coefficients of RISs can be programmed to enhance signals received at users. This letter addresses a joint design of transmit beamformers at access points and reflecting coeffi…
▽ More
Integrating the reconfigurable intelligent surface in a cell-free (RIS-CF) network is an effective solution to improve the capacity and coverage of future wireless systems with low cost and power consumption. The reflecting coefficients of RISs can be programmed to enhance signals received at users. This letter addresses a joint design of transmit beamformers at access points and reflecting coefficients at RISs to maximize the energy efficiency (EE) of RIS-CF networks, taking into account the limited backhaul capacity constraints. Due to a very computationally challenging nonconvex problem, we develop a simple yet efficient alternating descent algorithm for its solution. Numerical results verify that the EE of RIS-CF networks is greatly improved, showing the benefit of using RISs.
△ Less
Submitted 8 March, 2021; v1 submitted 21 December, 2020;
originally announced December 2020.
-
Full-Duplex Non-Orthogonal Multiple Access Cooperative Overlay Spectrum-Sharing Networks with SWIPT
Authors:
Quang Nhat Le,
Animesh Yadav,
Nam-Phong Nguyen,
Octavia A. Dobre,
Ruiqin Zhao
Abstract:
This paper proposes a novel non-orthogonal multiple access (NOMA) assisted cooperative spectrum sharing network, in which one of the full-duplex (FD) secondary transmitters (STs) is chosen among many for forwarding the primary transmitter's and its own information to primary receiver and secondary receivers, respectively, using NOMA technique. To stimulate the ST to conduct cooperative transmissio…
▽ More
This paper proposes a novel non-orthogonal multiple access (NOMA) assisted cooperative spectrum sharing network, in which one of the full-duplex (FD) secondary transmitters (STs) is chosen among many for forwarding the primary transmitter's and its own information to primary receiver and secondary receivers, respectively, using NOMA technique. To stimulate the ST to conduct cooperative transmission and sustain its operations, the simultaneous wireless information and power transfer (SWIPT) technique is utilized by the ST to harvest the primary signal's energy. In order to evaluate the proposed system's performance, the outage probability and system throughput for the primary and secondary networks are derived in tight closed-form approximations. Further, the sum rate optimization problem is formulated for the proposed cooperative network and a rapid convergent iterative algorithm is proposed to obtain the optimized power allocation coefficients. Numerical results show that FD, SWIPT, and NOMA techniques greatly boost the performance of cooperative spectrum-sharing network in terms of outage probability, system throughput, and sum rate compared to that of half-duplex NOMA and the conventional orthogonal multiple access-time division multiple access networks.
△ Less
Submitted 19 November, 2020;
originally announced November 2020.
-
Learning-Assisted User Clustering in Cell-Free Massive MIMO-NOMA Networks
Authors:
Quang Nhat Le,
Van-Dinh Nguyen,
Nam-Phong Nguyen,
Symeon Chatzinotas,
Octavia A. Dobre,
Ruiqin Zhao
Abstract:
The superior spectral efficiency (SE) and user fairness feature of non-orthogonal multiple access (NOMA) systems are achieved by exploiting user clustering (UC) more efficiently. However, a random UC certainly results in a suboptimal solution while an exhaustive search method comes at the cost of high complexity, especially for systems of medium-to-large size. To address this problem, we develop t…
▽ More
The superior spectral efficiency (SE) and user fairness feature of non-orthogonal multiple access (NOMA) systems are achieved by exploiting user clustering (UC) more efficiently. However, a random UC certainly results in a suboptimal solution while an exhaustive search method comes at the cost of high complexity, especially for systems of medium-to-large size. To address this problem, we develop two efficient unsupervised machine learning (ML) based UC algorithms, namely k-means++ and improved k-means++, to effectively cluster users into disjoint clusters in cell-free massive multiple-input multiple-output (CFmMIMO) system. Using full-pilot zero-forcing at access points, we derive the sum SE in closed-form expression taking into account the impact of intra-cluster pilot contamination, inter-cluster interference, and imperfect successive interference cancellation. To comprehensively assess the system performance, we formulate the sum SE optimization problem, and then develop a simple yet efficient iterative algorithm for its solution. In addition, the performance of collocated massive MIMO-NOMA (COmMIMO-NOMA) system is also characterized. Numerical results are provided to show the superior performance of the proposed UC algorithms compared to other baseline schemes. The effectiveness of applying NOMA in CFmMIMO and COmMIMO systems is also validated.
△ Less
Submitted 15 November, 2020;
originally announced November 2020.
-
Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition
Authors:
Zhong Meng,
Sarangarajan Parthasarathy,
Eric Sun,
Yashesh Gaur,
Naoyuki Kanda,
Liang Lu,
Xie Chen,
Rui Zhao,
Jinyu Li,
Yifan Gong
Abstract:
The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models. In this work, we propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including…
▽ More
The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models. In this work, we propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including the most popular recurrent neural network transducer (RNN-T) and attention-based encoder-decoder (AED) models. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the training data in the source domain. With ILME, the internal LM scores of an E2E model are estimated and subtracted from the log-linear interpolation between the scores of the E2E model and the external LM. The internal LM scores are approximated as the output of an E2E model when eliminating its acoustic components. ILME can alleviate the domain mismatch between training and testing, or improve the multi-domain E2E ASR. Experimented with 30K-hour trained RNN-T and AED models, ILME achieves up to 15.5% and 6.8% relative word error rate reductions from Shallow Fusion on out-of-domain LibriSpeech and in-domain Microsoft production test sets, respectively.
△ Less
Submitted 3 November, 2020;
originally announced November 2020.
-
Enhancing and Learning Denoiser without Clean Reference
Authors:
Rui Zhao,
Daniel P. K. Lun,
Kin-Man Lam
Abstract:
Recent studies on learning-based image denoising have achieved promising performance on various noise reduction tasks. Most of these deep denoisers are trained either under the supervision of clean references, or unsupervised on synthetic noise. The assumption with the synthetic noise leads to poor generalization when facing real photographs. To address this issue, we propose a novel deep image-de…
▽ More
Recent studies on learning-based image denoising have achieved promising performance on various noise reduction tasks. Most of these deep denoisers are trained either under the supervision of clean references, or unsupervised on synthetic noise. The assumption with the synthetic noise leads to poor generalization when facing real photographs. To address this issue, we propose a novel deep image-denoising method by regarding the noise reduction task as a special case of the noise transference task. Learning noise transference enables the network to acquire the denoising ability by observing the corrupted samples. The results on real-world denoising benchmarks demonstrate that our proposed method achieves promising performance on removing realistic noises, making it a potential solution to practical noise reduction problems.
△ Less
Submitted 28 March, 2021; v1 submitted 9 September, 2020;
originally announced September 2020.
-
Transfer Learning Approaches for Streaming End-to-End Speech Recognition System
Authors:
Vikas Joshi,
Rui Zhao,
Rupesh R. Mehta,
Kshitiz Kumar,
Jinyu Li
Abstract:
Transfer learning (TL) is widely used in conventional hybrid automatic speech recognition (ASR) system, to transfer the knowledge from source to target language. TL can be applied to end-to-end (E2E) ASR system such as recurrent neural network transducer (RNN-T) models, by initializing the encoder and/or prediction network of the target language with the pre-trained models from source language. In…
▽ More
Transfer learning (TL) is widely used in conventional hybrid automatic speech recognition (ASR) system, to transfer the knowledge from source to target language. TL can be applied to end-to-end (E2E) ASR system such as recurrent neural network transducer (RNN-T) models, by initializing the encoder and/or prediction network of the target language with the pre-trained models from source language. In the hybrid ASR system, transfer learning is typically done by initializing the target language acoustic model (AM) with source language AM. Several transfer learning strategies exist in the case of the RNN-T framework, depending upon the choice of the initialization model for encoder and prediction networks. This paper presents a comparative study of four different TL methods for RNN-T framework. We show 17% relative word error rate reduction with different TL methods over randomly initialized RNN-T model. We also study the impact of TL with varying amount of training data ranging from 50 hours to 1000 hours and show the efficacy of TL for languages with small amount of training data.
△ Less
Submitted 17 August, 2020; v1 submitted 11 August, 2020;
originally announced August 2020.
-
Deep Reinforcement Learning Based Mobile Edge Computing for Intelligent Internet of Things
Authors:
Rui Zhao,
Xinjie Wang,
Junjuan Xia,
Liseng Fan
Abstract:
In this paper, we investigate mobile edge computing (MEC) networks for intelligent internet of things (IoT), where multiple users have some computational tasks assisted by multiple computational access points (CAPs). By offloading some tasks to the CAPs, the system performance can be improved through reducing the latency and energy consumption, which are the two important metrics of interest in th…
▽ More
In this paper, we investigate mobile edge computing (MEC) networks for intelligent internet of things (IoT), where multiple users have some computational tasks assisted by multiple computational access points (CAPs). By offloading some tasks to the CAPs, the system performance can be improved through reducing the latency and energy consumption, which are the two important metrics of interest in the MEC networks. We devise the system by proposing the offloading strategy intelligently through the deep reinforcement learning algorithm. In this algorithm, Deep Q-Network is used to automatically learn the offloading decision in order to optimize the system performance, and a neural network (NN) is trained to predict the offloading action, where the training data is generated from the environmental system. Moreover, we employ the bandwidth allocation in order to optimize the wireless spectrum for the links between the users and CAPs, where several bandwidth allocation schemes are proposed. In further, we use the CAP selection in order to choose one best CAP to assist the computational tasks from the users. Simulation results are finally presented to show the effectiveness of the proposed reinforcement learning offloading strategy. In particular, the system cost of latency and energy consumption can be reduced significantly by the proposed deep reinforcement learning based algorithm.
△ Less
Submitted 1 August, 2020;
originally announced August 2020.
-
Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability
Authors:
Jinyu Li,
Rui Zhao,
Zhong Meng,
Yanqing Liu,
Wenning Wei,
Sarangarajan Parthasarathy,
Vadim Mazalov,
Zhenghao Wang,
Lei He,
Sheng Zhao,
Yifan Gong
Abstract:
Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition. In this paper, we describe our recent development of RNN-T models with reduced GPU memory consumption during training, better initialization strategy, and advanced encoder modeling with future lookahead.…
▽ More
Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition. In this paper, we describe our recent development of RNN-T models with reduced GPU memory consumption during training, better initialization strategy, and advanced encoder modeling with future lookahead. When trained with Microsoft's 65 thousand hours of anonymized training data, the developed RNN-T model surpasses a very well trained hybrid model with both better recognition accuracy and lower latency. We further study how to customize RNN-T models to a new domain, which is important for deploying E2E models to practical scenarios. By comparing several methods leveraging text-only data in the new domain, we found that updating RNN-T's prediction and joint networks using text-to-speech generated from domain-specific text is the most effective.
△ Less
Submitted 29 July, 2020;
originally announced July 2020.
-
Enhancement of a CNN-Based Denoiser Based on Spatial and Spectral Analysis
Authors:
Rui Zhao,
Kin-Man Lam,
Daniel P. K. Lun
Abstract:
Convolutional neural network (CNN)-based image denoising methods have been widely studied recently, because of their high-speed processing capability and good visual quality. However, most of the existing CNN-based denoisers learn the image prior from the spatial domain, and suffer from the problem of spatially variant noise, which limits their performance in real-world image denoising tasks. In t…
▽ More
Convolutional neural network (CNN)-based image denoising methods have been widely studied recently, because of their high-speed processing capability and good visual quality. However, most of the existing CNN-based denoisers learn the image prior from the spatial domain, and suffer from the problem of spatially variant noise, which limits their performance in real-world image denoising tasks. In this paper, we propose a discrete wavelet denoising CNN (WDnCNN), which restores images corrupted by various noise with a single model. Since most of the content or energy of natural images resides in the low-frequency spectrum, their transformed coefficients in the frequency domain are highly imbalanced. To address this issue, we present a band normalization module (BNM) to normalize the coefficients from different parts of the frequency spectrum. Moreover, we employ a band discriminative training (BDT) criterion to enhance the model regression. We evaluate the proposed WDnCNN, and compare it with other state-of-the-art denoisers. Experimental results show that WDnCNN achieves promising performance in both synthetic and real noise reduction, making it a potential solution to many practical image denoising applications.
△ Less
Submitted 28 June, 2020;
originally announced June 2020.
-
On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition
Authors:
Jinyu Li,
Yu Wu,
Yashesh Gaur,
Chengyi Wang,
Rui Zhao,
Shujie Liu
Abstract:
Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non…
▽ More
Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes. We use 65 thousand hours of Microsoft anonymized training data to train these models. As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data. To the best of our knowledge, no such comprehensive study has been conducted yet. We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized. Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode. We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.
△ Less
Submitted 29 July, 2020; v1 submitted 28 May, 2020;
originally announced May 2020.
-
Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition
Authors:
Hu Hu,
Rui Zhao,
Jinyu Li,
Liang Lu,
Yifan Gong
Abstract:
Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition. However, RNN-T training is made difficult by the huge memory requirements, and complicated neural structure. A common solution to ease the RNN-T training is to employ c…
▽ More
Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition. However, RNN-T training is made difficult by the huge memory requirements, and complicated neural structure. A common solution to ease the RNN-T training is to employ connectionist temporal classification (CTC) model along with RNN language model (RNNLM) to initialize the RNN-T parameters. In this work, we conversely leverage external alignments to seed the RNN-T model. Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively. Evaluated on Microsoft 65,000 hours anonymized production data with personally identifiable information removed, our proposed methods can obtain significant improvement. In particular, the encoder pre-training solution achieved a 10% and a 8% relative word error rate reduction when compared with random initialization and the widely used CTC+RNNLM initialization strategy, respectively. Our solutions also significantly reduce the RNN-T model latency from the baseline.
△ Less
Submitted 1 May, 2020;
originally announced May 2020.
-
High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model
Authors:
Jinyu Li,
Rui Zhao,
Eric Sun,
Jeremy H. M. Wong,
Amit Das,
Zhong Meng,
Yifan Gong
Abstract:
While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved. In this paper, we detail our recent efforts to improve conventional hybrid LST…
▽ More
While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved. In this paper, we detail our recent efforts to improve conventional hybrid LSTM acoustic models for high-accuracy and low-latency automatic speech recognition. To achieve high accuracy, we use a contextual layer trajectory LSTM (cltLSTM), which decouples the temporal modeling and target classification tasks, and incorporates future context frames to get more information for accurate acoustic modeling. We further improve the training strategy with sequence-level teacher-student learning. To obtain low latency, we design a two-head cltLSTM, in which one head has zero latency and the other head has a small latency, compared to an LSTM. When trained with Microsoft's 65 thousand hours of anonymized training data and evaluated with test sets with 1.8 million words, the proposed two-head cltLSTM model with the proposed training strategy yields a 28.2\% relative WER reduction over the conventional LSTM acoustic model, with a similar perceived latency.
△ Less
Submitted 16 March, 2020;
originally announced March 2020.
-
Harvest-and-Opportunistically-Relay: Analyses on Transmission Outage and Covertness
Authors:
Yuanjian Li,
Rui Zhao,
Zhiqiao Nie,
A. Hamid Aghvami
Abstract:
For enhancing transmission performance, privacy level and energy manipulating efficiency of wireless networks, this paper initiates a novel simultaneous wireless information and power transfer (SWIPT) full-duplex (FD) relaying protocol, termed harvest-and-opportunistically-relay (HOR). In the proposed HOR protocol, the relay can work opportunistically in either pure energy harvesting (PEH) or the…
▽ More
For enhancing transmission performance, privacy level and energy manipulating efficiency of wireless networks, this paper initiates a novel simultaneous wireless information and power transfer (SWIPT) full-duplex (FD) relaying protocol, termed harvest-and-opportunistically-relay (HOR). In the proposed HOR protocol, the relay can work opportunistically in either pure energy harvesting (PEH) or the FD SWIPT mode. Due to the FD characteristics, the dynamic fluctuation of R's residual energy is difficult to quantify and track. To solve this problem, we apply a novel discrete-state Markov Chain (MC) method in which the practical finite-capacity energy storage is considered. Furthermore, to improve the privacy level of the proposed HOR relaying system, covert transmission performance analysis is developed and investigated, where closed-form expressions of optimal detection threshold and minimum detection error probability are derived. Last but not least, with the aid of stationary distribution of the MC, closed-form expression of transmission outage probability is calculated, based on which transmission outage performance is analyzed. Numerical results have validated the correctness of analyses on transmission outage and covert communication. The impacts of key system parameters on the performance of transmission outage and covert communication are given and discussed. Based on mathematical analysis and numerical results, it is fair to say that the proposed HOR model is able to not only reliably enhance the transmission performance via smartly managing residual energy but also efficiently improve the privacy level of the legitimate transmission party via dynamically adjust the optimal detection threshold.
△ Less
Submitted 7 February, 2020;
originally announced February 2020.
-
Improving RNN Transducer Modeling for End-to-End Speech Recognition
Authors:
Jinyu Li,
Rui Zhao,
Hu Hu,
Yifan Gong
Abstract:
In the last few years, an emerging trend in automatic speech recognition research is the study of end-to-end (E2E) systems. Connectionist Temporal Classification (CTC), Attention Encoder-Decoder (AED), and RNN Transducer (RNN-T) are the most popular three methods. Among these three methods, RNN-T has the advantages to do online streaming which is challenging to AED and it doesn't have CTC's frame-…
▽ More
In the last few years, an emerging trend in automatic speech recognition research is the study of end-to-end (E2E) systems. Connectionist Temporal Classification (CTC), Attention Encoder-Decoder (AED), and RNN Transducer (RNN-T) are the most popular three methods. Among these three methods, RNN-T has the advantages to do online streaming which is challenging to AED and it doesn't have CTC's frame-independence assumption. In this paper, we improve the RNN-T training in two aspects. First, we optimize the training algorithm of RNN-T to reduce the memory consumption so that we can have larger training minibatch for faster training speed. Second, we propose better model structures so that we obtain RNN-T models with the very good accuracy but small footprint. Trained with 30 thousand hours anonymized and transcribed Microsoft production data, the best RNN-T model with even smaller model size (216 Megabytes) achieves up-to 11.8% relative word error rate (WER) reduction from the baseline RNN-T model. This best RNN-T model is significantly better than the device hybrid model with similar size by achieving up-to 15.0% relative WER reduction, and obtains similar WERs as the server hybrid model of 5120 Megabytes in size.
△ Less
Submitted 26 September, 2019;
originally announced September 2019.
-
A Refined Equilibrium Generative Adversarial Network for Retinal Vessel Segmentation
Authors:
Yukun Zhou,
Zailiang Chen,
Hailan Shen,
Xianxian Zheng,
Rongchang Zhao,
Xuanchu Duan
Abstract:
Objective: Recognizing retinal vessel abnormity is vital to early diagnosis of ophthalmological diseases and cardiovascular events. However, segmentation results are highly influenced by elusive vessels, especially in low-contrast background and lesion region. In this work, we present an end-to-end synthetic neural network, containing a symmetric equilibrium generative adversarial network (SEGAN),…
▽ More
Objective: Recognizing retinal vessel abnormity is vital to early diagnosis of ophthalmological diseases and cardiovascular events. However, segmentation results are highly influenced by elusive vessels, especially in low-contrast background and lesion region. In this work, we present an end-to-end synthetic neural network, containing a symmetric equilibrium generative adversarial network (SEGAN), multi-scale features refine blocks (MSFRB), and attention mechanism (AM) to enhance the performance on vessel segmentation. Method: The proposed network is granted powerful multi-scale representation capability to extract detail information. First, SEGAN constructs a symmetric adversarial architecture, which forces generator to produce more realistic images with local details. Second, MSFRB are devised to prevent high-resolution features from being obscured, thereby merging multi-scale features better. Finally, the AM is employed to encourage the network to concentrate on discriminative features. Results: On public dataset DRIVE, STARE, CHASEDB1, and HRF, we evaluate our network quantitatively and compare it with state-of-the-art works. The ablation experiment shows that SEGAN, MSFRB, and AM both contribute to the desirable performance. Conclusion: The proposed network outperforms the mature methods and effectively functions in elusive vessels segmentation, achieving highest scores in Sensitivity, G-Mean, Precision, and F1-Score while maintaining the top level in other metrics. Significance: The appreciable performance and computational efficiency offer great potential in clinical retinal vessel segmentation application. Meanwhile, the network could be utilized to extract detail information in other biomedical issues
△ Less
Submitted 18 December, 2019; v1 submitted 26 September, 2019;
originally announced September 2019.
-
A Simple Change Comparison Method for Image Sequences Based on Uncertainty Coefficient
Authors:
Ruzhang Zhao,
Yajun Fang,
Berthold K. P. Horn
Abstract:
For identification of change information in image sequences, most studies focus on change detection in one image sequence, while few studies have considered the change level comparison between two different image sequences. Moreover, most studies require the detection of image information in details, for example, object detection. Based on Uncertainty Coefficient(UC), this paper proposes an innova…
▽ More
For identification of change information in image sequences, most studies focus on change detection in one image sequence, while few studies have considered the change level comparison between two different image sequences. Moreover, most studies require the detection of image information in details, for example, object detection. Based on Uncertainty Coefficient(UC), this paper proposes an innovative method CCUC for change comparison between two image sequences. The proposed method is computationally efficient and simple to implement. The change comparison stems from video monitoring system. The limited number of provided screens and a large number of monitoring cameras require the videos or image sequences ordered by change level. We demonstrate this new method by applying it on two publicly available image sequences. The results are able to show the method can distinguish the different change level for sequences.
△ Less
Submitted 14 October, 2018;
originally announced October 2018.