-
MCDubber: Multimodal Context-Aware Expressive Video Dubbing
Authors:
Yuan Zhao,
Zhenqi Jia,
Rui Liu,
De Hu,
Feilong Bao,
Guanglai Gao
Abstract:
Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be co…
▽ More
Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be combined with the original context in the final video. This aspect has been overlooked in previous studies. To address this issue, we propose a Multimodal Context-aware video Dubbing model, termed \textbf{MCDubber}, to convert the modeling object from a single sentence to a longer sequence with context information to ensure the consistency of the global context prosody. MCDubber comprises three main components: (1) A context duration aligner aims to learn the context-aware alignment between the text and lip frames; (2) A context prosody predictor seeks to read the global context visual sequence and predict the context-aware global energy and pitch; (3) A context acoustic decoder ultimately predicts the global context mel-spectrogram with the assistance of adjacent ground-truth mel-spectrograms of the target sentence. Through this process, MCDubber fully considers the influence of multimodal context on the prosody expressiveness of the current sentence when dubbing. The extracted mel-spectrogram belonging to the target sentence from the output context mel-spectrograms is the final required dubbing audio. Extensive experiments on the Chem benchmark dataset demonstrate that our MCDubber significantly improves dubbing expressiveness compared to all advanced baselines. The code and demos are available at https://github.com/XiaoYuanJun-zy/MCDubber.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
Benchmarking Conventional and Learned Video Codecs with a Low-Delay Configuration
Authors:
Siyue Teng,
Yuxuan Jiang,
Ge Gao,
Fan Zhang,
Thomas Davis,
Zoe Liu,
David Bull
Abstract:
Recent advances in video compression have seen significant coding performance improvements with the development of new standards and learning-based video codecs. However, most of these works focus on application scenarios that allow a certain amount of system delay (e.g., Random Access mode in MPEG codecs), which is not always acceptable for live delivery. This paper conducts a comparative study o…
▽ More
Recent advances in video compression have seen significant coding performance improvements with the development of new standards and learning-based video codecs. However, most of these works focus on application scenarios that allow a certain amount of system delay (e.g., Random Access mode in MPEG codecs), which is not always acceptable for live delivery. This paper conducts a comparative study of state-of-the-art conventional and learned video coding methods based on a low delay configuration. Specifically, this study includes two MPEG standard codecs (H.266/VVC VTM and JVET ECM), two AOM codecs (AV1 libaom and AVM), and two recent neural video coding models (DCVC-DC and DCVC-FM). To allow a fair and meaningful comparison, the evaluation was performed on test sequences defined in the AOM and MPEG common test conditions in the YCbCr 4:2:0 color space. The evaluation results show that the JVET ECM codecs offer the best overall coding performance among all codecs tested, with a 16.1% (based on PSNR) average BD-rate saving over AOM AVM, and 11.0% over DCVC-FM. We also observed inconsistent performance with the learned video codecs, DCVC-DC and DCVC-FM, for test content with large background motions.
△ Less
Submitted 9 August, 2024;
originally announced August 2024.
-
AI for Equitable Tennis Training: Leveraging AI for Equitable and Accurate Classification of Tennis Skill Levels and Training Phases
Authors:
Gyanna Gao,
Hao-Yu Liao,
Zhenhong Hu
Abstract:
Numerous studies have demonstrated the manifold benefits of tennis, such as increasing overall physical and mental health. Unfortunately, many children and youth from low-income families are unable to engage in this sport mainly due to financial constraints such as private lesson expenses as well as logistical concerns to and back from such lessons and clinics. While several tennis self-training s…
▽ More
Numerous studies have demonstrated the manifold benefits of tennis, such as increasing overall physical and mental health. Unfortunately, many children and youth from low-income families are unable to engage in this sport mainly due to financial constraints such as private lesson expenses as well as logistical concerns to and back from such lessons and clinics. While several tennis self-training systems exist, they are often tailored for professionals and are prohibitively expensive. The present study aims to classify tennis players' skill levels and classify tennis strokes into phases characterized by motion attributes for a future development of an AI-based tennis self-training model for affordable and convenient applications running on devices used in daily life such as an iPhone or an Apple Watch for tennis skill improvement. We collected motion data, including Motion Yaw, Roll and Pitch from inertial measurement units (IMUs) worn by participating junior tennis players. For this pilot study, data from twelve participants were processed using Support Vector Machine (SVM) algorithms. The SVM models demonstrated an overall accuracy of 77% in classifying players as beginners or intermediates, with low rates of false positives and false negatives, effectively distinguishing skill levels. Additionally, the tennis swings were successfully classified into five phases based on the collected motion data. These findings indicate that SVM-based classification can be a reliable foundation for developing an equitable and accessible AI-driven tennis training system.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Anomaly Detection Utilizing a Riemann Metric for Robust Myoelectric Pattern Recognition
Authors:
ZongYe Hu,
Ge Gao,
Xiang Chen,
Xu Zhang
Abstract:
Traditional myoelectric pattern recognition (MPR) systems excel within controlled laboratory environments but they are interfered when confronted with anomaly or novel motions not encountered during the training phase. Utilizing metric ways to distinguish the target and novel motions based on extractors compared to training set is a prevalent idea to alleviate such interference. An innovative meth…
▽ More
Traditional myoelectric pattern recognition (MPR) systems excel within controlled laboratory environments but they are interfered when confronted with anomaly or novel motions not encountered during the training phase. Utilizing metric ways to distinguish the target and novel motions based on extractors compared to training set is a prevalent idea to alleviate such interference. An innovative method for anomaly motion detection was proposed based on simplified log-Euclidean distance (SLED) of symmetric positive definite manifolds. The SLED enhances the discrimination between target and novel motions. Moreover, it generates a more flexible shaping of motion boundaries to segregate target and novel motions, therefore effectively detecting the novel ones. The proposed method was evaluated using surface-electromyographic (sEMG) armband data recorded while performing 6 target and 8 novel hand motions. Based on linear discriminate analysis (LDA) and convolution prototype network (CPN) feature extractors, the proposed method achieved accuracies of 89.7% and 93.9% in novel motion detection respectively, while maintaining a target motion classification accuracy of 90%, outperforming the existing ones with statistical significance (p<0.05). This study provided a valuable solution for improving the robustness of MPR systems against anomaly motion interference.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
A Survey of Machine Learning Techniques for Improving Global Navigation Satellite Systems
Authors:
Adyasha Mohanty,
Grace Gao
Abstract:
Global Navigation Satellite Systems (GNSS)-based positioning plays a crucial role in various applications, including navigation, transportation, logistics, mapping, and emergency services. Traditional GNSS positioning methods are model-based and they utilize satellite geometry and the known properties of satellite signals. However, model-based methods have limitations in challenging environments a…
▽ More
Global Navigation Satellite Systems (GNSS)-based positioning plays a crucial role in various applications, including navigation, transportation, logistics, mapping, and emergency services. Traditional GNSS positioning methods are model-based and they utilize satellite geometry and the known properties of satellite signals. However, model-based methods have limitations in challenging environments and often lack adaptability to uncertain noise models. This paper highlights recent advances in Machine Learning (ML) and its potential to address these limitations. It covers a broad range of ML methods, including supervised learning, unsupervised learning, deep learning, and hybrid approaches. The survey provides insights into positioning applications related to GNSS such as signal analysis, anomaly detection, multi-sensor integration, prediction, and accuracy enhancement using ML. It discusses the strengths, limitations, and challenges of current ML-based approaches for GNSS positioning, providing a comprehensive overview of the field.
△ Less
Submitted 29 March, 2024;
originally announced June 2024.
-
Triage of 3D pathology data via 2.5D multiple-instance learning to guide pathologist assessments
Authors:
Gan Gao,
Andrew H. Song,
Fiona Wang,
David Brenes,
Rui Wang,
Sarah S. L. Chow,
Kevin W. Bishop,
Lawrence D. True,
Faisal Mahmood,
Jonathan T. C. Liu
Abstract:
Accurate patient diagnoses based on human tissue biopsies are hindered by current clinical practice, where pathologists assess only a limited number of thin 2D tissue slices sectioned from 3D volumetric tissue. Recent advances in non-destructive 3D pathology, such as open-top light-sheet microscopy, enable comprehensive imaging of spatially heterogeneous tissue morphologies, offering the feasibili…
▽ More
Accurate patient diagnoses based on human tissue biopsies are hindered by current clinical practice, where pathologists assess only a limited number of thin 2D tissue slices sectioned from 3D volumetric tissue. Recent advances in non-destructive 3D pathology, such as open-top light-sheet microscopy, enable comprehensive imaging of spatially heterogeneous tissue morphologies, offering the feasibility to improve diagnostic determinations. A potential early route towards clinical adoption for 3D pathology is to rely on pathologists for final diagnosis based on viewing familiar 2D H&E-like image sections from the 3D datasets. However, manual examination of the massive 3D pathology datasets is infeasible. To address this, we present CARP3D, a deep learning triage approach that automatically identifies the highest-risk 2D slices within 3D volumetric biopsy, enabling time-efficient review by pathologists. For a given slice in the biopsy, we estimate its risk by performing attention-based aggregation of 2D patches within each slice, followed by pooling of the neighboring slices to compute a context-aware 2.5D risk score. For prostate cancer risk stratification, CARP3D achieves an area under the curve (AUC) of 90.4% for triaging slices, outperforming methods relying on independent analysis of 2D sections (AUC=81.3%). These results suggest that integrating additional depth context enhances the model's discriminative capabilities. In conclusion, CARP3D has the potential to improve pathologist diagnosis via accurate triage of high-risk slices within large-volume 3D pathology datasets.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Spreading Code Optimization for Low-Earth Orbit Satellites via Mixed-Integer Convex Programming
Authors:
Alan Yang,
Tara Mina,
Grace Gao
Abstract:
Optimizing the correlation properties of spreading codes is critical for minimizing inter-channel interference in satellite navigation systems. By improving the codes' correlation sidelobes, we can enhance navigation performance while minimizing the required spreading code lengths. In the case of low earth orbit (LEO) satellite navigation, shorter code lengths (on the order of a hundred) are prefe…
▽ More
Optimizing the correlation properties of spreading codes is critical for minimizing inter-channel interference in satellite navigation systems. By improving the codes' correlation sidelobes, we can enhance navigation performance while minimizing the required spreading code lengths. In the case of low earth orbit (LEO) satellite navigation, shorter code lengths (on the order of a hundred) are preferred due to their ability to achieve fast signal acquisition. Additionally, the relatively high signal-to-noise ratio (SNR) in LEO systems reduces the need for longer spreading codes to mitigate inter-channel interference. In this work, we propose a two-stage block coordinate descent (BCD) method which optimizes the codes' correlation properties while enforcing the autocorrelation sidelobe zero (ACZ) property. In each iteration of the BCD method, we solve a mixed-integer convex program (MICP) over a block of 25 binary variables. Our method is applicable to spreading code families of arbitrary sizes and lengths, and we demonstrate its effectiveness for a problem with 66 length-127 codes and a problem with 130 length-257 codes.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
Accelerating Learnt Video Codecs with Gradient Decay and Layer-wise Distillation
Authors:
Tianhao Peng,
Ge Gao,
Heming Sun,
Fan Zhang,
David Bull
Abstract:
In recent years, end-to-end learnt video codecs have demonstrated their potential to compete with conventional coding algorithms in term of compression efficiency. However, most learning-based video compression models are associated with high computational complexity and latency, in particular at the decoder side, which limits their deployment in practical applications. In this paper, we present a…
▽ More
In recent years, end-to-end learnt video codecs have demonstrated their potential to compete with conventional coding algorithms in term of compression efficiency. However, most learning-based video compression models are associated with high computational complexity and latency, in particular at the decoder side, which limits their deployment in practical applications. In this paper, we present a novel model-agnostic pruning scheme based on gradient decay and adaptive layer-wise distillation. Gradient decay enhances parameter exploration during sparsification whilst preventing runaway sparsity and is superior to the standard Straight-Through Estimation. The adaptive layer-wise distillation regulates the sparse training in various stages based on the distortion of intermediate features. This stage-wise design efficiently updates parameters with minimal computational overhead. The proposed approach has been applied to three popular end-to-end learnt video codecs, FVC, DCVC, and DCVC-HEM. Results confirm that our method yields up to 65% reduction in MACs and 2x speed-up with less than 0.3dB drop in BD-PSNR. Supporting code and supplementary material can be downloaded from: https://jasminepp.github.io/lightweightdvc/
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Spoofing-Resilient LiDAR-GPS Factor Graph Localization with Chimera Authentication
Authors:
Adam Dai,
Tara Minda,
Ashwin Kanhere,
Grace Gao
Abstract:
Many vehicle platforms typically use sensors such as LiDAR or camera for locally-referenced navigation with GPS for globally-referenced navigation. However, due to the unencrypted nature of GPS signals, all civilian users are vulner-able to spoofing attacks, where a malicious spoofer broadcasts fabricated signals and causes the user to track a false position fix. To protect against such GPS spoofi…
▽ More
Many vehicle platforms typically use sensors such as LiDAR or camera for locally-referenced navigation with GPS for globally-referenced navigation. However, due to the unencrypted nature of GPS signals, all civilian users are vulner-able to spoofing attacks, where a malicious spoofer broadcasts fabricated signals and causes the user to track a false position fix. To protect against such GPS spoofing attacks, Chips-Message Robust Authentication (Chimera) has been developed and will be tested on the Navigation Technology Satellite 3 (NTS-3) satellite being launched later this year. However, Chimera authentication is not continuously available and may not provide sufficient protection for vehicles which rely on more frequent GPS measurements. In this paper, we propose a factor graph-based state estimation framework which integrates LiDAR and GPS while simultaneously detecting and mitigating spoofing attacks experienced between consecutive Chimera authentications. Our proposed framework combines GPS pseudorange measurements with LiDAR odometry to provide a robust navigation solution. A chi-squared detector, based on pseudorange residuals, is used to detect and mitigate any potential GPS spoofing attacks. We evaluate our method using real-world LiDAR data from the KITTI dataset and simulated GPS measurements, both nominal and with spoofing. Across multiple trajectories and Monte Carlo runs, our method consistently achieves position errors under 5 m during nominal conditions, and successfully bounds positioning error to within odometry drift levels during spoofed conditions.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
HiNeRV: Video Compression with Hierarchical Encoding-based Neural Representation
Authors:
Ho Man Kwan,
Ge Gao,
Fan Zhang,
Andrew Gower,
David Bull
Abstract:
Learning-based video compression is currently a popular research topic, offering the potential to compete with conventional standard video codecs. In this context, Implicit Neural Representations (INRs) have previously been used to represent and compress image and video content, demonstrating relatively high decoding speed compared to other methods. However, existing INR-based methods have failed…
▽ More
Learning-based video compression is currently a popular research topic, offering the potential to compete with conventional standard video codecs. In this context, Implicit Neural Representations (INRs) have previously been used to represent and compress image and video content, demonstrating relatively high decoding speed compared to other methods. However, existing INR-based methods have failed to deliver rate quality performance comparable with the state of the art in video compression. This is mainly due to the simplicity of the employed network architectures, which limit their representation capability. In this paper, we propose HiNeRV, an INR that combines light weight layers with novel hierarchical positional encodings. We employs depth-wise convolutional, MLP and interpolation layers to build the deep and wide network architecture with high capacity. HiNeRV is also a unified representation encoding videos in both frames and patches at the same time, which offers higher performance and flexibility than existing methods. We further build a video codec based on HiNeRV and a refined pipeline for training, pruning and quantization that can better preserve HiNeRV's performance during lossy model compression. The proposed method has been evaluated on both UVG and MCL-JCV datasets for video compression, demonstrating significant improvement over all existing INRs baselines and competitive performance when compared to learning-based codecs (72.3% overall bit rate saving over HNeRV and 43.4% over DCVC on the UVG dataset, measured in PSNR).
△ Less
Submitted 26 January, 2024; v1 submitted 16 June, 2023;
originally announced June 2023.
-
MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset
Authors:
Kailin Liang,
Bin Liu,
Yifan Hu,
Rui Liu,
Feilong Bao,
Guanglai Gao
Abstract:
Text-to-Speech (TTS) synthesis for low-resource languages is an attractive research issue in academia and industry nowadays. Mongolian is the official language of the Inner Mongolia Autonomous Region and a representative low-resource language spoken by over 10 million people worldwide. However, there is a relative lack of open-source datasets for Mongolian TTS. Therefore, we make public an open-so…
▽ More
Text-to-Speech (TTS) synthesis for low-resource languages is an attractive research issue in academia and industry nowadays. Mongolian is the official language of the Inner Mongolia Autonomous Region and a representative low-resource language spoken by over 10 million people worldwide. However, there is a relative lack of open-source datasets for Mongolian TTS. Therefore, we make public an open-source multi-speaker Mongolian TTS dataset, named MnTTS2, for the benefit of related researchers. In this work, we prepare the transcription from various topics and invite three professional Mongolian announcers to form a three-speaker TTS dataset, in which each announcer records 10 hours of speeches in Mongolian, resulting 30 hours in total. Furthermore, we build the baseline system based on the state-of-the-art FastSpeech2 model and HiFi-GAN vocoder. The experimental results suggest that the constructed MnTTS2 dataset is sufficient to build robust multi-speaker TTS models for real-world applications. The MnTTS2 dataset, training recipe, and pretrained models are released at: \url{https://github.com/ssmlkl/MnTTS2}
△ Less
Submitted 11 December, 2022;
originally announced January 2023.
-
Binary sequence set optimization for CDMA applications via mixed-integer quadratic programming
Authors:
Alan Yang,
Tara Mina,
Grace Gao
Abstract:
Finding sets of binary sequences with low auto- and cross-correlation properties is a hard combinatorial optimization problem with numerous applications, including multiple-input-multiple-output (MIMO) radar and global navigation satellite systems (GNSS). The sum of squared correlations, sometimes referred to as the integrated sidelobe level (ISL), is a quartic function in the variables and is a c…
▽ More
Finding sets of binary sequences with low auto- and cross-correlation properties is a hard combinatorial optimization problem with numerous applications, including multiple-input-multiple-output (MIMO) radar and global navigation satellite systems (GNSS). The sum of squared correlations, sometimes referred to as the integrated sidelobe level (ISL), is a quartic function in the variables and is a commonly-used metric of sequence set quality. In this paper, we show that the ISL minimization problem may be formulated as a mixed-integer quadratic program (MIQP). We then present a block coordinate descent (BCD) algorithm that iteratively optimizes over subsets of variables. The subset optimization subproblems are also MIQPs which may be handled more efficiently using specialized solvers than using exhaustive search; this allows us to perform BCD over larger variable subsets than previously possible. Our approach was used to find sets of four binary sequences of lengths up to 1023 with better ISL performance than Gold codes and sequence sets found using existing BCD methods.
△ Less
Submitted 14 March, 2023; v1 submitted 1 November, 2022;
originally announced November 2022.
-
Explicit Intensity Control for Accented Text-to-speech
Authors:
Rui Liu,
Haolin Zuo,
De Hu,
Guanglai Gao,
Haizhou Li
Abstract:
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). How to control the intensity of accent in the process of TTS is a very interesting research direction, and has attracted more and more attention. Recent work design a speaker-adversarial loss to disentangle the speaker and accent information, and then adjust the loss weig…
▽ More
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). How to control the intensity of accent in the process of TTS is a very interesting research direction, and has attracted more and more attention. Recent work design a speaker-adversarial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity. However, such a control method lacks interpretability, and there is no direct correlation between the controlling factor and natural accent intensity. To this end, this paper propose a new intuitive and explicit accent intensity control scheme for accented TTS. Specifically, we first extract the posterior probability, called as ``goodness of pronunciation (GoP)'' from the L1 speech recognition model to quantify the phoneme accent intensity for accented speech, then design a FastSpeech2 based TTS model, named Ai-TTS, to take the accent intensity expression into account during speech generation. Experiments show that the our method outperforms the baseline model in terms of accent rendering and intensity control.
△ Less
Submitted 27 October, 2022;
originally announced October 2022.
-
FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis
Authors:
Yifan Hu,
Rui Liu,
Guanglai Gao,
Haizhou Li
Abstract:
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context. The correlation between the current utterance and the dialogue history at the utterance level was used to improve the expressiveness of synthesized speech. However, the fine-grained information in the dialogue history at the word level also has an importan…
▽ More
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context. The correlation between the current utterance and the dialogue history at the utterance level was used to improve the expressiveness of synthesized speech. However, the fine-grained information in the dialogue history at the word level also has an important impact on the prosodic expression of an utterance, which has not been well studied in the prior work. Therefore, we propose a novel expressive conversational TTS model, termed as FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation. Specifically, the FCTalker includes fine and coarse grained encoders to exploit the word and utterance-level context dependency. To model the word-level dependencies between an utterance and its dialogue history, the fine-grained dialogue encoder is built on top of a dialogue BERT model. The experimental results show that the proposed method outperforms all baselines and generates more expressive speech that is contextually appropriate. We release the source code at: https://github.com/walker-hyf/FCTalker.
△ Less
Submitted 27 October, 2022;
originally announced October 2022.
-
The Brain-Inspired Cooperative Shared Control for Brain-Machine Interface
Authors:
Shengjie Zheng,
Ling Liu,
Junjie Yang,
Lang Qian,
Gang Gao,
Xin Chen,
Wenqi Jin,
Chunshan Deng,
Xiaojian Li
Abstract:
In the practical application of brain-machine interface technology, the problem often faced is the low information content and high noise of the neural signals collected by the electrode and the difficulty of decoding by the decoder, which makes it difficult for the robotic to obtain stable instructions to complete the task. The idea based on the principle of cooperative shared control can be achi…
▽ More
In the practical application of brain-machine interface technology, the problem often faced is the low information content and high noise of the neural signals collected by the electrode and the difficulty of decoding by the decoder, which makes it difficult for the robotic to obtain stable instructions to complete the task. The idea based on the principle of cooperative shared control can be achieved by extracting general motor commands from brain activity, while the fine details of the movement can be hosted to the robot for completion, or the brain can have complete control. This study proposes a brain-machine interface shared control system based on spiking neural networks for robotic arm movement control and wheeled robots wheel speed control and steering, respectively. The former can reliably control the robotic arm to move to the destination position, while the latter controls the wheeled robots for object tracking and map generation. The results show that the shared control based on brain-inspired intelligence can perform some typical tasks in complex environments and positively improve the fluency and ease of use of brain-machine interaction, and also demonstrate the potential of this control method in clinical applications of brain-machine interfaces.
△ Less
Submitted 25 June, 2024; v1 submitted 17 October, 2022;
originally announced October 2022.
-
Rethinking the Detection Head Configuration for Traffic Object Detection
Authors:
Yi Shi,
Jiang Wu,
Shixuan Zhao,
Gangyao Gao,
Tao Deng,
Hongmei Yan
Abstract:
Multi-scale detection plays an important role in object detection models. However, researchers usually feel blank on how to reasonably configure detection heads combining multi-scale features at different input resolutions. We find that there are different matching relationships between the object distribution and the detection head at different input resolutions. Based on the instructive findings…
▽ More
Multi-scale detection plays an important role in object detection models. However, researchers usually feel blank on how to reasonably configure detection heads combining multi-scale features at different input resolutions. We find that there are different matching relationships between the object distribution and the detection head at different input resolutions. Based on the instructive findings, we propose a lightweight traffic object detection network based on matching between detection head and object distribution, termed as MHD-Net. It consists of three main parts. The first is the detection head and object distribution matching strategy, which guides the rational configuration of detection head, so as to leverage multi-scale features to effectively detect objects at vastly different scales. The second is the cross-scale detection head configuration guideline, which instructs to replace multiple detection heads with only two detection heads possessing of rich feature representations to achieve an excellent balance between detection accuracy, model parameters, FLOPs and detection speed. The third is the receptive field enlargement method, which combines the dilated convolution module with shallow features of backbone to further improve the detection accuracy at the cost of increasing model parameters very slightly. The proposed model achieves more competitive performance than other models on BDD100K dataset and our proposed ETFOD-v2 dataset. The code will be available.
△ Less
Submitted 7 October, 2022;
originally announced October 2022.
-
MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline
Authors:
Yifan Hu,
Pengkai Yin,
Rui Liu,
Feilong Bao,
Guanglai Gao
Abstract:
This paper introduces a high-quality open-source text-to-speech (TTS) synthesis dataset for Mongolian, a low-resource language spoken by over 10 million people worldwide. The dataset, named MnTTS, consists of about 8 hours of transcribed audio recordings spoken by a 22-year-old professional female Mongolian announcer. It is the first publicly available dataset developed to promote Mongolian TTS ap…
▽ More
This paper introduces a high-quality open-source text-to-speech (TTS) synthesis dataset for Mongolian, a low-resource language spoken by over 10 million people worldwide. The dataset, named MnTTS, consists of about 8 hours of transcribed audio recordings spoken by a 22-year-old professional female Mongolian announcer. It is the first publicly available dataset developed to promote Mongolian TTS applications in both academia and industry. In this paper, we share our experience by describing the dataset development procedures and faced challenges. To demonstrate the reliability of our dataset, we built a powerful non-autoregressive baseline system based on FastSpeech2 model and HiFi-GAN vocoder, and evaluated it using the subjective mean opinion score (MOS) and real time factor (RTF) metrics. Evaluation results show that the powerful baseline system trained on our dataset achieves MOS above 4 and RTF about $3.30\times10^{-1}$, which makes it applicable for practical use. The dataset, training recipe, and pretrained TTS models are freely available \footnote{\label{github}\url{https://github.com/walker-hyf/MnTTS}}.
△ Less
Submitted 22 September, 2022;
originally announced September 2022.
-
Controllable Accented Text-to-Speech Synthesis
Authors:
Rui Liu,
Berrak Sisman,
Guanglai Gao,
Haizhou Li
Abstract:
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). Accented TTS synthesis is challenging as L2 is different from L1 in both in terms of phonetic rendering and prosody pattern. Furthermore, there is no easy solution to the control of the accent intensity in an utterance. In this work, we propose a neural TTS architecture,…
▽ More
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). Accented TTS synthesis is challenging as L2 is different from L1 in both in terms of phonetic rendering and prosody pattern. Furthermore, there is no easy solution to the control of the accent intensity in an utterance. In this work, we propose a neural TTS architecture, that allows us to control the accent and its intensity during inference. This is achieved through three novel mechanisms, 1) an accent variance adaptor to model the complex accent variance with three prosody controlling factors, namely pitch, energy and duration; 2) an accent intensity modeling strategy to quantify the accent intensity; 3) a consistency constraint module to encourage the TTS system to render the expected accent intensity at a fine level. Experiments show that the proposed system attains superior performance to the baseline models in terms of accent rendering and intensity control. To our best knowledge, this is the first study of accented TTS synthesis with explicit intensity control.
△ Less
Submitted 22 September, 2022;
originally announced September 2022.
-
Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning
Authors:
Rui Liu,
Berrak Sisman,
Björn Schuller,
Guanglai Gao,
Haizhou Li
Abstract:
Emotion classification of speech and assessment of the emotion strength are required in applications such as emotional text-to-speech and voice conversion. The emotion attribute ranking function based on Support Vector Machine (SVM) was proposed to predict emotion strength for emotional speech corpus. However, the trained ranking function doesn't generalize to new domains, which limits the scope o…
▽ More
Emotion classification of speech and assessment of the emotion strength are required in applications such as emotional text-to-speech and voice conversion. The emotion attribute ranking function based on Support Vector Machine (SVM) was proposed to predict emotion strength for emotional speech corpus. However, the trained ranking function doesn't generalize to new domains, which limits the scope of applications, especially for out-of-domain or unseen speech. In this paper, we propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. This is achieved by the fusion of emotional data from various domains. We follow a multi-task learning network architecture that includes an acoustic encoder, a strength predictor, and an auxiliary emotion predictor. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech. We release the source codes at: https://github.com/ttslr/StrengthNet.
△ Less
Submitted 14 June, 2022;
originally announced June 2022.
-
Multiple Degradation and Reconstruction Network for Single Image Denoising via Knowledge Distillation
Authors:
Juncheng Li,
Hanhui Yang,
Qiaosi Yi,
Faming Fang,
Guangwei Gao,
Tieyong Zeng,
Guixu Zhang
Abstract:
Single image denoising (SID) has achieved significant breakthroughs with the development of deep learning. However, the proposed methods are often accompanied by plenty of parameters, which greatly limits their application scenarios. Different from previous works that blindly increase the depth of the network, we explore the degradation mechanism of the noisy image and propose a lightweight Multip…
▽ More
Single image denoising (SID) has achieved significant breakthroughs with the development of deep learning. However, the proposed methods are often accompanied by plenty of parameters, which greatly limits their application scenarios. Different from previous works that blindly increase the depth of the network, we explore the degradation mechanism of the noisy image and propose a lightweight Multiple Degradation and Reconstruction Network (MDRN) to progressively remove noise. Meanwhile, we propose two novel Heterogeneous Knowledge Distillation Strategies (HMDS) to enable MDRN to learn richer and more accurate features from heterogeneous models, which make it possible to reconstruct higher-quality denoised images under extreme conditions. Extensive experiments show that our MDRN achieves favorable performance against other SID models with fewer parameters. Meanwhile, plenty of ablation studies demonstrate that the introduced HMDS can improve the performance of tiny models or the model under high noise levels, which is extremely useful for related applications.
△ Less
Submitted 29 April, 2022;
originally announced April 2022.
-
Safe Reinforcement Learning Using Black-Box Reachability Analysis
Authors:
Mahmoud Selim,
Amr Alanwar,
Shreyas Kousik,
Grace Gao,
Marco Pavone,
Karl H. Johansson
Abstract:
Reinforcement learning (RL) is capable of sophisticated motion planning and control for robots in uncertain environments. However, state-of-the-art deep RL approaches typically lack safety guarantees, especially when the robot and environment models are unknown. To justify widespread deployment, robots must respect safety constraints without sacrificing performance. Thus, we propose a Black-box Re…
▽ More
Reinforcement learning (RL) is capable of sophisticated motion planning and control for robots in uncertain environments. However, state-of-the-art deep RL approaches typically lack safety guarantees, especially when the robot and environment models are unknown. To justify widespread deployment, robots must respect safety constraints without sacrificing performance. Thus, we propose a Black-box Reachability-based Safety Layer (BRSL) with three main components: (1) data-driven reachability analysis for a black-box robot model, (2) a trajectory rollout planner that predicts future actions and observations using an ensemble of neural networks trained online, and (3) a differentiable polytope collision check between the reachable set and obstacles that enables correcting unsafe actions. In simulation, BRSL outperforms other state-of-the-art safe RL methods on a Turtlebot 3, a quadrotor, a trajectory-tracking point mass, and a hexarotor in wind with an unsafe set adjacent to the area of highest reward.
△ Less
Submitted 21 November, 2022; v1 submitted 15 April, 2022;
originally announced April 2022.
-
Feature Distillation Interaction Weighting Network for Lightweight Image Super-Resolution
Authors:
Guangwei Gao,
Wenjie Li,
Juncheng Li,
Fei Wu,
Huimin Lu,
Yi Yu
Abstract:
Convolutional neural networks based single-image super-resolution (SISR) has made great progress in recent years. However, it is difficult to apply these methods to real-world scenarios due to the computational and memory cost. Meanwhile, how to take full advantage of the intermediate features under the constraints of limited parameters and calculations is also a huge challenge. To alleviate these…
▽ More
Convolutional neural networks based single-image super-resolution (SISR) has made great progress in recent years. However, it is difficult to apply these methods to real-world scenarios due to the computational and memory cost. Meanwhile, how to take full advantage of the intermediate features under the constraints of limited parameters and calculations is also a huge challenge. To alleviate these issues, we propose a lightweight yet efficient Feature Distillation Interaction Weighted Network (FDIWN). Specifically, FDIWN utilizes a series of specially designed Feature Shuffle Weighted Groups (FSWG) as the backbone, and several novel mutual Wide-residual Distillation Interaction Blocks (WDIB) form an FSWG. In addition, Wide Identical Residual Weighting (WIRW) units and Wide Convolutional Residual Weighting (WCRW) units are introduced into WDIB for better feature distillation. Moreover, a Wide-Residual Distillation Connection (WRDC) framework and a Self-Calibration Fusion (SCF) unit are proposed to interact features with different scales more flexibly and efficiently.Extensive experiments show that our FDIWN is superior to other models to strike a good balance between model performance and efficiency. The code is available at https://github.com/IVIPLab/FDIWN.
△ Less
Submitted 11 April, 2022; v1 submitted 16 December, 2021;
originally announced December 2021.
-
A Novel Full-Polarization SAR Images Ship Detector Based on the Scattering Mechanisms and the Wave Polarization Anisotropy
Authors:
Chuan Zhang,
Gui Gao,
Linlin Zhang,
C. Chen,
S. Gao,
Libo Yao,
Shiquan Gou
Abstract:
Synthetic aperture radar (SAR) is considered being a good option for earth observation with its unique advantages. In this paper, we proposed an adaptive ship detector using full-polarization SAR images. First, by thoroughly investigating the scattering characteristics between ships and their background, and the wave polarization anisotropy, a novel ship detector is proposed by jointing the two ch…
▽ More
Synthetic aperture radar (SAR) is considered being a good option for earth observation with its unique advantages. In this paper, we proposed an adaptive ship detector using full-polarization SAR images. First, by thoroughly investigating the scattering characteristics between ships and their background, and the wave polarization anisotropy, a novel ship detector is proposed by jointing the two characteristics, named Scattering-Anisotropy joint (joint-SA). Based on the theoretical analysis, we showed that the joint-SA is an effective physical quantity to show the difference between the ship and its background, and thus joint-SA can be used for ship detection of full-polarization image data. Second, the generalized Gamma distribution was used to characterize the joint-SA statistics of sea clutter with a large range of homogeneity. As a result, an adaptive constant false alarm rate (CFAR) method was implemented based on the joint-SA. Finally, RADARSAT-2 and GF-3 data in C-band and ALOS data in L-band are used for verification. We tested on five datasets, and the experimental results verify the correctness and superiority of the constant false alarm rate (CFAR) method based on the joint-SA. In addition, the experimental results also showed that the signal-clutter ratio (SCR) of the proposed ship detector joint-SA (33.17 dB, 35.98 dB, 57.25 dB) is better than that of DBSP (8.92 dB, 3.43 dB, 25.40 dB) and RsDVH (17.28 dB, 11.17 dB, 54.55 dB). More importantly, the proposed detector joint-SA has higher detection accuracy and a lower false alarm rate.
△ Less
Submitted 6 December, 2021; v1 submitted 6 December, 2021;
originally announced December 2021.
-
A Systematic Survey of Deep Learning-based Single-Image Super-Resolution
Authors:
Juncheng Li,
Zehua Pei,
Wenjie Li,
Guangwei Gao,
Longguang Wang,
Yingqian Wang,
Tieyong Zeng
Abstract:
Single-image super-resolution (SISR) is an important task in image processing, which aims to enhance the resolution of imaging systems. Recently, SISR has made a huge leap and has achieved promising results with the help of deep learning (DL). In this survey, we give an overview of DL-based SISR methods and group them according to their design targets. Specifically, we first introduce the problem…
▽ More
Single-image super-resolution (SISR) is an important task in image processing, which aims to enhance the resolution of imaging systems. Recently, SISR has made a huge leap and has achieved promising results with the help of deep learning (DL). In this survey, we give an overview of DL-based SISR methods and group them according to their design targets. Specifically, we first introduce the problem definition, research background, and the significance of SISR. Secondly, we introduce some related works, including benchmark datasets, upsampling methods, optimization objectives, and image quality assessment methods. Thirdly, we provide a detailed investigation of SISR and give some domain-specific applications of it. Fourthly, we present the reconstruction results of some classic SISR methods to intuitively know their performance. Finally, we discuss some issues that still exist in SISR and summarize some new trends and future directions. This is an exhaustive survey of SISR, which can help researchers better understand SISR and inspire more exciting research in this field. An investigation project for SISR is provided at https://github.com/CV-JunchengLi/SISR-Survey.
△ Less
Submitted 12 April, 2024; v1 submitted 29 September, 2021;
originally announced September 2021.
-
Ellipsotopes: Combining Ellipsoids and Zonotopes for Reachability Analysis and Fault Detection
Authors:
Shreyas Kousik,
Adam Dai,
Grace Gao
Abstract:
Ellipsoids are a common representation for reachability analysis, because they can be transformed efficiently under affine maps, and allow conservative approximation of Minkowski sums, which let one incorporate uncertainty and linearization error in a dynamical system by expanding the size of the reachable set. Zonotopes, a type of symmetric, convex polytope, are similarly frequently used due to e…
▽ More
Ellipsoids are a common representation for reachability analysis, because they can be transformed efficiently under affine maps, and allow conservative approximation of Minkowski sums, which let one incorporate uncertainty and linearization error in a dynamical system by expanding the size of the reachable set. Zonotopes, a type of symmetric, convex polytope, are similarly frequently used due to efficient numerical implementation of affine maps and exact Minkowski sums. Both of these representations also enable efficient, convex collision detection for fault detection or formal verification tasks, wherein one checks if the reachable set of a system collides (i.e., intersects) with an unsafe set. However, both representations often result in conservative representations for reachable sets of arbitrary systems, and neither is closed under intersection. Recently, representations such as constrained zonotopes and constrained polynomial zonotopes have been shown to overcome some of these conservativeness challenges, and are closed under intersection. However, constrained zonotopes can not represent shapes with smooth boundaries such as ellipsoids, and constrained polynomial zonotopes can require solving a non-convex program for collision checking or fault detection. This paper introduces ellipsotopes, a set representation that is closed under affine maps, Minkowski sums, and intersections. Ellipsotopes combine the advantages of ellipsoids and zonotopes while ensuring convex collision checking. The utility of this representation is demonstrated on several examples.
△ Less
Submitted 21 June, 2022; v1 submitted 3 August, 2021;
originally announced August 2021.
-
Guided Training: A Simple Method for Single-channel Speaker Separation
Authors:
Hao Li,
Xueliang Zhang,
Guanglai Gao
Abstract:
Deep learning has shown a great potential for speech separation, especially for speech and non-speech separation. However, it encounters permutation problem for multi-speaker separation where both target and interference are speech. Permutation Invariant training (PIT) was proposed to solve this problem by permuting the order of the multiple speakers. Another way is to use an anchor speech, a shor…
▽ More
Deep learning has shown a great potential for speech separation, especially for speech and non-speech separation. However, it encounters permutation problem for multi-speaker separation where both target and interference are speech. Permutation Invariant training (PIT) was proposed to solve this problem by permuting the order of the multiple speakers. Another way is to use an anchor speech, a short speech of the target speaker, to model the speaker identity. In this paper, we propose a simple strategy to train a long short-term memory (LSTM) model to solve the permutation problem in speaker separation. Specifically, we insert a short speech of target speaker at the beginning of a mixture as guide information. So, the first appearing speaker is defined as the target. Due to the powerful capability on sequence modeling, LSTM can use its memory cells to track and separate target speech from interfering speech. Experimental results show that the proposed training strategy is effective for speaker separation.
△ Less
Submitted 26 March, 2021;
originally announced March 2021.
-
JDSR-GAN: Constructing An Efficient Joint Learning Network for Masked Face Super-Resolution
Authors:
Guangwei Gao,
Lei Tang,
Fei Wu,
Huimin Lu,
Jian Yang
Abstract:
With the growing importance of preventing the COVID-19 virus, face images obtained in most video surveillance scenarios are low resolution with mask simultaneously. However, most of the previous face super-resolution solutions can not handle both tasks in one model. In this work, we treat the mask occlusion as image noise and construct a joint and collaborative learning network, called JDSR-GAN, f…
▽ More
With the growing importance of preventing the COVID-19 virus, face images obtained in most video surveillance scenarios are low resolution with mask simultaneously. However, most of the previous face super-resolution solutions can not handle both tasks in one model. In this work, we treat the mask occlusion as image noise and construct a joint and collaborative learning network, called JDSR-GAN, for the masked face super-resolution task. Given a low-quality face image with the mask as input, the role of the generator composed of a denoising module and super-resolution module is to acquire a high-quality high-resolution face image. The discriminator utilizes some carefully designed loss functions to ensure the quality of the recovered face images. Moreover, we incorporate the identity information and attention mechanism into our network for feasible correlated feature expression and informative feature learning. By jointly performing denoising and face super-resolution, the two tasks can complement each other and attain promising performance. Extensive qualitative and quantitative results show the superiority of our proposed JDSR-GAN over some comparable methods which perform the previous two tasks separately.
△ Less
Submitted 29 January, 2023; v1 submitted 25 March, 2021;
originally announced March 2021.
-
Lightweight Image Super-Resolution with Multi-scale Feature Interaction Network
Authors:
Zhengxue Wang,
Guangwei Gao,
Juncheng Li,
Yi Yu,
Huimin Lu
Abstract:
Recently, the single image super-resolution (SISR) approaches with deep and complex convolutional neural network structures have achieved promising performance. However, those methods improve the performance at the cost of higher memory consumption, which is difficult to be applied for some mobile devices with limited storage and computing resources. To solve this problem, we present a lightweight…
▽ More
Recently, the single image super-resolution (SISR) approaches with deep and complex convolutional neural network structures have achieved promising performance. However, those methods improve the performance at the cost of higher memory consumption, which is difficult to be applied for some mobile devices with limited storage and computing resources. To solve this problem, we present a lightweight multi-scale feature interaction network (MSFIN). For lightweight SISR, MSFIN expands the receptive field and adequately exploits the informative features of the low-resolution observed images from various scales and interactive connections. In addition, we design a lightweight recurrent residual channel attention block (RRCAB) so that the network can benefit from the channel attention mechanism while being sufficiently lightweight. Extensive experiments on some benchmarks have confirmed that our proposed MSFIN can achieve comparable performance against the state-of-the-arts with a more lightweight model.
△ Less
Submitted 21 June, 2021; v1 submitted 24 March, 2021;
originally announced March 2021.
-
GPS Spoofing Mitigation and Timing Risk Analysis in Networked PMUs via Stochastic Reachability
Authors:
Sriramya Bhamidipati,
Grace Xingxin Gao
Abstract:
To address PMU vulnerability against spoofing, we propose a set-valued state estimation technique known as Stochastic Reachability-based Distributed Kalman Filter (SR-DKF) that computes secure GPS timing across a network of receivers. Utilizing stochastic reachability, we estimate not only GPS time but also its stochastic reachable set, which is parameterized via probabilistic zonotope (p-Zonotope…
▽ More
To address PMU vulnerability against spoofing, we propose a set-valued state estimation technique known as Stochastic Reachability-based Distributed Kalman Filter (SR-DKF) that computes secure GPS timing across a network of receivers. Utilizing stochastic reachability, we estimate not only GPS time but also its stochastic reachable set, which is parameterized via probabilistic zonotope (p-Zonotope). While requiring known measurement error bounds in only non-spoofed conditions, we design a two-tier approach: We first perform measurement-level spoofing mitigation via deviation of measurement innovation from its expected p-Zonotope and second perform state-level timing risk analysis via intersection probability of estimated pZonotope with an unsafe set that violates IEEE C37.118.1a-2014 standards. We validate the proposed SR-DKF by subjecting a simulated receiver network to coordinated signal-level spoofing. We demonstrate improved GPS timing accuracy and successful spoofing mitigation via our SR-DKF. We validate the robustness of the estimated timing risk as the number of receivers is varied.
△ Less
Submitted 12 January, 2021;
originally announced January 2021.
-
Designing Low-Correlation GPS Spreading Codes with a Natural Evolution Strategy Machine Learning Algorithm
Authors:
Tara Yasmin Mina,
Grace Xingxin Gao
Abstract:
With the birth of the next-generation GPS III constellation and the upcoming launch of the Navigation Technology Satellite-3 (NTS-3) testing platform to explore future technologies for GPS, we are indeed entering a new era of satellite navigation. Correspondingly, it is time to revisit the design methods of the GPS spreading code families. In this work, we develop a natural evolution strategy (NES…
▽ More
With the birth of the next-generation GPS III constellation and the upcoming launch of the Navigation Technology Satellite-3 (NTS-3) testing platform to explore future technologies for GPS, we are indeed entering a new era of satellite navigation. Correspondingly, it is time to revisit the design methods of the GPS spreading code families. In this work, we develop a natural evolution strategy (NES) machine learning algorithm with a Gaussian proposal distribution which constructs high-quality families of spreading code sequences. We minimize the maximum between the mean-squared auto-correlation and the mean-squared cross-correlation and demonstrate the ability of our algorithm to achieve better performance than well-chosen families of equal-length Gold codes and Weil codes, for sequences of up to length-1023 and length-1031 bits and family sizes of up to 31 codes. Furthermore, we compare our algorithm with an analogous genetic algorithm implementation assigned the same code evaluation metric. To the best of the authors' knowledge, this is the first work to explore using a machine learning approach for designing navigation spreading code sequences.
△ Less
Submitted 28 December, 2021; v1 submitted 7 January, 2021;
originally announced January 2021.
-
Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS
Authors:
Rui Liu,
Berrak Sisman,
Feilong Bao,
Guanglai Gao,
Haizhou Li
Abstract:
Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning s…
▽ More
Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.
△ Less
Submitted 11 August, 2020;
originally announced August 2020.
-
Expressive TTS Training with Frame and Style Reconstruction Loss
Authors:
Rui Liu,
Berrak Sisman,
Guanglai Gao,
Haizhou Li
Abstract:
We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system to improve the expressiveness of speech. One of the key challenges in prosody modeling is the lack of reference that makes explicit modeling difficult. The proposed technique doesn't require prosody annotations from training data. It doesn't attempt to model prosody explicitly either, but rather encodes the associa…
▽ More
We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system to improve the expressiveness of speech. One of the key challenges in prosody modeling is the lack of reference that makes explicit modeling difficult. The proposed technique doesn't require prosody annotations from training data. It doesn't attempt to model prosody explicitly either, but rather encodes the association between input text and its prosody styles using a Tacotron-based TTS framework. Our proposed idea marks a departure from the style token paradigm where prosody is explicitly modeled by a bank of prosody embeddings. The proposed training strategy adopts a combination of two objective functions: 1) frame level reconstruction loss, that is calculated between the synthesized and target spectral features; 2) utterance level style reconstruction loss, that is calculated between the deep style features of synthesized and target speech. The proposed style reconstruction loss is formulated as a perceptual loss to ensure that utterance level speech style is taken into consideration during training. Experiments show that the proposed training strategy achieves remarkable performance and outperforms a state-of-the-art baseline in both naturalness and expressiveness. To our best knowledge, this is the first study to incorporate utterance level perceptual quality as a loss function into Tacotron training for improved expressiveness.
△ Less
Submitted 12 April, 2021; v1 submitted 4 August, 2020;
originally announced August 2020.
-
Distributed Energy Trading and Scheduling among Microgrids via Multiagent Reinforcement Learning
Authors:
Guanyu Gao,
Yonggang Wen,
Xiaohu Wu,
Ran Wang
Abstract:
The development of renewable energy generation empowers microgrids to generate electricity to supply itself and to trade the surplus on energy markets. To minimize the overall cost, a microgrid must determine how to schedule its energy resources and electrical loads and how to trade with others. The control decisions are influenced by various factors, such as energy storage, renewable energy yield…
▽ More
The development of renewable energy generation empowers microgrids to generate electricity to supply itself and to trade the surplus on energy markets. To minimize the overall cost, a microgrid must determine how to schedule its energy resources and electrical loads and how to trade with others. The control decisions are influenced by various factors, such as energy storage, renewable energy yield, electrical load, and competition from other microgrids. Making the optimal control decision is challenging, due to the complexity of the interconnected microgrids, the uncertainty of renewable energy generation and consumption, and the interplay among microgrids. The previous works mainly adopted the modeling-based approaches for deriving the control decision, yet they relied on the precise information of future system dynamics, which can be hard to obtain in a complex environment. This work provides a new perspective of obtaining the optimal control policy for distributed energy trading and scheduling by directly interacting with the environment, and proposes a multiagent deep reinforcement learning approach for learning the optimal control policy. Each microgrid is modeled as an agent, and different agents learn collaboratively for maximizing their rewards. The agent of each microgrid can make the local scheduling decision without knowing others' information, which can well maintain the autonomy of each microgrid. We evaluate the performances of our proposed method using real-world datasets. The experimental results show that our method can significantly reduce the cost of the microgrids compared with the baseline methods.
△ Less
Submitted 8 July, 2020;
originally announced July 2020.
-
An Edge Information and Mask Shrinking Based Image Inpainting Approach
Authors:
Huali Xu,
Xiangdong Su,
Meng Wang,
Xiang Hao,
Guanglai Gao
Abstract:
In the image inpainting task, the ability to repair both high-frequency and low-frequency information in the missing regions has a substantial influence on the quality of the restored image. However, existing inpainting methods usually fail to consider both high-frequency and low-frequency information simultaneously. To solve this problem, this paper proposes edge information and mask shrinking ba…
▽ More
In the image inpainting task, the ability to repair both high-frequency and low-frequency information in the missing regions has a substantial influence on the quality of the restored image. However, existing inpainting methods usually fail to consider both high-frequency and low-frequency information simultaneously. To solve this problem, this paper proposes edge information and mask shrinking based image inpainting approach, which consists of two models. The first model is an edge generation model used to generate complete edge information from the damaged image, and the second model is an image completion model used to fix the missing regions with the generated edge information and the valid contents of the damaged image. The mask shrinking strategy is employed in the image completion model to track the areas to be repaired. The proposed approach is evaluated qualitatively and quantitatively on the dataset Places2. The result shows our approach outperforms state-of-the-art methods.
△ Less
Submitted 11 June, 2020;
originally announced June 2020.
-
SNR-Based Teachers-Student Technique for Speech Enhancement
Authors:
Xiang Hao,
Xiangdong Su,
Zhiyu Wang,
Qiang Zhang,
Huali Xu,
Guanglai Gao
Abstract:
It is very challenging for speech enhancement methods to achieves robust performance under both high signal-to-noise ratio (SNR) and low SNR simultaneously. In this paper, we propose a method that integrates an SNR-based teachers-student technique and time-domain U-Net to deal with this problem. Specifically, this method consists of multiple teacher models and a student model. We first train the t…
▽ More
It is very challenging for speech enhancement methods to achieves robust performance under both high signal-to-noise ratio (SNR) and low SNR simultaneously. In this paper, we propose a method that integrates an SNR-based teachers-student technique and time-domain U-Net to deal with this problem. Specifically, this method consists of multiple teacher models and a student model. We first train the teacher models under multiple small-range SNRs that do not coincide with each other so that they can perform speech enhancement well within the specific SNR range. Then, we choose different teacher models to supervise the training of the student model according to the SNR of the training data. Eventually, the student model can perform speech enhancement under both high SNR and low SNR. To evaluate the proposed method, we constructed a dataset with an SNR ranging from -20dB to 20dB based on the public dataset. We experimentally analyzed the effectiveness of the SNR-based teachers-student technique and compared the proposed method with several state-of-the-art methods.
△ Less
Submitted 29 October, 2020; v1 submitted 29 May, 2020;
originally announced May 2020.
-
Sub-Band Knowledge Distillation Framework for Speech Enhancement
Authors:
Xiang Hao,
Shixue Wen,
Xiangdong Su,
Yun Liu,
Guanglai Gao,
Xiaofei Li
Abstract:
In single-channel speech enhancement, methods based on full-band spectral features have been widely studied. However, only a few methods pay attention to non-full-band spectral features. In this paper, we explore a knowledge distillation framework based on sub-band spectral mapping for single-channel speech enhancement. Specifically, we divide the full frequency band into multiple sub-bands and pr…
▽ More
In single-channel speech enhancement, methods based on full-band spectral features have been widely studied. However, only a few methods pay attention to non-full-band spectral features. In this paper, we explore a knowledge distillation framework based on sub-band spectral mapping for single-channel speech enhancement. Specifically, we divide the full frequency band into multiple sub-bands and pre-train an elite-level sub-band enhancement model (teacher model) for each sub-band. These teacher models are dedicated to processing their own sub-bands. Next, under the teacher models' guidance, we train a general sub-band enhancement model (student model) that works for all sub-bands. Without increasing the number of model parameters and computational complexity, the student model's performance is further improved. To evaluate our proposed method, we conducted a large number of experiments on an open-source data set. The final experimental results show that the guidance from the elite-level teacher models dramatically improves the student model's performance, which exceeds the full-band model by employing fewer parameters.
△ Less
Submitted 29 October, 2020; v1 submitted 29 May, 2020;
originally announced May 2020.
-
WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss
Authors:
Rui Liu,
Berrak Sisman,
Feilong Bao,
Guanglai Gao,
Haizhou Li
Abstract:
Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input. Such frameworks typically consist of a feature prediction network that maps character sequences to frequency-domain acoustic features, followed by a waveform reconstruction algorithm or a neural vocoder that generates the time-domain waveform from acoustic features. As the loss function is usually calculated on…
▽ More
Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input. Such frameworks typically consist of a feature prediction network that maps character sequences to frequency-domain acoustic features, followed by a waveform reconstruction algorithm or a neural vocoder that generates the time-domain waveform from acoustic features. As the loss function is usually calculated only for frequency-domain acoustic features, that doesn't directly control the quality of the generated time-domain waveform. To address this problem, we propose a new training scheme for Tacotron-based TTS, referred to as WaveTTS, that has 2 loss functions: 1) time-domain loss, denoted as the waveform loss, that measures the distortion between the natural and generated waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic feature loss between the natural and generated acoustic features. WaveTTS ensures both the quality of the acoustic features and the resulting speech waveform. To our best knowledge, this is the first implementation of Tacotron with joint time-frequency domain loss. Experimental results show that the proposed framework outperforms the baselines and achieves high-quality synthesized speech.
△ Less
Submitted 6 April, 2020; v1 submitted 2 February, 2020;
originally announced February 2020.
-
Teacher-Student Training for Robust Tacotron-based TTS
Authors:
Rui Liu,
Berrak Sisman,
Jingdong Li,
Feilong Bao,
Guanglai Gao,
Haizhou Li
Abstract:
While neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways, the exposure bias problem in the autoregressive models remains an issue to be resolved. The exposure bias problem arises from the mismatch between the training and inference process, that results in unpredictable performance for out-of-domain test data at run-time. To overcome this, we propos…
▽ More
While neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways, the exposure bias problem in the autoregressive models remains an issue to be resolved. The exposure bias problem arises from the mismatch between the training and inference process, that results in unpredictable performance for out-of-domain test data at run-time. To overcome this, we propose a teacher-student training scheme for Tacotron-based TTS by introducing a distillation loss function in addition to the feature loss function. We first train a Tacotron2-based TTS model by always providing natural speech frames to the decoder, that serves as a teacher model. We then train another Tacotron2-based model as a student model, of which the decoder takes the predicted speech frames as input, similar to how the decoder works during run-time inference. With the distillation loss, the student model learns the output probabilities from the teacher model, that is called knowledge distillation. Experiments show that our proposed training scheme consistently improves the voice quality for out-of-domain test data both in Chinese and English systems.
△ Less
Submitted 11 February, 2020; v1 submitted 7 November, 2019;
originally announced November 2019.
-
SLAM-based Integrity Monitoring Using GPS and Fish-eye Camera
Authors:
Sriramya Bhamidipati,
Grace Xingxin Gao
Abstract:
Urban navigation using GPS and fish-eye camera suffers from multipath effects in GPS measurements and data association errors in pixel intensities across image frames. We propose a Simultaneous Localization and Mapping (SLAM)-based Integrity Monitoring (IM) algorithm to compute the position protection levels while accounting for multiple faults in both GPS and vision. We perform graph optimization…
▽ More
Urban navigation using GPS and fish-eye camera suffers from multipath effects in GPS measurements and data association errors in pixel intensities across image frames. We propose a Simultaneous Localization and Mapping (SLAM)-based Integrity Monitoring (IM) algorithm to compute the position protection levels while accounting for multiple faults in both GPS and vision. We perform graph optimization using the sequential data of GPS pseudoranges, pixel intensities, vehicle dynamics, and satellite ephemeris to simultaneously localize the vehicle as well as the landmarks, namely GPS satellites and key image pixels in the world frame. We estimate the fault mode vector by analyzing the temporal correlation across the GPS measurement residuals and spatial correlation across the vision intensity residuals. In particular, to detect and isolate the vision faults, we developed a superpixel-based piecewise Random Sample Consensus (RANSAC) technique to perform spatial voting across image pixels. For an estimated fault mode, we compute the protection levels by applying worst-case failure slope analysis to the linearized Graph-SLAM framework. We perform ground vehicle experiments in the semi-urban area of Champaign, IL and have demonstrated the successful detection and isolation of multiple faults. We also validate tighter protection levels and lower localization errors achieved via the proposed algorithm as compared to SLAM-based IM that utilizes only GPS measurements.
△ Less
Submitted 4 October, 2019;
originally announced October 2019.
-
A photonic-assisted method based on the MDA technique for the frequency estimation precision improvement
Authors:
Guangyu Gao,
Naijin Liu
Abstract:
A novel photonics-assisted method based on presampling and MDA technique is proposed for significantly improving the frequency estimation precision without introducing other complex algorithms. This method is also compatible with existing FFT-based high-precision estimation algorithms
A novel photonics-assisted method based on presampling and MDA technique is proposed for significantly improving the frequency estimation precision without introducing other complex algorithms. This method is also compatible with existing FFT-based high-precision estimation algorithms
△ Less
Submitted 31 July, 2019;
originally announced August 2019.
-
Energy-Efficient Thermal Comfort Control in Smart Buildings via Deep Reinforcement Learning
Authors:
Guanyu Gao,
Jie Li,
Yonggang Wen
Abstract:
Heating, Ventilation, and Air Conditioning (HVAC) is extremely energy-consuming, accounting for 40% of total building energy consumption. Therefore, it is crucial to design some energy-efficient building thermal control policies which can reduce the energy consumption of HVAC while maintaining the comfort of the occupants. However, implementing such a policy is challenging, because it involves var…
▽ More
Heating, Ventilation, and Air Conditioning (HVAC) is extremely energy-consuming, accounting for 40% of total building energy consumption. Therefore, it is crucial to design some energy-efficient building thermal control policies which can reduce the energy consumption of HVAC while maintaining the comfort of the occupants. However, implementing such a policy is challenging, because it involves various influencing factors in a building environment, which are usually hard to model and may be different from case to case. To address this challenge, we propose a deep reinforcement learning based framework for energy optimization and thermal comfort control in smart buildings. We formulate the building thermal control as a cost-minimization problem which jointly considers the energy consumption of HVAC and the thermal comfort of the occupants. To solve the problem, we first adopt a deep neural network based approach for predicting the occupants' thermal comfort, and then adopt Deep Deterministic Policy Gradients (DDPG) for learning the thermal control policy. To evaluate the performance, we implement a building thermal control simulation system and evaluate the performance under various settings. The experiment results show that our method can improve the thermal comfort prediction accuracy, and reduce the energy consumption of HVAC while improving the occupants' thermal comfort.
△ Less
Submitted 15 January, 2019;
originally announced January 2019.
-
A code-free optical undersampling technique for broadband microwave spectrum measurement
Authors:
Guangyu Gao,
Xueshuang Xiang,
Qijun Liang,
Naijin Liu
Abstract:
A novel broadband microwave (MW) spectrum measurement (BMSM) scheme based on code-free optical undersampling and homodyne detection is proposed. The fully analog generation of optical pulses with a far-less-than-Nyquist rate is only through modulating cascaded electrooptical modulators by a single RF tone instead of any high-speed coding sequence modulation. Homodyne detection will reduce the anal…
▽ More
A novel broadband microwave (MW) spectrum measurement (BMSM) scheme based on code-free optical undersampling and homodyne detection is proposed. The fully analog generation of optical pulses with a far-less-than-Nyquist rate is only through modulating cascaded electrooptical modulators by a single RF tone instead of any high-speed coding sequence modulation. Homodyne detection will reduce the analysis bandwidth of BMSM and enhance the detection performance of weak signal. A multi-band signal with 20 GHz spectral range and SNR = 61 dB is used to investigate the BMSM performance of this scheme, and the results show good performance for BMSM. The potentials for further optimization in practice are also discussed.
△ Less
Submitted 31 July, 2019; v1 submitted 29 April, 2018;
originally announced May 2018.